How to load multiple objects (list of list) in HDF5? - python

I am currently trying to use keras to predict ... stuffs. I am using a HDF5 file as input. The file contains 195 objects, each one is a list of matrices with one attribute. I would like keras to learn on the list of matrices and predict the attribute. But here is the issue, so far I have seen that one object can only be assigned to one variable. That would be meaningless in my case.
I would like to know whether or not it was possible to load all of these objects at once, say under one variable, in keras to predict the attribute ? For instance here are some objects,
['10gs', '1a30', '1bcu',..., '4tmn']
I know I can assign one variable to one object,
dataset=infile['1a30']
However I am not sure how to assign several objects to one variable? Do I need to create a list of objects ? Here's what I am trying to get,
dataset=infile['all of my objects'].
In fine, I will be using it in keras but I am not too sure whether it is necessary as it seems to me it is a HDF file issue (misunderstanding).

Related

How to save an array of objects to a file in Python

I know how to save an array of simple data (like float type values) to a file using numpy.save and numpy.savez.
I also know how to save a single object to a file using the pickle module, although I have not tested it yet.
The question is: How can I save (and load) an array of objects to a file? Can I combine the two methods exposed above in order to achieve this? Is there a better way?
If you know how to pickle a single object, to pickle some amount of objects you can create a structure that contains all these objects (list, set, dict with these objects as values for example, or your own class which contains them in some way) and this structure will be a single object awailable to pickle.

Cannot iterate over tf.data.Dataset

I'm using the tensorflow.data.Dataset api by tensorflow. However I need to create datasets on the fly filtering out elements other dataset. While training goes well and I can iterate over the training set and the dev set, when I reinitialize the iterator with a new dataset that I just created with a filter, i receive the following exception:
tensorflow.python.framework.errors_impl.NotFoundError: Function tf_predicate_5HKZIzWZBv8 is not defined.
I'm using the following function to create an initialiser out of a dataset:
self.iterator.make_initializer(dataset)
where self.iterator is defined as follow:
self.iterator = tf.data.Iterator.from_structure(ds_types, ds_shapes)
Do you guys have any idea about why this is happening? Note that it happens if I call make_initializer after I have created a session, run a dataset, and then create a new initializer. If after the creation I also recreate the Session everything works (except the fact that all the variables have to be reinitialized)
I found the solution and I'm sharing in case somebody will run into this problem. The thing is that, as I'm defining a new dataset after the session has been initialised it, it doesn't have the new operation I'm adding for the new dataset (In this case I'm using a new filter everytime I create a new dataset) and that's why the session can't find the operation. To overcome the problem I defined all the datasets I needed to use before the session is initialised and I used a filter that takes as input a placeholder so that I always use the same filter feeded everytime at iterator init time with the right value.

Sporadically assigning values to variables (placeholders vs feed_dict)

I have a model where I need to assign to the weights (trainable variables) new external values every N iterations.
I can think of a few solutions:
Save and restore
Not good as I would need to serialization, go through a file system calls, etc. (even if I use something like tmpfs)
Using placeholders and assign operations
I would create a placeholder and assign op for each trainable variable. Everytime I want to assign something to the weights, I ran the assign ops.
However, I understand that this means I will be forced to consider these placeholders in every feed_dict and pass dummy values everytime I run any operation in my graph.
In addition I would be using much more memory than necessary..
Use a feed_dict for trainable variable and trigger ops that assign each variable to itself?
Does this work? Is there any drawback?
Before coding something I thought it was a good idea to ask?
What is the recommended way to assign new external values to variables efficiently (memory/timewise)?
Your 3-rd option sounds like the best one.
You can feed values to tensors that aren’t placeholders.
TensorFlow's feed mechanism lets you inject data into any Tensor in a
computation graph. A python computation can thus feed data directly
into the graph.
Any tensors that are feedable can be fed. To check if a tensor is feedable or not, use: tf.Graph.is_feedable(tensor).
In recent versions of Tensorflow Variable class has load method. It does exactly what you want.
https://www.tensorflow.org/api_docs/python/tf/Variable#load
You can use the assign operations with placeholders.
I will be forced to consider these placeholders in every feed_dict and pass dummy values everytime I run any operation in my graph
In addition I would be using much more memory than necessary..
No. You would only need to feed values to the placeholders when you run the assign operations. Don't make the assign operation part of your training graph and only run them when you want to assign new values.
If the assigning turns out to be a bottleneck (for small N it might slow down your program) you can consider other methods of getting data into TensorFlow.

Access blob by index in Pycaffe

I can access blobs in a network using their name.
For example:
net.blobs['data'].data.shape
returns the shape of the above layer.
Is there a way to access the layer by it's index? I tried the following:
net.blobs[1].data.shape
but I get an uninformative python exception
KeyError: 1
I need to access it in the above-mentioned manner since at run-time, I don't have the name of the layer I want to access.
Does anyone know a way to do this?
net.blobs is an Ordered Dictionary object, therefore you cannot access its items via numerical index, you need to use string keys.
You can get a list of the available keys by net.blobs.keys(), and you can access blobs using:
net.blobs[net.blobs.keys()[1]].data.shape
But you should be careful doing so: caffe may introduce auxiliary blobs in your model (specifically, split layers when one "top" is fed as "bottom" to several layers).

System/Convention for code-independent serialization

I'm training neural networks. I want to save them in a code-independent way so they can be loaded by someone using different software.
Just pickling my objects is no good, because the pickle breaks if it's loaded in an environment where the code has changed or moved (which it always does).
So I've been converting my objects into dicts of primitive types and pickling those. I maintain a module that can convert these dicts back into objects (the type of object is defined by a "class" key of the dict). My current solution feels messy.
So I was wondering if there's some package or design pattern that's made to handle this kind of "code-independent serialization"
If you are using numpy/scipy for your project, you could save your weight matrixes in matlab format.

Categories