Representing time sequence input/output in tensorflow

Representing time sequence input/output in tensorflow - python

I've been working through the TensorFlow documentation (still learning), and I can't figure out how to represent input/output sequence data. My inputs are a sequences of 20 8-entry vectors, making a 8x20xN matrix, where N is the number of instances. I'd like to eventually pass these through an LSTM for sequence to sequence learning. I know I need a 3D vector, but I'm unsure which dimensions are which.
RTFMs with pointers to the correct documentation greatly appreciated. I feel like this is obvious and I'm just missing it.

As described in the excellent blog post by WildML, the proper way is to save your example in a TFRecord using the formate tf.SequenceExample(). Using TFRecords for this provides the following advantages:
You can split your data in to many files and load them each on a different GPU.
You can use Tensorflow utilities for loading the data (for example using Queues to load you data on demand.
Your model code will be separate from your dataset processing (this is a good habit to have).
You can bring new data to your model just by putting it into this format.
TFRecords uses protobuf or protocol buffers as a way to format your data. The documentation of which can be found here. The basic idea is you have a format for your data (in this case in the format of tf.SequenceExample) save it to a TFRecord and load it using the same data definition. Code for this pattern can be found at this ipython notebook.
As my answer is mostly summarizing the WildML blog post on this topic, I suggest you check that out, again found here.

Related

How to export a fasttext model created by gensim, to a binary file?

I'm trying to export the fasttext model created by gensim to a binary file. But the docs are unclear about how to achieve this.
What I've done so far:
model.wv.save_word2vec_format('model.bin')
But this does not seems like the best solution. Since later when I want to load the model using the :
fasttext.load_facebook_model('model.bin')
I get into an infinite loop. While loading the fasttext.model created by model.save('fasttext.model) function gets completed in around 30 seconds.

Using .save_word2vec_format() saves just the full-word vectors, to a simple format that was used by Google's original word2vec.c release. It doesn't save unique things about a full FastText model. Such files would be reloaded with the matched .load_word2vec_format().
The .load_facebook_format() method loads files in the format saved by Facebook's original (non-Python) FastText code release. (The name of this method is pretty misguided, since 'facebook' could mean so many different things other than a specific data format.) Gensim doesn't have a matched method for saving to this same format – though it probably wouldn't be very hard to implement, and would make symmetric sense to support this export option.
Gensim's models typically implement gensim-native .save() and .load() options, which make use of a mix of Python 'pickle' serialization and raw large-array files. These are your best options if you want to save the full model state, for later reloading back into Gensim.
(Such files can't be loaded by other FastText implementations.)
Be sure to keep the multiple related files written by this .save() (all with the same user-supplied prefix) together when moving the saved model to a new location.
Update (May 2020): Recent versions of gensim such as 3.8.3 and later include a new contributed FastText.save_facebook_model() method which saves to the original Facebook FastTExt binary format.

Keras custom fit_generator for numeric dataframe

I have several CSV files which are placed in a directory. What I want to do is to create a flow from this directory where each file is taken, prepossessed(such as null value fill, outlier treatment etc) and then each data point is passed to keras model and this process should repeat itself for every file placed in the directory. Any suggestions on this to create data flow same as for Image data available in keras. Also this should happen in python :)
Thanks in advance!

I don't think that Keras natively supplies such functionality.
You should make your own converter, using something like glob to go over each file, send it to preprocessing functions, and finally save it as a format readily usable by Keras, such as a numpy array.
You might want to have a look here for an example of inputting multiple files (although in this case they are already numpy arrays, not csv files) to use in the training of a model.

Import TensorFlow data from pyspark

I want to create a predictive model on several hundred GBs of data. The data needs some not-intensive preprocessing that I can do in pyspark but not in tensorflow. In my situation, it would be much more convenient to directly pass the result of the pre-processing to TF, ideally treating the pyspark data frame as a virtual input file to TF, instead of saving the pre-processed data to disk. However, I haven't the faintest idea how to do that and I couldn't find anywhere on the internet.
After some thought, it seems to me that I actually need an iterator (like as defined by tf.data.Iterator) over spark's data. However, I found comments online that hint to the fact that the distributed structure of spark makes it very hard, if not impossible. Why so? Imagine that I don't care about the order of the lines, why should it be impossible to iterate over the spark data?

It sounds like you simply want to use tf.data.Dataset.from_generator() you define a python generator which reads samples out of spark. Although I don't know spark very well, I'm certain you can do a reduce to the server that will be running the tensorflow model. Better yet, if you're distributing your training you can reduce to the set of servers who need some shard of your final dataset.
The import data programmers guide covers the Dataset input pipeline in more detail. The tensorflow Dataset will provide you with an iterator that's accessed directly by the graph so there's no need for tf.placeholders or marshaling data outside of the tf.data.Dataset.from_generator() code you write.

TensorFlow: How and why to use SavedModel

I have a few questions regarding the SavedModel API, whose documentation I find leaves a lot of details unexplained.
The first three questions are about what to pass to the arguments of the add_meta_graph_and_variables() method of tf.saved_model.builder.SavedModelBuilder, while the fourth question is about why to use the SavedModel API over tf.train.Saver.
What is the format of the signature_def_map argument? Do I normally need to set this argument when saving a model?
Similarly, What is the format of the assets_collection argument?
Why do you save a list of tags with a metagraph as opposed to just giving it a name (i.e. attaching just one unique tag to it)? Why would I add multiple tags to a given metagraph? What if I try to load a metagrpah from a pb by a certain tag, but multiple metagraphs in that pb match that tag?
The documentation argues that it is recommended to use SavedModel to save entire models (as opposed to variables only) in self-contained files. But tf.train.Saver also saves the graph in addition to the variables in a .meta file. So what are the advantages of using SavedModel? The documentation says
When you want to save and load variables, the graph, and the graph's
metadata--basically, when you want to save or restore your model--we
recommend using SavedModel. SavedModel is a language-neutral,
recoverable, hermetic serialization format. SavedModel enables
higher-level systems and tools to produce, consume, and transform
TensorFlow models.
but this explanation is quite abstract and doesn't really help me understand what the advantages of SavedModel are. What would be concrete examples where SavedModel (as opposed to tf.train.Saver) would be better to use?
Please note that my question is not a duplicate of this question. I'm not asking how to save a model, I am asking very specific questions about the properties of SavedModel, which is only one of multiple mechanisms TensorFlow provides to save and load models. None of the answers in the linked question touch on the SavedModel API (which, once again, is not the same as tf.train.Saver).

EDIT: I wrote this back at TensorFlow 1.4. As of today (TensorFlow 1.12 is stable, there's a 1.13rc and 2.0 is around the corner) the docs linked in the question are much improved.
I'm trying to use tf.saved_model and also found the Docs quite (too) abstract. Here's my stab at a full answer to your questions:
1. signature_def_map:
a. Format See Tom's answer to Tensorflow: how to save/restore a model. (Ctrl-F for "tf.saved_model" - currently, the only uses of the phrase on that question are in his answer).
b. need It's my understanding that you do normally need it. If you intend to use the model, you need to know the inputs and outputs of the graph. I think it is akin to a C++ function signature: If you intend to define a function after it's called or in another C++ file, you need the signature in your main file (i.e. prototyped or in a header file).
2. assets_collection:
format: Couldn't find clear documentation, so I went to the builder source code. It appears that the argument is an iterable of Tensors of dtype=tf.string, where each Tensor is a path for the asset directory. So, a TensorFlow Graph collection should work. I guess that is the parameter's namesake, but from the source code I would expect a Python list to work too.
(You didn't ask if you need to set it, but judging from Zoe's answer to What are assets in tensorflow? and iga's answer to the tangentially related Tensorflow serving: “No assets to save/writes” when exporting models, it doesn't usually need set.)
3. Tags:
a. Why list I don't know why you must pass a list, but you may pass a list with one element. For instance, in my current project I only use the [tf...tag_constants.SERVING] tag.
b. When to use multiple Say you're using explicit device placement for operations. Maybe you want to save a CPU version and a GPU version of your graph. Obviously you want to save a serving version of each, and say you want to save training checkpoints. You could use a CPU/GPU tag and a training/serving tag to manage all cases. The docs hint at it:
Each MetaGraphDef added to the SavedModel must be annotated with user-specified tags. The tags provide a means to identify the specific MetaGraphDef to load and restore, along with the shared set of variables and assets. These tags typically annotate a MetaGraphDef with its functionality (for example, serving or training), and optionally with hardware-specific aspects (for example, GPU).
c. Collision
Too lazy to force a collision myself - I see two cases that would need addressed - I went to the loader source code. Inside def load, you'll see:
saved_model = _parse_saved_model(export_dir)
found_match = False
for meta_graph_def in saved_model.meta_graphs:
if set(meta_graph_def.meta_info_def.tags) == set(tags):
meta_graph_def_to_load = meta_graph_def
found_match = True
break
if not found_match:
raise RuntimeError(
"MetaGraphDef associated with tags " + str(tags).strip("[]") +
" could not be found in SavedModel. To inspect available tag-sets in"
" the SavedModel, please use the SavedModel CLI: `saved_model_cli`"
)
It appears to me that it's looking for an exact match. E.g. say you have a metagraph with tags "GPU" and "Serving" and a metagraph with tag "Serving". If you load "Serving", you'll get the latter metagraph. On the other hand, say you have a metagraph "GPU" and "Serving" and a metagraph "CPU" and "Serving". If you try to load "Serving", you'll get the error. If you try to save two metagraphs with the exact same tags in the same folder, I expect you'll overwrite the first one. It doesn't look like the build code handles such a collision in any special way.
4. SavedModel or tf.train.Saver:
This confused me too. wicke's answer to Should TensorFlow users prefer SavedModel over Checkpoint or GraphDef? cleared it up for me. I'll throw in my two cents:
In the scope of local Python+TensorFlow, you can make tf.train.Saver do everything. But, it will cost you. Let me outline the save-a-trained-model-and-deploy use case. You'll need your saver object. It's easiest to set it up to save the complete graph (every variable). You probably don't want to save the .meta all the time since you're working with a static graph. You'll need to specify that in your training hook. You can read about that on cv-tricks. When your training finishes, you'll need convert your checkpoint file to a pb file. That usually means clearing the current graph, restoring the checkpoint, freezing your variables to constants with tf.python.framework.graph_util, and writing it with tf.gfile.GFile. You can read about that on medium. After that, you want to deploy it in Python. You'll need the input and output Tensor names - the string names in the graph def. You can read about that on metaflow (actually a very good blog post for the tf.train.Saver method). Some op nodes will let you feed data into them easily. Some not so much. I usually gave up on finding an appropriate node and added a tf.reshape that didn't actually reshape anything to the graph def. That was my ad-hoc input node. Same for the output. And then finally, you can deploy your model, at least locally in Python.
Or, you could use the answer I linked in point 1 to accomplish all this with the SavedModel API. Less headaches thanks to Tom's answer . You'll get more support and features in the future if it ever gets documented appropriately . Looks like it's easier to use command line serving (the medium link covers doing that with Saver - looks tough, good luck!). It's practically baked in to the new Estimators. And according to the Docs,
SavedModel is a language-neutral, recoverable, hermetic serialization format.
Emphasis mine: Looks like you can get your trained models into the growing C++ API much easier.
The way I see it, it's like the Datasets API. It's just easier than the old way!
As far as concrete examples of SavedModel of tf.train.Saver: If "basically, when you want to save or restore your model" isn't clear enough for you: The correct time to use it is any time it makes your life easier. To me, that looks like always. Especially if you're using Estimators, deploying in C++, or using command line serving.
So that's my research on your question. Or four enumerated questions. Err, eight question marks. Hope this helps.

How to load a very large into Tensorflow and create minibatchs?

I have a hdf5 file that has been written into the disk by a python code (I only have the file, not the code). The size of this file is 90GB and the data in this file has the following format: (N, 250,360,3). Just as side note, the data can't fit into memory.
Now I want to write a data loader in Tensorflow where each time just loads M samples from this file (M is way smaller than N).
What would be the best way to do this? Any pointer to a code would be highly appreciated.
Thanks.
J

The Tensorflow MNIST tutorial shows how this can be done:
https://www.tensorflow.org/tutorials/mnist/beginners/
If you look at the implementation on Github, you'll see that it uses a next_batch function to read batches of inputs 100 at a time.
The implementation of next_batch lives here:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/learn/python/learn/datasets/mnist.py#L160
You would need to implement something similar for your data set. I'm not particularly familiar with HDF5, but you can use any Python library to do the loading; it doesn't have to be specific to Tensorflow.
Hope that helps!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.