How to load a very large into Tensorflow and create minibatchs? - python

I have a hdf5 file that has been written into the disk by a python code (I only have the file, not the code). The size of this file is 90GB and the data in this file has the following format: (N, 250,360,3). Just as side note, the data can't fit into memory.
Now I want to write a data loader in Tensorflow where each time just loads M samples from this file (M is way smaller than N).
What would be the best way to do this? Any pointer to a code would be highly appreciated.
Thanks.
J

The Tensorflow MNIST tutorial shows how this can be done:
https://www.tensorflow.org/tutorials/mnist/beginners/
If you look at the implementation on Github, you'll see that it uses a next_batch function to read batches of inputs 100 at a time.
The implementation of next_batch lives here:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/learn/python/learn/datasets/mnist.py#L160
You would need to implement something similar for your data set. I'm not particularly familiar with HDF5, but you can use any Python library to do the loading; it doesn't have to be specific to Tensorflow.
Hope that helps!

Related

Dask Read Data from Binary File

I am trying to implement an out of core processing version of k-means clustering algorithm in python. I learned about dask from this git project K-Mean Parallel... Dask...
I use the same git project but am trying to load my data which is in the form of a binary file. The binary file contains data points with 1024 floating point features each.
My problem is how do I load data if it is very huge i.e. larger than the available memory itself? I tried to use the numpy's fromFile function but my kernel somehow dies. Some of my questions are:
Q. Is it possible to load data into numpy created from some other source (the file was not created by numpy but a c script)?
Q. Is there a module for dask that can load data directly from a binary file? I have seen csv files used but nothing related to binary files.
I only dabble in Dask, but calling np.fromfile in the code below via Dask delayed should allow you to work with it lazily. That said, I'm working on this myself so this is currently a partial answer.
For your first question: I currently load .bin files created by a Labview program with no issues using similar code to this:
import numpy as np
method = "b+r" # binary read method
chunkSize = 1e6 # chunk as needed for your purposes
fileSize = os.path.getsize(myfile)
data = []
with open(myfile,method) as file:
for chunk in range(0,fileSize,chunkSize):
data.append(np.fromfile(file,dtype=np.float32,chunk))
For the second question: I have not been able to find anything in Dask for dealing with binary files. I find that a conversion to something that Dask can use, is worth it.

Import TensorFlow data from pyspark

I want to create a predictive model on several hundred GBs of data. The data needs some not-intensive preprocessing that I can do in pyspark but not in tensorflow. In my situation, it would be much more convenient to directly pass the result of the pre-processing to TF, ideally treating the pyspark data frame as a virtual input file to TF, instead of saving the pre-processed data to disk. However, I haven't the faintest idea how to do that and I couldn't find anywhere on the internet.
After some thought, it seems to me that I actually need an iterator (like as defined by tf.data.Iterator) over spark's data. However, I found comments online that hint to the fact that the distributed structure of spark makes it very hard, if not impossible. Why so? Imagine that I don't care about the order of the lines, why should it be impossible to iterate over the spark data?
It sounds like you simply want to use tf.data.Dataset.from_generator() you define a python generator which reads samples out of spark. Although I don't know spark very well, I'm certain you can do a reduce to the server that will be running the tensorflow model. Better yet, if you're distributing your training you can reduce to the set of servers who need some shard of your final dataset.
The import data programmers guide covers the Dataset input pipeline in more detail. The tensorflow Dataset will provide you with an iterator that's accessed directly by the graph so there's no need for tf.placeholders or marshaling data outside of the tf.data.Dataset.from_generator() code you write.

Working with large (+15 gb) CSV datasets and Pandas/XGBoost

I am trying to find a means of starting to work with very large CSV files in Pandas, ultimately to be able to do some machine learning with XGBoost.
I am torn between using mySQL or some sqllite framework to manage chunks of my data; my issue is in the machine learning aspect of it later on, and in loading in chunks at a time to train the model.
My other thought was to use Dask, which is built of off Pandas, but also has XGBoost functionality.
I'm not sure what the best starting point is and was hoping to ask for an opinion! I am leaning towards Dask but I have not used it yet.
This blogpost goes through an example using XGBoost on a large CSV dataset. However it did so by using a distributed cluster with enough RAM to fit the entire dataset in memory at once. While many dask.dataframe operations can operate in small space I don't think that XGBoost training is likely to be one of them. XGBoost seems to operate best when all data is available all the time.
I haven't tried this, but I would load your data into an hdf5 file using h5py. This library let's you store data on disk but access it like a numpy array. Therefore you are no longer constrained by memory for your dataset.
For the XGBoost part, I would use the sklearn API and pass in the h5py object as the X value. I recommend the sklearn API since it accepts numpy like arrays for input which should let h5py objects work. Make sure to use a small value for subsample otherwise you'll likely run out of memory fast.

Load next batch during training step

I have a TensorFlow setup that looks something like this:
for b in batch_loader.iter_batches(self.TRAINING_SET):
...
self.session.run(train_step, feed_dict=...)
The iter_batches function loads image data from numpy memory map files into RAM. Measurements showed that the loading of data from disk takes about 1/3 of the time of running that train_step. Also, the train_step operation does not need to access the hard drive at all.
So I could make everything faster if I could load the next batch i+1 while I'm training with batch i.
Can I use some python multiprocessing library for this or does TensorFlow offer something for this use case? I looked around their documentation but did not find anything. Is there a canonical way of doing this?
You can setup queues using tf.train.start_queue_runners and tf.train.Coordinator. See here for details.

Representing time sequence input/output in tensorflow

I've been working through the TensorFlow documentation (still learning), and I can't figure out how to represent input/output sequence data. My inputs are a sequences of 20 8-entry vectors, making a 8x20xN matrix, where N is the number of instances. I'd like to eventually pass these through an LSTM for sequence to sequence learning. I know I need a 3D vector, but I'm unsure which dimensions are which.
RTFMs with pointers to the correct documentation greatly appreciated. I feel like this is obvious and I'm just missing it.
As described in the excellent blog post by WildML, the proper way is to save your example in a TFRecord using the formate tf.SequenceExample(). Using TFRecords for this provides the following advantages:
You can split your data in to many files and load them each on a different GPU.
You can use Tensorflow utilities for loading the data (for example using Queues to load you data on demand.
Your model code will be separate from your dataset processing (this is a good habit to have).
You can bring new data to your model just by putting it into this format.
TFRecords uses protobuf or protocol buffers as a way to format your data. The documentation of which can be found here. The basic idea is you have a format for your data (in this case in the format of tf.SequenceExample) save it to a TFRecord and load it using the same data definition. Code for this pattern can be found at this ipython notebook.
As my answer is mostly summarizing the WildML blog post on this topic, I suggest you check that out, again found here.

Categories