Load next batch during training step

Load next batch during training step - python

I have a TensorFlow setup that looks something like this:
for b in batch_loader.iter_batches(self.TRAINING_SET):
...
self.session.run(train_step, feed_dict=...)
The iter_batches function loads image data from numpy memory map files into RAM. Measurements showed that the loading of data from disk takes about 1/3 of the time of running that train_step. Also, the train_step operation does not need to access the hard drive at all.
So I could make everything faster if I could load the next batch i+1 while I'm training with batch i.
Can I use some python multiprocessing library for this or does TensorFlow offer something for this use case? I looked around their documentation but did not find anything. Is there a canonical way of doing this?

You can setup queues using tf.train.start_queue_runners and tf.train.Coordinator. See here for details.

Related

Import TensorFlow data from pyspark

I want to create a predictive model on several hundred GBs of data. The data needs some not-intensive preprocessing that I can do in pyspark but not in tensorflow. In my situation, it would be much more convenient to directly pass the result of the pre-processing to TF, ideally treating the pyspark data frame as a virtual input file to TF, instead of saving the pre-processed data to disk. However, I haven't the faintest idea how to do that and I couldn't find anywhere on the internet.
After some thought, it seems to me that I actually need an iterator (like as defined by tf.data.Iterator) over spark's data. However, I found comments online that hint to the fact that the distributed structure of spark makes it very hard, if not impossible. Why so? Imagine that I don't care about the order of the lines, why should it be impossible to iterate over the spark data?

It sounds like you simply want to use tf.data.Dataset.from_generator() you define a python generator which reads samples out of spark. Although I don't know spark very well, I'm certain you can do a reduce to the server that will be running the tensorflow model. Better yet, if you're distributing your training you can reduce to the set of servers who need some shard of your final dataset.
The import data programmers guide covers the Dataset input pipeline in more detail. The tensorflow Dataset will provide you with an iterator that's accessed directly by the graph so there's no need for tf.placeholders or marshaling data outside of the tf.data.Dataset.from_generator() code you write.

Speed up Spacy Named Entity Recognition

I'm using spacy to recognize street addresses on web pages.
My model is initialized basically using spacy's new entity type sample code found here:
https://github.com/explosion/spaCy/blob/master/examples/training/train_new_entity_type.py
My training data consists of plain text webpages with their corresponding Street Address entities and character positions.
I was able to quickly build a model in spacy to start making predictions, but I found its prediction speed to be very slow.
My code works by iterating through serveral raw HTML pages and then feeding each page's plain text version into spacy as it's iterating. For reasons I can't get into, I need to make predictions with Spacy page by page, inside of the iteration loop.
After the model is loaded, I'm using the standard way of making predictions, which I'm referring to as the prediction/evaluation phase:
doc = nlp(plain_text_webpage)
if len(doc.ents) > 0:
print ("found entity")
Questions:
How can I speed up the entity prediction / recognition phase? I'm using a c4.8xlarge instance on AWS and all 36 cores are constantly maxed out when spacy is evaluating the data. Spacy is turning processing a few million webpages from a 1 minute job to a 1 hour+ job.
Will the speed of entity recognition improve as my model becomes more accurate?
Is there a way to remove pipelines like tagger during this phase, can ER be decoupled like that and still be accurate? Will removing other pipelines affect the model itself or is it just a temporary thing?
I saw that you can use GPU during the ER training phase, can it also be used in this evaluating phase in my code for faster predictions?
Update:
I managed to significantly cut down the processing time by:
Using a custom tokenizer (used the one in the docs)
Disabling other pipelines that aren't for Named Entity Recognition
Instead of feeding the whole body of text from each webpage into spacy, I'm only sending over a maximum of 5,000 characters
My updated code to load the model:
nlp = spacy.load('test_model/', disable=['parser', 'tagger', 'textcat'])
nlp.tokenizer = WhitespaceTokenizer(nlp.vocab)
doc = nlp(text)
However, it is still too slow (20X slower than I need it)
Questions:
Are there any other improvements I can make to speed up the Named Entity Recognition? Any fat I can cut from spacy?
I'm still looking to see if a GPU based solution would help - I saw that GPU use is supported during the Named Entity Recognition training phase, can it also be used in this evaluation phase in my code for faster predictions?

Please see here for details about speed troubleshooting: https://github.com/explosion/spaCy/issues/1508
The most important things:
1) Check which BLAS library numpy is linked against, and make sure it's compiled well for your machine. Using conda is helpful as then you get Intel's mkl
2)
c4.8xlarge instance on AWS and all 36 cores are constantly maxed out
when spacy is evaluating the data.
That's probably bad. We can only really parallelise the matrix multiplications at the moment, because we're using numpy --- so there's no way to thread larger chunks. This means the BLAS library is probably launching too many threads. In general you can only profitably use 3-4 cores per process. Try setting the environment variables for your BLAS library to restrict the number of threads.
3) Use nlp.pipe(), to process batches of data. This makes the matrix multiplications bigger, making processing more efficient.
4) Your outer loop of "feed data through my processing pipeline" is probably embarrassingly parallel. So, parallelise it. Either use Python's multiprocessing, or something like joblib, or something like Spark, or just fire off 10 bash scripts in parallel. But take the outermost, highest level chunk of work you can, and run it as independently as possible.
It's actually usually better to run multiple smaller VMs instead of one large VM. It's annoying operationally, but it means less resource sharing.

Working with large (+15 gb) CSV datasets and Pandas/XGBoost

I am trying to find a means of starting to work with very large CSV files in Pandas, ultimately to be able to do some machine learning with XGBoost.
I am torn between using mySQL or some sqllite framework to manage chunks of my data; my issue is in the machine learning aspect of it later on, and in loading in chunks at a time to train the model.
My other thought was to use Dask, which is built of off Pandas, but also has XGBoost functionality.
I'm not sure what the best starting point is and was hoping to ask for an opinion! I am leaning towards Dask but I have not used it yet.

This blogpost goes through an example using XGBoost on a large CSV dataset. However it did so by using a distributed cluster with enough RAM to fit the entire dataset in memory at once. While many dask.dataframe operations can operate in small space I don't think that XGBoost training is likely to be one of them. XGBoost seems to operate best when all data is available all the time.

I haven't tried this, but I would load your data into an hdf5 file using h5py. This library let's you store data on disk but access it like a numpy array. Therefore you are no longer constrained by memory for your dataset.
For the XGBoost part, I would use the sklearn API and pass in the h5py object as the X value. I recommend the sklearn API since it accepts numpy like arrays for input which should let h5py objects work. Make sure to use a small value for subsample otherwise you'll likely run out of memory fast.

How to load a very large into Tensorflow and create minibatchs?

I have a hdf5 file that has been written into the disk by a python code (I only have the file, not the code). The size of this file is 90GB and the data in this file has the following format: (N, 250,360,3). Just as side note, the data can't fit into memory.
Now I want to write a data loader in Tensorflow where each time just loads M samples from this file (M is way smaller than N).
What would be the best way to do this? Any pointer to a code would be highly appreciated.
Thanks.
J

The Tensorflow MNIST tutorial shows how this can be done:
https://www.tensorflow.org/tutorials/mnist/beginners/
If you look at the implementation on Github, you'll see that it uses a next_batch function to read batches of inputs 100 at a time.
The implementation of next_batch lives here:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/learn/python/learn/datasets/mnist.py#L160
You would need to implement something similar for your data set. I'm not particularly familiar with HDF5, but you can use any Python library to do the loading; it doesn't have to be specific to Tensorflow.
Hope that helps!

Benchmark of HowTo: Reading Data

I'm using tensorflow 0.10 and I was benchmarking the examples found in the official HowTo on reading data. This HowTo illustrates different methods to move data to tensorflow, using the same MNIST example.
I was surprised by the results and I was wondering if anyone has enough low-level understanding to explain what is happening.
In the HowTo there are basically 3 methods to read in data:
Feeding: building the mini-batch in python and passing it with sess.run(..., feed_dict={x: mini_batch})
Reading from files: use tf operations to open the files and create mini-batches. (Bypass handling data in python.)
Preloaded data: load all the data in either a single tf variable or constant and use tf functions to break that up in mini-batches. The variable or constant is pinned to the cpu, not gpu.
The scripts I used to run my benchmarks are found within tensorflow:
Feeding: examples/tutorials/mnist/fully_connected_feed.py
Reading from files: examples/how_tos/reading_data/convert_to_records.py and examples/how_tos/reading_data/fully_connected_reader.py
Preloaded data (constant): examples/how_tos/reading_data/fully_connected_preloaded.py
Preloaded data (variable): examples/how_tos/reading_data/fully_connected_preloaded_var.py
I ran those scripts unmodified, except for the last two because they crash --for version 0.10 at least-- unless I add an extra sess.run(tf.initialize_local_variables()).
Main Question
The time to execute 100 mini-batches of 100 examples running on a GTX1060:
Feeding: ~0.001 s
Reading from files: ~0.010 s
Preloaded data (constant): ~0.010 s
Preloaded data (variable): ~0.010 s
Those results are quite surprising to me. I would have expected Feeding to be the slowest since it does almost everything in python, while the other methods use lower-level tensorflow/C++ to carry similar operations. It is the complete opposite of what I expected. Does anyone understand what is going on?
Secondary question
I have access to another machine which has a Titan X and older NVidia drivers. The relative results were roughly in line with the above, except for Preloaded data (constant) which was catastrophically slow, taking many seconds for a single mini-batch.
Is this some known issue that performance can vary greatly with hardware/drivers?

Update Oct 9 the slowness comes because the computation runs too fast for Python to pre-empt the computation thread and to schedule the pre-fetching threads. Computation in main thread takes 2ms and apparently that's too little for the pre-fetching thread to grab the GIL. Pre-fetching thread has larger delay and hence can always be pre-empted by computation thread. So the computation thread runs through all of the examples, and then spends most of the time blocked on GIL as some prefetching thread gets scheduled and enqueues a single example. The solution is to increase number of Python threads, increase queue size to fit the entire dataset, start queue runners, and then pause main thread for a couple of seconds to give queue runners to pre-populate the queue.
Old stuff
That's surprisingly slow.
This looks some kind of special cases making the last 3 examples unnecessarily slow (most effort went into optimizing large models like ImageNet, so MNIST didn't get as much attention).
You can diagnose the problems by getting timelines, as described here
Here are 3 of those examples with timeline collection enabled.
Here's the timeline for feed_dict implementation
The important thing to notice is that matmul takes a good chunk of the time, so the reading overhead is not significant
Now here's the timeline for reader implementation
You can see that operation is bottlenecked on QueueDequeueMany which takes whopping 45ms.
If you zoom in, you'll see a bunch of tiny MEMCPY and Cast operations, which is a sign of some op being CPU only (parse_single_example), and the dequeue having to schedule multiple independent CPU->GPU transfers
For the var example below with GPU disabled, I don't see tiny little ops, but QueueDequeueMany still takes over 10ms. The timing seems to scale linearly with batch size, so there's some fundamental slowness there. Filed #4740

Yaroslav nails the problem well. With small models you'll need to speed up the data import. One way to do this is with the Tensorflow function, tf.TFRecordReader.read_up_to, that reads multiple records in each session.run() call, thereby removing the excess overhead caused by multiple calls.
enqueue_many_size = SOME_ENQUEUE_MANY_SIZE
reader = tf.TFRecordReader(options = tf.python_io.TFRecordOptions(tf.python_io.TFRecordCompressionType.ZLIB))
_, queue_batch = reader.read_up_to(filename_queue, enqueue_many_size)
batch_serialized_example = tf.train.shuffle_batch(
[queue_batch],
batch_size=batch_size,
num_threads=thread_number,
capacity=capacity,
min_after_dequeue=min_after_dequeue,
enqueue_many=True)
This was also addressed in this SO question.

The main question is that why the example with the preloaded data (constant)
examples/how_tos/reading_data/fully_connected_preloaded.py is significantly slower than other data loading example codes when using GPU.
I had the same problem, that fully_connected_preloaded.py is unexpectedly slow on my Titan X. The problem was that the whole dataset was pre-loaded on CPU, not GPU.
First, let me share my initial attempts. I applied the following performance tips by Yaroslav.
set capacity=55000 for tf.train.slice_input_producer.(55000 is the size of MNIST training set in my case)
set num_threads=5 for tf.train.batch.
set capacity=500 for tf.train.batch.
put time.sleep(10) after tf.train.start_queue_runners.
However, the average speed per each batch stays the same. I tried timeline visualization for profiling, and still got QueueDequeueManyV2 dominating.
The problem was the line 65 of fully_connected_preloaded.py. The following code loads entire dataset to CPU, still providing a bottleneck for CPU-GPU data transmission.
with tf.device('/cpu:0'):
input_images = tf.constant(data_sets.train.images)
input_labels = tf.constant(data_sets.train.labels)
Hence, I switched the device allocation.
with tf.device('/gpu:0')
Then I got x100 speed-up per each batch.
Note:
This was possible because Titan X has enough memory space to preload entire dataset.
In the original code(fully_connected_preloaded.py), the comment in the line 64 says "rest of pipeline is CPU-only". I am not sure about what this comment intended.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.