Benchmark of HowTo: Reading Data - python

I'm using tensorflow 0.10 and I was benchmarking the examples found in the official HowTo on reading data. This HowTo illustrates different methods to move data to tensorflow, using the same MNIST example.
I was surprised by the results and I was wondering if anyone has enough low-level understanding to explain what is happening.
In the HowTo there are basically 3 methods to read in data:
Feeding: building the mini-batch in python and passing it with sess.run(..., feed_dict={x: mini_batch})
Reading from files: use tf operations to open the files and create mini-batches. (Bypass handling data in python.)
Preloaded data: load all the data in either a single tf variable or constant and use tf functions to break that up in mini-batches. The variable or constant is pinned to the cpu, not gpu.
The scripts I used to run my benchmarks are found within tensorflow:
Feeding: examples/tutorials/mnist/fully_connected_feed.py
Reading from files: examples/how_tos/reading_data/convert_to_records.py and examples/how_tos/reading_data/fully_connected_reader.py
Preloaded data (constant): examples/how_tos/reading_data/fully_connected_preloaded.py
Preloaded data (variable): examples/how_tos/reading_data/fully_connected_preloaded_var.py
I ran those scripts unmodified, except for the last two because they crash --for version 0.10 at least-- unless I add an extra sess.run(tf.initialize_local_variables()).
Main Question
The time to execute 100 mini-batches of 100 examples running on a GTX1060:
Feeding: ~0.001 s
Reading from files: ~0.010 s
Preloaded data (constant): ~0.010 s
Preloaded data (variable): ~0.010 s
Those results are quite surprising to me. I would have expected Feeding to be the slowest since it does almost everything in python, while the other methods use lower-level tensorflow/C++ to carry similar operations. It is the complete opposite of what I expected. Does anyone understand what is going on?
Secondary question
I have access to another machine which has a Titan X and older NVidia drivers. The relative results were roughly in line with the above, except for Preloaded data (constant) which was catastrophically slow, taking many seconds for a single mini-batch.
Is this some known issue that performance can vary greatly with hardware/drivers?

Update Oct 9 the slowness comes because the computation runs too fast for Python to pre-empt the computation thread and to schedule the pre-fetching threads. Computation in main thread takes 2ms and apparently that's too little for the pre-fetching thread to grab the GIL. Pre-fetching thread has larger delay and hence can always be pre-empted by computation thread. So the computation thread runs through all of the examples, and then spends most of the time blocked on GIL as some prefetching thread gets scheduled and enqueues a single example. The solution is to increase number of Python threads, increase queue size to fit the entire dataset, start queue runners, and then pause main thread for a couple of seconds to give queue runners to pre-populate the queue.
Old stuff
That's surprisingly slow.
This looks some kind of special cases making the last 3 examples unnecessarily slow (most effort went into optimizing large models like ImageNet, so MNIST didn't get as much attention).
You can diagnose the problems by getting timelines, as described here
Here are 3 of those examples with timeline collection enabled.
Here's the timeline for feed_dict implementation
The important thing to notice is that matmul takes a good chunk of the time, so the reading overhead is not significant
Now here's the timeline for reader implementation
You can see that operation is bottlenecked on QueueDequeueMany which takes whopping 45ms.
If you zoom in, you'll see a bunch of tiny MEMCPY and Cast operations, which is a sign of some op being CPU only (parse_single_example), and the dequeue having to schedule multiple independent CPU->GPU transfers
For the var example below with GPU disabled, I don't see tiny little ops, but QueueDequeueMany still takes over 10ms. The timing seems to scale linearly with batch size, so there's some fundamental slowness there. Filed #4740

Yaroslav nails the problem well. With small models you'll need to speed up the data import. One way to do this is with the Tensorflow function, tf.TFRecordReader.read_up_to, that reads multiple records in each session.run() call, thereby removing the excess overhead caused by multiple calls.
enqueue_many_size = SOME_ENQUEUE_MANY_SIZE
reader = tf.TFRecordReader(options = tf.python_io.TFRecordOptions(tf.python_io.TFRecordCompressionType.ZLIB))
_, queue_batch = reader.read_up_to(filename_queue, enqueue_many_size)
batch_serialized_example = tf.train.shuffle_batch(
[queue_batch],
batch_size=batch_size,
num_threads=thread_number,
capacity=capacity,
min_after_dequeue=min_after_dequeue,
enqueue_many=True)
This was also addressed in this SO question.

The main question is that why the example with the preloaded data (constant)
examples/how_tos/reading_data/fully_connected_preloaded.py is significantly slower than other data loading example codes when using GPU.
I had the same problem, that fully_connected_preloaded.py is unexpectedly slow on my Titan X. The problem was that the whole dataset was pre-loaded on CPU, not GPU.
First, let me share my initial attempts. I applied the following performance tips by Yaroslav.
set capacity=55000 for tf.train.slice_input_producer.(55000 is the size of MNIST training set in my case)
set num_threads=5 for tf.train.batch.
set capacity=500 for tf.train.batch.
put time.sleep(10) after tf.train.start_queue_runners.
However, the average speed per each batch stays the same. I tried timeline visualization for profiling, and still got QueueDequeueManyV2 dominating.
The problem was the line 65 of fully_connected_preloaded.py. The following code loads entire dataset to CPU, still providing a bottleneck for CPU-GPU data transmission.
with tf.device('/cpu:0'):
input_images = tf.constant(data_sets.train.images)
input_labels = tf.constant(data_sets.train.labels)
Hence, I switched the device allocation.
with tf.device('/gpu:0')
Then I got x100 speed-up per each batch.
Note:
This was possible because Titan X has enough memory space to preload entire dataset.
In the original code(fully_connected_preloaded.py), the comment in the line 64 says "rest of pipeline is CPU-only". I am not sure about what this comment intended.

Related

Query execution time with small batches vs entire input set

I'm using ArangoDB 3.9.2 for search task. The number of items in dataset is 100.000. When I pass the entire dataset as an input list to the engine - the execution time is around ~10 sec, which is pretty quick. But if I pass the dataset in small batches one by one - 100 items per batch, the execution time is rapidly growing. In this case, to process the full dataset takes about ~2 min. Could you explain please, why is it happening? The dataset is the same.
I'm using python driver "ArangoClient" from python-arango lib ver 0.2.1
PS: I had the similar problem with Neo4j, but the problem was solved using transactions committing with HTTP API. Does the ArangoDB have something similar?
Every time you make a call to a remote system (Neo4J or ArangoDB or any database) there is overhead in making the connection, sending the data, and then after executing your command, tearing down the connection.
What you're doing is trying to find the 'sweet spot' for your implementation as to the most efficient batch size for the type of data you are sending, the complexity of your query, the performance of your hardware, etc.
What I recommend doing is writing a test script that sends the data in varying batch sizes to help you determine the optimal settings for your use case.
I have taken this approach with many systems that I've designed and the optimal batch sizes are unique to each implementation. It totally depends on what you are doing.
See what results you get for the overall load time if you use batch sizes of 100, 1000, 2000, 5000, and 10000.
This way you'll work out the best answer for you.

Basic Mapreduce with threads is slower than sequential version

I am trying to do a word counter with mapreduce using threads, but this version is much slower than the sequential version. With a 300MB text file the mapreduce version takes about 80s, with the sequential version it takes significantly less. My question is due to not understanding why, as I have done all the stages of map reduce (split, mapping, shuffle, reduce) but I can't figure out why it is slower, as I have used about 6 threads to do the test. I was thinking that it could be that the creation of threads was expensive compared to the execution time, but since it takes about 80s I think it is clear that this is not the problem. Could you take a look at the code to see what it is? I'm pretty sure that the code works fine, the problem is that I don't know what is causing the slowness.
One last thing, when using a text file of more than 300MB, the program fills all the ram memory of my computer, is there any way to optimize it?
First of all several disclaimers:
to know the exact reason why the application is slow you need to profile it. In this answer I'm giving some common sense reasoning.
I'm assuming you are using cPython
When you parallelize some algorithm there are several factors that that influence performance. Some of them work in favour of speed (I'l mark them with +) and some against (-). Let's look at them:
you need to split the work first (-)
work is parallel workers is done simultaneously (+)
parallel workers may need to synchronize their work (-)
reduce requires time (-)
In order for you parallel algorithm give you some gain as compared to sequential you need that all factors that speeds things up overweight all factors that drags you down.
Also the gain from #2 should be big enough to compensate for the additional work you need to do as compared to sequential processing (this means that for some small data you will not get any boost as overhead for coordination will be bigger).
The main problems in your implementation now are items #2 and #3.
First of all the workers are not working in parallel. The portion of the task you parallelize is CPU bound. In python threads of a single process cannot use more than one CPU. So in this program they never execute in parallel. They share the same CPU.
Moreover every modification operation they do on the dicts uses locking/unlocking and this is much slower then sequential version that does not require such synchronization.
To speed up your algorithm you need:
use multiprocessing instead of multithreading (this way you can use multiple CPU for processing)
structure the algorithm in a way that does not require synchronization between workers when they do their job (each worker should use its own dict to store intermediate results)

Dataflow pipeline throughput decreases drastically as execution advances + unexpected side input behavior

I have a dataflow pipeline processing an input of about 1Gb of data with two dicts as side_inputs. The goal is to calculate features from the main dataset with the help of those two side_inputs.
Overall structure of the pipeline is as follows:
# First side input, ends up as a 2GB dict with 3.5 million keys
side_inp1 = ( p |
"read side_input1" >> beam.io.ReadFromAvro("$PATH/*.avro") |
"to list of tuples" >> beam.Map(lambda row: (row["key"], row["value"]))
)
# Second side input, ends up as a 1.6GB dict with 4.5 million keys
side_inp2 = (p |
"read side_input2" >> beam.io.ReadFromAvro("$PATH2/*.avro") |
"to list of tuples" >> beam.Map(lambda row: (row["key"], row["value"]))
)
# The main part of the pipeline, reading an avro dataset of 1 million rows -- 20GB
(p |
"read inputs" >> beam.io.ReadFromAvro("$MainPath/*.avro") |
"main func" >> beam.Map(MyMapper, pvalue.AsDict(side_inp1), pvalue.AsDict(side_inp2))
)
Here's the Dataflow graph:
And the "Featurize" step unwrapped:
So Featurize is a function that looks for ids in the side-inputs, .gets the vectors and does like 180 different ways of vector dot products to calculate some features. It's a completely CPU bound process and it's expected to take longer than the rest of the pipeline, but stalling is the thing that's strange here.
My problems are two fold:
The dataflow pipeline seems to slow down drastically as it moves further in the process. I don't know what the reasons are and how can I alleviate this problem. A throughput chart of the MyMapper step can be seen below, I'm wondering for the declining throughput (from ~400 rows/sec to nearly ~1 rows/sec in the end).
Also the behavior of side_inputs is strange to me. I expected the side_inputs to be read only and only once, but when I checkout the Job Metrics / Throughput chart, I observe the following chart. As can be seen, the pipeline is constantly reading in side_inputs, while what I want is only two dicts that are kept in memory.
Other job configurations
zone: us-central-1a
machine_type: m1-ultramem-40 (40 CPU cores, 960GB RAM)
disk_type/size: ssd/50GB
experiments: shuffle-service enabled.
max_num_workers: 1 to help ease calculations and metrics, and not have them vary due to auto-scaling.
Extra Observations
I'm constantly seeing log entires like the following in LogViewer: [INFO] Completed workitem: 4867151000103436312 in 1069.056863785 seconds"
All completed workItems so far have taken about 1000-1100 seconds, this is another source of confusion, why should throughput drop while processing workItems takes the same time as before? Has parallelism dropped for some reason? (maybe some hidden threading threshold that's out of my control, like harness_threads?).
In the later parts of the pipelines, looking at the logs, it looks the execution pattern is very sequential (Seems like it's executing 1 workItem, finishes it, goes to the next, which is strange to me, considering there's 1TB of available memory and 40cores)
There are 0 errors or even warnings
The throughput chart in point 1 is a good indicator that the performance in your job decreased somehow.
The side input is intended to be in memory; however, I'm not quite sure that a pipeline with only 1 highmem node is a good approach. By having only one node, the pipeline might have bottlenecks difficult to identify, e.g. Network or OS limitations (like max number of files opened in the OS related to the files loaded into memory). Because of beam's architecture, I think it is not a problem that you can have more nodes even if autoscaling is enabled since we find that autoscaling automatically chooses the appropriate number of worker instances required to run your job. If you are worried about calculations and metrics for other reasons, please share.
Regarding point 2, I think it is expected to find activity on the graph since the side input (in memory) is read by each element being processed. However, if this doesn't make sense for you, you can always add the complete job graph for us to understand any other details of the pipeline steps.
My recommendation is adding more workers to distribute the workaload as a PCollection is a distributed dataset that will be distributed among available nodes. You can try to have similar computational resources with more nodes, for example, 4 instances n2d-highmem-16 (16vCPU 128GB). With this changes it is possible that any bottlenecks dissapear or can be mitigated; in addition, you can monitor the new job in the same way:
Remember to check errors in your pipeline, so you can identify any other issues that are happening/causing the performance issue.
Check the CPU and Memory usage in Dataflow UI. If memory errors are happening at job level Stackdriver should shows them as memory errors, but also the memory in the host instance should be checked to be sure that it is not reaching the limit in the OS for other reasons.
You might want to check this example with side inputs as dictionaries. I'm not expert, but you can follow the best practice in the example.
UPDATE
If machines n2d-highmem-16 have OOM, it seems to me that each harness thread might use a copy of the dicts. Not quite sure if configuring the number of threads can help, but you can try to set number_of_worker_harness_threads in the pipeline options.
On the other hand, can you expand the step Featurize? The wall time is very high in this step (~6 days), let's check the composite transforms that absorbed such latency. For the problematic composite transforms let us know the code snippet. To identify the composite transforms that can have issues please refer to Side Inputs Metrics especially Time spent writing and Time spent reading.

TensorFlow PoolAllocator huge number of requests

using Tensorflow r0.9/r.10 I get the following message, that makes me worried I've set my neural network model in the wrong way.
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 6206792 get requests, put_count=6206802 evicted_count=5000 eviction_rate=0.000805568 and unsatisfied allocation rate=0.000806536
The network I use is similar to AlexNet/VGG-M, I create the variables and the ops in a function called once, and then I just loop over multiple epochs calling the same omptimizer, loss and prediction function for each mini-batch iteration.
Another thing that makes me worried is that the network can be unstable when using a large batch size: it runs fine for few epochs, and then it goes out of memory (trying to allocate...).
Is there any way to check if there is something wrong and what it is?
This is an info-level log statement (the "I" prefix). It does not necessarily mean that anything is wrong: however, the pool allocator (a cache for allocations) is finding that it frequently has to fall back on the underlying allocator. This may indicate memory pressure.
For your instability problem: as you observe, large batches can lead to out-of-memory errors. There is some nondeterminism to operator scheduling, which is why you may not see it fail every time. Try lowering your batch size until you consistently no longer see out of memory errors.

Run out of VRAM using Theano on Amazon cluster

I'm trying to execute the logistic_sgd.py code on an Amazon cluster running the ami-b141a2f5 (Theano - CUDA 7) image.
Instead of the included MNIST database I am using the SD19 database, which requires changing a few dimensional constants, but otherwise no code has been touched. The code runs fine locally, on my CPU, but once I SSH the code and data to the Amazon cluster and run it there, I get this output:
It looks to me like it is running out of VRAM, but it was my understanding that the code should run on a GPU already, without any tinkering on my part necessary. After following the suggestion from the error message, the error persists.
There's nothing especially strange here. The error message is almost certainly accurate: there really isn't enough VRAM. Often, a script will run fine on CPU but then fail like this on GPU simply because there is usually much more system memory available than GPU memory, especially since the system memory is virtualized (and can page out to disk if required) while the GPU memory isn't.
For this script, there needs to be enough memory to store the training, validation, and testing data sets, the model parameters, and enough working space to store intermediate results of the computation. There are two options available:
Reduce the amount of memory needed for one or more of these three components. Reducing the amount of training data is usually easiest; reducing the size of the model next. Unfortunately both of those two options will often impair the quality of the result that is being looked for. Reducing the amount of memory needed for intermediate results is usually beyond the developers control -- it is managed by Theano, but there is sometimes scope for altering the computation to achieve this goal once a good understanding of Theano's internals is achieved.
If the model parameters and working memory can fit in GPU memory then the most common solution is to change the code so that the data is no longer stored in GPU memory (i.e. just store it as numpy arrays, not as Theano shared variables) then pass each batch of data in as inputs instead of givens. The LSTM sample code is an example of this approach.

Categories