Tensorflow Multi-GPU reusing vs. duplicating?

Tensorflow Multi-GPU reusing vs. duplicating? - python

To train a model on multiple GPUs one can create one set of variables on first GPU and reuse them (by tf.variable_scope(tf.get_variable_scope(), reuse=device_num != 0)) on other GPUs as in cifar10_multi_gpu_train.
But I came across the official CNN benchmarks where in local replicated setting they use new variable scope for each GPU (by tf.variable_scope('v%s' % device_num)). Since all variables are initialized randomly, post init op is used to copy values from GPU:0 to others.
Both implementations then average gradients on CPU and back-propagate back the result (at least that is what I think since the benchmarks code is cryptic:)) - probably resulting in the same outcome.
What is then the difference between these two approaches and more importantly what is faster?
Thank you.

The difference is that if you're reusing variables every iteration starts with a broadcast of the variables from their original location all GPUs, while if you're copying variables this broadcast is unnecessary, so not sharing should be faster.
The one downside of not sharing is that it's easier for a bug or numerical instability somewhere to lead to different GPUs ending up with different values for each variable.

Related

Difference between MirroredStrategy and CentralStorageStrategy

I read the documentation of both CentralStorageStrategy and MirroredStrategy, but can not understand the essence of difference between them.
In MirroredStrategy:
Each variable in the model is mirrored across all the replicas.
In CentralStorageStrategy:
Variables are not mirrored, instead they are placed on the CPU and operations are replicated across all local GPUs.
Source: https://www.tensorflow.org/guide/distributed_training
What does it mean in practice? What are use cases for the CentralStorageStrategy and how does the training work if variables are placed on the CPU in this strategy?

Consider one particular variable (call it "my_var") in your usual, single-GPU, non-distributed use case (e.g. a weight matrix of a convolutional layer).
If you use 4 GPUs, MirroredStrategy will create 4 variables instead of "my_var" variable, one on each GPU. However each variable will have the same value, because they are always updated in the same way. So the variable updates happen in sync on all the GPUs.
In case of the CentralStorageStrategy, only one variable is created for "my_var", in the host (CPU) memory. The updates only happen in one place.
Which one is better probably depends on the computer's topology and how fast CPU-GPU communication is compared with GPU-GPU. If the GPUs can communicate fast with each other, MirroredStrategy may be more efficient. But I'd benchmark it to be sure.

Benchmark of HowTo: Reading Data

I'm using tensorflow 0.10 and I was benchmarking the examples found in the official HowTo on reading data. This HowTo illustrates different methods to move data to tensorflow, using the same MNIST example.
I was surprised by the results and I was wondering if anyone has enough low-level understanding to explain what is happening.
In the HowTo there are basically 3 methods to read in data:
Feeding: building the mini-batch in python and passing it with sess.run(..., feed_dict={x: mini_batch})
Reading from files: use tf operations to open the files and create mini-batches. (Bypass handling data in python.)
Preloaded data: load all the data in either a single tf variable or constant and use tf functions to break that up in mini-batches. The variable or constant is pinned to the cpu, not gpu.
The scripts I used to run my benchmarks are found within tensorflow:
Feeding: examples/tutorials/mnist/fully_connected_feed.py
Reading from files: examples/how_tos/reading_data/convert_to_records.py and examples/how_tos/reading_data/fully_connected_reader.py
Preloaded data (constant): examples/how_tos/reading_data/fully_connected_preloaded.py
Preloaded data (variable): examples/how_tos/reading_data/fully_connected_preloaded_var.py
I ran those scripts unmodified, except for the last two because they crash --for version 0.10 at least-- unless I add an extra sess.run(tf.initialize_local_variables()).
Main Question
The time to execute 100 mini-batches of 100 examples running on a GTX1060:
Feeding: ~0.001 s
Reading from files: ~0.010 s
Preloaded data (constant): ~0.010 s
Preloaded data (variable): ~0.010 s
Those results are quite surprising to me. I would have expected Feeding to be the slowest since it does almost everything in python, while the other methods use lower-level tensorflow/C++ to carry similar operations. It is the complete opposite of what I expected. Does anyone understand what is going on?
Secondary question
I have access to another machine which has a Titan X and older NVidia drivers. The relative results were roughly in line with the above, except for Preloaded data (constant) which was catastrophically slow, taking many seconds for a single mini-batch.
Is this some known issue that performance can vary greatly with hardware/drivers?

Update Oct 9 the slowness comes because the computation runs too fast for Python to pre-empt the computation thread and to schedule the pre-fetching threads. Computation in main thread takes 2ms and apparently that's too little for the pre-fetching thread to grab the GIL. Pre-fetching thread has larger delay and hence can always be pre-empted by computation thread. So the computation thread runs through all of the examples, and then spends most of the time blocked on GIL as some prefetching thread gets scheduled and enqueues a single example. The solution is to increase number of Python threads, increase queue size to fit the entire dataset, start queue runners, and then pause main thread for a couple of seconds to give queue runners to pre-populate the queue.
Old stuff
That's surprisingly slow.
This looks some kind of special cases making the last 3 examples unnecessarily slow (most effort went into optimizing large models like ImageNet, so MNIST didn't get as much attention).
You can diagnose the problems by getting timelines, as described here
Here are 3 of those examples with timeline collection enabled.
Here's the timeline for feed_dict implementation
The important thing to notice is that matmul takes a good chunk of the time, so the reading overhead is not significant
Now here's the timeline for reader implementation
You can see that operation is bottlenecked on QueueDequeueMany which takes whopping 45ms.
If you zoom in, you'll see a bunch of tiny MEMCPY and Cast operations, which is a sign of some op being CPU only (parse_single_example), and the dequeue having to schedule multiple independent CPU->GPU transfers
For the var example below with GPU disabled, I don't see tiny little ops, but QueueDequeueMany still takes over 10ms. The timing seems to scale linearly with batch size, so there's some fundamental slowness there. Filed #4740

Yaroslav nails the problem well. With small models you'll need to speed up the data import. One way to do this is with the Tensorflow function, tf.TFRecordReader.read_up_to, that reads multiple records in each session.run() call, thereby removing the excess overhead caused by multiple calls.
enqueue_many_size = SOME_ENQUEUE_MANY_SIZE
reader = tf.TFRecordReader(options = tf.python_io.TFRecordOptions(tf.python_io.TFRecordCompressionType.ZLIB))
_, queue_batch = reader.read_up_to(filename_queue, enqueue_many_size)
batch_serialized_example = tf.train.shuffle_batch(
[queue_batch],
batch_size=batch_size,
num_threads=thread_number,
capacity=capacity,
min_after_dequeue=min_after_dequeue,
enqueue_many=True)
This was also addressed in this SO question.

The main question is that why the example with the preloaded data (constant)
examples/how_tos/reading_data/fully_connected_preloaded.py is significantly slower than other data loading example codes when using GPU.
I had the same problem, that fully_connected_preloaded.py is unexpectedly slow on my Titan X. The problem was that the whole dataset was pre-loaded on CPU, not GPU.
First, let me share my initial attempts. I applied the following performance tips by Yaroslav.
set capacity=55000 for tf.train.slice_input_producer.(55000 is the size of MNIST training set in my case)
set num_threads=5 for tf.train.batch.
set capacity=500 for tf.train.batch.
put time.sleep(10) after tf.train.start_queue_runners.
However, the average speed per each batch stays the same. I tried timeline visualization for profiling, and still got QueueDequeueManyV2 dominating.
The problem was the line 65 of fully_connected_preloaded.py. The following code loads entire dataset to CPU, still providing a bottleneck for CPU-GPU data transmission.
with tf.device('/cpu:0'):
input_images = tf.constant(data_sets.train.images)
input_labels = tf.constant(data_sets.train.labels)
Hence, I switched the device allocation.
with tf.device('/gpu:0')
Then I got x100 speed-up per each batch.
Note:
This was possible because Titan X has enough memory space to preload entire dataset.
In the original code(fully_connected_preloaded.py), the comment in the line 64 says "rest of pipeline is CPU-only". I am not sure about what this comment intended.

TensorFlow PoolAllocator huge number of requests

using Tensorflow r0.9/r.10 I get the following message, that makes me worried I've set my neural network model in the wrong way.
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 6206792 get requests, put_count=6206802 evicted_count=5000 eviction_rate=0.000805568 and unsatisfied allocation rate=0.000806536
The network I use is similar to AlexNet/VGG-M, I create the variables and the ops in a function called once, and then I just loop over multiple epochs calling the same omptimizer, loss and prediction function for each mini-batch iteration.
Another thing that makes me worried is that the network can be unstable when using a large batch size: it runs fine for few epochs, and then it goes out of memory (trying to allocate...).
Is there any way to check if there is something wrong and what it is?

This is an info-level log statement (the "I" prefix). It does not necessarily mean that anything is wrong: however, the pool allocator (a cache for allocations) is finding that it frequently has to fall back on the underlying allocator. This may indicate memory pressure.
For your instability problem: as you observe, large batches can lead to out-of-memory errors. There is some nondeterminism to operator scheduling, which is why you may not see it fail every time. Try lowering your batch size until you consistently no longer see out of memory errors.

TensorFlow: critical graph operations assigned to cpu rather than gpu

I have implemented a TensorFlow DNN model (2 hidden layers with elu activation functions trained on MNIST) as a Python class in order to wrap TF calls within another library with its own optimization routines and tools.
When running some tests on a TeslaK20 I noticed that the GPU was being used at 4% of the total capacity. Therefore I looked a bit more closely to the log-device-placement and figured that all critical operations like MatMul, Sum, Add, Mean etc were being assigned to the CPU.
The first thing that came to mind was that it was because I was using dtype=float64, therefore I switched to dtype=float32. While a lot more operations were assigned to GPU, still a good number were assigned to the CPU, like Mean, gradient/Mean_grad/Prod, gradient/Mean.
So here comes my first question (I'm linking a working code example at the end),
1) why would that be? I have written different TF models that consist of simple tensor multiplications and reductions and they run fully on GPU as long as I use single precision.
So here comes the second question,
2) why does TF assign the graph to different devices depending on the data type? I understand that not all kernels are implemented for GPU but I would have thought that things like MatMul could run on GPU for both single and double precision.
3) Could the fact that the model is wrapped within a Python class have an effect? I do not think this is the case because as I said, it did not happen for other models wrapped similarly but that were simpler.
4) What sort of steps can I take to run the model fully on a GPU?
Here is a full working example of my code that I have isolated from my library
https://gist.github.com/smcantab/8ecb679150a327738102 .
If you run it and look at the output you'll see how different parts of the graph have been assigned to different devices. To see how this changes with types and devices change dtype and device within main() at the end of the example. Note that if I set allow_soft_placement=False the graph fails to initialize.
Any word of advice would be really appreciated.

As Yaroslav noted: Mean, in particular, was not yet implemented for GPU, but it is now available so these operations should run on the GPU with the latest TensorFlow. (as per the DEVICE_GPU registration at that link)
Prior to availability of mean, the status of this was:
(a) You can implement mean by hand, because reduce_sum is available on GPU.
(b) I've re-pinged someone to see if there's an easy way to add the GPU support, but we'll see.
Re float64 on GPU, someone opened an issue three days ago with a patch for supporting float64 reductions on GPU. Currently being reviewed and tested.
No, it doesn't matter if it's wrapped in Python - it's really just about whether a kernel has been defined for it to execute on the GPU or not. In many cases, the answer to "why is X supported on GPU by Y not?" comes down to whether or not there's been demand for Y to run on the GPU. The answer for float64 is simpler: float32 is a lot faster, so in most cases, people work to make their models work in float32 when possible because it gives all-around speed benefits.

Most of the graphics cards like the GTX 980, 1080, etc are stripped of the double precision floating point hardware units. Since these are much cheaper and therefore more ubiquitous than the Newer Tesla Units (which have FP64 double precision hardware), doing double precision calculations on the graphics cards is very slow compared to single precision. FP64 calculations on a GPU seem to be about 32 X slower than FP32 on a GPU without the FP64 hardware. I believe this is why the FP32 calculations tend to be set up for the GPU while FP64 for the CPU (which is faster in most systems.) Hopefully in the future, the frameworks will test the GPU capabilities at runtime to decide where to assign the FP64 calculations.

Run out of VRAM using Theano on Amazon cluster

I'm trying to execute the logistic_sgd.py code on an Amazon cluster running the ami-b141a2f5 (Theano - CUDA 7) image.
Instead of the included MNIST database I am using the SD19 database, which requires changing a few dimensional constants, but otherwise no code has been touched. The code runs fine locally, on my CPU, but once I SSH the code and data to the Amazon cluster and run it there, I get this output:
It looks to me like it is running out of VRAM, but it was my understanding that the code should run on a GPU already, without any tinkering on my part necessary. After following the suggestion from the error message, the error persists.

There's nothing especially strange here. The error message is almost certainly accurate: there really isn't enough VRAM. Often, a script will run fine on CPU but then fail like this on GPU simply because there is usually much more system memory available than GPU memory, especially since the system memory is virtualized (and can page out to disk if required) while the GPU memory isn't.
For this script, there needs to be enough memory to store the training, validation, and testing data sets, the model parameters, and enough working space to store intermediate results of the computation. There are two options available:
Reduce the amount of memory needed for one or more of these three components. Reducing the amount of training data is usually easiest; reducing the size of the model next. Unfortunately both of those two options will often impair the quality of the result that is being looked for. Reducing the amount of memory needed for intermediate results is usually beyond the developers control -- it is managed by Theano, but there is sometimes scope for altering the computation to achieve this goal once a good understanding of Theano's internals is achieved.
If the model parameters and working memory can fit in GPU memory then the most common solution is to change the code so that the data is no longer stored in GPU memory (i.e. just store it as numpy arrays, not as Theano shared variables) then pass each batch of data in as inputs instead of givens. The LSTM sample code is an example of this approach.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.