Difference between MirroredStrategy and CentralStorageStrategy - python

I read the documentation of both CentralStorageStrategy and MirroredStrategy, but can not understand the essence of difference between them.
In MirroredStrategy:
Each variable in the model is mirrored across all the replicas.
In CentralStorageStrategy:
Variables are not mirrored, instead they are placed on the CPU and operations are replicated across all local GPUs.
Source: https://www.tensorflow.org/guide/distributed_training
What does it mean in practice? What are use cases for the CentralStorageStrategy and how does the training work if variables are placed on the CPU in this strategy?

Consider one particular variable (call it "my_var") in your usual, single-GPU, non-distributed use case (e.g. a weight matrix of a convolutional layer).
If you use 4 GPUs, MirroredStrategy will create 4 variables instead of "my_var" variable, one on each GPU. However each variable will have the same value, because they are always updated in the same way. So the variable updates happen in sync on all the GPUs.
In case of the CentralStorageStrategy, only one variable is created for "my_var", in the host (CPU) memory. The updates only happen in one place.
Which one is better probably depends on the computer's topology and how fast CPU-GPU communication is compared with GPU-GPU. If the GPUs can communicate fast with each other, MirroredStrategy may be more efficient. But I'd benchmark it to be sure.

Related

Updating Pytorch variables

First, I should mention I’m on torch version 0.3.1, but am happy to upgrade if necessary.
I am still learning, so please let me know if I misunderstood anything here.
I am building several graphs imperatively, using Variables. The mathematical expressions might be quite complicated as time goes on, and certain parts of the graphs may be generated using autograd’s gradient functions, which may then be operated on.
I can create whatever expression I want just fine. The only catch is building up this graph can be quite “slow” - for relatively simple operations (4x4 matrices being multiplied together), it can take a few ms.
Once these graphs are generated, though, I expect that I should be able to update certain Variables’ values, and evaluate the output of the graph (or any node, really), much faster. I believe that that is probably the case since evaluation should be able to happen completely in the C back-end, which should be pretty otimized for speed? In other words, building the graph in python might be slow, since it involves python for loops and the like, but re-evaluating a graph with static topology should be fast.
Is this intuition correct? If so, how can I efficiently re-assign a value and re-evaluate the graph?
If you are talking about updating values of input variables, and you just want to evaluate the forward of the graph and are not going to call .backward(), you can use volatile=True on the input Variable(s) in Pytorch 0.3.
This feature is removed in 0.4 in favor of the context manager with torch.no_grad():.
This is usually done at inference time. If you do this, the graph won't be created, thus eliminating the overhead.
Update
Pytorch 0.3
From Pytorch 0.3 docs:
Volatile is recommended for purely inference mode, when you’re sure you won’t be even calling .backward(). It’s more efficient than any other autograd setting - it will use the absolute minimal amount of memory to evaluate the model. volatile also determines that requires_grad is False.
and
Volatile differs from requires_grad in how the flag propagates. If there’s even a single volatile input to an operation, its output is also going to be volatile. Volatility spreads across the graph much easier than non-requiring gradient - you only need a single volatile leaf to have a volatile output, while you need all leaves to not require gradient to have an output that doesn’t require gradient. Using volatile flag you don’t need to change any settings of your model parameters to use it for inference. It’s enough to create a volatile input, and this will ensure that no intermediate states are saved.
So if you use Pytorch 0.3 and want to do efficient inference, you use volatile=True on your inputs.
Pytorch 0.4
The context manager has been introduced in Pytorch 0.4. From The Pytorch migration guide between 0.3 and 0.4:
The volatile flag is now deprecated and has no effect. Previously, any computation that involves a Variable with volatile=True wouldn’t be tracked by autograd. This has now been replaced by a set of more flexible context managers including torch.no_grad(), torch.set_grad_enabled(grad_mode), and others.
So when you say: "Then, if there is a way to turn all variables of that graph to non-volatile, I can just perform inference.", setting the input as volatile or using no_grad will do exactly this.
As for the example of using torch.no_grad(), look at the test function in this script.
Static graphs during training
I'm not aware of any method to avoid the creation of the graph at each iteration in training mode in Pytorch. Dynamic graphs come at a cost. If you want this, convert your model to Tensorflow/Caffe2/any library supporting static graphs, even though in my experience they won't necessarily be faster.

Tensorflow Multi-GPU reusing vs. duplicating?

To train a model on multiple GPUs one can create one set of variables on first GPU and reuse them (by tf.variable_scope(tf.get_variable_scope(), reuse=device_num != 0)) on other GPUs as in cifar10_multi_gpu_train.
But I came across the official CNN benchmarks where in local replicated setting they use new variable scope for each GPU (by tf.variable_scope('v%s' % device_num)). Since all variables are initialized randomly, post init op is used to copy values from GPU:0 to others.
Both implementations then average gradients on CPU and back-propagate back the result (at least that is what I think since the benchmarks code is cryptic:)) - probably resulting in the same outcome.
What is then the difference between these two approaches and more importantly what is faster?
Thank you.
The difference is that if you're reusing variables every iteration starts with a broadcast of the variables from their original location all GPUs, while if you're copying variables this broadcast is unnecessary, so not sharing should be faster.
The one downside of not sharing is that it's easier for a bug or numerical instability somewhere to lead to different GPUs ending up with different values for each variable.

How to make word2vec model's loading time and memory use more efficient?

I want to use Word2vec in a web server (production) in two different variants where I fetch two sentences from the web and compare it in real-time. For now, I am testing it on a local machine which has 16GB RAM.
Scenario:
w2v = load w2v model
If condition 1 is true:
if normalize:
reverse normalize by w2v.init_sims(replace=False) (not sure if it will work)
Loop through some items:
calculate their vectors using w2v
else if condition 2 is true:
if not normalized:
w2v.init_sims(replace=True)
Loop through some items:
calculate their vectors using w2v
I have already read the solution about reducing the vocabulary size to a small size but I would like to use all the vocabulary.
Are there new workarounds on how to handle this? Is there a way to initially load a small portion of the vocabulary for first 1-2 minutes and in parallel keep loading the whole vocabulary?
As a one-time delay that you should be able to schedule to happen before any service-requests, I would recommend against worrying too much about the first-time load() time. (It's going to inherently take a lot of time to load a lot of data from disk to RAM – but once there, if it's being kept around and shared between processes well, the cost is not spent again for an arbitrarily long service-uptime.)
It doesn't make much sense to "load a small portion of the vocabulary for first 1-2 minutes and in parallel keep loading the whole vocabulary" – as soon as any similarity-calc is needed, the whole set of vectors need to be accessed for any top-N results. (So the "half-loaded" state isn't very useful.)
Note that if you do init_sims(replace=True), the model's original raw vector magnitudes are clobbered with the new unit-normed (all-same-magnitude) vectors. So looking at your pseudocode, the only difference between the two paths is the explicit init_sims(replace=True). But if you're truly keeping the same shared model in memory between requests, as soon as condition 2 occurs, the model is normalized, and thereafter calls under condition 1 are also occurring with normalized vectors. And further, additional calls under condition 2 just redundantly (and expensively) re-normalize the vectors in-place. So if normalized-comparisons are your only focus, best to do one in-place init_sims(replace=True) at service startup - not at the mercy of order-of-requests.
If you've saved the model using gensim's native save() (rather than save_word2vec_format()), and as uncompressed files, there's the option to 'memory-map' the files on a future re-load. This means rather than immediately copying the full vector array into RAM, the file-on-disk is simply marked as providing the addressing-space. There are two potential benefits to this: (1) if you only even access some limited ranges of the array, only those are loaded, on demand; (2) many separate processes all using the same mapped files will automatically reuse any shared ranges loaded into RAM, rather than potentially duplicating the same data.
(1) isn't much of an advantage as soon as you need to do a full-sweep over the whole vocabulary – because they're all brought into RAM then, and further at the moment of access (which will have more service-lag than if you'd just pre-loaded them). But (2) is still an advantage in multi-process webserver scenarios. There's a lot more detail on how you might use memory-mapped word2vec models efficiently in a prior answer of mine, at How to speed up Gensim Word2vec model load time?

TensorFlow: critical graph operations assigned to cpu rather than gpu

I have implemented a TensorFlow DNN model (2 hidden layers with elu activation functions trained on MNIST) as a Python class in order to wrap TF calls within another library with its own optimization routines and tools.
When running some tests on a TeslaK20 I noticed that the GPU was being used at 4% of the total capacity. Therefore I looked a bit more closely to the log-device-placement and figured that all critical operations like MatMul, Sum, Add, Mean etc were being assigned to the CPU.
The first thing that came to mind was that it was because I was using dtype=float64, therefore I switched to dtype=float32. While a lot more operations were assigned to GPU, still a good number were assigned to the CPU, like Mean, gradient/Mean_grad/Prod, gradient/Mean.
So here comes my first question (I'm linking a working code example at the end),
1) why would that be? I have written different TF models that consist of simple tensor multiplications and reductions and they run fully on GPU as long as I use single precision.
So here comes the second question,
2) why does TF assign the graph to different devices depending on the data type? I understand that not all kernels are implemented for GPU but I would have thought that things like MatMul could run on GPU for both single and double precision.
3) Could the fact that the model is wrapped within a Python class have an effect? I do not think this is the case because as I said, it did not happen for other models wrapped similarly but that were simpler.
4) What sort of steps can I take to run the model fully on a GPU?
Here is a full working example of my code that I have isolated from my library
https://gist.github.com/smcantab/8ecb679150a327738102 .
If you run it and look at the output you'll see how different parts of the graph have been assigned to different devices. To see how this changes with types and devices change dtype and device within main() at the end of the example. Note that if I set allow_soft_placement=False the graph fails to initialize.
Any word of advice would be really appreciated.
As Yaroslav noted: Mean, in particular, was not yet implemented for GPU, but it is now available so these operations should run on the GPU with the latest TensorFlow. (as per the DEVICE_GPU registration at that link)
Prior to availability of mean, the status of this was:
(a) You can implement mean by hand, because reduce_sum is available on GPU.
(b) I've re-pinged someone to see if there's an easy way to add the GPU support, but we'll see.
Re float64 on GPU, someone opened an issue three days ago with a patch for supporting float64 reductions on GPU. Currently being reviewed and tested.
No, it doesn't matter if it's wrapped in Python - it's really just about whether a kernel has been defined for it to execute on the GPU or not. In many cases, the answer to "why is X supported on GPU by Y not?" comes down to whether or not there's been demand for Y to run on the GPU. The answer for float64 is simpler: float32 is a lot faster, so in most cases, people work to make their models work in float32 when possible because it gives all-around speed benefits.
Most of the graphics cards like the GTX 980, 1080, etc are stripped of the double precision floating point hardware units. Since these are much cheaper and therefore more ubiquitous than the Newer Tesla Units (which have FP64 double precision hardware), doing double precision calculations on the graphics cards is very slow compared to single precision. FP64 calculations on a GPU seem to be about 32 X slower than FP32 on a GPU without the FP64 hardware. I believe this is why the FP32 calculations tend to be set up for the GPU while FP64 for the CPU (which is faster in most systems.) Hopefully in the future, the frameworks will test the GPU capabilities at runtime to decide where to assign the FP64 calculations.

Run out of VRAM using Theano on Amazon cluster

I'm trying to execute the logistic_sgd.py code on an Amazon cluster running the ami-b141a2f5 (Theano - CUDA 7) image.
Instead of the included MNIST database I am using the SD19 database, which requires changing a few dimensional constants, but otherwise no code has been touched. The code runs fine locally, on my CPU, but once I SSH the code and data to the Amazon cluster and run it there, I get this output:
It looks to me like it is running out of VRAM, but it was my understanding that the code should run on a GPU already, without any tinkering on my part necessary. After following the suggestion from the error message, the error persists.
There's nothing especially strange here. The error message is almost certainly accurate: there really isn't enough VRAM. Often, a script will run fine on CPU but then fail like this on GPU simply because there is usually much more system memory available than GPU memory, especially since the system memory is virtualized (and can page out to disk if required) while the GPU memory isn't.
For this script, there needs to be enough memory to store the training, validation, and testing data sets, the model parameters, and enough working space to store intermediate results of the computation. There are two options available:
Reduce the amount of memory needed for one or more of these three components. Reducing the amount of training data is usually easiest; reducing the size of the model next. Unfortunately both of those two options will often impair the quality of the result that is being looked for. Reducing the amount of memory needed for intermediate results is usually beyond the developers control -- it is managed by Theano, but there is sometimes scope for altering the computation to achieve this goal once a good understanding of Theano's internals is achieved.
If the model parameters and working memory can fit in GPU memory then the most common solution is to change the code so that the data is no longer stored in GPU memory (i.e. just store it as numpy arrays, not as Theano shared variables) then pass each batch of data in as inputs instead of givens. The LSTM sample code is an example of this approach.

Categories