TensorFlow: critical graph operations assigned to cpu rather than gpu

TensorFlow: critical graph operations assigned to cpu rather than gpu - python

I have implemented a TensorFlow DNN model (2 hidden layers with elu activation functions trained on MNIST) as a Python class in order to wrap TF calls within another library with its own optimization routines and tools.
When running some tests on a TeslaK20 I noticed that the GPU was being used at 4% of the total capacity. Therefore I looked a bit more closely to the log-device-placement and figured that all critical operations like MatMul, Sum, Add, Mean etc were being assigned to the CPU.
The first thing that came to mind was that it was because I was using dtype=float64, therefore I switched to dtype=float32. While a lot more operations were assigned to GPU, still a good number were assigned to the CPU, like Mean, gradient/Mean_grad/Prod, gradient/Mean.
So here comes my first question (I'm linking a working code example at the end),
1) why would that be? I have written different TF models that consist of simple tensor multiplications and reductions and they run fully on GPU as long as I use single precision.
So here comes the second question,
2) why does TF assign the graph to different devices depending on the data type? I understand that not all kernels are implemented for GPU but I would have thought that things like MatMul could run on GPU for both single and double precision.
3) Could the fact that the model is wrapped within a Python class have an effect? I do not think this is the case because as I said, it did not happen for other models wrapped similarly but that were simpler.
4) What sort of steps can I take to run the model fully on a GPU?
Here is a full working example of my code that I have isolated from my library
https://gist.github.com/smcantab/8ecb679150a327738102 .
If you run it and look at the output you'll see how different parts of the graph have been assigned to different devices. To see how this changes with types and devices change dtype and device within main() at the end of the example. Note that if I set allow_soft_placement=False the graph fails to initialize.
Any word of advice would be really appreciated.

As Yaroslav noted: Mean, in particular, was not yet implemented for GPU, but it is now available so these operations should run on the GPU with the latest TensorFlow. (as per the DEVICE_GPU registration at that link)
Prior to availability of mean, the status of this was:
(a) You can implement mean by hand, because reduce_sum is available on GPU.
(b) I've re-pinged someone to see if there's an easy way to add the GPU support, but we'll see.
Re float64 on GPU, someone opened an issue three days ago with a patch for supporting float64 reductions on GPU. Currently being reviewed and tested.
No, it doesn't matter if it's wrapped in Python - it's really just about whether a kernel has been defined for it to execute on the GPU or not. In many cases, the answer to "why is X supported on GPU by Y not?" comes down to whether or not there's been demand for Y to run on the GPU. The answer for float64 is simpler: float32 is a lot faster, so in most cases, people work to make their models work in float32 when possible because it gives all-around speed benefits.

Most of the graphics cards like the GTX 980, 1080, etc are stripped of the double precision floating point hardware units. Since these are much cheaper and therefore more ubiquitous than the Newer Tesla Units (which have FP64 double precision hardware), doing double precision calculations on the graphics cards is very slow compared to single precision. FP64 calculations on a GPU seem to be about 32 X slower than FP32 on a GPU without the FP64 hardware. I believe this is why the FP32 calculations tend to be set up for the GPU while FP64 for the CPU (which is faster in most systems.) Hopefully in the future, the frameworks will test the GPU capabilities at runtime to decide where to assign the FP64 calculations.

Related

Pytorch transformations on GPU, is it worth on big input data? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
I am running a UNet with PyTorch on medical imaging data with a bunch of transformations and augmentations in my preprocessing. However, after digging into the different preprocessing packages like Torchio and MONAI, I noticed that most of the functions, even when they take Tensors as IO, are running things on CPU.
The functions either straight up take numpy arrays as input or call .numpy() on the tensors.
The problem is that my data is composed of 3D images of dimension 91x109x91 that I resize in 96x128x96 so they are pretty big. Hence, running transformations and augmentations on CPU is pretty inefficient I think.
First, it makes my program CPU bound because it takes more time to transform my images than running them through the model (I timed it many times ). Secondly, I checked the GPU usage and it's oscillating between pretty much 0% and 100% at each batch so, it's clearly limited by the CPU. I would like to speed it up if it's possible.
My question is: Why are these packages not using the GPUs? They could at least have hybrid functions taking either a numpy array or a Tensor as input as a lot of numpy functions are available in Torch as well. Is there a good reason to stick to the CPU rather than speeding up the preprocessing by loading the images on GPU at the beginning of the preprocessing?
I translated a simple normalization function to work on GPU and compare the running time between the GPU and CPU version and even on a laptop (NVidia M2000M) the function was 3 to 4 times faster on GPU.
On an ML discord, someone mentioned that GPU-based functions might not give deterministic results and that's why it might not be a good idea but I don't know if that's actually the case.
My preprocessing includes resizing, intensity clamping, z-scoring, intensity rescaling, and then I have some augmentations like random histogram shift/elastic transform/affine transform/bias field.

A transformation will typically only be faster on the GPU than on the CPU if the implementation can make use of the parallelism offered by the GPU. Typically anything that operates element-wise, or row/column-wise can be made faster on GPU. This therefore concerns most image transformations.
The reason why some libraries don't implement things on GPU is that it requires additional work for each Tensor manipulation library you want to support (Pytorch, Tensorflow, MXNet, ...), and you still have to maintain another CPU implementation anyway. Since you're using PyTorch, checkout the torchvision package that implements many transformations for both GPU and CPU tensors.
For more complex transformations, like elastic deformation, I'm not sure if you can find a GPU version. If not, you might have to write one yourself, or drop this transformation, or pay the cost of copying back-and-forth between CPU and GPU during your data augmentation.
Another solution that some people prefer is to precompute a large set of transformation on CPU as a separate step and to save the result in a file. The HDF file format is commonly used to save large datasets that can then be read very fast from disk. Since you will be saving a finite set of augmentation, be careful to generate several augmentations for each sample of your dataset to conserve a somewhat random behavior. This is not perfect, but it's a very pragmatic that will likely speed things up quite a bit if your CPU is holding your GPU back.
Regarding the determinism of the GPU, it's true that floating point operations are not by default guaranteed to be deterministic when run on GPU. This is because reordering some floating point operations can make them faster, but the reordering cannot guarantee that the result will be exactly the same (it will be close of course!). This can matter for reproducibility, if you use a seed in your code and get slightly different results. See the Pytorch Documentation to understand other sources of non-determinism.

Difference between MirroredStrategy and CentralStorageStrategy

I read the documentation of both CentralStorageStrategy and MirroredStrategy, but can not understand the essence of difference between them.
In MirroredStrategy:
Each variable in the model is mirrored across all the replicas.
In CentralStorageStrategy:
Variables are not mirrored, instead they are placed on the CPU and operations are replicated across all local GPUs.
Source: https://www.tensorflow.org/guide/distributed_training
What does it mean in practice? What are use cases for the CentralStorageStrategy and how does the training work if variables are placed on the CPU in this strategy?

Consider one particular variable (call it "my_var") in your usual, single-GPU, non-distributed use case (e.g. a weight matrix of a convolutional layer).
If you use 4 GPUs, MirroredStrategy will create 4 variables instead of "my_var" variable, one on each GPU. However each variable will have the same value, because they are always updated in the same way. So the variable updates happen in sync on all the GPUs.
In case of the CentralStorageStrategy, only one variable is created for "my_var", in the host (CPU) memory. The updates only happen in one place.
Which one is better probably depends on the computer's topology and how fast CPU-GPU communication is compared with GPU-GPU. If the GPUs can communicate fast with each other, MirroredStrategy may be more efficient. But I'd benchmark it to be sure.

Theano only using 40% of my GPU

I am starting to get into deep learning and I am trying out the example from Chapter 6 on neuralnetworksanddeeplearning.com. Theano is telling me, that it is using my GPU (a GTX 780). However, the GPU usage hovers only at around 40~50% and the clockspeed is only at ~800 MHz (normal Boost clock in games is ~1100 MHz).
Is this normal? Or is something wrong here?

It is normal. Actually, 40~50% should be considered high usage. Some operations like vector concatenation are performed on CPU. The GPU has to wait these operations to be completed before using the results as input. Besides, the overhead can be caused by loading data from memory.
So people commonly run 2~3 programs on the same GPU to take full advantage of it.

Run out of VRAM using Theano on Amazon cluster

I'm trying to execute the logistic_sgd.py code on an Amazon cluster running the ami-b141a2f5 (Theano - CUDA 7) image.
Instead of the included MNIST database I am using the SD19 database, which requires changing a few dimensional constants, but otherwise no code has been touched. The code runs fine locally, on my CPU, but once I SSH the code and data to the Amazon cluster and run it there, I get this output:
It looks to me like it is running out of VRAM, but it was my understanding that the code should run on a GPU already, without any tinkering on my part necessary. After following the suggestion from the error message, the error persists.

There's nothing especially strange here. The error message is almost certainly accurate: there really isn't enough VRAM. Often, a script will run fine on CPU but then fail like this on GPU simply because there is usually much more system memory available than GPU memory, especially since the system memory is virtualized (and can page out to disk if required) while the GPU memory isn't.
For this script, there needs to be enough memory to store the training, validation, and testing data sets, the model parameters, and enough working space to store intermediate results of the computation. There are two options available:
Reduce the amount of memory needed for one or more of these three components. Reducing the amount of training data is usually easiest; reducing the size of the model next. Unfortunately both of those two options will often impair the quality of the result that is being looked for. Reducing the amount of memory needed for intermediate results is usually beyond the developers control -- it is managed by Theano, but there is sometimes scope for altering the computation to achieve this goal once a good understanding of Theano's internals is achieved.
If the model parameters and working memory can fit in GPU memory then the most common solution is to change the code so that the data is no longer stored in GPU memory (i.e. just store it as numpy arrays, not as Theano shared variables) then pass each batch of data in as inputs instead of givens. The LSTM sample code is an example of this approach.

PyCUDA/CUDA: Causes of non-deterministic launch failures?

Anyone following CUDA will probably have seen a few of my queries regarding a project I'm involved in, but for those who haven't I'll summarize. (Sorry for the long question in advance)
Three Kernels, One Generates a data set based on some input variables (deals with bit-combinations so can grow exponentially), another solves these generated linear systems, and another reduction kernel to get the final result out. These three kernels are ran over and over again as part of an optimisation algorithm for a particular system.
On my dev machine (Geforce 9800GT, running under CUDA 4.0) this works perfectly, all the time, no matter what I throw at it (up to a computational limit based on the stated exponential nature), but on a test machine (4xTesla S1070's, only one used, under CUDA 3.1) the exact same code (Python base, PyCUDA interface to CUDA kernels), produces the exact results for 'small' cases, but in mid-range cases, the solving stage fails on random iterations.
Previous problems I've had with this code have been to do with the numeric instability of the problem, and have been deterministic in nature (i.e fails at exactly the same stage every time), but this one is frankly pissing me off, as it will fail whenever it wants to.
As such, I don't have a reliable way to breaking the CUDA code out from the Python framework and doing proper debugging, and PyCUDA's debugger support is questionable to say the least.
I've checked the usual things like pre-kernel-invocation checking of free memory on the device, and occupation calculations say that the grid and block allocations are fine. I'm not doing any crazy 4.0 specific stuff, I'm freeing everything I allocate on the device at each iteration and I've fixed all the data types as being floats.
TL;DR, Has anyone come across any gotchas regarding CUDA 3.1 that I haven't seen in the release notes, or any issues with PyCUDA's autoinit memory management environment that would cause intermittent launch failures on repeated invocations?

Have you tried:
cuda-memcheck python yourapp.py
You likely have an out of bounds memory access.

You can use nVidia CUDA Profiler and see what gets executed before the failure.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.