I experience an incredibly high amount of (CPU) RAM usage with Tensorflow while about every variable is allocated on the GPU device, and all computation runs there. Even then, RAM usage exceeds the VRAM usage by a factor of 2 at least. I'm trying to understand why so as to see if it can be remedied or if it's inevitable.
Question
So my main question is: Does Tensorflow allocate and maintain a copy of all GPU variables on (CPU) RAM? If yes, what is allocated when (in which phase, see below)? And why is it useful to allocate this in CPU memory?
More info
I have 3 phases in which I see RAM increasing dramatically.
First, when defining the graph (I append VGG-19 with quite large loss functions that iterate over many translated activation maps). This adds 2 GB to RAM usage.
Second, defining the optimizer (I use Adam) adds 250MB.
Initialize global variables adds 750MB.
And then it remains stable and runs very fast (all on GPU).
(Amounts of data mentioned here are when I input tiny images of size 8x8x3, batch size of 1. If I do more than 1x16x16x3, the process gets killed because it overflows my 8GB RAM+6GB swap limit).
Note that I recorded variable placement with tf.ConfigProto(log_device_placement=True), and GPU usage using tf.RunMetadata and visualization on tensorboard.
Thank you for any help.
System information
Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
OS Platform and Distribution: Linux Ubuntu 17.10
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): 1.7
Python version: 3.6.3
GCC/Compiler version (if compiling from source): 6.4.0
CUDA/cuDNN version: 9.0
GPU model and memory: NVidia GeForce Titan Xp
Related
I installed Stable Diffusion v1.4 by following the instructions described in https://www.howtogeek.com/830179/how-to-run-stable-diffusion-on-your-pc-to-generate-ai-images/#autotoc_anchor_2
My machine heavily exceeds the min reqs to run Stable Diffusion:
Windows 11 Pro
11th Gen Intel i7 # 2.30GHz
Latest NVIDIA GeForce GPU
16GB Memory
1TB SSD
Yet, I get an error when trying to run the test prompt
python scripts/txt2img.py --prompt "a close-up portrait of a cat by pablo picasso, vivid, abstract art, colorful, vibrant" --plms --n_iter 5 --n_samples 1
RuntimeError: CUDA out of memory. Tried to allocate 1024.00 MiB (GPU 0; 8.00 GiB total capacity; 6.13 GiB already allocated; 0 bytes free; 6.73 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Reading a post by Marco Ramos it seems like it relates to the number of workers in PyTorch
Strange Cuda out of Memory behavior in Pytorch
How do I change the number of workers while running Stable Diffusion? And why is it throwing this error if my machine still has lots of memory? Has anyone encountered this same issue while running Stable Diffusion?
I had the same issue, it's because you're using a non-optimized version of Stable-Diffusion. You have to download basujindal's branch of it, which allows it use much less ram by sacrificing the precision, this is the branch - https://github.com/basujindal/stable-diffusion
Everything else in that guide stays the same just clone from this version. It allow you to even push past 512x512 default resolution, you can use 756x512 to get rectangular images for example (but the results may vary since it was trained on a 512 square set).
the new prompt becomes python optimizedSD/optimized_txt2img.py --prompt "blue orange" --H 756 --W 512
Also another note: as of a few days ago an even faster and more optimized version was released by neonsecret (https://github.com/basujindal/stable-diffusion), however I'm having issues installing it, so can't really recommend it but you can try it as well and see if it works for you.
In addition to the optimized version by basujindal, the additional tags following the prompt allows the model to run properly on a machine with NVIDIA or AMD 8+GB GPU.
So the new prompt would look like this
>> python optimizedSD/optimized_txt2img.py --prompt "a close-up portrait of a cat by pablo picasso, vivid, abstract art, colorful, vibrant" --H 512 --W 512 --seed 27 --n_iter 2 --n_samples 10 --ddim_steps 50
I am asking this question because I am successfully training a segmentation network on my GTX 2070 on laptop with 8GB VRAM and I use exactly the same code and exactly the same software libraries installed on my desktop PC with a GTX 1080TI and it still throws out of memory.
Why does this happen, considering that:
The same Windows 10 + CUDA 10.1 + CUDNN 7.6.5.32 + Nvidia Driver 418.96 (comes along with CUDA 10.1) are both on laptop and on PC.
The fact that training with TensorFlow 2.3 runs smoothly on the GPU on my PC, yet it fails allocating memory for training only with PyTorch.
PyTorch recognises the GPU (prints GTX 1080 TI) via the command : print(torch.cuda.get_device_name(0))
PyTorch allocates memory when running this command: torch.rand(20000, 20000).cuda() #allocated 1.5GB of VRAM.
What is the solution to this?
Most of the people (even in the thread below) jump to suggest that decreasing the batch_size will solve this problem. In fact, it does not in this case. For example, it would have been illogical for a network to train on 8GB VRAM and yet to fail to train on 11GB VRAM, considering that there were no other applications consuming video memory on the system with 11GB VRAM and the exact same configuration is installed and used.
The reason why this happened in my case was that, when using the DataLoader object, I set a very high (12) value for the workers parameter. Decreasing this value to 4 in my case solved the problem.
In fact, although at the bottom of the thread, the answer provided by Yurasyk at https://github.com/pytorch/pytorch/issues/16417#issuecomment-599137646 pointed me in the right direction.
Solution: Decrease the number of workers in the PyTorch DataLoader. Although I do not exactly understand why this solution works, I assume it is related to the threads spawned behind the scenes for data fetching; it may be the case that, on some processors, such an error appears.
I run Keras code (tensorflow backend) on GPU
I simply run it without setting anything the code is run on GPU
automatically and I can see the usage of GPU
When I run my normal python code and I check my
usage of GPU ,it turns out to be 0%.
My questions are:
(1) How to make CPU send the data to GPU and always let GPU compute it.
(2)I heard that the default of numpy array data type is always set to "float" ,Does is have something to do with GPU?
In the context of deep neural networks training, the training works faster when it uses the GPU as the processing unit.
This is done by configuring CudNN optimizations and changing the processing unit in the environment variables with the following line (Python 2.7 and Keras on Windows):
os.environ["THEANO_FLAGS"] = "floatX=float32,device=gpu,optimizer_including=cudnn,gpuarray.preallocate=0.8,dnn.conv.algo_bwd_filter=deterministic,dnn.conv.algo_bwd_data=deterministic,dnn.include_path=e:/toolkits.win/cuda-8.0.61/include,dnn.library_path=e:/toolkits.win/cuda-8.0.61/lib/x64"
The output is then:
Using gpu device 0: TITAN Xp (CNMeM is disabled, cuDNN 5110)
The problem is that the GPU memory is limited compared to the RAM (12GB and 128GB respectively), and the training is only one phase of the whole flow. Therefore I want to change back to CPU once the training is completed.
I've tried the following line, but it has no effect:
os.environ["THEANO_FLAGS"] = "floatX=float32,device=cpu"
My questions are:
Is it possible to change from GPU to CPU and vice-versa during runtime? (technically)
If yes, how can I do it programmatically in Python? (2.7, Windows, and Keras with Theano backend).
Yes this is possible at least for the tensorflow backend. You just have to also import tensorflow and put your code into the following with:
with tf.device('/cpu:0'):
your code
with tf.device('/gpu:0'):
your code
I am unsure if this also works for theano backend. However, switching from one backend to the other one is just setting a flag beforehand so this should not provide too much trouble.
Before I installed the second graphics card, I had successfully trained a model which is VGG-11 architecture using tensorflow and a single GPU that has 6GB memory. But I got OOM error when I installed the second graphics card and ran the same code(allow_growth=True and no tf.device() used).
My understanding is that my second card(GPU:0) has 8GB memory and TF would use my "device:gpu:0" as default to do computation when I did not use tf.device() to specify any device. And the memory should be enough because 8GB > 6GB.
Then I tried to use CUDA_VISIBLE_DEVICES=0 to block one and ran the same code. TF worked successfully.
What is the problem?