Cap tensorflow batch size based on GPU memory - python

I am switching to training on GPUs and found that with arbitrary, and not very big, batch size of training will crash. With 256x256 RGB images in a UNET, a batch of 32 causes an out of memory crash, while 16 works successfully. The amount of memory consumed was surprising as I never ran into an out-of-memory on a 16 GB RAM system. Is tensorflow free to use SWAP?
How can I check the amount of total memory available on a GPU? Many guides online only look at memory used.
How does one estimate the memory needs? Image size (pixelschannelsdtype)* batch + parameter size * float?
Many thanks,
Bogdan

Related

Cleaning Google TPU memory (python)

My python code has two steps. In each step, I train a neural network (primarily using from mesh_transformer.transformer_shard import CausalTransformer and delete the network before the next step that I train another network with the same function. The problem is that in some cases, I receive this error:
Resource exhausted: Failed to allocate request for 32.00MiB (33554432B) on device ordinal 0: while running replica 0 and partition 0 of a replicated computation (other replicas may have failed as well).
I think there is still some remaining stuff in the TPU memory I need to remove except that network. The point here is that both steps are independent, and they don't share any information or variable. But I have to do this sequentially to manage my storage on Google cloud. Also, when I run these two steps separately, it works fine. Is there any way to clean TPU memory thoroughly before going to the next step of my code? I think just removing the network is not enough.
Unfortunately, you can’t clean the TPU memory, but you can reduce memory usage by these options;
The most effective ways to reduce memory usage are to:
Reduce excessive tensor padding
Tensors in TPU memory are padded, that is, the TPU rounds up the sizes of tensors stored in memory to perform computations more efficiently. This padding happens transparently at the hardware level and does not affect results. However, in certain cases the padding can result in significantly increased memory use and execution time.
Reduce the batch size
Slowly reduce the batch size until it fits in memory, making sure that the total batch size is a multiple of 64 (the per-core batch size has to be a multiple of 8). Keep in mind that larger batch sizes are more efficient on the TPU. A total batch size of 1024 (128 per core) is generally a good starting point.
If the model cannot be run on the TPU even with a small batch size
(for example, 64), try reducing the number of layers or the layer
sizes.
You could read more about troubleshooting in this documentation
You can try to clean TPU state after each training and see if that helps with
tf.tpu.experimental.shutdown_tpu_system() call.
Another option is to restart the TPU to clean the memory using:
pip3 install cloud-tpu-client
import tensorflow as tf
from cloud_tpu_client import Client
print(tf.__version__)
Client().configure_tpu_version(tf.__version__, restart_type='always')

Allowing Tensorflow to Use both GPU and Physical System Memory

For a project I'm working on, I am using an altered version of Mask RCNN to train a model that will find objects in an image. These images are relatively small, about 300 x 200 pixels, and I train them for a relatively long time, around 100 epochs.
However, my main question relates to the batch size and how Tensorflow allocates memory on the gpu for the validation stage per epoch. I want to increase my batch size to help better smooth out the validation curve, as well as increase the accuracy of the overall model. However, if I increase my batch size to drastically, I get a OOM: GPU out-of-momory and keras_scratch_graph error. I'm currently working with two NVIDIA Quadro P5000s that have 16GB of vram each. having about 3 images per GPU, I can have a max batch size of 6 before it errors out. I've looked around and most people either say to just decrease the batch size, which I would prefer not to do, or enable GPU growth, which I couldn't get to work either. I could decrease the complexity of my model to decrease the size of tensors that are being evaluated, but I don't want to risk it as it could cause my accuracy to decrease, or loss to increase.
Is there a way that I can offset some images onto my physical systems memory, or am I purely limited to the amount of ram I have available on my GPU? Are their any more compact or robust methods out there that could solve this issue?

Why model.fit function in keras significantly increase RAM memory?

I load my data with open_memmap function and it takes 5GB RAM memory. Then I compile the model which has params: 89,268,608 and it does not take any RAM memory. My batch size is 200 at the moment and the input image has shape (300,54,3).
My problem is when I call the model.fit function in keras my RAM memory increase from 5 GB to 24GB. My question is why?
When I try with different batch sizes nothing is changing and still 23 GB of RAM are occupied?
If somebody can explain me what is happening I would highly appreciate it,
Thanks!
Keras' fit method loads all the data into memory at once meaning changing your batch size will have no effect on the RAM it takes up. Have a look at using fit_generator which is designed for use with a large dataset.

TensorFlow Object Detection API Mask-RCNN Training Causes OOM Error

I'm trying to train a Mask-RCNN (ResNet101) model on a single GPU using the TensorFlow Object Detection API.
This the configuration I'm using for Mask-RCNN.
Immediately after training starts, it takes up ~24GB of CPU RAM. At the 10 minute mark, when the first round of evaluation begins, all 32GB of my CPU RAM fill up and the process gets killed.
My dataset consists of 1 sample in the training set (i.e. 1 image) and 1 sample in the eval set. The images are 775 x 522 pixels. Each sample (image + boxes + masks) amounts to no more than 2MB on disk. I've done this to ensure that the dataset's effect on memory consumption is minimised.
I've also attempted to train a Faster RCNN model on the same dataset, and it works as expected (using 2-3 GB of CPU RAM).
This is the configuration I'm using for Faster RCNN
Why is the memory usage during the training and evaluation of Mask-RCNN so high (compared to Faster RCNN's), and what can I do to reduce it?

Why will GPU usage run low in NN training?

I'm running a NN training on my GPU with pytorch.
But the GPU usage is strangely "limited" at about 50-60%.
That's a waste of computing resources but I can't make it a bit higher.
I'm sure that the hardware is fine because running 2 of my process at the same time,or training a simple NN (DCGAN,for instance) can both occupy 95% or more GPU.(which is how it supposed to be)
My NN contains several convolution layers and it should use more GPU resources.
Besides, I guess that the data from dataset has been feeding fast enough,because I used workers=64 in my dataloader instance and my disk works just fine.
I just confused about what is happening.
Dev details:
GPU : Nvidia GTX 1080 Ti
os:Ubuntu 64-bit
I can only guess without further research but it could be that your network is small in terms of layer-size (not number of layers) so each step of the training is not enough to occupy all the GPU resources. Or at least the ratio between the data size and the transfer speed (to the gpu memory) is bad and the GPU stays idle most of the time.
tl;dr: the gpu jobs are not long enough to justify the memory transfers

Categories