Allowing Tensorflow to Use both GPU and Physical System Memory

Allowing Tensorflow to Use both GPU and Physical System Memory - python

For a project I'm working on, I am using an altered version of Mask RCNN to train a model that will find objects in an image. These images are relatively small, about 300 x 200 pixels, and I train them for a relatively long time, around 100 epochs.
However, my main question relates to the batch size and how Tensorflow allocates memory on the gpu for the validation stage per epoch. I want to increase my batch size to help better smooth out the validation curve, as well as increase the accuracy of the overall model. However, if I increase my batch size to drastically, I get a OOM: GPU out-of-momory and keras_scratch_graph error. I'm currently working with two NVIDIA Quadro P5000s that have 16GB of vram each. having about 3 images per GPU, I can have a max batch size of 6 before it errors out. I've looked around and most people either say to just decrease the batch size, which I would prefer not to do, or enable GPU growth, which I couldn't get to work either. I could decrease the complexity of my model to decrease the size of tensors that are being evaluated, but I don't want to risk it as it could cause my accuracy to decrease, or loss to increase.
Is there a way that I can offset some images onto my physical systems memory, or am I purely limited to the amount of ram I have available on my GPU? Are their any more compact or robust methods out there that could solve this issue?

Related

Keras/TensorFlow I'm Having Trouble Using GPUs, CPUs, and Other Hardware Efficiently

I have access to a 2018 Lambda machine with 24 CPUs and 4 GPUs and a 4 TB SSD. I also have access to a 2022 Dell with 40 CPUs and 3 A6000 GPUs and a 1TB NVMe SSD in addition to a 1TB SSD and 2 8TB HDDs.
If I restrict my Python/Keras/TensorFlow code to use 1 CPU and 1 GPU, it takes about 9-10 minutes per epoch to run with about 40k parameters. If I increase the number of CPUs available or the number of GPUs available or both, it still takes 9-10 minutes per epoch. In the beginning, the disks are heavily taxed as the data is read into memory, but apparently, it all fits into memory because, after a while, it appears that there is no disk activity.
I used atop to monitor resource usage and that is how I found out about the disk activity.
I used htop to monitor CPU activity and I found that if only one CPU is used, it is heavily taxed. If 2 CPUs are used, they are each taxed at about 60% capacity. For 10 CPUs, each is used at about 13% capacity, and with 20 CPUs, each is used at about 7% capacity.
I used nvtop to monitor GPU activity and it appears, on the Dell, that one GPU is utilized at about 10-25% capacity while the other two are rarely used. In fact, it appears that no data is read into the memory of GPU1 or GPU2. Only GPU0 appears to be doing something real. Previously using nvidia-smi revealed similar behavior on the Lambda machine.
I am using the Keras model.fit with my own data generator function. I would think that using multiple CPUs would cause the data generator to be run in parallel on several CPUs to keep the queue full, but the CPUs seem mostly idle.
So, the CPUs appear to be underutilized, the GPUs appear to be underutilized, and the disks appear to be underutilized. Either there is some other source of bottlenecking or the main memory/PCI-bus can't keep up. Or maybe Keras just isn't running more than one copy of the data generator. At this point, I'm at a complete loss.
BTW, I'm using Python 3.8 and TensorFlow 2.10.
Any help is appreciated.
Edit: Some more information:
I can't say exactly what the application is, but the network currently has two Siamese branches each having 17 convolutional layers performing 3x3 convolutions in a telescoping arrangement. The input images are 69x69 and the smallest layer is 5x5. The input layer has 3 channels and subsequent layers each have 16 channels. Following that are three dense layers with 25, 20, and 10 units. Following that, the two Siamese branches are combined in a product layer with 10 outputs and those outputs are combined by another dense layer with 1 output. There is also a custom loss function which is similar to the binary cross-entropy loss. The input dataset consists of about 600,000 images.
On the advice of a colleague, I turned off all the GPUs and am running purely on CPUs. Under that condition, it appears that all 40 CPUs on the Dell are being utilized at between 20% and 50% of capacity with one CPU sometimes using 95+% of capacity. It appears that under this condition, the time increases to about 12-13 minutes per epoch.
I should also mention that I'm using a batch size of 500. I've tried increasing or decreasing the batch size, but the run time does not appear to be greatly affected. If anything, it just gets slower.
Thanks.

Cap tensorflow batch size based on GPU memory

I am switching to training on GPUs and found that with arbitrary, and not very big, batch size of training will crash. With 256x256 RGB images in a UNET, a batch of 32 causes an out of memory crash, while 16 works successfully. The amount of memory consumed was surprising as I never ran into an out-of-memory on a 16 GB RAM system. Is tensorflow free to use SWAP?
How can I check the amount of total memory available on a GPU? Many guides online only look at memory used.
How does one estimate the memory needs? Image size (pixelschannelsdtype)* batch + parameter size * float?
Many thanks,
Bogdan

Why is Keras LSTM on CPU three times faster than GPU?

I use this notebook from Kaggle to run LSTM neural network.
I had started training of neural network and I saw that it is too slow. It is almost three times slower than CPU training.
CPU perfomance: 8 min per epoch;
GPU perfomance: 26 min per epoch.
After this I decided to find answer in this question on Stackoverflow and I applied a CuDNNLSTM (which runs only on GPU) instead of LSTM.
Hence, GPU perfomance became only 1 min per epoch and accuracy of model decreased on 3%.
Questions:
1) Does somebody know why GPU works slower than CPU in the classic LSTM layer? I do not understand why this happens.
2) Why when I use CuDNNLSTM instead of LSTM, training become much more faster and the accuracy of the model decrease?
P.S.:
My CPU: Intel Core i7-7700 Processor (8M Cache, up to 4.20 GHz)
My GPU: nVidia GeForce GTX 1050 Ti (4 GB)

Guessing it's just a different, better implementation and, if the implementation is different, you shouldn't expect identical results.
In general, efficiently implementing an algorithm on a GPU is hard and getting maximum performance requires architecture-specific implementations. Therefore, it wouldn't be surprising if an implementation specific to Nvidia's GPUs had enhanced performance versus a general implementation for GPUs. It also wouldn't be surprising that Nvidia would sink significantly more resources into accelerating their code for their GPUs versus than would a team working on a general CNN implementation.
The other possibility is that the data type used on the backend has changed from double- to single- or even half-precision float. The smaller data types mean you can crunch more numbers faster at the cost of accuracy. For NN applications this is often acceptable because no individual number needs to be especially accurate for the net to produce acceptable results.

I had a similar problem today and found two things that may be helpful to others (this is a regression problem on a data set with ~2.1MM rows, running on a machine with 4 P100 GPUs):
Using the CuDNNLSTM layer instead of the LSTM layer on a GPU machine reduced the fit time from ~13500 seconds to ~400 seconds per epoch.
Increasing the batch size (~500 to ~4700) reduced it to ~130 seconds per epoch.
Reducing the batch size has increase loss and val loss, so you'll need to make a decision about the trade offs you want to make.

In Keras, the fast LSTM implementation with CuDNN.
model.add(CuDNNLSTM(units, input_shape=(len(X_train), len(X_train[0])), return_sequences=True))
It can only be run on the GPU with the TensorFlow backend.

Why will GPU usage run low in NN training?

I'm running a NN training on my GPU with pytorch.
But the GPU usage is strangely "limited" at about 50-60%.
That's a waste of computing resources but I can't make it a bit higher.
I'm sure that the hardware is fine because running 2 of my process at the same time,or training a simple NN (DCGAN,for instance) can both occupy 95% or more GPU.(which is how it supposed to be)
My NN contains several convolution layers and it should use more GPU resources.
Besides, I guess that the data from dataset has been feeding fast enough,because I used workers=64 in my dataloader instance and my disk works just fine.
I just confused about what is happening.
Dev details:
GPU : Nvidia GTX 1080 Ti
os:Ubuntu 64-bit

I can only guess without further research but it could be that your network is small in terms of layer-size (not number of layers) so each step of the training is not enough to occupy all the GPU resources. Or at least the ratio between the data size and the transfer speed (to the gpu memory) is bad and the GPU stays idle most of the time.
tl;dr: the gpu jobs are not long enough to justify the memory transfers

Training in tensor flow cpu version too slow

I have installed Tensorflow cpu version.I have only few images as dataset and I am training on a machine with 4GB ram and Core i5 3340m 2.70GHZ with batch size 1 and it is still extremely slow.the size of all images is same (200X185 i think).Will it train like this ? kindly tell me how can I speed up this process?
Training porcess

If your network is deep, it could take a long time to train your network using CPU as it is not optimized like GPU for calculations.
I would suggest you to get a graphic card, even a old version of graphic card can significantly improve the performance (it could be like 100x faster).

Let's put some numbers here. You are dealing with images with a size of 200x185. Do you realize we are talking about 37000 features right? If we deal with gray levels. If we deal with RGB multiply that by 3. How many images are you using for training? Keep also in mind that SGD (Stochastic Gradient Descent, mini-batch size = 1) tend to be very slow for big datasets... Give us some numbers. How many training images and what is "slow". How much time for one epoch. Something else: programming languages, library (tensorflow, etc.), optimizer, etc. would help us in judging if your code is "slow" and can it be made faster.

batch size is another param affect training time: higher size will help reduce time each epoch, but will require more epoch to have the same effiency like size=1
And if your network is deep (using CNN, etc), you should run on GPU

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.