Why is Keras LSTM on CPU three times faster than GPU?

Why is Keras LSTM on CPU three times faster than GPU? - python

I use this notebook from Kaggle to run LSTM neural network.
I had started training of neural network and I saw that it is too slow. It is almost three times slower than CPU training.
CPU perfomance: 8 min per epoch;
GPU perfomance: 26 min per epoch.
After this I decided to find answer in this question on Stackoverflow and I applied a CuDNNLSTM (which runs only on GPU) instead of LSTM.
Hence, GPU perfomance became only 1 min per epoch and accuracy of model decreased on 3%.
Questions:
1) Does somebody know why GPU works slower than CPU in the classic LSTM layer? I do not understand why this happens.
2) Why when I use CuDNNLSTM instead of LSTM, training become much more faster and the accuracy of the model decrease?
P.S.:
My CPU: Intel Core i7-7700 Processor (8M Cache, up to 4.20 GHz)
My GPU: nVidia GeForce GTX 1050 Ti (4 GB)

Guessing it's just a different, better implementation and, if the implementation is different, you shouldn't expect identical results.
In general, efficiently implementing an algorithm on a GPU is hard and getting maximum performance requires architecture-specific implementations. Therefore, it wouldn't be surprising if an implementation specific to Nvidia's GPUs had enhanced performance versus a general implementation for GPUs. It also wouldn't be surprising that Nvidia would sink significantly more resources into accelerating their code for their GPUs versus than would a team working on a general CNN implementation.
The other possibility is that the data type used on the backend has changed from double- to single- or even half-precision float. The smaller data types mean you can crunch more numbers faster at the cost of accuracy. For NN applications this is often acceptable because no individual number needs to be especially accurate for the net to produce acceptable results.

I had a similar problem today and found two things that may be helpful to others (this is a regression problem on a data set with ~2.1MM rows, running on a machine with 4 P100 GPUs):
Using the CuDNNLSTM layer instead of the LSTM layer on a GPU machine reduced the fit time from ~13500 seconds to ~400 seconds per epoch.
Increasing the batch size (~500 to ~4700) reduced it to ~130 seconds per epoch.
Reducing the batch size has increase loss and val loss, so you'll need to make a decision about the trade offs you want to make.

In Keras, the fast LSTM implementation with CuDNN.
model.add(CuDNNLSTM(units, input_shape=(len(X_train), len(X_train[0])), return_sequences=True))
It can only be run on the GPU with the TensorFlow backend.

Related

Keras/TensorFlow I'm Having Trouble Using GPUs, CPUs, and Other Hardware Efficiently

I have access to a 2018 Lambda machine with 24 CPUs and 4 GPUs and a 4 TB SSD. I also have access to a 2022 Dell with 40 CPUs and 3 A6000 GPUs and a 1TB NVMe SSD in addition to a 1TB SSD and 2 8TB HDDs.
If I restrict my Python/Keras/TensorFlow code to use 1 CPU and 1 GPU, it takes about 9-10 minutes per epoch to run with about 40k parameters. If I increase the number of CPUs available or the number of GPUs available or both, it still takes 9-10 minutes per epoch. In the beginning, the disks are heavily taxed as the data is read into memory, but apparently, it all fits into memory because, after a while, it appears that there is no disk activity.
I used atop to monitor resource usage and that is how I found out about the disk activity.
I used htop to monitor CPU activity and I found that if only one CPU is used, it is heavily taxed. If 2 CPUs are used, they are each taxed at about 60% capacity. For 10 CPUs, each is used at about 13% capacity, and with 20 CPUs, each is used at about 7% capacity.
I used nvtop to monitor GPU activity and it appears, on the Dell, that one GPU is utilized at about 10-25% capacity while the other two are rarely used. In fact, it appears that no data is read into the memory of GPU1 or GPU2. Only GPU0 appears to be doing something real. Previously using nvidia-smi revealed similar behavior on the Lambda machine.
I am using the Keras model.fit with my own data generator function. I would think that using multiple CPUs would cause the data generator to be run in parallel on several CPUs to keep the queue full, but the CPUs seem mostly idle.
So, the CPUs appear to be underutilized, the GPUs appear to be underutilized, and the disks appear to be underutilized. Either there is some other source of bottlenecking or the main memory/PCI-bus can't keep up. Or maybe Keras just isn't running more than one copy of the data generator. At this point, I'm at a complete loss.
BTW, I'm using Python 3.8 and TensorFlow 2.10.
Any help is appreciated.
Edit: Some more information:
I can't say exactly what the application is, but the network currently has two Siamese branches each having 17 convolutional layers performing 3x3 convolutions in a telescoping arrangement. The input images are 69x69 and the smallest layer is 5x5. The input layer has 3 channels and subsequent layers each have 16 channels. Following that are three dense layers with 25, 20, and 10 units. Following that, the two Siamese branches are combined in a product layer with 10 outputs and those outputs are combined by another dense layer with 1 output. There is also a custom loss function which is similar to the binary cross-entropy loss. The input dataset consists of about 600,000 images.
On the advice of a colleague, I turned off all the GPUs and am running purely on CPUs. Under that condition, it appears that all 40 CPUs on the Dell are being utilized at between 20% and 50% of capacity with one CPU sometimes using 95+% of capacity. It appears that under this condition, the time increases to about 12-13 minutes per epoch.
I should also mention that I'm using a batch size of 500. I've tried increasing or decreasing the batch size, but the run time does not appear to be greatly affected. If anything, it just gets slower.
Thanks.

Allowing Tensorflow to Use both GPU and Physical System Memory

For a project I'm working on, I am using an altered version of Mask RCNN to train a model that will find objects in an image. These images are relatively small, about 300 x 200 pixels, and I train them for a relatively long time, around 100 epochs.
However, my main question relates to the batch size and how Tensorflow allocates memory on the gpu for the validation stage per epoch. I want to increase my batch size to help better smooth out the validation curve, as well as increase the accuracy of the overall model. However, if I increase my batch size to drastically, I get a OOM: GPU out-of-momory and keras_scratch_graph error. I'm currently working with two NVIDIA Quadro P5000s that have 16GB of vram each. having about 3 images per GPU, I can have a max batch size of 6 before it errors out. I've looked around and most people either say to just decrease the batch size, which I would prefer not to do, or enable GPU growth, which I couldn't get to work either. I could decrease the complexity of my model to decrease the size of tensors that are being evaluated, but I don't want to risk it as it could cause my accuracy to decrease, or loss to increase.
Is there a way that I can offset some images onto my physical systems memory, or am I purely limited to the amount of ram I have available on my GPU? Are their any more compact or robust methods out there that could solve this issue?

Neural network inference with one image: Why is GPU utilization not 100%?

First of all: this question is connected to neural network inference and not training.
I have discovered, that when doing inference of a trained neural network with only one image over and over on a GPU (e.g. P100) the utilization of the computing power with Tensorflow is not reaching 100%, but instead around 70%. This is also the case if the image does not have to be transferred to the GPU. Therefore, the issue has to be connected to constraints in the parallelization of the calculations. My best guesses for the reasons are:
Tensorflow can only utilize the parallelization capabilities of a GPU up to a certain level. (Also the higher utilization of the same model as a TensorRT models suggest that). In this case, the question is: What is the reason for that?
The inherent neural network structure with several subsequent layers avoids that a higher usage is possible. Therefore the problem is not overhead of a framework but lies in the general design of neural networks. In this case, the question is: What are the restrictions to that?
Both of the above combined.
Thanks for your ideas on the issue!

Why do you expect the GPU utilization to go to 100% when you run the neuronal network prediction for one image?
The GPU utilization is per time unit (e.g. 1 second). This means, when the neuronal network algorithm finished before this time unit elapsed (e.g within 0.5s) Then the rest of the time the GPU may get used by other programs or not get used at all. If the GPU is not used by any other programs neither then well you will not reach 100%.

Training in tensor flow cpu version too slow

I have installed Tensorflow cpu version.I have only few images as dataset and I am training on a machine with 4GB ram and Core i5 3340m 2.70GHZ with batch size 1 and it is still extremely slow.the size of all images is same (200X185 i think).Will it train like this ? kindly tell me how can I speed up this process?
Training porcess

If your network is deep, it could take a long time to train your network using CPU as it is not optimized like GPU for calculations.
I would suggest you to get a graphic card, even a old version of graphic card can significantly improve the performance (it could be like 100x faster).

Let's put some numbers here. You are dealing with images with a size of 200x185. Do you realize we are talking about 37000 features right? If we deal with gray levels. If we deal with RGB multiply that by 3. How many images are you using for training? Keep also in mind that SGD (Stochastic Gradient Descent, mini-batch size = 1) tend to be very slow for big datasets... Give us some numbers. How many training images and what is "slow". How much time for one epoch. Something else: programming languages, library (tensorflow, etc.), optimizer, etc. would help us in judging if your code is "slow" and can it be made faster.

batch size is another param affect training time: higher size will help reduce time each epoch, but will require more epoch to have the same effiency like size=1
And if your network is deep (using CNN, etc), you should run on GPU

Mixture usage of CPU and GPU in Keras

I am building a neural network on Keras, including multiple layers of LSTM, Permute and Dense.
It seems LSTM is GPU-unfriendly. So I did research and use
With tf.device('/cpu:0'):
out = LSTM(cells)(inp)
But based on my understanding about with, with is try...finally block to ensure that clean-up code is executed. I don't know whether the following CPU/GPU mixture usage code works or not? Will they accelerate speed of training?
With tf.device('/cpu:0'):
out = LSTM(cells)(inp)
With tf.device('/gpu:0'):
out = Permute(some_shape)(out)
With tf.device('/cpu:0'):
out = LSTM(cells)(out)
With tf.device('/gpu:0'):
out = Dense(output_size)(out)

As you may read here - tf.device is a context manager which switches a default device to this passed as its argument in a context (block) created by it. So this code should run all '/cpu:0' device at CPU and rest on GPU.
The question will it speed up your training is really hard to answer because it depends on the machine you use - but I don't expect computations to be faster as each change of a device makes data to be copied between GPU RAM and machine RAM. This could even slow down your computations.

I have created a model using 2 LSTM and 1 dense layers and trained it in my GPU (NVidia GTX 10150Ti) Here is my observations.
use CUDA LSTM https://keras.io/layers/recurrent/#cudnnlstm
Use a bath size which helps more GPU parallelism, if I use a very small batch size(2-10) GPU multi cores are not utilized; so I used 100 as batch size
If I train my network on GPU and try to use it for predictions on CPU, it works in-terms of compiling and running but the predictions are weird. In my case I have the luxury to use a GPU for prediction as well.
for multi layer LSTM, need to use
here is some sample snippet
model = keras.Sequential()
model.add(keras.layers.cudnn_recurrent.CuDNNLSTM(neurons
, batch_input_shape=(nbatch_size, reshapedX.shape[1], reshapedX.shape[2])
, return_sequences=True
, stateful=True))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.