I am building a neural network on Keras, including multiple layers of LSTM, Permute and Dense.
It seems LSTM is GPU-unfriendly. So I did research and use
With tf.device('/cpu:0'):
out = LSTM(cells)(inp)
But based on my understanding about with, with is try...finally block to ensure that clean-up code is executed. I don't know whether the following CPU/GPU mixture usage code works or not? Will they accelerate speed of training?
With tf.device('/cpu:0'):
out = LSTM(cells)(inp)
With tf.device('/gpu:0'):
out = Permute(some_shape)(out)
With tf.device('/cpu:0'):
out = LSTM(cells)(out)
With tf.device('/gpu:0'):
out = Dense(output_size)(out)
As you may read here - tf.device is a context manager which switches a default device to this passed as its argument in a context (block) created by it. So this code should run all '/cpu:0' device at CPU and rest on GPU.
The question will it speed up your training is really hard to answer because it depends on the machine you use - but I don't expect computations to be faster as each change of a device makes data to be copied between GPU RAM and machine RAM. This could even slow down your computations.
I have created a model using 2 LSTM and 1 dense layers and trained it in my GPU (NVidia GTX 10150Ti) Here is my observations.
use CUDA LSTM https://keras.io/layers/recurrent/#cudnnlstm
Use a bath size which helps more GPU parallelism, if I use a very small batch size(2-10) GPU multi cores are not utilized; so I used 100 as batch size
If I train my network on GPU and try to use it for predictions on CPU, it works in-terms of compiling and running but the predictions are weird. In my case I have the luxury to use a GPU for prediction as well.
for multi layer LSTM, need to use
here is some sample snippet
model = keras.Sequential()
model.add(keras.layers.cudnn_recurrent.CuDNNLSTM(neurons
, batch_input_shape=(nbatch_size, reshapedX.shape[1], reshapedX.shape[2])
, return_sequences=True
, stateful=True))
Related
I am running the following code for LSTM on Databricks with GPU
model = Sequential()
model.add(LSTM(64, activation=LeakyReLU(alpha=0.05), batch_input_shape=(1, timesteps, n_features),
stateful=False, return_sequences = True))
model.add(Dropout(0.2))
model.add(LSTM(32))
model.add(Dropout(0.2))
model.add(Dense(n_features))
model.compile(loss='mean_squared_error', optimizer=Adam(learning_rate = 0.001), metrics='acc')
model.fit(generator, epochs=epochs, verbose=0, shuffle=False)
but the following warning keeps appearing
WARNING:tensorflow:Layer lstm will not use cuDNN kernels since it doesn't meet the criteria. It will use a generic GPU kernel as fallback when running on GPU.
It trains much slower than it does without a GPU.
I'm using DBR 9.0 ML (includes Apache Spark 3.1.2, GPU, Scala 2.12)
Do I need any additional libraries for this?
CUDNN has functionality to specifically accelerate LSTM and GRU layers. These GRU/LSTM layers can only be accelerated if they meet a certain criteria. In your case the problem is that you are using the LeakyReLU activation. The CUDNN LSTM acceleration only works if the activation is tanh.
Quoting from the documentation (https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM)
The requirements to use the cuDNN implementation are:
activation == tanh
recurrent_activation == sigmoid
recurrent_dropout == 0
unroll is False
use_bias is True
Inputs, if use masking, are strictly right-padded.
Eager execution is enabled in the outermost context.
Your LSTM should still run on the gpu but it will be constructed using scan and matmul operations and therefore be much slower. From my experience the CUDNN LSTM/GRU acceleration works so well that both these layers run faster then the SimpleRNN layer (which is not accelerated by CUDNN) despite this layer being much simpler.
This is what Francois Chollet, creator of keras library, main contributor of tensorflow framework, said about RNN runtime performance in his book Deep Learning with Python 2nd edition
Recurrent models with very few parameters, like the ones in this chapter, tend to be significantly faster on a multicore CPU than on GPU, because they only involve small matrix multiplications, and the chain of multiplications is not well parallelizable due to the presence of a for loop. But larger RNNs can greatly benefit from a GPU runtime.
When using a Keras LSTM or GRU layer on GPU with default keyword arguments, your layer will be leveraging a cuDNN kernel, a highly optimized, low-level, NVIDIA-provided implementation of the underlying algorithm. As usual, cuDNN kernels are a mixed blessing: they’re fast, but inflexible—if you try to do anything not supported by the default kernel, you will suffer a dramatic slow- down, which more or less forces you to stick to what NVIDIA happens to provide. For instance, recurrent dropout isn’t supported by the LSTM and GRU cuDNN kernels, so adding it to your layers forces the runtime to fall back to the regular TensorFlow implementation, which is generally two to five times slower on GPU (even though its computational cost is the same).
As a way to speed up your RNN layer when you can’t use cuDNN, you can try unrolling it. Unrolling a for loop consists of removing the loop and simply inlining its content N times. In the case of the for loop of an RNN, unrolling can help TensorFlow optimize the underlying computation graph. However, it will also considerably increase the memory consumption of your RNN—as such, it’s only viable for relatively small sequences (around 100 steps or fewer). Also, note that you can only do this if the number of timesteps in the data is known in advance by the model (that is to say, if you pass a shape without any None entries to your initial Input()). It works like this:
inputs = keras.Input(shape=(sequence_length, num_features))
x = layers.LSTM(32, recurrent_dropout=0.2, unroll=True)(inputs)
I am already using Google Colab to train my model. So I will not use my own GPU for training. I want to ask, is there a performance difference beetween GPU and CPU while working with pre-trained model. I already trained a model with Google Colab GPU and used with my own local CPU. Should I use GPU for testing?
It depends how many predictions you need to do. Usually in training you are making many calculations therefore parallelisation by GPU shortens overall training time. Usually, when using a trained model you just need to do a sparse prediction per time unit. In such situation CPU approach should be OK. However, if you need to do as many predictions as during training then GPU would be beneficial. This can particularly be true with reinforcement training, when your model must adopt to continuously changing environmental input.
I am trying to train a CNN model with Keras using 36 cores that I have. I am trying to follow:
How to run Keras on multiple cores?
But it doesn't make my code faster, and I am not sure whether it uses all the avialble cores or just uses one core, and the rest remains unused.
My code is:
Model is defined with Keras ==>
import tensorflow as tf
from keras.backend import tensorflow_backend as K
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
K.set_session(sess)
CNN_Model = CNN_model()
ES = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=150)
history = CNN_Model.fit(IM_Training , Y_Train , batch_size= 256, epochs =250, verbose=1, validation_data=(IM_Valid, Y_Val ), callbacks = [ES])
How can I make sure that the code uses all the cores?
There are 2 main ways you can get parallelism when evaluation a neural network:
Matrix computations
micro-batch parallelism
The computational graph for many neural networks is sequential (thus Keras having a Sequential model). i.e. one computes layer1 ... layerN in sequence both on forward and backward steps. A sequential network can't be speed up by distributing the layers across different cores.
However most computations use matrix operations which are typically implemented using a high performance lib like BLAS which uses all the cores available to the CPU. Typically the larger the batch size, the larger the opportunities for parallelism.
micro-batch parallelism is the strategy used by multi_gpu_model where different micro-batches are distributed to different compute units (it really makes sense for GPUs primarily).
Non-sequential models can also be parallelised via careful device placement; I'm not sure this is the scenario here. But the TLDR is: increase your batch_size and enjoy all your 36 cores spinning on matrix computations.
I use this notebook from Kaggle to run LSTM neural network.
I had started training of neural network and I saw that it is too slow. It is almost three times slower than CPU training.
CPU perfomance: 8 min per epoch;
GPU perfomance: 26 min per epoch.
After this I decided to find answer in this question on Stackoverflow and I applied a CuDNNLSTM (which runs only on GPU) instead of LSTM.
Hence, GPU perfomance became only 1 min per epoch and accuracy of model decreased on 3%.
Questions:
1) Does somebody know why GPU works slower than CPU in the classic LSTM layer? I do not understand why this happens.
2) Why when I use CuDNNLSTM instead of LSTM, training become much more faster and the accuracy of the model decrease?
P.S.:
My CPU: Intel Core i7-7700 Processor (8M Cache, up to 4.20 GHz)
My GPU: nVidia GeForce GTX 1050 Ti (4 GB)
Guessing it's just a different, better implementation and, if the implementation is different, you shouldn't expect identical results.
In general, efficiently implementing an algorithm on a GPU is hard and getting maximum performance requires architecture-specific implementations. Therefore, it wouldn't be surprising if an implementation specific to Nvidia's GPUs had enhanced performance versus a general implementation for GPUs. It also wouldn't be surprising that Nvidia would sink significantly more resources into accelerating their code for their GPUs versus than would a team working on a general CNN implementation.
The other possibility is that the data type used on the backend has changed from double- to single- or even half-precision float. The smaller data types mean you can crunch more numbers faster at the cost of accuracy. For NN applications this is often acceptable because no individual number needs to be especially accurate for the net to produce acceptable results.
I had a similar problem today and found two things that may be helpful to others (this is a regression problem on a data set with ~2.1MM rows, running on a machine with 4 P100 GPUs):
Using the CuDNNLSTM layer instead of the LSTM layer on a GPU machine reduced the fit time from ~13500 seconds to ~400 seconds per epoch.
Increasing the batch size (~500 to ~4700) reduced it to ~130 seconds per epoch.
Reducing the batch size has increase loss and val loss, so you'll need to make a decision about the trade offs you want to make.
In Keras, the fast LSTM implementation with CuDNN.
model.add(CuDNNLSTM(units, input_shape=(len(X_train), len(X_train[0])), return_sequences=True))
It can only be run on the GPU with the TensorFlow backend.
We've trained a tf-seq2seq model for question answering. The main framework is from google/seq2seq. We use bidirectional RNN( GRU encoders/decoders 128units), adding soft attention mechanism.
We limit maximum length to 100 words. It mostly just generates 10~20 words.
For model inference, we try two cases:
normal(greedy algorithm). Its inference time is about 40ms~100ms
beam search. We try to use beam width 5, and its inference time is about 400ms~1000ms.
So, we want to try to use beam width 3, its time may decrease, but it may also influence the final effect.
So are there any suggestion to decrease inference time for our case? Thanks.
you can do network compression.
cut the sentence into pieces by byte-pair-encoding or unigram language model and etc and then try TreeLSTM.
you can try faster softmax like adaptive softmax](https://arxiv.org/pdf/1609.04309.pdf)
try cudnnLSTM
try dilated RNN
switch to CNN like dilated CNN, or BERT for parallelization and more efficient GPU support
If you require improved performance, I'd propose that you use OpenVINO. It reduces inference time by graph pruning and fusing some operations. Although OpenVINO is optimized for Intel hardware, it should work with any CPU.
Here are some performance benchmarks for NLP model (BERT) and various CPUs.
It's rather straightforward to convert the Tensorflow model to OpenVINO unless you have fancy custom layers. The full tutorial on how to do it can be found here. Some snippets below.
Install OpenVINO
The easiest way to do it is using PIP. Alternatively, you can use this tool to find the best way in your case.
pip install openvino-dev[tensorflow2]
Use Model Optimizer to convert SavedModel model
The Model Optimizer is a command-line tool that comes from OpenVINO Development Package. It converts the Tensorflow model to IR, which is a default format for OpenVINO. You can also try the precision of FP16, which should give you better performance without a significant accuracy drop (just change data_type). Run in the command line:
mo --saved_model_dir "model" --input_shape "[1, 3, 224, 224]" --data_type FP32 --output_dir "model_ir"
Run the inference
The converted model can be loaded by the runtime and compiled for a specific device e.g. CPU or GPU (integrated into your CPU like Intel HD Graphics). If you don't know what is the best choice for you, just use AUTO.
# Load the network
ie = Core()
model_ir = ie.read_model(model="model_ir/model.xml")
compiled_model_ir = ie.compile_model(model=model_ir, device_name="CPU")
# Get output layer
output_layer_ir = compiled_model_ir.output(0)
# Run inference on the input image
result = compiled_model_ir([input_image])[output_layer_ir]
Disclaimer: I work on OpenVINO.