Multy processing with Keras - python

I am trying to train a CNN model with Keras using 36 cores that I have. I am trying to follow:
How to run Keras on multiple cores?
But it doesn't make my code faster, and I am not sure whether it uses all the avialble cores or just uses one core, and the rest remains unused.
My code is:
Model is defined with Keras ==>
import tensorflow as tf
from keras.backend import tensorflow_backend as K
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
K.set_session(sess)
CNN_Model = CNN_model()
ES = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=150)
history = CNN_Model.fit(IM_Training , Y_Train , batch_size= 256, epochs =250, verbose=1, validation_data=(IM_Valid, Y_Val ), callbacks = [ES])
How can I make sure that the code uses all the cores?

There are 2 main ways you can get parallelism when evaluation a neural network:
Matrix computations
micro-batch parallelism
The computational graph for many neural networks is sequential (thus Keras having a Sequential model). i.e. one computes layer1 ... layerN in sequence both on forward and backward steps. A sequential network can't be speed up by distributing the layers across different cores.
However most computations use matrix operations which are typically implemented using a high performance lib like BLAS which uses all the cores available to the CPU. Typically the larger the batch size, the larger the opportunities for parallelism.
micro-batch parallelism is the strategy used by multi_gpu_model where different micro-batches are distributed to different compute units (it really makes sense for GPUs primarily).
Non-sequential models can also be parallelised via careful device placement; I'm not sure this is the scenario here. But the TLDR is: increase your batch_size and enjoy all your 36 cores spinning on matrix computations.

Related

Do activations of untrainable layers occupy GPU memory during training in Tensorflow-GPU?

We deal with semantic segmentation of large 3D volumes involving dozens of classes, and we often find ourselves unable to begin training (out of memory errors) in some of our larger models with our 12 GB of GPU memory. We think the following question is important for us to ask to help us strategize our training given our GPU memory constraint.
I am working on a two-step training scheme [illustration] for 3D segmentation of 50 classes. Our model involves U-Net architectures modified for our task and built with tf.keras layers.
The first step is pre-training a cleanup network (S1p).
The second step will involve building a full model comprised of two subnetworks (S1 and S2). S1 layers will be loaded from the pre-trained S1p, and these layers will be set to untrainable. Meanwhile, S2 will be initialized with random weights.
I understand that during training, activations of all layers need to be kept in memory to solve for back-propagation. But during inference, only activations of layers currently in execution are stored. For my case, because S1 layers are untrainable and only weights in S2 need training, does Tensorflow know that S1 layer activations do not need to be kept in GPU memory during training?
Thank you.
Version details:
python=3.9
tensorflow-gpu=2.9.1

lstm will not use cuDNN kernels since it doesn't meet the criteria. It will use a generic GPU kernel as fallback when running on GPU

I am running the following code for LSTM on Databricks with GPU
model = Sequential()
model.add(LSTM(64, activation=LeakyReLU(alpha=0.05), batch_input_shape=(1, timesteps, n_features),
stateful=False, return_sequences = True))
model.add(Dropout(0.2))
model.add(LSTM(32))
model.add(Dropout(0.2))
model.add(Dense(n_features))
model.compile(loss='mean_squared_error', optimizer=Adam(learning_rate = 0.001), metrics='acc')
model.fit(generator, epochs=epochs, verbose=0, shuffle=False)
but the following warning keeps appearing
WARNING:tensorflow:Layer lstm will not use cuDNN kernels since it doesn't meet the criteria. It will use a generic GPU kernel as fallback when running on GPU.
It trains much slower than it does without a GPU.
I'm using DBR 9.0 ML (includes Apache Spark 3.1.2, GPU, Scala 2.12)
Do I need any additional libraries for this?
CUDNN has functionality to specifically accelerate LSTM and GRU layers. These GRU/LSTM layers can only be accelerated if they meet a certain criteria. In your case the problem is that you are using the LeakyReLU activation. The CUDNN LSTM acceleration only works if the activation is tanh.
Quoting from the documentation (https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM)
The requirements to use the cuDNN implementation are:
activation == tanh
recurrent_activation == sigmoid
recurrent_dropout == 0
unroll is False
use_bias is True
Inputs, if use masking, are strictly right-padded.
Eager execution is enabled in the outermost context.
Your LSTM should still run on the gpu but it will be constructed using scan and matmul operations and therefore be much slower. From my experience the CUDNN LSTM/GRU acceleration works so well that both these layers run faster then the SimpleRNN layer (which is not accelerated by CUDNN) despite this layer being much simpler.
This is what Francois Chollet, creator of keras library, main contributor of tensorflow framework, said about RNN runtime performance in his book Deep Learning with Python 2nd edition
Recurrent models with very few parameters, like the ones in this chapter, tend to be significantly faster on a multicore CPU than on GPU, because they only involve small matrix multiplications, and the chain of multiplications is not well parallelizable due to the presence of a for loop. But larger RNNs can greatly benefit from a GPU runtime.
When using a Keras LSTM or GRU layer on GPU with default keyword arguments, your layer will be leveraging a cuDNN kernel, a highly optimized, low-level, NVIDIA-provided implementation of the underlying algorithm. As usual, cuDNN kernels are a mixed blessing: they’re fast, but inflexible—if you try to do anything not supported by the default kernel, you will suffer a dramatic slow- down, which more or less forces you to stick to what NVIDIA happens to provide. For instance, recurrent dropout isn’t supported by the LSTM and GRU cuDNN kernels, so adding it to your layers forces the runtime to fall back to the regular TensorFlow implementation, which is generally two to five times slower on GPU (even though its computational cost is the same).
As a way to speed up your RNN layer when you can’t use cuDNN, you can try unrolling it. Unrolling a for loop consists of removing the loop and simply inlining its content N times. In the case of the for loop of an RNN, unrolling can help TensorFlow optimize the underlying computation graph. However, it will also considerably increase the memory consumption of your RNN—as such, it’s only viable for relatively small sequences (around 100 steps or fewer). Also, note that you can only do this if the number of timesteps in the data is known in advance by the model (that is to say, if you pass a shape without any None entries to your initial Input()). It works like this:
inputs = keras.Input(shape=(sequence_length, num_features))
x = layers.LSTM(32, recurrent_dropout=0.2, unroll=True)(inputs)

Why it's necessary to frozen all inner state of a Batch Normalization layer when fine-tuning

The following content comes from Keras tutorial
This behavior has been introduced in TensorFlow 2.0, in order to enable layer.trainable = False to produce the most commonly expected behavior in the convnet fine-tuning use case.
Why we should freeze the layer when fine-tuning a convolutional neural network? Is it because some mechanisms in tensorflow keras or because of the algorithm of batch normalization? I run an experiment myself and I found that if trainable is not set to false the model tends to catastrophic forgetting what has been learned before and returns very large loss at first few epochs. What's the reason for that?
During training, varying batch statistics act as a regularization mechanism that can improve ability to generalize. This can help to minimize overfitting when training for a high number of iterations. Indeed, using a very large batch size can harm generalization as there is less variation in batch statistics, decreasing regularization.
When fine-tuning on a new dataset, batch statistics are likely to be very different if fine-tuning examples have different characteristics to examples in the original training dataset. Therefore, if batch normalization is not frozen, the network will learn new batch normalization parameters (gamma and beta in the batch normalization paper) that are different to what the other network paramaters have been optimised for during the original training. Relearning all the other network parameters is often undesirable during fine-tuning, either due to the required training time or small size of the fine-tuning dataset. Freezing batch normalization avoids this issue.

Mixture usage of CPU and GPU in Keras

I am building a neural network on Keras, including multiple layers of LSTM, Permute and Dense.
It seems LSTM is GPU-unfriendly. So I did research and use
With tf.device('/cpu:0'):
out = LSTM(cells)(inp)
But based on my understanding about with, with is try...finally block to ensure that clean-up code is executed. I don't know whether the following CPU/GPU mixture usage code works or not? Will they accelerate speed of training?
With tf.device('/cpu:0'):
out = LSTM(cells)(inp)
With tf.device('/gpu:0'):
out = Permute(some_shape)(out)
With tf.device('/cpu:0'):
out = LSTM(cells)(out)
With tf.device('/gpu:0'):
out = Dense(output_size)(out)
As you may read here - tf.device is a context manager which switches a default device to this passed as its argument in a context (block) created by it. So this code should run all '/cpu:0' device at CPU and rest on GPU.
The question will it speed up your training is really hard to answer because it depends on the machine you use - but I don't expect computations to be faster as each change of a device makes data to be copied between GPU RAM and machine RAM. This could even slow down your computations.
I have created a model using 2 LSTM and 1 dense layers and trained it in my GPU (NVidia GTX 10150Ti) Here is my observations.
use CUDA LSTM https://keras.io/layers/recurrent/#cudnnlstm
Use a bath size which helps more GPU parallelism, if I use a very small batch size(2-10) GPU multi cores are not utilized; so I used 100 as batch size
If I train my network on GPU and try to use it for predictions on CPU, it works in-terms of compiling and running but the predictions are weird. In my case I have the luxury to use a GPU for prediction as well.
for multi layer LSTM, need to use
here is some sample snippet
model = keras.Sequential()
model.add(keras.layers.cudnn_recurrent.CuDNNLSTM(neurons
, batch_input_shape=(nbatch_size, reshapedX.shape[1], reshapedX.shape[2])
, return_sequences=True
, stateful=True))

Tensorflow - stop restoring network parameters

I'm attempting to make multiple sequential predictions from a tensorflow network, but performance seems very poor (~500ms per prediction for a 2-layer 8x8 convolutional network) even for a CPU. I suspect that part of the problem is that it appears to be reloading the network parameters every time. Each call to classifier.predict in the code below results in the following line of output - which I therefore see hundreds of times.
INFO:tensorflow:Restoring parameters from /tmp/model_data/model.ckpt-102001
How can I reuse the checkpoint that is already loaded?
(I can't do batch predictions here because the output of the network is a move to play in a game, which then needs to be applied to the current state before feeding the the new game state.)
Here's the loop that's doing the predictions.
def rollout(classifier, state):
while not state.terminated:
predict_input_fn = tf.estimator.inputs.numpy_input_fn(x={"x": state.as_nn_input()}, shuffle=False)
prediction = next(classifier.predict(input_fn=predict_input_fn))
index = np.random.choice(NUM_ACTIONS, p=prediction["probabilities"]) # Select a move according to the network's output probabilities
state.apply_move(index)
classifier is a tf.estimator.Estimator created with...
classifier = tf.estimator.Estimator(
model_fn=cnn_model_fn, model_dir=os.path.join(tempfile.gettempdir(), 'model_data'))
The Estimator API is a high-level API.
The tf.estimator framework makes it easy to construct and train
machine learning models via its high-level Estimator API. Estimator
offers classes you can instantiate to quickly configure common model
types such as regressors and classifiers.
The Estimator API abstracts away lots of the complexity of TensorFlow, but loses some generality in the process. Having read the code, it's clear that there's no way to run multiple sequential predictions without reloading the model each time. The low-level TensorFlow APIs allow this behaviour. But...
Keras is a high-level framework that supports this use case. Simple define the model and then call predict repeatedly.
def rollout(model, state):
while not state.terminated:
predictions = model.predict(state.as_nn_input())
for _, prediction in enumerate(predictions):
index = np.random.choice(bt.ACTIONS, p=prediction)
state.apply_mode(index)
Unscientific benchmarking shows that this is ~100x faster.

Categories