Training broke with ResourceExausted error

Training broke with ResourceExausted error - python

I am new to tensorflow and Machine Learning. Recently I am working on a model. My model is like below,
Character level Embedding Vector -> Embedding lookup -> LSTM1
Word level Embedding Vector->Embedding lookup -> LSTM2
[LSTM1+LSTM2] -> single layer MLP-> softmax layer
[LSTM1+LSTM2] -> Single layer MLP-> WGAN discriminator
Code of he rnn model
while I'm working on this model I got the following error. I thought My batch is too big. Thus I tried to reduce the batch size from 20 to 10 but it doesn't work.
ResourceExhaustedError (see above for traceback): OOM when allocating
tensor with shape[24760,100] [[Node:
chars/bidirectional_rnn/bw/bw/while/bw/lstm_cell/split =
Split[T=DT_FLOAT, num_split=4,
_device="/job:localhost/replica:0/task:0/device:GPU:0"](gradients_2/Add_3/y,
chars/bidirectional_rnn/bw/bw/while/bw/lstm_cell/BiasAdd)]] [[Node:
bi-lstm/bidirectional_rnn/bw/bw/stack/_167 =
_Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0",
send_device="/job:localhost/replica:0/task:0/device:GPU:0",
send_device_incarnation=1,
tensor_name="edge_636_bi-lstm/bidirectional_rnn/bw/bw/stack",
tensor_type=DT_INT32,
_device="/job:localhost/replica:0/task:0/device:CPU:0"]]
tensor with shape[24760,100] means 2476000*32/8*1024*1024 = 9.44519043 MB memory. I am running the code on a titan X(11 GB) GPU. What could go wrong? Why this type of error occurred?
* Extra info *: the size of the LSTM1 is 100. for bidirectional LSTM it becomes 200.
The size of the LSTM2 is 300. For Bidirectional LSTM it becomes 600.
*Note *: The error occurred after 32 epoch. My question is why after 32 epoch there is an error. Why not at the initial epoch.

I have been tweaking a lot these days to solve this problem.
Finally, I haven't solved the mystery of the memory size described in the question. I guess while computing the gradient tensoflow accumulate a lot of additional memory for computing gradient. I need to check the source of the tensorflow which seems very cumbersome at this time. You can check how much memory your model is using from terminal by the following command,
nvidia-smi
judging from this command you can guess how much additional memory you can use.
But the solution to these type of problem lies on reducing the batch size,
For my case reducing the size of the batch to 3 works. This may vary
model to model.
But what if you are using a model where the embedding matrix is much bigger that you cannot load them into memory?
The solution is to write some painy code.
You have to lookup on the embedding matrix and then load the embedding to the model. In short, for each batch, you have to give the lookup matrixes to the model(feed them by the feed_dict argument in the sess.run()).
Next you will face a new problem,
You cannot make the embeddings trainable in this way. The solution is to use the embedding in a placeholder and assign them to a Variable(say for example A). After each batch of training, the learning algorithm updates the variable A. Then compute the output of A vector by tensorflow and assign them to your embedding matrix which is outside of the model. (I said that the process is painy)
Now your next question should be, what if you cannot feed the embedding lookup to the model because it's so big. This is a fundamental problem that you cannot avoid. That's why the NVIDIA GTX 1080, 1080ti and NVIDA TITAN Xp have so price difference though NVIDIA 1080ti and 1080 have the higher frequency to run an execution.

*Note *: The error occurred after 32 epoch. My question is why after 32 epoch there is an error. Why not at the initial epoch.
This is a major clue that the graph is not static during execution. By that I mean, you're likely doing sess.run(tf.something) instead of
my_something = tf.something
with tf.Session() as sess:
sess.run(my_something)
I ran into the same problem trying to implement a stateful RNN. I would occasionally reset the state, so I was doing sess.run([reset if some_condition else tf.no_op()]). Simply adding nothing = tf.no_op() to my graph and using sess.run([reset if some_condition else nothing]) solved my problem.
If you could post the training loop, it would be easier to tell if that is what's going wrong.

I also faced the same problem while training the conv-autoencoder model. Solved it by reducing the batchsize. My earlier batch size was 64. To solve this error, I reduced it to 32 and it worked!!!

Related

hi ... i had this error when run trying to call vgg16 model

error message :
InternalError: Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:localhost/replica:0/task:0/device:GPU:0 in order to run _EagerConst: Dst tensor is not initialized.
the code :
VGG_model = VGG16(weights='imagenet', include_top=False, input_shape=(SIZE, SIZE, 3))
for layer in VGG_model.layers:
layer.trainable = False
VGG_model.summary() #Trainable parameters will be 0
train_feature_extractor=VGG_model.predict(x_train)
i tried to reduce the dimensionality of input data and its work but this effected on the Accuracy
so i run the code on cpu only and this work too but take a long time

This error message is indicating that your GPU does not have enough memory to load your data.
If you want to use your GPU, you need to be sure that your data (batch) fits into its memory. Unfortunately, you do not provide additional information about your data preprocessing and the shape of x_train.
If x_train is a set of multiple images, a default batch size of 32 will be used.
Try to manually reduce the batch size by calling VGG_model.predict(x_train, batch_size) with setting the additional argument batch_size to a value smaller than 32.
Tensorflow Documentation: predict()
maybe this might be helpful as well:
enter link description here

Different results while training with CudnnLSTM compared to regular LSTMCell in Tensorflow

I'm training an LSTM network with Tensorflow in Python and wanted to switch to tf.contrib.cudnn_rnn.CudnnLSTM for faster training. What I did is replaced
cells = tf.nn.rnn_cell.LSTMCell(self.num_hidden)
initial_state = cells.zero_state(self.batch_size, tf.float32)
rnn_outputs, _ = tf.nn.dynamic_rnn(cells, my_inputs, initial_state = initial_state)
with
lstm = tf.contrib.cudnn_rnn.CudnnLSTM(1, self.num_hidden)
rnn_outputs, _ = lstm(my_inputs)
I'm experiencing significant training speedup (more than 10x times), but at the same time my performance metric goes down. AUC on a binary classification is 0.741 when using LSTMCell and 0.705 when using CudnnLSTM. I'm wondering if I'm doing something wrong or it's the difference in implementation between those two and it's that's the case how to get my performance back while keep using CudnnLSTM.
The training dataset has 15,337 sequences of varying length (up to few hundred elements) that are padded with zeros to be the same length in each batch. All the code is the same including the TF Dataset API pipeline and all evaluation metrics. I ran each version few times and in all cases it converges around those values.
Moreover, I have few datasets that can be plugged into exactly the same model and the problem persists on all of them.
In the tensorflow code for cudnn_rnn I found a sentence saying:
Cudnn LSTM and GRU are mathematically different from their tf
counterparts.
But there's no explanation what those differences really are...

It seems tf.contrib.cudnn_rnn.CudnnLSTM are time-major, so those should be provided with sequence of shape (seq_len, batch_size, embedding_size) instead of (batch_size, seq_len, embedding_size), so you would have to transpose it (I think, can't be sure when it comes to messy Tensorflow documentation, but you may want to test that. See links below if you wish to check it).
More informations on the topic here (in there is another link pointing towards math differences), except one thing seems to be wrong: not only GRU is time-major, LSTM is as well (as pointed by this issue).
I would advise against using tf.contrib, as it's even messier (and will be, finally, left out of Tensorflow 2.0 releases) and stick to keras if possible (as it will be the main front-end of the upcoming Tensorflow 2.0) or tf.nn, as it's gonna be a part of tf.Estimator API (though it's far less readable IMO).
... or consider using PyTorch to save yourself the hassle, where input shapes (and their meaning) are provided in the documentation at the very least.

Why does the magnitudes of output during inference correlate with the batch size during training?

I have to say this might be one of the weirdest problems I've ever met.
I was implementing ResNet to perform 10-classification over cifr-10 with tensorflow. Everything seemed to be fine with the training phase -- loss decreased steadily, and accuracy on training set kept increasing to over 90%, however, the results were totally abnormal during inference.
I have analyzed my code very carefully and ruled out the possibility of making mistakes when feeding the data or saving/loading the model. So the only difference between the training phase and the test phase lies in batch normalization layers.
For BN layers, I used tf.layers.batch_normalization directly and I thought I've paid attention to every pitfall in using tf.layers.batch_normalization.
Specifically, I've included the dependency for train_op as follows,
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
self.train_op = optimizer.minimize(self.losses)
Also, for saving and loading the model, I've specified var_list as tf.global_variables(). Moreover, I used training=True for training and training=False for test.
Nevertheless, the accuracy during inference was only around 10%, even when applied to the same data used for training. And when I output the last layer of the network (i.e., the 10-dimension vector input to softmax), I found that the magnitude of each item in the 10-dimension vector during training was always 1e0 or 1e-1, while for inference, it could be 1e4 or even 1e5. The strangest part was that I found the magnitude of the 10-dimension vector during inference correlated with the batch size used in training, i.e., the bigger the batch size, the smaller the magnitude.
Besides, I also found that the magnitudes of moving_mean and moving_variance of BN layers correlated with the batch size too, but why was this even possible? I thought moving_mean means the mean of the entire training population, and so was moving_variance. So why was there anything to do with the batch size?
I think there must be something that I don't know about using BN with tensorflow. This problem is really gonna drive me crazy! I've never expected to deal with such a problem in tensorflow, considering how convenient it is to use BN with PyTorch!

The problem has been solved!
I read the source code of tensorflow. Based on my understanding, the value of momentum in tf.layers.batch_normalization should be 1 - 1/num_of_batches. The default value is 0.99, which means the default value is most suitable when there are 100 batches in training data.
I didn't find any documents mentioned this. Hope this can be helpful to someone who would have the same problem with BN in tensorflow!

Shape error only when TPU training Keras model

First off, this is not my code. I just changed it to be able to train it on TPU. The original author is here. I am able to run it on the GPU accelerated runtime on a collaboratory notebook but it seems to break when I do TPU accelerated runtime.
Here is my notebook. It just give me an error that the activation function is not the right size.
ValueError: Error when checking target: expected activation_21 to have shape (1,) but got array with shape (205,)
I would appreciate any help I can get as I spent like 3 hours debugging.

Since you are one-hot encoding the labels and therefore they are not sparse, you need to use 'categorical_accuracy' as the metric:
model.compile(..., metrics=['categorical_accuracy'])
or more succinctly use 'accuracy' to let Keras infer the right metric based on the loss function used (which in this case would be 'categorical_accuracy' since you are using categorical_crossentropy as the loss function):
model.compile(..., metrics=['accuracy'])

How to add report_tensor_allocations_upon_oom to RunOptions in Keras

I'm trying to train a neural net on a GPU using Keras and am getting a "Resource exhausted: OOM when allocating tensor" error. The specific tensor it's trying to allocate isn't very big, so I assume some previous tensor consumed almost all the VRAM. The error message comes with a hint that suggests this:
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
That sounds good, but how do I do it? RunOptions appears to be a Tensorflow thing, and what little documentation I can find for it associates it with a "session". I'm using Keras, so Tensorflow is hidden under a layer of abstraction and its sessions under another layer below that.
How do I dig underneath everything to set this option in such a way that it will take effect?

TF1 solution:
Its not as hard as it seems, what you need to know is that according to the documentation, the **kwargs parameter passed to model.compile will be passed to session.run
So you can do something like:
import tensorflow as tf
run_opts = tf.RunOptions(report_tensor_allocations_upon_oom = True)
model.compile(loss = "...", optimizer = "...", metrics = "..", options = run_opts)
And it should be passed directly each time session.run is called.
TF2:
The solution above works only for tf1. For tf2, unfortunately, it appears there is no easy solution yet.

Currently, it is not possible to add the options to model.compile. See: https://github.com/tensorflow/tensorflow/issues/19911

OOM means out of memory. May be it is using more memory at that time.
Decrease batch_size significantly. I set to 16, then it worked fine

Got the same error, but only in case, the training dataset was about the same as my GPU memory. For example, with 4 Gb video card memory I can train the model with the ~3,5 GB dataset. The workaround for me was to create the data_generator custom function, with yield, indices, and lookback.
The other way I was suggested was to start learning true tensorflow framework and with tf.Session (example).

OOM is nothing but "OUT OF MEMORY".
TensorFlow throws this error when it runs out of vRAM while loading batches into memory.
I was trying to train a Vision Transformer on CIFAR-100 dataset.
GPU:
GTX 1650 w/ 4GB vRAM
Initially, I had the batch_size set to 256, which was totally insane for such a GPU, and I was getting the same OOM error.
I tweaked it to batch_size = 16 (or something lower, which your GPU can handle), training works perfectly fine.
So, always choose a smaller batch_size if you are training on laptops or mid-range GPUs.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.