Tensorflow gelu out of memory error, but not relu - python

Gelu activation causes my model to return an OOM error when training, but when I switch to relu the problem goes away. Even if the model is doubled in size, the model with relu performs fine.
if activation=="gelu":
out = tfa.layers.GELU()(out)
elif activation=="relu":
out = KL.ReLU()(out)
The OOM error does not happen on the gelu function, but since the two models are the same except for the difference in activation function, I don't think this is the issue.
File ".../python3.9/site-packages/keras/backend.py", line 3693, in resize_images
x = tf.image.resize(x, new_shape, method=interpolations[interpolation])
Node: 'model/up_sampling2d_2/resize/ResizeNearestNeighbor'
2 root error(s) found.
(0) RESOURCE_EXHAUSTED: OOM when allocating tensor with shape[8,320,240,64] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node model/up_sampling2d_2/resize/ResizeNearestNeighbor}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

Related

In pytorch what happens when I move a module to cuda device

In an example code I see this:
self.models["pose_encoder"] = \
networks.ResnetEncoder(18, self.opt.weights_init == "pretrained",
num_input_images=self.num_pose_frames)
self.models["pose_encoder"].to("cuda:4")
with ResnetEncoder defined by
class ResnetEncoder(nn.Module):
"""Pytorch module for a resnet encoder
"""
def __init__(self, num_layers, pretrained, num_input_images=1, **kwargs):
super(ResnetEncoder, self).__init__()
I am confused about what happens when the to(cuda:4) part happens to the module. Do the whole tensors defined in the module move to cuda:4?
What cause me error now is that I have a member function in the module, and in that function I define a tensor:
def A():
self.mytensor = []
# after some operation this is not empty any more
self.mytensor.cuda()
and this error occur:
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:4 and cuda:0!
I know the cuda:0 is caused by the .cuda() operation in the last line of code. But I don't know any way to move 'self.mytensor' to cuda:4. I cam pass device as a parameter in the module's constructor, but I guess there is a better way to do this. and I want the device to change during runtime, so I don't want use os.environ also. Is there any way to do this?
Follow the document, to.device('name_device') is the special function of torch.Tensor which use to move your Tensor to a different device.
The name_device can be replaced by: cuda:num_gpu or cpu, to.device('cuda:4') will move the tensor to GPU number 4 (you can check the number of GPU by command nvidia-smi on cmd/terminal).
Any operation between two tensors will require they stand in the same device. The bug raises because your self.mycuda is on GPU 0, you can move it to GPU 4 by self.mycuda.to_device('cuda:4')
Note: by using .cuda() = .to_device('cuda:0'). You can specify the device by self.mycuda.cuda(4)

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: PyTorch error

I am trying to run some code in PyTorch but I got stacked at this point:
At first iteration, both backward operations, for Discriminator and Generator are running well
....
self.G_loss.backward(retain_graph=True)
self.D_loss.backward()
...
At the second iteration, when self.G_loss.backward(retain_graph=True) executes, I get this error:
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [8192, 512]] is at version 2; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!
According to torch.autograd.set_detect_anomaly, the last of the following lines in the Discriminator network, is responsible for this:
bottleneck = bottleneck[:-1]
self.embedding = x.view(x.size(0), -1)
self.logit = self.layers[-1](self.embedding)
The strange thing is that I have used that network architecture in other code where it worked properly. Any suggestions?
The full error:
site-packages\torch\autograd\__init__.py", line 127, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [8192, 512]] is at version 2; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!
Solved by removing code with loss += loss_val lines

Why does keras model.fit use so much memory despite using allow_growth=True?

I have, thanks to this question mostly been able to solve the problem of tensorflow allocating memory which I didn't want allocated. However, I have recently found that despite my using set_session with allow_growth=True, using model.fit will still mean that all the memory is allocated and I can no longer use it for the rest of my program, even when the function is exited and the model should no longer have any allocated memory due to the fact that the model is a local variable.
Here is some example code demonstrating this:
from numpy import array
from keras import Input, Model
from keras.layers import Conv2D, Dense, Flatten
from keras.optimizers import SGD
# stops keras/tensorflow from allocating all the GPU's memory immediately
from tensorflow.compat.v1.keras.backend import set_session
from tensorflow.compat.v1 import Session, ConfigProto, GPUOptions
tf_config = ConfigProto(gpu_options=GPUOptions(allow_growth=True))
session = Session(config=tf_config)
set_session(session)
# makes the neural network
def make_net():
input = Input((2, 3, 3))
conv = Conv2D(256, (1, 1))(input)
flattened_input = Flatten()(conv)
output = Dense(1)(flattened_input)
model = Model(inputs=input, outputs=output)
sgd = SGD(0.2, 0.9)
model.compile(sgd, 'mean_squared_error')
model.summary()
return model
def make_data(input_data, target_output):
input_data.append([[[0 for i in range(3)] for j in range(3)] for k in range(2)])
target_output.append(0)
def main():
data_amount = 4096
input_data = []
target_output = []
model = make_model()
for i in range(data_amount):
make_data(input_data, target_output)
model.fit(array(input_data), array(target_output), batch_size=len(input_data))
return
while True:
main()
When I run this code with the Pycharm debugger, I find that the GPU RAM used stays at around 0.1GB until I run model.fit for the first time, at which point the memory usage shoots up to 3.2GB of my 4GB of GPU RAM. I have also noted that the memory usage doesn't increase after the first time that model.fit is run and that if I remove the convolutional layer from my network, the memory increase doesn't happen at all.
Could someone please shine some light on my problem?
UPDATE: Setting per_process_gpu_memory_fraction in GPUOptions to 0.1 helps limit the effect in the code included, but not in my actual program. A better solution would still be helpful.
I used to face this problem. And I found a solution from someone who I can't find anymore. His solution I paste below. In fact, I found that if you set allow_growth=True, tensorflow seems to use all your memory. So you should just set your max limit.
try this:
gpus = tf.config.experimental.list_physical_devices("GPU")
if gpus:
# Restrict TensorFlow to only use the first GPU
try:
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, False)
tf.config.experimental.set_virtual_device_configuration(
gpu,
[
tf.config.experimental.VirtualDeviceConfiguration(
memory_limit=12288 # set your limit
)
],
)
tf.config.experimental.set_visible_devices(gpus[0], "GPU")
logical_gpus = tf.config.experimental.list_logical_devices("GPU")
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPU")
except RuntimeError as e:
# Visible devices must be set before GPUs have been initialized
print(e)
Training with SGD and the whole training data in one batch can (depending on your input data) be very memory consumptive.
Try tweaking your batch_size to a lower size (e.g. 8, 16, 32)

Issue with keras fit_generator epoch

I'm creating an LSTM Model for Text generation using Keras. As the dataset(around 25 novels,which has around 1.4 million words) I'm using can't be processed at once(An Memory issue with converting my outputs to_Categorical()) I created a custom generator function to read the Data in.
# Data generator for fit and evaluate
def generator(batch_size):
start = 0
end = batch_size
while True:
x = sequences[start:end,:-1]
#print(x)
y = sequences[start:end,-1]
y = to_categorical(y, num_classes=vocab_size)
#print(y)
yield x, y
if batch_size == len(lines):
break;
else:
start += batch_size
end += batch_size
when i excecute the model.fit() method, after 1 epoch is done training the following error is thrown.
UnknownError: [_Derived_] CUDNN_STATUS_BAD_PARAM
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1459): 'cudnnSetTensorNdDescriptor( tensor_desc.get(), data_type, sizeof(dims) / sizeof(dims[0]), dims, strides)'
[[{{node CudnnRNN}}]]
[[sequential/lstm/StatefulPartitionedCall]] [Op:__inference_train_function_25138]
Function call stack:
train_function -> train_function -> train_function
does anyone know how to solve this issue ? Thanks
From many sources in the Internet, this issue seems to occur while using LSTM Layer along with Masking Layer and while training on GPU.
Mentioned below can be the workarounds for this problem:
If you can compromise on speed, you can Train your Model on CPU rather than on GPU. It works without any error.
As per this comment, please check if your Input Sequences comprises of all Zeros, as the Masking Layer may mask all the Inputs
If possible, you can Disable the Eager Execution. As per this comment, it works without any error.
Instead of using a Masking Layer, you can try the alternatives mentioned in this link
a. Adding the argument, mask_zero = True to the Embedding Layer. or
b. Pass a mask argument manually when calling layers that support this argument
Last solution can be to remove Masking Layer, if that is possible.
If none of the above workaround solves your problem, Google Tensorflow Team is working to resolve this error. We may have to wait till that is fixed.
Hope this information helps. Happy Learning!

getting error with softmax and cross entropy in theano

I'm implementing a DNN with Theano. At the last layer of DNN, I'm using a softmax as a nonlinear function from theano.tensor.nnet.softmax
As a lost function i'm using cross entropy from T.nnet.binary_crossentropy
But I get a strange error:
"The following error happened while compiling the node', GpuDnnSoftmaxGrad{tensor_format='bc01' ..."
I'm a newbie with theano and can't figure out what's wrong with this model. Your help is appreciated
PS: my guess is it is somehow related to the fact that softmax takes a 2D tensor and returns a 2D tensor.
PS2:I'm using the bleeding edge Theano (just cloned) my CUDA version is old it is 4.2 BUT I'm almost sure that that's not the problem since I'm working without error with other DNN tools written based on Theano.
I'm using pylearn2 to accelerate and that's not the problem either since I already used it successfully with the current Theano and CUDA in another DNN.
The error happens at this line: train= theano.function([idx], train_loss, givens=givens, updates=updates)
The full error message is:
cmodule.py", line 293, in dlimport
rval = __import__(module_name, {}, {}, [module_name])
RuntimeError: ('The following error happened while compiling the node', GpuDnnSoftmaxGrad{tensor_format='bc01', mode='channel', algo='accurate'}(GpuContiguous.0, GpuContiguous.0), '\n', 'could not create cuDNN handle: The handle was not initialized(Is your driver recent enought?).', "[GpuDnnSoftmaxGrad{tensor_format='bc01', mode='channel', algo='accurate'}(<CudaNdarrayType(float32, (False, False, True, True))>, <CudaNdarrayType(float32, (False, False, True, True))>)]")
The Cross entropy funcion I'm using is defined as:
error = T.mean(T.nnet.binary_crossentropy(input, target_y)
where input is the output of the softmax layer and target_y is the labels.
solved. I had to use T.nnet.categorical_crossentropy since my target variable is an integer vector.

Categories