Why does my Colab session run out of RAM? - python

I'm building a model for image deblurring based on the model described in this paper using Keras. I train the model on Colab using the following training code:
x_train, y_train = load_h5_dataset()
def train(batch_size=16, epoch_num=5, critic_updates=5, log_dir='drive/MyDrive/train_logs'):
g = make_resnet_generator_model()
d = make_discriminator_model()
gan = make_gan(g, d)
d_opt = Adam(learning_rate=1e-4, beta_1=0.9, beta_2=0.999, epsilon=1e-8)
gan_opt = Adam(learning_rate=1e-4, beta_1=0.9, beta_2=0.999, epsilon=1e-8)
d.trainable = True
d.compile(optimizer=d_opt, loss=wasserstein_loss)
d.trainable = False
loss = [perceptual_loss, wasserstein_loss]
loss_weights = [100, 1]
gan.compile(optimizer=gan_opt, loss=loss, loss_weights=loss_weights)
d.trainable = True
output_true_batch, output_false_batch = np.ones((batch_size, 1)), -np.ones((batch_size, 1))
writer = tf.summary.create_file_writer(log_dir)
for epoch in tqdm(range(epoch_num)):
print(f"Epoch {epoch + 1}/{epoch_num}...")
permuted_indexes = np.random.permutation(x_train.shape[0])
d_losses = []
gan_losses = []
x_train = dataset['sharp_img']
for index in range(int(x_train.shape[0] / batch_size)):
batch_indexes = permuted_indexes[index * batch_size:(index + 1) * batch_size]
image_blur_batch = x_train[batch_indexes]
image_full_batch = y_train[batch_indexes]
generated_images = g.predict(x=image_blur_batch, batch_size=batch_size)
for _ in range(critic_updates):
d_loss_real = d.train_on_batch(image_full_batch, output_true_batch)
d_loss_fake = d.train_on_batch(generated_images, output_false_batch)
d_loss = 0.5 * np.add(d_loss_fake, d_loss_real)
d_losses.append(d_loss)
d.trainable = False
gan_loss = gan.train_on_batch(image_blur_batch, [image_full_batch, output_true_batch])
gan_losses.append(gan_loss)
d.trainable = True
write_logs(writer, ['d_loss', 'gan_loss'], [np.mean(d_losses), np.mean(gan_losses)], epoch)
save_weights(d, g, epoch, int(np.mean(gan_losses)))
In the training code above, the perceptual loss is calculated using a VGG16 network, pretrained on ImageNet. The function load_h5_dataset() is used to load a dataset saved as a .hdf5 file. I encounter two problems when executing this code:
When I run it on Colab, it keeps running out of RAM on Colab and stops the execution. However, the size of the dataset is 6GB, which is well below the available size of RAM of Colab.
When I run this code on my local machine (which has 16GB of RAM and a NVIDIA GeForce GTX 1660 Ti with 6GB capacity), I encounter this error: tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[16,256,128,128] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:Conv2D]
Can someone have a look at my code and see what going wrong here? Thank you very much.

Can you check this issue https://github.com/tensorflow/models/issues/1993
And also you can
del whatevervariable
and then RAM will be free

Related

Training loss not decreasing when training - tensorflow gpu

I am training a graph neural network on a node cluster with one gpu Titan RTX. I am using Tensorflow-gpu 1.15 and it can recognize the gpu successfully. The training involves some tensors operations of type float 64, where the training set is formed by 256K sparse block-circulant matrices of moderate size. I evaluate 256 samples per run and the batch size is set to 32.
When I look at the loss graph in tb, I notice that even after evaluating more than 100K samples (after 24 hours of training) my training loss is not decaying at all: it looks noisy and quite flat. This is the plot from tb:
The loss is measured as the frobenious norm of an error matrix and it is supposed to decay. I am also using the adam optimizer with learning rate of 10^-3.
Any insights on why it is behaving like this? It is basically not learning anything.
I did a quick profiling to see which operations are the slowest, but cannot find something significant.
Could it be the GPU that I am using and the loss in performance due to the heavy memory allocation of float64? When I am checking the gpu usage, I allocate 60% of the memory (and I have the option to release it after operation).
Any suggestion or tips?
I have been using:
Tensorflow-gpu 1.15,
CUDA 10.0.130,
NCCL 2.4.7-CUDA-10.0.130,
cuDNN 7.6.3-CUDA-10.0.130.
Running on a remote server with 4 gpus Titan RTX (I am using 1 of them).
Type tf.float64 is not the problem when you select the correct optimizer when I am running on version 1 compatibility mode 'tf.compat.v1.disable_eager_execution()'
Select the correct input data and target optimizer.
Select the correct tf.Variable.
Select the optimized equation or methods.
Input may require specific methods to transform to tf.float64 in compatibility mode 'tf.compat.v1.disable_eager_execution()'
Running sessions have the same input and update of variables, arrays, or feed_dict.
Purpose of the optimizer when you need to find similarities or you need to find categories of their group.
Sample: Similarity scopes re-occurrence or all pixels compare small of change see it different.
import os
from os.path import exists
import tensorflow as tf
import matplotlib.pyplot as plt
from skimage.transform import resize
import numpy as np
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
None
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
physical_devices = tf.config.experimental.list_physical_devices('GPU')
assert len(physical_devices) > 0, "Not enough GPU hardware devices available"
config = tf.config.experimental.set_memory_growth(physical_devices[0], True)
print(physical_devices)
print(config)
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
: Variables
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
learning_rate = 0.1
global_step = 0
tf.compat.v1.disable_eager_execution()
BATCH_SIZE = 1
IMG_SIZE = (32, 32)
history = [ ]
history_Y = [ ]
list_file = [ ]
list_label = [ ]
for file in os.listdir("F:\\datasets\\downloads\\dark\\train") :
image = plt.imread( "F:\\datasets\\downloads\\dark\\train\\" + file )
image = resize(image, (32, 32))
image = np.reshape( image, (1, 32, 32, 3) )
list_file.append( image )
list_label.append(1)
optimizer = tf.compat.v1.train.AdamOptimizer(
learning_rate=0.1,
beta1=0.9,
beta2=0.999,
epsilon=1e-08,
use_locking=False,
name='Adam'
)
var1 = tf.Variable(255.0, dtype=tf.dtypes.float64)
var2 = tf.Variable(10.0, dtype=tf.dtypes.float64)
X_var = tf.compat.v1.get_variable('X', dtype = tf.float64, initializer = tf.random.normal((1, 32, 32, 3), dtype=tf.dtypes.float64))
y_var = tf.compat.v1.get_variable('Y', dtype = tf.float64, initializer = tf.random.normal((1, 32, 32, 3), dtype=tf.dtypes.float64))
Z = tf.nn.l2_loss((var1 - X_var) ** 2 + (var2 - y_var) ** 2, name="loss")
cosine_loss = tf.keras.losses.CosineSimilarity(axis=1)
loss = tf.reduce_mean(input_tensor=tf.square(Z))
training_op = optimizer.minimize(loss)
previous_train_loss = 0
with tf.compat.v1.Session() as sess:
sess.run(tf.compat.v1.global_variables_initializer())
image = list_file[0]
X = image
Y = image
for i in range(1000):
global_step = global_step + 1
train_loss, temp = sess.run([loss, training_op], feed_dict={X_var:X, y_var:Y})
history.append( train_loss )
if global_step % 2 == 0 :
var2 = var2 - 0.001
if global_step % 4 == 0 and train_loss <= previous_train_loss :
var1 = var1 - var2 + 0.5
print( 'steps: ' + str(i) )
print( 'train_loss: ' + str(train_loss) )
previous_train_loss = train_loss
sess.close()
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
: Graph
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
history = history[:-1]
plt.plot(np.asarray(history))
plt.xlabel('Epoch')
plt.ylabel('loss')
plt.legend(loc='lower right')
plt.show()
Without Cosine Similarity: All Pixels are comparing a bit of change they find the meaning.
With Cosine Similarity: Re-occurrence of series supposed to consider same threads.

Model parallelism, CUDA out of memory in Pytorch

I am trying to build autoencoder model, where input/output is RGB images with size of 256 x 256. I tried to train model on 1 GPU with 12 GB of memory but I always caught CUDA OOM (I tried differen batchsizes and even batch size of 1 is failing). So I read about model parallelism in Pytorch and tried this:
class Autoencoder(nn.Module):
def __init__(self, input_output_size):
super(Autoencoder, self).__init__()
self.encoder = nn.Sequential(
nn.Linear(input_output_size, 1024),
nn.ReLU(True),
nn.Linear(1024, 200),
nn.ReLU(True)
).cuda(0)
self.decoder = nn.Sequential(
nn.Linear(200, 1024),
nn.ReLU(True),
nn.Linear(1024, input_output_size),
nn.Sigmoid()).cuda(1)
print(self.encoder.get_device())
print(self.decoder.get_device())
def forward(self, x):
x = x.cuda(0)
x = self.encoder(x)
x = x.cuda(1)
x = self.decoder(x)
return x
So I have moved my encoder and decoder on different GPUs. But now I get this exception:
Expected tensor for 'out' to have the same device as tensor for argument #2 'mat1'; but device 0 does not equal 1 (while checking arguments for addmm)
It appear when I do x = x.cuda(1) in forward method.
Moreover, here is my "train" code, maye you can give me some advices about optimizations? Is images of 3 x 256 x 256 too large for training? (I cannot reduce them). Thank you in advance.
Training:
input_output_size = 3 * 256 * 256
model = Autoencoder(input_output_size).to(device)
optimizer = optim.Adam(model.parameters(), lr=1e-4)
criterion = nn.MSELoss()
for epoch in range(100):
epoch_loss = 0
for batch_idx, (images, _) in enumerate(dataloader):
images = torch.flatten(images, start_dim=1).to(device)
output_images = model(images).to(device)
train_loss = criterion(output_images, images)
train_loss.backward()
optimizer.step()
if batch_idx % 5 == 0:
with torch.no_grad():
model.eval()
pred = model(test_set).to(device)
model.train()
test_loss = criterion(pred, test_set)
wandb.log({"MSE train": train_loss})
wandb.log({"MSE test": test_loss})
del pred, test_loss
if batch_idx % 200 == 0:
# here I send testing images from output to W&B
with torch.no_grad():
model.eval()
pred = model(test_set).to(device)
model.train()
wandb.log({"PRED": [wandb.Image((pred[i].cpu().reshape((3, 256, 256)).permute(1, 2, 0) * 255).numpy().astype(np.uint8), caption=str(i)) for i in range(20)]})
del pred
gc.collect()
torch.cuda.empty_cache()
epoch_loss += train_loss.item()
del output_images, train_loss
epoch_loss = epoch_loss / len(dataloader)
wandb.log({"Epoch MSE train": epoch_loss})
del epoch_loss
Three issues that I'm seeing:
model(test_set)
This is when you are sending the entirety of your test set (presumably huge) as a single batch through your model.
I don't know what wandb is, but another likely source of memory growth is these lines:
wandb.log({"MSE train": train_loss})
wandb.log({"MSE test": test_loss})
You seem to be saving train_loss and test_loss, but these contain not only the numbers themselves, but the computational graphs (living on the GPU) needed for backprop. Before saving them, you want to convert them into float or numpy.
Your model contains two 3*256*256 x 1024 weight blocks. When used in Adam, these will require 3*256*256 x 1024 * 3 * 4 bytes = 2.25GB of VRAM each (Possibly more, if it's inefficiently implemented) This looks like a poor architecture for other reasons also.

Resource exhausted while training Xception model on flower_dataset with tensorflow-gpu-2.4.0

I am having issues while training the TensorFlow model on a flower-dataset of 5 classes. There are 3000+ pictures of flowers in 5 classes. I have installed the necessary library for TensorFlow-gpu. It's
showing resources exhausted.
Configuration
intel i5 1135g7
Nvidia mx330 2GB
Tensorflow-gpu==2.4.0
Cuda11.0.1
CuDNN8.2
Resource exhausted: OOM when allocating tensor with shape[32,728,14,14] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
I tried to reduce the BATCH_SIZE too.
Here is my code in tensorflow.
IMAGE_SIZE = (224, 224)
BATCH_SIZE = 32
base_model = keras.applications.Xception(include_top=False,
weights="imagenet",
input_shape=IMAGE_SIZE+ (3, ))
base_model.trainable = True
inputs = layers.Input(shape=IMAGE_SIZE+ (3,),)
x = data_augmentation(inputs)
x = layers.experimental.preprocessing.Rescaling(1./255)(x)
x = base_model(x, training=False)
x = layers.GlobalAveragePooling2D()(x)
x = layers.Dropout(0.2)(x)
outputs = layers.Dense(5, activation="softmax")(x)
model = keras.Model(inputs, outputs)
model.compile(
optimizer=keras.optimizers.Adam(),
loss=keras.losses.SparseCategoricalCrossentropy(),
metrics=["accuracy"]
)
early_stopping = keras.callbacks.EarlyStopping(monitor="val_accuract",
patience=5)
epochs = 30
history = model.fit(
train_ds,
shuffle=True,
epochs=epochs,
validation_data=val_ds,
callbacks=[early_stopping]
)
OOM basically means that you are getting out of memory. Looking at your GPU is only 2 GB, so it is pretty normal to run out of memory in some tasks. You might want to try reducing BATCH_SIZE as much as you can (no clue which values you tried so far).
Try setting BATCH_SIZE to 8.
If that does not solve it, you will need to resize all your pictures to avoid consuming that much memory. Try changing the image size to 30x30.

PyTorch (GPU) slower than CPU slower than keras

I'm just getting started with PyTorch and I wanted to run through a few toy problems. In the following case, I'm noticing a significant difference in how much time it takes for the model to train once over and issue one batch of predictions.
This is the PyTorch implementation. On the GPU, it takes ~17 seconds on my machine. The same model on the CPU takes ~11 seconds.
class LR(torch.nn.Module):
def __init__(self):
super().__init__()
self.linear1 = torch.nn.Linear(2, 20)
self.linear2 = torch.nn.Linear(20, 1)
def forward(self, x):
x = torch.nn.functional.relu(self.linear1(x))
x = torch.nn.functional.relu(self.linear2(x))
return x
def fit_torch(df_train, df_test):
sampler_tr = torch.utils.data.SubsetRandomSampler(df_train.index)
train = torch.utils.data.DataLoader(
torch.tensor(df_train.values, dtype=torch.float),
batch_size=batch_size, sampler=sampler_tr)
sampler_te = torch.utils.data.SubsetRandomSampler(df_test.index)
test = torch.utils.data.DataLoader(
torch.tensor(df_test.values, dtype=torch.float),
batch_size=batch_size, sampler=sampler_te)
model = LR()
model = model.to(device)
loss = torch.nn.MSELoss()
optim = torch.optim.Adam(model.parameters(), lr=0.001)
model.train()
for _ in range(1000):
for train_data in train:
train_data = train_data.to(device)
x_train = train_data[:, :2]
y_train = train_data[:, 2]
optim.zero_grad()
pred = model(x_train)
loss_val = loss(pred.squeeze(), y_train)
loss_val.backward()
optim.step()
model.eval()
with torch.no_grad():
for test_data in test:
test_data = test_data.to(device)
pred = model(test_data[:, :2].float())
break
This is the keras implementation. It takes approximately 9 seconds to run.
def fit_tf(df_train, df_test):
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Dense(20, activation='relu'))
model.add(tf.keras.layers.Dense(1, activation='relu'))
model.compile(loss='mse', optimizer='adam')
model.fit(
df_train.values[:, :2],
df_train.values[:, 2],
batch_size=batch_size, epochs=1000, verbose=0)
model.predict(df_test.iloc[:batch_size].values[:, :2])
The dataset and main functions.
device = torch.device('cuda:0')
scaler = MinMaxScaler()
batch_size = 64
def create_dataset():
dataset = []
random_x = np.random.randint(10, 1000, 1000)
random_y = np.random.randint(10, 1000, 1000)
for x, y in zip(random_x, random_y):
dataset.append((x, y, 4 * x + 3 * y + 10))
np.random.shuffle(dataset)
df = pd.DataFrame(dataset)
df = pd.DataFrame(scaler.fit_transform(df))
return df
def __main__():
df = create_dataset()
df_train, df_test = train_test_split(df)
start_time = time.time()
fit_tf(df_train.reset_index(drop=True), df_test.reset_index(drop=True))
print(time.time() - start_time)
PyTorch uses a dynamic computational graph by default, which is more flexible when you start to develop a neural network since it will give a more straight forward debug message. TensorFlow, in contrast, will produce a static computational graph, and that is why you need to compile the model before use it. The compiler can optimize your model, but the tradeoff is the neural network becomes difficult to debug. This may cause minor difference between the performance of the two framework, but should not be a big deal.
Since your network is pretty small, the overhead to copy the network between GPU memory and CPU memory and to initiate the CUDA subsystem exceeds the benefit brought by the GPU. If you try some more complex neural network such as AlexNet, ResNet or even GoogLeNet, the benefit will be much more obvious.

Speed of Logistic Regression on MNIST with Tensorflow

I am taking the CS 20SI: Tensorflow for Deep Learning Research from Stanford. I have question regarding the following code:
import time
import numpy as np
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
# Step 1: Read in data
# using TF Learn's built in function to load MNIST data to the folder data/mnist
MNIST = input_data.read_data_sets("/data/mnist", one_hot=True)
# Batched logistic regression
learning_rate = 0.01
batch_size = 128
n_epochs = 25
X = tf.placeholder(tf.float32, [batch_size, 784], name = 'image')
Y = tf.placeholder(tf.float32, [batch_size, 10], name = 'label')
#w = tf.Variable(tf.random_normal(shape = [int(shape[1]), int(Y.shape[1])], stddev = 0.01), name='weights')
#b = tf.Variable(tf.zeros(shape = [1, int(Y.shape[1])]), name='bias')
w = tf.Variable(tf.random_normal(shape=[784, 10], stddev=0.01), name="weights")
b = tf.Variable(tf.zeros([1, 10]), name="bias")
logits = tf.matmul(X,w) + b
entropy = tf.nn.softmax_cross_entropy_with_logits( logits=logits, labels=Y)
loss = tf.reduce_mean(entropy) #computes the mean over examples in the batch
optimizer = tf.train.GradientDescentOptimizer(learning_rate = learning_rate).minimize(loss)
init = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
n_batches = int(MNIST.train.num_examples/batch_size)
for i in range(n_epochs):
start_time = time.time()
for _ in range(n_batches):
X_batch, Y_batch = MNIST.train.next_batch(batch_size)
opt, loss_ = sess.run([optimizer, loss], feed_dict = {X: X_batch, Y:Y_batch})
end_time = time.time()
print('Epoch %d took %f'%(i, end_time - start_time))
On this code, logistic regression with MNIST dataset is performed. The author states:
Running on my Mac, the batch version of the model with batch size 128
runs in 0.5 second
However, when I run it, each epoch takes around 2 seconds, giving a total execution time of around a minute. Is it reasonable that this example takes that time? Currently I have a Ryzen 1700 without OC (3.0GHz) and a GPU Gtx 1080 without OC.
I tried this code on GTX Titan X (Maxwell) and got around 0.5 seconds per epoch. I would expect that GTX 1080 should be able to get similar results.
Try using the latest tensorflow and cuda/cudnn versions. Make sure there are no limiting (which GPUs are visible, how much memory tensorflow can use, etc) environment variables set. You can try running a micro-benchmark to see that you can achieve the the stated FLOPS of your card, e.g. Testing GPU with tensorflow matrix multiplication

Categories