I'm trying to calculate the mean_iou and update a confusion matrix for each batch. But after 30 steps I get a SIGKILL event. The images which I use in my generator have the resolution of 2048x1024 because of this my batch_size is 2. It seems that I can't release the memory after one step is finished. I tested the generator while I'm iterating over all images but everything works well.
I'm using Keras 2.1.2 with Tensorflow 1.4.1 as Backend on a GTX 1080. It would be really nice if someone have an advice.
def calculate_iou_tf(model, generator, steps, num_classes):
conf_m = K.tf.zeros((num_classes, num_classes), dtype=K.tf.float64)
generator.reset()
pb = Progbar(steps)
for i in range(0, steps):
x, y_true = generator.next()
y_pred = model.predict_on_batch(x)
# num_classes = K.int_shape(y_pred)[-1]
y_pred = K.flatten(K.argmax(y_pred, axis=-1))
y_true = K.reshape(y_true, (-1,))
mask = K.less_equal(y_true, num_classes - 1)
y_true = K.tf.to_int32(K.tf.boolean_mask(y_true, mask))
y_pred = K.tf.to_int32(K.tf.boolean_mask(y_pred, mask))
mIoU, up_op = K.tf.contrib.metrics.streaming_mean_iou(y_pred, y_true, num_classes, updates_collections=[conf_m])
K.get_session().run(K.tf.local_variables_initializer())
with K.tf.control_dependencies([up_op]):
score = K.eval(mIoU)
print(score)
pb.update(i + 1)
conf_m = K.eval(conf_m)
return conf_m, K.eval(mIoU)
The problem lied in using keras.backend functions instead of numpy ones. Every time when a function was called - a new tensor was created. Unfortunately - with the current implementation of tf - there is no a systematic garbage collection of tensors - so this made memory full error. Switching to numpy solved the problem.
Related
I'm trying to train a resnet18 model on pytorch (+pytorch-lightning) with the use of Virtual Adversarial Training. During the computations required for this type of training I need to obtain the gradient of D (ie. the cross-entropy loss of the model) with regard to tensor r.
This should, in theory, happen in the following code snippet:
def generic_step(self, train_batch, batch_idx, step_type):
x, y = train_batch
unlabeled_idx = y is None
d = torch.rand(x.shape).to(x.device)
d = d/(torch.norm(d) + 1e-8)
pred_y = self.classifier(x)
y[unlabeled_idx] = pred_y[unlabeled_idx]
l = self.criterion(pred_y, y)
R_adv = torch.zeros_like(x)
for _ in range(self.ip):
r = self.xi * d
r.requires_grad = True
pred_hat = self.classifier(x + r)
# pred_hat = F.log_softmax(pred_hat, dim=1)
D = self.criterion(pred_hat, pred_y)
self.classifier.zero_grad()
D.requires_grad=True
D.backward()
R_adv += self.eps * r.grad / (torch.norm(r.grad) + 1e-8)
R_adv /= 32
loss = l + R_adv * self.a
loss.backward()
self.accuracy[step_type] = self.acc_metric(torch.argmax(pred_y, 1), y)
return loss
Here, to my understanding, r.grad should in theory be the gradient of D with respect to r. However, the code throws this at D.backward():
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
(full traceback excluded because this error is not helpful and technically "solved" as I know the cause for it, explained just below)
After some research and debugging it seems that in this situation D.backward() attempts to calculate dD/dD disregarding any previous mention of requires_grad=True. This is confirmed when I add D.requires_grad=True and I get D.grad=Tensor(1.,device='cuda:0') but r.grad=None.
Does anyone know why this may be happening?
In Lightning, .backward() and optimizer step are all handled under the hood. If you do it yourself like in the code above, it will mess with Lightning because it doesn't know you called backward yourself.
You can enable manual optimization in the LightningModule:
def __init__(self):
super().__init__()
# put this in your init
self.automatic_optimization = False
This tells Lightning that you are taking over calling backward and handling optimizer step + zero grad yourself. Don't forget to add that in your code above. You can access the optimizer and scheduler like so in your training step:
def training_step(self, batch, batch_idx):
optimizer = self.optimizers()
scheduler = self.lr_schedulers()
# do your training step
# don't forget to call:
# 1) backward 2) optimizer step 3) zero grad
Read more about manual optimization here.
I'm trying to define the following (toy) custom loss function in Keras:
def flexed_distance_loss(y_true, y_pred):
y_true_df = pd.DataFrame(y_true, columns=my_columns)
# do something with y_true_df
return categorical_crossentropy(y_true_df.values, y_pred)
I'm running this model on GPU with tf.distribute.MirroredStrategy().
Compiling the model generates no error, but when running model.fit(), the following error happens:
>>> y_true_df = pd.DataFrame(y_true, columns=my_columns)
OperatorNotAllowedInGraphError: iterating over `tf.Tensor` is not allowed:
AutoGraph did convert this function. This might indicate you are trying to use an unsupported feature.
It seems that Pandas is trying to iterate over the tensor y_true, which is forbidden in graph mode (the preferred mode when training on GPU).
Must I understand that this is not possible to use Pandas within a loss function when training on GPU?
What would be some plausible alternatives, other than doing all the manipulations directly in TensorFlow itself? I'm doing quite some heavy re-indexing and merging and I can't begin to imagine the pain of doing all this in native TensorFlow code.
Note:
For reference, this is the kind of manipulation I'm trying to make:
def flexed_distance_loss(y_true, y_pred):
y_true_df = pd.DataFrame(y_true, columns=my_columns)
y_true_custom = y_true_df.idxmax(axis=1).to_frame(name='my_name')
y_true_df = pd.concat([y_true_custom, y_true_df], axis=1)
y_true_df = y_true_df.where(y_true_df != 0, np.NaN)
y_true_df = y_true_df.reset_index().set_index('my_name')
nearby = y_true_df.fillna(pivoted_df.reindex(y_true_df.index)) \
.fillna(0) \
.set_index('index').sort_index()
nearby = np.expm1(nearby).div(np.sum(np.expm1(nearby), axis=1), axis=0)
y_true_flexed = nearby.values
return categorical_crossentropy(y_true_flexed, y_pred)
Actually I realised that all I'm doing within the custom loss function is transforming y_true. In the real case, I'm transforming it based on some random number (if random.random() > 0.1 then apply the transformation).
The most appropriate place to do this is not in a loss function, but in the batch generator instead.
class BatchGenerator(tf.keras.utils.Sequence):
def __init__(self, indices, batch_size, mode):
self.indices = indices
self.batch_size = batch_size
self.mode = mode
def __len__(self):
return math.ceil(len(self.indices) / self.batch_size)
def __getitem__(self, idx):
batch = self.indices[idx * self.batch_size:(idx + 1) * self.batch_size]
X_batch = X[batch, :]
y_batch = y[batch, :]
if self.mode == 'train' and random.random() > 0.3:
# pick y from regular batch
return X_batch, y_batch
else:
# apply flex-distancing to y
return X_batch, flex_distance_batch(y_batch)
batch_size = 512*4
train_generator = BatchGenerator(range(0, test_cutoff), batch_size, 'train')
test_generator = BatchGenerator(range(test_cutoff, len(y_df)), batch_size, 'test')
This way the transformations are applied directly from the batch generator, and Pandas is perfectly allowed here as we're dealing only with NumPy array on the CPU.
I'm trying to get BatchNormalization working properly in Keras (2.2.4), and haven't had luck. Its behavior seems inconsistent across model.fit() and model.predict()/evaluate().
My original problem was in the context of a complex GAN setup with various layers that were switching between frozen and unfrozen. In an attempt to grok why things were failing, I created this toy example, in which there is only a single BatchNormalization layer trying to learn the identity function, and there is no more freezing/unfreezing nonsense:
import numpy as np
import keras
from keras.layers import Input, BatchNormalization
if __name__ == '__main__':
x = np.random.uniform(low=-1.0, high=1.0, size=(20,1,))
y = x
ip = tx =Input(shape=[1,])
tx = BatchNormalization()(tx)
mod = keras.Model(inputs=ip, outputs=tx)
mod.compile(optimizer=keras.optimizers.SGD(lr=0.01), loss="mse")
mod.fit(x, y, epochs=2000)
print("mod evaluate", mod.evaluate(x, y))
I also created a pure tensorflow implementation that tries to do the exact same thing:
import tensorflow as tf
import numpy as np
if __name__ == '__main__':
x = np.random.uniform(low=-1.0, high=1.0, size=(20,1))
y = x
input = tx = tf.Variable(initial_value=x)
is_training = tf.Variable(initial_value=False)
tx = tf.layers.batch_normalization(tx, training=is_training)
output = tx
train_output = tf.placeholder(tf.float32, shape=(None, 1))
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
loss = tf.losses.mean_squared_error(train_output, output)
train_op = tf.train.MomentumOptimizer(learning_rate=0.01, momentum=0.9).minimize(loss)
train_op = tf.group([train_op, update_ops])
sess = tf.Session()
sess.run(tf.global_variables_initializer())
is_training.load(False, sess)
py = sess.run(output)
print("py", py)
print("mse", np.mean(np.square(py - x)))
print("training...")
for i in range(2000):
is_training.load(True, sess)
sess.run(train_op, {train_output: y})
is_training.load(False, sess)
py = sess.run(output)
print("step", i, "mse", np.mean(np.square(py - x)))
is_training.load(False, sess)
py = sess.run(output)
print("mse", np.mean(np.square(py - x)))
I expect the Keras and Tensorflow results to be similar. But the Keras code prints out a MSE error approx 0.0002 at the end, while showing a loss of about 1e-12 during training. The Tensorflow code gives a way lower error, at approx 1e-16 for both training and test. The Tensorflow result is the expected one, since it should be trivial to learn the identity function, and once the moving mean and variance of BN converge, the results should be identical across train/test.
Why is this happening? Why is the behavior inconsistent across Tensorflow and Keras, and how can I get a lower error using pure Keras? Any insight into the situation would be appreciated.
I am trying out this SSIM loss implement by this repo for image restoration.
For the reference of original sample code on author's GitHub, I tried:
model.train()
for epo in range(epoch):
for i, data in enumerate(trainloader, 0):
inputs = data
inputs = Variable(inputs)
optimizer.zero_grad()
inputs = inputs.view(bs, 1, 128, 128)
top = model.upward(inputs)
outputs = model.downward(top, shortcut = True)
outputs = outputs.view(bs, 1, 128, 128)
if i % 20 == 0:
out = outputs[0].view(128, 128).detach().numpy() * 255
cv2.imwrite("/home/tk/Documents/recover/SSIM/" + str(epo) + "_" + str(i) + "_re.png", out)
loss = - criterion(inputs, outputs)
ssim_value = - loss.data.item()
print (ssim_value)
loss.backward()
optimizer.step()
However, the results didn't come out as I expected. After first 10 epochs, the printed outcome image were all black.
loss = - criterion(inputs, outputs) is proposed by the author, however, for classical Pytorch training code this will be loss = criterion(y_pred, target), therefore should be loss = criterion(inputs, outputs) here.
However, I tried loss = criterion(inputs, outputs) but the results are still the same.
Can anyone share some thoughts about how to properly utilize SSIM loss? Thanks.
The author is trying to maximize the SSIM value.
The natural understanding of the pytorch loss function and optimizer working is to reduce the loss. But the SSIM value is quality measure and hence higher the better. Hence the author uses
loss = - criterion(inputs, outputs)
You can instead try using
loss = 1 - criterion(inputs, outputs)
as described in this paper.
Modified code (max_ssim.py) for testing the above thing using this repo
import pytorch_ssim
import torch
from torch.autograd import Variable
from torch import optim
import cv2
import numpy as np
npImg1 = cv2.imread("einstein.png")
img1 = torch.from_numpy(np.rollaxis(npImg1, 2)).float().unsqueeze(0)/255.0
img2 = torch.rand(img1.size())
if torch.cuda.is_available():
img1 = img1.cuda()
img2 = img2.cuda()
img1 = Variable( img1, requires_grad=False)
img2 = Variable( img2, requires_grad = True)
print(img1.shape)
print(img2.shape)
# Functional: pytorch_ssim.ssim(img1, img2, window_size = 11, size_average = True)
ssim_value = 1-pytorch_ssim.ssim(img1, img2).item()
print("Initial ssim:", ssim_value)
# Module: pytorch_ssim.SSIM(window_size = 11, size_average = True)
ssim_loss = pytorch_ssim.SSIM()
optimizer = optim.Adam([img2], lr=0.01)
while ssim_value > 0.05:
optimizer.zero_grad()
ssim_out = 1-ssim_loss(img1, img2)
ssim_value = ssim_out.item()
print(ssim_value)
ssim_out.backward()
optimizer.step()
cv2.imshow('op',np.transpose(img2.cpu().detach().numpy()[0],(1,2,0)))
cv2.waitKey()
The usual way to transform a similarity (higher is better) into a loss is to compute 1 - similarity(x, y).
To create this loss you can create a new "function".
def ssim_loss(x, y):
return 1. - ssim(x, y)
Alternatively, if the similarity is a class (nn.Module), you can overload it to create a new one.
class SSIMLoss(SSIM):
def forward(self, x, y):
return 1. - super().forward(x, y)
Also, there are better implementations of SSIM than the one of this repo. For example, the one of the piqa Python package is faster.
The package can be installed with
pip install piqa
For your problem
from piqa import SSIM
class SSIMLoss(SSIM):
def forward(self, x, y):
return 1. - super().forward(x, y)
criterion = SSIMLoss() # .cuda() if you need GPU support
...
loss = criterion(x, y)
...
should work well.
Problem statement
I am trying to train a dynamic RNN in TensorFlow v1.0.1 on Linux RedHat 7.3 (problem also manifests on Windows 7), and no matter what I try, I get the exact same training and validation error at every epoch, i.e. my weights are not updating.
I appreciate any help you can offer.
Example
I tried to reduce this to a minimum example that shows my issue, but the minimum example is still pretty large. I based the network structure largely on this gist.
Network definition
import functools
import numpy as np
import tensorflow as tf
def lazy_property(function):
attribute = '_' + function.__name__
#property
#functools.wraps(function)
def wrapper(self):
if not hasattr(self, attribute):
setattr(self, attribute, function(self))
return getattr(self, attribute)
return wrapper
class MyNetwork:
"""
Class defining an RNN for labeling a time series.
"""
def __init__(self, data, target, num_hidden=64):
self.data = data
self.target = target
self._num_hidden = num_hidden
self._num_steps = int(self.target.get_shape()[1])
self._num_classes = int(self.target.get_shape()[2])
self._weight_and_bias() # create weight and bias tensors
self.prediction
self.error
self.optimize
#lazy_property
def prediction(self):
"""Defines the recurrent neural network prediction scheme."""
# Dynamic LSTM.
network = tf.contrib.rnn.BasicLSTMCell(self._num_hidden)
output, _ = tf.nn.dynamic_rnn(network, data, dtype=tf.float32)
# Flatten and apply same weights to all time steps.
output = tf.reshape(output, [-1, self._num_hidden])
prediction = tf.nn.softmax(tf.matmul(output, self.weight) + self.bias)
prediction = tf.reshape(prediction,
[-1, self._num_steps, self._num_classes])
return prediction
#lazy_property
def cost(self):
"""Defines the cost function for the network."""
cross_entropy = -tf.reduce_sum(self.target * tf.log(self.prediction),
axis=[1, 2])
cross_entropy = tf.reduce_mean(cross_entropy)
return cross_entropy
#lazy_property
def optimize(self):
"""Defines the optimization scheme."""
learning_rate = 0.003
optimizer = tf.train.RMSPropOptimizer(learning_rate)
return optimizer.minimize(self.cost)
#lazy_property
def error(self):
"""Defines a measure of prediction error."""
mistakes = tf.not_equal(tf.argmax(self.target, 2),
tf.argmax(self.prediction, 2))
return tf.reduce_mean(tf.cast(mistakes, tf.float32))
def _weight_and_bias(self):
"""Returns appropriately sized weight and bias tensors for the output layer."""
self.weight = tf.Variable(tf.truncated_normal(
[self._num_hidden, self._num_classes],
mean=0.0,
stddev=0.01,
dtype=tf.float32))
self.bias = tf.Variable(tf.constant(0.1, shape=[self._num_classes]))
Training
Here is my training process. The all_data class just holds my data and labels, and uses a batch generator class to spit out batches for training when I call all_data.train.next() and all_data.train_labels.next(). You can reproduce with any batch generation scheme you like, and I can add the code if you think it is relevant; I felt like this was getting too long as it is.
tf.reset_default_graph()
data = tf.placeholder(tf.float32,
[None, all_data.num_steps, all_data.num_features])
target = tf.placeholder(tf.float32,
[None, all_data.num_steps, all_data.num_outputs])
model = MyNetwork(data, target, NUM_HIDDEN)
print('Training the model...')
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
print('Initialized.')
for epoch in range(3):
print('Epoch {} |'.format(epoch), end='', flush=True)
for step in range(all_data.train_size // BATCH_SIZE):
# Generate the next training batch and train.
d = all_data.train.next()
t = all_data.train_labels.next()
sess.run(model.optimize,
feed_dict={data: d, target: t})
# Update the user periodically.
if step % summary_frequency == 0:
print('.', end='', flush=True)
# Show training and validation error at the end of each epoch.
print('|', flush=True)
train_error = sess.run(model.error,
feed_dict={data: d, target: t})
valid_error = sess.run(model.error,
feed_dict={
data: all_data.valid,
target: all_data.valid_labels
})
print('Training error: {}%'.format(100 * train_error))
print('Validation error: {}%'.format(100 * valid_error))
# Check testing error after everything.
test_error = sess.run(model.error,
feed_dict={
data: all_data.test,
target: all_data.test_labels
})
print('Testing error after {} epochs: {}%'.format(epoch + 1, 100 * test_error))
For a simple example, I generated random data and labels, where data has shape [num_samples, num_steps, num_features], and each sample has a single label associated with the whole thing:
data = np.random.rand(5000, 1000, 2)
labels = np.random.randint(low=0, high=2, size=[5000])
I then converted my labels to one-hot vectors and tiled them so that the resulting labels tensor was the same size as the data tensor.
Results
No matter what I do, I get results like this:
Training the model...
Initialized.
Epoch 0 |.......................................................|
Training error: 56.25%
Validation error: 53.39999794960022%
Epoch 1 |.......................................................|
Training error: 56.25%
Validation error: 53.39999794960022%
Epoch 2 |.......................................................|
Training error: 56.25%
Validation error: 53.39999794960022%
Testing error after 3 epochs: 49.000000953674316%
Where I have exactly the same error at every epoch. Even if my weights were randomly walking around this should change. For the example shown here, I used random data with random labels, so I do not expect much improvement, but I do expect some change, and I am getting the exact same results every epoch. When I do this with my actual data set, I get the same behavior.
Insight
I hesitate to include this in case it proves to be a red herring, but I believe that my optimizer is calculating cost function gradients of None. When I tried a different optimizer and attempted to clip the gradients, I went ahead and used tf.Print to output the gradients as well. The network crashed with an error that tf.Print could not handle None-type values.
Attempted fixes
I have tried the following things, and the problem persists in all cases:
Using different optimizers, e.g. AdamOptimizer with and without modifications to the gradients (clipping).
Adjusting batch sizes.
Using many more and many fewer hidden nodes.
Running for more epochs.
Initializing my weights with different values assigned to stddev.
Initializing my biases to zeros (using tf.zeros) and to different constants.
Using weights and biases that are defined within the prediction method and are not member variables of the class, and a _weight_and_bias method that is defined as a #staticmethod like in this gist.
Determining logits in the prediction function instead of softmax predictions, i.e. predictions = tf.matmul(output, self.weights) + self.bias, and then using tf.nn.softmax_cross_entropy_with_logits. This requires some reshaping because the method wants its labels and targets given with shape [batch_size, num_classes], so the cost method becomes:
(line added to get code to format...)
#lazy_property
def cost(self):
"""Defines the cost function for the network."""
targs = tf.reshape(self.target, [-1, self._num_classes])
logits = tf.reshape(self.predictions, [-1, self._num_classes])
cross_entropy = tf.nn.softmax_cross_entropy_with_logits(labels=targs, logits=logits)
cross_entropy = tf.reduce_mean(cross_entropy)
return cross_entropy
Changing which size dimension I leave as None when I create my placeholders as suggested in this answer, which requires a bit of rewriting in the network definition. Basically setting size = [all_data.batch_size, -1, all_data.num_features] and size = [all_data.batch_size, -1, all_data.num_classes].
Using tf.contrib.rnn.DropoutWrapper in my network definition and passing a dropout value set to 0.5 in training and 1.0 in validation and testing.
The problem went away when I used
output = tf.contrib.layers.flatten(output)
logits = tf.contrib.layers.fully_connected(output, some_size, activation_fn=None)
instead of flattening my network output, defining weights, and performing the tf.matmul(output, weight) + bias manually. I then used logits (instead of predictions in the question) in my cost function with
cross_entropy = tf.nn.softmax_cross_entropy_with_logits(labels=target,
logits=logits)
If you want to get the network prediction, you will still need to do prediction = tf.nn.softmax(logits).
I have no idea why this helped, but the network would not train even on random made-up data until I made these changes.