I am trying to compute "gradients through gradients" for a paper (MAML, by C.Finn et al.) in Tensorflow 2 with Keras backend. Thus, we start at some initial weights, compute K gradient update steps, and want to backpropagate through our initial weights. The code sample belows illustrates what I want to achieve, but unfortunately does not work.
optimizer = tf.keras.SGD()
initial_weights = model.trainable_variables
with tf.GradientTape() as mt:
for gradient_steps in range(10):
with tf.GradientTape() as t:
loss = loss_function(y_train, model(x_train))
grads = t.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
test_loss = loss_function(y_test, model(x_test))
mt.gradient(test_loss, initial_weights)
Does anyone know how to differentiate through the initialization? Any help would be greatly appreciated!
Related
When trying to train a multi-task model in Tensorflow using Keras, I run into the error of finding gradients whose values are None. When I use the regular model.fit(), I am able to train normally (without any issues) but Tensorboard shows some missing distributions (some of them are totally missing), which makes me believe something is going on with gradients.
Then, when I try to debug this using a custom training loop, I see gradient values coming out as None. For ex, like this:
w1 = tf.Variable(1.0)
w2 = tf.Variable(1.0)
for step, (x_batch_train, y_batch_train) in enumerate(trainDS):
with tf.GradientTape(persistent=True) as tape:
logits = model(x_batch_train, training=True)
loss1 = loss_fn1(y_batch_train['label1'], logits[0])
loss2 = loss_fn2(y_batch_train['label2'], logits[1])
l1 = tf.math.multiply(w1, loss1)
l2 = tf.math.multiply(w2, loss2)
loss_value = tf.math.add(l1, l2)
grad_task_1 = tape.gradient(l1, model.trainable_weights)
I get gradients of None when inspecting the list.
Eventually I want to compute the normm of grad_task_1 gradients, but cannot do that with Nones.
Is there anything that I am doing wrong from an approach POV or from a design POV?
Or am I missing something?
Could anyone kindly provide some guidance. Would be greatly appreciated
Thank you
I have been trying to implement the training step for a DQN described in this paper on various RL methods using TensorFlow, but when I try to compute the gradient using a GradientTape I get a ValueError: No gradients provided for any variable:. Below is the training step code:
def train_step(model, target, optimizer, observations, actions, rewards, next_observations):
with tf.GradientTape() as tape:
target_logits = tf.math.reduce_max(target(np.expand_dims(next_observations, -1)), 1)
logits = model(np.expand_dims(observations, -1))
act_logits = np.ndarray(EXPERIENCE_SAMPLE_SIZE)
for i in range(EXPERIENCE_SAMPLE_SIZE):
act_logits[i] = logits[i][actions[i]]
act_logits = tf.convert_to_tensor(act_logits, dtype=tf.float32)
y_T = tf.math.add(tf.convert_to_tensor(rewards, dtype=tf.float32), tf.math.scalar_mul(DISCOUNT_RATE, target_logits))
loss = tf.math.squared_difference(act_logits, y_T)
loss = tf.math.scalar_mul(1.0 / EXPERIENCE_SAMPLE_SIZE, loss)
grads = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
Where model and target are tf.keras.Sequential that output the expected value for taking each of 5 possible actions, optimizer is SGD, and observations, actions, rewards, and next_observations are numpy arrays sampled from an experience replay buffer.
This is part of implementing the following pseudocode from the aforementioned paper:
My best guess is that this error is because indexing logits makes the gradient impossible to differentiate, but I don't know else to calculate the Q*(s,a,theta) quantity.
Adding the Solution in the Answer Section for the benefit of the Community.
From Comments:
The problem is resolved by replacing the code:
act_logits = np.ndarray(EXPERIENCE_SAMPLE_SIZE)
for i in range(EXPERIENCE_SAMPLE_SIZE):
act_logits[i] = logits[i][actions[i]]
with the code:
act_logits = tf.math.reduce_max(tf.math.multiply(act_logits, logits), 1)
I'm using Tensorflow 2.1 and Python 3, creating my custom training model following the tutorial "Tensorflow - Custom training: walkthrough".
I'm trying to use Hamming Distance on my loss function:
import tensorflow as tf
import tensorflow_addons as tfa
def my_loss_hamming(model, x, y):
global output
output = model(x)
return tfa.metrics.hamming.hamming_loss_fn(y, output, threshold=0.5, mode='multilabel')
def grad(model, inputs, targets):
with tf.GradientTape() as tape:
tape.watch(model.trainable_variables)
loss_value = my_loss_hamming(model, inputs, targets)
return loss_value, tape.gradient(loss_value, model.trainable_variables)
When I call it:
loss_value, grads = grad(model, feature, label)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
grads variable is a list with 38 None.
And I get the error:
No gradients provided for any variable: ['conv1_1/kernel:0', ...]
Is there any way to use Hamming Distance without "interrupts the gradient chain registered by the gradient tape"?
Apology if I'm saying something obvious, but the way how backpropagation works as a fitting algorithm for neural networks is through gradients - e.g. for each batch of training data you compute how much the loss function will improve/degrade if you move a particular trainable weight by a very small amount delta.
Hamming loss is by definition not differentiable, so for small movements of trainable weights you will never experience any changes in the loss. I imagine it is only added to be used for final measurements of trained models' performance rather than for training.
If you want to train a neural net through backpropagation you need to use some differentiable loss - such that can help the model to move weights in the right direction. Sometimes people use different techniques to smooth such losses as Hamming less and create approximations - e.g. here it could be something which would penalize less predictions which are closer to the target answer rather then just giving out 1 for everything above threshold and 0 for everything else.
I'm training a model with tensorflow 2.0. The images in my training set are of different resolutions. The Model I've built can handle variable resolutions (conv layers followed by global averaging). My training set is very small and I want to use full training set in a single batch.
Since my images are of different resolutions, I can't use model.fit(). So, I'm planning to pass each sample through the network individually, accumulate the errors/gradients and then apply one optimizer step. I'm able to compute loss values, but I don't know how to accumulate the losses/gradients. How can I accumulate the losses/gradients and then apply a single optimizer step?
Code:
for i in range(num_epochs):
print(f'Epoch: {i + 1}')
total_loss = 0
for j in tqdm(range(num_samples)):
sample = samples[j]
with tf.GradientTape as tape:
prediction = self.model(sample)
loss_value = self.loss_function(y_true=labels[j], y_pred=prediction)
gradients = tape.gradient(loss_value, self.model.trainable_variables)
self.optimizer.apply_gradients(zip(gradients, self.model.trainable_variables))
total_loss += loss_value
epoch_loss = total_loss / num_samples
print(f'Epoch loss: {epoch_loss}')
If I understand correctly from this statement:
How can I accumulate the losses/gradients and then apply a single optimizer step?
#Nagabhushan is trying to accumulate gradients and then apply the optimization on the (mean) accumulated gradient. The answer provided by #TensorflowSupport does not answers it.
In order to perform the optimization only once, and accumulate the gradient from several tapes, you can do the following:
for i in range(num_epochs):
print(f'Epoch: {i + 1}')
total_loss = 0
# get trainable variables
train_vars = self.model.trainable_variables
# Create empty gradient list (not a tf.Variable list)
accum_gradient = [tf.zeros_like(this_var) for this_var in train_vars]
for j in tqdm(range(num_samples)):
sample = samples[j]
with tf.GradientTape as tape:
prediction = self.model(sample)
loss_value = self.loss_function(y_true=labels[j], y_pred=prediction)
total_loss += loss_value
# get gradients of this tape
gradients = tape.gradient(loss_value, train_vars)
# Accumulate the gradients
accum_gradient = [(acum_grad+grad) for acum_grad, grad in zip(accum_gradient, gradients)]
# Now, after executing all the tapes you needed, we apply the optimization step
# (but first we take the average of the gradients)
accum_gradient = [this_grad/num_samples for this_grad in accum_gradient]
# apply optimization step
self.optimizer.apply_gradients(zip(accum_gradient,train_vars))
epoch_loss = total_loss / num_samples
print(f'Epoch loss: {epoch_loss}')
Using tf.Variable() should be avoided inside the training loop, since it will produce errors when trying to execute the code as a graph. If you use tf.Variable() inside your training function and then decorate it with "#tf.function" or apply "tf.function(my_train_fcn)" to obtain a graph function (i.e. for improved performance), the execution will rise an error.
This happens because the tracing of the tf.Variable function results in a different behaviour than the observed in eager execution (re-utilization or creation, respectively). You can find more info on this in the tensorflow help page.
In line with the Stack Overflow Answer and the explanation provided in Tensorflow Website, mentioned below is the code for Accumulating Gradients in Tensorflow Version 2.0:
def train(epochs):
for epoch in range(epochs):
for (batch, (images, labels)) in enumerate(dataset):
with tf.GradientTape() as tape:
logits = mnist_model(images, training=True)
tvs = mnist_model.trainable_variables
accum_vars = [tf.Variable(tf.zeros_like(tv.initialized_value()), trainable=False) for tv in tvs]
zero_ops = [tv.assign(tf.zeros_like(tv)) for tv in accum_vars]
loss_value = loss_object(labels, logits)
loss_history.append(loss_value.numpy().mean())
grads = tape.gradient(loss_value, tvs)
#print(grads[0].shape)
#print(accum_vars[0].shape)
accum_ops = [accum_vars[i].assign_add(grad) for i, grad in enumerate(grads)]
optimizer.apply_gradients(zip(grads, mnist_model.trainable_variables))
print ('Epoch {} finished'.format(epoch))
# Call the above function
train(epochs = 3)
Complete code can be found in this Github Gist.
I'm looking to implement GradientTape() on a custom NN architecture but I don't see an explanation anywhere on how to use this to compute biases. A similar question was answered here, but it was not answered fully.
As a simple example, I have the training step for my NN like so:
self.W = ## Initialized earlier on
self.b = ## Initialized earlier on
#tf.function
def train(self):
with tf.GradientTape() as tape:
pred = self.feedforward()
loss = self.loss_evaluation()
grad = tape.gradient(loss, self.W)
grad = tape.gradient(loss, self.b) ## How do I do this?
optimizer.apply_gradients(zip(grad, self.W))
optimizer.apply_gradients(zip(grad, self.b)) ## How do I do this?
Put simply, I cannot evaluate the gradients with respect to the biases as nowhere in any documentation or tutorial do I see the bias term included. So, how do I go about implementing the bias term as a trainable variable in my code? I'm not looking to implement this with keras, so do not suggest I use trainable_variables as I want to do it from scratch.
#thushv89 The code Jamie showed doesn't work because you can't call gradient() on the same tape twice.
Jamie, why can't you simply do the following?
with tf.GradientTape() as tape:
pred = self.feedforward()
loss = self.loss_evaluation()
grads = tape.gradient(loss, [self.W, self.b])
optimizer.apply_gradients(zip(grads, [self.W, self.b]))