I have been trying to implement the training step for a DQN described in this paper on various RL methods using TensorFlow, but when I try to compute the gradient using a GradientTape I get a ValueError: No gradients provided for any variable:. Below is the training step code:
def train_step(model, target, optimizer, observations, actions, rewards, next_observations):
with tf.GradientTape() as tape:
target_logits = tf.math.reduce_max(target(np.expand_dims(next_observations, -1)), 1)
logits = model(np.expand_dims(observations, -1))
act_logits = np.ndarray(EXPERIENCE_SAMPLE_SIZE)
for i in range(EXPERIENCE_SAMPLE_SIZE):
act_logits[i] = logits[i][actions[i]]
act_logits = tf.convert_to_tensor(act_logits, dtype=tf.float32)
y_T = tf.math.add(tf.convert_to_tensor(rewards, dtype=tf.float32), tf.math.scalar_mul(DISCOUNT_RATE, target_logits))
loss = tf.math.squared_difference(act_logits, y_T)
loss = tf.math.scalar_mul(1.0 / EXPERIENCE_SAMPLE_SIZE, loss)
grads = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
Where model and target are tf.keras.Sequential that output the expected value for taking each of 5 possible actions, optimizer is SGD, and observations, actions, rewards, and next_observations are numpy arrays sampled from an experience replay buffer.
This is part of implementing the following pseudocode from the aforementioned paper:
My best guess is that this error is because indexing logits makes the gradient impossible to differentiate, but I don't know else to calculate the Q*(s,a,theta) quantity.
Adding the Solution in the Answer Section for the benefit of the Community.
From Comments:
The problem is resolved by replacing the code:
act_logits = np.ndarray(EXPERIENCE_SAMPLE_SIZE)
for i in range(EXPERIENCE_SAMPLE_SIZE):
act_logits[i] = logits[i][actions[i]]
with the code:
act_logits = tf.math.reduce_max(tf.math.multiply(act_logits, logits), 1)
Related
I have been trying to implement policy gradient algorithm in reinforcement learning. However, I am facing the error"ValueError: No gradients provided for any variable:" while computing the gradients for the custom loss function as shown below:
def loss_function(prob, action, reward):
prob_action = np.array([prob.numpy()[0][action]]) #prob is like ->[0.4900, 0.5200] and action is scalar index->1,0
log_prob = tf.math.log(prob_action)
loss = tf.multiply(log_prob, (-reward))
return loss
I am computing the gradients as below:
def update_policy(policy, states, actions, discounted_rewards):
opt = tf.keras.optimizers.SGD(learning_rate=0.1)
for state, reward, action in zip(states, discounted_rewards, actions):
with tf.GradientTape() as tape:
prob = policy(state, training=True)
loss = loss_function(prob, action, reward)
print(loss)
gradients = tape.gradient(loss, policy.trainable_variables)
opt.apply_gradients(zip(gradients, policy.trainable_variables))
Kindly please help me out in this issue.
Thank you
As #gekrone indicates in the comment this is definetly due to the gradients not flowing due to prob_action being a numpy array and not a tensor. Also be careful not to use the .numpy() method. Probably stick to something like
prob_action = prob[0][action]
...
and this should work.
I am trying to compute "gradients through gradients" for a paper (MAML, by C.Finn et al.) in Tensorflow 2 with Keras backend. Thus, we start at some initial weights, compute K gradient update steps, and want to backpropagate through our initial weights. The code sample belows illustrates what I want to achieve, but unfortunately does not work.
optimizer = tf.keras.SGD()
initial_weights = model.trainable_variables
with tf.GradientTape() as mt:
for gradient_steps in range(10):
with tf.GradientTape() as t:
loss = loss_function(y_train, model(x_train))
grads = t.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
test_loss = loss_function(y_test, model(x_test))
mt.gradient(test_loss, initial_weights)
Does anyone know how to differentiate through the initialization? Any help would be greatly appreciated!
I'm using Tensorflow 2.1 and Python 3, creating my custom training model following the tutorial "Tensorflow - Custom training: walkthrough".
I'm trying to use Hamming Distance on my loss function:
import tensorflow as tf
import tensorflow_addons as tfa
def my_loss_hamming(model, x, y):
global output
output = model(x)
return tfa.metrics.hamming.hamming_loss_fn(y, output, threshold=0.5, mode='multilabel')
def grad(model, inputs, targets):
with tf.GradientTape() as tape:
tape.watch(model.trainable_variables)
loss_value = my_loss_hamming(model, inputs, targets)
return loss_value, tape.gradient(loss_value, model.trainable_variables)
When I call it:
loss_value, grads = grad(model, feature, label)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
grads variable is a list with 38 None.
And I get the error:
No gradients provided for any variable: ['conv1_1/kernel:0', ...]
Is there any way to use Hamming Distance without "interrupts the gradient chain registered by the gradient tape"?
Apology if I'm saying something obvious, but the way how backpropagation works as a fitting algorithm for neural networks is through gradients - e.g. for each batch of training data you compute how much the loss function will improve/degrade if you move a particular trainable weight by a very small amount delta.
Hamming loss is by definition not differentiable, so for small movements of trainable weights you will never experience any changes in the loss. I imagine it is only added to be used for final measurements of trained models' performance rather than for training.
If you want to train a neural net through backpropagation you need to use some differentiable loss - such that can help the model to move weights in the right direction. Sometimes people use different techniques to smooth such losses as Hamming less and create approximations - e.g. here it could be something which would penalize less predictions which are closer to the target answer rather then just giving out 1 for everything above threshold and 0 for everything else.
I'm training a model with tensorflow 2.0. The images in my training set are of different resolutions. The Model I've built can handle variable resolutions (conv layers followed by global averaging). My training set is very small and I want to use full training set in a single batch.
Since my images are of different resolutions, I can't use model.fit(). So, I'm planning to pass each sample through the network individually, accumulate the errors/gradients and then apply one optimizer step. I'm able to compute loss values, but I don't know how to accumulate the losses/gradients. How can I accumulate the losses/gradients and then apply a single optimizer step?
Code:
for i in range(num_epochs):
print(f'Epoch: {i + 1}')
total_loss = 0
for j in tqdm(range(num_samples)):
sample = samples[j]
with tf.GradientTape as tape:
prediction = self.model(sample)
loss_value = self.loss_function(y_true=labels[j], y_pred=prediction)
gradients = tape.gradient(loss_value, self.model.trainable_variables)
self.optimizer.apply_gradients(zip(gradients, self.model.trainable_variables))
total_loss += loss_value
epoch_loss = total_loss / num_samples
print(f'Epoch loss: {epoch_loss}')
If I understand correctly from this statement:
How can I accumulate the losses/gradients and then apply a single optimizer step?
#Nagabhushan is trying to accumulate gradients and then apply the optimization on the (mean) accumulated gradient. The answer provided by #TensorflowSupport does not answers it.
In order to perform the optimization only once, and accumulate the gradient from several tapes, you can do the following:
for i in range(num_epochs):
print(f'Epoch: {i + 1}')
total_loss = 0
# get trainable variables
train_vars = self.model.trainable_variables
# Create empty gradient list (not a tf.Variable list)
accum_gradient = [tf.zeros_like(this_var) for this_var in train_vars]
for j in tqdm(range(num_samples)):
sample = samples[j]
with tf.GradientTape as tape:
prediction = self.model(sample)
loss_value = self.loss_function(y_true=labels[j], y_pred=prediction)
total_loss += loss_value
# get gradients of this tape
gradients = tape.gradient(loss_value, train_vars)
# Accumulate the gradients
accum_gradient = [(acum_grad+grad) for acum_grad, grad in zip(accum_gradient, gradients)]
# Now, after executing all the tapes you needed, we apply the optimization step
# (but first we take the average of the gradients)
accum_gradient = [this_grad/num_samples for this_grad in accum_gradient]
# apply optimization step
self.optimizer.apply_gradients(zip(accum_gradient,train_vars))
epoch_loss = total_loss / num_samples
print(f'Epoch loss: {epoch_loss}')
Using tf.Variable() should be avoided inside the training loop, since it will produce errors when trying to execute the code as a graph. If you use tf.Variable() inside your training function and then decorate it with "#tf.function" or apply "tf.function(my_train_fcn)" to obtain a graph function (i.e. for improved performance), the execution will rise an error.
This happens because the tracing of the tf.Variable function results in a different behaviour than the observed in eager execution (re-utilization or creation, respectively). You can find more info on this in the tensorflow help page.
In line with the Stack Overflow Answer and the explanation provided in Tensorflow Website, mentioned below is the code for Accumulating Gradients in Tensorflow Version 2.0:
def train(epochs):
for epoch in range(epochs):
for (batch, (images, labels)) in enumerate(dataset):
with tf.GradientTape() as tape:
logits = mnist_model(images, training=True)
tvs = mnist_model.trainable_variables
accum_vars = [tf.Variable(tf.zeros_like(tv.initialized_value()), trainable=False) for tv in tvs]
zero_ops = [tv.assign(tf.zeros_like(tv)) for tv in accum_vars]
loss_value = loss_object(labels, logits)
loss_history.append(loss_value.numpy().mean())
grads = tape.gradient(loss_value, tvs)
#print(grads[0].shape)
#print(accum_vars[0].shape)
accum_ops = [accum_vars[i].assign_add(grad) for i, grad in enumerate(grads)]
optimizer.apply_gradients(zip(grads, mnist_model.trainable_variables))
print ('Epoch {} finished'.format(epoch))
# Call the above function
train(epochs = 3)
Complete code can be found in this Github Gist.
Suppose we have a simple Keras model that uses BatchNormalization:
model = tf.keras.Sequential([
tf.keras.layers.InputLayer(input_shape=(1,)),
tf.keras.layers.BatchNormalization()
])
How to actually use it with GradientTape? The following doesn't seem to work as it doesn't update the moving averages?
# model training... we want the output values to be close to 150
for i in range(1000):
x = np.random.randint(100, 110, 10).astype(np.float32)
with tf.GradientTape() as tape:
y = model(np.expand_dims(x, axis=1))
loss = tf.reduce_mean(tf.square(y - 150))
grads = tape.gradient(loss, model.variables)
opt.apply_gradients(zip(grads, model.variables))
In particular, if you inspect the moving averages, they remain the same (inspect model.variables, averages are always 0 and 1). I know one can use .fit() and .predict(), but I would like to use the GradientTape and I'm not sure how to do this. Some version of the documentation suggests to update update_ops, but that doesn't seem to work in eager mode.
In particular, the following code will not output anything close to 150 after the above training.
x = np.random.randint(200, 210, 100).astype(np.float32)
print(model(np.expand_dims(x, axis=1)))
with gradient tape mode BatchNormalization layer should be called with argument training=True
example:
inp = KL.Input( (64,64,3) )
x = inp
x = KL.Conv2D(3, kernel_size=3, padding='same')(x)
x = KL.BatchNormalization()(x, training=True)
model = KM.Model(inp, x)
then moving vars are properly updated
>>> model.layers[2].weights[2]
<tf.Variable 'batch_normalization/moving_mean:0' shape=(3,) dtype=float32, numpy
=array([-0.00062087, 0.00015137, -0.00013239], dtype=float32)>
I just give up. I spent quiet a bit of time trying to make sense of a model that looks like:
model = tf.keras.Sequential([
tf.keras.layers.BatchNormalization(),
])
And I do give up because that thing looks like that:
My intuition was that BatchNorm these days is not as straight forward as it used to be and that is why it scales original distribution but not so much new distribution (which is a shame), but ain't nobody got time for that.
Edit: the reason for that behavior is that BN only calculates moments and normalizes batches during training. During training it maintains running averages of mean and deviation and once you switch to evaluation, parameters are used as constants. i.e. evaluation should not depend on normalization because evaluation can be used even for a single input and can not rely on batch statistics. Since constants are calculated on a different distribution, you are getting a higher error during evaluation.
With Gradient Tape mode, you would usually find gradients like:
with tf.GradientTape() as tape:
y_pred = model(features)
loss = your_loss_function(y_pred, y_true)
gradients = tape.gradient(loss, model.trainable_variables)
train_op = model.optimizer.apply_gradients(zip(gradients, model.trainable_variables))
However, if your model contains BatchNormalization or Dropout layer (or any layer that has different train/test phases) then tf will fail building the graph.
A good practice would be to explicitly use trainable parameter when obtaining output from a model. When optimizing use model(features, trainable=True) and when predicting use model(features, trainable=False), in order to explicitly choose train/test phase when using such layers.
For PREDICT and EVAL phase, use
training = (mode == tf.estimator.ModeKeys.TRAIN)
y_pred = model(features, trainable=training)
For TRAIN phase, use
with tf.GradientTape() as tape:
y_pred = model(features, trainable=training)
loss = your_loss_function(y_pred, y_true)
gradients = tape.gradient(loss, model.trainable_variables)
train_op = model.optimizer.apply_gradients(zip(gradients, model.trainable_variables))
Note that, iperov's answer works as well, except that you will need to set the training phase manually for those layers.
x = BatchNormalization()(x, training=True)
x = Dropout(rate=0.25)(x, training=True)
x = BatchNormalization()(x, training=False)
x = Dropout(rate=0.25)(x, training=False)
I'd recommended to have one get_model function that returns the model, while changing the phase using training parameter when calling the model.
Note:
If you use model.variables when finding gradients, you'll get this warning
Gradients do not exist for variables
['layer_1_bn/moving_mean:0',
'layer_1_bn/moving_variance:0',
'layer_2_bn/moving_mean:0',
'layer_2_bn/moving_variance:0']
when minimizing the loss.
This can be resolved by computing gradients only against trainable variables. Replace model.variables with model.trainable_variables