ValueError: No gradients provided for any variable in policy gradient - python

I have been trying to implement policy gradient algorithm in reinforcement learning. However, I am facing the error"ValueError: No gradients provided for any variable:" while computing the gradients for the custom loss function as shown below:
def loss_function(prob, action, reward):
prob_action = np.array([prob.numpy()[0][action]]) #prob is like ->[0.4900, 0.5200] and action is scalar index->1,0
log_prob = tf.math.log(prob_action)
loss = tf.multiply(log_prob, (-reward))
return loss
I am computing the gradients as below:
def update_policy(policy, states, actions, discounted_rewards):
opt = tf.keras.optimizers.SGD(learning_rate=0.1)
for state, reward, action in zip(states, discounted_rewards, actions):
with tf.GradientTape() as tape:
prob = policy(state, training=True)
loss = loss_function(prob, action, reward)
print(loss)
gradients = tape.gradient(loss, policy.trainable_variables)
opt.apply_gradients(zip(gradients, policy.trainable_variables))
Kindly please help me out in this issue.
Thank you

As #gekrone indicates in the comment this is definetly due to the gradients not flowing due to prob_action being a numpy array and not a tensor. Also be careful not to use the .numpy() method. Probably stick to something like
prob_action = prob[0][action]
...
and this should work.

Related

Tensorflow / Keras gradients return a None value in multi-task model

When trying to train a multi-task model in Tensorflow using Keras, I run into the error of finding gradients whose values are None. When I use the regular model.fit(), I am able to train normally (without any issues) but Tensorboard shows some missing distributions (some of them are totally missing), which makes me believe something is going on with gradients.
Then, when I try to debug this using a custom training loop, I see gradient values coming out as None. For ex, like this:
w1 = tf.Variable(1.0)
w2 = tf.Variable(1.0)
for step, (x_batch_train, y_batch_train) in enumerate(trainDS):
with tf.GradientTape(persistent=True) as tape:
logits = model(x_batch_train, training=True)
loss1 = loss_fn1(y_batch_train['label1'], logits[0])
loss2 = loss_fn2(y_batch_train['label2'], logits[1])
l1 = tf.math.multiply(w1, loss1)
l2 = tf.math.multiply(w2, loss2)
loss_value = tf.math.add(l1, l2)
grad_task_1 = tape.gradient(l1, model.trainable_weights)
I get gradients of None when inspecting the list.
Eventually I want to compute the normm of grad_task_1 gradients, but cannot do that with Nones.
Is there anything that I am doing wrong from an approach POV or from a design POV?
Or am I missing something?
Could anyone kindly provide some guidance. Would be greatly appreciated
Thank you

ValueError: No gradients provided for any variable when calculating loss

I have been trying to implement the training step for a DQN described in this paper on various RL methods using TensorFlow, but when I try to compute the gradient using a GradientTape I get a ValueError: No gradients provided for any variable:. Below is the training step code:
def train_step(model, target, optimizer, observations, actions, rewards, next_observations):
with tf.GradientTape() as tape:
target_logits = tf.math.reduce_max(target(np.expand_dims(next_observations, -1)), 1)
logits = model(np.expand_dims(observations, -1))
act_logits = np.ndarray(EXPERIENCE_SAMPLE_SIZE)
for i in range(EXPERIENCE_SAMPLE_SIZE):
act_logits[i] = logits[i][actions[i]]
act_logits = tf.convert_to_tensor(act_logits, dtype=tf.float32)
y_T = tf.math.add(tf.convert_to_tensor(rewards, dtype=tf.float32), tf.math.scalar_mul(DISCOUNT_RATE, target_logits))
loss = tf.math.squared_difference(act_logits, y_T)
loss = tf.math.scalar_mul(1.0 / EXPERIENCE_SAMPLE_SIZE, loss)
grads = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
Where model and target are tf.keras.Sequential that output the expected value for taking each of 5 possible actions, optimizer is SGD, and observations, actions, rewards, and next_observations are numpy arrays sampled from an experience replay buffer.
This is part of implementing the following pseudocode from the aforementioned paper:
My best guess is that this error is because indexing logits makes the gradient impossible to differentiate, but I don't know else to calculate the Q*(s,a,theta) quantity.
Adding the Solution in the Answer Section for the benefit of the Community.
From Comments:
The problem is resolved by replacing the code:
act_logits = np.ndarray(EXPERIENCE_SAMPLE_SIZE)
for i in range(EXPERIENCE_SAMPLE_SIZE):
act_logits[i] = logits[i][actions[i]]
with the code:
act_logits = tf.math.reduce_max(tf.math.multiply(act_logits, logits), 1)

Use Hamming Distance Loss Function with Tensorflow GradientTape: no gradients. Is it not differentiable?

I'm using Tensorflow 2.1 and Python 3, creating my custom training model following the tutorial "Tensorflow - Custom training: walkthrough".
I'm trying to use Hamming Distance on my loss function:
import tensorflow as tf
import tensorflow_addons as tfa
def my_loss_hamming(model, x, y):
global output
output = model(x)
return tfa.metrics.hamming.hamming_loss_fn(y, output, threshold=0.5, mode='multilabel')
def grad(model, inputs, targets):
with tf.GradientTape() as tape:
tape.watch(model.trainable_variables)
loss_value = my_loss_hamming(model, inputs, targets)
return loss_value, tape.gradient(loss_value, model.trainable_variables)
When I call it:
loss_value, grads = grad(model, feature, label)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
grads variable is a list with 38 None.
And I get the error:
No gradients provided for any variable: ['conv1_1/kernel:0', ...]
Is there any way to use Hamming Distance without "interrupts the gradient chain registered by the gradient tape"?
Apology if I'm saying something obvious, but the way how backpropagation works as a fitting algorithm for neural networks is through gradients - e.g. for each batch of training data you compute how much the loss function will improve/degrade if you move a particular trainable weight by a very small amount delta.
Hamming loss is by definition not differentiable, so for small movements of trainable weights you will never experience any changes in the loss. I imagine it is only added to be used for final measurements of trained models' performance rather than for training.
If you want to train a neural net through backpropagation you need to use some differentiable loss - such that can help the model to move weights in the right direction. Sometimes people use different techniques to smooth such losses as Hamming less and create approximations - e.g. here it could be something which would penalize less predictions which are closer to the target answer rather then just giving out 1 for everything above threshold and 0 for everything else.

How to detect vanishing and exploding gradients with Tensorboard?

I have two "sub-questions"
1) How can I detect vanishing or exploding gradients with Tensorboard, given the fact that currently write_grads=True is deprecated in the Tensorboard callback as per "un-deprecate write_grads for fit #31173" ?
2) I figured I can probably tell whether my model suffers from vanishing gradients based on the weights' distributions and histograms in the Distributions and Histograms tab in Tensorboard. My problem is that I have no frame of reference to compare with. Currently, my biases seem to be "moving" but I can't tell whether my kernel weights (Conv2D layers) are "moving"/"changing" "enough". Can someone help me by giving a rule of thumb to asses this visually in Tensorboard? I.e. if only the bottom 25% percentile of kernel weights are moving, that's good enough / not good enough? Or perhaps someone can post two reference images from tensorBoard of vanishing gradients vs, non vanishing gradients.
Here are my histograms and distributions, is it possible to tell whether my model suffers from Vanishing gradients? (some layers omitted for brevity) Thanks in advance.
I am currently facing the same question and approached the problem similarly using Tensorboard.
Even tho write_grads is deprecated you can still manage to log gradients for each layer of your network by subclassing the tf.keras.Model class and computing the gradients manually with gradient.Tape in the train_step method.
Something similar to this is working for me
from tensorflow.keras import Model
class TrainWithCustomLogsModel(Model):
def __init__(self, **kwargs):
super(TrainWithCustomLogsModel, self).__init__(**kwargs)
self.step = tf.Variable(0, dtype=tf.int64,trainable=False)
def train_step(self, data):
# Get batch images and labels
x, y = data
# Compute the batch loss
with tf.GradientTape() as tape:
p = self(x , training = True)
loss = self.compiled_loss(y, p, regularization_losses=self.losses)
# Compute gradients for each weight of the network. Note trainable_vars and gradients are list of tensors
trainable_vars = self.trainable_variables
gradients = tape.gradient(loss, trainable_vars)
# Log gradients in Tensorboard
self.step.assign_add(tf.constant(1, dtype=tf.int64))
#tf.print(self.step)
with train_summary_writer.as_default():
for var, grad in zip(trainable_vars, gradients):
name = var.name
var, grad = tf.squeeze(var), tf.squeeze(grad)
tf.summary.histogram(name, var, step = self.step)
tf.summary.histogram('Gradients_'+name, grad, step = self.step)
# Update model's weights
self.optimizer.apply_gradients(zip(gradients, trainable_vars))
del tape
# Update metrics (includes the metric that tracks the loss)
self.compiled_metrics.update_state(y, p)
# Return a dict mapping metric names to current value
return {m.name: m.result() for m in self.metrics}
You should then be able to visualize distributions of your gradients for any train step of your training, along with distributions of your kernel's values.
Moreover, it might be worth try to plot the distribution of the norm through time instead of single values.

How to use Tensorflows GradientTape() to compute biases

I'm looking to implement GradientTape() on a custom NN architecture but I don't see an explanation anywhere on how to use this to compute biases. A similar question was answered here, but it was not answered fully.
As a simple example, I have the training step for my NN like so:
self.W = ## Initialized earlier on
self.b = ## Initialized earlier on
#tf.function
def train(self):
with tf.GradientTape() as tape:
pred = self.feedforward()
loss = self.loss_evaluation()
grad = tape.gradient(loss, self.W)
grad = tape.gradient(loss, self.b) ## How do I do this?
optimizer.apply_gradients(zip(grad, self.W))
optimizer.apply_gradients(zip(grad, self.b)) ## How do I do this?
Put simply, I cannot evaluate the gradients with respect to the biases as nowhere in any documentation or tutorial do I see the bias term included. So, how do I go about implementing the bias term as a trainable variable in my code? I'm not looking to implement this with keras, so do not suggest I use trainable_variables as I want to do it from scratch.
#thushv89 The code Jamie showed doesn't work because you can't call gradient() on the same tape twice.
Jamie, why can't you simply do the following?
with tf.GradientTape() as tape:
pred = self.feedforward()
loss = self.loss_evaluation()
grads = tape.gradient(loss, [self.W, self.b])
optimizer.apply_gradients(zip(grads, [self.W, self.b]))

Categories