How to detect vanishing and exploding gradients with Tensorboard? - python

I have two "sub-questions"
1) How can I detect vanishing or exploding gradients with Tensorboard, given the fact that currently write_grads=True is deprecated in the Tensorboard callback as per "un-deprecate write_grads for fit #31173" ?
2) I figured I can probably tell whether my model suffers from vanishing gradients based on the weights' distributions and histograms in the Distributions and Histograms tab in Tensorboard. My problem is that I have no frame of reference to compare with. Currently, my biases seem to be "moving" but I can't tell whether my kernel weights (Conv2D layers) are "moving"/"changing" "enough". Can someone help me by giving a rule of thumb to asses this visually in Tensorboard? I.e. if only the bottom 25% percentile of kernel weights are moving, that's good enough / not good enough? Or perhaps someone can post two reference images from tensorBoard of vanishing gradients vs, non vanishing gradients.
Here are my histograms and distributions, is it possible to tell whether my model suffers from Vanishing gradients? (some layers omitted for brevity) Thanks in advance.

I am currently facing the same question and approached the problem similarly using Tensorboard.
Even tho write_grads is deprecated you can still manage to log gradients for each layer of your network by subclassing the tf.keras.Model class and computing the gradients manually with gradient.Tape in the train_step method.
Something similar to this is working for me
from tensorflow.keras import Model
class TrainWithCustomLogsModel(Model):
def __init__(self, **kwargs):
super(TrainWithCustomLogsModel, self).__init__(**kwargs)
self.step = tf.Variable(0, dtype=tf.int64,trainable=False)
def train_step(self, data):
# Get batch images and labels
x, y = data
# Compute the batch loss
with tf.GradientTape() as tape:
p = self(x , training = True)
loss = self.compiled_loss(y, p, regularization_losses=self.losses)
# Compute gradients for each weight of the network. Note trainable_vars and gradients are list of tensors
trainable_vars = self.trainable_variables
gradients = tape.gradient(loss, trainable_vars)
# Log gradients in Tensorboard
self.step.assign_add(tf.constant(1, dtype=tf.int64))
#tf.print(self.step)
with train_summary_writer.as_default():
for var, grad in zip(trainable_vars, gradients):
name = var.name
var, grad = tf.squeeze(var), tf.squeeze(grad)
tf.summary.histogram(name, var, step = self.step)
tf.summary.histogram('Gradients_'+name, grad, step = self.step)
# Update model's weights
self.optimizer.apply_gradients(zip(gradients, trainable_vars))
del tape
# Update metrics (includes the metric that tracks the loss)
self.compiled_metrics.update_state(y, p)
# Return a dict mapping metric names to current value
return {m.name: m.result() for m in self.metrics}
You should then be able to visualize distributions of your gradients for any train step of your training, along with distributions of your kernel's values.
Moreover, it might be worth try to plot the distribution of the norm through time instead of single values.

Related

ValueError: No gradients provided for any variable when calculating loss

I have been trying to implement the training step for a DQN described in this paper on various RL methods using TensorFlow, but when I try to compute the gradient using a GradientTape I get a ValueError: No gradients provided for any variable:. Below is the training step code:
def train_step(model, target, optimizer, observations, actions, rewards, next_observations):
with tf.GradientTape() as tape:
target_logits = tf.math.reduce_max(target(np.expand_dims(next_observations, -1)), 1)
logits = model(np.expand_dims(observations, -1))
act_logits = np.ndarray(EXPERIENCE_SAMPLE_SIZE)
for i in range(EXPERIENCE_SAMPLE_SIZE):
act_logits[i] = logits[i][actions[i]]
act_logits = tf.convert_to_tensor(act_logits, dtype=tf.float32)
y_T = tf.math.add(tf.convert_to_tensor(rewards, dtype=tf.float32), tf.math.scalar_mul(DISCOUNT_RATE, target_logits))
loss = tf.math.squared_difference(act_logits, y_T)
loss = tf.math.scalar_mul(1.0 / EXPERIENCE_SAMPLE_SIZE, loss)
grads = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
Where model and target are tf.keras.Sequential that output the expected value for taking each of 5 possible actions, optimizer is SGD, and observations, actions, rewards, and next_observations are numpy arrays sampled from an experience replay buffer.
This is part of implementing the following pseudocode from the aforementioned paper:
My best guess is that this error is because indexing logits makes the gradient impossible to differentiate, but I don't know else to calculate the Q*(s,a,theta) quantity.
Adding the Solution in the Answer Section for the benefit of the Community.
From Comments:
The problem is resolved by replacing the code:
act_logits = np.ndarray(EXPERIENCE_SAMPLE_SIZE)
for i in range(EXPERIENCE_SAMPLE_SIZE):
act_logits[i] = logits[i][actions[i]]
with the code:
act_logits = tf.math.reduce_max(tf.math.multiply(act_logits, logits), 1)

Use Hamming Distance Loss Function with Tensorflow GradientTape: no gradients. Is it not differentiable?

I'm using Tensorflow 2.1 and Python 3, creating my custom training model following the tutorial "Tensorflow - Custom training: walkthrough".
I'm trying to use Hamming Distance on my loss function:
import tensorflow as tf
import tensorflow_addons as tfa
def my_loss_hamming(model, x, y):
global output
output = model(x)
return tfa.metrics.hamming.hamming_loss_fn(y, output, threshold=0.5, mode='multilabel')
def grad(model, inputs, targets):
with tf.GradientTape() as tape:
tape.watch(model.trainable_variables)
loss_value = my_loss_hamming(model, inputs, targets)
return loss_value, tape.gradient(loss_value, model.trainable_variables)
When I call it:
loss_value, grads = grad(model, feature, label)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
grads variable is a list with 38 None.
And I get the error:
No gradients provided for any variable: ['conv1_1/kernel:0', ...]
Is there any way to use Hamming Distance without "interrupts the gradient chain registered by the gradient tape"?
Apology if I'm saying something obvious, but the way how backpropagation works as a fitting algorithm for neural networks is through gradients - e.g. for each batch of training data you compute how much the loss function will improve/degrade if you move a particular trainable weight by a very small amount delta.
Hamming loss is by definition not differentiable, so for small movements of trainable weights you will never experience any changes in the loss. I imagine it is only added to be used for final measurements of trained models' performance rather than for training.
If you want to train a neural net through backpropagation you need to use some differentiable loss - such that can help the model to move weights in the right direction. Sometimes people use different techniques to smooth such losses as Hamming less and create approximations - e.g. here it could be something which would penalize less predictions which are closer to the target answer rather then just giving out 1 for everything above threshold and 0 for everything else.

Keras: Generating target values based on Faster-RCNN predictions

SOLUTION
Ok, I did find a solution, maybe not the best one. Essentially just created a new method which used the GradientTape from TensorFlow. Essentially, get the predictions, generate the targets, calculate the losses and then update the gradients.
with tf.GradientTape() as tape:
# Forward pass
y_preds = self.model(x, training=True)
# Generate the target values from the predictions
actual_deltas, actual_objectness = self.generate_target_values(y_preds, labels)
#Get the loss
loss = self.model.compiled_loss([actual_deltas, actual_objectness], y_preds, regularization_losses=self.model.losses)
# Compute gradients
trainable_vars = self.model.trainable_variables
gradients = tape.gradient(loss, trainable_vars)
# Update weights
self.model.optimizer.apply_gradients(zip(gradients, trainable_vars))
I understand that there is better ways to do this with Keras Subclassing, but this did the job.
ORIGINAL POST
I am currently trying to create a model, where the prediction is need to be run through a function, which compares them to the training label. This function will then return the target values. How do I train my model, such that the predictions will get fed to my function and it will return the function pred.
I am using Tensorflow 2.1.0 and Keras 2.2.4-tf
Edit:
The model is a modified Faster-RCNN model.
I am trying to add the function, which takes the predictions (a 2xN and a 4xN vector), converts them to bounding boxes, compares them to the ground truth bounding boxes, and then returns what each of the proposed bounding boxes values should have been, in order to overlay this bounding box properly.
Ok, I did find a solution, maybe not the best one. Essentially just created a new method which used the GradientTape from TensorFlow. Essentially, get the predictions, generate the targets, calculate the losses and then update the gradients.
with tf.GradientTape() as tape:
# Forward pass
y_preds = self.model(x, training=True)
# Generate the target values from the predictions
actual_deltas, actual_objectness = self.generate_target_values(y_preds, labels)
#Get the loss
loss = self.model.compiled_loss([actual_deltas, actual_objectness], y_preds, regularization_losses=self.model.losses)
# Compute gradients
trainable_vars = self.model.trainable_variables
gradients = tape.gradient(loss, trainable_vars)
# Update weights
self.model.optimizer.apply_gradients(zip(gradients, trainable_vars))
I understand that there is better ways to do this with Keras Subclassing, but this did the job.

How can I disable gradient updates for some modules in autograd backpropagation?

I'm building a multi-model neural network for reinforcement learning to include an action network, a world model network, and a critic. The idea is train the world model to emulate whatever simulation you are trying to master based on input from the action network and the previous state, to train the critic to maximize the Bellman equation (total reinforcement over time) based on the world model output, and then backpropagate the critic value through the world model to provide gradient targets for training the actions. So - from some state, the action network outputs an action which is fed into the model to generate the next state, and that state feeds into the critic network for evaluation against some goal state.
For all this to work, I must use 3 separate loss functions, one for each network, and they all add something to the gradients in one or more networks but they can be in conflict. For example - to train the world model I use a target from an environmental simulation and for the critic I use a target of the current state reward + discount * next state forecast value. However, to train the a actor I just use the negative critic value as a loss and backpropagate all the way through all three models to calibrate the best action.
I can make this work without any batching by zeroing out gradients incrementally, but that is inefficient and doesn't let me accumulate gradients for any kind of "time-series batching" optimizer update step. Each model has its own trainable parameters, but the execution graph flows through all three networks. So inside the calibration loop after firing the networks in sequence:
...
if self.actor.calibrating:
self.actor.optimizer.zero_grad()
#Pick loss For maximizing the value of all actions
loss = -self.critic.value
#Backpropagate through all three networks to train actor output
#How do I stop the critic and model networks from incrementing their gradient values?
loss.backward(retain_graph=True)
self.actor.optimizer.step()
if self.model.calibrating:
self.model.optimizer.zero_grad()
#Reduce loss for ambiguous actions
loss = self.model.get_loss() * self.actor.get_confidence()**2
#How can I block this from backpropagating through action network?
loss.backward(retain_graph=True)
self.model.optimizer.step()
if self.critic.calibrating:
self.critic.optimizer.zero_grad()
#Reduce loss for ambiguous actions
loss = self.critic.get_loss(self.goal) * self.actor.get_confidence()**2
#How do I stop this from backpropagating through the model and action networks?
loss.backward(retain_graph=True)
self.critic.optimizer.step()
...
Finally - my question is in two parts:
How can I temporarily stop loss.backward() at a given layer without detaching it forever?
How can I block loss.backward() from updating some gradients where I'm just flowing through a model to get gradients for another model?
Got this figured out thanks to a suggestion from a colleague to try the requires_grad setting. (I had assumed that would break the execution graph, but it doesn't)
So - to answer my own two questions:
If you calibrate the chained models in the correct order, you can detach them one at a time so that loss.backward() doesn't run over models that aren't needed. I was thinking that this would break the graph but... this is Pytorch, not Tensorflow 1.x and the graph is regenerated on every forward pass anyway. Silly me for missing this yesterday.
If you set requires_grad to False for a model (or a layer or an individual weight) then loss.backward() will STILL traverse the entire connected graph but it will leave those individual gradients as they were while still setting any gradients earlier in the graph. Exactly what I wanted.
This code works to minimize the execution of unnecessary graph traversals and gradient updates. I still need to refactor it for staggered updates over time so that it can accumulate gradients for several cycles before stepping the optimizers, but this definitely works as intended.
#Step through all models in a chain to create gradient paths from critic back through the world model, to the actor.
def step(self):
#Get the current state from the simulation
state = self.world.state
#Fire the actor to select a softmax action.
self.actor(state)
#run the world simulation on that action.
self.world.step(self.actor.action)
#Combine the action and starting state as input to the world model.
if self.actor.calibrating:
action_state = torch.cat([self.actor.value, state], dim=0)
else:
#Push softmax action closer to 1.0
action_state = torch.cat([self.actor.hard_value, state], dim=0)
#Run the model and then the critic on the action_state
self.critic(self.model(action_state))
if self.actor.calibrating:
self.actor.optimizer.zero_grad()
self.model.requires_grad = False
self.critic.requires_grad = False
#Pick loss For maximizing the value of the action choice
loss = -self.critic.value * self.actor.get_confidence()
loss.backward(retain_graph=True)
self.actor.optimizer.step()
if self.model.calibrating:
#Don't need to backpropagate through actor again
self.actor.value.detach_()
self.model.optimizer.zero_grad()
self.model.requires_grad = True
#Reduce loss for ambiguous actions
loss = self.model.get_loss() * self.actor.get_confidence()**2
loss.backward(retain_graph=True)
self.model.optimizer.step()
if self.critic.calibrating:
#Don't need to backpropagate through the model or actor again
self.model.value.detach_()
self.critic.optimizer.zero_grad()
self.critic.requires_grad = True
#Reduce loss for ambiguous actions
loss = self.critic.get_loss(self.goal) * self.actor.get_confidence()**2
loss.backward(retain_graph=True)
self.critic.optimizer.step()
here’s a more precise and fuller example.
import torch
import torch.nn as nn
from torch.autograd import Variable
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.layers = nn.ModuleList([
nn.Linear(10, 10),
nn.Linear(10, 10),
nn.Linear(10, 10),
nn.Linear(10, 10),
])
def forward(self, x):
self.output = []
self.input = []
for layer in self.layers:
# detach from previous history
x = Variable(x.data, requires_grad=True)
#you can add this line after each layer to stop back propagation of that layer
self.input.append(x)
# compute output
x = layer(x)
# add to list of outputs
self.output.append(x)
return x
def backward(self, g):
for i, output in reversed(list(enumerate(self.output))):
if i == (len(self.output) - 1):
# for last node, use g
output.backward(g)
else:
output.backward(self.input[i+1].grad.data)
print(i, self.input[i+1].grad.data.sum())
model = Net()
inp = Variable(torch.randn(4, 10))
output = model(inp)
gradients = torch.randn(*output.size())
model.backward(gradients)

If one captures gradient with Optimizer, will it calculate twice the gradient?

I recently have some training performance bottleneck. I always add a lot of histograms in the summary. I want to know if by calculating gradients first then re-minimizing the lose will calculate twice the gradients. A simplified code:
# layers
...
# optimizer
loss = tf.losses.mean_squared_error(labels=y_true, predictions=logits)
opt = AdamOptimizer(learning_rate)
# collect gradients
gradients = opt.compute_gradients(loss)
# train operation
train_op = opt.minimize(loss)
...
# merge summary
...
Is there an minimize method in optimizers that use directly the gradients? Something like opt.minimize(gradients) instead of opt.minimize(loss)?
You could use apply_gradients after the calculation of the gradients with compute_gradients as follows :
grads_and_vars = opt.compute_gradients(loss)
train_op = opt.apply_gradients(grads_and_vars)

Categories