If one captures gradient with Optimizer, will it calculate twice the gradient?

If one captures gradient with Optimizer, will it calculate twice the gradient? - python

I recently have some training performance bottleneck. I always add a lot of histograms in the summary. I want to know if by calculating gradients first then re-minimizing the lose will calculate twice the gradients. A simplified code:
# layers
...
# optimizer
loss = tf.losses.mean_squared_error(labels=y_true, predictions=logits)
opt = AdamOptimizer(learning_rate)
# collect gradients
gradients = opt.compute_gradients(loss)
# train operation
train_op = opt.minimize(loss)
...
# merge summary
...
Is there an minimize method in optimizers that use directly the gradients? Something like opt.minimize(gradients) instead of opt.minimize(loss)?

You could use apply_gradients after the calculation of the gradients with compute_gradients as follows :
grads_and_vars = opt.compute_gradients(loss)
train_op = opt.apply_gradients(grads_and_vars)

Related

Combining gradients from different "networks" in TensorFlow2

I'm trying to combine a few "networks" into one final loss function. I'm wondering if what I'm doing is "legal", as of now I can't seem to make this work. I'm using tensorflow probability :
The main problem is here:
# Get gradients of the loss wrt the weights.
gradients = tape.gradient(loss, [m_phis.trainable_weights, m_mus.trainable_weights, m_sigmas.trainable_weights])
# Update the weights of our linear layer.
optimizer.apply_gradients(zip(gradients, [m_phis.trainable_weights, m_mus.trainable_weights, m_sigmas.trainable_weights])
Which gives me None gradients and throws on apply gradients:
AttributeError: 'list' object has no attribute 'device'
Full code:
univariate_gmm = tfp.distributions.MixtureSameFamily(
mixture_distribution=tfp.distributions.Categorical(probs=phis_true),
components_distribution=tfp.distributions.Normal(loc=mus_true,scale=sigmas_true)
)
x = univariate_gmm.sample(n_samples, seed=random_seed).numpy()
dataset = tf.data.Dataset.from_tensor_slices(x)
dataset = dataset.shuffle(buffer_size=1024).batch(64)
m_phis = keras.layers.Dense(2, activation=tf.nn.softmax)
m_mus = keras.layers.Dense(2)
m_sigmas = keras.layers.Dense(2, activation=tf.nn.softplus)
def neg_log_likelihood(y, phis, mus, sigmas):
a = tfp.distributions.Normal(loc=mus[0],scale=sigmas[0]).prob(y)
b = tfp.distributions.Normal(loc=mus[1],scale=sigmas[1]).prob(y)
c = np.log(phis[0]*a + phis[1]*b)
return tf.reduce_sum(-c, axis=-1)
# Instantiate a logistic loss function that expects integer targets.
loss_fn = neg_log_likelihood
# Instantiate an optimizer.
optimizer = tf.keras.optimizers.SGD(learning_rate=1e-3)
# Iterate over the batches of the dataset.
for step, y in enumerate(dataset):
yy = np.expand_dims(y, axis=1)
# Open a GradientTape.
with tf.GradientTape() as tape:
# Forward pass.
phis = m_phis(yy)
mus = m_mus(yy)
sigmas = m_sigmas(yy)
# Loss value for this batch.
loss = loss_fn(yy, phis, mus, sigmas)
# Get gradients of the loss wrt the weights.
gradients = tape.gradient(loss, [m_phis.trainable_weights, m_mus.trainable_weights, m_sigmas.trainable_weights])
# Update the weights of our linear layer.
optimizer.apply_gradients(zip(gradients, [m_phis.trainable_weights, m_mus.trainable_weights, m_sigmas.trainable_weights]))
# Logging.
if step % 100 == 0:
print("Step:", step, "Loss:", float(loss))

There are two separate problems to take into account.
1. Gradients are None:
Typically this happens, if non-tensorflow operations are executed in the code that is watched by the GradientTape. Concretely, this concerns the computation of np.log in your neg_log_likelihood functions. If you replace np.log with tf.math.log, the gradients should compute. It may be a good habit to try not to use numpy in your "internal" tensorflow components, since this avoids errors like this. For most numpy operations, there is a good tensorflow substitute.
2. apply_gradients for multiple trainables:
This mainly has to do with the input that apply_gradients expects. There you have two options:
First option: Call apply_gradients three times, each time with different trainables
optimizer.apply_gradients(zip(m_phis_gradients, m_phis.trainable_weights))
optimizer.apply_gradients(zip(m_mus_gradients, m_mus.trainable_weights))
optimizer.apply_gradients(zip(m_sigmas_gradients, m_sigmas.trainable_weights))
The alternative would be to create a list of tuples, like indicated in the tensorflow documentation (quote: "grads_and_vars: List of (gradient, variable) pairs.").
This would mean calling something like
optimizer.apply_gradients(
[
zip(m_phis_gradients, m_phis.trainable_weights),
zip(m_mus_gradients, m_mus.trainable_weights),
zip(m_sigmas_gradients, m_sigmas.trainable_weights),
]
)
Both options require you to split the gradients. You can either do that by computing the gradients and indexing them separately (gradients[0],...), or you can simply compute the gradiens separately. Note that this may require persistent=True in your GradientTape.
# [...]
# Open a GradientTape.
with tf.GradientTape(persistent=True) as tape:
# Forward pass.
phis = m_phis(yy)
mus = m_mus(yy)
sigmas = m_sigmas(yy)
# Loss value for this batch.
loss = loss_fn(yy, phis, mus, sigmas)
# Get gradients of the loss wrt the weights.
m_phis_gradients = tape.gradient(loss, m_phis.trainable_weights)
m_mus_gradients = tape.gradient(loss, m_mus.trainable_weights)
m_sigmas_gradients = tape.gradient(loss, m_sigmas .trainable_weights)
# Update the weights of our linear layer.
optimizer.apply_gradients(
[
zip(m_phis_gradients, m_phis.trainable_weights),
zip(m_mus_gradients, m_mus.trainable_weights),
zip(m_sigmas_gradients, m_sigmas.trainable_weights),
]
)
# [...]

Use Hamming Distance Loss Function with Tensorflow GradientTape: no gradients. Is it not differentiable?

I'm using Tensorflow 2.1 and Python 3, creating my custom training model following the tutorial "Tensorflow - Custom training: walkthrough".
I'm trying to use Hamming Distance on my loss function:
import tensorflow as tf
import tensorflow_addons as tfa
def my_loss_hamming(model, x, y):
global output
output = model(x)
return tfa.metrics.hamming.hamming_loss_fn(y, output, threshold=0.5, mode='multilabel')
def grad(model, inputs, targets):
with tf.GradientTape() as tape:
tape.watch(model.trainable_variables)
loss_value = my_loss_hamming(model, inputs, targets)
return loss_value, tape.gradient(loss_value, model.trainable_variables)
When I call it:
loss_value, grads = grad(model, feature, label)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
grads variable is a list with 38 None.
And I get the error:
No gradients provided for any variable: ['conv1_1/kernel:0', ...]
Is there any way to use Hamming Distance without "interrupts the gradient chain registered by the gradient tape"?

Apology if I'm saying something obvious, but the way how backpropagation works as a fitting algorithm for neural networks is through gradients - e.g. for each batch of training data you compute how much the loss function will improve/degrade if you move a particular trainable weight by a very small amount delta.
Hamming loss is by definition not differentiable, so for small movements of trainable weights you will never experience any changes in the loss. I imagine it is only added to be used for final measurements of trained models' performance rather than for training.
If you want to train a neural net through backpropagation you need to use some differentiable loss - such that can help the model to move weights in the right direction. Sometimes people use different techniques to smooth such losses as Hamming less and create approximations - e.g. here it could be something which would penalize less predictions which are closer to the target answer rather then just giving out 1 for everything above threshold and 0 for everything else.

Calculating the derivates of the output with respect to input for a give time step in LSTM tensorflow2.0

I wrote a sample code to generate the real problem I am facing in my project. I am using an LSTM in tensorflow to model some time series data. Input dimensions are (10, 100, 1), that is, 10 instances, 100 time steps, and number of features is 1. The output is of the same shape.
What I want to achieve after training the model is to study the influence of each of the inputs to each output at each particular time step. In other words, I would like to see which input variables affect my output the most (or which input has the most influence on the output/maybe large gradient) at each time step. Here is the code for this problem:
tf.keras.backend.clear_session()
tf.random.set_seed(42)
model_input = tf.data.Dataset.from_tensor_slices(np.random.normal(size=(10, 100, 1)))
model_input = model_input.batch(10)
model_output = tf.data.Dataset.from_tensor_slices(np.random.normal(size=(10, 100, 1)))
model_output = model_output.batch(10)
my_dataset = tf.data.Dataset.zip((model_input, model_output))
m_inputs = tf.keras.Input(shape=(None, 1))
lstm_outputs = tf.keras.layers.LSTM(32, return_sequences=True)(m_inputs)
m_outputs = tf.keras.layers.TimeDistributed(tf.keras.layers.Dense(1))(lstm_outputs)
my_model = tf.keras.Model(m_inputs, m_outputs, name="my_model")
my_optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)
my_loss_fn = tf.keras.losses.MeanSquaredError()
my_epochs = 3
for epoch in range(my_epochs):
for step, (x_batch_tr, y_batch_tr) in enumerate(my_dataset):
x += 1
# open a gradient tape to record the operations run during the forward pass, which enables autodifferentiation
with tf.GradientTape() as tape:
# Run the forward pass of the layer
logits = my_model(x_batch_tr, training=True)
# compute the loss value for this mismatch
loss_value = my_loss_fn(y_batch_tr, logits)
# use the gradient tape to automatically retrieve the gradients of the trainable variables with respect to the loss.
grads = tape.gradient(loss_value, my_model.trainable_weights)
# Run one step of gradient descent by updating the value of the variables to minimize the loss.
my_optimizer.apply_gradients(zip(grads, my_model.trainable_weights))
print(f"Step {step}, loss: {loss_value}")
print("\n\nCalculate gradient of ouptuts w.r.t inputs\n\n")
for step, (x_batch_tr, y_batch_tr) in enumerate(my_dataset):
# open a gradient tape to record the operations run during the forward pass, which enables autodifferentiation
with tf.GradientTape() as tape:
tape.watch(x_batch_tr)
# Run the forward pass of the layer
logits = my_model(x_batch_tr, training=True)
#tape.watch(logits[:, 10, :]) # this didn't help
# compute the loss value for this mismatch
loss_value = my_loss_fn(y_batch_tr, logits)
# use the gradient tape to automatically retrieve the gradients of the trainable variables with respect to the loss.
# grads = tape.gradient(logits, x_batch_tr) # This works
# print(grads.numpy().shape) # This works
grads = tape.gradient(logits[:, 10, :], x_batch_tr)
print(grads)
In other words, I would like to pay attention to the inputs that affect my output the most (at each particular time step).
To me grads = tape.gradient(logits, x_batch_tr) won't do the job cuz this will add the gradients from all outputs w.r.t each inputs.
However, the gradients are always None.
Any help is much appreciated!

You can use tf.GradientTape.batch_jacobian to get precisely that information:
grads = tape.batch_jacobian(logits, x_batch_tr)
print(grads.shape)
# (10, 100, 1, 100, 1)
Here, grads[i, t1, f1, t2, f2] gives you, for the example i, the gradient of output feature f1 at time t1 with respect to input feature f2 at time t2. If, as in your case, you only have one feature, you can just say that grads[i, t1, 0, t2, 0] gives you the gradient of t1 with respect to t2. Conveniently, you can also aggregate different axes or slices of this result to get aggregated gradients. For example, tf.reduce_sum(grads[:, :, :, :10], axis=3) would give you the gradient of each output time step with respect to the first ten input time steps.
About getting None gradients in your example, I think it is because you are doing the slicing operation outside of the gradient tape context, so the gradient tracking is lost.

so the solution was to create a temporary tensor for part of the logits that we need to use in tape.grad, and register that tensor on tape using tape.watch
This is how it should be done:
for step, (x_batch_tr, y_batch_tr) in enumerate(my_dataset):
# open a gradient tape to record the operations run during the forward pass, which enables autodifferentiation
with tf.GradientTape() as tape:
tape.watch(x_batch_tr)
# Run the forward pass of the layer
logits = my_model(x_batch_tr, training=True)
tensor_logits = tf.constant(logits[:, 10, :])
tape.watch(tensor_logits) # this didn't help
# compute the loss value for this mismatch
loss_value = my_loss_fn(y_batch_tr, logits)
# use the gradient tape to automatically retrieve the gradients of the trainable variables with respect to the loss.
grads = tape.gradient(tensor_logits, x_batch_tr)
print(grads.numpy())

How to detect vanishing and exploding gradients with Tensorboard?

I have two "sub-questions"
1) How can I detect vanishing or exploding gradients with Tensorboard, given the fact that currently write_grads=True is deprecated in the Tensorboard callback as per "un-deprecate write_grads for fit #31173" ?
2) I figured I can probably tell whether my model suffers from vanishing gradients based on the weights' distributions and histograms in the Distributions and Histograms tab in Tensorboard. My problem is that I have no frame of reference to compare with. Currently, my biases seem to be "moving" but I can't tell whether my kernel weights (Conv2D layers) are "moving"/"changing" "enough". Can someone help me by giving a rule of thumb to asses this visually in Tensorboard? I.e. if only the bottom 25% percentile of kernel weights are moving, that's good enough / not good enough? Or perhaps someone can post two reference images from tensorBoard of vanishing gradients vs, non vanishing gradients.
Here are my histograms and distributions, is it possible to tell whether my model suffers from Vanishing gradients? (some layers omitted for brevity) Thanks in advance.

I am currently facing the same question and approached the problem similarly using Tensorboard.
Even tho write_grads is deprecated you can still manage to log gradients for each layer of your network by subclassing the tf.keras.Model class and computing the gradients manually with gradient.Tape in the train_step method.
Something similar to this is working for me
from tensorflow.keras import Model
class TrainWithCustomLogsModel(Model):
def __init__(self, **kwargs):
super(TrainWithCustomLogsModel, self).__init__(**kwargs)
self.step = tf.Variable(0, dtype=tf.int64,trainable=False)
def train_step(self, data):
# Get batch images and labels
x, y = data
# Compute the batch loss
with tf.GradientTape() as tape:
p = self(x , training = True)
loss = self.compiled_loss(y, p, regularization_losses=self.losses)
# Compute gradients for each weight of the network. Note trainable_vars and gradients are list of tensors
trainable_vars = self.trainable_variables
gradients = tape.gradient(loss, trainable_vars)
# Log gradients in Tensorboard
self.step.assign_add(tf.constant(1, dtype=tf.int64))
#tf.print(self.step)
with train_summary_writer.as_default():
for var, grad in zip(trainable_vars, gradients):
name = var.name
var, grad = tf.squeeze(var), tf.squeeze(grad)
tf.summary.histogram(name, var, step = self.step)
tf.summary.histogram('Gradients_'+name, grad, step = self.step)
# Update model's weights
self.optimizer.apply_gradients(zip(gradients, trainable_vars))
del tape
# Update metrics (includes the metric that tracks the loss)
self.compiled_metrics.update_state(y, p)
# Return a dict mapping metric names to current value
return {m.name: m.result() for m in self.metrics}
You should then be able to visualize distributions of your gradients for any train step of your training, along with distributions of your kernel's values.
Moreover, it might be worth try to plot the distribution of the norm through time instead of single values.

How to use Tensorflow BatchNormalization with GradientTape?

Suppose we have a simple Keras model that uses BatchNormalization:
model = tf.keras.Sequential([
tf.keras.layers.InputLayer(input_shape=(1,)),
tf.keras.layers.BatchNormalization()
])
How to actually use it with GradientTape? The following doesn't seem to work as it doesn't update the moving averages?
# model training... we want the output values to be close to 150
for i in range(1000):
x = np.random.randint(100, 110, 10).astype(np.float32)
with tf.GradientTape() as tape:
y = model(np.expand_dims(x, axis=1))
loss = tf.reduce_mean(tf.square(y - 150))
grads = tape.gradient(loss, model.variables)
opt.apply_gradients(zip(grads, model.variables))
In particular, if you inspect the moving averages, they remain the same (inspect model.variables, averages are always 0 and 1). I know one can use .fit() and .predict(), but I would like to use the GradientTape and I'm not sure how to do this. Some version of the documentation suggests to update update_ops, but that doesn't seem to work in eager mode.
In particular, the following code will not output anything close to 150 after the above training.
x = np.random.randint(200, 210, 100).astype(np.float32)
print(model(np.expand_dims(x, axis=1)))

with gradient tape mode BatchNormalization layer should be called with argument training=True
example:
inp = KL.Input( (64,64,3) )
x = inp
x = KL.Conv2D(3, kernel_size=3, padding='same')(x)
x = KL.BatchNormalization()(x, training=True)
model = KM.Model(inp, x)
then moving vars are properly updated
>>> model.layers[2].weights[2]
<tf.Variable 'batch_normalization/moving_mean:0' shape=(3,) dtype=float32, numpy
=array([-0.00062087, 0.00015137, -0.00013239], dtype=float32)>

I just give up. I spent quiet a bit of time trying to make sense of a model that looks like:
model = tf.keras.Sequential([
tf.keras.layers.BatchNormalization(),
])
And I do give up because that thing looks like that:
My intuition was that BatchNorm these days is not as straight forward as it used to be and that is why it scales original distribution but not so much new distribution (which is a shame), but ain't nobody got time for that.
Edit: the reason for that behavior is that BN only calculates moments and normalizes batches during training. During training it maintains running averages of mean and deviation and once you switch to evaluation, parameters are used as constants. i.e. evaluation should not depend on normalization because evaluation can be used even for a single input and can not rely on batch statistics. Since constants are calculated on a different distribution, you are getting a higher error during evaluation.

With Gradient Tape mode, you would usually find gradients like:
with tf.GradientTape() as tape:
y_pred = model(features)
loss = your_loss_function(y_pred, y_true)
gradients = tape.gradient(loss, model.trainable_variables)
train_op = model.optimizer.apply_gradients(zip(gradients, model.trainable_variables))
However, if your model contains BatchNormalization or Dropout layer (or any layer that has different train/test phases) then tf will fail building the graph.
A good practice would be to explicitly use trainable parameter when obtaining output from a model. When optimizing use model(features, trainable=True) and when predicting use model(features, trainable=False), in order to explicitly choose train/test phase when using such layers.
For PREDICT and EVAL phase, use
training = (mode == tf.estimator.ModeKeys.TRAIN)
y_pred = model(features, trainable=training)
For TRAIN phase, use
with tf.GradientTape() as tape:
y_pred = model(features, trainable=training)
loss = your_loss_function(y_pred, y_true)
gradients = tape.gradient(loss, model.trainable_variables)
train_op = model.optimizer.apply_gradients(zip(gradients, model.trainable_variables))
Note that, iperov's answer works as well, except that you will need to set the training phase manually for those layers.
x = BatchNormalization()(x, training=True)
x = Dropout(rate=0.25)(x, training=True)
x = BatchNormalization()(x, training=False)
x = Dropout(rate=0.25)(x, training=False)
I'd recommended to have one get_model function that returns the model, while changing the phase using training parameter when calling the model.
Note:
If you use model.variables when finding gradients, you'll get this warning
Gradients do not exist for variables
['layer_1_bn/moving_mean:0',
'layer_1_bn/moving_variance:0',
'layer_2_bn/moving_mean:0',
'layer_2_bn/moving_variance:0']
when minimizing the loss.
This can be resolved by computing gradients only against trainable variables. Replace model.variables with model.trainable_variables

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

If one captures gradient with Optimizer, will it calculate twice the gradient? - python

You could use apply_gradients after the calculation of the gradients with compute_gradients as follows : grads_and_vars = opt.compute_gradients(loss) train_op = opt.apply_gradients(grads_and_vars)

Related

Combining gradients from different "networks" in TensorFlow2

Use Hamming Distance Loss Function with Tensorflow GradientTape: no gradients. Is it not differentiable?

Calculating the derivates of the output with respect to input for a give time step in LSTM tensorflow2.0

How to detect vanishing and exploding gradients with Tensorboard?

How to use Tensorflow BatchNormalization with GradientTape?

Categories

Resources