How to use Tensorflows GradientTape() to compute biases - python

I'm looking to implement GradientTape() on a custom NN architecture but I don't see an explanation anywhere on how to use this to compute biases. A similar question was answered here, but it was not answered fully.
As a simple example, I have the training step for my NN like so:
self.W = ## Initialized earlier on
self.b = ## Initialized earlier on
#tf.function
def train(self):
with tf.GradientTape() as tape:
pred = self.feedforward()
loss = self.loss_evaluation()
grad = tape.gradient(loss, self.W)
grad = tape.gradient(loss, self.b) ## How do I do this?
optimizer.apply_gradients(zip(grad, self.W))
optimizer.apply_gradients(zip(grad, self.b)) ## How do I do this?
Put simply, I cannot evaluate the gradients with respect to the biases as nowhere in any documentation or tutorial do I see the bias term included. So, how do I go about implementing the bias term as a trainable variable in my code? I'm not looking to implement this with keras, so do not suggest I use trainable_variables as I want to do it from scratch.

#thushv89 The code Jamie showed doesn't work because you can't call gradient() on the same tape twice.
Jamie, why can't you simply do the following?
with tf.GradientTape() as tape:
pred = self.feedforward()
loss = self.loss_evaluation()
grads = tape.gradient(loss, [self.W, self.b])
optimizer.apply_gradients(zip(grads, [self.W, self.b]))

Related

Tensorflow / Keras gradients return a None value in multi-task model

When trying to train a multi-task model in Tensorflow using Keras, I run into the error of finding gradients whose values are None. When I use the regular model.fit(), I am able to train normally (without any issues) but Tensorboard shows some missing distributions (some of them are totally missing), which makes me believe something is going on with gradients.
Then, when I try to debug this using a custom training loop, I see gradient values coming out as None. For ex, like this:
w1 = tf.Variable(1.0)
w2 = tf.Variable(1.0)
for step, (x_batch_train, y_batch_train) in enumerate(trainDS):
with tf.GradientTape(persistent=True) as tape:
logits = model(x_batch_train, training=True)
loss1 = loss_fn1(y_batch_train['label1'], logits[0])
loss2 = loss_fn2(y_batch_train['label2'], logits[1])
l1 = tf.math.multiply(w1, loss1)
l2 = tf.math.multiply(w2, loss2)
loss_value = tf.math.add(l1, l2)
grad_task_1 = tape.gradient(l1, model.trainable_weights)
I get gradients of None when inspecting the list.
Eventually I want to compute the normm of grad_task_1 gradients, but cannot do that with Nones.
Is there anything that I am doing wrong from an approach POV or from a design POV?
Or am I missing something?
Could anyone kindly provide some guidance. Would be greatly appreciated
Thank you

ValueError: No gradients provided for any variable when calculating loss

I have been trying to implement the training step for a DQN described in this paper on various RL methods using TensorFlow, but when I try to compute the gradient using a GradientTape I get a ValueError: No gradients provided for any variable:. Below is the training step code:
def train_step(model, target, optimizer, observations, actions, rewards, next_observations):
with tf.GradientTape() as tape:
target_logits = tf.math.reduce_max(target(np.expand_dims(next_observations, -1)), 1)
logits = model(np.expand_dims(observations, -1))
act_logits = np.ndarray(EXPERIENCE_SAMPLE_SIZE)
for i in range(EXPERIENCE_SAMPLE_SIZE):
act_logits[i] = logits[i][actions[i]]
act_logits = tf.convert_to_tensor(act_logits, dtype=tf.float32)
y_T = tf.math.add(tf.convert_to_tensor(rewards, dtype=tf.float32), tf.math.scalar_mul(DISCOUNT_RATE, target_logits))
loss = tf.math.squared_difference(act_logits, y_T)
loss = tf.math.scalar_mul(1.0 / EXPERIENCE_SAMPLE_SIZE, loss)
grads = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
Where model and target are tf.keras.Sequential that output the expected value for taking each of 5 possible actions, optimizer is SGD, and observations, actions, rewards, and next_observations are numpy arrays sampled from an experience replay buffer.
This is part of implementing the following pseudocode from the aforementioned paper:
My best guess is that this error is because indexing logits makes the gradient impossible to differentiate, but I don't know else to calculate the Q*(s,a,theta) quantity.
Adding the Solution in the Answer Section for the benefit of the Community.
From Comments:
The problem is resolved by replacing the code:
act_logits = np.ndarray(EXPERIENCE_SAMPLE_SIZE)
for i in range(EXPERIENCE_SAMPLE_SIZE):
act_logits[i] = logits[i][actions[i]]
with the code:
act_logits = tf.math.reduce_max(tf.math.multiply(act_logits, logits), 1)

ValueError: No gradients provided for any variable in policy gradient

I have been trying to implement policy gradient algorithm in reinforcement learning. However, I am facing the error"ValueError: No gradients provided for any variable:" while computing the gradients for the custom loss function as shown below:
def loss_function(prob, action, reward):
prob_action = np.array([prob.numpy()[0][action]]) #prob is like ->[0.4900, 0.5200] and action is scalar index->1,0
log_prob = tf.math.log(prob_action)
loss = tf.multiply(log_prob, (-reward))
return loss
I am computing the gradients as below:
def update_policy(policy, states, actions, discounted_rewards):
opt = tf.keras.optimizers.SGD(learning_rate=0.1)
for state, reward, action in zip(states, discounted_rewards, actions):
with tf.GradientTape() as tape:
prob = policy(state, training=True)
loss = loss_function(prob, action, reward)
print(loss)
gradients = tape.gradient(loss, policy.trainable_variables)
opt.apply_gradients(zip(gradients, policy.trainable_variables))
Kindly please help me out in this issue.
Thank you
As #gekrone indicates in the comment this is definetly due to the gradients not flowing due to prob_action being a numpy array and not a tensor. Also be careful not to use the .numpy() method. Probably stick to something like
prob_action = prob[0][action]
...
and this should work.

Tensorflow 2 differentiate through optimization path?

I am trying to compute "gradients through gradients" for a paper (MAML, by C.Finn et al.) in Tensorflow 2 with Keras backend. Thus, we start at some initial weights, compute K gradient update steps, and want to backpropagate through our initial weights. The code sample belows illustrates what I want to achieve, but unfortunately does not work.
optimizer = tf.keras.SGD()
initial_weights = model.trainable_variables
with tf.GradientTape() as mt:
for gradient_steps in range(10):
with tf.GradientTape() as t:
loss = loss_function(y_train, model(x_train))
grads = t.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
test_loss = loss_function(y_test, model(x_test))
mt.gradient(test_loss, initial_weights)
Does anyone know how to differentiate through the initialization? Any help would be greatly appreciated!

How to use Tensorflow BatchNormalization with GradientTape?

Suppose we have a simple Keras model that uses BatchNormalization:
model = tf.keras.Sequential([
tf.keras.layers.InputLayer(input_shape=(1,)),
tf.keras.layers.BatchNormalization()
])
How to actually use it with GradientTape? The following doesn't seem to work as it doesn't update the moving averages?
# model training... we want the output values to be close to 150
for i in range(1000):
x = np.random.randint(100, 110, 10).astype(np.float32)
with tf.GradientTape() as tape:
y = model(np.expand_dims(x, axis=1))
loss = tf.reduce_mean(tf.square(y - 150))
grads = tape.gradient(loss, model.variables)
opt.apply_gradients(zip(grads, model.variables))
In particular, if you inspect the moving averages, they remain the same (inspect model.variables, averages are always 0 and 1). I know one can use .fit() and .predict(), but I would like to use the GradientTape and I'm not sure how to do this. Some version of the documentation suggests to update update_ops, but that doesn't seem to work in eager mode.
In particular, the following code will not output anything close to 150 after the above training.
x = np.random.randint(200, 210, 100).astype(np.float32)
print(model(np.expand_dims(x, axis=1)))
with gradient tape mode BatchNormalization layer should be called with argument training=True
example:
inp = KL.Input( (64,64,3) )
x = inp
x = KL.Conv2D(3, kernel_size=3, padding='same')(x)
x = KL.BatchNormalization()(x, training=True)
model = KM.Model(inp, x)
then moving vars are properly updated
>>> model.layers[2].weights[2]
<tf.Variable 'batch_normalization/moving_mean:0' shape=(3,) dtype=float32, numpy
=array([-0.00062087, 0.00015137, -0.00013239], dtype=float32)>
I just give up. I spent quiet a bit of time trying to make sense of a model that looks like:
model = tf.keras.Sequential([
tf.keras.layers.BatchNormalization(),
])
And I do give up because that thing looks like that:
My intuition was that BatchNorm these days is not as straight forward as it used to be and that is why it scales original distribution but not so much new distribution (which is a shame), but ain't nobody got time for that.
Edit: the reason for that behavior is that BN only calculates moments and normalizes batches during training. During training it maintains running averages of mean and deviation and once you switch to evaluation, parameters are used as constants. i.e. evaluation should not depend on normalization because evaluation can be used even for a single input and can not rely on batch statistics. Since constants are calculated on a different distribution, you are getting a higher error during evaluation.
With Gradient Tape mode, you would usually find gradients like:
with tf.GradientTape() as tape:
y_pred = model(features)
loss = your_loss_function(y_pred, y_true)
gradients = tape.gradient(loss, model.trainable_variables)
train_op = model.optimizer.apply_gradients(zip(gradients, model.trainable_variables))
However, if your model contains BatchNormalization or Dropout layer (or any layer that has different train/test phases) then tf will fail building the graph.
A good practice would be to explicitly use trainable parameter when obtaining output from a model. When optimizing use model(features, trainable=True) and when predicting use model(features, trainable=False), in order to explicitly choose train/test phase when using such layers.
For PREDICT and EVAL phase, use
training = (mode == tf.estimator.ModeKeys.TRAIN)
y_pred = model(features, trainable=training)
For TRAIN phase, use
with tf.GradientTape() as tape:
y_pred = model(features, trainable=training)
loss = your_loss_function(y_pred, y_true)
gradients = tape.gradient(loss, model.trainable_variables)
train_op = model.optimizer.apply_gradients(zip(gradients, model.trainable_variables))
Note that, iperov's answer works as well, except that you will need to set the training phase manually for those layers.
x = BatchNormalization()(x, training=True)
x = Dropout(rate=0.25)(x, training=True)
x = BatchNormalization()(x, training=False)
x = Dropout(rate=0.25)(x, training=False)
I'd recommended to have one get_model function that returns the model, while changing the phase using training parameter when calling the model.
Note:
If you use model.variables when finding gradients, you'll get this warning
Gradients do not exist for variables
['layer_1_bn/moving_mean:0',
'layer_1_bn/moving_variance:0',
'layer_2_bn/moving_mean:0',
'layer_2_bn/moving_variance:0']
when minimizing the loss.
This can be resolved by computing gradients only against trainable variables. Replace model.variables with model.trainable_variables

Categories