Keras An operation has None for gradient when train_on_batch

Keras An operation has None for gradient when train_on_batch - python

Google Colab to reproduce the error None_for_gradient.ipynb
I need a custom loss function where the value is calculated according to the model inputs, these inputs are not the default values (y_true, y_pred). The predict method works for the generated architecture, but when I try to use the train_on_batch, the following error appears.
ValueError: An operation has None for gradient. Please make sure that all of your ops have a gradient defined (i.e. are differentiable). Common ops without gradient: K.argmax, K.round, K.eval.
My custom function of loss (below) was based on this example image_ocr.py#L475, in the Colab link has another example based on this solution Custom loss function y_true y_pred shape mismatch #4781, it also generates the same error:
from keras import backend as K
from keras import losses
import keras
from keras.models import TimeDistributed, Dense, Dropout, LSTM
def my_loss(args):
input_y, input_y_pred, y_pred = args
return keras.losses.binary_crossentropy(input_y, input_y_pred)
def generator2():
input_noise = keras.Input(name='input_noise', shape=(40, 38), dtype='float32')
input_y = keras.Input(name='input_y', shape=(1,), dtype='float32')
input_y_pred = keras.Input(name='input_y_pred', shape=(1,), dtype='float32')
lstm1 = LSTM(256, return_sequences=True)(input_noise)
drop = Dropout(0.2)(lstm1)
lstm2 = LSTM(256, return_sequences=True)(drop)
y_pred = TimeDistributed(Dense(38, activation='softmax'))(lstm2)
loss_out = keras.layers.Lambda(my_loss, output_shape=(1,), name='my_loss')([input_y, input_y_pred, y_pred])
model = keras.models.Model(inputs=[input_noise, input_y, input_y_pred], outputs=[y_pred, loss_out])
model.compile(loss={'my_loss': lambda y_true, y_pred: y_pred}, optimizer='adam')
return model
g2 = generator2()
noise = np.random.uniform(0,1,size=[10,40,38])
g2.train_on_batch([noise, np.ones(10), np.zeros(10)], noise)
I need help to verify which operation is generating this error, because as far as I know the keras.losses.binary_crossentropy is differentiable.

I think the reason is that input_y and input_y_pred are all keras Input,your loss function is calculated with these two tensor,they are not binded up with the model parameters,so the loss function gives no gradient to your model

Related

Training a TensorFlow-Keras model with extra layer which gets also the labels as input

I want to train a model which create a features vector for a RGB image (2D array with 3 channels), and using that features vector, a classifier will decide what to do (e.g. person recognition from an image, assign a label by choosing the "closest" pre-trained features vectors (of the people enrolled to the system). To do so I use a categorical cross-entropy. In the training phase I apply categorical softmax on the features vector as an extra layer, and get as output the probability to be in each label or class, then I use the softmax output and the training label to compute the loss.
So, for working or testing, the model receives just one input: the image, and outputs a features vector. While for training the model receives pairs: the image and its label.
I want to train such a model, with pair [image,label] input in the training phase, and [image] input in the testing or working phase.
I use TensorFlow 2.8 and Keras 2.8 with Python 3.9.5.
The code (with a toy model and some random data):
# ==============================================================================
# Imports
# ==============================================================================
import numpy as np
import tensorflow as tf
import keras
import keras.backend as K
from keras import layers as tfl
from keras import Model
# ==============================================================================
# Switch case layer, behaves differently for training and testing
# ==============================================================================
class Switch(tf.keras.layers.Layer):
def __init__(self, **kwargs):
super().__init__(**kwargs)
def call(self, inputs, training=None):
x = tf.identity(inputs)
if training:
y = tfl.Input(shape=(2,), name="label")
output_tensor = tf.nn.softmax_cross_entropy_with_logits(y, x)
return output_tensor
else:
output_tensor = tf.identity(x, name="output")
return output_tensor
# ==============================================================================
# Define model
# ==============================================================================
inputs = keras.Input(shape=(4, 4, 3))
conv = keras.layers.Conv2D(filters=2, kernel_size=2)(inputs)
pooling = keras.layers.GlobalAveragePooling2D()(conv)
feature = keras.layers.Dense(10)(pooling)
outputs = Switch()(feature)
# output = tf.identity(feature)
model = keras.Model(inputs, outputs)
# ==============================================================================
# Training data
# ==============================================================================
tf.random.set_seed(42)
x_train = tf.random.normal((5, 4, 4, 3))
y_train = tf.constant([1, 1, 0, 0, 2])
# ==============================================================================
# Train model
# ==============================================================================
model.compile(optimizer='adam',
loss='categorical_crossentropy',
metrics='accuracy',)
model.fit(
x=x_train,
y=y_train,
epochs=3,
verbose='auto',
shuffle=True,
initial_epoch=0,
max_queue_size=10
)
The Switch layer is based on: Is it possible to add different behavior for training and testing in keras Functional API
If I understand correctly, when using model.fit, the model's call is automatically invoked with training=True.
However, when I run the model I get the following error:
TypeError: You are passing KerasTensor(type_spec=TensorSpec(shape=(),
dtype=tf.float32, name=None), name='Placeholder:0',
description="created by layer 'tf.cast_4'"), an intermediate Keras
symbolic input/output, to a TF API that does not allow registering
custom dispatchers, such as tf.cond, tf.function, gradient tapes,
or tf.map_fn. Keras Functional model construction only supports TF
API calls that do support dispatching, such as tf.math.add or
tf.reshape. Other APIs cannot be called directly on symbolic
Kerasinputs/outputs. You can work around this limitation by putting
the operation in a custom Keras layer call and calling that layer on
this symbolic input/output.
When I pass:
model.fit(
x=[x_train, y_train],
y=y_train,
I receive the following error:
ValueError: Layer "model" expects 1 input(s), but it received 2 input
tensors. Inputs received: [<tf.Tensor 'IteratorGetNext:0' shape=(None,
4, 4, 3) dtype=float32>, <tf.Tensor 'IteratorGetNext:1' shape=(None,)
dtype=int32>]
The problem is probably due to the Switch layer.
How do I solve it and how I train a model in which the training phase input and output are different than in the working phase (gets image, outputs features vector)?

confusing behaviour of binary_crossentropy loss in evaluate method of keras network

I am trying to understand the calculation of binary_crossentropy when used as the loss for a network that outputs a 2 probabilities rather than just 1. Basically I wanted to reproduce the calculation that keras/tf is doing in this case rather than in the common case where the network outputs a single value (the logit of the probability of positive classification). Here's some minimal reproducer code:
from tensorflow import keras
import numpy as np
loss_func = keras.losses.BinaryCrossentropy()
nn = keras.Sequential([
keras.layers.Dense(2**8, input_shape=(1,), activation='relu'),
keras.layers.Dense(2, activation='softmax')
])
nn.compile(loss=loss_func,optimizer='adam')
train_x = np.array([0.4,0.7,0.3,0.2])
train_y = np.array([[0,1],[1,0],[0,1],[0,1]])
print("Evaluted loss = ",nn.evaluate(train_x,train_y))
print("Function loss = ",loss_func(train_y,nn.predict(train_x)).numpy())
print("Manual loss = ",np.average( -train_y*np.log(nn.predict(train_x)) -(1-train_y)*np.log(1. - nn.predict(train_x)) ))
This outputs:
Evaluted loss = 0.6944893002510071
Function loss = 0.6959093
Manual loss = 0.6959095224738121
So there's a difference between the loss calculated by the evaluate method vs using the loss as a function or even calculating the loss by hand. I note that if I swap to using keras.losses.CategoricalCrossentropy() then all three calculations agree. I also note that if I use network with a single logit output then everything also agrees, i.e. if I do the following:
loss_func = keras.losses.BinaryCrossentropy(from_logits=True)
nn = keras.Sequential([
keras.layers.Dense(2**8, input_shape=(1,), activation='relu'),
keras.layers.Dense(1)
])
nn.compile(loss=loss_func,optimizer='adam')
train_x = np.array([0.4,0.7,0.3,0.2])
train_y = np.array([[1.],[0.],[1.],[1.]])
print("Evaluted loss = ",nn.evaluate(train_x,train_y))
print("Function loss = ",loss_func(train_y,nn.predict(train_x)).numpy())
print("Manual loss = ",np.average( -train_y*np.log(1./(1+np.exp(-nn.predict(train_x)))) -(1-train_y)*np.log(1. - 1./(1+np.exp(-nn.predict(train_x)))) ))
gives:
Evaluted loss = 0.6919926404953003
Function loss = 0.69199264
Manual loss = 0.6919926702976227
So my question is: What is the calculation that evaluate is doing on the first network with the 2 probabilities being output and why is it different to the value calculated using the loss function as a standalone function or doing the calculation by hand?
Thanks!

Tensorflow No gradients provided for any variable with different shape of variable

with tf.GradientTape() as tape:
images, labels = x
initial_points = self.model(images, is_training=True)
final_images = (tf.ones_like(initial_points) + initial_points).numpy()
final_images = np.expand_dims(final_images, axis=-1)
final_labels = tf.zeros_like(final_images)
loss = tf.nn.softmax_cross_entropy_with_logits(logits=final_images, labels=final_labels)
gradients = tape.gradient(loss, self.model.trainable_variables)
self.optimizer.apply_gradients(zip(gradients, self.model.trainable_variables))
Why is it that if I modify the shape of the model output using np.expand_dims(), I get the following error:
"ValueError: No gradients provided for any variable ... " when applying the gradients to my model variables? It works fine if I don't have the np.expand_dims() though. Is it because the model loss has to have the same shape as the model output? Or is it non-differentiable?

Always, use TensorFlow version of NumPy functions, to avoid this kind of error.
with tf.GradientTape() as tape:
images, labels = x
initial_points = self.model(images, is_training=True)
final_images = (tf.ones_like(initial_points) + initial_points).numpy()
final_images = tf.expand_dims(final_images, axis=-1)
final_labels = tf.zeros_like(final_images)
loss = tf.nn.softmax_cross_entropy_with_logits(logits=final_images, labels=final_labels)
gradients = tape.gradient(loss, self.model.trainable_variables)
self.optimizer.apply_gradients(zip(gradients, self.model.trainable_variables))

The TensorFlow library operates in a very specific matter when you are using tf.GradientTape(). Under this function, it is automatically computing partial derivatives for you in order to update the gradients afterwards. It can do this because each tf function was designed for this specifically.
When you use a NumPy function, however, there is a break in the formula. TensorFlow does not know/understand this function, and thus cannot compute the partial derivative of your loss via the chain rule anymore.
You must use only tf functions under GradientTape() for this reason.

GradientTape with Keras returns 0

I've tried using GradientTape with a Keras model (simplified) as follows:
import tensorflow as tf
tf.enable_eager_execution()
input_ = tf.keras.layers.Input(shape=(28, 28))
flat = tf.keras.layers.Flatten()(input_)
output = tf.keras.layers.Dense(10, activation='softmax')(flat)
model = tf.keras.Model(input_, output)
model.compile(loss='categorical_crossentropy', optimizer='sgd')
import numpy as np
inp = tf.Variable(np.random.random((1,28,28)), dtype=tf.float32, name='input')
target = tf.constant([[1,0,0,0,0,0,0,0,0,0]], dtype=tf.float32)
with tf.GradientTape(persistent=True) as g:
g.watch(inp)
result = model(inp, training=False)
print(tf.reduce_max(tf.abs(g.gradient(result, inp))))
But for some random values of inp, the gradient is zero everywhere, and for the rest, the gradient magnitude is really small (<1e-7).
I've also tried this with a MNIST-trained 3-layer MLP and the results are the same, but trying it with a 1-layer Linear model with no activation works.
What's going on here?

You are computing gradients of a softmax output layer -- since softmax always always sums to 1, it makes sense that the gradients (which, in a multi-putput case, are summed/averaged over dimensions AFAIK) must be 0 -- the overall output of the layer cannot change. The cases where you get small values > 0 are numerical hiccups, I presume.
When you remove the activation function, this limitation no longer holds and the activations can become larger (meaning gradients with magnitude > 0).
Are you trying to use gradient descent to construct inputs that result in a very large probability for a certain class (if not, disregard this...)? #jdehesa already included a way to do this via the loss function. Note that you can do it via the softmax as well, like so:
import tensorflow as tf
tf.enable_eager_execution()
input_ = tf.keras.layers.Input(shape=(28, 28))
flat = tf.keras.layers.Flatten()(input_)
output = tf.keras.layers.Dense(10, activation='softmax')(flat)
model = tf.keras.Model(input_, output)
model.compile(loss='categorical_crossentropy', optimizer='sgd')
import numpy as np
inp = tf.Variable(np.random.random((1,28,28)), dtype=tf.float32, name='input')
with tf.GradientTape(persistent=True) as g:
g.watch(inp)
result = model(inp, training=False)[:,0]
print(tf.reduce_max(tf.abs(g.gradient(result, inp))))
Note that I grab only the results in column 0, corresponding to the first class (I removed target because it's not used). This will compute gradients only for the softmax value for this class, which are meaningful.
Some caveats:
It's important to do the indexing inside the gradient tape context manager! If you do it outside (e.g. in the line where you call g.gradient, this will not work (no gradients)
You can also use gradients of the logits (pre-softmax values) instead. This is different, because softmax probabilities can be increased by making other classes less likely, whereas logits can only be increased by increasing the "score" for the class in question.

Computing the gradients against the output of the model is not usually very meaningful, in general you compute the gradients against the loss, which is what tells the model where the variables should go to reach your goal. In this case, you would be optimizing your input instead of the model parameters, but it is the same.
import tensorflow as tf
import numpy as np
tf.enable_eager_execution() # Not necessary in TF 2.x
tf.random.set_random_seed(0) # tf.random.set_seed in TF 2.x
np.random.seed(0)
input_ = tf.keras.layers.Input(shape=(28, 28))
flat = tf.keras.layers.Flatten()(input_)
output = tf.keras.layers.Dense(10, activation='softmax')(flat)
model = tf.keras.Model(input_, output)
model.compile(loss='categorical_crossentropy', optimizer='sgd')
inp = tf.Variable(np.random.random((1, 28, 28)), dtype=tf.float32, name='input')
target = tf.constant([[1, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=tf.float32)
with tf.GradientTape(persistent=True) as g:
g.watch(inp)
result = model(inp, training=False)
# Get the loss for the example
loss = tf.keras.losses.categorical_crossentropy(target, result)
print(tf.reduce_max(tf.abs(g.gradient(loss, inp))))
# tf.Tensor(0.118953675, shape=(), dtype=float32)

How to use Tensorflow BatchNormalization with GradientTape?

Suppose we have a simple Keras model that uses BatchNormalization:
model = tf.keras.Sequential([
tf.keras.layers.InputLayer(input_shape=(1,)),
tf.keras.layers.BatchNormalization()
])
How to actually use it with GradientTape? The following doesn't seem to work as it doesn't update the moving averages?
# model training... we want the output values to be close to 150
for i in range(1000):
x = np.random.randint(100, 110, 10).astype(np.float32)
with tf.GradientTape() as tape:
y = model(np.expand_dims(x, axis=1))
loss = tf.reduce_mean(tf.square(y - 150))
grads = tape.gradient(loss, model.variables)
opt.apply_gradients(zip(grads, model.variables))
In particular, if you inspect the moving averages, they remain the same (inspect model.variables, averages are always 0 and 1). I know one can use .fit() and .predict(), but I would like to use the GradientTape and I'm not sure how to do this. Some version of the documentation suggests to update update_ops, but that doesn't seem to work in eager mode.
In particular, the following code will not output anything close to 150 after the above training.
x = np.random.randint(200, 210, 100).astype(np.float32)
print(model(np.expand_dims(x, axis=1)))

with gradient tape mode BatchNormalization layer should be called with argument training=True
example:
inp = KL.Input( (64,64,3) )
x = inp
x = KL.Conv2D(3, kernel_size=3, padding='same')(x)
x = KL.BatchNormalization()(x, training=True)
model = KM.Model(inp, x)
then moving vars are properly updated
>>> model.layers[2].weights[2]
<tf.Variable 'batch_normalization/moving_mean:0' shape=(3,) dtype=float32, numpy
=array([-0.00062087, 0.00015137, -0.00013239], dtype=float32)>

I just give up. I spent quiet a bit of time trying to make sense of a model that looks like:
model = tf.keras.Sequential([
tf.keras.layers.BatchNormalization(),
])
And I do give up because that thing looks like that:
My intuition was that BatchNorm these days is not as straight forward as it used to be and that is why it scales original distribution but not so much new distribution (which is a shame), but ain't nobody got time for that.
Edit: the reason for that behavior is that BN only calculates moments and normalizes batches during training. During training it maintains running averages of mean and deviation and once you switch to evaluation, parameters are used as constants. i.e. evaluation should not depend on normalization because evaluation can be used even for a single input and can not rely on batch statistics. Since constants are calculated on a different distribution, you are getting a higher error during evaluation.

With Gradient Tape mode, you would usually find gradients like:
with tf.GradientTape() as tape:
y_pred = model(features)
loss = your_loss_function(y_pred, y_true)
gradients = tape.gradient(loss, model.trainable_variables)
train_op = model.optimizer.apply_gradients(zip(gradients, model.trainable_variables))
However, if your model contains BatchNormalization or Dropout layer (or any layer that has different train/test phases) then tf will fail building the graph.
A good practice would be to explicitly use trainable parameter when obtaining output from a model. When optimizing use model(features, trainable=True) and when predicting use model(features, trainable=False), in order to explicitly choose train/test phase when using such layers.
For PREDICT and EVAL phase, use
training = (mode == tf.estimator.ModeKeys.TRAIN)
y_pred = model(features, trainable=training)
For TRAIN phase, use
with tf.GradientTape() as tape:
y_pred = model(features, trainable=training)
loss = your_loss_function(y_pred, y_true)
gradients = tape.gradient(loss, model.trainable_variables)
train_op = model.optimizer.apply_gradients(zip(gradients, model.trainable_variables))
Note that, iperov's answer works as well, except that you will need to set the training phase manually for those layers.
x = BatchNormalization()(x, training=True)
x = Dropout(rate=0.25)(x, training=True)
x = BatchNormalization()(x, training=False)
x = Dropout(rate=0.25)(x, training=False)
I'd recommended to have one get_model function that returns the model, while changing the phase using training parameter when calling the model.
Note:
If you use model.variables when finding gradients, you'll get this warning
Gradients do not exist for variables
['layer_1_bn/moving_mean:0',
'layer_1_bn/moving_variance:0',
'layer_2_bn/moving_mean:0',
'layer_2_bn/moving_variance:0']
when minimizing the loss.
This can be resolved by computing gradients only against trainable variables. Replace model.variables with model.trainable_variables

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Keras An operation has None for gradient when train_on_batch - python

I think the reason is that input_y and input_y_pred are all keras Input,your loss function is calculated with these two tensor,they are not binded up with the model parameters,so the loss function gives no gradient to your model

Related

Training a TensorFlow-Keras model with extra layer which gets also the labels as input

confusing behaviour of binary_crossentropy loss in evaluate method of keras network

Tensorflow No gradients provided for any variable with different shape of variable

GradientTape with Keras returns 0

How to use Tensorflow BatchNormalization with GradientTape?

Categories

Resources