GradientTape with Keras returns 0

GradientTape with Keras returns 0 - python

I've tried using GradientTape with a Keras model (simplified) as follows:
import tensorflow as tf
tf.enable_eager_execution()
input_ = tf.keras.layers.Input(shape=(28, 28))
flat = tf.keras.layers.Flatten()(input_)
output = tf.keras.layers.Dense(10, activation='softmax')(flat)
model = tf.keras.Model(input_, output)
model.compile(loss='categorical_crossentropy', optimizer='sgd')
import numpy as np
inp = tf.Variable(np.random.random((1,28,28)), dtype=tf.float32, name='input')
target = tf.constant([[1,0,0,0,0,0,0,0,0,0]], dtype=tf.float32)
with tf.GradientTape(persistent=True) as g:
g.watch(inp)
result = model(inp, training=False)
print(tf.reduce_max(tf.abs(g.gradient(result, inp))))
But for some random values of inp, the gradient is zero everywhere, and for the rest, the gradient magnitude is really small (<1e-7).
I've also tried this with a MNIST-trained 3-layer MLP and the results are the same, but trying it with a 1-layer Linear model with no activation works.
What's going on here?

You are computing gradients of a softmax output layer -- since softmax always always sums to 1, it makes sense that the gradients (which, in a multi-putput case, are summed/averaged over dimensions AFAIK) must be 0 -- the overall output of the layer cannot change. The cases where you get small values > 0 are numerical hiccups, I presume.
When you remove the activation function, this limitation no longer holds and the activations can become larger (meaning gradients with magnitude > 0).
Are you trying to use gradient descent to construct inputs that result in a very large probability for a certain class (if not, disregard this...)? #jdehesa already included a way to do this via the loss function. Note that you can do it via the softmax as well, like so:
import tensorflow as tf
tf.enable_eager_execution()
input_ = tf.keras.layers.Input(shape=(28, 28))
flat = tf.keras.layers.Flatten()(input_)
output = tf.keras.layers.Dense(10, activation='softmax')(flat)
model = tf.keras.Model(input_, output)
model.compile(loss='categorical_crossentropy', optimizer='sgd')
import numpy as np
inp = tf.Variable(np.random.random((1,28,28)), dtype=tf.float32, name='input')
with tf.GradientTape(persistent=True) as g:
g.watch(inp)
result = model(inp, training=False)[:,0]
print(tf.reduce_max(tf.abs(g.gradient(result, inp))))
Note that I grab only the results in column 0, corresponding to the first class (I removed target because it's not used). This will compute gradients only for the softmax value for this class, which are meaningful.
Some caveats:
It's important to do the indexing inside the gradient tape context manager! If you do it outside (e.g. in the line where you call g.gradient, this will not work (no gradients)
You can also use gradients of the logits (pre-softmax values) instead. This is different, because softmax probabilities can be increased by making other classes less likely, whereas logits can only be increased by increasing the "score" for the class in question.

Computing the gradients against the output of the model is not usually very meaningful, in general you compute the gradients against the loss, which is what tells the model where the variables should go to reach your goal. In this case, you would be optimizing your input instead of the model parameters, but it is the same.
import tensorflow as tf
import numpy as np
tf.enable_eager_execution() # Not necessary in TF 2.x
tf.random.set_random_seed(0) # tf.random.set_seed in TF 2.x
np.random.seed(0)
input_ = tf.keras.layers.Input(shape=(28, 28))
flat = tf.keras.layers.Flatten()(input_)
output = tf.keras.layers.Dense(10, activation='softmax')(flat)
model = tf.keras.Model(input_, output)
model.compile(loss='categorical_crossentropy', optimizer='sgd')
inp = tf.Variable(np.random.random((1, 28, 28)), dtype=tf.float32, name='input')
target = tf.constant([[1, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=tf.float32)
with tf.GradientTape(persistent=True) as g:
g.watch(inp)
result = model(inp, training=False)
# Get the loss for the example
loss = tf.keras.losses.categorical_crossentropy(target, result)
print(tf.reduce_max(tf.abs(g.gradient(loss, inp))))
# tf.Tensor(0.118953675, shape=(), dtype=float32)

Related

confusing behaviour of binary_crossentropy loss in evaluate method of keras network

I am trying to understand the calculation of binary_crossentropy when used as the loss for a network that outputs a 2 probabilities rather than just 1. Basically I wanted to reproduce the calculation that keras/tf is doing in this case rather than in the common case where the network outputs a single value (the logit of the probability of positive classification). Here's some minimal reproducer code:
from tensorflow import keras
import numpy as np
loss_func = keras.losses.BinaryCrossentropy()
nn = keras.Sequential([
keras.layers.Dense(2**8, input_shape=(1,), activation='relu'),
keras.layers.Dense(2, activation='softmax')
])
nn.compile(loss=loss_func,optimizer='adam')
train_x = np.array([0.4,0.7,0.3,0.2])
train_y = np.array([[0,1],[1,0],[0,1],[0,1]])
print("Evaluted loss = ",nn.evaluate(train_x,train_y))
print("Function loss = ",loss_func(train_y,nn.predict(train_x)).numpy())
print("Manual loss = ",np.average( -train_y*np.log(nn.predict(train_x)) -(1-train_y)*np.log(1. - nn.predict(train_x)) ))
This outputs:
Evaluted loss = 0.6944893002510071
Function loss = 0.6959093
Manual loss = 0.6959095224738121
So there's a difference between the loss calculated by the evaluate method vs using the loss as a function or even calculating the loss by hand. I note that if I swap to using keras.losses.CategoricalCrossentropy() then all three calculations agree. I also note that if I use network with a single logit output then everything also agrees, i.e. if I do the following:
loss_func = keras.losses.BinaryCrossentropy(from_logits=True)
nn = keras.Sequential([
keras.layers.Dense(2**8, input_shape=(1,), activation='relu'),
keras.layers.Dense(1)
])
nn.compile(loss=loss_func,optimizer='adam')
train_x = np.array([0.4,0.7,0.3,0.2])
train_y = np.array([[1.],[0.],[1.],[1.]])
print("Evaluted loss = ",nn.evaluate(train_x,train_y))
print("Function loss = ",loss_func(train_y,nn.predict(train_x)).numpy())
print("Manual loss = ",np.average( -train_y*np.log(1./(1+np.exp(-nn.predict(train_x)))) -(1-train_y)*np.log(1. - 1./(1+np.exp(-nn.predict(train_x)))) ))
gives:
Evaluted loss = 0.6919926404953003
Function loss = 0.69199264
Manual loss = 0.6919926702976227
So my question is: What is the calculation that evaluate is doing on the first network with the 2 probabilities being output and why is it different to the value calculated using the loss function as a standalone function or doing the calculation by hand?
Thanks!

How to clip layer output in MLP with `tf.keras.activations.relu()`?

According to the documentation, tf.keras.activations.relu(x, alpha=0.0, max_value=None, threshold=0) seems to clip x within [threshold, max_value], but x must be specified. How can I use it for clipping the output of a layer in neural network? Or is there a more convenient way to do so?
Suppose I want to output the linear combination of all elements of a 10-by-10 2D-array only when the result is between 0 and 5.
import tensorflow as tf
from tensorflow import keras
model = keras.models.Sequential()
model.add(keras.layers.Flatten(input_shape=[10, 10]))
model.add(keras.layers.Dense(1, activation='relu') # output layer

Differentiating a trained simple neural net with tensorflow gives the wrong result and changes model weights

I am trying to teach a neural net a simple function (f(t) = 2t) and then compute derivative with respect to input (df/dt = 2). I use a net with one dense layer and no activation:
model = Sequential()
model.add(Dense(output_dim=1, input_shape=(1,), bias_initializer='ones'))
opt = RMSprop(lr=0.01, rho=0.9, epsilon=None, decay=0.0)
model.compile(optimizer=opt, loss='mse', metrics=['mae'])#optimizer=opt,
model.summary()
My data consist of pairs t -> f(t), where t is chosen randomly on [0, 1] to compute df/dt of my net I found this code:
https://groups.google.com/forum/#!msg/keras-users/g2JmncAIT9w/36MJZI7NBQAJ
and https://colab.research.google.com/drive/1l9YdIa2N40Fj3Y09qb3r3RhqKPXoaVJC
This is my full code on colab:
model.fit(train_x ,train_y, epochs=100,validation_data=(test_x, test_y),shuffle=False, batch_size=32)
model.layers[0].get_weights()# this displays 2.0069, quite right
outputTensor = model.output
listOfVariableTensors = model.inputs[0]
gradients = k.gradients(outputTensor, listOfVariableTensors)
sess = tf.InteractiveSession()
sess.run(tf.initialize_all_variables())
evaluated_gradients = sess.run(gradients,feed_dict={model.input:np.array([[10]])})
evaluated_gradients # this displays kinda random number
model.layers[0].get_weights() # this displays same number as above
I believe my model performs a simple w*t + b transform, so its derivative should be just w. But the code I found provides wrong results and breaks trained weights. I actually think it resets them to initial weights because if I initialize dense layer weights with kernel_initializer= "ones", the code returns 1 as a derivative.
So, I need help with correct derivation of neural net.

sess = tf.InteractiveSession() # run this before anything happens
#....
model.fit(train_x ,train_y, epochs=100,validation_data=(test_x, test_y),shuffle=False, batch_size=32)
model.layers[0].get_weights()# this displays 2.0069, quite right
outputTensor = model.output
listOfVariableTensors = model.inputs[0]
gradients = k.gradients(outputTensor, listOfVariableTensors)
evaluated_gradients = sess.run(gradients,feed_dict={model.input:np.array([[10]])})
evaluated_gradients

Not fully connected layer in tensorflow

I want to create a network where in the input layer nodes are just connected to some nodes in the next layer. Here is a small example:
My solution so far is that I set the weight of the edge between i1 and h1 to zero and after every optimization step I multiply the weights with a matrix (I call this matrix mask matrix) in which every entry is 1 except the entry of the weight of the edge between i1 and h1.
(See code below)
Is this approach right? Or does this have a affect on the GradientDescent? Is there another approach to create this kind of a network in TensorFlow?
import tensorflow as tf
import tensorflow.contrib.eager as tfe
import numpy as np
tf.enable_eager_execution()
model = tf.keras.Sequential([
tf.keras.layers.Dense(2, activation=tf.sigmoid, input_shape=(2,)), # input shape required
tf.keras.layers.Dense(2, activation=tf.sigmoid)
])
#set the weights
weights=[np.array([[0, 0.25],[0.2,0.3]]),np.array([0.35,0.35]),np.array([[0.4,0.5],[0.45, 0.55]]),np.array([0.6,0.6])]
model.set_weights(weights)
model.get_weights()
features = tf.convert_to_tensor([[0.05,0.10 ]])
labels = tf.convert_to_tensor([[0.01,0.99 ]])
mask =np.array([[0, 1],[1,1]])
#define the loss function
def loss(model, x, y):
y_ = model(x)
return tf.losses.mean_squared_error(labels=y, predictions=y_)
#define the gradient calculation
def grad(model, inputs, targets):
with tf.GradientTape() as tape:
loss_value = loss(model, inputs, targets)
return loss_value, tape.gradient(loss_value, model.trainable_variables)
#create optimizer an global Step
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.01)
global_step = tf.train.get_or_create_global_step()
#optimization step
loss_value, grads = grad(model, features, labels)
optimizer.apply_gradients(zip(grads, model.variables),global_step)
#masking the optimized weights
weights=(model.get_weights())[0]
masked_weights=tf.multiply(weights,mask)
model.set_weights([masked_weights])

If you are looking for a solution for the specific example you provided, you can simply use tf.keras Functional API and define two Dense layers where one is connected to both neurons in the previous layer and the other one is only connected to one of the neurons:
from tensorflow.keras.layer import Input, Lambda, Dense, concatenate
from tensorflow.keras.models import Model
inp = Input(shape=(2,))
inp2 = Lambda(lambda x: x[:,1:2])(inp) # get the second neuron
h1_out = Dense(1, activation='sigmoid')(inp2) # only connected to the second neuron
h2_out = Dense(1, activation='sigmoid')(inp) # connected to both neurons
h_out = concatenate([h1_out, h2_out])
out = Dense(2, activation='sigmoid')(h_out)
model = Model(inp, out)
# simply train it using `fit`
model.fit(...)

The problem with your solution and some others suggested by other answers in this post is that they do not prevent training of this weight. They allow the gradient descent to train the non existent weight and then overwrite it retrospectively. This will result in a network that has a zero in this location as desired, but will negatively affect your training process as the back propagation calculation will not see the masking step as it is not part of a TensorFlow graph and so the gradient descent will follow a path which includes the assumption that this weight does have an affect on the outcome (it does not).
A better solution would be to include the masking step as a part of your TensorFlow graph, so that it can be factored into the gradient descent. Since the masking step is simply a element wise multiplication by your sparse, binary martix mask, you could just include the mask matrix as an elementwise matrix multiplicaiton in the graph definition using tf.multiply.
Sadly this means sying goodbye to the user friendly keras,layers methods and embracing a more nuts & bolts approach to TensorFlow. I can't see an obvious way to do it using the layers API.
See the implementation below, I have tried to provide comments explaining what is happening at each stage.
import tensorflow as tf
## Graph definition for model
# set up tf.placeholders for inputs x, and outputs y_
# these remain fixed during training and can have values fed to them during the session
with tf.name_scope("Placeholders"):
x = tf.placeholder(tf.float32, shape=[None, 2], name="x") # input layer
y_ = tf.placeholder(tf.float32, shape=[None, 2], name="y_") # output layer
# set up tf.Variables for the weights at each layer from l1 to l3, and setup feeding of initial values
# also set up mask as a variable and set it to be un-trianable
with tf.name_scope("Variables"):
w_l1_values = [[0, 0.25],[0.2,0.3]]
w_l1 = tf.Variable(w_l1_values, name="w_l1")
w_l2_values = [[0.4,0.5],[0.45, 0.55]]
w_l2 = tf.Variable(w_l2_values, name="w_l2")
mask_values = [[0., 1.], [1., 1.]]
mask = tf.Variable(mask_values, trainable=False, name="mask")
# link each set of weights as matrix multiplications in the graph. Inlcude an elementwise multiplication by mask.
# Sequence takes us from inputs x to output final_out, which will be compared to labels fed to placeholder y_
l1_out = tf.nn.relu(tf.matmul(x, tf.multiply(w_l1, mask)), name="l1_out")
final_out = tf.nn.relu(tf.matmul(l1_out, w_l2), name="output")
## define loss function and training operation
with tf.name_scope("Loss"):
# some loss defined as a function of graph output: final_out and labels: y_
loss = tf.nn.sigmoid_cross_entropy_with_logits(logits=final_out, labels=y_, name="loss")
with tf.name_scope("Train"):
# some optimisation strategy, arbitrary learning rate
optimizer = tf.train.AdamOptimizer(learning_rate=0.001, name="optimizer_adam")
train_op = optimizer.minimize(loss, name="train_op")
# create session, initialise variables and train according to inputs and corresponding labels
# This should show that the values of the first layer weights change, but the one set to 0 remains at 0
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
initial_l1_weights = sess.graph.get_tensor_by_name("Variables/w_l1:0")
print(initial_l1_weights.eval())
inputs = [[0.05, 0.10]]
labels = [[0.01, 0.99]]
ans = sess.run(train_op, feed_dict={"Placeholders/x:0": inputs, "Placeholders/y_:0": labels})
train_steps = 1
for i in range(train_steps):
initial_l1_weights = sess.graph.get_tensor_by_name("Variables/w_l1:0")
print(initial_l1_weights.eval())
Or use the answer provided by today for a keras friendly option.

You have multiple options here.
First, you could use the dynamic masking approach in your example. I believe this will work as expected since the gradients w.r.t. the masked-out parameters will be zero (the output is constant when you change the unused parameters). This approach is simple and it can be used even when your mask is not constant during the training.
Second, if you know beforehand which weights will be always zero, you can compose your weight matrix using tf.get_variable to get a submatrix, and then concatenate it with a tf.constant tensor, e.g.:
weights_sub = tf.get_variable("w", [dim_in, dim_out - 1])
zeros = tf.zeros([dim_in, 1])
weights = tf.concat([weights_sub, zeros], axis=1)
this example will make one column of your weight matrix to be always zero.
Finally, if your mask is more complex, you can use tf.get_variable on a flattened vector and then compose a tf.SparseTensor with the variable values on the used indices:
weights_used = tf.get_variable("w", [num_used_vars])
indices = ... # get your indices in a 2-D matrix of shape [num_used_vars, 2]
dense_shape = tf.constant([dim_in, dim_out]) # this is the final shape of the weight matrix
weights = tf.SparseTensor(indices, weights_used, dense_shape)
EDIT: This probably won't work in combination with Keras' set_weights method, as it expects Numpy arrays, not Tensors.

Keras An operation has None for gradient when train_on_batch

Google Colab to reproduce the error None_for_gradient.ipynb
I need a custom loss function where the value is calculated according to the model inputs, these inputs are not the default values (y_true, y_pred). The predict method works for the generated architecture, but when I try to use the train_on_batch, the following error appears.
ValueError: An operation has None for gradient. Please make sure that all of your ops have a gradient defined (i.e. are differentiable). Common ops without gradient: K.argmax, K.round, K.eval.
My custom function of loss (below) was based on this example image_ocr.py#L475, in the Colab link has another example based on this solution Custom loss function y_true y_pred shape mismatch #4781, it also generates the same error:
from keras import backend as K
from keras import losses
import keras
from keras.models import TimeDistributed, Dense, Dropout, LSTM
def my_loss(args):
input_y, input_y_pred, y_pred = args
return keras.losses.binary_crossentropy(input_y, input_y_pred)
def generator2():
input_noise = keras.Input(name='input_noise', shape=(40, 38), dtype='float32')
input_y = keras.Input(name='input_y', shape=(1,), dtype='float32')
input_y_pred = keras.Input(name='input_y_pred', shape=(1,), dtype='float32')
lstm1 = LSTM(256, return_sequences=True)(input_noise)
drop = Dropout(0.2)(lstm1)
lstm2 = LSTM(256, return_sequences=True)(drop)
y_pred = TimeDistributed(Dense(38, activation='softmax'))(lstm2)
loss_out = keras.layers.Lambda(my_loss, output_shape=(1,), name='my_loss')([input_y, input_y_pred, y_pred])
model = keras.models.Model(inputs=[input_noise, input_y, input_y_pred], outputs=[y_pred, loss_out])
model.compile(loss={'my_loss': lambda y_true, y_pred: y_pred}, optimizer='adam')
return model
g2 = generator2()
noise = np.random.uniform(0,1,size=[10,40,38])
g2.train_on_batch([noise, np.ones(10), np.zeros(10)], noise)
I need help to verify which operation is generating this error, because as far as I know the keras.losses.binary_crossentropy is differentiable.

I think the reason is that input_y and input_y_pred are all keras Input,your loss function is calculated with these two tensor,they are not binded up with the model parameters,so the loss function gives no gradient to your model

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

GradientTape with Keras returns 0 - python

Related

confusing behaviour of binary_crossentropy loss in evaluate method of keras network

How to clip layer output in MLP with `tf.keras.activations.relu()`?

Differentiating a trained simple neural net with tensorflow gives the wrong result and changes model weights

Not fully connected layer in tensorflow

Keras An operation has None for gradient when train_on_batch

Categories

Resources