I'm trying to combine a few "networks" into one final loss function. I'm wondering if what I'm doing is "legal", as of now I can't seem to make this work. I'm using tensorflow probability :
The main problem is here:
# Get gradients of the loss wrt the weights.
gradients = tape.gradient(loss, [m_phis.trainable_weights, m_mus.trainable_weights, m_sigmas.trainable_weights])
# Update the weights of our linear layer.
optimizer.apply_gradients(zip(gradients, [m_phis.trainable_weights, m_mus.trainable_weights, m_sigmas.trainable_weights])
Which gives me None gradients and throws on apply gradients:
AttributeError: 'list' object has no attribute 'device'
Full code:
univariate_gmm = tfp.distributions.MixtureSameFamily(
mixture_distribution=tfp.distributions.Categorical(probs=phis_true),
components_distribution=tfp.distributions.Normal(loc=mus_true,scale=sigmas_true)
)
x = univariate_gmm.sample(n_samples, seed=random_seed).numpy()
dataset = tf.data.Dataset.from_tensor_slices(x)
dataset = dataset.shuffle(buffer_size=1024).batch(64)
m_phis = keras.layers.Dense(2, activation=tf.nn.softmax)
m_mus = keras.layers.Dense(2)
m_sigmas = keras.layers.Dense(2, activation=tf.nn.softplus)
def neg_log_likelihood(y, phis, mus, sigmas):
a = tfp.distributions.Normal(loc=mus[0],scale=sigmas[0]).prob(y)
b = tfp.distributions.Normal(loc=mus[1],scale=sigmas[1]).prob(y)
c = np.log(phis[0]*a + phis[1]*b)
return tf.reduce_sum(-c, axis=-1)
# Instantiate a logistic loss function that expects integer targets.
loss_fn = neg_log_likelihood
# Instantiate an optimizer.
optimizer = tf.keras.optimizers.SGD(learning_rate=1e-3)
# Iterate over the batches of the dataset.
for step, y in enumerate(dataset):
yy = np.expand_dims(y, axis=1)
# Open a GradientTape.
with tf.GradientTape() as tape:
# Forward pass.
phis = m_phis(yy)
mus = m_mus(yy)
sigmas = m_sigmas(yy)
# Loss value for this batch.
loss = loss_fn(yy, phis, mus, sigmas)
# Get gradients of the loss wrt the weights.
gradients = tape.gradient(loss, [m_phis.trainable_weights, m_mus.trainable_weights, m_sigmas.trainable_weights])
# Update the weights of our linear layer.
optimizer.apply_gradients(zip(gradients, [m_phis.trainable_weights, m_mus.trainable_weights, m_sigmas.trainable_weights]))
# Logging.
if step % 100 == 0:
print("Step:", step, "Loss:", float(loss))
There are two separate problems to take into account.
1. Gradients are None:
Typically this happens, if non-tensorflow operations are executed in the code that is watched by the GradientTape. Concretely, this concerns the computation of np.log in your neg_log_likelihood functions. If you replace np.log with tf.math.log, the gradients should compute. It may be a good habit to try not to use numpy in your "internal" tensorflow components, since this avoids errors like this. For most numpy operations, there is a good tensorflow substitute.
2. apply_gradients for multiple trainables:
This mainly has to do with the input that apply_gradients expects. There you have two options:
First option: Call apply_gradients three times, each time with different trainables
optimizer.apply_gradients(zip(m_phis_gradients, m_phis.trainable_weights))
optimizer.apply_gradients(zip(m_mus_gradients, m_mus.trainable_weights))
optimizer.apply_gradients(zip(m_sigmas_gradients, m_sigmas.trainable_weights))
The alternative would be to create a list of tuples, like indicated in the tensorflow documentation (quote: "grads_and_vars: List of (gradient, variable) pairs.").
This would mean calling something like
optimizer.apply_gradients(
[
zip(m_phis_gradients, m_phis.trainable_weights),
zip(m_mus_gradients, m_mus.trainable_weights),
zip(m_sigmas_gradients, m_sigmas.trainable_weights),
]
)
Both options require you to split the gradients. You can either do that by computing the gradients and indexing them separately (gradients[0],...), or you can simply compute the gradiens separately. Note that this may require persistent=True in your GradientTape.
# [...]
# Open a GradientTape.
with tf.GradientTape(persistent=True) as tape:
# Forward pass.
phis = m_phis(yy)
mus = m_mus(yy)
sigmas = m_sigmas(yy)
# Loss value for this batch.
loss = loss_fn(yy, phis, mus, sigmas)
# Get gradients of the loss wrt the weights.
m_phis_gradients = tape.gradient(loss, m_phis.trainable_weights)
m_mus_gradients = tape.gradient(loss, m_mus.trainable_weights)
m_sigmas_gradients = tape.gradient(loss, m_sigmas .trainable_weights)
# Update the weights of our linear layer.
optimizer.apply_gradients(
[
zip(m_phis_gradients, m_phis.trainable_weights),
zip(m_mus_gradients, m_mus.trainable_weights),
zip(m_sigmas_gradients, m_sigmas.trainable_weights),
]
)
# [...]
Related
I am currently working with an LSTM sequence to sequence model for time domain signal predictions. Because of domain knowledge, I know that the first part of the prediction (about 20%) can never be predicted correctly, since the information required is not available in the given input sequence. The remaining 80% of the predicted sequence are usually predicted quite well. In order to exclude the first 20% from the training optimization, it would be nice to define a loss function that would basically operate on a given index range like the numpy code below:
start = int(0.2*sequence_length)
stop = sequence_length
def mse(pred, target):
""" Mean squared error between two time series np.arrays """
return 1/target.shape[0]*np.sum((pred-target)**2)
def range_mse_loss(y_pred, y):
return mse(y_pred[start:stop],y[start:stop])
How do I have to write this loss function in order to have it work with my preexisting keras code, where loss is simply given by model.compile(loss='mse') ?
You can slice your tensor to just last 80% of the data.
size = int(y_true.shape[0] * 0.8) # for 2D vector, e.g., (100, 1)
loss_fn = tf.keras.losses.MeanSquaredError(name='mse')
loss_fn(y_pred[:-size], y_true[:-size])
You can also use the sample_weights at the tf.keras.losses.MeanSquaredError(), passing an array of weights and the first 20% of weights is zero
size = int(y_true.shape[0] * 0.8) # for 2D vector, e.g., (100, 1)
zeros = tf.zeros((y_true.shape[0] - size), dtype=tf.int32)
ones = tf.ones((size), dtype=tf.int32)
weights = tf.concat([zeros, ones], 0)
loss_fn = tf.keras.losses.MeanSquaredError(name='mse')
loss_fn(y_pred, y_true, sample_weights=weights)
There is a warming of the second solution, the final loss will be lower than the first solution, because you are putting zero in the first predictions values, but you aren't removing them in the formula MSE = 1 /n * sum((y-y_hat)^2).
One workaround would be to mark the observations as None/nan and then overwrite the train_step method. Following tensorflow's tutorial about customizing train_step, you would do something like this
#tf.function
def train_step(keras_model, data):
print('custom train_step')
# Unpack the data. Its structure depends on your model and
# on what you pass to `fit()`.
x, y = data
with tf.GradientTape() as tape:
y_pred = keras_model(x, training=True) # Forward pass
# masking nan values in observations, also assuming that targets are >0.0
mask = tf.greater(y, 0.0)
true_y = tf.boolean_mask(y, mask)
pred_y = tf.boolean_mask(y_pred, mask)
# Compute the loss value
# (the loss function is configured in `compile()`)
loss = keras_model.compiled_loss(true_y, pred_y, regularization_losses=keras_model.losses)
# Compute gradients
trainable_vars = keras_model.trainable_variables
gradients = tape.gradient(loss, trainable_vars)
# Update weights
keras_model.optimizer.apply_gradients(zip(gradients, trainable_vars))
# Update metrics (includes the metric that tracks the loss)
keras_model.compiled_metrics.update_state(true_y, pred_y)
# Return a dict mapping metric names to current value
return {m.name: m.result() for m in keras_model.metrics}
This will work for all the performance metrics you are tracking. Alternative way would be to mask the nans inside the loss function but that would be limited to only one loss function and not any other loss function/performance metrics.
with tf.GradientTape() as tape:
images, labels = x
initial_points = self.model(images, is_training=True)
final_images = (tf.ones_like(initial_points) + initial_points).numpy()
final_images = np.expand_dims(final_images, axis=-1)
final_labels = tf.zeros_like(final_images)
loss = tf.nn.softmax_cross_entropy_with_logits(logits=final_images, labels=final_labels)
gradients = tape.gradient(loss, self.model.trainable_variables)
self.optimizer.apply_gradients(zip(gradients, self.model.trainable_variables))
Why is it that if I modify the shape of the model output using np.expand_dims(), I get the following error:
"ValueError: No gradients provided for any variable ... " when applying the gradients to my model variables? It works fine if I don't have the np.expand_dims() though. Is it because the model loss has to have the same shape as the model output? Or is it non-differentiable?
Always, use TensorFlow version of NumPy functions, to avoid this kind of error.
with tf.GradientTape() as tape:
images, labels = x
initial_points = self.model(images, is_training=True)
final_images = (tf.ones_like(initial_points) + initial_points).numpy()
final_images = tf.expand_dims(final_images, axis=-1)
final_labels = tf.zeros_like(final_images)
loss = tf.nn.softmax_cross_entropy_with_logits(logits=final_images, labels=final_labels)
gradients = tape.gradient(loss, self.model.trainable_variables)
self.optimizer.apply_gradients(zip(gradients, self.model.trainable_variables))
The TensorFlow library operates in a very specific matter when you are using tf.GradientTape(). Under this function, it is automatically computing partial derivatives for you in order to update the gradients afterwards. It can do this because each tf function was designed for this specifically.
When you use a NumPy function, however, there is a break in the formula. TensorFlow does not know/understand this function, and thus cannot compute the partial derivative of your loss via the chain rule anymore.
You must use only tf functions under GradientTape() for this reason.
I wrote a sample code to generate the real problem I am facing in my project. I am using an LSTM in tensorflow to model some time series data. Input dimensions are (10, 100, 1), that is, 10 instances, 100 time steps, and number of features is 1. The output is of the same shape.
What I want to achieve after training the model is to study the influence of each of the inputs to each output at each particular time step. In other words, I would like to see which input variables affect my output the most (or which input has the most influence on the output/maybe large gradient) at each time step. Here is the code for this problem:
tf.keras.backend.clear_session()
tf.random.set_seed(42)
model_input = tf.data.Dataset.from_tensor_slices(np.random.normal(size=(10, 100, 1)))
model_input = model_input.batch(10)
model_output = tf.data.Dataset.from_tensor_slices(np.random.normal(size=(10, 100, 1)))
model_output = model_output.batch(10)
my_dataset = tf.data.Dataset.zip((model_input, model_output))
m_inputs = tf.keras.Input(shape=(None, 1))
lstm_outputs = tf.keras.layers.LSTM(32, return_sequences=True)(m_inputs)
m_outputs = tf.keras.layers.TimeDistributed(tf.keras.layers.Dense(1))(lstm_outputs)
my_model = tf.keras.Model(m_inputs, m_outputs, name="my_model")
my_optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)
my_loss_fn = tf.keras.losses.MeanSquaredError()
my_epochs = 3
for epoch in range(my_epochs):
for step, (x_batch_tr, y_batch_tr) in enumerate(my_dataset):
x += 1
# open a gradient tape to record the operations run during the forward pass, which enables autodifferentiation
with tf.GradientTape() as tape:
# Run the forward pass of the layer
logits = my_model(x_batch_tr, training=True)
# compute the loss value for this mismatch
loss_value = my_loss_fn(y_batch_tr, logits)
# use the gradient tape to automatically retrieve the gradients of the trainable variables with respect to the loss.
grads = tape.gradient(loss_value, my_model.trainable_weights)
# Run one step of gradient descent by updating the value of the variables to minimize the loss.
my_optimizer.apply_gradients(zip(grads, my_model.trainable_weights))
print(f"Step {step}, loss: {loss_value}")
print("\n\nCalculate gradient of ouptuts w.r.t inputs\n\n")
for step, (x_batch_tr, y_batch_tr) in enumerate(my_dataset):
# open a gradient tape to record the operations run during the forward pass, which enables autodifferentiation
with tf.GradientTape() as tape:
tape.watch(x_batch_tr)
# Run the forward pass of the layer
logits = my_model(x_batch_tr, training=True)
#tape.watch(logits[:, 10, :]) # this didn't help
# compute the loss value for this mismatch
loss_value = my_loss_fn(y_batch_tr, logits)
# use the gradient tape to automatically retrieve the gradients of the trainable variables with respect to the loss.
# grads = tape.gradient(logits, x_batch_tr) # This works
# print(grads.numpy().shape) # This works
grads = tape.gradient(logits[:, 10, :], x_batch_tr)
print(grads)
In other words, I would like to pay attention to the inputs that affect my output the most (at each particular time step).
To me grads = tape.gradient(logits, x_batch_tr) won't do the job cuz this will add the gradients from all outputs w.r.t each inputs.
However, the gradients are always None.
Any help is much appreciated!
You can use tf.GradientTape.batch_jacobian to get precisely that information:
grads = tape.batch_jacobian(logits, x_batch_tr)
print(grads.shape)
# (10, 100, 1, 100, 1)
Here, grads[i, t1, f1, t2, f2] gives you, for the example i, the gradient of output feature f1 at time t1 with respect to input feature f2 at time t2. If, as in your case, you only have one feature, you can just say that grads[i, t1, 0, t2, 0] gives you the gradient of t1 with respect to t2. Conveniently, you can also aggregate different axes or slices of this result to get aggregated gradients. For example, tf.reduce_sum(grads[:, :, :, :10], axis=3) would give you the gradient of each output time step with respect to the first ten input time steps.
About getting None gradients in your example, I think it is because you are doing the slicing operation outside of the gradient tape context, so the gradient tracking is lost.
so the solution was to create a temporary tensor for part of the logits that we need to use in tape.grad, and register that tensor on tape using tape.watch
This is how it should be done:
for step, (x_batch_tr, y_batch_tr) in enumerate(my_dataset):
# open a gradient tape to record the operations run during the forward pass, which enables autodifferentiation
with tf.GradientTape() as tape:
tape.watch(x_batch_tr)
# Run the forward pass of the layer
logits = my_model(x_batch_tr, training=True)
tensor_logits = tf.constant(logits[:, 10, :])
tape.watch(tensor_logits) # this didn't help
# compute the loss value for this mismatch
loss_value = my_loss_fn(y_batch_tr, logits)
# use the gradient tape to automatically retrieve the gradients of the trainable variables with respect to the loss.
grads = tape.gradient(tensor_logits, x_batch_tr)
print(grads.numpy())
I'm currently analyzing how gradients develop over the course of training of a CNN using Tensorflow 2.x. What I want to do is compare each gradient in a batch to the gradient resulting for the whole batch. At the moment I use this simple code snippet for each training step:
[...]
loss_object = tf.keras.losses.SparseCategoricalCrossentropy()
[...]
# One training step
# x_train is a batch of input data, y_train the corresponding labels
def train_step(model, optimizer, x_train, y_train):
# Process batch
with tf.GradientTape() as tape:
batch_predictions = model(x_train, training=True)
batch_loss = loss_object(y_train, batch_predictions)
batch_grads = tape.gradient(batch_loss, model.trainable_variables)
# Do something with gradient of whole batch
# ...
# Process each data point in the current batch
for index in range(len(x_train)):
with tf.GradientTape() as single_tape:
single_prediction = model(x_train[index:index+1], training=True)
single_loss = loss_object(y_train[index:index+1], single_prediction)
single_grad = single_tape.gradient(single_loss, model.trainable_variables)
# Do something with gradient of single data input
# ...
# Use batch gradient to update network weights
optimizer.apply_gradients(zip(batch_grads, model.trainable_variables))
train_loss(batch_loss)
train_accuracy(y_train, batch_predictions)
My main problem is that computation time explodes when calculating each of the gradients single-handedly although these calculations should have already been done by Tensorflow when calculating the batch's gradient. The reason is that GradientTape as well as compute_gradients always return a single gradient no matter whether single or several data points were given. So this computation has to be done for each data point.
I know that I could compute the batch's gradient to update the network by using all the single gradients calculated for each data point but this plays only a minor role in saving computation time.
Is there a more efficient way to compute single gradients?
You can use the jacobian method of the gradient tape to get the Jacobian matrix, which will give you the gradients for each individual loss value:
import tensorflow as tf
# Make a random linear problem
tf.random.set_seed(0)
# Random input batch of ten four-vector examples
x = tf.random.uniform((10, 4))
# Random weights
w = tf.random.uniform((4, 2))
# Random batch label
y = tf.random.uniform((10, 2))
with tf.GradientTape() as tape:
tape.watch(w)
# Prediction
p = x # w
# Loss
loss = tf.losses.mean_squared_error(y, p)
# Compute Jacobian
j = tape.jacobian(loss, w)
# The Jacobian gives you the gradient for each loss value
print(j.shape)
# (10, 4, 2)
# Gradient of the loss wrt the weights for the first example
tf.print(j[0])
# [[0.145728424 0.0756840706]
# [0.103099883 0.0535449386]
# [0.267220169 0.138780832]
# [0.280130595 0.145485848]]
I recently have some training performance bottleneck. I always add a lot of histograms in the summary. I want to know if by calculating gradients first then re-minimizing the lose will calculate twice the gradients. A simplified code:
# layers
...
# optimizer
loss = tf.losses.mean_squared_error(labels=y_true, predictions=logits)
opt = AdamOptimizer(learning_rate)
# collect gradients
gradients = opt.compute_gradients(loss)
# train operation
train_op = opt.minimize(loss)
...
# merge summary
...
Is there an minimize method in optimizers that use directly the gradients? Something like opt.minimize(gradients) instead of opt.minimize(loss)?
You could use apply_gradients after the calculation of the gradients with compute_gradients as follows :
grads_and_vars = opt.compute_gradients(loss)
train_op = opt.apply_gradients(grads_and_vars)