How to accumulate gradients in tensorflow 2.0?

How to accumulate gradients in tensorflow 2.0? - python

I'm training a model with tensorflow 2.0. The images in my training set are of different resolutions. The Model I've built can handle variable resolutions (conv layers followed by global averaging). My training set is very small and I want to use full training set in a single batch.
Since my images are of different resolutions, I can't use model.fit(). So, I'm planning to pass each sample through the network individually, accumulate the errors/gradients and then apply one optimizer step. I'm able to compute loss values, but I don't know how to accumulate the losses/gradients. How can I accumulate the losses/gradients and then apply a single optimizer step?
Code:
for i in range(num_epochs):
print(f'Epoch: {i + 1}')
total_loss = 0
for j in tqdm(range(num_samples)):
sample = samples[j]
with tf.GradientTape as tape:
prediction = self.model(sample)
loss_value = self.loss_function(y_true=labels[j], y_pred=prediction)
gradients = tape.gradient(loss_value, self.model.trainable_variables)
self.optimizer.apply_gradients(zip(gradients, self.model.trainable_variables))
total_loss += loss_value
epoch_loss = total_loss / num_samples
print(f'Epoch loss: {epoch_loss}')

If I understand correctly from this statement:
How can I accumulate the losses/gradients and then apply a single optimizer step?
#Nagabhushan is trying to accumulate gradients and then apply the optimization on the (mean) accumulated gradient. The answer provided by #TensorflowSupport does not answers it.
In order to perform the optimization only once, and accumulate the gradient from several tapes, you can do the following:
for i in range(num_epochs):
print(f'Epoch: {i + 1}')
total_loss = 0
# get trainable variables
train_vars = self.model.trainable_variables
# Create empty gradient list (not a tf.Variable list)
accum_gradient = [tf.zeros_like(this_var) for this_var in train_vars]
for j in tqdm(range(num_samples)):
sample = samples[j]
with tf.GradientTape as tape:
prediction = self.model(sample)
loss_value = self.loss_function(y_true=labels[j], y_pred=prediction)
total_loss += loss_value
# get gradients of this tape
gradients = tape.gradient(loss_value, train_vars)
# Accumulate the gradients
accum_gradient = [(acum_grad+grad) for acum_grad, grad in zip(accum_gradient, gradients)]
# Now, after executing all the tapes you needed, we apply the optimization step
# (but first we take the average of the gradients)
accum_gradient = [this_grad/num_samples for this_grad in accum_gradient]
# apply optimization step
self.optimizer.apply_gradients(zip(accum_gradient,train_vars))
epoch_loss = total_loss / num_samples
print(f'Epoch loss: {epoch_loss}')
Using tf.Variable() should be avoided inside the training loop, since it will produce errors when trying to execute the code as a graph. If you use tf.Variable() inside your training function and then decorate it with "#tf.function" or apply "tf.function(my_train_fcn)" to obtain a graph function (i.e. for improved performance), the execution will rise an error.
This happens because the tracing of the tf.Variable function results in a different behaviour than the observed in eager execution (re-utilization or creation, respectively). You can find more info on this in the tensorflow help page.

In line with the Stack Overflow Answer and the explanation provided in Tensorflow Website, mentioned below is the code for Accumulating Gradients in Tensorflow Version 2.0:
def train(epochs):
for epoch in range(epochs):
for (batch, (images, labels)) in enumerate(dataset):
with tf.GradientTape() as tape:
logits = mnist_model(images, training=True)
tvs = mnist_model.trainable_variables
accum_vars = [tf.Variable(tf.zeros_like(tv.initialized_value()), trainable=False) for tv in tvs]
zero_ops = [tv.assign(tf.zeros_like(tv)) for tv in accum_vars]
loss_value = loss_object(labels, logits)
loss_history.append(loss_value.numpy().mean())
grads = tape.gradient(loss_value, tvs)
#print(grads[0].shape)
#print(accum_vars[0].shape)
accum_ops = [accum_vars[i].assign_add(grad) for i, grad in enumerate(grads)]
optimizer.apply_gradients(zip(grads, mnist_model.trainable_variables))
print ('Epoch {} finished'.format(epoch))
# Call the above function
train(epochs = 3)
Complete code can be found in this Github Gist.

Related

Keras loss value significant jump

I am working on a simple neural network in Keras with Tensorflow. There is a significant jump in loss value from the last mini-batch of epoch L-1 to the first mini-batch of epoch L.
I am aware that the loss should decrease with an increase in the number of iterations but a significant jump in loss after each epoch does looks strange. Here is the code snippet
tf.keras.initializers.he_uniform(seed=None)
initializer = tf.keras.initializers.he_uniform()
def my_loss(y_true, y_pred):
epsilon=1e-30 #epsilon is added to avoid inf/nan
y_pred = K.cast(y_pred, K.floatx())
y_true = K.cast(y_true, K.floatx())
loss = y_true* K.log(y_pred+epsilon) + (1-y_true)*K.log(1-y_pred+epsilon)
loss = K.mean(loss, axis= -1)
loss = K.mean(loss)
loss = -1*loss
return loss
inputs = tf.keras.Input(shape=(140,))
x = tf.keras.layers.Dense(1000,kernel_initializer=initializer)(inputs)
x = tf.keras.layers.BatchNormalization()(x)
x = tf.keras.layers.Dense(1000,kernel_initializer=initializer)(x)
x = tf.keras.layers.ReLU()(x)
x = tf.keras.layers.Dense(1000,kernel_initializer=initializer)(x)
x = tf.keras.layers.ReLU()(x)
x = tf.keras.layers.Dense(100, kernel_initializer=initializer)(x)
outputs = tf.keras.activations.sigmoid(x)
model = tf.keras.Model(inputs=inputs, outputs=outputs)
opt = tf.keras.optimizers.Adam()
recall1 = tf.keras.metrics.Recall(top_k = 8)
c_entropy = tf.keras.losses.BinaryCrossentropy()
model.compile(loss=c_entropy, optimizer= opt , metrics = [recall1,my_loss], run_eagerly=True)
model.fit(X_train_test, Y_train_test, epochs=epochs, batch_size=batch, shuffle=True, verbose = 1)
When I search online, I found this article, which suggests that Keras calculates the moving average over the mini-batches. Also, I found somewhere that the array for calculating the moving average is reset after each epoch that's why we obtain a very smooth curve within an epoch but a jump after the epoch.
In order to avoid the moving average, I implemented my own loss function, which should output the loss values of the mini-batch instead of the moving average over the batches. As each mini-batch is different from each other; therefore the corresponding loss must also be different from each other. Due to this reason, I was expecting an arbitrary loss value on each mini-batch through my implementation of the loss function. Instead, I obtain exactly the same values as the loss function by Keras.
I am unclear about:
Is Keras calculating the moving average over the mini-batches, the array of which is reset after each epoch causing the jump. If not, then what is causing the jump behaviour in loss value.
Is my implementation of loss for each mini-batch correct? If not, then how can I obtain the loss value of the mini-batch during the training.

Keras in fact shows the moving average instead of the "raw" loss values. The moving average array is reset after each epoch that's why we can see a huge jump after each epoch. In order to acquire the raw loss values, one should implement a callback as shown below:
class LossHistory(keras.callbacks.Callback):
def on_train_begin(self, logs={}):
#initialize a list at the begining of training
self.losses = []
def on_batch_end(self, batch, logs={}):
self.losses.append(logs.get('loss'))
mycallback = LossHistory()
Then call it in model.fit
model.fit(X, Y, epochs=epochs, batch_size=batch, shuffle=True, verbose = 0, callbacks=[mycallback])
print(mycallback.losses)
I tested with the following configuration
Keras 2.3.1
Tensorflow 2.1.0
Python 3.7.9
For some reason, it didn't work with the following configuration
Keras 2.4.3
Tensorflow 2.2.0
Python 3.8.5
To answer the second question, the implementation of the loss function my_loss is correct and the values obtained are pretty much close to the values generated by the built-in function.
tf.keras.losses.BinaryCrossentropy()

In TensorFlow version 2.2 and newer, the loss provided to on_train_batch_end is now the average loss of all batches up until the current batch within the given epoch. This is also the case for additional metrics, and applies to the built-in losses/metrics as well as any custom losses/metrics.
Fortunately, the loss for the current batch can be calculated from the average loss as follows:
from tensorflow.keras.callbacks import Callback
class CustomCallback(Callback):
''' This callback converts the average loss (default behavior in TF>=2.2)
into the loss for only the current batch.
'''
def on_epoch_begin(self, epoch, logs={}):
self.previous_loss_sum = 0
def on_train_batch_end(self, batch, logs={}):
# calculate loss of current batch:
current_loss_sum = (batch + 1) * logs['loss']
current_loss = current_loss_sum - self.previous_loss_sum
self.previous_loss_sum = current_loss_sum
# use current_loss:
# ...
This code can be added to any custom callback that needs the loss for the current batch instead of the average loss, including the LossHistory callback provided in Doc Jazzy's answer.
Also, if you are using Tensorflow 1 or TensorFlow 2 version <= 2.1, then do not include this code in your callback, as in those versions the current loss is already provided, instead of the average loss.

ValueError: No gradients provided for any variable when calculating loss

I have been trying to implement the training step for a DQN described in this paper on various RL methods using TensorFlow, but when I try to compute the gradient using a GradientTape I get a ValueError: No gradients provided for any variable:. Below is the training step code:
def train_step(model, target, optimizer, observations, actions, rewards, next_observations):
with tf.GradientTape() as tape:
target_logits = tf.math.reduce_max(target(np.expand_dims(next_observations, -1)), 1)
logits = model(np.expand_dims(observations, -1))
act_logits = np.ndarray(EXPERIENCE_SAMPLE_SIZE)
for i in range(EXPERIENCE_SAMPLE_SIZE):
act_logits[i] = logits[i][actions[i]]
act_logits = tf.convert_to_tensor(act_logits, dtype=tf.float32)
y_T = tf.math.add(tf.convert_to_tensor(rewards, dtype=tf.float32), tf.math.scalar_mul(DISCOUNT_RATE, target_logits))
loss = tf.math.squared_difference(act_logits, y_T)
loss = tf.math.scalar_mul(1.0 / EXPERIENCE_SAMPLE_SIZE, loss)
grads = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
Where model and target are tf.keras.Sequential that output the expected value for taking each of 5 possible actions, optimizer is SGD, and observations, actions, rewards, and next_observations are numpy arrays sampled from an experience replay buffer.
This is part of implementing the following pseudocode from the aforementioned paper:
My best guess is that this error is because indexing logits makes the gradient impossible to differentiate, but I don't know else to calculate the Q*(s,a,theta) quantity.

Adding the Solution in the Answer Section for the benefit of the Community.
From Comments:
The problem is resolved by replacing the code:
act_logits = np.ndarray(EXPERIENCE_SAMPLE_SIZE)
for i in range(EXPERIENCE_SAMPLE_SIZE):
act_logits[i] = logits[i][actions[i]]
with the code:
act_logits = tf.math.reduce_max(tf.math.multiply(act_logits, logits), 1)

TensorFlow 2.0: Eager execution of training either returns bad results or doesn't learn at all

I am experimenting with TensorFlow 2.0 (alpha). I want to implement a simple feed forward Network with two output nodes for binary classification (it's a 2.0 version of this model).
This is a simplified version of the script. After I defined a simple Sequential() model, I set:
# import layers + dropout & activation
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.activations import elu, softmax
# Neural Network Architecture
n_input = X_train.shape[1]
n_hidden1 = 15
n_hidden2 = 10
n_output = y_train.shape[1]
model = tf.keras.models.Sequential([
Dense(n_input, input_shape = (n_input,), activation = elu), # Input layer
Dropout(0.2),
Dense(n_hidden1, activation = elu), # hidden layer 1
Dropout(0.2),
Dense(n_hidden2, activation = elu), # hidden layer 2
Dropout(0.2),
Dense(n_output, activation = softmax) # Output layer
])
# define loss and accuracy
bce_loss = tf.keras.losses.BinaryCrossentropy()
accuracy = tf.keras.metrics.BinaryAccuracy()
# define optimizer
optimizer = tf.optimizers.Adam(learning_rate = 0.001)
# save training progress in lists
loss_history = []
accuracy_history = []
# loop over 1000 epochs
for epoch in range(1000):
with tf.GradientTape() as tape:
# take binary cross-entropy (bce_loss)
current_loss = bce_loss(model(X_train), y_train)
# Update weights based on the gradient of the loss function
gradients = tape.gradient(current_loss, model.trainable_variables)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
# save in history vectors
current_loss = current_loss.numpy()
loss_history.append(current_loss)
accuracy.update_state(model(X_train), y_train)
current_accuracy = accuracy.result().numpy()
accuracy_history.append(current_accuracy)
# print loss and accuracy scores each 100 epochs
if (epoch+1) % 100 == 0:
print(str(epoch+1) + '.\tTrain Loss: ' + str(current_loss) + ',\tAccuracy: ' + str(current_accuracy))
accuracy.reset_states()
print('\nTraining complete.')
Training goes without errors, however strange things happen:
Sometimes, the Network doesn't learn anything. All loss and accuracy scores are constant throughout all the epochs.
Other times, the network is learning, but very very badly. Accuracy never went beyond 0.4 (while in TensorFlow 1.x I got an effortless 0.95+). Such a low performance suggests me that something went wrong in the training.
Other times, the accuracy is very slowly improving, while the loss remains constant all the time.
What can cause these problems? Please help me understand my mistakes.
UPDATE:
After some corrections, I can make the Network learn. However, its performance is extremely poor. After 1000 epochs, it reaches about %40 accuracy, which clearly means something is still wrong. Any help is appreciated.

The tf.GradientTape is recording every operation that happens inside its scope.
You don't want to record in the tape the gradient calculation, you only want to compute the loss forward.
with tf.GradientTape() as tape:
# take binary cross-entropy (bce_loss)
current_loss = bce_loss(model(df), classification)
# End of tape scope
# Update weights based on the gradient of the loss function
gradients = tape.gradient(current_loss, model.trainable_variables)
# The tape is now consumed
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
More importantly, I don't see the loop on the training set, therefore I suppose the complete code looks like:
for epoch in range(n_epochs):
for df, classification in dataset:
# your code that computes loss and trains
Moreover, the usage of the metrics is wrong.
You want to accumulate, thus update the internal state of the accuracy operation, at every training step and measure the overall accuracy at the end of every epoch.
Thus you have to:
# Measure the accuracy inside the training loop
accuracy.update_state(model(df), classification)
And call accuracy.result() only at the end of the epoch, when all the accuracy value have been saved into the metric.
Remember to call to the .reset_states() method to clears the variable states, resetting it to zero at the end of every epoch.

Use custom loss value while training in tensorflow

I would like to train my neural network using a custom loss value of my own. Therefore, I would like to perform feed forward propagation for one mini batch to store the activations in the memory, and then perform back propagation using a my own loss value. This is to be done using tensorflow.
Finally, I need to do something like:
sess.run(optimizer, feed_dict={x: training_data, loss: my_custom_loss_value}
Is that possible? I am assuming that the optimizer depends on the loss which by itself depends on the input. Therefore, I want to inputs to be fed into the graph, but I want to use my value for the loss.

I guess since the optimizer depends on the activations, they will be evaluated, in other words, the input is going to be fed into the network. Here is an example:
import tensorflow as tf
a = tf.Variable(tf.constant(8.0))
a = tf.Print(input_=a, data=[a], message="a:")
b = tf.Variable(tf.constant(6.0))
b = tf.Print(input_=b, data=[b], message="b:")
c = a * b
optimizer = tf.train.AdadeltaOptimizer(learning_rate=0.1).minimize(c)
init_op = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init_op)
value, _ = sess.run([c, optimizer], feed_dict={c: 1})
print(value)
Finally, the printed value is 1.0, while the console shows: a:[8]b:[6] which means that the inputs got evaluated.

Exactly so.
When you train the optimizer using Gradient Descent or any other optimization algorithm like AdamOptimizer(), the optimizer minimizes your loss function, which could be a Softmax cross entropy tf.nn.softmax_cross_entropy_with_logits in terms of multi-class classification, or a squared error loss tf.losses.mean_squared_error in terms of regression or your own custom loss. The loss function is evaluated or computed using the model hypothesis.
So TensorFlow uses this cascade approach to train the model hypothesis by calling a tf.Session().run() on the optimizer. See the following as a rough example in a multi-classification setting:
batch_size = 128
# build the linear model
hypothesis = tf.add(tf.matmul(input_X, weight), bias)
# softmax cross entropy loss or cost function for logistic regression
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=targets,
logits=hypothesis))
# optimizer to minimize loss
optimizer = tf.train.GradientDescentOptimizer(learning_rate = 0.001).minimize(loss)
# execute in Session
with tf.Session() as sess:
# initialize all variables
tf.global_variables_initializer().run()
tf.local_variables_initializer().run()
# Train the model
for steps in range(1000):
mini_batch = zip(range(0, X_train.shape[0], batch_size),
range(batch_size, X_train.shape[0]+1, batch_size))
# train using mini-batches
for (start, end) in mini_batch:
sess.run(optimizer, feed_dict = {input_X: X_features[start:end],
input_y: y_targets[start:end]})

How does pytorch compute the gradients for a simple linear regression model?

I am using pytorch and trying to understand how a simple linear regression model works.
I'm using a simple LinearRegressionModel class:
class LinearRegressionModel(nn.Module):
def __init__(self, input_dim, output_dim):
super(LinearRegressionModel, self).__init__()
self.linear = nn.Linear(input_dim, output_dim)
def forward(self, x):
out = self.linear(x)
return out
model = LinearRegressionModel(1, 1)
Next I instantiate a loss criterion and an optimizer
criterion = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
Finally to train the model I use the following code:
for epoch in range(epochs):
if torch.cuda.is_available():
inputs = Variable(torch.from_numpy(x_train).cuda())
if torch.cuda.is_available():
labels = Variable(torch.from_numpy(y_train).cuda())
# Clear gradients w.r.t. parameters
optimizer.zero_grad()
# Forward to get output
outputs = model(inputs)
# Calculate Loss
loss = criterion(outputs, labels)
# Getting gradients w.r.t. parameters
loss.backward()
# Updating parameters
optimizer.step()
My question is how does the optimizer get the loss gradient, computed by loss.backward(), to update the parameters using the step() method? How are the model, the loss criterion and the optimizer tied together?

PyTorch has this concept of tensors and variables. When you use nn.Linear the function creates 2 variables namely W and b.In pytorch a variable is a wrapper that encapsulates a tensor , its gradient and information about its create function. you can directly access the gradients by
w.grad
When you try it before calling the loss.backward() you get None. Once you call the loss.backward() it will contain now gradients. Now you can update these gradient manually with the below simple steps.
w.data -= learning_rate * w.grad.data
When you have a complex network ,the above simple step could grow complex. So optimisers like SGD , Adam takes care of this. When you create the object for these optimisers we pass in the parameters of our model. nn.Module contains this parameters() function which will return all the learnable parameters to the optimiser. Which can be done using the below step.
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

loss.backward()
calculates the gradients and store them in the parameters.
And you pass in the paremeters that are needed to be tuned here:
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to accumulate gradients in tensorflow 2.0? - python

Related

Keras loss value significant jump

ValueError: No gradients provided for any variable when calculating loss

TensorFlow 2.0: Eager execution of training either returns bad results or doesn't learn at all

Use custom loss value while training in tensorflow

How does pytorch compute the gradients for a simple linear regression model?

Categories

Resources