Effect of the position of tf.GradientTape() in model training time

Effect of the position of tf.GradientTape() in model training time - python

I am trying to update the weight each epoch, but I am processing the data in batches. The problem is, to normalize the loss, I need to tape TensorFlow variables outside the training loop(to be tracked and normalized). But when I do this, the training time is Huge.
I think, it accumulates variables from all batches into the graph and calculates gradients at the end.
I have started tracking variables outside the for loop and inside the for loop and the later is faster than first. I am confused about why this happens because whatever I do, my model's trainable variables and loss remain the same.
# Very Slow
loss_value = 0
batches = 0
with tf.GradientTape() as tape:
for inputs, min_seq in zip(dataset, minutes_sequence):
temp_loss_value = my_loss_function(inputs, min_seq)
batches +=1
loss_value = loss_value + temp_loss_value
# The following line takes huge time.
grads = tape.gradient(loss_value, model.trainable_variables)
# Very Fast
loss_value = 0
batches = 0
for inputs, min_seq in zip(dataset, minutes_sequence):
with tf.GradientTape() as tape:
temp_loss_value = my_loss_function(inputs, min_seq)
batches +=1
loss_value = loss_value + temp_loss_value
# If I do the following line, the graph will break because this are out of tape's scope.
loss_value = loss_value / batches
# the following line takes huge time
grads = tape.gradient(loss_value, model.trainable_variables)
When I declare tf.GradientTape() inside the for loop, it is very fast but I outside It is slow.
P.S. - This is for a custom loss and the architecture contains just one hidden layer of size 10.
I want to know, the difference tf.GradientTape()'s position makes and how it should be used for per epoch weights updating in batched dataset.

The tape variable is used primarily to watch trainable tensor variables(record the previous and changing values of the variables) so that we can calculate the gradient for an epoch of training as per the loss function. It is an implementation of the python context manager construct used here to record the state of the variables. An excellent resource on python context managers is here. So if inside the loop it will record the variables (weights) for that forward pass so that the we can calculate the gradient for all those variables in one shot (instead of stack based gradient passing as in a naive implementation without a library like tensorflow). If it is outside the loop it will record the states for all the epochs and as per the Tensorflow source code it also flushes if using TF2.0 unlike TF1.x where model developer had to take care of flushing. In your example you do not have any writer set but if any writer is set it will do that too. So for the above code it will keep recording (Graph.add_to_collection method is used internally) all the weights and as epochs increase you should see slowdown. The rate of slowdown will be proportional on the size of the network(trainable variables) and the current epoch number.
So placing it inside the loop is correct. Also the gradients should be applied inside the for loop and not outside (at the same indent level as with) else you are only applying gradients at the end of your training loop (after last epoch). I see that your training may not be that good with the current placement of gradient retrieval(after which it is applied in your code though you omitted it in the snippet).
One more good resource on gradienttape I just found.

Related

How keras calculate loss if you have multiple output

I searched all over the web but I couldn't find it. how does Keras calculate the loss if we have multiple output values.

It depends on which loss function you are using.
Usually, For each batch during training, Keras will call the
loss function to compute the loss and use it to perform a Gradient
Descent step. Moreover, it will keep track of the total loss since the
beginning of the epoch, and it will display the mean loss.
Regarding multiple outputs, same process happens and calculation wise you can pick any loss function mentioned in the document here and check the examples.
Let's say for SparseCategoricalCrossentropy you can check this document.

Why is training loss oscilating up and down?

I am using the TF2 research object detection API with the pre-trained EfficientDet D3 model from the TF2 model zoo. During training on my own dataset I notice that the total loss is jumping up and down - for example from 0.5 to 2.0 a few steps later, and then back to 0.75:
So all in all this training does not seem to be very stable. I thought the problem might be the learning rate, but as you can see in the charts above, I set the LR to decay during the training, it goes down to a really small value of 1e-15, so I don't see how this can be the problem (at least in the 2nd half of the training).
Also when I smooth the curves in Tensorboard, as in the 2nd image above, one can see the total loss going down, so the direction is correct, even though it's still on quite a high value. I would be interested why I can't achieve better results with my training set, but I guess that is another question. First I would be really interested why the total loss is going up and down so much the whole training. Any ideas?
PS: The pipeline.config file for my training can be found here.

In your config it states that your batch size is 2. This is tiny and will cause a very volatile loss.
Try increasing your batch size substantially; try a value of 256 or 512. If you are constrained by memory, try increasing it via gradient accumulation.
Gradient accumulation is the process of synthesising a larger batch by combining the backwards passes from smaller mini-batches. You would run multiple backwards passes before updating the model's parameters.
Typically, a training loop would like this (I'm using pytorch-like syntax for illustrative purposes):
for model_inputs, truths in iter_batches():
predictions = model(model_inputs)
loss = get_loss(predictions, truths)
loss.backward()
optimizer.step()
optimizer.zero_grad()
With gradient accumulation, you'll put several batches through and then update the model. This simulates a larger batch size without requiring the memory to actually put a large batch size through all at once:
accumulations = 10
for i, (model_inputs, truths) in enumerate(iter_batches()):
predictions = model(model_inputs)
loss = get_loss(predictions, truths)
loss.backward()
if (i - 1) % accumulations == 0:
optimizer.step()
optimizer.zero_grad()
Reading
https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255
How to accumulate gradients in tensorflow?
https://towardsdatascience.com/how-to-easily-use-gradient-accumulation-in-keras-models-fa02c0342b60
Understanding accumulated gradients in PyTorch

Additional optimizer affects regularization loss

I'm working with an existing tensorflow model.
For one part of the network, I want to set a different learning rate as in the remaining network. Let's say all_variables are made up of variables_1 and variables_2, then I want to change the learning rate for variables of variables_2.
The existing code for settings up optimizer, computing and applying gradients looks basically like this:
optimizer = tf.train.MomentumOptimizer(learning_rate, 0.9)
grads_and_vars = optimizer.compute_gradients(loss, all_variables)
grads_updates = optimizer.apply_gradients(grads_and_vars, global_step)
I already tried to create a second optimizer following this scheme. However, for debugging, I set both learning rates equal, and the decrease of regularization loss was very dissimilar.
Isn't it possible to create a second optimizer, optimizer_new, and apply apply_gradients simply on the respective grads_and_vars of variables_1 and variables_2? I.e. Instead of having this line
grads_updates = optimizer.apply_gradients(grads_and_vars, global_step)
one could use
grads_updates = optimizer.apply_gradients(grads_and_vars['variables_1'], global_step)
grads_updates_new = optimizer_new.apply_gradients(grads_and_vars['variables_2'], global_step)
and finally, train_op = tf.group(grads_updates, grads_updates_new).
However, the regularization loss behavior is still present.

I came across the cause through a comment in this post . In my case, it doesn't make sense to to supply twice "global_step" for the global_step argument of apply_gradients. As the learning_rate and therefore the optimizer arguments depends on global_step, the training process, especially regularization loss behaviour, differs. Thanks to y.selivonchyk for pointing this out.

When does Tensorflow update weights and biases?

When does tensorflow update weights and biases in the for loop?
Below is the code from tf's github. mnist_softmax.py
for _ in range(1000):
batch_xs, batch_ys = mnist.train.next_batch(100)
sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})
When does tensorflow update weights and biases?
Does it update them when running sess.run()? If so, Does it mean, in this program, tf update weights and biases 1000 times?
Or does it update them after finishing the whole for loop?
If 2. is correct, my next question is, does tf update the model using different training data every time (since it uses next_batch(100). There are 1000*100 training data points in total. But all data points are considered only once individually. Am I correct or did I misunderstand something?
If 3. is correct, is it weird that after just one update step the model had been trained?
I think I must be misunderstanding something, It would be really great if anyone can give me a hint or refer some material.

It updates weights every time you run the train_step.
Yes, it is updating the weights 1000 times in this program.
See above
Yes, you are correct, it loads a mini-batch containing 100 points at once and uses it to compute gradients.
It's not weird at all. You don't necessarily need to see the same data again and again, all that is required is that you have enough data for the network to converge. You can iterate multiple times over the same data if you want, but since this model doesn't have many parameters, it converges in a single epoch.
Tensorflow works by creating a graph of the computations that are required for computing the output of a network. Each of the basic operations like matrix multiplication, addition, anything you can think of are nodes in this computation graph. In the tensorflow mnist example that you are following, the lines from 40-46 define the network architecture
x: placeholder
y_: placeholder
W: Variable - This is learnt during training
b: Variable - This is also learnt during training
The network represents a simple linear regression model where the prediction is made using y = W*x + b (see line 43).
Next, you configure the training procedure for your network. This code uses cross-entropy as the loss function to minimize (see line 57). The minimization is done using the gradient descent algorithm (see line 59).
At this point, your network is fully constructed. Now you need to run these nodes so that actual computation if performed (no computation has been performed up till this point).
In the loop where sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys}) is executed, tf computes the value of train_step which causes the GradientDescentOptimizer to try to minimize the cross_entropy and this is how training progresses.

How can I access the weights of a recurrent cell in Tensorflow?

One way to improve stability in deep Q-learning tasks is to maintain a set of target weights for the network that update slowly and are used for calculating Q-value targets. As a result at different times in the learning procedure, two different sets of weights are used in the forward pass. For normal DQN this is not difficult to implement, as the weights are tensorflow variables that can be set in a feed_dict ie:
sess = tf.Session()
input = tf.placeholder(tf.float32, shape=[None, 5])
weights = tf.Variable(tf.random_normal(shape=[5,4], stddev=0.1)
bias = tf.Variable(tf.constant(0.1, shape=[4])
output = tf.matmul(input, weights) + bias
target = tf.placeholder(tf.float32, [None, 4])
loss = ...
...
#Here we explicitly set weights to be the slowly updated target weights
sess.run(output, feed_dict={input: states, weights: target_weights, bias: target_bias})
# Targets for the learning procedure are computed using this output.
....
#Now we run the learning procedure, using the most up to date weights,
#as well as the previously computed targets
sess.run(loss, feed_dict={input: states, target: targets})
I'd like to use this target network technique in a recurrent version of DQN, but I don't know how to access and set the weights used inside a recurrent cell. Specifically I'm using a tf.nn.rnn_cell.BasicLSTMCell, but I'd like to know how to do this for any type of recurrent cell.

The BasicLSTMCell does not expose its variables as part of its public API. I recommend that you either look up what names these variables have in your graph and feed those names (those names are unlikely to change since they are in the checkpoints and changing these names would break checkpoint compatibility).
Alternatively, you can make a copy of BasicLSTMCell which does expose the variables. This is the cleanest approach, I think.

You can use the below line to get the variables in the graph
variables = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES)
Then you can inspect these variables to see how they are changing

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.