Additional optimizer affects regularization loss

Additional optimizer affects regularization loss - python

I'm working with an existing tensorflow model.
For one part of the network, I want to set a different learning rate as in the remaining network. Let's say all_variables are made up of variables_1 and variables_2, then I want to change the learning rate for variables of variables_2.
The existing code for settings up optimizer, computing and applying gradients looks basically like this:
optimizer = tf.train.MomentumOptimizer(learning_rate, 0.9)
grads_and_vars = optimizer.compute_gradients(loss, all_variables)
grads_updates = optimizer.apply_gradients(grads_and_vars, global_step)
I already tried to create a second optimizer following this scheme. However, for debugging, I set both learning rates equal, and the decrease of regularization loss was very dissimilar.
Isn't it possible to create a second optimizer, optimizer_new, and apply apply_gradients simply on the respective grads_and_vars of variables_1 and variables_2? I.e. Instead of having this line
grads_updates = optimizer.apply_gradients(grads_and_vars, global_step)
one could use
grads_updates = optimizer.apply_gradients(grads_and_vars['variables_1'], global_step)
grads_updates_new = optimizer_new.apply_gradients(grads_and_vars['variables_2'], global_step)
and finally, train_op = tf.group(grads_updates, grads_updates_new).
However, the regularization loss behavior is still present.

I came across the cause through a comment in this post . In my case, it doesn't make sense to to supply twice "global_step" for the global_step argument of apply_gradients. As the learning_rate and therefore the optimizer arguments depends on global_step, the training process, especially regularization loss behaviour, differs. Thanks to y.selivonchyk for pointing this out.

Related

Effect of the position of tf.GradientTape() in model training time

I am trying to update the weight each epoch, but I am processing the data in batches. The problem is, to normalize the loss, I need to tape TensorFlow variables outside the training loop(to be tracked and normalized). But when I do this, the training time is Huge.
I think, it accumulates variables from all batches into the graph and calculates gradients at the end.
I have started tracking variables outside the for loop and inside the for loop and the later is faster than first. I am confused about why this happens because whatever I do, my model's trainable variables and loss remain the same.
# Very Slow
loss_value = 0
batches = 0
with tf.GradientTape() as tape:
for inputs, min_seq in zip(dataset, minutes_sequence):
temp_loss_value = my_loss_function(inputs, min_seq)
batches +=1
loss_value = loss_value + temp_loss_value
# The following line takes huge time.
grads = tape.gradient(loss_value, model.trainable_variables)
# Very Fast
loss_value = 0
batches = 0
for inputs, min_seq in zip(dataset, minutes_sequence):
with tf.GradientTape() as tape:
temp_loss_value = my_loss_function(inputs, min_seq)
batches +=1
loss_value = loss_value + temp_loss_value
# If I do the following line, the graph will break because this are out of tape's scope.
loss_value = loss_value / batches
# the following line takes huge time
grads = tape.gradient(loss_value, model.trainable_variables)
When I declare tf.GradientTape() inside the for loop, it is very fast but I outside It is slow.
P.S. - This is for a custom loss and the architecture contains just one hidden layer of size 10.
I want to know, the difference tf.GradientTape()'s position makes and how it should be used for per epoch weights updating in batched dataset.

The tape variable is used primarily to watch trainable tensor variables(record the previous and changing values of the variables) so that we can calculate the gradient for an epoch of training as per the loss function. It is an implementation of the python context manager construct used here to record the state of the variables. An excellent resource on python context managers is here. So if inside the loop it will record the variables (weights) for that forward pass so that the we can calculate the gradient for all those variables in one shot (instead of stack based gradient passing as in a naive implementation without a library like tensorflow). If it is outside the loop it will record the states for all the epochs and as per the Tensorflow source code it also flushes if using TF2.0 unlike TF1.x where model developer had to take care of flushing. In your example you do not have any writer set but if any writer is set it will do that too. So for the above code it will keep recording (Graph.add_to_collection method is used internally) all the weights and as epochs increase you should see slowdown. The rate of slowdown will be proportional on the size of the network(trainable variables) and the current epoch number.
So placing it inside the loop is correct. Also the gradients should be applied inside the for loop and not outside (at the same indent level as with) else you are only applying gradients at the end of your training loop (after last epoch). I see that your training may not be that good with the current placement of gradient retrieval(after which it is applied in your code though you omitted it in the snippet).
One more good resource on gradienttape I just found.

pytorch predictions stability

This is my predict function. is there anything wrong with this? Predictions are not stable, everytime I run on same data, I get different predictions.
def predict(model, device, inputs, batch_size=1024):
model = model.to(device)
dataset = torch.utils.data.TensorDataset(*inputs)
loader = torch.utils.data.DataLoader(
dataset,
batch_size=batch_size,
pin_memory=False
)
predictions = []
for i, batch in enumerate(loader):
with torch.no_grad():
pred = model(*(item.to(device) for item in batch))
pred = pred.detach().cpu().numpy()
predictions.append(pred)
return np.concatenate(predictions)

As Usman Ali suggested, you need to set your model to eval mode by calling
model.eval()
before your prediction function.
What eval mode does:
Sets the module in evaluation mode.
This has any effect only on certain modules. See documentations of particular modules for details of their behaviors in training/evaluation mode, if they are affected, e.g. Dropout, BatchNorm, etc.
When you finish your prediction and wish t continue training, don't forget to reset your model to training mode by calling
model.train()
There are several layers in models that may introduce randomness into the forward pass of the net. One such example is the dropout layers. A dropout layer "drops" p percent of its neurons at random to increase model's generalization.
Additionally, BatchNorm (and possibly other adaptive normalization layers) keeps track of the statistics of the data and therefore has a different "behavior" in train mode or in eval mode.

You have defined the function, but you haven't trained the model. The model randomizes predictions before it is trained, which is why yours are inconsistent. If you set up an optimizer with a loss function, and run over multiple epochs the predictions will stabilize. This link may help: https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html. Look at sections 3 and 4

Can I copy the optimizer (state) from one model to another? [keras, tensorflow]

Assuming I create model A, which has a similar but not exactly the same architecture as the compiled model B. Can I compile model A as follows?
model_A.compile(model_B.optimizer,
loss=model_B.loss,
metrics=model_B.metrics,
)
I am most worried that some values stored in the optimizer (e.g. updates, weights, ...) are specific to the model architecture and may yield a mismatch. Can somebody explain what exactly is happening when I perform such a copy? I couldn't extract helpful information from the source code (l37ff).
P.s.: Is the state of the optimizer also copied this way? If not, can you copy it somehow?

We can use optimizer from one model to another. Most of the optimizers takes learning rate, momentum, decay etc as arguments. model.compile initialises the weights according to your argument. optimizer only takes care of how you loss is propagated after its calculation.
We will change optimizer only to make our model converge faster for the given data.
But you may not be able to use the same loss function for different models(model b can be mse and model a can have softmax as its last layer). same holds true for accuracy too.

Changing optimizer or lr after loading model yields strange results

I'm using the latest Keras with Tensorflow backend (Python 3.6)
I'm loading a model that had a training accuracy at around 86% when I last trained it.
The orginal optimizer that I used was :
r_optimizer = optimizer=Adam(lr=0.0001, decay = .02)
model.compile(optimizer= r_optimizer,
loss='categorical_crossentropy', metrics = ['accuracy'])
If I load the model and continue training without recompiling, my
accuracy would stay around 86% (even after 10 or so more epochs).
So I wanted to try changing the learning rate or optimizer.
If I recompile the model and try to change the learning rate or the
optimizer as follows:
new_optimizer = optimizer=Adam(lr=0.001, decay = .02)
or to this one:
sgd = optimizers.SGD(lr= .0001)
and then compile:
model.compile(optimizer= new_optimizer ,
loss='categorical_crossentropy', metrics = ['accuracy'])
model.fit ....
The accuracy would reset to around 15% - 20%, instead of starting around 86%,
and my loss would be much higher.
Even if I used a small learning rate, and recompiled, I would still start
off from a very low accuracy.
From browsing the internet it seems some optimizers like ADAM or RMSPROP have
a problem with resetting weights after recompiling (can't find the link at the moment)
So I did some digging and tried to reset my optimizer without recompiling as follows:
model = load_model(load_path)
sgd = optimizers.SGD(lr=1.0) # very high for testing
model.optimizer = sgd #change optimizer
#fit for training
history =model.fit_generator(
train_gen,
steps_per_epoch = r_steps_per_epoch,
epochs = r_epochs,
validation_data=valid_gen,
validation_steps= np.ceil(len(valid_gen.filenames)/r_batch_size),
callbacks = callbacks,
shuffle= True,
verbose = 1)
However, these changes don't seem to be reflected in my training.
Despite raising the lr significantly, I'm still floundering around 86% with the same loss. During each epoch, I'm seeing very little loss or accuracy movement. I would expect the loss to be a lot more volatile.
This leads me to believe that my change in optimizer and lr isn't being
realized by the model.
Any idea what I could be doing wrong?

I think your change does not assign new lr to optimizer, and I find a solution to reset lr values after loading model in Keras, hope it will help you.

This is a partial answer referring to what you wrote here:
From browsing the internet it seems some optimizers like ADAM or RMSPROP have a problem with resetting weights after recompiling (can't find the link at the moment)
Adaptive optimizers such as ADAM RMSPROP, ADAGRAD, ADADELTA, and any variation on these, rely on previous update steps to improve the direction and magnitude of any current adjustment to the weights of the model.
Because of this, the first few steps that they take tend to be relatively "bad" as they "calibrate themselves" with information from previous steps.
When used on a random initialization, this is not a problem, but when used on a pretrained model, these few first steps, can degrade the model so much, that almost all of the pretrained work gets lost.
Even worse, now the training doesn't start from a carefully chosen random initialization like a Xavier initialization, but from some sub-optimal starting point, which could potentially prevent the model from converging to the local optimum that it would have reached if it started from a good random initialization.
Unfortunately I'm not sure how you can avoid this... Perhaps pretrain with one optimizer --> save weights --> replace optimizer --> restore weights --> train for a few epochs and hope the new adaptive optimizer learns a "useful history" --> than restore the weights agin from the saved weights of the pretrained model and without recompiling start training again, now with a better optimizer "history".
Please let us know if this works.

Modify learning rate in imported Tensorflow graph

I have created a graph with an AdamOptimizer, which I have then saved with tf.train.Saver().save(session, "model_name")
After training it for a while I am able to import the whole graph and the variables in a different session and resume training with
saver = tf.train.import_meta_graph("model_name")
saver.restore(session, "model_name")
What I would like to do is, after importing the graph+variables and before resuming the optimization, to change the learning_rate of the AdamOptimizer. Is that possible?
EDIT: One way of doing this would be to define the learning rate as a placeholder and feed a different value every time. But let's assume the graph has already been saved without doing this for the sake of argument.

I think you can replace learning_rate with placeholder,ie.
learning_rate = tf.placeholder(tf.float32,shape=(),name="learing_rate")
train_op = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(your_loss_tensor, name="train_op")
when you have restored your graph, get all the all ops and tensors that related to train like train_op and learning_rate using
train_op = graph.get_operation_by_name("train_op")
learning_rate = graph.get_tensor_by_name("learning_rate:0")
and run train
sess.run(train_op, feed_dict={learning_rate: whatever_you_what})
UPDATE:
see this if you want to change some input of your saved graph

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.