Changing optimizer or lr after loading model yields strange results - python

I'm using the latest Keras with Tensorflow backend (Python 3.6)
I'm loading a model that had a training accuracy at around 86% when I last trained it.
The orginal optimizer that I used was :
r_optimizer = optimizer=Adam(lr=0.0001, decay = .02)
model.compile(optimizer= r_optimizer,
loss='categorical_crossentropy', metrics = ['accuracy'])
If I load the model and continue training without recompiling, my
accuracy would stay around 86% (even after 10 or so more epochs).
So I wanted to try changing the learning rate or optimizer.
If I recompile the model and try to change the learning rate or the
optimizer as follows:
new_optimizer = optimizer=Adam(lr=0.001, decay = .02)
or to this one:
sgd = optimizers.SGD(lr= .0001)
and then compile:
model.compile(optimizer= new_optimizer ,
loss='categorical_crossentropy', metrics = ['accuracy'])
model.fit ....
The accuracy would reset to around 15% - 20%, instead of starting around 86%,
and my loss would be much higher.
Even if I used a small learning rate, and recompiled, I would still start
off from a very low accuracy.
From browsing the internet it seems some optimizers like ADAM or RMSPROP have
a problem with resetting weights after recompiling (can't find the link at the moment)
So I did some digging and tried to reset my optimizer without recompiling as follows:
model = load_model(load_path)
sgd = optimizers.SGD(lr=1.0) # very high for testing
model.optimizer = sgd #change optimizer
#fit for training
history =model.fit_generator(
train_gen,
steps_per_epoch = r_steps_per_epoch,
epochs = r_epochs,
validation_data=valid_gen,
validation_steps= np.ceil(len(valid_gen.filenames)/r_batch_size),
callbacks = callbacks,
shuffle= True,
verbose = 1)
However, these changes don't seem to be reflected in my training.
Despite raising the lr significantly, I'm still floundering around 86% with the same loss. During each epoch, I'm seeing very little loss or accuracy movement. I would expect the loss to be a lot more volatile.
This leads me to believe that my change in optimizer and lr isn't being
realized by the model.
Any idea what I could be doing wrong?

I think your change does not assign new lr to optimizer, and I find a solution to reset lr values after loading model in Keras, hope it will help you.

This is a partial answer referring to what you wrote here:
From browsing the internet it seems some optimizers like ADAM or RMSPROP have a problem with resetting weights after recompiling (can't find the link at the moment)
Adaptive optimizers such as ADAM RMSPROP, ADAGRAD, ADADELTA, and any variation on these, rely on previous update steps to improve the direction and magnitude of any current adjustment to the weights of the model.
Because of this, the first few steps that they take tend to be relatively "bad" as they "calibrate themselves" with information from previous steps.
When used on a random initialization, this is not a problem, but when used on a pretrained model, these few first steps, can degrade the model so much, that almost all of the pretrained work gets lost.
Even worse, now the training doesn't start from a carefully chosen random initialization like a Xavier initialization, but from some sub-optimal starting point, which could potentially prevent the model from converging to the local optimum that it would have reached if it started from a good random initialization.
Unfortunately I'm not sure how you can avoid this... Perhaps pretrain with one optimizer --> save weights --> replace optimizer --> restore weights --> train for a few epochs and hope the new adaptive optimizer learns a "useful history" --> than restore the weights agin from the saved weights of the pretrained model and without recompiling start training again, now with a better optimizer "history".
Please let us know if this works.

Related

Resume training with Adam optimizer in Keras

My question is quite straightforward but I can't find a definite answer online (so far).
I have saved the weights of a keras model trained with an adam optimizer after a defined number of epochs of training using:
callback = tf.keras.callbacks.ModelCheckpoint(filepath=path, save_weights_only=True)
model.fit(X,y,callbacks=[callback])
When I resume training after closing my jupyter, can I simply use:
model.load_weights(path)
to continue training.
Since Adam is dependent on the epoch number (such as in the case of learning rate decay), I would like to know the easiest way to resume training in the same conditions as before.
Following ibarrond's answer, I have written a small custom callback.
optim = tf.keras.optimizers.Adam()
model.compile(optimizer=optim, loss='categorical_crossentropy',metrics=['accuracy'])
weight_callback = tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_path, save_weights_only=True, verbose=1, save_best_only=False)
class optim_callback(tf.keras.callbacks.Callback):
'''Custom callback to save optimiser state'''
def on_epoch_end(self,epoch,logs=None):
optim_state = tf.keras.optimizers.Adam.get_config(optim)
with open(optim_state_pkl,'wb') as f_out:
pickle.dump(optim_state,f_out)
model.fit(X,y,callbacks=[weight_callback,optim_callback()])
When I resume training:
model.load_weights(checkpoint_path)
with open(optim_state_pkl,'rb') as f_out:
optim_state = pickle.load(f_out)
tf.keras.optimizers.Adam.from_config(optim_state)
I would just like to check if this is correct. Many thanks again!!
Addendum: On further reading of the default Keras implementation of Adam and the original Adam paper, I believe that the default Adam is not dependent on epoch number but only on the iteration number. Therefore, this is unnecessary. However, the code may still be useful for anyone who wishes to keep track of other optimisers.
In order to perfectly capture the status of your optimizer, you should store its configuration using the function get_config(). This function returns a dictionary (containing the options) that can be serialized and stored in a file using pickle.
To restart the process, just d = pickle.load('my_saved_tfconf.txt') to retrieve the dictionary with the configuration and then generate your Adam Optimizer using the function from_config(d) of the Keras Adam Optimizer.

How can I get weights converged in a way that MSE minimizes?

here is my code
for _ in range(5):
K.clear_session()
model = Sequential()
model.add(LSTM(256, input_shape=(None, 1)))
model.add(Dropout(0.2))
model.add(Dense(256))
model.add(Dropout(0.2))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='RmsProp', metrics=['accuracy'])
hist = model.fit(x_train, y_train, epochs=20, batch_size=64, verbose=0, validation_data=(x_val, y_val))
p = model.predict(x_test)
print(mean_squared_error(y_test, p))
plt.plot(y_test)
plt.plot(p)
plt.legend(['testY', 'p'], loc='upper right')
plt.show()
Total params : 330,241
samples : 2264
and below is the result
I haven't changed anything.
I only ran for loop.
As you can see in the picture, the result of the MSE is huge, even though I have just run the for loop.
I think the fundamental reason for this problem is that the optimizer can not find global maximum and find local maximum and converge. The reason is that after checking all the loss graphs, the loss is no longer reduced significantly. (After 20 times) So in order to solve this problem, I have to find the global minimum. How should I do this?
I tried adjusting the number of batch_size, epoch. Also, I tried hidden layer size, LSTM unit, kerner_initializer addition, optimizer change, etc. but could not get any meaningful result.
I wonder how can I solve this problem.
Your valuable opinions and thoughts will be very much appreciated.
if you want to see full source here is link https://gist.github.com/Lay4U/e1fc7d036356575f4d0799cdcebed90e
From your example, the problem simply comes from the fact that you have over 100 times more parameters than you have samples. If you reduce the size of your model, you will see less variance.
The wider question you are asking is actually very interesting that usually isn't covered in tutorials. Nearly all Machine Learning models are by nature stochastic, the output predictions will change slightly everytime you run it which means you will always have to ask the question: Which model do I deploy to production ?
Off the top of my head there are two things you can do:
Choose the first model trained on all the data (after cross-validation, ...)
Build an ensemble of models that all have the same hyper-parameters and implement a simple voting strategy
References:
https://machinelearningmastery.com/train-final-machine-learning-model/
https://machinelearningmastery.com/randomness-in-machine-learning/
If you want to always start from the same point you should set some seed. You can do it like this if you use Tensorflow backend in Keras:
from numpy.random import seed
seed(1)
from tensorflow import set_random_seed
set_random_seed(2)
If you want to learn why do you get different results in ML/DL models, I recommend this article.

Model learns with SGD but not Adam

I was going through a basic PyTorch MNIST example here and noticed that when I changed the optimizer from SGD to Adam the model did not converge. Specifically, I changed line 106 from
optimizer = optim.SGD(model.parameters(), lr=args.lr, momentum=args.momentum)
to
optimizer = optim.Adam(model.parameters(), lr=args.lr)
I thought this would have no effect on the model. With SGD the loss quickly dropped to low values after about a quarter of an epoch. However with Adam, the loss did not drop at all even after 10 epochs. I'm curious to why this is happening; it seems to me these should have nearly identical performance.
I ran this on Win10/Py3.6/PyTorch1.01/CUDA9
And to save you a tiny bit of code digging, here are the hyperparams:
lr=0.01
momentum=0.5
batch_size=64
Adam is famous for working out of the box with its default paremeters, which, in almost all frameworks, include a learning rate of 0.001 (see the default values in Keras, PyTorch, and Tensorflow), which is indeed the value suggested in the Adam paper.
So, I would suggest changing to
optimizer = optim.Adam(model.parameters(), lr=0.001)
or simply
optimizer = optim.Adam(model.parameters())
in order to leave lr in its default value (although I would say I am surprised, as MNIST is famous nowadays for working practically with whatever you may throw into it).

should model.compile() be run prior to using model.load_weights(), if model has been only slightly changed say dropout?

With training & validation through a dataset for nearly 24 epochs, intermittently 8 epochs at once and saving weights cumulatively after each interval.
I observed a constant declining train & test-loss for first 16 epochs, post which the training loss continues to fall whereas test loss rises so i think it's the case of Overfitting.
For which i tried to resume training with weights saved after 16 epochs with change in hyperparameters say increasing dropout_rate a little.
Therefore i reran the dense & transition blocks with new dropout to get identical architecture with same sequence & learnable parameters count.
Now when i'm assigning previous weights to my new model(with new dropout) with model.load_weights() and compiling thereafter.
i see the training loss is even higher, that should be initially (blatantly with increased inactivity of random nodes during training) but later also it's performing quite unsatisfactory,
so i'm suspecting maybe compiling after loading pretrained weights might have ruined the performance?
what's reasoning & recommended sequence of model.load_weights() & model.compile()? i'd really appreciate any insights on above case.
The model.compile() method does not touch the weights in any way.
Its purpose is to create a symbolic function adding the loss and the optimizer to the model's existing function.
You can compile the model as many times as you want, whenever you want, and your weights will be kept intact.
Possible consequences of compile
If you got a model, well trained for some epochs, it's optimizer (depending on what type and parameters you chose for it) will also be trained for that specific epochs.
Compiling will make you lose the trained optimizer, and your first training batches might experience some bad results due to learning rates not suited to the current state of the model.
Other than that, compiling doesn't cause any harm.

Resume training with multi_gpu_model in Keras

I'm training a modified InceptionV3 model with the multi_gpu_model in Keras, and I use model.save to save the whole model.
Then I closed and restarted the IDE and used load_model to reinstantiate the model.
The problem is that I am not able to resume the training exactly where I left off.
Here is the code:
parallel_model = multi_gpu_model(model, gpus=2)
parallel_model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
history = parallel_model.fit_generator(generate_batches(path), steps_per_epoch = num_images/batch_size, epochs = num_epochs)
model.save('my_model.h5')
Before the IDE closed, the loss is around 0.8.
After restarting the IDE, reloading the model and re-running the above code, the loss became 1.5.
But, according to the Keras FAQ, model_save should save the whole model (architecture + weights + optimizer state), and load_model should return a compiled model that is identical to the previous one.
So I don't understand why the loss becomes larger after resuming the training.
EDIT: If I don't use the multi_gpu_model and just use the ordinary model, I'm able to resume exactly where I left off.
When you call multi_gpu_model(...), Keras automatically sets the weights of your model to some default values (at least in the version 2.2.0 which I am currently using). That's why you were not able to resume the training at the same point as it was when you saved it.
I just solved the issue by replacing the weights of the parallel model with the weights from the sequential model:
parallel_model = multi_gpu_model(model, gpus=2)
parallel_model.layers[-2].set_weights(model.get_weights()) # you can check the index of the sequential model with parallel_model.summary()
parallel_model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
history = parallel_model.fit_generator(generate_batches(path), steps_per_epoch = num_images/batch_size, epochs = num_epochs)
I hope this will help you.
#saul19am When you compile it, you can only load the weights and the model structure, but you still lose the optimizer_state. I think this can help.

Categories