I was going through a basic PyTorch MNIST example here and noticed that when I changed the optimizer from SGD to Adam the model did not converge. Specifically, I changed line 106 from
optimizer = optim.SGD(model.parameters(), lr=args.lr, momentum=args.momentum)
to
optimizer = optim.Adam(model.parameters(), lr=args.lr)
I thought this would have no effect on the model. With SGD the loss quickly dropped to low values after about a quarter of an epoch. However with Adam, the loss did not drop at all even after 10 epochs. I'm curious to why this is happening; it seems to me these should have nearly identical performance.
I ran this on Win10/Py3.6/PyTorch1.01/CUDA9
And to save you a tiny bit of code digging, here are the hyperparams:
lr=0.01
momentum=0.5
batch_size=64
Adam is famous for working out of the box with its default paremeters, which, in almost all frameworks, include a learning rate of 0.001 (see the default values in Keras, PyTorch, and Tensorflow), which is indeed the value suggested in the Adam paper.
So, I would suggest changing to
optimizer = optim.Adam(model.parameters(), lr=0.001)
or simply
optimizer = optim.Adam(model.parameters())
in order to leave lr in its default value (although I would say I am surprised, as MNIST is famous nowadays for working practically with whatever you may throw into it).
Related
I'm implementing deep architecture using TensorFlow Keras. At first, I used a loss function without defining the learning rate, for example:
model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
I'm wondering as to what the default learning rate is and how TensorFlow Keras sets it. Second, which is preferable: the default-specified learning rate or the Custom (user-specified) learning rate?
Then I switched to a custom learning rate. However, I've observed two different approaches of assigning a value to learning rate. For instance, one is lr and the other is learning_rate.
first way to set learing rate
optimizer = Adam(learning_rate=0.001)
model.compile(loss="categorical_crossentropy", optimizer=optimizer, metrics=["accuracy"])
Second way to set learning rate
optimizer = Adam(lr=0.001)
model.compile(loss="categorical_crossentropy", optimizer=optimizer, metrics=["accuracy"])
What is the difference between learning_rate and lr?
The lr implementation is deprecated but they essentially did the same thing. You can check this here. Credit - #Frightera.
i was building a dense neural network for predicting poker hands. First i had a problem with the reproducibility, but then i discovered my real problem: That i can not reproduce my code is because of the adam-optimizer, because with sgd it worked.
This means
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
did NOT work, whereas
opti = tf.train.AdamOptimizer()
model.compile(loss='sparse_categorical_crossentropy', optimizer=opti, metrics=['accuracy'])
worked with reproducibility.
So my question is now:
Is there any difference using
tf.train.AdamOptimizer
and
model.compile(..., optimizer = 'adam')
because i would like to use the first one because of the reproduce-problem.
They both are the same. However, in the tensorflow.train.AdamOptimizer you can change the learning rate
tf.compat.v1.train.AdamOptimizer(
learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08, use_locking=False,
name='Adam')
which will improve the learning performance and the training would take longer. but in the model.compile(optimizer="adam") it will set the learning rate, beta1, beta2...etc to the default settings
My question is quite straightforward but I can't find a definite answer online (so far).
I have saved the weights of a keras model trained with an adam optimizer after a defined number of epochs of training using:
callback = tf.keras.callbacks.ModelCheckpoint(filepath=path, save_weights_only=True)
model.fit(X,y,callbacks=[callback])
When I resume training after closing my jupyter, can I simply use:
model.load_weights(path)
to continue training.
Since Adam is dependent on the epoch number (such as in the case of learning rate decay), I would like to know the easiest way to resume training in the same conditions as before.
Following ibarrond's answer, I have written a small custom callback.
optim = tf.keras.optimizers.Adam()
model.compile(optimizer=optim, loss='categorical_crossentropy',metrics=['accuracy'])
weight_callback = tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_path, save_weights_only=True, verbose=1, save_best_only=False)
class optim_callback(tf.keras.callbacks.Callback):
'''Custom callback to save optimiser state'''
def on_epoch_end(self,epoch,logs=None):
optim_state = tf.keras.optimizers.Adam.get_config(optim)
with open(optim_state_pkl,'wb') as f_out:
pickle.dump(optim_state,f_out)
model.fit(X,y,callbacks=[weight_callback,optim_callback()])
When I resume training:
model.load_weights(checkpoint_path)
with open(optim_state_pkl,'rb') as f_out:
optim_state = pickle.load(f_out)
tf.keras.optimizers.Adam.from_config(optim_state)
I would just like to check if this is correct. Many thanks again!!
Addendum: On further reading of the default Keras implementation of Adam and the original Adam paper, I believe that the default Adam is not dependent on epoch number but only on the iteration number. Therefore, this is unnecessary. However, the code may still be useful for anyone who wishes to keep track of other optimisers.
In order to perfectly capture the status of your optimizer, you should store its configuration using the function get_config(). This function returns a dictionary (containing the options) that can be serialized and stored in a file using pickle.
To restart the process, just d = pickle.load('my_saved_tfconf.txt') to retrieve the dictionary with the configuration and then generate your Adam Optimizer using the function from_config(d) of the Keras Adam Optimizer.
I'm a newbie in a Neural Networks. I'm doing my university NN project in Keras. I assembled and trained the one-layer sequential model using SGD optimizer:
[...]
nn_model = Sequential()
nn_model.add(Dense(32, input_dim=X_train.shape[1], activation='tanh'))
nn_model.add(Dense(1, activation='tanh'))
sgd = keras.optimizers.SGD(lr=0.001, momentum=0.25)
nn_model.compile(loss='mean_squared_error', optimizer=sgd, metrics=['accuracy'])
history = nn_model.fit(X_train, Y_train, epochs=2000, verbose=2, validation_data=(X_test, Y_test))
[...]
I've tried various learning rate, momentum, neurons and get satisfying accuracy and error results. But, I have to know how keras works. So could you please explain me how exactly fitting in keras works, because I can't find it in Keras documentation?
How does Keras update weights? Does it use a backpropagation algorithm? (I'm 95% sure.)
How SGD algorithm is implemented in Keras? Is it similar to Wikipedia explanation?
How Keras exactly calculate a gradient?
Thank you kindly for any information.
Let's try to break it down and I'll cover only Keras specific bits:
How does Keras update weights? Using an Optimiser which is a base class for different optimisers. Each optimiser calculates the new weights, under a function get_updates which returns a list of functions when run applies the updates.
Back-propagation? Yes, but Keras doesn't implement it directly, it leaves it for the backend tensor libraries to perform automatic differentiation. For example K.gradients calls tf.gradients in the Tensorflow backend.
SGD algorithm? It is implemented as expected on Wikipedia in the SGD class with the basic extensions such as momentum. You can follow the code easily and how it calculates the updates.
How gradient is calculated? Using back-propagation.
I'm using the latest Keras with Tensorflow backend (Python 3.6)
I'm loading a model that had a training accuracy at around 86% when I last trained it.
The orginal optimizer that I used was :
r_optimizer = optimizer=Adam(lr=0.0001, decay = .02)
model.compile(optimizer= r_optimizer,
loss='categorical_crossentropy', metrics = ['accuracy'])
If I load the model and continue training without recompiling, my
accuracy would stay around 86% (even after 10 or so more epochs).
So I wanted to try changing the learning rate or optimizer.
If I recompile the model and try to change the learning rate or the
optimizer as follows:
new_optimizer = optimizer=Adam(lr=0.001, decay = .02)
or to this one:
sgd = optimizers.SGD(lr= .0001)
and then compile:
model.compile(optimizer= new_optimizer ,
loss='categorical_crossentropy', metrics = ['accuracy'])
model.fit ....
The accuracy would reset to around 15% - 20%, instead of starting around 86%,
and my loss would be much higher.
Even if I used a small learning rate, and recompiled, I would still start
off from a very low accuracy.
From browsing the internet it seems some optimizers like ADAM or RMSPROP have
a problem with resetting weights after recompiling (can't find the link at the moment)
So I did some digging and tried to reset my optimizer without recompiling as follows:
model = load_model(load_path)
sgd = optimizers.SGD(lr=1.0) # very high for testing
model.optimizer = sgd #change optimizer
#fit for training
history =model.fit_generator(
train_gen,
steps_per_epoch = r_steps_per_epoch,
epochs = r_epochs,
validation_data=valid_gen,
validation_steps= np.ceil(len(valid_gen.filenames)/r_batch_size),
callbacks = callbacks,
shuffle= True,
verbose = 1)
However, these changes don't seem to be reflected in my training.
Despite raising the lr significantly, I'm still floundering around 86% with the same loss. During each epoch, I'm seeing very little loss or accuracy movement. I would expect the loss to be a lot more volatile.
This leads me to believe that my change in optimizer and lr isn't being
realized by the model.
Any idea what I could be doing wrong?
I think your change does not assign new lr to optimizer, and I find a solution to reset lr values after loading model in Keras, hope it will help you.
This is a partial answer referring to what you wrote here:
From browsing the internet it seems some optimizers like ADAM or RMSPROP have a problem with resetting weights after recompiling (can't find the link at the moment)
Adaptive optimizers such as ADAM RMSPROP, ADAGRAD, ADADELTA, and any variation on these, rely on previous update steps to improve the direction and magnitude of any current adjustment to the weights of the model.
Because of this, the first few steps that they take tend to be relatively "bad" as they "calibrate themselves" with information from previous steps.
When used on a random initialization, this is not a problem, but when used on a pretrained model, these few first steps, can degrade the model so much, that almost all of the pretrained work gets lost.
Even worse, now the training doesn't start from a carefully chosen random initialization like a Xavier initialization, but from some sub-optimal starting point, which could potentially prevent the model from converging to the local optimum that it would have reached if it started from a good random initialization.
Unfortunately I'm not sure how you can avoid this... Perhaps pretrain with one optimizer --> save weights --> replace optimizer --> restore weights --> train for a few epochs and hope the new adaptive optimizer learns a "useful history" --> than restore the weights agin from the saved weights of the pretrained model and without recompiling start training again, now with a better optimizer "history".
Please let us know if this works.