Optimizer doesn't work when resuming training - python

I defined a scalar _log_alpha and an optimizer to otimize it.
self._log_alpha = torch.log(torch.ones(1) * alpha).to(get_device()).requires_grad_(True)
self._log_alpha_optimizer = optim.Adam([self._log_alpha], lr=lr)
If I start training from the very beginning, it works just fine. The _log_alpha changes a little bit every time I call
However, if I train several steps, and save the optimizer's state_dict and _log_alpha
ckpt = {'log_alpha_optimizer_state_dict': self._log_alpha_optimizer.state_dict(),
'log_alpha': self._log_alpha}
torch.save(ckpt, save_dir)
and then load them to resume training
ckpt = torch.load(load_dir, map_location=torch.device(get_device()))
self._log_alpha = ckpt['log_alpha']
the _log_alpha won't change anymore.
I also defined some nn.Modules, whose optimizers still work after saving and loading. I wonder which part of the _log_alpha_optimizer is wrong?


Hugging Face not able to reload all weights after training

I recently being using a RobertaLarge model, which I perform a down stream Training, using "Trainer" package.
All goes well, I see the loss going down, and compare manually some results with valid dataset.
Problem goes when I try to save the model and reload it afterwards.
I keep seeing the warning when trying to reload the model:
Some weights of the model checkpoint at Roberta_trained_1epoch were not used when initializing RobertaPreTrainedModel: ['module.roberta.encoder.layer.10.output.dense.bias', [........................................340_LAYERS_..................................]
'module.roberta.encoder.layer.6.attention.self.key.bias', 'module.roberta.encoder.layer.22.output.dense.weight', 'module.roberta.encoder.layer.3.attention.self.key.bias', 'module.roberta.encoder.layer.15.attention.self.value.bias', 'module.roberta.encoder.layer.15.attention.self.query.bias', 'module.roberta.encoder.layer.2.attention.self.value.bias']
I looked extensively for an answer to why this problem, and so far couldn't find a solution. Some claim this is just a warning and there's nothing wrong, however suspiciously I did some manual checks, and indeed the model seems... virgin.
I'm using the: Trainer.save_model('save_here') after training, and using the RobertaForTokenClassification.from_pretrained('save_here', local_files_only=True)model to reload it.
However the results show me that the model is not loading currently clearly.
training code:
trainer = Trainer(
this results in evaluation loss of: 0.002
Reloading and re-evaluation:
model = RobertaForTokenClassification.from_pretrained('save_here', local_files_only=True)
tokenizer = AutoTokenizer.from_pretrained('tokenizers_saved')
dl_valid = DataLoader(ds_valid, batch_size=Config.batch_size, shuffle=True)
with torch.no_grad():
for index, data in enumerate(dl_valid):
batch_input_ids = data['input_ids'].to(device, dtype=torch.long)
batch_att_mask = data['attention_mask'].to(device, dtype=torch.long)
batch_target = data['label_ids'].to(device, dtype=torch.long)
output = model(batch_input_ids, token_type_ids=None, attention_mask=batch_att_mask, labels=batch_target)
step_loss, eval_prediction = output['loss'], output['logits']
eval_prediction = np.argmax(eval_prediction.detach().to('cpu').numpy(), axis=2)
eval_loss += step_loss
This results in loss: 1.2 - 0.9 (randomly after loading)
I found out what was wrong.
Will share with others, given others may have the same issue.
My problem was that I wrapped my model into a DataParallel model = nn.DataParallel(model)
So it seems that Trainer can't save it properly and get it back the usual way.
As a work around:
model = trainer.model
# afterwards in another machine
model = RobertaForTokenClassification.from_pretrained('save_here')
Still think that this should be handled differently.

How to disable automatic checkpoint loading

Im trying to run a loop over a set of parameters and I wan't to make a new network for each parameter and let it learn a few epochs.
Currently my code looks like this:
def optimize_scale(self, epochs=5, comp_scale=100, scale_list=[1, 100]):
trainer = pyli.Trainer(gpus=1, max_epochs=epochs)
for scale in scale_list:
test_model = CustomNN(num_layers=1, scale=scale, lr=1, pad=True, batch_size=1)
del test_model
Everything works fine for the first element of scale_list, the network learns 5 epochs and completes the test. All this can be seen in the console. However for all following elements of scale_list it doesn't work as the old network is not overwritten, but instead an old checkpoint is loaded automatically when trainer.fit(model) is called. In the console this is indicated through:
C:\Users\XXXX\AppData\Roaming\Python\Python39\site-packages\pytorch_lightning\callbacks\model_checkpoint.py:623: UserWarning:
Checkpoint directory D:\XXXX\src\lightning_logs\version_0\checkpoints exists and is not empty.
rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")
train_size = 8 val_size = 1 test_size = 1
Restoring states from the checkpoint path at D:\XXXX\src\lightning_logs\version_0\checkpoints\epoch=4-step=39.ckpt
Loaded model weights from checkpoint at D:\XXXX\src\lightning_logs\version_0\checkpoints\epoch=4-step=39.ckpt
The consequence is that the second test outputs the same result, as the the checkpoint from the old network was loaded which already finished all 5 epochs. I though that adding the del test_model might help in dropping the model completely, but that did not work.
On my search I found a few Issues closely related, for example: https://github.com/PyTorchLightning/pytorch-lightning/issues/368. However I did not manage to fix my problem. I assume it has something to with the fact that the new network which should overwrite the old one has the same name/version and therefore looks for the same checkpoints.
If anyone has an idea or knows how to circumvent this I would be very grateful.
I think, in your settings, you want to disable automatic checkpointing:
trainer = pyli.Trainer(gpus=1, max_epochs=epochs,enable_checkpointing=False)
You may need to explicitly save a checkpoint (with a different name) for each training session you are running.
You can manually save a checkpoint via:

OutOfRangeError: tensorflow iterator not reinitializing between runs

I am fine-tuning an Inception model via tensorflow with the below setup, and am feeding batches tf.DatasetAPI. However, every time I attempt to train this model (before successfully retrieving any batches), I get an OutOfRangeError claiming that the iterator is exhausted:
Caught OutOfRangeError. Stopping Training. End of sequence
[[node IteratorGetNext (defined at <ipython-input-8-c768436e70d8>:13) = IteratorGetNext[output_shapes=[[?,224,224,3], [?,1]], output_types=[DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](OneShotIterator)]]
with tf.Graph().as_default():
I created a function to feed in hard coded batches as the result of get_batch, and this runs and converges without any issues, leading me to believe that the graph and session code is working properly. I also tested the get_batch function to iterate in a session, and this causes no errors. The behavior I would expect is that restarting training (especially with reseting the notebook, etc. ) would produce a fresh iterator over the dataset.
Code to train model:
with tf.Graph().as_default():
images, labels = get_batch(filenames=tf_train_record_path+train_file)
# Create the model, use the default arg scope to configure the batch norm parameters.
with slim.arg_scope(inception.inception_v1_arg_scope()):
logits, ax = inception.inception_v1(images, num_classes=1, is_training=True)
# Specify the loss function:
total_loss = tf.losses.get_total_loss()
tf.summary.scalar('losses/Total_Loss', total_loss)
# Specify the optimizer and create the train op:
optimizer = tf.train.AdamOptimizer(learning_rate=0.01)
train_op = slim.learning.create_train_op(total_loss, optimizer)
# Run the training:
final_loss = slim.learning.train(
Code to get batches using Dataset
def get_batch(filenames):
dataset = tf.data.TFRecordDataset(filenames=filenames)
dataset = dataset.map(parse)
dataset = dataset.batch(2)
iterator = dataset.make_one_shot_iterator()
data_X, data_y = iterator.get_next()
return data_X, data_y
This previously asked question resembles the issue I am experiencing, however, I am not using a batch_join call. I am not if this is an issue with slim.learning.train, restoring from a checkpoint, or scope. Any help would be appreciated!
Your input pipeline looks ok. The problem might be with damaged TFRecords file. You can try your code with random data, or use your images as numpy arrays with tf.data.Dataset.from_tensor_slices().
Also your parse function may cause problems. Try to print your image/label with sess.run.
And I'd advise using Estimator API as train_op. It is much more convenient and slim will be deprecated soon.

TensorFlow Eager Mode: How to restore a model from a checkpoint?

I've trained a CNN model in TensorFlow eager mode. Now I'm trying to restore the trained model from a checkpoint file but haven't got any success.
All the examples (as shown below) I've found are talking about restoring checkpoint to a Session. But what I need is to restore the model into eager mode, i.e. without creating a session.
with tf.Session() as sess:
# Restore variables from disk.
saver.restore(sess, "/tmp/model.ckpt")
Basically what I need is something like:
model = tfe.restore('model.ckpt')
and then I can use the model to make predictions.
Can someone please help?
The example code can be found at: mnist eager mode demo
I've tried to follow the steps from #Jay Shah 's answer and it almost worked but the restored model doesn't have any variables in it.
model2 = MNISTModel()
The original model has lots of variables in it.:
[<tf.Variable 'mnist_model_1/conv2d/kernel:0' shape=(5, 5, 1, 32) dtype=float32, numpy=
array([[[[ -8.25184360e-02, 6.77833706e-03, 6.97569922e-02,...
Eager Execution is still a new feature in TensorFlow, and was not included in the latest version, so not all features, are supported, but fortunately, loading a model from a saved checkpoint is.
You'll need to use the tfe.Saver class (which is a thin wrapper over the tf.train.Saver class), and your code should look something like this:
saver = tfe.Saver([x, y])
Where [x,y] represents the list of variables and/or models you wish to restore. This should precisely match the variables passed when the saver that created the checkpoint was initially created.
More details, including sample code, can be found here, and the API details of the saver can be found here.
Ok, after spending a few hours running the code in line-by-line mode, I've figured out a way to restore a checkpoint to a new TensorFlow Eager Mode model.
Using the examples from TF Eager Mode MNIST
After your model has been trained, find the latest checkpoint(or the checkpoint you want) index file from the checkpoint folder created in the training process, such as 'ckpt-25800.index'. Use only the filename 'ckpt-25800' while restoring in step 5.
Start a new python terminal and enable TensorFlow Eager mode by running:
Create a new instance of the MNISTMOdel:
model_new = MNISTModel()
Initialise the variables for model_new by running a dummy train process once.(This step is important. Without initialising the variables first, they can't be restored by the following step. However I can't find another way to initialise variables in Eager mode other than what I did below.)
model_new(tfe.Variable(np.zeros((1,784),dtype=np.float32)), training=True)
Restore the variables to model_new using the checkpoint identified in step 1.
If restore process is successful, you should see something like:
INFO:tensorflow:Restoring parameters from ./tf_checkpoints/ckpt-25800
Now the checkpoint has been successfully restored to model_new and you can use it to make predictions on new data.
I like to share TFLearn library which is Deep learning library featuring a higher-level API for TensorFlow. With the help of this library you can easily save and restore a model.
Saving a model
model = tflearn.DNN(net) #Here 'net' is your designed network model.
#This is a sample example for training the model
model.fit(train_x, train_y, n_epoch=10, validation_set=(test_x, test_y), batch_size=10, show_metric=True)
Restore a model
model = tflearn.DNN(net)
For more example of tflearn you can check some site like...
My first CNN in TFLearn.
Github Link
First you save your model in a checkpoint by doing following:
saver.save(sess, './my_model.ckpt')
In above line you are saving you session in "my_model.ckpt" checkpoint
Following code restores the model
saver = tf.train.Saver()
with tf.Session() as sess:
saver.restore(sess, './my_model.ckpt')
When you restore the session as a model then you restores your model from the ckpt
For eager mode to save :
For eager mode to restore :
sess is an object of class Network. Any object of class Network can be saved and restored. A quick explanation of network objects :-
class TwoLayerNetwork(tfe.Network):
def __init__(self, name):
super(TwoLayerNetwork, self).__init__(name=name)
self.layer_one = self.track_layer(tf.layers.Dense(16, input_shape=(8,)))
self.layer_two = self.track_layer(tf.layers.Dense(1, input_shape=(16,)))
def call(self, inputs):
return self.layer_two(self.layer_one(inputs))
After constructing an object and calling the Network, a list of variables
created by tracked Layers is available via Network.variables:
sess = TwoLayerNetwork(name="net") # sess is object of Network
output = sess(tf.ones([1, 8]))
print([v.name for v in sess.variables])
This example prints variable names, one kernel and one bias per
`tf.layers.Dense` layer:
These variables can be passed to a `Saver` (`tf.train.Saver`, or
`tf.contrib.eager.Saver` when executing eagerly) to save or restore the
tfe.save_network_checkpoint(sess,'./my_model.ckpt') # saving the model
tfe.restore_network_checkpoint(sess,'./my_model.ckpt') # restoring
Saving variables with tfe.Saver().save() :
for epoch in range(epochs):
all_variables = model.variables + optimizer.variables()
# save the varibles
And then reload saved variables with tfe.Saver().restore() :
tfe.Saver((model.variables + optimizer.variables())).restore(checkpoint_prefix)
Then the model is loaded with the saved variables, and no need to create a new one as in #Stefan Falk 's answer.

Issue with TensorFlow saving

I am training neural nets with TensorFlow, and the model's training is working using a custom implementation of batch gradient descent. I have a logging function which records validation error, and it gets down to about 2.6%. I'm saving the model every 10 epochs using a tf.train.Saver.
However, when I load the variables into memory again using a tf.train.Saver with the same script, the model performs poorly--with about the performance it does when the weights are randomly initialized. I have inspected the constitutional filters in the checkpoint and they don't seem to be random however.
I have not included all of my code, since its around 400 lines long, but I've included what seem to be important sections here and summarized the other functionality.
class ModelTrainer:
def __init__(self, ...hyperparameters...):
# Intitialize datasets and hyperparameters
for each gpu
# Create loss function and gradient assigned to this gpu using tf.device("/gpu:n")
with tf.device("/cpu:0")
# Average and clip gradients from the gpu's
# Create this batch gradient descent operation for each trainable variable
variable.assign_sub(learning_rate * averaged_and_clipped_gradient).op
def train(self, ...hyperparameters...)
saver = train.Saver(tf.all_variables(), max_to_keep = 30)
init = tf.initialize_all_variables()
sess = tf.Session()
if starting_point is not None: # Used to evaluate existing models
saver.restore(sess, starting_point)
for i in range(number_of_batches)
# ... Get training batch ...
gradients = sess.run(calculate_gradients, feeds = training_batch)
# Average "gradients" variable across multiple batches
# Must be done because of GPU memory limitations
if i % meta_batch_size == 0:
feeds = gradients_that_have_been_averaged_across_multiple_batches)
# Log validation error
if i % save_after_n_batches == 0:
saver.save(sess, "some-filename", global_step=self.iter_num)
As expected, running these two functions creates a set of checkpoint files called "some-filename-40001" or whatever other iteration number the training is at when that file is saved. Unfortunately when I load these checkpoints back in using the start_point parameter they perform on par with random initialization.
Initially I assumed it was something to do with the way I'm training the model, since I haven't found anyone else with this issue, but the validation error behaves as expected.
Edit: More odd results. After more experimentation, I have found that when I load the saved model using the code:
with tf.Session() as sess:
saver = tf.train.import_meta_graph("saved-checkpoint-40.meta")
saver.restore(sess, "saved-checkpoint-40")
# ... Use model in some way ...
I get different, but still incorrect results.
