Save Keras model at specific epochs - python

I am using Keras to do some training on my dataset and it is time consuming to keep running every time to locate the number of epochs needed to get the best results. I tried using callbacks to get the best model, but it just does not work and usually stops too early. Also, saving every N epochs is not an option for me.
What I am trying to do is save the model after some specific epochs are done. Let's say for example, after epoch = 150 is over, it will be saved as model.save(model_1.h5) and after epoch = 152, it will be saved as model.save(model_2.h5) etc... for few specific epochs.
Is there a way to implement this in Keras ? I already searched for a method but no luck so far.
Thank you for any help/suggestion.

Edit
In most cases it's enough to use name formatting suggested by #Toan Tran in his answer.
But if you need some sophisticated logic, you can use a callback, for example
import keras
class CustomSaver(keras.callbacks.Callback):
def on_epoch_end(self, epoch, logs={}):
if epoch == 2: # or save after some epoch, each k-th epoch etc.
self.model.save("model_{}.hd5".format(epoch))
on_epoch_end is called at the end of each epoch; epoch is a number of epoch, latter argument is a logs (you can read about other callback methods in docs). Put the logic into this method (in example it's as simple as possible).
Create saver object and put it into fit method:
import keras
import numpy as np
inp = keras.layers.Input(shape=(10,))
dense = keras.layers.Dense(10, activation='relu')(inp)
out = keras.layers.Dense(1, activation='sigmoid')(dense)
model = keras.models.Model(inp, out)
model.compile(optimizer="adam", loss="binary_crossentropy",)
# Just a noise data for fast working example
X = np.random.normal(0, 1, (1000, 10))
y = np.random.randint(0, 2, 1000)
# create and use callback:
saver = CustomSaver()
model.fit(X, y, callbacks=[saver], epochs=5)
In the bash:
!ls
Out:
model_2.hd5
So, it works.

checkpoint = keras.callbacks.ModelCheckpoint('model{epoch:08d}.h5', period=5)
model.fit(X_train, Y_train, callbacks=[checkpoint])
Did you try checkpoint? period=5 means model is saved after 5 epoch
More details here
Hope this help :)

Well, I can't comment on posts yet. So, I'm adding on to #Toan Tran's answer. With the latest version of Keras, the argument period is deprecated. Instead, we can use save_freq.
In the following example, the model is saved after every epoch.
checkpoint = keras.callbacks.ModelCheckpoint(model_save_path+'/checkpoint_{epoch:02d}', save_freq='epoch')
H=model.fit(x=x_train, y=y_train,epochs=epoch_no,verbose=2, callbacks=[checkpoint])
You can find more details from keras documentation.

Related

Hugging Face not able to reload all weights after training

I recently being using a RobertaLarge model, which I perform a down stream Training, using "Trainer" package.
All goes well, I see the loss going down, and compare manually some results with valid dataset.
Problem goes when I try to save the model and reload it afterwards.
I keep seeing the warning when trying to reload the model:
Some weights of the model checkpoint at Roberta_trained_1epoch were not used when initializing RobertaPreTrainedModel: ['module.roberta.encoder.layer.10.output.dense.bias', [........................................340_LAYERS_..................................]
'module.roberta.encoder.layer.6.attention.self.key.bias', 'module.roberta.encoder.layer.22.output.dense.weight', 'module.roberta.encoder.layer.3.attention.self.key.bias', 'module.roberta.encoder.layer.15.attention.self.value.bias', 'module.roberta.encoder.layer.15.attention.self.query.bias', 'module.roberta.encoder.layer.2.attention.self.value.bias']
I looked extensively for an answer to why this problem, and so far couldn't find a solution. Some claim this is just a warning and there's nothing wrong, however suspiciously I did some manual checks, and indeed the model seems... virgin.
I'm using the: Trainer.save_model('save_here') after training, and using the RobertaForTokenClassification.from_pretrained('save_here', local_files_only=True)model to reload it.
However the results show me that the model is not loading currently clearly.
training code:
trainer = Trainer(
model=model,
args=training_args,
compute_metrics=compute_metrics,
train_dataset=ds_train,
eval_dataset=ds_valid,
callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
)
trainer.train()
trainer.evaluate()
trainer.save_model('save_here')
this results in evaluation loss of: 0.002
Reloading and re-evaluation:
model = RobertaForTokenClassification.from_pretrained('save_here', local_files_only=True)
tokenizer = AutoTokenizer.from_pretrained('tokenizers_saved')
dl_valid = DataLoader(ds_valid, batch_size=Config.batch_size, shuffle=True)
with torch.no_grad():
for index, data in enumerate(dl_valid):
batch_input_ids = data['input_ids'].to(device, dtype=torch.long)
batch_att_mask = data['attention_mask'].to(device, dtype=torch.long)
batch_target = data['label_ids'].to(device, dtype=torch.long)
output = model(batch_input_ids, token_type_ids=None, attention_mask=batch_att_mask, labels=batch_target)
step_loss, eval_prediction = output['loss'], output['logits']
eval_prediction = np.argmax(eval_prediction.detach().to('cpu').numpy(), axis=2)
predictions.append(eval_prediction)
reals.append(batch_target)
eval_loss += step_loss
print(eval_loss)
This results in loss: 1.2 - 0.9 (randomly after loading)
I found out what was wrong.
Will share with others, given others may have the same issue.
My problem was that I wrapped my model into a DataParallel model = nn.DataParallel(model)
So it seems that Trainer can't save it properly and get it back the usual way.
As a work around:
model = trainer.model
model.module.save_pretrained('save_here')
....
# afterwards in another machine
....
model = RobertaForTokenClassification.from_pretrained('save_here')
Still think that this should be handled differently.

Optimizer doesn't work when resuming training

I defined a scalar _log_alpha and an optimizer to otimize it.
self._log_alpha = torch.log(torch.ones(1) * alpha).to(get_device()).requires_grad_(True)
self._log_alpha_optimizer = optim.Adam([self._log_alpha], lr=lr)
If I start training from the very beginning, it works just fine. The _log_alpha changes a little bit every time I call
self._log_alpha_optimizer.zero_grad()
log_alpha_loss.backward()
self._log_alpha_optimizer.step()
However, if I train several steps, and save the optimizer's state_dict and _log_alpha
ckpt = {'log_alpha_optimizer_state_dict': self._log_alpha_optimizer.state_dict(),
'log_alpha': self._log_alpha}
torch.save(ckpt, save_dir)
and then load them to resume training
ckpt = torch.load(load_dir, map_location=torch.device(get_device()))
self._log_alpha_optimizer.load_state_dict(ckpt['log_alpha_optimizer_state_dict'])
self._log_alpha = ckpt['log_alpha']
self._log_alpha.requires_grad_(True)
the _log_alpha won't change anymore.
I also defined some nn.Modules, whose optimizers still work after saving and loading. I wonder which part of the _log_alpha_optimizer is wrong?

Call back not working in tensor flow to stop the training

I have written a call back which stops training when accuracy becomes 99%.But the problem is i get this error .Sometimes if i resolve this error the call back not get called even though acuurqacy becoms 100 %.
'>' not supported between instances of 'NoneType' and 'float'
class myCallback(tf.keras.callbacks.Callback):
def on_epoch_end(self, epoch, logs={}):
if(logs.get('accuracy') > 0.99):
self.model.stop_training = True
def train_mnist():
# Please write your code only where you are indicated.
# please do not remove # model fitting inline comments.
# YOUR CODE SHOULD START HERE
# YOUR CODE SHOULD END HERE
call = myCallback()
mnist = tf.keras.datasets.mnist
(x_train, y_train),(x_test, y_test) = mnist.load_data(path=path)
# YOUR CODE SHOULD START
x_train = x_train/255
y_train = y_train/255
# YOUR CODE SHOULD END HERE
model = tf.keras.models.Sequential([
# YOUR CODE SHOULD START HERE
keras.layers.Flatten(input_shape=(28,28)),
keras.layers.Dense(128,activation='relu'),
keras.layers.Dense(10,activation='softmax')
# YOUR CODE SHOULD END HERE
])
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
# model fitting
history = model.fit(# YOUR CODE SHOULD START HERE
x_train,y_train,epochs=9,callbacks=[call] )
# model fitting
return history.epoch, history.history['acc'][-1]
Two major problems with the above code:
Getting to 100% accuracy on training set almost always means that your model is overfitting. Thats BAD. What you want to do instead is specify the validation_split=.2 parameter in the .fit method, and look for a high accuracy on the validation set.
What you are trying to build in your custom callback is already done in keras.callbacks.EarlyStopping, it even has an option to restore to the best overall model over each epoch. And, by default, it is looking for a validation accuracy, not training accuracy, if you have a validation split.
So, here's what you should do:
Stop using custom callbacks, they take some mastery to get to work. Use EarlyStopping with restore_best instead. like this
Always use validation_split and look for high accuracy in validation set. Like in this quick example.
Did using built-in callbacks resolve your problem?

KeyError: ''val_loss" when training model

I am training a model with keras and am getting an error in callback in fit_generator function. I always run to epoch 3rd and get this error
annotation_path = 'train2.txt'
log_dir = 'logs/000/'
classes_path = 'model_data/deplao_classes.txt'
anchors_path = 'model_data/yolo_anchors.txt'
class_names = get_classes(classes_path)
num_classes = len(class_names)
anchors = get_anchors(anchors_path)
input_shape = (416,416) # multiple of 32, hw
is_tiny_version = len(anchors)==6 # default setting
if is_tiny_version:
model = create_tiny_model(input_shape, anchors, num_classes,
freeze_body=2, weights_path='model_data/tiny_yolo_weights.h5')
else:
model = create_model(input_shape, anchors, num_classes,
freeze_body=2, weights_path='model_data/yolo_weights.h5') # make sure you know what you freeze
logging = TensorBoard(log_dir=log_dir)
checkpoint = ModelCheckpoint(log_dir + 'ep{epoch:03d}-loss{loss:.3f}-val_loss{val_loss:.3f}.h5',
monitor='val_loss', save_weights_only=True, save_best_only=True, period=3)
reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=3, verbose=1)
early_stopping = EarlyStopping(monitor='val_loss', min_delta=0, patience=10, verbose=1)
[error]
Traceback (most recent call last):
File "train.py", line 194, in <module>
_main()
File "train.py", line 69, in _main
callbacks=[logging, checkpoint])
File "C:\Users\ilove\AppData\Roaming\Python\Python37\lib\site-packages\keras\legacy\interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "C:\Users\ilove\AppData\Roaming\Python\Python37\lib\site-packages\keras\engine\training.py", line 1418, in fit_generator
initial_epoch=initial_epoch)
File "C:\Users\ilove\AppData\Roaming\Python\Python37\lib\site-packages\keras\engine\training_generator.py", line 251, in fit_generator
callbacks.on_epoch_end(epoch, epoch_logs)
File "C:\Users\ilove\AppData\Roaming\Python\Python37\lib\site-packages\keras\callbacks.py", line 79, in on_epoch_end
callback.on_epoch_end(epoch, logs)
File "C:\Users\ilove\AppData\Roaming\Python\Python37\lib\site-packages\keras\callbacks.py", line 429, in on_epoch_end
filepath = self.filepath.format(epoch=epoch + 1, **logs)
KeyError: 'val_loss'
can anyone find out problem to help me?
Thanks in advance for your help.
This callback runs at the end of iteration 3.
checkpoint = ModelCheckpoint(log_dir + 'ep{epoch:03d}-loss{loss:.3f}-val_loss{val_loss:.3f}.h5',
monitor='val_loss', save_weights_only=True, save_best_only=True, period=3)
The error message is claiming that there is no val_loss in the logs variable when executing:
filepath = self.filepath.format(epoch=epoch + 1, **logs)
This would happen if fit is called without validation_data.
I would start by simplifying the path name for model checkpoint. It is probably enough to include the epoch in the name.
This answer doesn't apply to the question, but this was at the top of the Google results for keras "KeyError: 'val_loss'" so I'm going to share the solution for my problem.
The error was the same for me: when using val_loss in the checkpoint file name, I would get the following error: KeyError: 'val_loss'. My checkpointer was also monitoring this field, so even if I took the field out of the file name, I would still get this warning from the checkpointer: WARNING:tensorflow:Can save best model only with val_loss available, skipping.
In my case, the issue was that I was upgrading from using Keras and Tensorflow 1 separately to using the Keras that came with Tensorflow 2. The period param for ModelCheckpoint had been replaced with save_freq. I erroneously assumed that save_freq behaved the same way, so I set it to save_freq=1 thinking this would save it every epic. However, the docs state:
save_freq: 'epoch' or integer. When using 'epoch', the callback saves the model after each epoch. When using integer, the callback saves the model at end of a batch at which this many samples have been seen since last saving. Note that if the saving isn't aligned to epochs, the monitored metric may potentially be less reliable (it could reflect as little as 1 batch, since the metrics get reset every epoch). Defaults to 'epoch'
Setting save_freq='epoch' solved the issue for me. Note: the OP was still using period=1 so this is definitely not what was causing their problem
Use val_accuracy in the filepath and checkpoint. If it still doesn't improve just restart the pc or colab.
this error happens when we are not providing validation data to the model,
And check the parameters of the model.fit_generator(or model.fit)(train_data, steps_per_epoch,validation_data, validation_steps, epochs,initial_epoch, callbacks)
For me the problem was that I was trying to set the initial_epoch (in model.fit) to a value other than the standard 0. I was doing so because I'm running model.fit in a loop that runs 10 epochs each cycle, then retrieves history data, checks if loss has decreased and runs model.fit again until it's satisfied.
I thought I had to update the value as I was restarting the previous model but apparently no...
switch = True
epoch = 0
wait = 0
previous = 10E+10
while switch:
history = model.fit( X, y, batch_size=1, epochs=step, verbose=False )
epoch += step
current = history.history["loss"][-1]
if current >= previous:
wait += 1
if wait >= tolerance:
switch = False
else:
wait = 0
if epoch >= max_epochs:
switch = False
previous = current
In my case, the val_generator was broken when colab notebook try to read the images from google drive. So i run the cell create val_generator again and it worked
I had this error and didn't manage to find the cause of the bug anywhere online.
What was happening in my case was that I was asking for more training samples than I actually had. TF didn't give me an explicit error for that and it even provided me with a saved value for the loss. I only received the esoteric KeyError: "val_loss" when trying to save that.
Hope this helps someone sniff out their silly bug if that's whats happening to them.
I do not know if this will work in all cases. But, for me I restarted my computer and it seemed to work.

OutOfRangeError: tensorflow iterator not reinitializing between runs

I am fine-tuning an Inception model via tensorflow with the below setup, and am feeding batches tf.DatasetAPI. However, every time I attempt to train this model (before successfully retrieving any batches), I get an OutOfRangeError claiming that the iterator is exhausted:
Caught OutOfRangeError. Stopping Training. End of sequence
[[node IteratorGetNext (defined at <ipython-input-8-c768436e70d8>:13) = IteratorGetNext[output_shapes=[[?,224,224,3], [?,1]], output_types=[DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](OneShotIterator)]]
with tf.Graph().as_default():
I created a function to feed in hard coded batches as the result of get_batch, and this runs and converges without any issues, leading me to believe that the graph and session code is working properly. I also tested the get_batch function to iterate in a session, and this causes no errors. The behavior I would expect is that restarting training (especially with reseting the notebook, etc. ) would produce a fresh iterator over the dataset.
Code to train model:
with tf.Graph().as_default():
tf.logging.set_verbosity(tf.logging.INFO)
images, labels = get_batch(filenames=tf_train_record_path+train_file)
# Create the model, use the default arg scope to configure the batch norm parameters.
with slim.arg_scope(inception.inception_v1_arg_scope()):
logits, ax = inception.inception_v1(images, num_classes=1, is_training=True)
# Specify the loss function:
tf.losses.mean_squared_error(labels,logits)
total_loss = tf.losses.get_total_loss()
tf.summary.scalar('losses/Total_Loss', total_loss)
# Specify the optimizer and create the train op:
optimizer = tf.train.AdamOptimizer(learning_rate=0.01)
train_op = slim.learning.create_train_op(total_loss, optimizer)
# Run the training:
final_loss = slim.learning.train(
train_op,
logdir=train_dir,
init_fn=get_init_fn(),
number_of_steps=1)
Code to get batches using Dataset
def get_batch(filenames):
dataset = tf.data.TFRecordDataset(filenames=filenames)
dataset = dataset.map(parse)
dataset = dataset.batch(2)
iterator = dataset.make_one_shot_iterator()
data_X, data_y = iterator.get_next()
return data_X, data_y
This previously asked question resembles the issue I am experiencing, however, I am not using a batch_join call. I am not if this is an issue with slim.learning.train, restoring from a checkpoint, or scope. Any help would be appreciated!
Your input pipeline looks ok. The problem might be with damaged TFRecords file. You can try your code with random data, or use your images as numpy arrays with tf.data.Dataset.from_tensor_slices().
Also your parse function may cause problems. Try to print your image/label with sess.run.
And I'd advise using Estimator API as train_op. It is much more convenient and slim will be deprecated soon.

Categories