Keras early stopping callback error, val_loss metric not available - python

I am training a Keras (Tensorflow backend, Python, on MacBook) and am getting an error in the early stopping callback in fit_generator function. The error is as follows:
RuntimeWarning: Early stopping conditioned on metric `val_loss` which is not available. Available metrics are:
(self.monitor, ','.join(list(logs.keys()))),
RuntimeWarning: Can save best model only with val_acc available, skipping.
'skipping.' % (self.monitor), RuntimeWarning
[local-dir]/lib/python3.6/site-packages/keras/callbacks.py:497: RuntimeWarning: Early stopping conditioned on metric `val_loss` which is not available. Available metrics are:
(self.monitor, ','.join(list(logs.keys()))), RuntimeWarning
[local-dir]/lib/python3.6/site-packages/keras/callbacks.py:406: RuntimeWarning: Can save best model only with val_acc available, skipping.
'skipping.' % (self.monitor), RuntimeWarning)
Traceback (most recent call last):
:
[my-code]
:
File "[local-dir]/lib/python3.6/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "[local-dir]/lib/python3.6/site-packages/keras/engine/training.py", line 2213, in fit_generator
callbacks.on_epoch_end(epoch, epoch_logs)
File "[local-dir]/lib/python3.6/site-packages/keras/callbacks.py", line 76, in on_epoch_end
callback.on_epoch_end(epoch, logs)
File "[local-dir]/lib/python3.6/site-packages/keras/callbacks.py", line 310, in on_epoch_end
self.progbar.update(self.seen, self.log_values, force=True)
AttributeError: 'ProgbarLogger' object has no attribute 'log_values'
My code is as follows (which looks OK):
:
ES = EarlyStopping(monitor="val_loss", min_delta=0.001, patience=3, mode="min", verbose=1)
:
self.model.fit_generator(
generator = train_batch,
validation_data = valid_batch,
validation_steps = validation_steps,
steps_per_epoch = steps_per_epoch,
epochs = epochs,
callbacks = [ES],
verbose = 1,
workers = 3,
max_queue_size = 8)
The error message appears to relate to the early stopping callback but the callback looks OK. Also the error states that the val_loss is not appropriate, but I am not sure why... one more unusual thing about this is that the error only occurs when I use smaller data sets.
Any help is appreciated.

If the error only occurs when you use smaller datasets, you're very likely using datasets small enough to not have a single sample in the validation set.
Thus it cannot calculate a validation loss.

The error occurs to us because we forgot to set validation_data in fit() method, while used 'callbacks': [keras.callbacks.EarlyStopping(monitor='val_loss', patience=1)],
Code causing error is:
self.model.fit(
x=x_train,
y=y_train,
callbacks=[keras.callbacks.EarlyStopping(monitor='val_loss', patience=1)],
verbose=True)
Adding validation_data=(self.x_validate, self.y_validate), in fit() fixed:
self.model.fit(
x=x_train,
y=y_train,
callbacks=[keras.callbacks.EarlyStopping(monitor='val_loss', patience=1)],
validation_data=(x_validate, y_validate),
verbose=True)

I up-voted the previous answer as it gave me the insight to verify the data and inputs to the fit_generator function and find out what the root cause of the issue actually was. In summary, in cases where my dataset was small, I calculated validation_steps and steps_per_epoch which turned out to be zero (0) which caused the error.
I suppose the better longer-term answer, perhaps for the Keras team, is to cause an error/exception in fit_generator when these values are zero, which would probably lead to a better understanding about how to address this issue.

This error is occur's due to the smaller dataset,to resolve this,increase the train times and split the train set in 80:20.

My problem was that I called these Callbacks with the parameter "val_acc".
The right parameter is "val_accuracy".
This solution was in my Error messages in the sentence: "Available metrics are: ..."

I got this warning too. It appeared after [switching to the master branch of Keras 2.2.4 to get validation_freq functionality enabled][1]:
//anaconda3/lib/python3.7/site-packages/keras/callbacks/callbacks.py:846: RuntimeWarning: Early stopping conditioned on metric `val_loss` which is not available. Available metrics are: loss,accuracy
(self.monitor, ','.join(list(logs.keys()))), RuntimeWarning
However, despite of the warning, early stopping on val_loss still works (at least for me). For example, this is the output I received when the computation early stopped:
Epoch 00076: early stopping
Before that Keras update, early stopping worked on val_loss with no warning.
Don't ask me why it works because I haven't a clue.
(You can try this behavior with a small example that you know it should early stop).

I got this error message using fit_generator. The error appeared after the first epoch had finished.
The problem was that I had set validation_freq=20 in fit_generator parameters.
Keras executes the callbacks list at the end of the first epoch, but it didn't actually calculate val_loss until after epoch 20, so val_loss was not (yet) available.
Setting validation_freq=1 fixed the problem.

using tf.compat.v1.disable_eager_execution() will solved the problem. Trying validation_freq = 1 is also a good idea. However, you have to wait for the script terminal output for each epoch complete
Therefore, I recommend to observe the result by tensorboard, weight&bias, etc,...

Try to avoid using tf.keras when importing dependencies. It works for me when I directly use Keras (e.g., to import layers and callbacks).

you should set the monitor parameter for early_stop.
default value of monitor parameter for early stop is "val_loss"
keras.callbacks.EarlyStopping(monitor='val_loss', patience=5)
and when you do not set validation_set for your model so you dont have val_loss.
so you should set validation_set for your model inside "fit" function or change the monitor parameter value to "loss"
keras.callbacks.EarlyStopping(monitor='loss', patience=5)

Change this line from
'val_loss'
to
'loss'
ES = EarlyStopping(monitor="val_loss", min_delta=0.001, patience=3, mode="min", verbose=1)
change to...
ES = EarlyStopping(monitor="loss", min_delta=0.001, patience=3, mode="min", verbose=1)

Also, check your validation_freq input as val_loss parameter is only available after it is computed the first time.
i.e. if your early stoping method stops at epoch 5 and your validation_freq is set to 10, the val_loss won't be available
The early stopping keras callback requires you to compute the val_loss on each epoch for the monitoring purpose

If you are too lazy - try something like this.
Quite often EarlyStopping would fail because it runs too soon. Setting "strict" to "false" allows to avoid crashing after failing to check the metrics. "verbose" would also write what happened, so you can confirm that it stopped crashing after the first epoch.
EarlyStopping(
monitor="val_loss",
mode="min",
verbose = True,
strict=False)

Related

Hugging Face not able to reload all weights after training

I recently being using a RobertaLarge model, which I perform a down stream Training, using "Trainer" package.
All goes well, I see the loss going down, and compare manually some results with valid dataset.
Problem goes when I try to save the model and reload it afterwards.
I keep seeing the warning when trying to reload the model:
Some weights of the model checkpoint at Roberta_trained_1epoch were not used when initializing RobertaPreTrainedModel: ['module.roberta.encoder.layer.10.output.dense.bias', [........................................340_LAYERS_..................................]
'module.roberta.encoder.layer.6.attention.self.key.bias', 'module.roberta.encoder.layer.22.output.dense.weight', 'module.roberta.encoder.layer.3.attention.self.key.bias', 'module.roberta.encoder.layer.15.attention.self.value.bias', 'module.roberta.encoder.layer.15.attention.self.query.bias', 'module.roberta.encoder.layer.2.attention.self.value.bias']
I looked extensively for an answer to why this problem, and so far couldn't find a solution. Some claim this is just a warning and there's nothing wrong, however suspiciously I did some manual checks, and indeed the model seems... virgin.
I'm using the: Trainer.save_model('save_here') after training, and using the RobertaForTokenClassification.from_pretrained('save_here', local_files_only=True)model to reload it.
However the results show me that the model is not loading currently clearly.
training code:
trainer = Trainer(
model=model,
args=training_args,
compute_metrics=compute_metrics,
train_dataset=ds_train,
eval_dataset=ds_valid,
callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
)
trainer.train()
trainer.evaluate()
trainer.save_model('save_here')
this results in evaluation loss of: 0.002
Reloading and re-evaluation:
model = RobertaForTokenClassification.from_pretrained('save_here', local_files_only=True)
tokenizer = AutoTokenizer.from_pretrained('tokenizers_saved')
dl_valid = DataLoader(ds_valid, batch_size=Config.batch_size, shuffle=True)
with torch.no_grad():
for index, data in enumerate(dl_valid):
batch_input_ids = data['input_ids'].to(device, dtype=torch.long)
batch_att_mask = data['attention_mask'].to(device, dtype=torch.long)
batch_target = data['label_ids'].to(device, dtype=torch.long)
output = model(batch_input_ids, token_type_ids=None, attention_mask=batch_att_mask, labels=batch_target)
step_loss, eval_prediction = output['loss'], output['logits']
eval_prediction = np.argmax(eval_prediction.detach().to('cpu').numpy(), axis=2)
predictions.append(eval_prediction)
reals.append(batch_target)
eval_loss += step_loss
print(eval_loss)
This results in loss: 1.2 - 0.9 (randomly after loading)
I found out what was wrong.
Will share with others, given others may have the same issue.
My problem was that I wrapped my model into a DataParallel model = nn.DataParallel(model)
So it seems that Trainer can't save it properly and get it back the usual way.
As a work around:
model = trainer.model
model.module.save_pretrained('save_here')
....
# afterwards in another machine
....
model = RobertaForTokenClassification.from_pretrained('save_here')
Still think that this should be handled differently.

Call back not working in tensor flow to stop the training

I have written a call back which stops training when accuracy becomes 99%.But the problem is i get this error .Sometimes if i resolve this error the call back not get called even though acuurqacy becoms 100 %.
'>' not supported between instances of 'NoneType' and 'float'
class myCallback(tf.keras.callbacks.Callback):
def on_epoch_end(self, epoch, logs={}):
if(logs.get('accuracy') > 0.99):
self.model.stop_training = True
def train_mnist():
# Please write your code only where you are indicated.
# please do not remove # model fitting inline comments.
# YOUR CODE SHOULD START HERE
# YOUR CODE SHOULD END HERE
call = myCallback()
mnist = tf.keras.datasets.mnist
(x_train, y_train),(x_test, y_test) = mnist.load_data(path=path)
# YOUR CODE SHOULD START
x_train = x_train/255
y_train = y_train/255
# YOUR CODE SHOULD END HERE
model = tf.keras.models.Sequential([
# YOUR CODE SHOULD START HERE
keras.layers.Flatten(input_shape=(28,28)),
keras.layers.Dense(128,activation='relu'),
keras.layers.Dense(10,activation='softmax')
# YOUR CODE SHOULD END HERE
])
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
# model fitting
history = model.fit(# YOUR CODE SHOULD START HERE
x_train,y_train,epochs=9,callbacks=[call] )
# model fitting
return history.epoch, history.history['acc'][-1]
Two major problems with the above code:
Getting to 100% accuracy on training set almost always means that your model is overfitting. Thats BAD. What you want to do instead is specify the validation_split=.2 parameter in the .fit method, and look for a high accuracy on the validation set.
What you are trying to build in your custom callback is already done in keras.callbacks.EarlyStopping, it even has an option to restore to the best overall model over each epoch. And, by default, it is looking for a validation accuracy, not training accuracy, if you have a validation split.
So, here's what you should do:
Stop using custom callbacks, they take some mastery to get to work. Use EarlyStopping with restore_best instead. like this
Always use validation_split and look for high accuracy in validation set. Like in this quick example.
Did using built-in callbacks resolve your problem?

KeyError: ''val_loss" when training model

I am training a model with keras and am getting an error in callback in fit_generator function. I always run to epoch 3rd and get this error
annotation_path = 'train2.txt'
log_dir = 'logs/000/'
classes_path = 'model_data/deplao_classes.txt'
anchors_path = 'model_data/yolo_anchors.txt'
class_names = get_classes(classes_path)
num_classes = len(class_names)
anchors = get_anchors(anchors_path)
input_shape = (416,416) # multiple of 32, hw
is_tiny_version = len(anchors)==6 # default setting
if is_tiny_version:
model = create_tiny_model(input_shape, anchors, num_classes,
freeze_body=2, weights_path='model_data/tiny_yolo_weights.h5')
else:
model = create_model(input_shape, anchors, num_classes,
freeze_body=2, weights_path='model_data/yolo_weights.h5') # make sure you know what you freeze
logging = TensorBoard(log_dir=log_dir)
checkpoint = ModelCheckpoint(log_dir + 'ep{epoch:03d}-loss{loss:.3f}-val_loss{val_loss:.3f}.h5',
monitor='val_loss', save_weights_only=True, save_best_only=True, period=3)
reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=3, verbose=1)
early_stopping = EarlyStopping(monitor='val_loss', min_delta=0, patience=10, verbose=1)
[error]
Traceback (most recent call last):
File "train.py", line 194, in <module>
_main()
File "train.py", line 69, in _main
callbacks=[logging, checkpoint])
File "C:\Users\ilove\AppData\Roaming\Python\Python37\lib\site-packages\keras\legacy\interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "C:\Users\ilove\AppData\Roaming\Python\Python37\lib\site-packages\keras\engine\training.py", line 1418, in fit_generator
initial_epoch=initial_epoch)
File "C:\Users\ilove\AppData\Roaming\Python\Python37\lib\site-packages\keras\engine\training_generator.py", line 251, in fit_generator
callbacks.on_epoch_end(epoch, epoch_logs)
File "C:\Users\ilove\AppData\Roaming\Python\Python37\lib\site-packages\keras\callbacks.py", line 79, in on_epoch_end
callback.on_epoch_end(epoch, logs)
File "C:\Users\ilove\AppData\Roaming\Python\Python37\lib\site-packages\keras\callbacks.py", line 429, in on_epoch_end
filepath = self.filepath.format(epoch=epoch + 1, **logs)
KeyError: 'val_loss'
can anyone find out problem to help me?
Thanks in advance for your help.
This callback runs at the end of iteration 3.
checkpoint = ModelCheckpoint(log_dir + 'ep{epoch:03d}-loss{loss:.3f}-val_loss{val_loss:.3f}.h5',
monitor='val_loss', save_weights_only=True, save_best_only=True, period=3)
The error message is claiming that there is no val_loss in the logs variable when executing:
filepath = self.filepath.format(epoch=epoch + 1, **logs)
This would happen if fit is called without validation_data.
I would start by simplifying the path name for model checkpoint. It is probably enough to include the epoch in the name.
This answer doesn't apply to the question, but this was at the top of the Google results for keras "KeyError: 'val_loss'" so I'm going to share the solution for my problem.
The error was the same for me: when using val_loss in the checkpoint file name, I would get the following error: KeyError: 'val_loss'. My checkpointer was also monitoring this field, so even if I took the field out of the file name, I would still get this warning from the checkpointer: WARNING:tensorflow:Can save best model only with val_loss available, skipping.
In my case, the issue was that I was upgrading from using Keras and Tensorflow 1 separately to using the Keras that came with Tensorflow 2. The period param for ModelCheckpoint had been replaced with save_freq. I erroneously assumed that save_freq behaved the same way, so I set it to save_freq=1 thinking this would save it every epic. However, the docs state:
save_freq: 'epoch' or integer. When using 'epoch', the callback saves the model after each epoch. When using integer, the callback saves the model at end of a batch at which this many samples have been seen since last saving. Note that if the saving isn't aligned to epochs, the monitored metric may potentially be less reliable (it could reflect as little as 1 batch, since the metrics get reset every epoch). Defaults to 'epoch'
Setting save_freq='epoch' solved the issue for me. Note: the OP was still using period=1 so this is definitely not what was causing their problem
Use val_accuracy in the filepath and checkpoint. If it still doesn't improve just restart the pc or colab.
this error happens when we are not providing validation data to the model,
And check the parameters of the model.fit_generator(or model.fit)(train_data, steps_per_epoch,validation_data, validation_steps, epochs,initial_epoch, callbacks)
For me the problem was that I was trying to set the initial_epoch (in model.fit) to a value other than the standard 0. I was doing so because I'm running model.fit in a loop that runs 10 epochs each cycle, then retrieves history data, checks if loss has decreased and runs model.fit again until it's satisfied.
I thought I had to update the value as I was restarting the previous model but apparently no...
switch = True
epoch = 0
wait = 0
previous = 10E+10
while switch:
history = model.fit( X, y, batch_size=1, epochs=step, verbose=False )
epoch += step
current = history.history["loss"][-1]
if current >= previous:
wait += 1
if wait >= tolerance:
switch = False
else:
wait = 0
if epoch >= max_epochs:
switch = False
previous = current
In my case, the val_generator was broken when colab notebook try to read the images from google drive. So i run the cell create val_generator again and it worked
I had this error and didn't manage to find the cause of the bug anywhere online.
What was happening in my case was that I was asking for more training samples than I actually had. TF didn't give me an explicit error for that and it even provided me with a saved value for the loss. I only received the esoteric KeyError: "val_loss" when trying to save that.
Hope this helps someone sniff out their silly bug if that's whats happening to them.
I do not know if this will work in all cases. But, for me I restarted my computer and it seemed to work.

Save Keras model at specific epochs

I am using Keras to do some training on my dataset and it is time consuming to keep running every time to locate the number of epochs needed to get the best results. I tried using callbacks to get the best model, but it just does not work and usually stops too early. Also, saving every N epochs is not an option for me.
What I am trying to do is save the model after some specific epochs are done. Let's say for example, after epoch = 150 is over, it will be saved as model.save(model_1.h5) and after epoch = 152, it will be saved as model.save(model_2.h5) etc... for few specific epochs.
Is there a way to implement this in Keras ? I already searched for a method but no luck so far.
Thank you for any help/suggestion.
Edit
In most cases it's enough to use name formatting suggested by #Toan Tran in his answer.
But if you need some sophisticated logic, you can use a callback, for example
import keras
class CustomSaver(keras.callbacks.Callback):
def on_epoch_end(self, epoch, logs={}):
if epoch == 2: # or save after some epoch, each k-th epoch etc.
self.model.save("model_{}.hd5".format(epoch))
on_epoch_end is called at the end of each epoch; epoch is a number of epoch, latter argument is a logs (you can read about other callback methods in docs). Put the logic into this method (in example it's as simple as possible).
Create saver object and put it into fit method:
import keras
import numpy as np
inp = keras.layers.Input(shape=(10,))
dense = keras.layers.Dense(10, activation='relu')(inp)
out = keras.layers.Dense(1, activation='sigmoid')(dense)
model = keras.models.Model(inp, out)
model.compile(optimizer="adam", loss="binary_crossentropy",)
# Just a noise data for fast working example
X = np.random.normal(0, 1, (1000, 10))
y = np.random.randint(0, 2, 1000)
# create and use callback:
saver = CustomSaver()
model.fit(X, y, callbacks=[saver], epochs=5)
In the bash:
!ls
Out:
model_2.hd5
So, it works.
checkpoint = keras.callbacks.ModelCheckpoint('model{epoch:08d}.h5', period=5)
model.fit(X_train, Y_train, callbacks=[checkpoint])
Did you try checkpoint? period=5 means model is saved after 5 epoch
More details here
Hope this help :)
Well, I can't comment on posts yet. So, I'm adding on to #Toan Tran's answer. With the latest version of Keras, the argument period is deprecated. Instead, we can use save_freq.
In the following example, the model is saved after every epoch.
checkpoint = keras.callbacks.ModelCheckpoint(model_save_path+'/checkpoint_{epoch:02d}', save_freq='epoch')
H=model.fit(x=x_train, y=y_train,epochs=epoch_no,verbose=2, callbacks=[checkpoint])
You can find more details from keras documentation.

ML Engine Experiment eval tf.summary.scalar not displaying in tensorboard

I am trying to output some summary scalars in an ML engine experiment at both train and eval time. tf.summary.scalar('loss', loss) is correctly outputting the summary scalars for both training and evaluation on the same plot in tensorboard. However, I am also trying to output other metrics at both train and eval time and they are only outputting at train time. The code immediately follows tf.summary.scalar('loss', loss) but does not appear to work. For example, the code as follows is only outputting for TRAIN, but not EVAL. The only difference is that these are using custom accuracy functions, but they are working for TRAIN
if mode in (Modes.TRAIN, Modes.EVAL):
loss = tf.contrib.legacy_seq2seq.sequence_loss(logits, outputs, weights)
tf.summary.scalar('loss', loss)
sequence_accuracy = sequence_accuracy(targets, predictions,weights)
tf.summary.scalar('sequence_accuracy', sequence_accuracy)
Does it make any sense why loss would plot in tensorboard for both TRAIN & EVAL, while sequence_accuracy would only plot for TRAIN?
Could this behavior somehow be related to the warning I received "Found more than one metagraph event per run. Overwriting the metagraph with the newest event."?
Because the summary node in the graph is just a node. It still needs to be evaluated (outputting a protobuf string), and that string still needs to be written to a file. It's not evaluated in training mode because it's not upstream of the train_op in your graph, and even if it were evaluated, it wouldn't be written to a file unless you specified a tf.train.SummarySaverHook as one of you training_chief_hooks in your EstimatorSpec. Because the Estimator class doesn't assume you want any extra evaluation during training, normally evaluation is only done during the EVAL phase, and you just increase min_eval_frequency or checkpoint_frequency to get more evaluation datapoints.
If you really really want to log a summary during training here's how you'd do it:
def model_fn(mode, features, labels, params):
...
if mode == Modes.TRAIN:
# loss is already written out during training, don't duplicate the summary op
loss = tf.contrib.legacy_seq2seq.sequence_loss(logits, outputs, weights)
sequence_accuracy = sequence_accuracy(targets, predictions,weights)
seq_sum_op = tf.summary.scalar('sequence_accuracy', sequence_accuracy)
with tf.control_depencencies([seq_sum_op]):
train_op = optimizer.minimize(loss)
return tf.estimator.EstimatorSpec(
loss=loss,
mode=mode,
train_op=train_op,
training_chief_hooks=[tf.train.SummarySaverHook(
save_steps=100,
output_dir='./summaries',
summary_op=seq_sum_op
)]
)
But it's better to just increase your eval frequency and make an eval_metric_ops for accuracy with tf.metrics.streaming_accuracy

Categories