How can we perform early stopping with train_on_batch?

How can we perform early stopping with train_on_batch? - python

I manually run the epochs in a loop, as well as further nested mini-batches in the loop. At each mini-batch, I need to call train_on_batch, to enable the training of a customized model.
Is there a manual way to restore the functionality of early stopping, i.e. to break the loop?

In practice, 'early stopping' is largely done via: (1) train for X epochs, (2) save the model each time it achieves a new best performance, (3) select the best model. "Best performance" defined as achieving the highest (e.g. accuracy) or lowest (e.g. loss) validation metric - example script below:
best_val_loss = 999 # arbitrary init - should be high if 'best' is low, and vice versa
num_epochs = 5
epoch = 0
while epoch < num_epochs:
model.train_on_batch(x_train, y_train) # get x, y somewhere in the loop
val_loss = model.evaluate(x_val, y_val)
if val_loss < best_val_loss:
model.save(best_model_path) # OR model.save_weights()
print("Best model w/ val loss {} saved to {}".format(val_loss, best_model_path))
# ...
epoch += 1
See saving Keras models. If you rather early-stop directly, then define some metric - i.e. condition - that'll end the train loop. For example,
while True:
loss = model.train_on_batch(...)
if loss < .02:
break

Related

How to seperate code into train, val and test functions for pytorch cnn?

I am training a cnn using pytorch and have created a training loop. As I am performing optimisation and experimenting with hyper-parameter tuning, I want to separate my training, validation and testing into different functions. I need to be able to record my accuracy and loss for each function in order to plot graphs. For this I want to create a function which returns the accuracy.
I am pretty new to coding and was wondering the best way to go about this. I feel like my code is a bit messy at the moment. I need to be able to feed in various hyper-parameters for experimentation in my training function. Could anyone offer any advice? Below is what I can so far:
def train_model(model, optimizer, data_loader, num_epochs, criterion=criterion):
total_epochs = notebook.tqdm(range(num_epochs))
for epoch in total_epochs:
model.train()
train_correct = 0.0
train_running_loss=0.0
train_total=0.0
for i, (img, label) in enumerate(data_loader['train']):
#uploading images and labels to GPU
img = img.to(device)
label = label.to(device)
#training model
outputs = model(img)
#computing losss
loss = criterion(outputs, label)
#propagating the loss backwards
optimizer.zero_grad()
loss.backward()
optimizer.step()
train_running_loss += loss.item()
_, predicted = outputs.max(1)
train_total += label.size(0)
train_correct += predicted.eq(label).sum().item()
train_loss=train_running_loss/len(data_loader['train'])
train_accu=100.*correct/total
print('Train Loss: %.3f | Train Accuracy: %.3f'%(train_loss,train_accu))
I have also experimented with making a functions to record accuracy:
def accuracy(outputs, labels):
_, preds = torch.max(outputs, dim = 1)
return torch.tensor(torch.sum(preds == labels).item() / len(preds))

First, note that:
Unless you have some specific motivation, validation (and testing) should be performed on a different dataset than the training set, so you should use a different DataLoader. The computation time will increase because of an additional for loop at every epoch.
Always call model.eval() before validation/testing.
That said, The signature of the validation function is pretty much similar to that of train_model
# criterion is passed if you want to register the validation loss too
def validate_model(model, eval_loader, criterion):
...
Then, in train_model, after each epoch, you can call the function validate_model and store the returned metrics in some data structure (list, tensor, etc.) that will be used later for plotting.
At the end of the training, you can then use the same validate_model function for testing.
Instead of coding the accuracy by yourself, you can use Accuracy from TorchMetrics
Finally, if you feel the need to level up, you can use DL training frameworks like PyTorch Lightning or FastAI. Give also a look at some hyperparameter tuning library such as Ray Tune.

Custom Callback made in Keras to stop "Overfitting and Overtraining" all in one is taking too much time to save model and weight

I have made this CallBack to stop Overfitting specially and Overtraining secondly is taking too much time for model to save weights. SO the logic is that I want to save only and only that where current_val_loss is always less than min_val_loss AND less than min_train_loss. I want to save that weight as best_model_weight and restore that weight after the Training ends. Here is the code for that Callback:
class StopOverFit(tf.keras.callbacks.Callback):
'''
A class that stops Overfitting by saving the model/weights if val_loss < loss and val_loss < min_val_loss ever encountered
Adds the functionality of ModelCheckpoint and EarlyStopping too. Currently works only with "val_loss"
'''
def __init__(self,filename=None,weights_only=True,restore_weights=False,early_stop=True,
watch='val_loss',patience=5,stop_at=0.0001,min_delta=0.003):
'''
Constructor of the Callback
args:
filename: {str} should in the format "epoch: {epoch:02d}- val_loss: {val_loss:.3f}.hdf5"
weights_only: {bool} whether to save the whole model or just the weights
restore_weights: {bool} whether to restore the "best" weights at the end of training
early_stop: {bool} whether to stop he model before reaching the epochs described
watch: {str} which metri to watch out for
patience: {int} if model has not improved with there continuous epochs, stop model training
stop_at: {float} if the model has a metric less than equal to this value, stop training
min_delta: {float} If the difference between best and current is greater than this, do not consider it as improvement
'''
super(StopOverFit,self).__init__()
self.filename = filename
self.weights_only = weights_only
self.best_min_val_loss = np.inf # "best" means without overfitting
self.min_loss = np.inf # min train loss
self.restore = restore_weights
self.best_weights = None # for restoring weights
self.early_stop = early_stop
self.watch = watch # which metric to watch out for
self.patience = patience # how many epochs before stopping training
self.counter = 0 # to get the counter of how many epochs it's been since the last improvement
self.stop_at = stop_at # stop if the required metric has been achieved
self.delta = min_delta # if last saved loss has improved more than delta, say it a progress
def on_epoch_end(self, epoch, logs=None):
'''
Do something when each epoch begins like getting the logs dict and comparing the values
arguments are provided by the model itself
args:
epoch: {int} epoch number. starts with 0
logs: {dict} dictonary with metrices and losses as key. single value for each key
'''
val_loss = logs['val_loss']
loss = logs['loss']
val_f1 = logs['val_f1']
if loss<self.min_loss:
self.min_loss = loss
if (val_loss<self.best_min_val_loss) and (val_loss<self.min_loss):
print(f"\nbest val_loss improved from {self.best_min_val_loss:.4f} to {val_loss:.4f}. saving..")
self.best_min_val_loss = val_loss
self.best_weights = self.model.get_weights()
if self.weights_only:
if self.filename:
self.model.save_weights('weights-'+self.filename.format(val_loss=val_loss,epoch=epoch+1,val_f1=val_f1))
else:
self.model.save_weights('best_weights.h5')
else:
if self.filename:
self.model.save('model-'+self.filename.format(val_loss=val_loss,epoch=epoch+1,val_f1=val_f1))
else:
self.model.save('best_model.h5')
def on_train_end(self, logs=None):
'''
Do something when when the model end. We'll restore the best_weights
args:
logs: {dict} dictonary with metrices and losses as key. key:list pair
'''
if self.restore:
print("\nend of training. Restoring Best Weights")
self.model.set_weights(self.best_weights)
Can someone help me understand why this is happening.

I would recommend not setting the criteria to save the weights based on both the validation loss being the lowest thus far AND being below the training loss. Generally the training loss of the previous epoch may be lower than the validation loss for the current epoch, especially in the later epochs. You want to continue training as long as the value loss continues to decrease. To do that you can just use the existing Keras callback EarlyStopping. Monitor validation loss and set restore_best_weights=True. I also recommend that you include the callback ReduceLROnPlateau. Set it to monitor validation loss. Documentation is here.

PyTorch: is there a definitive training loop similar to Keras' fit()?

I'm coming over from Keras to PyTorch, and one of the surprising things I've found is that I'm supposed to implement my own training loop.
In Keras, there is a de facto fit() function that: (1) runs gradient descent and (2) collects a history of metrics for loss and accuracy over both the training set and validation set.
In PyTorch, it appears that the programmer needs to implement the training loop. Since I'm new to PyTorch, I don't know if my training loop implementation is correct. I just want to compare apples-to-apples loss and accuracy metrics with what I'm seeing in Keras.
I've already read through:
the official PyTorch 60-minute blitz, where they provide a sample training loop.
official PyTorch example code, where I've found the training loop placed in-line with other code.
the O'Reilly book Programming PyTorch for Deep Learning with its own training loop.
Stanford CS230 sample code.
various blog posts (e.g. here and here).
So I'm wondering: is there a definitive, universal training loop implementation that does the same thing and reports the same numbers as the Keras fit() function?
My points of frustration:
Pulling data out of the dataloader is not consistent between image data and NLP data.
Correctly computing loss and accuracy is not consistent in any sample code I've seen.
Some code examples use Variable, while others do not.
Unnecessarily detailed: moving data to/from the GPU; knowing when to call zero_grad().
For what it's worth, here is my current implementation. Are there any obvious bugs?
import time
def train(model, optimizer, loss_fn, train_dl, val_dl, epochs=20, device='cuda'):
'''
Runs training loop for classification problems. Returns Keras-style
per-epoch history of loss and accuracy over training and validation data.
Parameters
----------
model : nn.Module
Neural network model
optimizer : torch.optim.Optimizer
Search space optimizer (e.g. Adam)
loss_fn :
Loss function (e.g. nn.CrossEntropyLoss())
train_dl :
Iterable dataloader for training data.
val_dl :
Iterable dataloader for validation data.
epochs : int
Number of epochs to run
device : string
Specifies 'cuda' or 'cpu'
Returns
-------
Dictionary
Similar to Keras' fit(), the output dictionary contains per-epoch
history of training loss, training accuracy, validation loss, and
validation accuracy.
'''
print('train() called: model=%s, opt=%s(lr=%f), epochs=%d, device=%s\n' % \
(type(model).__name__, type(optimizer).__name__,
optimizer.param_groups[0]['lr'], epochs, device))
history = {} # Collects per-epoch loss and acc like Keras' fit().
history['loss'] = []
history['val_loss'] = []
history['acc'] = []
history['val_acc'] = []
start_time_sec = time.time()
for epoch in range(epochs):
# --- TRAIN AND EVALUATE ON TRAINING SET -----------------------------
model.train()
train_loss = 0.0
num_train_correct = 0
num_train_examples = 0
for batch in train_dl:
optimizer.zero_grad()
x = batch[0].to(device)
y = batch[1].to(device)
yhat = model(x)
loss = loss_fn(yhat, y)
loss.backward()
optimizer.step()
train_loss += loss.data.item() * x.size(0)
num_train_correct += (torch.max(yhat, 1)[1] == y).sum().item()
num_train_examples += x.shape[0]
train_acc = num_train_correct / num_train_examples
train_loss = train_loss / len(train_dl.dataset)
# --- EVALUATE ON VALIDATION SET -------------------------------------
model.eval()
val_loss = 0.0
num_val_correct = 0
num_val_examples = 0
for batch in val_dl:
x = batch[0].to(device)
y = batch[1].to(device)
yhat = model(x)
loss = loss_fn(yhat, y)
val_loss += loss.data.item() * x.size(0)
num_val_correct += (torch.max(yhat, 1)[1] == y).sum().item()
num_val_examples += y.shape[0]
val_acc = num_val_correct / num_val_examples
val_loss = val_loss / len(val_dl.dataset)
print('Epoch %3d/%3d, train loss: %5.2f, train acc: %5.2f, val loss: %5.2f, val acc: %5.2f' % \
(epoch+1, epochs, train_loss, train_acc, val_loss, val_acc))
history['loss'].append(train_loss)
history['val_loss'].append(val_loss)
history['acc'].append(train_acc)
history['val_acc'].append(val_acc)
# END OF TRAINING LOOP
end_time_sec = time.time()
total_time_sec = end_time_sec - start_time_sec
time_per_epoch_sec = total_time_sec / epochs
print()
print('Time total: %5.2f sec' % (total_time_sec))
print('Time per epoch: %5.2f sec' % (time_per_epoch_sec))
return history

Short answer: there is no equivalent training loop for PT and TF.keras and there shall never be one.
First of all, the training loop is syntactical sugar that is supposed to makes one's life easier. From my point of view, "making life easier" is a moto of TF.keras framework and this is the main reason it has it. Training loop can not be formalized as well defined practice, it might vary a lot depending on the task/dataset/procedure/metric/you_name_it and it would require a lot of effort to match all the options for 2 frameworks. Furthermore, creating a defining interface for training loop in Pytorch might be too restrictive for many actual users of the framework.
Matching the outputs of network would require matching behaviors of every operation within 2 frameworks, which would be impossible. First of all, the frameworks don't necessarily provide same sets of operations. Operations can be grouped into higher level abstracts differently. Also, some common functions like sigmoid or BatchNorm might look well mathematically defined on paper, but in reality have dozens of implementation specific details. Also, when improvements are introduced to the operations it is up to the community to integrate these updates into main framework distributions or plane ignore them. Needless to say, developers of 2 frameworks make these decisions independently and likely have different motivation behind them.
To sum it all up, matching high level details of 2 frameworks would require enormous effort and would probably be very disruptive for the existing users.

Indeed, the Pytorch Module class (source code) doesn't have a fit() method, so you have to implement your own according to your needs.
However there are some implementations which mimic the Keras training API, such as this one:
https://github.com/ncullen93/torchsample
or a simpler one:
https://github.com/henryre/pytorch-fitmodule

A close thing to Keras model.fit in Pytorch is Pytorch Extension called Torchbearer.
From the MNIST example notebook:
trial = Trial(model, optimizer, loss, metrics=['acc', 'loss'], callbacks=callbacks).to(device)
trial.with_generators(train_generator=traingen, val_generator=valgen, test_generator=testgen)
history = trial.run(epochs=5, verbose=1)
the similarity is there, although the usage requiers some reading.
Best of luck!

I wanted to point out that there is now a fit(...) equivalent. PyTorch Lightning is a wrapper around PyTorch that allows for a clean object-oriented approach to creating ML models in PyTorch. It provides a fit(...) loop using their Trainer class. I would checkout their official website for a more detailed answer.

How do I get a loss per epoch and not per batch?

In my understanding an epoch is an arbitrarily often repeated run over the whole dataset, which in turn is processed in parts, so called batches. After each train_on_batch a loss is calculated, the weights are updated and the next batch will get better results. These losses are indicators of the quality and learning state of my to NNs.
In several sources the loss is calculated (and printed) per epoch. Therefore I am not sure if I am doing this right.
At the moment my GAN looks like this:
for epoch:
for batch:
fakes = generator.predict_on_batch(batch)
dlc = discriminator.train_on_batch(batch, ..)
dlf = discriminator.train_on_batch(fakes, ..)
dis_loss_total = 0.5 * np.add(dlc, dlf)
g_loss = gan.train_on_batch(batch,..)
# save losses to array to work with later
These losses are for each batch. How do I get them for an epoch? As an aside: Do I need losses for an epoch, what for?

There is no direct way to compute the loss for an epoch. Actually, the loss of an epoch is usually defined as the average of the loss of batches in that epoch. So you can accumulate the loss values during an epoch and at the end divide it by the number of batches in the epoch:
epoch_loss = []
for epoch in range(n_epochs):
acc_loss = 0.
for batch in range(n_batches):
# do the training
loss = model.train_on_batch(...)
acc_loss += loss
epoch_loss.append(acc_loss / n_batches)
As for the other question, one usage of epoch loss might be to use it as an indicator to stop the training (however, the validation loss is usually used for that, not the training loss).

I'll expand on #today answer a bit. There is a certain balance to strike in how to report loss over an epoch and how to use it to determine when training should stop.
If you only look at the loss of the most recent batch, it will be a very noisy estimate of your dataset loss because maybe that batch happened to store all the samples your model has trouble with, or all the samples that are trivial to succeed on.
If you look at the averaged loss over all batches in the epoch, you may get a skewed response because, like you indicated, the model has been (hopefully) improving over the epoch, so the performance on the initial batches aren't as meaningfully compared to the performance on the later batches.
The only way to accurately report your epoch loss is to take your model out of training mode, i.e. fix all the model parameters, and run your model on the whole dataset. That will be an unbiased computation of your epoch loss. However, in general that's a terrible idea because if you have a complex model or a lot of training data, you will waste a lot of time doing this.
So, I think it's most common to balance these factors by reporting an averaged loss over N mini-batches, where N is large enough to smooth out the noise of individual batches but not so large that the model performance is not comparable between the first and last batches.
I know you're in Keras but here is a PyTorch example that illustrates this concept clearly, replicated here:
for epoch in range(2): # loop over the dataset multiple times
running_loss = 0.0
for i, data in enumerate(trainloader, 0):
# get the inputs; data is a list of [inputs, labels]
inputs, labels = data
# zero the parameter gradients
optimizer.zero_grad()
# forward + backward + optimize
outputs = net(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
# print statistics
running_loss += loss.item()
if i % 2000 == 1999: # print every 2000 mini-batches
print('[%d, %5d] loss: %.3f' %
(epoch + 1, i + 1, running_loss / 2000))
running_loss = 0.0
print('Finished Training')
You can see they accumulate the loss over N=2000 batches, report the averaged loss over those 2000 batches, then zero out the running loss and keep going.

Keras: Optimal epoch selection

I'm trying to write some logic that selects the best epoch to run a neural network in Keras. My code saves the training loss and the test loss for a set number of epochs and then picks the best fitting epoch according to some logic. The code looks like this:
ini_epochs = 100
df_train_loss = DataFrame(data=history.history['loss'], columns=['Train_loss']);
df_test_loss = DataFrame(data=history.history['val_loss'], columns=['Test_loss']);
df_loss = concat([df_train_loss,df_test_loss], axis=1)
Min_loss = max(df_loss['Test_loss'])
for i in range(ini_epochs):
Test_loss = df_loss['Test_loss'][i];
Train_loss = df_loss['Train_loss'][i];
if Test_loss > Train_loss and Test_loss < Min_loss:
Min_loss = Test_loss;
The idea behind the logic is this; to get the best model, the epoch selected should select the model with the lowest loss value, but it must be above the training loss value to avoid overfitting.
In general, this epoch selection method works OK. However, if the test loss value is below the train loss from the start, then this method picks an epoch of zero (see below).
Now I could add another if statement assessing whether the difference between the test and train losses are positive or negative, and then write logic for each case, but what happens if the difference starts positive and then ends up negative. I get confused and haven't been able to write effective code.
So, my questions are:
1) Can you show me how you what code you would write to to account for the situation show in the graph (and for the case where the test and train loss curves cross). I'd say the strategy would be to take the value that with the minimum difference.
2) There is a good chance that I'm going about this the wrong way. I know Keras has a callbacks feature but I don't like the idea of using the save_best_only feature because it can save overfitted models. Any advice on a more efficient epoch selection method would be great.

Use EarlyStopping which is available in Keras. Early stopping is basically stopping the training once your loss starts to increase (or in other words validation accuracy starts to decrease). use ModelCheckpoint to save the model wherever you want.
from keras.callbacks import EarlyStopping, ModelCheckpoint
STAMP = 'simple_lstm_glove_vectors_%.2f_%.2f'%(rate_drop_lstm,rate_drop_dense)
early_stopping =EarlyStopping(monitor='val_loss', patience=5)
bst_model_path = STAMP + '.h5'
model_checkpoint = ModelCheckpoint(bst_model_path, save_best_only=True, save_weights_only=True)
hist = model.fit(data_train, labels_train, \
validation_data=(data_val, labels_val), \
epochs=50, batch_size=256, shuffle=True, \
callbacks=[early_stopping, model_checkpoint])
model.load_weights(bst_model_path)
refer to this link for more info

Here is a simple example illustrate how to use early stooping in Keras:
First necessarily import:
from keras.callbacks import EarlyStopping, ModelCheckpoint
Setup Early Stopping
# Set callback functions to early stop training and save the best model so far
callbacks = [EarlyStopping(monitor='val_loss', patience=2),
ModelCheckpoint(filepath='best_model.h5', monitor='val_loss', save_best_only=True)]
Train neural network
history = network.fit(train_features, # Features
train_target, # Target vector
epochs=20, # Number of epochs
callbacks=callbacks, # Early stopping
verbose=0, # Print description after each epoch
batch_size=100, # Number of observations per batch
validation_data=(test_features, test_target)) # Data for evaluation
See the full example here.
Please also check :Stop Keras Training when the network has fully converge; the best answer of Daniel.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.