I'm coming over from Keras to PyTorch, and one of the surprising things I've found is that I'm supposed to implement my own training loop.
In Keras, there is a de facto fit() function that: (1) runs gradient descent and (2) collects a history of metrics for loss and accuracy over both the training set and validation set.
In PyTorch, it appears that the programmer needs to implement the training loop. Since I'm new to PyTorch, I don't know if my training loop implementation is correct. I just want to compare apples-to-apples loss and accuracy metrics with what I'm seeing in Keras.
I've already read through:
the official PyTorch 60-minute blitz, where they provide a sample training loop.
official PyTorch example code, where I've found the training loop placed in-line with other code.
the O'Reilly book Programming PyTorch for Deep Learning with its own training loop.
Stanford CS230 sample code.
various blog posts (e.g. here and here).
So I'm wondering: is there a definitive, universal training loop implementation that does the same thing and reports the same numbers as the Keras fit() function?
My points of frustration:
Pulling data out of the dataloader is not consistent between image data and NLP data.
Correctly computing loss and accuracy is not consistent in any sample code I've seen.
Some code examples use Variable, while others do not.
Unnecessarily detailed: moving data to/from the GPU; knowing when to call zero_grad().
For what it's worth, here is my current implementation. Are there any obvious bugs?
import time
def train(model, optimizer, loss_fn, train_dl, val_dl, epochs=20, device='cuda'):
'''
Runs training loop for classification problems. Returns Keras-style
per-epoch history of loss and accuracy over training and validation data.
Parameters
----------
model : nn.Module
Neural network model
optimizer : torch.optim.Optimizer
Search space optimizer (e.g. Adam)
loss_fn :
Loss function (e.g. nn.CrossEntropyLoss())
train_dl :
Iterable dataloader for training data.
val_dl :
Iterable dataloader for validation data.
epochs : int
Number of epochs to run
device : string
Specifies 'cuda' or 'cpu'
Returns
-------
Dictionary
Similar to Keras' fit(), the output dictionary contains per-epoch
history of training loss, training accuracy, validation loss, and
validation accuracy.
'''
print('train() called: model=%s, opt=%s(lr=%f), epochs=%d, device=%s\n' % \
(type(model).__name__, type(optimizer).__name__,
optimizer.param_groups[0]['lr'], epochs, device))
history = {} # Collects per-epoch loss and acc like Keras' fit().
history['loss'] = []
history['val_loss'] = []
history['acc'] = []
history['val_acc'] = []
start_time_sec = time.time()
for epoch in range(epochs):
# --- TRAIN AND EVALUATE ON TRAINING SET -----------------------------
model.train()
train_loss = 0.0
num_train_correct = 0
num_train_examples = 0
for batch in train_dl:
optimizer.zero_grad()
x = batch[0].to(device)
y = batch[1].to(device)
yhat = model(x)
loss = loss_fn(yhat, y)
loss.backward()
optimizer.step()
train_loss += loss.data.item() * x.size(0)
num_train_correct += (torch.max(yhat, 1)[1] == y).sum().item()
num_train_examples += x.shape[0]
train_acc = num_train_correct / num_train_examples
train_loss = train_loss / len(train_dl.dataset)
# --- EVALUATE ON VALIDATION SET -------------------------------------
model.eval()
val_loss = 0.0
num_val_correct = 0
num_val_examples = 0
for batch in val_dl:
x = batch[0].to(device)
y = batch[1].to(device)
yhat = model(x)
loss = loss_fn(yhat, y)
val_loss += loss.data.item() * x.size(0)
num_val_correct += (torch.max(yhat, 1)[1] == y).sum().item()
num_val_examples += y.shape[0]
val_acc = num_val_correct / num_val_examples
val_loss = val_loss / len(val_dl.dataset)
print('Epoch %3d/%3d, train loss: %5.2f, train acc: %5.2f, val loss: %5.2f, val acc: %5.2f' % \
(epoch+1, epochs, train_loss, train_acc, val_loss, val_acc))
history['loss'].append(train_loss)
history['val_loss'].append(val_loss)
history['acc'].append(train_acc)
history['val_acc'].append(val_acc)
# END OF TRAINING LOOP
end_time_sec = time.time()
total_time_sec = end_time_sec - start_time_sec
time_per_epoch_sec = total_time_sec / epochs
print()
print('Time total: %5.2f sec' % (total_time_sec))
print('Time per epoch: %5.2f sec' % (time_per_epoch_sec))
return history
Short answer: there is no equivalent training loop for PT and TF.keras and there shall never be one.
First of all, the training loop is syntactical sugar that is supposed to makes one's life easier. From my point of view, "making life easier" is a moto of TF.keras framework and this is the main reason it has it. Training loop can not be formalized as well defined practice, it might vary a lot depending on the task/dataset/procedure/metric/you_name_it and it would require a lot of effort to match all the options for 2 frameworks. Furthermore, creating a defining interface for training loop in Pytorch might be too restrictive for many actual users of the framework.
Matching the outputs of network would require matching behaviors of every operation within 2 frameworks, which would be impossible. First of all, the frameworks don't necessarily provide same sets of operations. Operations can be grouped into higher level abstracts differently. Also, some common functions like sigmoid or BatchNorm might look well mathematically defined on paper, but in reality have dozens of implementation specific details. Also, when improvements are introduced to the operations it is up to the community to integrate these updates into main framework distributions or plane ignore them. Needless to say, developers of 2 frameworks make these decisions independently and likely have different motivation behind them.
To sum it all up, matching high level details of 2 frameworks would require enormous effort and would probably be very disruptive for the existing users.
Indeed, the Pytorch Module class (source code) doesn't have a fit() method, so you have to implement your own according to your needs.
However there are some implementations which mimic the Keras training API, such as this one:
https://github.com/ncullen93/torchsample
or a simpler one:
https://github.com/henryre/pytorch-fitmodule
A close thing to Keras model.fit in Pytorch is Pytorch Extension called Torchbearer.
From the MNIST example notebook:
trial = Trial(model, optimizer, loss, metrics=['acc', 'loss'], callbacks=callbacks).to(device)
trial.with_generators(train_generator=traingen, val_generator=valgen, test_generator=testgen)
history = trial.run(epochs=5, verbose=1)
the similarity is there, although the usage requiers some reading.
Best of luck!
I wanted to point out that there is now a fit(...) equivalent. PyTorch Lightning is a wrapper around PyTorch that allows for a clean object-oriented approach to creating ML models in PyTorch. It provides a fit(...) loop using their Trainer class. I would checkout their official website for a more detailed answer.
Related
I am training a cnn using pytorch and have created a training loop. As I am performing optimisation and experimenting with hyper-parameter tuning, I want to separate my training, validation and testing into different functions. I need to be able to record my accuracy and loss for each function in order to plot graphs. For this I want to create a function which returns the accuracy.
I am pretty new to coding and was wondering the best way to go about this. I feel like my code is a bit messy at the moment. I need to be able to feed in various hyper-parameters for experimentation in my training function. Could anyone offer any advice? Below is what I can so far:
def train_model(model, optimizer, data_loader, num_epochs, criterion=criterion):
total_epochs = notebook.tqdm(range(num_epochs))
for epoch in total_epochs:
model.train()
train_correct = 0.0
train_running_loss=0.0
train_total=0.0
for i, (img, label) in enumerate(data_loader['train']):
#uploading images and labels to GPU
img = img.to(device)
label = label.to(device)
#training model
outputs = model(img)
#computing losss
loss = criterion(outputs, label)
#propagating the loss backwards
optimizer.zero_grad()
loss.backward()
optimizer.step()
train_running_loss += loss.item()
_, predicted = outputs.max(1)
train_total += label.size(0)
train_correct += predicted.eq(label).sum().item()
train_loss=train_running_loss/len(data_loader['train'])
train_accu=100.*correct/total
print('Train Loss: %.3f | Train Accuracy: %.3f'%(train_loss,train_accu))
I have also experimented with making a functions to record accuracy:
def accuracy(outputs, labels):
_, preds = torch.max(outputs, dim = 1)
return torch.tensor(torch.sum(preds == labels).item() / len(preds))
First, note that:
Unless you have some specific motivation, validation (and testing) should be performed on a different dataset than the training set, so you should use a different DataLoader. The computation time will increase because of an additional for loop at every epoch.
Always call model.eval() before validation/testing.
That said, The signature of the validation function is pretty much similar to that of train_model
# criterion is passed if you want to register the validation loss too
def validate_model(model, eval_loader, criterion):
...
Then, in train_model, after each epoch, you can call the function validate_model and store the returned metrics in some data structure (list, tensor, etc.) that will be used later for plotting.
At the end of the training, you can then use the same validate_model function for testing.
Instead of coding the accuracy by yourself, you can use Accuracy from TorchMetrics
Finally, if you feel the need to level up, you can use DL training frameworks like PyTorch Lightning or FastAI. Give also a look at some hyperparameter tuning library such as Ray Tune.
I am trying to optimize a given neural network (ex Perceptron Multilayer, with 2 hidden layers), by finding the number of epoch and batch that give the highest accuracy.
for epoch from 10 to 200 (in steps of 10):
for batch from 40 to 200 (in steps of 20):
modele.fit (X_train, Y_train, epochs = epoch, batch_size = batch)
I save batch, epoch, Accuracy;
Afterwards I kept the smallest epoch with the smallest corresponding batch which has the highest recognition
ex best_params: epoch = 10, batch = 150 => Accuracy = 94%
My problem is that when I re-run my model with the best_params, it doesn't give me the same results (loss, accuracy), even sometimes very low accuracy (eg 10%).
i try to fix seed, but no best result
Regards
Djam75
df=pd.DataFrame(columns=['Nb_Batch','Nb_Epoch','Accuracy'])
i=0
lst_loss=[]
lst_accuracy=[]
lst_epoch=list(np.arange(10,200,10))
lst_batch=list(np.arange(100,400,20))
for epoch in lst_epoch:
print ('---------------- Epoch ' + str(epoch)+ '------------------')
for batch in lst_batch:
modelSimple.fit(X_train, Y_train, nb_epoch = epoch, batch_size = batch, verbose = 0)
score = modelSimple.evaluate(X_test, Y_test)
df.loc[i,"Nb_Batch"]=batch
df.loc[i,"Nb_Epoch"]=epoch
df.loc[i,"Accuracy"]=score[1]*100
i=i+1
This might be happening due to random parameter initialization. Because if you are building an end-to-end model without transfer learn the weights, every time you training architecture get random values for its parameters.
In this case, a good practice is to use batch normalization layers after some layers according to your architecture.
tensoflow-implementation
pytorch-implmentation
extra idea:
Do not use any 'for', 'while' loops in the model implementation.
you can follow templates in TensorFlow or PyTorch.
OR, if you build a complete model from scratch, vectorize operations by using NumPy like metrics operation library.
Thanks for the update.
I resolve my probelm by saving a model and load it after.
thaks for idea (batch normalization ) and extra idea : not user any for ;-)
regards
I think you might not be updating the weight matrix after completing the training for certain batch sizes and epochs.
Please include the code as well in order to see the problem
I'm training a neural network (doesn't matter which one) on CIFAR-10 dataset. I'm using Federated Learning:
I have 10 models, each model having access to its own part of the dataset. At every time step, each model makes a step using its own data, and then the global model is an average of the model (this version is based on this, but I tried a lot of options):
def server_aggregate(server_model, client_models):
global_dict = server_model.state_dict()
for k in global_dict.keys():
global_dict[k] = torch.stack([client_models[i].state_dict()[k].float() for i in range(len(client_models))], 0).mean(0)
server_model.load_state_dict(global_dict)
for model in client_models:
model.load_state_dict(server_model.state_dict())
To be specific, each machine only has access to a data corresponding to a single class. I.e. machine 0 has only samples corresponding to class 0, etc. I'm doing it the following way:
def split_into_classes(full_ds, batch_size, num_classes=10):
class2indices = [[] for _ in range(num_classes)]
for i, y in enumerate(full_ds.targets):
class2indices[y].append(i)
datasets = [torch.utils.data.Subset(full_ds, indices) for indices in class2indices]
return [DataLoader(ds, batch_size=batch_size, shuffle=True) for ds in datasets]
Problem. During training, I can see that my federated training loss decreases. However, I never see my test loss/accuracy improve (acc is always around 10%).
Moreover, when I check accuracy on train/test datasets:
For the federated dataset, the accuracy improves.
For the testing dataset, the accuracy doesn't improve.
(Most surprising) for the training dataset, the accuracy doesn't improve. Note that this dataset is essentially the same as federated dataset, but not split into classes. The checking code is the following:
def epoch_summary(model, fed_loaders, true_train_loader, test_loader, frac):
with torch.no_grad():
train_len = 0
train_loss, train_acc = 0, 0
for train_loader in fed_loaders:
cur_loss, cur_acc, cur_len = true_results(model, train_loader, frac)
train_loss += cur_len * cur_loss
train_acc += cur_len * cur_acc
train_len += cur_len
train_loss /= train_len
train_acc /= train_len
true_train_loss, true_train_acc, true_train_len = true_results(model, true_train_loader, frac)
test_loss, test_acc, test_len = true_results(model, test_loader, frac)
print("TrainLoss: {:.4f} TrainAcc: {:.2f} TrueLoss: {:.4f} TrueAcc: {:.2f} TestLoss: {:.4f} TestAcc: {:.2f}".format(
train_loss, train_acc, true_train_loss, true_train_acc, test_loss, test_acc
), flush=True)
The full code can be found here. Things which don't seem to matter:
Model. I got the same problem for Resnet models and for some other models.
How I aggregate the models. I tried using state_dict or directly manipulate model.parameters(), no effect.
How I learn the models. I tried using optim.SGD or directly update param.data -= learning_rate * param.grad, no effect.
Computational graph. I've tried adding .detach().clone() and with torch.no_grad() into all possible places, no effect.
So I'm suspecting that the problem is somehow with the federated data itself (especially given strange accuracy results). What can be a problem?
10% on CIFAR-10 is basically random - your model outputs labels at random and gets 10%.
I think the problem lies in your "federated training" strategy: you cannot expect your sub-models to learn anything meaningful when all they see is a single label. This is why training data is shuffled.
Think of it: if each of your sub models learns all weights to be zero apart from the bias vector of the last classification layer that has 1 in the entry corresponding to the class this sub-model sees - the training of each sub model is perfect (it gets it right for all training samples it sees), but the averaged model is meaningless.
I manually run the epochs in a loop, as well as further nested mini-batches in the loop. At each mini-batch, I need to call train_on_batch, to enable the training of a customized model.
Is there a manual way to restore the functionality of early stopping, i.e. to break the loop?
In practice, 'early stopping' is largely done via: (1) train for X epochs, (2) save the model each time it achieves a new best performance, (3) select the best model. "Best performance" defined as achieving the highest (e.g. accuracy) or lowest (e.g. loss) validation metric - example script below:
best_val_loss = 999 # arbitrary init - should be high if 'best' is low, and vice versa
num_epochs = 5
epoch = 0
while epoch < num_epochs:
model.train_on_batch(x_train, y_train) # get x, y somewhere in the loop
val_loss = model.evaluate(x_val, y_val)
if val_loss < best_val_loss:
model.save(best_model_path) # OR model.save_weights()
print("Best model w/ val loss {} saved to {}".format(val_loss, best_model_path))
# ...
epoch += 1
See saving Keras models. If you rather early-stop directly, then define some metric - i.e. condition - that'll end the train loop. For example,
while True:
loss = model.train_on_batch(...)
if loss < .02:
break
I'm trying to write some logic that selects the best epoch to run a neural network in Keras. My code saves the training loss and the test loss for a set number of epochs and then picks the best fitting epoch according to some logic. The code looks like this:
ini_epochs = 100
df_train_loss = DataFrame(data=history.history['loss'], columns=['Train_loss']);
df_test_loss = DataFrame(data=history.history['val_loss'], columns=['Test_loss']);
df_loss = concat([df_train_loss,df_test_loss], axis=1)
Min_loss = max(df_loss['Test_loss'])
for i in range(ini_epochs):
Test_loss = df_loss['Test_loss'][i];
Train_loss = df_loss['Train_loss'][i];
if Test_loss > Train_loss and Test_loss < Min_loss:
Min_loss = Test_loss;
The idea behind the logic is this; to get the best model, the epoch selected should select the model with the lowest loss value, but it must be above the training loss value to avoid overfitting.
In general, this epoch selection method works OK. However, if the test loss value is below the train loss from the start, then this method picks an epoch of zero (see below).
Now I could add another if statement assessing whether the difference between the test and train losses are positive or negative, and then write logic for each case, but what happens if the difference starts positive and then ends up negative. I get confused and haven't been able to write effective code.
So, my questions are:
1) Can you show me how you what code you would write to to account for the situation show in the graph (and for the case where the test and train loss curves cross). I'd say the strategy would be to take the value that with the minimum difference.
2) There is a good chance that I'm going about this the wrong way. I know Keras has a callbacks feature but I don't like the idea of using the save_best_only feature because it can save overfitted models. Any advice on a more efficient epoch selection method would be great.
Use EarlyStopping which is available in Keras. Early stopping is basically stopping the training once your loss starts to increase (or in other words validation accuracy starts to decrease). use ModelCheckpoint to save the model wherever you want.
from keras.callbacks import EarlyStopping, ModelCheckpoint
STAMP = 'simple_lstm_glove_vectors_%.2f_%.2f'%(rate_drop_lstm,rate_drop_dense)
early_stopping =EarlyStopping(monitor='val_loss', patience=5)
bst_model_path = STAMP + '.h5'
model_checkpoint = ModelCheckpoint(bst_model_path, save_best_only=True, save_weights_only=True)
hist = model.fit(data_train, labels_train, \
validation_data=(data_val, labels_val), \
epochs=50, batch_size=256, shuffle=True, \
callbacks=[early_stopping, model_checkpoint])
model.load_weights(bst_model_path)
refer to this link for more info
Here is a simple example illustrate how to use early stooping in Keras:
First necessarily import:
from keras.callbacks import EarlyStopping, ModelCheckpoint
Setup Early Stopping
# Set callback functions to early stop training and save the best model so far
callbacks = [EarlyStopping(monitor='val_loss', patience=2),
ModelCheckpoint(filepath='best_model.h5', monitor='val_loss', save_best_only=True)]
Train neural network
history = network.fit(train_features, # Features
train_target, # Target vector
epochs=20, # Number of epochs
callbacks=callbacks, # Early stopping
verbose=0, # Print description after each epoch
batch_size=100, # Number of observations per batch
validation_data=(test_features, test_target)) # Data for evaluation
See the full example here.
Please also check :Stop Keras Training when the network has fully converge; the best answer of Daniel.