I am new to Pytorch. I wrote a script for training my model but I would like to record my accuracy and stuff of each epoch.
Currently, I use 4 lists to record the histories of each epoch and change them to a dataframe and save as CSV. I am wondering what people usually do for this part.
def train(model, criterion, optimizer, scheduler, num_epochs=25):
train_loss_history = []
val_loss_history = []
train_acc_history = []
val_acc_history = []
# Training scripts below
TensorboardX allows you to plot and store the information you need.
Related
I found a code for training model by creating a set of batches at a fixed rate from different types of data.
The code is as follows.
def train_model(self):
print('train the model')
i = 0
while (i<self.iterations) and (self.file_date == os.path.getmtime(sys.argv[0])):
#x,y,d,e,s = self.fgen.load_train(self.nbatch, scenario='doubletalk')
x,y,d,e,s = self.fgen.load_train(self.nbatch)
self.model.fit([x,y,d,e,s], None, batch_size=self.nbatch, epochs=1, verbose=0, callbacks=[self.logger])
i += 1
This model uses 'Adam optimizer'
However, my question is as follows.
I wonder if the optimizer state will be updated in this training case that 'epoch=1' is iterated for 'while'.
For example, my concern is that if its learning rate is constant for every training epoch
Using Ubuntu 20.04, Pytorch 1.10.1.
I am trying to solve a music generation task with a transformer architecture and multi-embeddings, for processing tokens with several characteristics.
In each training iteration, I have to calculate the loss of each token characteristic and store it in a vector, then I suppose that I should store in a checkpoint a vector containing all of them (or something similar), instead of what I'm doing now which is saving the total loss. I would like to know how to store all losses in the checkpoint (be able to keep training when loading it), or if it isn't needed at all.
The epochs loop:
for epoch in range(0, epochs):
print('Epoch: ', epoch)
loss = trfrmr.train(epoch+1, model, train_loader, train_loss_func, opt, lr_scheduler, num_iters=-1)
loss_train.append(loss)
torch.save({
'epoch': epoch,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': opt.state_dict(),
'loss': loss,
}, "model_pop909_checkpoint.pth")
The training loop:
for batch_num, batch in enumerate(dataloader):
time_before = time.time()
opt.zero_grad()
x = batch[0].to(get_device())
tgt = batch[1].to(get_device())
# x is the input sequence (N,T,Z), that should be input into the transformer forward function as (T,N,Z)
y = model(x.permute(1, 0, 2))
# tgt is the real output sequence, of shape (N,T,Z), T is sequence length, N batch size, Z the different token types
# y are the output logits, is a list of Z tensors of shape (T,N,C*) where C is the vocabulary size, and will vary depending on the token type (pitch, velocity etc...)
losses = []
for j in range(LEN_VOCAB):
aux_loss = loss.forward(y[j].permute(1, 2, 0),
tgt[..., j]) # shapes (N,C,T) and (N,T), see Pytorch cross-entropy for details
losses.append(aux_loss)
losses_sum = sum(losses) # here we sum, but we could also have mean for instance
losses_sum.backward()
opt.step()
if lr_scheduler is not None:
lr_scheduler.step()
lr = opt.param_groups[0]['lr']
loss_hist.append(losses_sum)
if batch_num == num_iters:
break
Thanks in advance.
HOURS LATER EDIT: SOLUTION TO MY SPECIFIC PROBLEM
The problem was that when loading again the model I wasn't doing it properly (not loading optimizer parameters, but only model ones). Now in my code, at the beginning of the loop I do:
if loaded:
print('Loading model and optimizer...')
model.load_state_dict(checkpoint['model_state_dict'], strict=False)
opt.load_state_dict(checkpoint['optimizer_state_dict'])
print('Loaded succesfully!')
And I also load the epoch:
epoch = 0
if loaded:
print('Loading epoch value...')
epoch = checkpoint['epoch']
print('Loaded succesfully!')
As far as I can tell from your code, your loss function has no custom learnable parameters; it's just recalculated every time your model iterates. Thus there is no need to save its value other than keeping a history of it; it is not required to continue training from a checkpoint.
I am training a cnn using pytorch and have created a training loop. As I am performing optimisation and experimenting with hyper-parameter tuning, I want to separate my training, validation and testing into different functions. I need to be able to record my accuracy and loss for each function in order to plot graphs. For this I want to create a function which returns the accuracy.
I am pretty new to coding and was wondering the best way to go about this. I feel like my code is a bit messy at the moment. I need to be able to feed in various hyper-parameters for experimentation in my training function. Could anyone offer any advice? Below is what I can so far:
def train_model(model, optimizer, data_loader, num_epochs, criterion=criterion):
total_epochs = notebook.tqdm(range(num_epochs))
for epoch in total_epochs:
model.train()
train_correct = 0.0
train_running_loss=0.0
train_total=0.0
for i, (img, label) in enumerate(data_loader['train']):
#uploading images and labels to GPU
img = img.to(device)
label = label.to(device)
#training model
outputs = model(img)
#computing losss
loss = criterion(outputs, label)
#propagating the loss backwards
optimizer.zero_grad()
loss.backward()
optimizer.step()
train_running_loss += loss.item()
_, predicted = outputs.max(1)
train_total += label.size(0)
train_correct += predicted.eq(label).sum().item()
train_loss=train_running_loss/len(data_loader['train'])
train_accu=100.*correct/total
print('Train Loss: %.3f | Train Accuracy: %.3f'%(train_loss,train_accu))
I have also experimented with making a functions to record accuracy:
def accuracy(outputs, labels):
_, preds = torch.max(outputs, dim = 1)
return torch.tensor(torch.sum(preds == labels).item() / len(preds))
First, note that:
Unless you have some specific motivation, validation (and testing) should be performed on a different dataset than the training set, so you should use a different DataLoader. The computation time will increase because of an additional for loop at every epoch.
Always call model.eval() before validation/testing.
That said, The signature of the validation function is pretty much similar to that of train_model
# criterion is passed if you want to register the validation loss too
def validate_model(model, eval_loader, criterion):
...
Then, in train_model, after each epoch, you can call the function validate_model and store the returned metrics in some data structure (list, tensor, etc.) that will be used later for plotting.
At the end of the training, you can then use the same validate_model function for testing.
Instead of coding the accuracy by yourself, you can use Accuracy from TorchMetrics
Finally, if you feel the need to level up, you can use DL training frameworks like PyTorch Lightning or FastAI. Give also a look at some hyperparameter tuning library such as Ray Tune.
I was testing the usage of an h5 file vs flow_from_directory and noticed different validation scores during training, but very similar scores for training and when tested the model on the test data both of them gives almost the same scores. For the experiment I am using the same model and giving them 5 epochs, and they have both the same starting weights, via get_weights and save_weights. I would like to use the h5 alternative since it cuts down my training time per epoch from 2min to half of a min.
Using flow from directory
Here's the scores during training
And the prediction results on the test data being:
loss: 1.9690 - accuracy: 0.4802
Using flow from the h5 file
And the "trained" model applied to the test data:
loss: 1.9695 - accuracy: 0.4822
From the stats, looks like it is training from the same data and predicting on the same test data(because of the similar loss and accuracy scores) and the validation data is different.
How the h5 file was created
Below I will show the code for inserting the validation data into the dataset. For the training and testing images it's the same code but modifying it accordingly .
...
hdf5_file.create_dataset('x_val', val_data_shape, np.uint8)
hdf5_file.create_dataset('y_val', val_label_shape, np.uint8 )
...
for i in range(len(val_it)):
if i%200 == 0:
print(f"Validation: Done {i} of {len(val_it)}")
x, y = val_it[i]
img = x[0]
label = y[0]
hdf5_file['x_val'][i, ...] = img
hdf5_file['y_val'][i, ...] = label
...
How the data was loaded to fetch it into the model
First I create a datagen that applies the preprocess needed for the Resnet model. For the training data and the test data are done in the same fashion.
datagen_val = ImageDataGenerator(preprocessing_function=preprocess_resnet)
val_it = datagen_val.flow_from_directory(BASE_PATH + 'val', batch_size=1)
val_x = hdf5_file['x_val']
val_y = hdf5_file['y_val']
datagen_val_h5 = ImageDataGenerator(preprocessing_function=preprocess_resnet)
val_it_h5 = datagen_val_h5.flow((val_x, val_y), batch_size=1)
Question is, why the loss and accuracy score on the validation dataset differ between the two methods when on training and test they score very similar?
I'm coming over from Keras to PyTorch, and one of the surprising things I've found is that I'm supposed to implement my own training loop.
In Keras, there is a de facto fit() function that: (1) runs gradient descent and (2) collects a history of metrics for loss and accuracy over both the training set and validation set.
In PyTorch, it appears that the programmer needs to implement the training loop. Since I'm new to PyTorch, I don't know if my training loop implementation is correct. I just want to compare apples-to-apples loss and accuracy metrics with what I'm seeing in Keras.
I've already read through:
the official PyTorch 60-minute blitz, where they provide a sample training loop.
official PyTorch example code, where I've found the training loop placed in-line with other code.
the O'Reilly book Programming PyTorch for Deep Learning with its own training loop.
Stanford CS230 sample code.
various blog posts (e.g. here and here).
So I'm wondering: is there a definitive, universal training loop implementation that does the same thing and reports the same numbers as the Keras fit() function?
My points of frustration:
Pulling data out of the dataloader is not consistent between image data and NLP data.
Correctly computing loss and accuracy is not consistent in any sample code I've seen.
Some code examples use Variable, while others do not.
Unnecessarily detailed: moving data to/from the GPU; knowing when to call zero_grad().
For what it's worth, here is my current implementation. Are there any obvious bugs?
import time
def train(model, optimizer, loss_fn, train_dl, val_dl, epochs=20, device='cuda'):
'''
Runs training loop for classification problems. Returns Keras-style
per-epoch history of loss and accuracy over training and validation data.
Parameters
----------
model : nn.Module
Neural network model
optimizer : torch.optim.Optimizer
Search space optimizer (e.g. Adam)
loss_fn :
Loss function (e.g. nn.CrossEntropyLoss())
train_dl :
Iterable dataloader for training data.
val_dl :
Iterable dataloader for validation data.
epochs : int
Number of epochs to run
device : string
Specifies 'cuda' or 'cpu'
Returns
-------
Dictionary
Similar to Keras' fit(), the output dictionary contains per-epoch
history of training loss, training accuracy, validation loss, and
validation accuracy.
'''
print('train() called: model=%s, opt=%s(lr=%f), epochs=%d, device=%s\n' % \
(type(model).__name__, type(optimizer).__name__,
optimizer.param_groups[0]['lr'], epochs, device))
history = {} # Collects per-epoch loss and acc like Keras' fit().
history['loss'] = []
history['val_loss'] = []
history['acc'] = []
history['val_acc'] = []
start_time_sec = time.time()
for epoch in range(epochs):
# --- TRAIN AND EVALUATE ON TRAINING SET -----------------------------
model.train()
train_loss = 0.0
num_train_correct = 0
num_train_examples = 0
for batch in train_dl:
optimizer.zero_grad()
x = batch[0].to(device)
y = batch[1].to(device)
yhat = model(x)
loss = loss_fn(yhat, y)
loss.backward()
optimizer.step()
train_loss += loss.data.item() * x.size(0)
num_train_correct += (torch.max(yhat, 1)[1] == y).sum().item()
num_train_examples += x.shape[0]
train_acc = num_train_correct / num_train_examples
train_loss = train_loss / len(train_dl.dataset)
# --- EVALUATE ON VALIDATION SET -------------------------------------
model.eval()
val_loss = 0.0
num_val_correct = 0
num_val_examples = 0
for batch in val_dl:
x = batch[0].to(device)
y = batch[1].to(device)
yhat = model(x)
loss = loss_fn(yhat, y)
val_loss += loss.data.item() * x.size(0)
num_val_correct += (torch.max(yhat, 1)[1] == y).sum().item()
num_val_examples += y.shape[0]
val_acc = num_val_correct / num_val_examples
val_loss = val_loss / len(val_dl.dataset)
print('Epoch %3d/%3d, train loss: %5.2f, train acc: %5.2f, val loss: %5.2f, val acc: %5.2f' % \
(epoch+1, epochs, train_loss, train_acc, val_loss, val_acc))
history['loss'].append(train_loss)
history['val_loss'].append(val_loss)
history['acc'].append(train_acc)
history['val_acc'].append(val_acc)
# END OF TRAINING LOOP
end_time_sec = time.time()
total_time_sec = end_time_sec - start_time_sec
time_per_epoch_sec = total_time_sec / epochs
print()
print('Time total: %5.2f sec' % (total_time_sec))
print('Time per epoch: %5.2f sec' % (time_per_epoch_sec))
return history
Short answer: there is no equivalent training loop for PT and TF.keras and there shall never be one.
First of all, the training loop is syntactical sugar that is supposed to makes one's life easier. From my point of view, "making life easier" is a moto of TF.keras framework and this is the main reason it has it. Training loop can not be formalized as well defined practice, it might vary a lot depending on the task/dataset/procedure/metric/you_name_it and it would require a lot of effort to match all the options for 2 frameworks. Furthermore, creating a defining interface for training loop in Pytorch might be too restrictive for many actual users of the framework.
Matching the outputs of network would require matching behaviors of every operation within 2 frameworks, which would be impossible. First of all, the frameworks don't necessarily provide same sets of operations. Operations can be grouped into higher level abstracts differently. Also, some common functions like sigmoid or BatchNorm might look well mathematically defined on paper, but in reality have dozens of implementation specific details. Also, when improvements are introduced to the operations it is up to the community to integrate these updates into main framework distributions or plane ignore them. Needless to say, developers of 2 frameworks make these decisions independently and likely have different motivation behind them.
To sum it all up, matching high level details of 2 frameworks would require enormous effort and would probably be very disruptive for the existing users.
Indeed, the Pytorch Module class (source code) doesn't have a fit() method, so you have to implement your own according to your needs.
However there are some implementations which mimic the Keras training API, such as this one:
https://github.com/ncullen93/torchsample
or a simpler one:
https://github.com/henryre/pytorch-fitmodule
A close thing to Keras model.fit in Pytorch is Pytorch Extension called Torchbearer.
From the MNIST example notebook:
trial = Trial(model, optimizer, loss, metrics=['acc', 'loss'], callbacks=callbacks).to(device)
trial.with_generators(train_generator=traingen, val_generator=valgen, test_generator=testgen)
history = trial.run(epochs=5, verbose=1)
the similarity is there, although the usage requiers some reading.
Best of luck!
I wanted to point out that there is now a fit(...) equivalent. PyTorch Lightning is a wrapper around PyTorch that allows for a clean object-oriented approach to creating ML models in PyTorch. It provides a fit(...) loop using their Trainer class. I would checkout their official website for a more detailed answer.