What is running loss in PyTorch and how is it calculated - python

I had a look at this tutorial in the PyTorch docs for understanding Transfer Learning. There was one line that I failed to understand.
After the loss is calculated using loss = criterion(outputs, labels), the running loss is calculated using running_loss += loss.item() * inputs.size(0) and finally, the epoch loss is calculated using running_loss / dataset_sizes[phase].
Isn't loss.item() supposed to be for an entire mini-batch (please correct me if I am wrong). i.e, if the batch_size is 4, loss.item() would give the loss for the entire set of 4 images. If this is true, why is loss.item() being multiplied with inputs.size(0) while calculating running_loss? Isn't this step like an extra multiplication in this case?
Any help would be appreciated. Thanks!

It's because the loss given by CrossEntropy or other loss functions is divided by the number of elements i.e. the reduction parameter is mean by default.
torch.nn.CrossEntropyLoss(weight=None, size_average=None, ignore_index=-100, reduce=None, reduction='mean')
Hence, loss.item() contains the loss of entire mini-batch, but divided by the batch size. That's why loss.item() is multiplied with batch size, given by inputs.size(0), while calculating running_loss.

if the batch_size is 4, loss.item() would give the loss for the entire set of 4 images
That depends on how the loss is calculated. Remember, loss is a tensor just like every other tensor. In general the PyTorch APIs return avg loss by default
"The losses are averaged across observations for each minibatch."
t.item() for a tensor t simply converts it to python's default float32.
More importantly, if you are new to PyTorch, it might be helpful for you to know that we use t.item() to maintain running loss instead of t because PyTorch tensors store history of its values which might overload your GPU very soon.

Related

Is it possible to perform step according to batch size in pytorch?

I am iterating over training samples in batches, however last batch always returns fewer samples.
Is it possible to specify step size in torch according to the current batch length?
For example most batch are of size 64, last batch only 6 samples.
If I do the usual routine:
optimizer.zero_grad()
loss.backward()
optimizer.step()
It seems that the last 6 samples carry the same weight when updating the gradients as the 64 sized batches, but in fact they should only carry about 1/10 weight due to fewer samples.
In Mxnet I could specify the step size accordingly but I don't know how to do it in torch.
You can define a custom loss function and then e.g. reweight it based on batch size
def reweighted_cross_entropy(my_outputs, my_labels):
# compute batch size
my_batch_size = my_outputs.size()[0]
original_loss = nn.CrossEntropyLoss()
loss = original_loss (my_outputs, my_labels)
# reweight accordingly
return my_batch_size * loss
if you are using something like gradient descent then it is easy to see that
[1/10 * lr] grad [loss] = lr * grad [ 1/10 loss]
so reweighting the loss will be equivalent to reweighting your learning rate. This won't be exactly true for more comlpex optimisers though but can be good enough in practise.
I suggest just ignore the last batch. Pytorch Dataloader has parameter to implement that behavior:
drop_last = True #(False by default)

Why is saving state_dict getting slower as training progresses?

I'm saving my model's and optimizer's state dict as follows:
if epoch % 50000 == 0:
#checkpoint save every 50000 epochs
print('\nSaving model... Loss is: ', loss)
torch.save({
'epoch': epoch,
'model': self.state_dict(),
'optimizer_state_dict': self.optimizer.state_dict(),
'scheduler': self.scheduler.state_dict(),
'loss': loss,
'losses': self.losses,
}, PATH)
When I first start the training it saves in less than 5 seconds. However, after a couple of hours of training it takes over a two minutes to save. The only reason I could think of is the list of losses. But I can't see how that would increase the time by that much.
Update 1:
I have my losses as:
self.losses = []
I'm appending the loss at each epoch to this list as follows:
#... loss calculation
loss.backward()
self.optimizer.step()
self.scheduler.step()
self.losses.append(loss)
As mentionned in the comments, the instruction
self.losses.append(loss)
is definitely the culprit, and shoud be replaced with
self.losses.append(loss.item())
The reason is that when you store the tensor loss, you also store the whole computational graph alongside (all the information that is required to perform the backprop). In other words, you are not merely storing a tensor, but also the pointers to all the tensors that have been involved in the computation of the loss and their relations (which ones were added, multiplied etc). So it will grow really big really fast.
When you do loss.item() (or loss.detach(), which would work as well), you detach the tensor from the computational graph, and thus you only store what you intended : the loss value itself, as a simple float value

How does PyTorch compute the backward pass when optimizing triplet loss?

I am implementing a triplet network in Pytorch where the 3 instances (sub-networks) share the same weights. Since the weights are shared, I implemented it as a single instance network that is called three times to produce the anchor, positive, and negative embeddings. The embeddings are learned by optimizing the triplet loss. Here is a small snippet for illustration:
from dependencies import *
model = SingleSubNet() # represents each instance in the triplet net
for epoch in epochs:
for anch, pos, neg in enumerate(train_loader):
optimizer.zero_grad()
fa, fp, fn = model(anch), model(pos), model(neg)
loss = triplet_loss(fa, fp, fn)
loss.backward()
optimizer.step()
# Do more stuff ...
My complete code works as expected. However, I do not understand what does the loss.backward() compute the gradient(s) in this case. I am confused because there are 3 gradients of loss is in each learning step (the gradients formulas are here). I assume the gradients are summed before performing optimizer.step(). But then it looks from the equations that if the gradients are summed, they will cancel each other out and yield zero update term. Of course, this is not true as the network learns meaningful embeddings at the end.
Thanks in advance
Late answer, but hope this helps someone.
The gradients that you linked are the gradients of the loss with respect to the embeddings (the anchor, positive embedding and negative embedding). To update the model parameters, you use the gradient of the loss with respect to the model parameters. This does not sum to zero.
The reason for this is that when calculating the gradient of the loss with respect to the model parameters, the formula makes use of the activations from the forward pass, and the 3 different inputs (anchor image, positive example and negative example) have different activations in the forward pass.

How to compute and sum gradients over multiple mini-batches in Keras?

I want to increase the mini batch-size for my neural network during training (instead of decaying the learning rate), but the upper limit for the mini batch-size is 8, due to my GPU memory.
I found this article
https://medium.com/#davidlmorton/increasing-mini-batch-size-without-increasing-memory-6794e10db672
on how to increase the mini-batch-size without increasing the memory, and it is doing that by implementing a DataLoader in PyTorch.
The technique is simple, you just compute and sum gradients over
multiple mini-batches. Only after the specified number of mini-batches
do you update the model parameters.
count = 0
for inputs, targets in training_data_loader:
if count == 0:
optimizer.step()
optimizer.zero_grad()
count = batch_multiplier
outputs = model(inputs)
loss = loss_function(outputs, targets) / batch_multiplier
loss.backward()
count -= 1
However I couldn't find any examples for this in Keras. I assume that it would have to be done using a data_generator(Sequence), in the function __getitem__ ?
But I have no idea how to implement it, or if it is even possible in Keras. I tried looking at examples of DataGenerators as well, but none of them involved optimizers.
Would appreciate if anybody can help me out!

How do I get a loss per epoch and not per batch?

In my understanding an epoch is an arbitrarily often repeated run over the whole dataset, which in turn is processed in parts, so called batches. After each train_on_batch a loss is calculated, the weights are updated and the next batch will get better results. These losses are indicators of the quality and learning state of my to NNs.
In several sources the loss is calculated (and printed) per epoch. Therefore I am not sure if I am doing this right.
At the moment my GAN looks like this:
for epoch:
for batch:
fakes = generator.predict_on_batch(batch)
dlc = discriminator.train_on_batch(batch, ..)
dlf = discriminator.train_on_batch(fakes, ..)
dis_loss_total = 0.5 * np.add(dlc, dlf)
g_loss = gan.train_on_batch(batch,..)
# save losses to array to work with later
These losses are for each batch. How do I get them for an epoch? As an aside: Do I need losses for an epoch, what for?
There is no direct way to compute the loss for an epoch. Actually, the loss of an epoch is usually defined as the average of the loss of batches in that epoch. So you can accumulate the loss values during an epoch and at the end divide it by the number of batches in the epoch:
epoch_loss = []
for epoch in range(n_epochs):
acc_loss = 0.
for batch in range(n_batches):
# do the training
loss = model.train_on_batch(...)
acc_loss += loss
epoch_loss.append(acc_loss / n_batches)
As for the other question, one usage of epoch loss might be to use it as an indicator to stop the training (however, the validation loss is usually used for that, not the training loss).
I'll expand on #today answer a bit. There is a certain balance to strike in how to report loss over an epoch and how to use it to determine when training should stop.
If you only look at the loss of the most recent batch, it will be a very noisy estimate of your dataset loss because maybe that batch happened to store all the samples your model has trouble with, or all the samples that are trivial to succeed on.
If you look at the averaged loss over all batches in the epoch, you may get a skewed response because, like you indicated, the model has been (hopefully) improving over the epoch, so the performance on the initial batches aren't as meaningfully compared to the performance on the later batches.
The only way to accurately report your epoch loss is to take your model out of training mode, i.e. fix all the model parameters, and run your model on the whole dataset. That will be an unbiased computation of your epoch loss. However, in general that's a terrible idea because if you have a complex model or a lot of training data, you will waste a lot of time doing this.
So, I think it's most common to balance these factors by reporting an averaged loss over N mini-batches, where N is large enough to smooth out the noise of individual batches but not so large that the model performance is not comparable between the first and last batches.
I know you're in Keras but here is a PyTorch example that illustrates this concept clearly, replicated here:
for epoch in range(2): # loop over the dataset multiple times
running_loss = 0.0
for i, data in enumerate(trainloader, 0):
# get the inputs; data is a list of [inputs, labels]
inputs, labels = data
# zero the parameter gradients
optimizer.zero_grad()
# forward + backward + optimize
outputs = net(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
# print statistics
running_loss += loss.item()
if i % 2000 == 1999: # print every 2000 mini-batches
print('[%d, %5d] loss: %.3f' %
(epoch + 1, i + 1, running_loss / 2000))
running_loss = 0.0
print('Finished Training')
You can see they accumulate the loss over N=2000 batches, report the averaged loss over those 2000 batches, then zero out the running loss and keep going.

Categories