Why is saving state_dict getting slower as training progresses?

Why is saving state_dict getting slower as training progresses? - python

I'm saving my model's and optimizer's state dict as follows:
if epoch % 50000 == 0:
#checkpoint save every 50000 epochs
print('\nSaving model... Loss is: ', loss)
torch.save({
'epoch': epoch,
'model': self.state_dict(),
'optimizer_state_dict': self.optimizer.state_dict(),
'scheduler': self.scheduler.state_dict(),
'loss': loss,
'losses': self.losses,
}, PATH)
When I first start the training it saves in less than 5 seconds. However, after a couple of hours of training it takes over a two minutes to save. The only reason I could think of is the list of losses. But I can't see how that would increase the time by that much.
Update 1:
I have my losses as:
self.losses = []
I'm appending the loss at each epoch to this list as follows:
#... loss calculation
loss.backward()
self.optimizer.step()
self.scheduler.step()
self.losses.append(loss)

As mentionned in the comments, the instruction
self.losses.append(loss)
is definitely the culprit, and shoud be replaced with
self.losses.append(loss.item())
The reason is that when you store the tensor loss, you also store the whole computational graph alongside (all the information that is required to perform the backprop). In other words, you are not merely storing a tensor, but also the pointers to all the tensors that have been involved in the computation of the loss and their relations (which ones were added, multiplied etc). So it will grow really big really fast.
When you do loss.item() (or loss.detach(), which would work as well), you detach the tensor from the computational graph, and thus you only store what you intended : the loss value itself, as a simple float value

Related

Updating based on two different loss functions, but with a different optimizer learning rate after each one (pytorch)?

I have a set up as follows, where I have an outer for loop iterating over epochs, and an inner for loop iterating over batches.
In the inner for loop, over batches, I'm usin a cross entropy loss, and using an Adam optimizer with a certain learning rate.
After the inner for loop (after all batches are evaluated), I'm then calculating another loss function based off of the output (a custom loss function), and optimizing.
However, I notice that when I define a different optimizer with a different learning rate, it doesn't seem to be training. When I keep the same optimizer, it seems like things change, but when I replace it, it doesn't. Example as follows:
net = <my defined model, from another function>
optimizer_1 = torch.optim.Adam(net.parameters(), lr=0.1)
optimizer_2 = torch.optim.Adam(net.parameters(), lr=0.01)
for epoch in range(num_epochs):
for data in training_data: # these are the batches
<get output here>
loss1 = <compute loss function>
optimizer_1.zero_grad()
loss1.backward()
optimizer_1.step()
loss2 = <compute a different loss function here>
optimizer_2.zero_grad() #use a second optimizer with a different learning rate
loss.backward()
loss.step()
When I do this, it seems like it doesn't actually carry through with the second optimization on the second loss function. Why is this? I want the second optimization to have a different learning rate than the first one. However, it seems like only continuing to use the first optimizer, optimizer_1, with its respective learning rate seems to work.

First you can accumulate the loss1 in the inner loop.
Next, You might want to consider merging two loss functions.
(sum(accumulated_loss1) + loss2).backward()
This ensures both are losses are considered during training and all the gradients are propagated in the backward pass

What is running loss in PyTorch and how is it calculated

I had a look at this tutorial in the PyTorch docs for understanding Transfer Learning. There was one line that I failed to understand.
After the loss is calculated using loss = criterion(outputs, labels), the running loss is calculated using running_loss += loss.item() * inputs.size(0) and finally, the epoch loss is calculated using running_loss / dataset_sizes[phase].
Isn't loss.item() supposed to be for an entire mini-batch (please correct me if I am wrong). i.e, if the batch_size is 4, loss.item() would give the loss for the entire set of 4 images. If this is true, why is loss.item() being multiplied with inputs.size(0) while calculating running_loss? Isn't this step like an extra multiplication in this case?
Any help would be appreciated. Thanks!

It's because the loss given by CrossEntropy or other loss functions is divided by the number of elements i.e. the reduction parameter is mean by default.
torch.nn.CrossEntropyLoss(weight=None, size_average=None, ignore_index=-100, reduce=None, reduction='mean')
Hence, loss.item() contains the loss of entire mini-batch, but divided by the batch size. That's why loss.item() is multiplied with batch size, given by inputs.size(0), while calculating running_loss.

if the batch_size is 4, loss.item() would give the loss for the entire set of 4 images
That depends on how the loss is calculated. Remember, loss is a tensor just like every other tensor. In general the PyTorch APIs return avg loss by default
"The losses are averaged across observations for each minibatch."
t.item() for a tensor t simply converts it to python's default float32.
More importantly, if you are new to PyTorch, it might be helpful for you to know that we use t.item() to maintain running loss instead of t because PyTorch tensors store history of its values which might overload your GPU very soon.

training by batches leads to more over-fitting

I'm training a sequence to sequence (seq2seq) model and I have different values to train on for the input_sequence_length.
For values 10 and 15, I get acceptable results but when I try to train with 20, I get memory errors so I switched the training to train by batches but the model over-fit and the validation loss explodes, and even with the accumulated gradient I get the same behavior, so I'm looking for hints and leads to more accurate ways to do the update.
Here is my training function (only with batch section) :
if batch_size is not None:
k=len(list(np.arange(0,(X_train_tensor_1.size()[0]//batch_size-1), batch_size )))
for epoch in range(num_epochs):
optimizer.zero_grad()
epoch_loss=0
for i in list(np.arange(0,(X_train_tensor_1.size()[0]//batch_size-1), batch_size )): # by using equidistant batch till the last one it becomes much faster than using the X.size()[0] directly
sequence = X_train_tensor[i:i+batch_size,:,:].reshape(-1, sequence_length, input_size).to(device)
labels = y_train_tensor[i:i+batch_size,:,:].reshape(-1, sequence_length, output_size).to(device)
# Forward pass
outputs = model(sequence)
loss = criterion(outputs, labels)
epoch_loss+=loss.item()
# Backward and optimize
loss.backward()
optimizer.step()
epoch_loss=epoch_loss/k
model.eval
validation_loss,_= evaluate(model,X_test_hard_tensor_1,y_test_hard_tensor_1)
model.train()
training_loss_log.append(epoch_loss)
print ('Epoch [{}/{}], Train MSELoss: {}, Validation : {} {}'.format(epoch+1, num_epochs,epoch_loss,validation_loss))
EDIT:
here are the parameters that I'm training with :
batch_size = 1024
num_epochs = 25000
learning_rate = 10e-04
optimizer=torch.optim.Adam(model.parameters(), lr=learning_rate)
criterion = nn.MSELoss(reduction='mean')

Batch size affects regularization. Training on a single example at a time is quite noisy, which makes it harder to overfit. Training on batches smoothes everything out, which makes it easier to overfit. Translating back to regularization:
Smaller batches add regularization.
Larger batches reduce regularization.
I am also curious about your learning rate. Every call to loss.backward() will accumulate the gradient. If you have set your learning rate to expect a single example at a time, and not reduced it to account for batch accumulation, then one of two things will happen.
The learning rate will be too high for the now-accumulated gradient, training will diverge, and both training and validation errors will explode.
The learning rate won't be too high, and nothing will diverge. The model will just train more quickly and effectively. If the model is too large for the data being fit, then training error will go to 0 but validation error will explode due to overfitting.
Update
Here is a bit more detail regarding the gradient accumulation.
Every call to loss.backward() will accumulate gradient, until you reset it with optimizer.zero_grad(). It will be acted on when you call optimizer.step(), based on whatever it has accumulated.
The way your code is written, you call loss.backward() for every pass through the inner loop, then you call optimizer.step() in the outer loop before resetting. So the gradient has been accumulated, that is summed, over all examples in the batch and not just one example at a time.
Under most assumptions, that will make the batch-accumulated gradient larger than the gradient for a single example. If the gradients are all aligned, for B batches, it will be larger by B times. If the gradients are i.i.d. then it will be more like sqrt(B) times larger.
If you do not account for this, then you have effectively increased your learning rate by that factor. Some of that will be mitigated by the smoothing effect of larger batches, which can then tolerate a higher learning rate. Larger batches reduce regularization, larger learning rates add it back. But that will not be a perfect match to compensate, so you will still want to adjust accordingly.
In general, whenever you change your batch size you will also want to re-tune your learning rate to compensate.
Leslie N. Smith has written some excellent papers on a methodical approach to hyperparameter tuning. A great place to start is A disciplined approach to neural network hyper-parameters: Part 1 -- learning rate, batch size, momentum, and weight decay. He recommends you start by reading the diagrams, which are very well done.

Losses keep increasing within iteration

I am just a little confused on the following:
I am training a neural network and have it print out the losses. I am training it over 4 iterations just to try it out, and use batches. I normally see loss functions as parabolas, where the losses would decrease to a minimum point before increasing again. But my losses keep increasing as the iteration progresses.
For example, let's say there are 100 batches in each iteration. In iteration 0, losses started at 26.3 (batch 0) and went up to 1500.7 (batch 100). In iteration 1, it started at 2.4e-14 and went up to 80.8.
I am following an example from spacy (https://spacy.io/usage/examples#training-ner). Should I be comparing the losses across batches instead (i.e. if I take the points from all of the batch 0s it should resemble a parabola)?

If you are using the exact same code as linked, this behaviour is to be expected.
for itn in range(n_iter):
random.shuffle(TRAIN_DATA)
losses = {}
# batch up the examples using spaCy's minibatch
batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
for batch in batches:
texts, annotations = zip(*batch)
nlp.update(
texts, # batch of texts
annotations, # batch of annotations
drop=0.5, # dropout - make it harder to memorise data
losses=losses,
)
print("Losses", losses)
An "iteration" is the outer loop: for itn in range(n_iter). And from the sample code you can also infer that losses is being reset every iteration. The nlp.update call will actually increment the appropriate loss in each call, i.e. with each batch that it processes.
So yes: the loss increases WITHIN an iteration, for each batch that you process. To check whether your model is actually learning anything, you need to check the loss across iterations, similar as how the print statement in the original snippet only prints after looping through the batches, not during.
Hope that helps!

How do I get a loss per epoch and not per batch?

In my understanding an epoch is an arbitrarily often repeated run over the whole dataset, which in turn is processed in parts, so called batches. After each train_on_batch a loss is calculated, the weights are updated and the next batch will get better results. These losses are indicators of the quality and learning state of my to NNs.
In several sources the loss is calculated (and printed) per epoch. Therefore I am not sure if I am doing this right.
At the moment my GAN looks like this:
for epoch:
for batch:
fakes = generator.predict_on_batch(batch)
dlc = discriminator.train_on_batch(batch, ..)
dlf = discriminator.train_on_batch(fakes, ..)
dis_loss_total = 0.5 * np.add(dlc, dlf)
g_loss = gan.train_on_batch(batch,..)
# save losses to array to work with later
These losses are for each batch. How do I get them for an epoch? As an aside: Do I need losses for an epoch, what for?

There is no direct way to compute the loss for an epoch. Actually, the loss of an epoch is usually defined as the average of the loss of batches in that epoch. So you can accumulate the loss values during an epoch and at the end divide it by the number of batches in the epoch:
epoch_loss = []
for epoch in range(n_epochs):
acc_loss = 0.
for batch in range(n_batches):
# do the training
loss = model.train_on_batch(...)
acc_loss += loss
epoch_loss.append(acc_loss / n_batches)
As for the other question, one usage of epoch loss might be to use it as an indicator to stop the training (however, the validation loss is usually used for that, not the training loss).

I'll expand on #today answer a bit. There is a certain balance to strike in how to report loss over an epoch and how to use it to determine when training should stop.
If you only look at the loss of the most recent batch, it will be a very noisy estimate of your dataset loss because maybe that batch happened to store all the samples your model has trouble with, or all the samples that are trivial to succeed on.
If you look at the averaged loss over all batches in the epoch, you may get a skewed response because, like you indicated, the model has been (hopefully) improving over the epoch, so the performance on the initial batches aren't as meaningfully compared to the performance on the later batches.
The only way to accurately report your epoch loss is to take your model out of training mode, i.e. fix all the model parameters, and run your model on the whole dataset. That will be an unbiased computation of your epoch loss. However, in general that's a terrible idea because if you have a complex model or a lot of training data, you will waste a lot of time doing this.
So, I think it's most common to balance these factors by reporting an averaged loss over N mini-batches, where N is large enough to smooth out the noise of individual batches but not so large that the model performance is not comparable between the first and last batches.
I know you're in Keras but here is a PyTorch example that illustrates this concept clearly, replicated here:
for epoch in range(2): # loop over the dataset multiple times
running_loss = 0.0
for i, data in enumerate(trainloader, 0):
# get the inputs; data is a list of [inputs, labels]
inputs, labels = data
# zero the parameter gradients
optimizer.zero_grad()
# forward + backward + optimize
outputs = net(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
# print statistics
running_loss += loss.item()
if i % 2000 == 1999: # print every 2000 mini-batches
print('[%d, %5d] loss: %.3f' %
(epoch + 1, i + 1, running_loss / 2000))
running_loss = 0.0
print('Finished Training')
You can see they accumulate the loss over N=2000 batches, report the averaged loss over those 2000 batches, then zero out the running loss and keep going.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.