How to compute and sum gradients over multiple mini-batches in Keras?

How to compute and sum gradients over multiple mini-batches in Keras? - python

I want to increase the mini batch-size for my neural network during training (instead of decaying the learning rate), but the upper limit for the mini batch-size is 8, due to my GPU memory.
I found this article
https://medium.com/#davidlmorton/increasing-mini-batch-size-without-increasing-memory-6794e10db672
on how to increase the mini-batch-size without increasing the memory, and it is doing that by implementing a DataLoader in PyTorch.
The technique is simple, you just compute and sum gradients over
multiple mini-batches. Only after the specified number of mini-batches
do you update the model parameters.
count = 0
for inputs, targets in training_data_loader:
if count == 0:
optimizer.step()
optimizer.zero_grad()
count = batch_multiplier
outputs = model(inputs)
loss = loss_function(outputs, targets) / batch_multiplier
loss.backward()
count -= 1
However I couldn't find any examples for this in Keras. I assume that it would have to be done using a data_generator(Sequence), in the function __getitem__ ?
But I have no idea how to implement it, or if it is even possible in Keras. I tried looking at examples of DataGenerators as well, but none of them involved optimizers.
Would appreciate if anybody can help me out!

Related

Is it possible to perform step according to batch size in pytorch?

I am iterating over training samples in batches, however last batch always returns fewer samples.
Is it possible to specify step size in torch according to the current batch length?
For example most batch are of size 64, last batch only 6 samples.
If I do the usual routine:
optimizer.zero_grad()
loss.backward()
optimizer.step()
It seems that the last 6 samples carry the same weight when updating the gradients as the 64 sized batches, but in fact they should only carry about 1/10 weight due to fewer samples.
In Mxnet I could specify the step size accordingly but I don't know how to do it in torch.

You can define a custom loss function and then e.g. reweight it based on batch size
def reweighted_cross_entropy(my_outputs, my_labels):
# compute batch size
my_batch_size = my_outputs.size()[0]
original_loss = nn.CrossEntropyLoss()
loss = original_loss (my_outputs, my_labels)
# reweight accordingly
return my_batch_size * loss
if you are using something like gradient descent then it is easy to see that
[1/10 * lr] grad [loss] = lr * grad [ 1/10 loss]
so reweighting the loss will be equivalent to reweighting your learning rate. This won't be exactly true for more comlpex optimisers though but can be good enough in practise.

I suggest just ignore the last batch. Pytorch Dataloader has parameter to implement that behavior:
drop_last = True #(False by default)

Neural Network optimization using epoch and batch

I am trying to optimize a given neural network (ex Perceptron Multilayer, with 2 hidden layers), by finding the number of epoch and batch that give the highest accuracy.
for epoch from 10 to 200 (in steps of 10):
for batch from 40 to 200 (in steps of 20):
modele.fit (X_train, Y_train, epochs = epoch, batch_size = batch)
I save batch, epoch, Accuracy;
Afterwards I kept the smallest epoch with the smallest corresponding batch which has the highest recognition
ex best_params: epoch = 10, batch = 150 => Accuracy = 94%
My problem is that when I re-run my model with the best_params, it doesn't give me the same results (loss, accuracy), even sometimes very low accuracy (eg 10%).
i try to fix seed, but no best result
Regards
Djam75

df=pd.DataFrame(columns=['Nb_Batch','Nb_Epoch','Accuracy'])
i=0
lst_loss=[]
lst_accuracy=[]
lst_epoch=list(np.arange(10,200,10))
lst_batch=list(np.arange(100,400,20))
for epoch in lst_epoch:
print ('---------------- Epoch ' + str(epoch)+ '------------------')
for batch in lst_batch:
modelSimple.fit(X_train, Y_train, nb_epoch = epoch, batch_size = batch, verbose = 0)
score = modelSimple.evaluate(X_test, Y_test)
df.loc[i,"Nb_Batch"]=batch
df.loc[i,"Nb_Epoch"]=epoch
df.loc[i,"Accuracy"]=score[1]*100
i=i+1

This might be happening due to random parameter initialization. Because if you are building an end-to-end model without transfer learn the weights, every time you training architecture get random values for its parameters.
In this case, a good practice is to use batch normalization layers after some layers according to your architecture.
tensoflow-implementation
pytorch-implmentation
extra idea:
Do not use any 'for', 'while' loops in the model implementation.
you can follow templates in TensorFlow or PyTorch.
OR, if you build a complete model from scratch, vectorize operations by using NumPy like metrics operation library.

Thanks for the update.
I resolve my probelm by saving a model and load it after.
thaks for idea (batch normalization ) and extra idea : not user any for ;-)
regards

I think you might not be updating the weight matrix after completing the training for certain batch sizes and epochs.
Please include the code as well in order to see the problem

How does PyTorch compute the backward pass when optimizing triplet loss?

I am implementing a triplet network in Pytorch where the 3 instances (sub-networks) share the same weights. Since the weights are shared, I implemented it as a single instance network that is called three times to produce the anchor, positive, and negative embeddings. The embeddings are learned by optimizing the triplet loss. Here is a small snippet for illustration:
from dependencies import *
model = SingleSubNet() # represents each instance in the triplet net
for epoch in epochs:
for anch, pos, neg in enumerate(train_loader):
optimizer.zero_grad()
fa, fp, fn = model(anch), model(pos), model(neg)
loss = triplet_loss(fa, fp, fn)
loss.backward()
optimizer.step()
# Do more stuff ...
My complete code works as expected. However, I do not understand what does the loss.backward() compute the gradient(s) in this case. I am confused because there are 3 gradients of loss is in each learning step (the gradients formulas are here). I assume the gradients are summed before performing optimizer.step(). But then it looks from the equations that if the gradients are summed, they will cancel each other out and yield zero update term. Of course, this is not true as the network learns meaningful embeddings at the end.
Thanks in advance

Late answer, but hope this helps someone.
The gradients that you linked are the gradients of the loss with respect to the embeddings (the anchor, positive embedding and negative embedding). To update the model parameters, you use the gradient of the loss with respect to the model parameters. This does not sum to zero.
The reason for this is that when calculating the gradient of the loss with respect to the model parameters, the formula makes use of the activations from the forward pass, and the 3 different inputs (anchor image, positive example and negative example) have different activations in the forward pass.

What is running loss in PyTorch and how is it calculated

I had a look at this tutorial in the PyTorch docs for understanding Transfer Learning. There was one line that I failed to understand.
After the loss is calculated using loss = criterion(outputs, labels), the running loss is calculated using running_loss += loss.item() * inputs.size(0) and finally, the epoch loss is calculated using running_loss / dataset_sizes[phase].
Isn't loss.item() supposed to be for an entire mini-batch (please correct me if I am wrong). i.e, if the batch_size is 4, loss.item() would give the loss for the entire set of 4 images. If this is true, why is loss.item() being multiplied with inputs.size(0) while calculating running_loss? Isn't this step like an extra multiplication in this case?
Any help would be appreciated. Thanks!

It's because the loss given by CrossEntropy or other loss functions is divided by the number of elements i.e. the reduction parameter is mean by default.
torch.nn.CrossEntropyLoss(weight=None, size_average=None, ignore_index=-100, reduce=None, reduction='mean')
Hence, loss.item() contains the loss of entire mini-batch, but divided by the batch size. That's why loss.item() is multiplied with batch size, given by inputs.size(0), while calculating running_loss.

if the batch_size is 4, loss.item() would give the loss for the entire set of 4 images
That depends on how the loss is calculated. Remember, loss is a tensor just like every other tensor. In general the PyTorch APIs return avg loss by default
"The losses are averaged across observations for each minibatch."
t.item() for a tensor t simply converts it to python's default float32.
More importantly, if you are new to PyTorch, it might be helpful for you to know that we use t.item() to maintain running loss instead of t because PyTorch tensors store history of its values which might overload your GPU very soon.

training by batches leads to more over-fitting

I'm training a sequence to sequence (seq2seq) model and I have different values to train on for the input_sequence_length.
For values 10 and 15, I get acceptable results but when I try to train with 20, I get memory errors so I switched the training to train by batches but the model over-fit and the validation loss explodes, and even with the accumulated gradient I get the same behavior, so I'm looking for hints and leads to more accurate ways to do the update.
Here is my training function (only with batch section) :
if batch_size is not None:
k=len(list(np.arange(0,(X_train_tensor_1.size()[0]//batch_size-1), batch_size )))
for epoch in range(num_epochs):
optimizer.zero_grad()
epoch_loss=0
for i in list(np.arange(0,(X_train_tensor_1.size()[0]//batch_size-1), batch_size )): # by using equidistant batch till the last one it becomes much faster than using the X.size()[0] directly
sequence = X_train_tensor[i:i+batch_size,:,:].reshape(-1, sequence_length, input_size).to(device)
labels = y_train_tensor[i:i+batch_size,:,:].reshape(-1, sequence_length, output_size).to(device)
# Forward pass
outputs = model(sequence)
loss = criterion(outputs, labels)
epoch_loss+=loss.item()
# Backward and optimize
loss.backward()
optimizer.step()
epoch_loss=epoch_loss/k
model.eval
validation_loss,_= evaluate(model,X_test_hard_tensor_1,y_test_hard_tensor_1)
model.train()
training_loss_log.append(epoch_loss)
print ('Epoch [{}/{}], Train MSELoss: {}, Validation : {} {}'.format(epoch+1, num_epochs,epoch_loss,validation_loss))
EDIT:
here are the parameters that I'm training with :
batch_size = 1024
num_epochs = 25000
learning_rate = 10e-04
optimizer=torch.optim.Adam(model.parameters(), lr=learning_rate)
criterion = nn.MSELoss(reduction='mean')

Batch size affects regularization. Training on a single example at a time is quite noisy, which makes it harder to overfit. Training on batches smoothes everything out, which makes it easier to overfit. Translating back to regularization:
Smaller batches add regularization.
Larger batches reduce regularization.
I am also curious about your learning rate. Every call to loss.backward() will accumulate the gradient. If you have set your learning rate to expect a single example at a time, and not reduced it to account for batch accumulation, then one of two things will happen.
The learning rate will be too high for the now-accumulated gradient, training will diverge, and both training and validation errors will explode.
The learning rate won't be too high, and nothing will diverge. The model will just train more quickly and effectively. If the model is too large for the data being fit, then training error will go to 0 but validation error will explode due to overfitting.
Update
Here is a bit more detail regarding the gradient accumulation.
Every call to loss.backward() will accumulate gradient, until you reset it with optimizer.zero_grad(). It will be acted on when you call optimizer.step(), based on whatever it has accumulated.
The way your code is written, you call loss.backward() for every pass through the inner loop, then you call optimizer.step() in the outer loop before resetting. So the gradient has been accumulated, that is summed, over all examples in the batch and not just one example at a time.
Under most assumptions, that will make the batch-accumulated gradient larger than the gradient for a single example. If the gradients are all aligned, for B batches, it will be larger by B times. If the gradients are i.i.d. then it will be more like sqrt(B) times larger.
If you do not account for this, then you have effectively increased your learning rate by that factor. Some of that will be mitigated by the smoothing effect of larger batches, which can then tolerate a higher learning rate. Larger batches reduce regularization, larger learning rates add it back. But that will not be a perfect match to compensate, so you will still want to adjust accordingly.
In general, whenever you change your batch size you will also want to re-tune your learning rate to compensate.
Leslie N. Smith has written some excellent papers on a methodical approach to hyperparameter tuning. A great place to start is A disciplined approach to neural network hyper-parameters: Part 1 -- learning rate, batch size, momentum, and weight decay. He recommends you start by reading the diagrams, which are very well done.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.