Why is training loss oscilating up and down?

Why is training loss oscilating up and down? - python

I am using the TF2 research object detection API with the pre-trained EfficientDet D3 model from the TF2 model zoo. During training on my own dataset I notice that the total loss is jumping up and down - for example from 0.5 to 2.0 a few steps later, and then back to 0.75:
So all in all this training does not seem to be very stable. I thought the problem might be the learning rate, but as you can see in the charts above, I set the LR to decay during the training, it goes down to a really small value of 1e-15, so I don't see how this can be the problem (at least in the 2nd half of the training).
Also when I smooth the curves in Tensorboard, as in the 2nd image above, one can see the total loss going down, so the direction is correct, even though it's still on quite a high value. I would be interested why I can't achieve better results with my training set, but I guess that is another question. First I would be really interested why the total loss is going up and down so much the whole training. Any ideas?
PS: The pipeline.config file for my training can be found here.

In your config it states that your batch size is 2. This is tiny and will cause a very volatile loss.
Try increasing your batch size substantially; try a value of 256 or 512. If you are constrained by memory, try increasing it via gradient accumulation.
Gradient accumulation is the process of synthesising a larger batch by combining the backwards passes from smaller mini-batches. You would run multiple backwards passes before updating the model's parameters.
Typically, a training loop would like this (I'm using pytorch-like syntax for illustrative purposes):
for model_inputs, truths in iter_batches():
predictions = model(model_inputs)
loss = get_loss(predictions, truths)
loss.backward()
optimizer.step()
optimizer.zero_grad()
With gradient accumulation, you'll put several batches through and then update the model. This simulates a larger batch size without requiring the memory to actually put a large batch size through all at once:
accumulations = 10
for i, (model_inputs, truths) in enumerate(iter_batches()):
predictions = model(model_inputs)
loss = get_loss(predictions, truths)
loss.backward()
if (i - 1) % accumulations == 0:
optimizer.step()
optimizer.zero_grad()
Reading
https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255
How to accumulate gradients in tensorflow?
https://towardsdatascience.com/how-to-easily-use-gradient-accumulation-in-keras-models-fa02c0342b60
Understanding accumulated gradients in PyTorch

Related

Why test loss fluctuates so much using Resnet?

Here is a typical plot of train/test losses behaviour as epoch increases.
I'm not an expert but I have read several topics on similar problems. Well, let me explain what I'm doing.
First, I have used implementation given by https://github.com/pytorch/vision/blob/master/torchvision/models/resnet.py for resnet18 & resnet50, and by https://github.com/akamaster/pytorch_resnet_cifar10 for resnet32, resnet56. For all these nets I got the same kind of test-loss hieratic behaviour.
Second, my inputs are images 5x64x64, so I have adapted the first Convolutional Layer, and the output of the last Full-connected consist of 180 neurons. I have used either 64, 128, 256 batch sizes for the training, and 128 for the test: the same behaviour persists. I have also both used 300k or 100k images in input training (100k for the test): same behaviour persists too. The images are not of "standard" RGB photos: first, as you probably have already, , remarked there are 5 channels, second the pixel values can be negative (eg. spanning the range (-0.01, 500))
Third, I am aware of the model.train() statement for the training phase, as well as the model.eval() statement (coupled with the with torch.no_grad():) for the testing phase. It is clear that if I do not use model.eval() during the test phase, the test loss is gently decrasing as the traing loss. But, this is not allowed, isn't it?
I have tried several things after reading post concerning Batch Norm behaviour wo any success
I have used SGD, Adam (& SWATS)
I have tryied lr = 0.1 to lr= 1e-5
I have modified the BN momentum (default = 0.1) : 0.5 and 0.01; as well as the eps parameter.
Now, I have managed to get nice results (ie; good training & testing losses) with a classical CNN (ie. wo any Batch normalization, & short-cuts) but I would like to study Resnet behaviour against adversarial attack. So, I would like to get Resnet fit my images :slight_smile:
Any idea ?
Thanks
After making some tests, I have found something: I have used the standard resnet20 (h=1). Then, I have used as test set the same samples (100,000 images) as for the train set. BUT, for the test set 1) I do not use the shuffling, and 2) I do not make any Horizontal/vertical flip or Rot90deg, Rot180deg or Rot270deg. I observe the same kind of fluctuations for the test loss.
Moreover, when I switch OFF complety the transformations of the train set, and uses the same set for train & test, I got the same behviour:
And finaly, if I switch off the suffling and random transforms (flips & Rotations) of train set, and I use the same set for test, then I get:
Seems that the test loss is converging towards a value, but different from the train loss. Why ???

What is causing large jumps in training accuracy and loss between epochs?

In training a neural network in Tensorflow 2.0 in python, I'm noticing that training accuracy and loss change dramatically between epochs. I'm aware that the metrics printed are an average over the entire epoch, but accuracy seems to drop significantly after each epoch, despite the average always increasing.
The loss also exhibits this behavior, dropping significantly each epoch but the average increases. Here is an image of what I mean (from Tensorboard):
I've noticed this behavior on all of the models I've implemented myself, so it could be a bug, but I want a second opinion on whether this is normal behavior and if so what does it mean?
Also, I'm using a fairly large dataset (roughly 3 million examples). Batch size is 32 and each dot in the accuracy/loss graphs represent 50 batches (2k on the graph = 100k batches). The learning rate graph is 1:1 for batches.

It seems this phenomenon comes from the fact that the model has a high batch-to-batch variance in terms of accuracy and loss. This is illustrated if I take a graph of the model with the actual metrics per step as opposed to the average over the epoch:
Here you can see that the model can vary widely. (This graph is just for one epoch, but the fact remains).
Since the average metrics were being reported per epoch, at the beginning of the next epoch it is highly likely that the average metrics will be lower than the previous average, leading to a dramatic drop in the running average value, illustrated in red below:
If you imagine the discontinuities in the red graph as being epoch transitions, you can see why you would observe the phenomenon in the question.
TL;DR The model has a very high variance in it's output with respect to each batch.

I have just newly experienced this kind of issue while I was working on a project that is about object localization. For my case, there was three main candidates.
I have used no shuffling in my training. That creates a loss increase after each epoch.
I have defined a new loss function that is calculated using IOU. It was something like;
def new_loss(y_true, y_pred):
mse = tf.losses.mean_squared_error(y_true, y_pred)
iou = calculate_iou(y_true, y_pred)
return mse + (1 - iou)
I also suspect this loss may be a possible candidate of increase in loss after epoch. However, I was not able to replace it.
I was using an Adam optimizer. So, a possible thing to do is to change it to see how the training affected.
Conclusion
I have just changed the Adam to SGD and shuffled my data in training. There was still a jump in the loss but it was so minimal compared without a change. For example, my loss spike was ~0.3 before the changes and it became ~0.02.
Note
I need to add there are lots of discussions about this topic. I tried to utilize the possible solutions that are possible candidates for my model.

Why does the magnitudes of output during inference correlate with the batch size during training?

I have to say this might be one of the weirdest problems I've ever met.
I was implementing ResNet to perform 10-classification over cifr-10 with tensorflow. Everything seemed to be fine with the training phase -- loss decreased steadily, and accuracy on training set kept increasing to over 90%, however, the results were totally abnormal during inference.
I have analyzed my code very carefully and ruled out the possibility of making mistakes when feeding the data or saving/loading the model. So the only difference between the training phase and the test phase lies in batch normalization layers.
For BN layers, I used tf.layers.batch_normalization directly and I thought I've paid attention to every pitfall in using tf.layers.batch_normalization.
Specifically, I've included the dependency for train_op as follows,
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
self.train_op = optimizer.minimize(self.losses)
Also, for saving and loading the model, I've specified var_list as tf.global_variables(). Moreover, I used training=True for training and training=False for test.
Nevertheless, the accuracy during inference was only around 10%, even when applied to the same data used for training. And when I output the last layer of the network (i.e., the 10-dimension vector input to softmax), I found that the magnitude of each item in the 10-dimension vector during training was always 1e0 or 1e-1, while for inference, it could be 1e4 or even 1e5. The strangest part was that I found the magnitude of the 10-dimension vector during inference correlated with the batch size used in training, i.e., the bigger the batch size, the smaller the magnitude.
Besides, I also found that the magnitudes of moving_mean and moving_variance of BN layers correlated with the batch size too, but why was this even possible? I thought moving_mean means the mean of the entire training population, and so was moving_variance. So why was there anything to do with the batch size?
I think there must be something that I don't know about using BN with tensorflow. This problem is really gonna drive me crazy! I've never expected to deal with such a problem in tensorflow, considering how convenient it is to use BN with PyTorch!

The problem has been solved!
I read the source code of tensorflow. Based on my understanding, the value of momentum in tf.layers.batch_normalization should be 1 - 1/num_of_batches. The default value is 0.99, which means the default value is most suitable when there are 100 batches in training data.
I didn't find any documents mentioned this. Hope this can be helpful to someone who would have the same problem with BN in tensorflow!

should model.compile() be run prior to using model.load_weights(), if model has been only slightly changed say dropout?

With training & validation through a dataset for nearly 24 epochs, intermittently 8 epochs at once and saving weights cumulatively after each interval.
I observed a constant declining train & test-loss for first 16 epochs, post which the training loss continues to fall whereas test loss rises so i think it's the case of Overfitting.
For which i tried to resume training with weights saved after 16 epochs with change in hyperparameters say increasing dropout_rate a little.
Therefore i reran the dense & transition blocks with new dropout to get identical architecture with same sequence & learnable parameters count.
Now when i'm assigning previous weights to my new model(with new dropout) with model.load_weights() and compiling thereafter.
i see the training loss is even higher, that should be initially (blatantly with increased inactivity of random nodes during training) but later also it's performing quite unsatisfactory,
so i'm suspecting maybe compiling after loading pretrained weights might have ruined the performance?
what's reasoning & recommended sequence of model.load_weights() & model.compile()? i'd really appreciate any insights on above case.

The model.compile() method does not touch the weights in any way.
Its purpose is to create a symbolic function adding the loss and the optimizer to the model's existing function.
You can compile the model as many times as you want, whenever you want, and your weights will be kept intact.
Possible consequences of compile
If you got a model, well trained for some epochs, it's optimizer (depending on what type and parameters you chose for it) will also be trained for that specific epochs.
Compiling will make you lose the trained optimizer, and your first training batches might experience some bad results due to learning rates not suited to the current state of the model.
Other than that, compiling doesn't cause any harm.

Interpreting tensorboard plots

I'm still newbie in tensorflow and I'm trying to understand what's happenning in details while my models' training goes on. Briefly, I'm using the slim models pretrained on ImageNet to do the finetuning on my dataset. Here are some plots extracted from tensorboard for 2 separate models:
Model_1 (InceptionResnet_V2)
Model_2 (InceptionV4)
So far, both models have poor results on the validation sets (Average Az (Area under the ROC curve) = 0.7 for Model_1 & 0.79 for Model_2). My interpretation to these plots is that the weights are not changing over the mini-batches. It's only the biases that change over the mini-batches and this might be the problem. But I don't know where to look to verify this point. This is the only interpretation I can think of but it might be wrong considering the fact that I'm still newbie. Can u please share with me your thoughts? Don't hesitate to ask for more plots in case needed.
EDIT:
As you can see in the plots below, it seems the weights are barely changing over time. This is applied for all other weights for both networks. This led me to think that there is a problem somewhere but don't know how to interpret it.
InceptionV4 weights
InceptionResnetV2 weights
EDIT2:
These models were first trained on ImageNet and these plots are the results of finetuning them on my dataset. I'm using a dataset of 19 classes with roughly 800000 images in it. I'm doing a multi-label classification problem and I'm using sigmoid_crossentropy as a loss function. The classes are highly unbalanced. In the table below, we're showing the percentage of presence of each class in the 2 subsets (train, validation):
Objects train validation
obj_1 3.9832 % 0.0000 %
obj_2 70.6678 % 33.3253 %
obj_3 89.9084 % 98.5371 %
obj_4 85.6781 % 81.4631 %
obj_5 92.7638 % 71.4327 %
obj_6 99.9690 % 100.0000 %
obj_7 90.5899 % 96.1605 %
obj_8 77.1223 % 91.8368 %
obj_9 94.6200 % 98.8323 %
obj_10 88.2051 % 95.0989 %
obj_11 3.8838 % 9.3670 %
obj_12 50.0131 % 24.8709 %
obj_13 0.0056 % 0.0000 %
obj_14 0.3237 % 0.0000 %
obj_15 61.3438 % 94.1573 %
obj_16 93.8729 % 98.1648 %
obj_17 93.8731 % 97.5094 %
obj_18 59.2404 % 70.1059 %
obj_19 8.5414 % 26.8762 %
The values of the hyperparams:
batch_size=32
weight_decay = 0.00004 #'The weight decay on the model weights.'
optimizer = rmsprop
rmsprop_momentum = 0.9
rmsprop_decay = 0.9 #'Decay term for RMSProp.'
learning_rate_decay_type = exponential #Specifies how the learning rate is decayed
learning_rate = 0.01 #Initial learning rate.
learning_rate_decay_factor = 0.94 #Learning rate decay factor
num_epochs_per_decay = 2.0 #'Number of epochs after which learning rate
Concerning the sparsity of the layers, here are some samples of the sparsity of the layers for both networks:
sparsity (InceptionResnet_V2)
sparsity (InceptionV4)
EDITED3:
Here are the plots of the losses for both models:
Losses and regularization loss (InceptionResnet_V2)
Losses and regularization loss (InceptionV4)

I agree with your assessment - the weights aren't changing very much across the minibatches. It does appear they are changing somewhat.
As I'm sure you're aware, you are doing fine tuning with very large models. As such, backprop can sometimes take a while. But, you're running many training iterations. I don't really think this is the problem.
If I'm not mistaken, both of these were originally trained on ImageNet. If your images are in a totally different domain than something in ImageNet, that could explain the problem.
The backprop equations do make it easier for biases to change with certain activation ranges. ReLU can be one if the model is highly sparse (i.e. if many layers have activation values of 0, then weights will struggle to adjust but biases will not). Also, if activations are in the range [0, 1], the gradient with respect to a weight will be higher than the gradient with respect to a bias. (This is why sigmoid is a bad activation function).
It could also be related to your readout layer - specifically the activation function. How are you calculating error? Is this a classification or regression problem? If at all possible, I recommend using something other than sigmoid as your final activation function. tanh could be marginally better. Linear readout sometimes speeds up training, too (all the gradients have to "pass through" the readout layer. If the derivative of the readout layer is always 1 - linear - you're "letting more gradient through" to adjust the weights further down the model).
Lastly I notice your weights histograms are pushing towards negative weights. Sometimes, especially with models that have a lot of ReLU activation, that can be an indicator of the model learning sparsity. Or an indicator of the dead neuron problem. Or both - the two are somewhat linked.
Ultimately, I think your model is just struggling to learn. I've encountered very similar histograms retraining Inception. I was using a dataset of about 2000 images, and I was struggling to push it over 80% accuracy (as it happens, the dataset was heavily biased - that accuracy was roughly as good as random guessing). It helped when I made the convolution variables constant and only made changes to the fully connected layer.
Indeed this is a classification problem and sigmoid cross entropy is the appropriate activation function. And you do have a sizable dataset - certainly big enough to fine tune these models.
With this new information, I would suggest lowering the initial learning rate. I have a two-fold reasoning here:
(1) is my own experience. As I mentioned, I'm not especially familiar with RMSprop. I've only used it in the context of DNCs (though, DNCs with convolutional controllers), but my experience there backs up what I'm about to say. I think .01 is high for training a model from scratch, let alone fine tuning. It's definitely high for Adam. In some sense, starting with a small learning rate is the "fine" part of fine tuning. Don't force the weights to shift quite so much. Especially if you're adjusting the whole model rather than the last (few) layer(s).
(2) is the increasing sparsity and shift toward negative weights. Based on your sparsity plots (good idea btw), it looks to me like some weights might be getting stuck in a sparse configuration as a result of overcorrection. I.e., as a result of a high initial rate, the weights are "overshooting" their optimal position and getting stuck somewhere that makes it hard for them to recover and contribute to the model. That is, slightly negative and close to zero is not good in a ReLU network.
As I've mentioned (repeatedly) I'm not very familiar with RMSprop. But, since you're already running lots of training iterations, give low, low, low initial rates a shot and work your way up. I mean, see how 1e-8 works. It's possible the model won't respond to training with a rate that low, but do something of an informal hyperparameter search with the learning rate. In my experience with Inception using Adam, 1e-4 to 1e-8 worked well.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.