What is causing large jumps in training accuracy and loss between epochs?

What is causing large jumps in training accuracy and loss between epochs? - python

In training a neural network in Tensorflow 2.0 in python, I'm noticing that training accuracy and loss change dramatically between epochs. I'm aware that the metrics printed are an average over the entire epoch, but accuracy seems to drop significantly after each epoch, despite the average always increasing.
The loss also exhibits this behavior, dropping significantly each epoch but the average increases. Here is an image of what I mean (from Tensorboard):
I've noticed this behavior on all of the models I've implemented myself, so it could be a bug, but I want a second opinion on whether this is normal behavior and if so what does it mean?
Also, I'm using a fairly large dataset (roughly 3 million examples). Batch size is 32 and each dot in the accuracy/loss graphs represent 50 batches (2k on the graph = 100k batches). The learning rate graph is 1:1 for batches.

It seems this phenomenon comes from the fact that the model has a high batch-to-batch variance in terms of accuracy and loss. This is illustrated if I take a graph of the model with the actual metrics per step as opposed to the average over the epoch:
Here you can see that the model can vary widely. (This graph is just for one epoch, but the fact remains).
Since the average metrics were being reported per epoch, at the beginning of the next epoch it is highly likely that the average metrics will be lower than the previous average, leading to a dramatic drop in the running average value, illustrated in red below:
If you imagine the discontinuities in the red graph as being epoch transitions, you can see why you would observe the phenomenon in the question.
TL;DR The model has a very high variance in it's output with respect to each batch.

I have just newly experienced this kind of issue while I was working on a project that is about object localization. For my case, there was three main candidates.
I have used no shuffling in my training. That creates a loss increase after each epoch.
I have defined a new loss function that is calculated using IOU. It was something like;
def new_loss(y_true, y_pred):
mse = tf.losses.mean_squared_error(y_true, y_pred)
iou = calculate_iou(y_true, y_pred)
return mse + (1 - iou)
I also suspect this loss may be a possible candidate of increase in loss after epoch. However, I was not able to replace it.
I was using an Adam optimizer. So, a possible thing to do is to change it to see how the training affected.
Conclusion
I have just changed the Adam to SGD and shuffled my data in training. There was still a jump in the loss but it was so minimal compared without a change. For example, my loss spike was ~0.3 before the changes and it became ~0.02.
Note
I need to add there are lots of discussions about this topic. I tried to utilize the possible solutions that are possible candidates for my model.

Related

StratifiedKFold overfitting

I'm working on a multimodal classifier (text + image) using pytorch (only 2 classes).
Since I don't have a lot of data, i've decided to use StratifiedKFold to avoid overfitting.
I noticed a strange behavior on training/testing curves.
My training accuracy quickly converges forward a unique value for few epochs before evolving again.
With these results I directly thought of overfitting, .67 being the maximum accuracy of the model.
With the rest of the data separated by the KFold, I tested my model in evaluation mode.
I've been quite surprised since test accuracy follows (quite exactly) the training accuracy while the loss (CrossEntropyLoss) still evolves.
Note : changing the batch size only make growing of accuracy delays or brings closer the moment the loss evolves.
Any ideas about this behaviour ?

Why is training loss oscilating up and down?

I am using the TF2 research object detection API with the pre-trained EfficientDet D3 model from the TF2 model zoo. During training on my own dataset I notice that the total loss is jumping up and down - for example from 0.5 to 2.0 a few steps later, and then back to 0.75:
So all in all this training does not seem to be very stable. I thought the problem might be the learning rate, but as you can see in the charts above, I set the LR to decay during the training, it goes down to a really small value of 1e-15, so I don't see how this can be the problem (at least in the 2nd half of the training).
Also when I smooth the curves in Tensorboard, as in the 2nd image above, one can see the total loss going down, so the direction is correct, even though it's still on quite a high value. I would be interested why I can't achieve better results with my training set, but I guess that is another question. First I would be really interested why the total loss is going up and down so much the whole training. Any ideas?
PS: The pipeline.config file for my training can be found here.

In your config it states that your batch size is 2. This is tiny and will cause a very volatile loss.
Try increasing your batch size substantially; try a value of 256 or 512. If you are constrained by memory, try increasing it via gradient accumulation.
Gradient accumulation is the process of synthesising a larger batch by combining the backwards passes from smaller mini-batches. You would run multiple backwards passes before updating the model's parameters.
Typically, a training loop would like this (I'm using pytorch-like syntax for illustrative purposes):
for model_inputs, truths in iter_batches():
predictions = model(model_inputs)
loss = get_loss(predictions, truths)
loss.backward()
optimizer.step()
optimizer.zero_grad()
With gradient accumulation, you'll put several batches through and then update the model. This simulates a larger batch size without requiring the memory to actually put a large batch size through all at once:
accumulations = 10
for i, (model_inputs, truths) in enumerate(iter_batches()):
predictions = model(model_inputs)
loss = get_loss(predictions, truths)
loss.backward()
if (i - 1) % accumulations == 0:
optimizer.step()
optimizer.zero_grad()
Reading
https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255
How to accumulate gradients in tensorflow?
https://towardsdatascience.com/how-to-easily-use-gradient-accumulation-in-keras-models-fa02c0342b60
Understanding accumulated gradients in PyTorch

Keras LSTM training loss is increasing instantaneously in one epoch

I am training an LSTM to predict a price chart. In order to do so I am arranging the data into 3D with a certain time window like 10 days to predict the next day.
I am not shuffling the data before splitting into training and testing. So the most recent data is used as testing. Is that a good idea with a stateless LSTM?
Secondly, during one epoch, the training is firstly decreasing but halfway through the epoch it is suddenly increased by an order of magnitude? what are the possible reasons for that happening? Is this normal behavior. I.e. is it a misconception of mine that loss in one epoch should constantly decrease?
I used Bayesian optimization to find the right hyperparameter and I am using adam optimizer with the default learning rate. And normalizing data between 0 and 1.

What am I trying to do here? train acc: 100%, test acc: 80% does this mean overfitting?

classifier.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
classifier.fit(X_train, y_train, epochs=50, batch_size=100)
Epoch 1/50
27455/27455 [==============================] - 3s 101us/step - loss: 2.9622 - acc: 0.5374
I know I'm compiling my model in first line and fitting it in second. I know what is optimiser. I'm interested the meaning of metrics=['accuracy'] and what does the acc: XXX exactly mean when I compile the model.
Also, I'm getting acc : 1.000 when I train my model (100%) but when I test my model I'm getting 80% accuracy. Does my model overfitting?

Ok, let's begin from the top,
First, metrics = ['accuracy'], The model can be evaluated on multiple parameters, accuracy is one of the metrics, other can be binary_accuracy, categorical_accuracy, sparse_categorical_accuracy, top_k_categorical_accuracy, and sparse_top_k_categorical_accuracy, these are only the inbuilt ones, you can even create custom metrics, to understand metrics in more details, you need to have a clear understanding of loss in a Neural Network, you might know that loss function must be differentiable in order to be able to do back propagation, this is not necessary in case of metrics, metrics are used purely for model evaluation and thus can even be functions that are not differentiable, in Keras as mentioned even in their documentation
A metric function is similar to a loss function, except that the results from evaluating a metric are not used when training the model. You may use any of the loss functions as a metric function.
On your Own, you can custom define an accuracy that is not differentiable but creates an objective function on what you need from your model.
TLDR; Metrics are just loss functions not used in back propagation but used for model evaluation.
Now,
acc:xxx might just be that it has not even finished one minibatch propagation and thus cannot give an accuracy score yet, I have not paid much attention to it, but it usually stays there for a few seconds and is thus an speculation from that.
Finally 20% Decrease in model performance when taken out of training, yes this can be a case of Overfitting but no one can know for sure without looking at your dataset, but most probably yes, it is overfitting, and you may need to look at the data it is performing bad on to know the cause.
If something is unclear, doesn't make sense, feel free to comment.

Having 100% accuracy on train dataset while having 80% accuracy on test dataset doesn't mean that your model overfits. Moreover, it almost surely doesn't overfit if your model is equipped with much more effective parameters that the number of training samples [2], [5] (insanely large model example [1]). This contradicts to conventional statistical learning theory, but these are empirical results.
For models with number of parameters greater than number of samples, it's better to continue to optimize the logistic or cross-entropy loss even after the training error is zero and the training loss is extremely small, and even if the validation loss increases [3]. This may hold even regardless of batch size [4].
Clarifications (edit)
The "models" I was referring to are neural networks with two or more hidden layers (could be also convolutional layers prior to dense layers).
[1] is cited to show a clear contradiction to classical statistical learning theory, which says that large models may overfit without some form of regularization.
I would invite anyone who disagrees with "almost surely doesn't overfit" to provide a reproducible example where models, say for MNIST/CIFAR etc with few hundred thousand parameters do overfit (in a sense of increasing with iterations test error curve).
[1] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V. Le,Geoffrey E. Hinton, and Jeff Dean. Outrageously large neural networks: Thesparsely-gated mixture-of-experts layer.CoRR, abs/1701.06538, 2017.
[2] Lei Wu, Zhanxing Zhu, et al. Towards understanding generalization of deep learn-ing: Perspective of loss landscapes.arXiv preprint arXiv:1706.10239, 2017.
[3] Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and NathanSrebro. The implicit bias of gradient descent on separable data.The Journal of Machine Learning Research, 19(1):2822–2878, 2018.
[4] Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: clos-ing the generalization gap in large batch training of neural networks. InAdvancesin Neural Information Processing Systems, pages 1731–1741, 2017.`
[5] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals.Understanding deep learning requires rethinking generalization.arXiv preprintarXiv:1611.03530, 2016.

Starting off with the first part of your question -
Keras defines a Metric as "a function that is used to judge the performance of your model". In this case you are using accuracy as the function to judge how good your model is. (This is the norm)
For the second part of your question - acc is the accuracy of your model at that Epoch. This can, and will change depending on which metrics were defined in the model.
Finally it is possible that you have ended up with an overfit model given what you have told us but there are simple solutions

So the meaning of metrics=['accuracy'] is actually dependent on what loss function you use. You can see how keras handels this from line 375 and down. Since you are using categorical_crossentropy, your case follows the logic in the elif (line 386). Hence your metric function is set to
metric_fn = metrics_module.sparse_categorical_accuracy
See this post for a description of the logic behind sparse_categorical_accuracy, it should clear the meaning of "accuracy" in your case. It basically just counts how many of your prediction (the class with maximum probability) was the same as the true class.
The train vs validation accuracy can show sign of over-fitting. To test this plot the train accuracy and validation accuracy against each other and see at what point the validation accuracy start to decrease. Follow this for a good description of how to plot accuracy and loss etc, to test for over-fitting.

Is my model underfitting, tensorflow?

My loss first decreased for few epochs but then started increasing and then increased up to a certain point and then stopped moving. I think now it has converged. Now, can we say that my model is underfitting? Because my interpretation is that (slide 93 link) if my loss is going down and then increasing it means that I have a high learning rate and which after every 2 epochs I'm decaying so after few epochs loss stopped increasing because learning rate is low now, because I'm still decaying my learning rate, now loss should start decreasing again, according to slide 93 because learning rate is low, but it doesn't. Can we say that loss is not decreasing further because my model is underfitting?

So, to summarize, the loss on the training data:
first went down
then it went up again
then it remains at the same level
then the learning rate is decayed
and the loss doesn't go back down again (still stays the same with decayed learning rate)
To me it sounds like the learning rate was too high initially, and it got stuck in a local minimum afterwards. Decaying the learning rate at that point, once it's already stuck in a local minimum, is not going to help it escape that minimum. Setting the initial learning rate at a lower value is more likely to be beneficial, so that you don't end up in the ''bad'' local minimum to begin with.
It is possible that your model is now underfitting, and that making the model more complex (more nodes in hidden layers, for instance) would help. This is not necessarily the case though.
Are you using any techniques to avoid overfitting? For example, regularization and/or dropout? If so, it is also possible that your model was initially overfitting (when the loss was going down, before it went back up again). To get a better idea of what's going on, it would be beneficial to plot not only your loss on the training data, but also loss on a validation set. If the loss on your training data drops significantly below the loss on the validation data, you know it's overfitting.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.