StratifiedKFold overfitting

StratifiedKFold overfitting - python

I'm working on a multimodal classifier (text + image) using pytorch (only 2 classes).
Since I don't have a lot of data, i've decided to use StratifiedKFold to avoid overfitting.
I noticed a strange behavior on training/testing curves.
My training accuracy quickly converges forward a unique value for few epochs before evolving again.
With these results I directly thought of overfitting, .67 being the maximum accuracy of the model.
With the rest of the data separated by the KFold, I tested my model in evaluation mode.
I've been quite surprised since test accuracy follows (quite exactly) the training accuracy while the loss (CrossEntropyLoss) still evolves.
Note : changing the batch size only make growing of accuracy delays or brings closer the moment the loss evolves.
Any ideas about this behaviour ?

Related

Training Accuracy Increasing but Validation Accuracy Remains as Chance of Each Class (1/number of classes)

I am training a classifier using CNNs in Pytorch. My classifier has 6 labels. There are 700 training images for each label and 10 validation images for each label. The batch size is 10 and the learning rate is 0.000001. Each class has 16.7% of the whole dataset images. I have trained 60 epochs and the architecture has 3 main layers:
Conv2D->ReLU->BatchNorm2D->MaxPool2D>Dropout2D
Conv2D->ReLU->BatchNorm2D->Flattening->Dropout2D
Linear->ReLU->BatchNorm1D->Dropout And finally a fully connected and
a softmax.
My optimizer is AdamW and the loss function is crossentropy. The network is training well as the training accuracy is increasing but the validation accuracy remains almost fixed and equal as the chance of each class(1/number of classes). The accuracy is shown in the image below:
Accuracy of training and test
And the loss is shown in:
Loss for training and validation
Is there any idea why this is happening?How can I improve the validation accuracy? I have used L1 and L2 Regularization as well and also the Dropout Layers. I have also tried adding more data but these didn't help.

Problem Solved: First, I looked at this problem as overfitting and spend so much time on methods to solve this such as regularization and augmentation. Finally, after trying different methods, I couldn't improve the validation accuracy. Thus, I went through the data. I found a bug in my data preparation which was resulting in similar tensors being generated under different labels. I generated the correct data and the problem was solved to some extent (The validation accuracy increased around 60%). Then finally I improved the validation accuracy to 90% by adding more "conv2d + maxpool" layers.

This is not so much a programming related question so maybe ask it again in cross-validated
and it would be easier if you would post your architecture code.
But here are things that I would suggest:
you wrote that you "tried adding more data", if you can, always use all data you have. If thats still not enough (and even if it is) use augmentation (e.g. flip, crop, add noise to the image)
your learning rate should not be so small, start with 0.001 and decay while training or try ~ 0.0001 without decaying
remove the dropout after the conv layers and the batchnorm after the dense layers and see if that helps, it is not so common to use cropout after conv but normally that shouldnt have a negative effect. try it anyways

Reason for keras validation zig-zagging

I am training a NN and getting this result on loss and validation loss:
These are 200 epochs, a batch size of 16, 500 training samples and 200 validation samples.
As you can see, after about 20 epochs, the validation loss begins to do a very exaggerated zig-zagging.
Do you know which could be the reason for that behavior?
I tried to increase the number of validation samples but that just increased the zig-zagging and made it more exaggerated.
Also, I added a decay value to the optimizer, but the loss and validation loss did not look so good.
.
I was looking for another way to improve it.
Any idea on which is the zig-zagging reason and how could I minimize it?

This might be a case of overfitting:
Overfitting refers to a model that models the “training data” too well. Overfitting happens when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data source.
Basically, you have a very small training sample (500), but are training for a very long time (200 epochs!).
The network will start learning your training data by heart and won't learn to generalise. It will thus seem to be very good during training, but will fail miserably on the test set.
early stopping is a nice way to avoid overfitting: basically, stop as soon as the validation loss becomes erratic/starts increasing. Another way to lower the chances of overfitting is to use techniques such as dropout or simply to increase the training data.
tldr; you are overfitting. To avoid this issue, many possibilities: reduce drastically the number of epochs, use a dev set and a stopping criterion, have more training data, ...
For alternative explanations, see also this question on QUORA.

I would suggest that don't be worry for the zigzag fashion of the validation loss or validation accuracy. See, what happens when training of the neural network goes on, it makes the mistakes and update the weights, right ?( if you know the math behind it). So it is obvious that testing data will create zigzag because model is in training mode (learning stage). Once the model will get trained fully , you will notice that ... zigzag will decrease (if you have chose correct number of epochs).
So don't worry for this.

Why validation accuracy remains at 75% while train accuracy is 100 %?

I used my own data set to train a model using retrain.py file from Tensorflow site. However, with my first set of images, I am seeing test accuracy of 100% while validation accuracy is at 70%. I see that validation entropy is increasing which tells overfitting. I am new to this field and got to this stage by following online tutorials.
I did not enable random brightness, crop and flip yet for training. I am trying to understand why is this behaviour? I tried flower example and it worked as expected. Cross-entropy got lowest instead of increasing with my data set.
Could some one explain whats going on inside the CNN here ?

Your model has over-fitted on the training data. If its a large model, you should consider using transfer learning where you train the model on a large dataset like ImageNet and then fine-tune on your data. You can also try adding some form of regularization to prevent overfitting specially Dropout and L2 regularization.

This simply means your model is overfitting. Overfitting means your model is not generalizing well to unseen data (ie the validation data). What you can do is add some form of regularization (L2 is used normally). What this does is it penalizes weights from getting very high values which would thereby lead to overfitting. This will also act against the model trying to fit outliers which again leads to less generalization and more overfitting.

Overfitting terminology in Machine Learning

In the book Introduction to Machine Learning with Python on page 50 the author is performing a Linear Regression on a dataset and gets:
training set score: 0.67
test set score: 0.66
They then state that they are “likely underfitting, not overfitting.”
However, when using TensorFlow’s Basic Classification Tutorial they are using the MNIST Fashion dataset with a neural network and get:
training set score: 0.892
test set score: 0.876
and then they state the following
“It turns out, the accuracy on the test dataset is a little less than the accuracy on the training dataset. This gap between training accuracy and test accuracy is an example of overfitting. Overfitting is when a machine learning model performs worse on new data than on their training data.”
I believe that the quote taken from the TensorFlow site is the correct one, or are they both correct and I don’t fully understand overfitting.

Underfitting occurs when both the training and testing accuracies are low. This signifies a systematic problem with your model, i.e the data would fit better with a polynomial model but you're using a linear model. So a ~66% accuracy for both training and testing is considered underfitting because they are both very low. In general, high error on both sets indicates underfitting.
Overfitting occurs when you have relatively high accuracy on training, but lower on testing. This signifies that your model has fit too much towards your training data, and does not generalize well to other data. In general, low error on training and higher error on testing indicates overfitting.

In general, it is extremely rare to build a model, that would show the same performance on the training and validation (or test, or holdout, whatever you wish to call it) sets. Thus, the gap between training and validation set will be there (almost) always. You will see the definition of overfitting based on the gap often, but in practice it is not applicable as it is not quantitative. The more general concept here is the "bias-variance trade-off", that you might want to google about. The relevant question is how large is the gap, how good is performance and how performance on the validation set behaves with changed complexity of the model.
I find this figure from Wikipedia very instructive: https://en.wikipedia.org/wiki/Overfitting#/media/File:Overfitting_svg.svg. The x axis is the number of training iterations (epochs) in the case of NN or GBM's, but you can also think of it as a model complexity parameter, e.g. the number of powers included in a polynomial model. As you can see, there is always a gap between performance on the training and validation samples. But the key to choose the model that does not overfit is to choose the optimal trade-off between performance on the training sample (= bias) and performance on the validation sample (the difference between performance on training and validation samples = variance).

Over- and under-fitting
The most overfit you can do is have an accuracy of 100% on your training set.
This means that your model learned to predict exactly inputs that it has seen before.
If you are ever in this situation, your test set will probably perform very poorly.
You can detect overfitting by:
High accuracy on the training set
a large gap between training and test set
You can detect underfitting by:
A low accuracy on the training set (irrespective of performance on test set)
Examples:
1)
training set score: 0.67
test set score: 0.66
This example has a low score on training set. So underfitting seems like a fair assumption.
2)
training set score: 0.892
test set score: 0.876
This one is up to interpretation. The score on the training set is quite high and there is a gap in respect to the test set.
If the examples in both sets are very similar, then I would say that there is some overfitting. However, if the two sets are quite different (for example from different sources), then the results could be deemed acceptable.

Is this an overfitted network?

I have obtained this result after training a neural network in keras and I was wondering if this is overfitting or not.
I'm having doubts because I have read overfitting is produced when a net is overtrained, and it happens when the validation loss INCREASES.
But in this case it doesn't increase. It remains the same, but the training loss DECREASES.
EXTRA INFO
Single dataset split on this way:
70% of the dataset used as training data
30% of the dataset used as validation data
500 EPOCHS TRAINING
2000 EPOCHS TRAINING
Training loss: 3.1711e-05
Validation loss: 0.0036

There is a slight overfit in the sense that you training loss keeps decreasing and the validation loss stopped decreasing.
However, I wouldn't consider this harmful because the validation loss insn't increasing. This is if I read the graph correctly, if there is a small increase then it's getting bad.
A harmful overfit is when your validation loss starts increasing. The validation loss is your true measure of the performance of the network. If it goes up then your model is starting to do bad things and you should stop there.
All in all this seems pretty decent. The training loss will almost always be going lower than the validation at some point, this is an optimization process over the training set.

Training loss does indeed appear to continue decreasing further than validation loss (it still looks to me like it didn't finish decreasing yet at the 500th epoch, would be good to continue for more epochs and see what happens). The difference doesn't appear to be large though.
It may be overfitting slightly, but it may also be possible that the distribution of your validation data is simply a bit different from the distribution of the training data.
I'd recommend testing the following:
Continue for more than 500 epochs, to see if the training loss keeps on decreasing even further, or if it stabilizes close to the validation loss. If it keeps on decreasing much further, and the validation loss stays the same, it's safe to say that the network is overfitting.
Try creating different splits of training and validation sets. How did you determine training and validation sets actually? Were you given two separate sets, one for training and one for validation? Or were you given a single large training set, and did you split it up yourself? In the first case, the distributions may be different, so a difference in training vs validation loss wouldn't be strange. In the second case, try randomly creating different splits and repeating the experiments to see if you always consistently get the same difference in training vs validation loss, or if they're sometimes also closer together.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

StratifiedKFold overfitting - python

Related

Training Accuracy Increasing but Validation Accuracy Remains as Chance of Each Class (1/number of classes)

Reason for keras validation zig-zagging

Why validation accuracy remains at 75% while train accuracy is 100 %?

Overfitting terminology in Machine Learning

Is this an overfitted network?

Categories

Resources