Why test loss fluctuates so much using Resnet?

Why test loss fluctuates so much using Resnet? - python

Here is a typical plot of train/test losses behaviour as epoch increases.
I'm not an expert but I have read several topics on similar problems. Well, let me explain what I'm doing.
First, I have used implementation given by https://github.com/pytorch/vision/blob/master/torchvision/models/resnet.py for resnet18 & resnet50, and by https://github.com/akamaster/pytorch_resnet_cifar10 for resnet32, resnet56. For all these nets I got the same kind of test-loss hieratic behaviour.
Second, my inputs are images 5x64x64, so I have adapted the first Convolutional Layer, and the output of the last Full-connected consist of 180 neurons. I have used either 64, 128, 256 batch sizes for the training, and 128 for the test: the same behaviour persists. I have also both used 300k or 100k images in input training (100k for the test): same behaviour persists too. The images are not of "standard" RGB photos: first, as you probably have already, , remarked there are 5 channels, second the pixel values can be negative (eg. spanning the range (-0.01, 500))
Third, I am aware of the model.train() statement for the training phase, as well as the model.eval() statement (coupled with the with torch.no_grad():) for the testing phase. It is clear that if I do not use model.eval() during the test phase, the test loss is gently decrasing as the traing loss. But, this is not allowed, isn't it?
I have tried several things after reading post concerning Batch Norm behaviour wo any success
I have used SGD, Adam (& SWATS)
I have tryied lr = 0.1 to lr= 1e-5
I have modified the BN momentum (default = 0.1) : 0.5 and 0.01; as well as the eps parameter.
Now, I have managed to get nice results (ie; good training & testing losses) with a classical CNN (ie. wo any Batch normalization, & short-cuts) but I would like to study Resnet behaviour against adversarial attack. So, I would like to get Resnet fit my images :slight_smile:
Any idea ?
Thanks
After making some tests, I have found something: I have used the standard resnet20 (h=1). Then, I have used as test set the same samples (100,000 images) as for the train set. BUT, for the test set 1) I do not use the shuffling, and 2) I do not make any Horizontal/vertical flip or Rot90deg, Rot180deg or Rot270deg. I observe the same kind of fluctuations for the test loss.
Moreover, when I switch OFF complety the transformations of the train set, and uses the same set for train & test, I got the same behviour:
And finaly, if I switch off the suffling and random transforms (flips & Rotations) of train set, and I use the same set for test, then I get:
Seems that the test loss is converging towards a value, but different from the train loss. Why ???

Related

Training Accuracy Increasing but Validation Accuracy Remains as Chance of Each Class (1/number of classes)

I am training a classifier using CNNs in Pytorch. My classifier has 6 labels. There are 700 training images for each label and 10 validation images for each label. The batch size is 10 and the learning rate is 0.000001. Each class has 16.7% of the whole dataset images. I have trained 60 epochs and the architecture has 3 main layers:
Conv2D->ReLU->BatchNorm2D->MaxPool2D>Dropout2D
Conv2D->ReLU->BatchNorm2D->Flattening->Dropout2D
Linear->ReLU->BatchNorm1D->Dropout And finally a fully connected and
a softmax.
My optimizer is AdamW and the loss function is crossentropy. The network is training well as the training accuracy is increasing but the validation accuracy remains almost fixed and equal as the chance of each class(1/number of classes). The accuracy is shown in the image below:
Accuracy of training and test
And the loss is shown in:
Loss for training and validation
Is there any idea why this is happening?How can I improve the validation accuracy? I have used L1 and L2 Regularization as well and also the Dropout Layers. I have also tried adding more data but these didn't help.

Problem Solved: First, I looked at this problem as overfitting and spend so much time on methods to solve this such as regularization and augmentation. Finally, after trying different methods, I couldn't improve the validation accuracy. Thus, I went through the data. I found a bug in my data preparation which was resulting in similar tensors being generated under different labels. I generated the correct data and the problem was solved to some extent (The validation accuracy increased around 60%). Then finally I improved the validation accuracy to 90% by adding more "conv2d + maxpool" layers.

This is not so much a programming related question so maybe ask it again in cross-validated
and it would be easier if you would post your architecture code.
But here are things that I would suggest:
you wrote that you "tried adding more data", if you can, always use all data you have. If thats still not enough (and even if it is) use augmentation (e.g. flip, crop, add noise to the image)
your learning rate should not be so small, start with 0.001 and decay while training or try ~ 0.0001 without decaying
remove the dropout after the conv layers and the batchnorm after the dense layers and see if that helps, it is not so common to use cropout after conv but normally that shouldnt have a negative effect. try it anyways

Keras: Over fitting Conv2D

I'm trying to build a convolutional based model. I trained two different structures as following. As you can see for single layer there isn't any obvious change along number of epochs. Bi-layer Conv2D presents improving in accuracy and losses for train dataset, but validation characteristics are going to be a tragedy.
According to the fact that I can't increase my data-set what should I do to improve validation characteristics?
I've examined regularizer L1 & L2 but they didn't affect my model.

1) You can use adaptive learning rate (exponential decay or step dependent may work for you) Furthermore, you can try extreme high learning rates when your model goes into local minimum.
2) If you are training with images, you can flip, rotate or other stuff to increase your dataset size and maybe some other augmentation techniques might work for your case.
3) Try to change the model like deeper, shallower, wider, narrower.
4) If you are doing a classification model, please ensure that you are not using sigmoid as your activation function in the end unless you are doing binary classification.
5) Always check your dataset's situation before training session.
Your train-test split may not be suitable for your case.
There might be extreme noises in your data.
Some amount of your data might be corrupted.
Note: I will update them whenever a new idea comes to my mind. Furthermore, I didn't want to repeat the comments and other answers, both of them are having valuable information for your case.

The validation becomes a tragedy because model is overfitting on the training data you can try if any of this works,
1)Batch normalisation would be a good option to go with.
2)Try reducing batch size.

I tried a variety of models known to work well on small datasets, but as I suspected, and as is my ultimate verdict - it is a lost cause.
You don't have nearly enough data to train a good DL model, or even an ML model like SVM - as matter's exacerbated by having eight separate classes; your dataset would stand some chance with an SVM for binary classification, but none for 8-class. As a last resort, you can try XGBoost, but I wouldn't bet on it.
What can you do? Get more data. There's no way around it. I don't have an exact number, but for 8-class classification, I'd say you need anywhere from 50-200x your current data to get reasonable results. Mind also that your validation performance is bound to be much worse on a bigger validation set, accounted for in this number.
For readers, OP shared his dataset with me; shapes are: X = (1152, 1024, 1), y = (1152, 8)

Why does the magnitudes of output during inference correlate with the batch size during training?

I have to say this might be one of the weirdest problems I've ever met.
I was implementing ResNet to perform 10-classification over cifr-10 with tensorflow. Everything seemed to be fine with the training phase -- loss decreased steadily, and accuracy on training set kept increasing to over 90%, however, the results were totally abnormal during inference.
I have analyzed my code very carefully and ruled out the possibility of making mistakes when feeding the data or saving/loading the model. So the only difference between the training phase and the test phase lies in batch normalization layers.
For BN layers, I used tf.layers.batch_normalization directly and I thought I've paid attention to every pitfall in using tf.layers.batch_normalization.
Specifically, I've included the dependency for train_op as follows,
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
self.train_op = optimizer.minimize(self.losses)
Also, for saving and loading the model, I've specified var_list as tf.global_variables(). Moreover, I used training=True for training and training=False for test.
Nevertheless, the accuracy during inference was only around 10%, even when applied to the same data used for training. And when I output the last layer of the network (i.e., the 10-dimension vector input to softmax), I found that the magnitude of each item in the 10-dimension vector during training was always 1e0 or 1e-1, while for inference, it could be 1e4 or even 1e5. The strangest part was that I found the magnitude of the 10-dimension vector during inference correlated with the batch size used in training, i.e., the bigger the batch size, the smaller the magnitude.
Besides, I also found that the magnitudes of moving_mean and moving_variance of BN layers correlated with the batch size too, but why was this even possible? I thought moving_mean means the mean of the entire training population, and so was moving_variance. So why was there anything to do with the batch size?
I think there must be something that I don't know about using BN with tensorflow. This problem is really gonna drive me crazy! I've never expected to deal with such a problem in tensorflow, considering how convenient it is to use BN with PyTorch!

The problem has been solved!
I read the source code of tensorflow. Based on my understanding, the value of momentum in tf.layers.batch_normalization should be 1 - 1/num_of_batches. The default value is 0.99, which means the default value is most suitable when there are 100 batches in training data.
I didn't find any documents mentioned this. Hope this can be helpful to someone who would have the same problem with BN in tensorflow!

NASNet-A fine tuning poor validation accuracy

I have a dataset of roughly 34000 images divided in 2 sets: train (30000 images) and validation (4000 images) sets. Each image is the result of the difference between two images taken from a video (the time offset between the images in each pair is about 1 second). The videos have a static background so the diff images contains too much black with only one or two small regions with colors. Each diff image has a label (there has been an action or no.. 1 or 0) so this is sort of binary classification. Briefly, I'm using the slim models pretrained on ImageNet to do the finetuning on my dataset. I've launched 5 separated training using 5 different networks: InceptionV4, InceptionResnetV2, Resnet152, NASNet-mobile, NASNet. I got very good results using the first 4 networks InceptionV4, InceptionResnetV2, Resnet152, NASNet-mobile but it was not the case using NASNet. The thing is that the Area Under the ROC curve on the validation set is always = 0.5 and the logits of the validation images are roughly having the same values which is really weird. In fact, I got this kind of results using NASNet-mobile on the first 10000 mini-batch but after that the model did converge. Here are the values of the hyperparameters I have in my script:
batch_size=10
weight_decay = 0.00004
optimizer = rmsprop
rmsprop_momentum = 0.9
rmsprop_decay = 0.9
learning_rate_decay_type = exponential
learning_rate = 0.01
learning_rate_decay_factor = 0.94
num_epochs_per_decay = 2.0 #'Number of epochs after which learning rate
I'm still newbie in tensorflow and I did not find anything related anywhere else. This is a really weird behavior because I'm using the same parameters and same inputs but it seems using NASNet there is a problem somewhere. I'm not only looking for a solution (if possible because I know it is tough to troubleshoot such things without too much details about the model) but insights about where to look and how to troubleshoot would be great. Does anybody had this problem with finetuning NASNet before? something like the model didn't converge for example? Finally, I know it is really hard to got answers on such questions but I hope to get at least some insights so I can move forward with my investigations.
EDIT:
Here are the plots of the cross entropy and regularization losses:
EDIT:
As proposed in the answer, I did set the drop_path_keep_prob params to 1 and now the model converged and I got good accuracy on the validation set. But now the question is: what does this param mean? Is it one of the params that we should adapt to our dataset (like learning rate etc..)?

The simplest sanity check you can do would be to run the finetuning on a single minibatch. Any deep network should be able to overfit to that, if there aren't any big problems. If you see that it can't do that, then there must be some problem with the definition, or the way you're using the definition.
The only guess I have in your case is that it could be something to do with the drop_path implementation. It's disabled in the mobile version, but it is enabled during training on the large model. It could make the model unstable enough that it wouldn't fine tune, so it may be worth trying to train with it disabled.

Keras CNN training accuracy is good but test accuracy is very low

Please give me any comment for these CNN results.
I have used 2000 training images and 400 test images.
Training accuracy is perfect but test accuracy is very low.
I think it because there is much variation between training and test images.
Anyone have a good idea for this case?
[]

This is clear case of over-fitting. How many learneable parameters you have? For example VGGnet has 138M parameters and in this case its not very hard to see some neuron in the network must have sort of memorized a training image as it is and thus your network is not generalizing well.
To fix that, first of all you can try a simpler model if task is simple like discriminating between shapes. Also you can increase training data via transformations like swapping color channels (if it doesnt impact the output class), flipping or rotating image to make your net generalize better. Include L1/L2 regularization in your loss function and try dropouts.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.