This is a problem that I am constantly facing, but don't seem to find the answer anywhere. I have a data set of 700 samples. As a result, I have to use cross-validation instead of just using one validation and one test set to get a close estimate of the error.
I would like to use a neural network to do this. But after doing CV with a neural network, and get an error estimate, how do I train the NN on the whole data set? Because for other algorithms like Logistic regression or SVM, there is no question of when to stop in training. But for NN, you train it until your validation score goes down. So, for the final model, training on the whole dataset, how do you know when to stop?
Just to make it clear, my problem is not how to choose hyper-parametes with NN. I can do that by using a nested CV. My question is how to train the final NN on the whole data set(when to stop more specifically) before applying it in wild?
To rephrase your question:
"When training a neural network, a common stopping criterion is the 'early stopping criterion' which stops training when the validation loss increases (signaling overfitting). For small datasets, where training samples are precious, we would prefer to use some other criterion and use 100% of the data for training the model."
I think this is generally a hard problem, so I am not surprised you have not found a simple answer. I think you have a few options:
Add regularization (such as Dropout or Batch Normalization) which should help prevent overfitting. Then, use the training loss for a stopping criterion. You could see how this approach would perform on a validation set without using early stopping to ensure that the model is not overfitting.
Be sure not to overprovision the model. Smaller models will have a more difficult time overfitting.
Take a look at the stopping criterion described in this paper which does not rely on a validation set: https://arxiv.org/pdf/1703.09580.pdf
Finally, you may not use Neural Networks here. Generally, these models work best with large amounts of training data. In this case of 700 samples, you can possibly get better performance with another algorithm.
Related
i'm working on a regression problem using neural network. the mse loss would decrease at the beginning of train and the accuracy is satisfactory, yet, when the train process goes on, the loss had a huge jump, and maintain at a certain value,like the curve in the picture. i don't know why and how to fix it? and i wanna ask if i could use the train coefficient before the jump, like train step at 8000, as my final result?
This is a typical case of model training where the value of the accuracy metric stops improving (and even get worse) from a certain number of training epochs.
I'll suggest you to implement Early Stopping meaning that, "yes", you can take the training accuracy at step 8000 as you final result if your only goal is to minimize the training loss.
This TF documentation explains how to implement Early Stopping with Tensorflow's tf.keras.callbacks.EarlyStopping() function.
However if your goal is a model that generalizes well on unseen data (test/validation data) as this is generally the case, you might want to evaluate your model's test accuracy in order to take it into account when implementing Early Stopping.
This article gives a very good example of end-to-end implementation of early stopping with Tensorflow.
I am trying to implement a residual network to classify images on the CIFAR10 dataset for a project and I have a working model that has an accuracy that logarthimically grows, but a validation accuracy that plateaus. I used batch normalization and relu after most layers and used a softmax at the end.
Here is my data split:
(train_images, train_labels), (test_images, test_labels) = datasets.cifar10.load_data()
Here is my code to compile and train the model
resNet50.compile(optimizer='adam',loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),metrics=['accuracy'])
resNet50.fit(train_images, train_labels, epochs=EPOCHS, validation_data=(test_images, test_labels))
What might be causing this validation plateau and what could improve my model?
Thank you in advanced for your feedback and comments.
This is a very common problem, that is a form of overfitting.
I invite you to read the book Deep Learning by Ian Goodfellow and Yoshua Bengio and Aaron Courville, especially this chapter (in free access), that's very informative.
In short, you seem to have chosen a model (ResNet50 + default training parameters) that has too much capacity for your problem and data. If you choose a model that is too simple, you'll get the training and evaluation curves very close to one another, but with worse performance that what you could achieve. If you choose a model that is too complex (as is a bit the case here), you can reach a much better performance on the training data, but the eval will not be at the same level, and could even be quite bad. That's called overfitting on the training set.
What you want is the best middle point : the best performance on evaluation data is found with a model complexity that's just before overfitting : you want the two performance curves to be close one to another, but both should be as good as possible.
So you need to decrease the capacity of your model for your problem. There are different ways to do that, they will not be equally efficient in terms of reducing overfitting, nor in terms of decreasing your train performance. The best method is usually to add more training data, if you can. If you can't, the next good things to add is regularization, such as data augmentation, dropout, L1 or L2 regularization, and early stopping. The last one is especially useful if your validation performance starts decreasing at some point, instead of just plateauing. It's not your case, so it should not be your first track.
If regularization is not enough, then try to play with the learning rate, or the other parameters mentioned in the book. You should be able to make ResNet50 itself work much better than this on Cifar10, but maybe it's not that trivial.
I am training a NN and getting this result on loss and validation loss:
These are 200 epochs, a batch size of 16, 500 training samples and 200 validation samples.
As you can see, after about 20 epochs, the validation loss begins to do a very exaggerated zig-zagging.
Do you know which could be the reason for that behavior?
I tried to increase the number of validation samples but that just increased the zig-zagging and made it more exaggerated.
Also, I added a decay value to the optimizer, but the loss and validation loss did not look so good.
.
I was looking for another way to improve it.
Any idea on which is the zig-zagging reason and how could I minimize it?
This might be a case of overfitting:
Overfitting refers to a model that models the “training data” too well. Overfitting happens when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data source.
Basically, you have a very small training sample (500), but are training for a very long time (200 epochs!).
The network will start learning your training data by heart and won't learn to generalise. It will thus seem to be very good during training, but will fail miserably on the test set.
early stopping is a nice way to avoid overfitting: basically, stop as soon as the validation loss becomes erratic/starts increasing. Another way to lower the chances of overfitting is to use techniques such as dropout or simply to increase the training data.
tldr; you are overfitting. To avoid this issue, many possibilities: reduce drastically the number of epochs, use a dev set and a stopping criterion, have more training data, ...
For alternative explanations, see also this question on QUORA.
I would suggest that don't be worry for the zigzag fashion of the validation loss or validation accuracy. See, what happens when training of the neural network goes on, it makes the mistakes and update the weights, right ?( if you know the math behind it). So it is obvious that testing data will create zigzag because model is in training mode (learning stage). Once the model will get trained fully , you will notice that ... zigzag will decrease (if you have chose correct number of epochs).
So don't worry for this.
I ask this question because many deep learning frameworks, such as Caffe, supports model refining function. For example, in Caffe, we can use snapshot to initialling the neural network parameters and then continue performing training as the following command shows:
./caffe train -solver solver_file.prototxt -snapshot snap_file.solverstate
In order to further train the model, the following tricks I can play with:
use smaller learning rate
change optimisation method. For example, change stochastic gradient descent to ADAM algorithm
Any other tricks I can play with?
ps: I understand that reducing the loss function value of the training samples does not mean that we can get a better model.
The question is way too broad, I think. However, this is a common practice, especially in case of a small training set. I would rank possible methods like this:
smaller learning rate
more/different data augmentation
add noise to train set (related to data augmentation, indeed)
fine-tune on subset of the training set.
The very last one is indeed a very powerful method to finalize the model that performs poor on some corner cases. You can then make a 'difficult' train subset in order to bias model towards it. I personally use it very often.
I am using TensorFlow for training model which has 1 output for the 4 inputs. The problem is of regression.
I found that when I use RandomForest to train the model, it quickly converges and also runs well on the test data. But when I use a simple Neural network for the same problem, the loss(Random square error) does not converge. It gets stuck on a particular value.
I tried increasing/decreasing number of hidden layers, increasing/decreasing learning rate. I also tried multiple optimizers and tried to train the model on both normalized and non-normalized data.
I am new to this field but the literature that I have read so far vehemently asserts that the neural network should marginally and categorically work better than the random forest.
What could be the reason behind non-convergence of the model in this case?
If your model is not converging it means that the optimizer is stuck in a local minima in your loss function.
I don't know what optimizer you are using but try increasing the momentum or even the learning rate slightly.
Another strategy employed often is the learning rate decay, which reduces your learning rate by a factor every several epochs. This can also help you not get stuck in a local minima early in the training phase, while achieving maximum accuracy towards the end of training.
Otherwise you could try selecting an adaptive optimizer (adam, adagrad, adadelta, etc) that take care of the hyperparameter selection for you.
This is a very good post comparing different optimization techniques.
Deep Neural Networks need a significant number of data to perform adequately. Be sure you have lots of training data or your model will overfit.
A useful rule for beginning training models, is not to begin with the more complex methods, for example, a Linear model, which you will be able to understand and debug more easily.
In case you continue with the current methods, some ideas:
Check the initial weight values (init them with a normal distribution)
As a previous poster said, diminish the learning rate
Do some additional checking on the data, check for NAN and outliers, the current models could be more sensitive to noise. Remember, garbage in, garbage out.