i'm working on a regression problem using neural network. the mse loss would decrease at the beginning of train and the accuracy is satisfactory, yet, when the train process goes on, the loss had a huge jump, and maintain at a certain value,like the curve in the picture. i don't know why and how to fix it? and i wanna ask if i could use the train coefficient before the jump, like train step at 8000, as my final result?
This is a typical case of model training where the value of the accuracy metric stops improving (and even get worse) from a certain number of training epochs.
I'll suggest you to implement Early Stopping meaning that, "yes", you can take the training accuracy at step 8000 as you final result if your only goal is to minimize the training loss.
This TF documentation explains how to implement Early Stopping with Tensorflow's tf.keras.callbacks.EarlyStopping() function.
However if your goal is a model that generalizes well on unseen data (test/validation data) as this is generally the case, you might want to evaluate your model's test accuracy in order to take it into account when implementing Early Stopping.
This article gives a very good example of end-to-end implementation of early stopping with Tensorflow.
Related
Good training, testing and validation accuracies but strange historical accuracies behavior for model:
Here is the summary of my model :
I performed the execution and prediction tasks and I've got the next confusion matrix :
while the Accuracy behavior was the next :
I can not understand if this is overfitting or underfitting or a normal behavior?
Adding the loss plot to clarify more in the next
Thank you in advance for any useful information and help !
Does not look like over fitting. Your training accuracy is increasing and so is the AVERAGE test accuracy. Over fitting is when the test loss improves, then plateaus and then starts to increase. It is best to look at loss metrics to monitor this. It is typical that once the training accuracy gets high the test loss will oscillate to a small degree. You can test for over fitting by varying the drop out rate and see the effect on test loss.
As you already mentioned your training is doing well.
First of all I recommend you to check a prediction by yourself with test-data.
The Validation-loss will converge until a specific value. It may looks a little bit variance but you need as reference the y-Axis. The ups and downs of the last epochs are between 91% and 94% which is not really much in reference to 100% (maybe change the y-Axis).
I am training a NN and getting this result on loss and validation loss:
These are 200 epochs, a batch size of 16, 500 training samples and 200 validation samples.
As you can see, after about 20 epochs, the validation loss begins to do a very exaggerated zig-zagging.
Do you know which could be the reason for that behavior?
I tried to increase the number of validation samples but that just increased the zig-zagging and made it more exaggerated.
Also, I added a decay value to the optimizer, but the loss and validation loss did not look so good.
.
I was looking for another way to improve it.
Any idea on which is the zig-zagging reason and how could I minimize it?
This might be a case of overfitting:
Overfitting refers to a model that models the “training data” too well. Overfitting happens when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data source.
Basically, you have a very small training sample (500), but are training for a very long time (200 epochs!).
The network will start learning your training data by heart and won't learn to generalise. It will thus seem to be very good during training, but will fail miserably on the test set.
early stopping is a nice way to avoid overfitting: basically, stop as soon as the validation loss becomes erratic/starts increasing. Another way to lower the chances of overfitting is to use techniques such as dropout or simply to increase the training data.
tldr; you are overfitting. To avoid this issue, many possibilities: reduce drastically the number of epochs, use a dev set and a stopping criterion, have more training data, ...
For alternative explanations, see also this question on QUORA.
I would suggest that don't be worry for the zigzag fashion of the validation loss or validation accuracy. See, what happens when training of the neural network goes on, it makes the mistakes and update the weights, right ?( if you know the math behind it). So it is obvious that testing data will create zigzag because model is in training mode (learning stage). Once the model will get trained fully , you will notice that ... zigzag will decrease (if you have chose correct number of epochs).
So don't worry for this.
classifier.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
classifier.fit(X_train, y_train, epochs=50, batch_size=100)
Epoch 1/50
27455/27455 [==============================] - 3s 101us/step - loss: 2.9622 - acc: 0.5374
I know I'm compiling my model in first line and fitting it in second. I know what is optimiser. I'm interested the meaning of metrics=['accuracy'] and what does the acc: XXX exactly mean when I compile the model.
Also, I'm getting acc : 1.000 when I train my model (100%) but when I test my model I'm getting 80% accuracy. Does my model overfitting?
Ok, let's begin from the top,
First, metrics = ['accuracy'], The model can be evaluated on multiple parameters, accuracy is one of the metrics, other can be binary_accuracy, categorical_accuracy, sparse_categorical_accuracy, top_k_categorical_accuracy, and sparse_top_k_categorical_accuracy, these are only the inbuilt ones, you can even create custom metrics, to understand metrics in more details, you need to have a clear understanding of loss in a Neural Network, you might know that loss function must be differentiable in order to be able to do back propagation, this is not necessary in case of metrics, metrics are used purely for model evaluation and thus can even be functions that are not differentiable, in Keras as mentioned even in their documentation
A metric function is similar to a loss function, except that the results from evaluating a metric are not used when training the model. You may use any of the loss functions as a metric function.
On your Own, you can custom define an accuracy that is not differentiable but creates an objective function on what you need from your model.
TLDR; Metrics are just loss functions not used in back propagation but used for model evaluation.
Now,
acc:xxx might just be that it has not even finished one minibatch propagation and thus cannot give an accuracy score yet, I have not paid much attention to it, but it usually stays there for a few seconds and is thus an speculation from that.
Finally 20% Decrease in model performance when taken out of training, yes this can be a case of Overfitting but no one can know for sure without looking at your dataset, but most probably yes, it is overfitting, and you may need to look at the data it is performing bad on to know the cause.
If something is unclear, doesn't make sense, feel free to comment.
Having 100% accuracy on train dataset while having 80% accuracy on test dataset doesn't mean that your model overfits. Moreover, it almost surely doesn't overfit if your model is equipped with much more effective parameters that the number of training samples [2], [5] (insanely large model example [1]). This contradicts to conventional statistical learning theory, but these are empirical results.
For models with number of parameters greater than number of samples, it's better to continue to optimize the logistic or cross-entropy loss even after the training error is zero and the training loss is extremely small, and even if the validation loss increases [3]. This may hold even regardless of batch size [4].
Clarifications (edit)
The "models" I was referring to are neural networks with two or more hidden layers (could be also convolutional layers prior to dense layers).
[1] is cited to show a clear contradiction to classical statistical learning theory, which says that large models may overfit without some form of regularization.
I would invite anyone who disagrees with "almost surely doesn't overfit" to provide a reproducible example where models, say for MNIST/CIFAR etc with few hundred thousand parameters do overfit (in a sense of increasing with iterations test error curve).
[1] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V. Le,Geoffrey E. Hinton, and Jeff Dean. Outrageously large neural networks: Thesparsely-gated mixture-of-experts layer.CoRR, abs/1701.06538, 2017.
[2] Lei Wu, Zhanxing Zhu, et al. Towards understanding generalization of deep learn-ing: Perspective of loss landscapes.arXiv preprint arXiv:1706.10239, 2017.
[3] Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and NathanSrebro. The implicit bias of gradient descent on separable data.The Journal of Machine Learning Research, 19(1):2822–2878, 2018.
[4] Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: clos-ing the generalization gap in large batch training of neural networks. InAdvancesin Neural Information Processing Systems, pages 1731–1741, 2017.`
[5] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals.Understanding deep learning requires rethinking generalization.arXiv preprintarXiv:1611.03530, 2016.
Starting off with the first part of your question -
Keras defines a Metric as "a function that is used to judge the performance of your model". In this case you are using accuracy as the function to judge how good your model is. (This is the norm)
For the second part of your question - acc is the accuracy of your model at that Epoch. This can, and will change depending on which metrics were defined in the model.
Finally it is possible that you have ended up with an overfit model given what you have told us but there are simple solutions
So the meaning of metrics=['accuracy'] is actually dependent on what loss function you use. You can see how keras handels this from line 375 and down. Since you are using categorical_crossentropy, your case follows the logic in the elif (line 386). Hence your metric function is set to
metric_fn = metrics_module.sparse_categorical_accuracy
See this post for a description of the logic behind sparse_categorical_accuracy, it should clear the meaning of "accuracy" in your case. It basically just counts how many of your prediction (the class with maximum probability) was the same as the true class.
The train vs validation accuracy can show sign of over-fitting. To test this plot the train accuracy and validation accuracy against each other and see at what point the validation accuracy start to decrease. Follow this for a good description of how to plot accuracy and loss etc, to test for over-fitting.
This is a problem that I am constantly facing, but don't seem to find the answer anywhere. I have a data set of 700 samples. As a result, I have to use cross-validation instead of just using one validation and one test set to get a close estimate of the error.
I would like to use a neural network to do this. But after doing CV with a neural network, and get an error estimate, how do I train the NN on the whole data set? Because for other algorithms like Logistic regression or SVM, there is no question of when to stop in training. But for NN, you train it until your validation score goes down. So, for the final model, training on the whole dataset, how do you know when to stop?
Just to make it clear, my problem is not how to choose hyper-parametes with NN. I can do that by using a nested CV. My question is how to train the final NN on the whole data set(when to stop more specifically) before applying it in wild?
To rephrase your question:
"When training a neural network, a common stopping criterion is the 'early stopping criterion' which stops training when the validation loss increases (signaling overfitting). For small datasets, where training samples are precious, we would prefer to use some other criterion and use 100% of the data for training the model."
I think this is generally a hard problem, so I am not surprised you have not found a simple answer. I think you have a few options:
Add regularization (such as Dropout or Batch Normalization) which should help prevent overfitting. Then, use the training loss for a stopping criterion. You could see how this approach would perform on a validation set without using early stopping to ensure that the model is not overfitting.
Be sure not to overprovision the model. Smaller models will have a more difficult time overfitting.
Take a look at the stopping criterion described in this paper which does not rely on a validation set: https://arxiv.org/pdf/1703.09580.pdf
Finally, you may not use Neural Networks here. Generally, these models work best with large amounts of training data. In this case of 700 samples, you can possibly get better performance with another algorithm.
I am using TensorFlow for training model which has 1 output for the 4 inputs. The problem is of regression.
I found that when I use RandomForest to train the model, it quickly converges and also runs well on the test data. But when I use a simple Neural network for the same problem, the loss(Random square error) does not converge. It gets stuck on a particular value.
I tried increasing/decreasing number of hidden layers, increasing/decreasing learning rate. I also tried multiple optimizers and tried to train the model on both normalized and non-normalized data.
I am new to this field but the literature that I have read so far vehemently asserts that the neural network should marginally and categorically work better than the random forest.
What could be the reason behind non-convergence of the model in this case?
If your model is not converging it means that the optimizer is stuck in a local minima in your loss function.
I don't know what optimizer you are using but try increasing the momentum or even the learning rate slightly.
Another strategy employed often is the learning rate decay, which reduces your learning rate by a factor every several epochs. This can also help you not get stuck in a local minima early in the training phase, while achieving maximum accuracy towards the end of training.
Otherwise you could try selecting an adaptive optimizer (adam, adagrad, adadelta, etc) that take care of the hyperparameter selection for you.
This is a very good post comparing different optimization techniques.
Deep Neural Networks need a significant number of data to perform adequately. Be sure you have lots of training data or your model will overfit.
A useful rule for beginning training models, is not to begin with the more complex methods, for example, a Linear model, which you will be able to understand and debug more easily.
In case you continue with the current methods, some ideas:
Check the initial weight values (init them with a normal distribution)
As a previous poster said, diminish the learning rate
Do some additional checking on the data, check for NAN and outliers, the current models could be more sensitive to noise. Remember, garbage in, garbage out.