How to split test and train size

How to split test and train size - python

I am trying to feed a CNN model(Human body pose estimation)with a dataset contains 1000 numbers,
first, how can I make sure that the number of my datasets is already enough?
second, how should i split my data to train and test size? (when I put train size = 0.6 and test_size = 0.4 the network doesnt work well and show me NAN for weights and bias and loss value!)

There is no fixed way to determine when you have a sufficient size data set. It depends on many factors. Best thing to do is run with what you have and see how it performs. I usually split my data into 3 sets, training, validation and test. I usually try 75% for training, 15% for validation and 10% for final test.The validation set is what I use to tweek the hyper parameters. Initially I monitor the training accuracy and loss. If I can get that up to over 95% then I monitor the validation accuracy and loss. I use the model_checkpoint keras callback to save the model with the lowest validation loss. If the validation accuracy and loss is not satisfactory I tweek the hyper parameters to try to improve it. I have found using an adjustable learning rate to be useful for this purpose. Finally when I am satisfied with the training accuracy and validation accuracy I use the saved model to make predictions on the test set. This is the final measure of how the model performs.

Related

Am i overfitting?

How it looks like with lesser smoothing
Hi! I am currently training my model with Darkflow Yolov2. The optimiser is SGD with lr 0.001.
Based on this graph, my val loss > train loss, which would mean that it is overfitting? If it is, what would be the recommended course of action? It seems weird because both losses are decreasing, but the val loss is slower.
For more info,
My train dataset consist of 400 images per class, with single annotations,with a total of 2800 images. I did this to prevent class imbalance, by only annotating one class instance per image. My val dataset consist of 350 images , with multiple annotations. Basically, i annotated every object within the images. I have 7 classes and my train-val-test split is 80-10-10. Is this the cause for the val loss?

Over-fitting detection includes a mismatch as training accuracy diverges from test (validation) accuracy. Since you haven't provided that data, we can't evaluate your model.
It might help to clarify stages and terms; this should let you answer the question for yourself in the future:
"Convergence" is the point in training at which we believe that the model
has learned something useful;
has reached this point via reproducible process;
isn't going to get significantly better;
is about to get worse.
Convergence is where we want to stop training and save (checkpoint) the model for production use.
We detect convergence by use of training passes and testing (validation) passes.
At convergence, we expect:
validation loss (error function, perplexity, etc.) is at a relative
minimum;
validation accuracy is at a relative maximum;
validation and training metrics are "reasonably stable", with respect
to the model's general behaviour;
training accuracy and validation accuracy are essentially equal.
Once a training run passes this point, it often transitions into "over-fitting", in which the model learns things so specific to the training data, that it is no longer as good at inferring about new observations. In this state,
training loss drops; validation loss rises;
training accuracy rises; validation accuracy drops.

CNN model validation accuracy is not improving

I am currently working on a CNN model for classification, I have to predict words on a wav file. I encountered a problem with my validation accuracy that stays (almost) the same, first I was thinking of overfitting but that does not seem to be the problem. Below you can see a photo with the result at the different epochs:
I am building a CNN model with Keras and using the 'adam' optimizer and 'categorical_crossentropy' for the loss. I already have tried to increase the number of epochs until 1000 and changed the batch size.

Your training loss seems to be decreasing but val_loss is increasing while val_accuracy is approximately same. This is standard case of overfitting. Why do you think that's not the case?
Increasing the training epochs or batch size is not helpful as you're just changing the number of times the model sees the data or the quantity of data it sees in one epoch.
For current scenario, the best model is created till the point both val_loss and train_loss continues to decrease before it becomes saturated.
To address the problem, you need to add noise in the training data so that the model generalizes better, generalize the examples better, create balanced categories in terms of training data volume.
Secondly, you can increase your validation dataset to see if it continues to have the same issue. If it's there then the model is definitely overfitting. ALso please update your question about what kind of validation set and technique you're using. If possible, add the code snippet of your validation set and loss function

Why validation accuracy remains at 75% while train accuracy is 100 %?

I used my own data set to train a model using retrain.py file from Tensorflow site. However, with my first set of images, I am seeing test accuracy of 100% while validation accuracy is at 70%. I see that validation entropy is increasing which tells overfitting. I am new to this field and got to this stage by following online tutorials.
I did not enable random brightness, crop and flip yet for training. I am trying to understand why is this behaviour? I tried flower example and it worked as expected. Cross-entropy got lowest instead of increasing with my data set.
Could some one explain whats going on inside the CNN here ?

Your model has over-fitted on the training data. If its a large model, you should consider using transfer learning where you train the model on a large dataset like ImageNet and then fine-tune on your data. You can also try adding some form of regularization to prevent overfitting specially Dropout and L2 regularization.

This simply means your model is overfitting. Overfitting means your model is not generalizing well to unseen data (ie the validation data). What you can do is add some form of regularization (L2 is used normally). What this does is it penalizes weights from getting very high values which would thereby lead to overfitting. This will also act against the model trying to fit outliers which again leads to less generalization and more overfitting.

Overfitting terminology in Machine Learning

In the book Introduction to Machine Learning with Python on page 50 the author is performing a Linear Regression on a dataset and gets:
training set score: 0.67
test set score: 0.66
They then state that they are “likely underfitting, not overfitting.”
However, when using TensorFlow’s Basic Classification Tutorial they are using the MNIST Fashion dataset with a neural network and get:
training set score: 0.892
test set score: 0.876
and then they state the following
“It turns out, the accuracy on the test dataset is a little less than the accuracy on the training dataset. This gap between training accuracy and test accuracy is an example of overfitting. Overfitting is when a machine learning model performs worse on new data than on their training data.”
I believe that the quote taken from the TensorFlow site is the correct one, or are they both correct and I don’t fully understand overfitting.

Underfitting occurs when both the training and testing accuracies are low. This signifies a systematic problem with your model, i.e the data would fit better with a polynomial model but you're using a linear model. So a ~66% accuracy for both training and testing is considered underfitting because they are both very low. In general, high error on both sets indicates underfitting.
Overfitting occurs when you have relatively high accuracy on training, but lower on testing. This signifies that your model has fit too much towards your training data, and does not generalize well to other data. In general, low error on training and higher error on testing indicates overfitting.

In general, it is extremely rare to build a model, that would show the same performance on the training and validation (or test, or holdout, whatever you wish to call it) sets. Thus, the gap between training and validation set will be there (almost) always. You will see the definition of overfitting based on the gap often, but in practice it is not applicable as it is not quantitative. The more general concept here is the "bias-variance trade-off", that you might want to google about. The relevant question is how large is the gap, how good is performance and how performance on the validation set behaves with changed complexity of the model.
I find this figure from Wikipedia very instructive: https://en.wikipedia.org/wiki/Overfitting#/media/File:Overfitting_svg.svg. The x axis is the number of training iterations (epochs) in the case of NN or GBM's, but you can also think of it as a model complexity parameter, e.g. the number of powers included in a polynomial model. As you can see, there is always a gap between performance on the training and validation samples. But the key to choose the model that does not overfit is to choose the optimal trade-off between performance on the training sample (= bias) and performance on the validation sample (the difference between performance on training and validation samples = variance).

Over- and under-fitting
The most overfit you can do is have an accuracy of 100% on your training set.
This means that your model learned to predict exactly inputs that it has seen before.
If you are ever in this situation, your test set will probably perform very poorly.
You can detect overfitting by:
High accuracy on the training set
a large gap between training and test set
You can detect underfitting by:
A low accuracy on the training set (irrespective of performance on test set)
Examples:
1)
training set score: 0.67
test set score: 0.66
This example has a low score on training set. So underfitting seems like a fair assumption.
2)
training set score: 0.892
test set score: 0.876
This one is up to interpretation. The score on the training set is quite high and there is a gap in respect to the test set.
If the examples in both sets are very similar, then I would say that there is some overfitting. However, if the two sets are quite different (for example from different sources), then the results could be deemed acceptable.

Is this an overfitted network?

I have obtained this result after training a neural network in keras and I was wondering if this is overfitting or not.
I'm having doubts because I have read overfitting is produced when a net is overtrained, and it happens when the validation loss INCREASES.
But in this case it doesn't increase. It remains the same, but the training loss DECREASES.
EXTRA INFO
Single dataset split on this way:
70% of the dataset used as training data
30% of the dataset used as validation data
500 EPOCHS TRAINING
2000 EPOCHS TRAINING
Training loss: 3.1711e-05
Validation loss: 0.0036

There is a slight overfit in the sense that you training loss keeps decreasing and the validation loss stopped decreasing.
However, I wouldn't consider this harmful because the validation loss insn't increasing. This is if I read the graph correctly, if there is a small increase then it's getting bad.
A harmful overfit is when your validation loss starts increasing. The validation loss is your true measure of the performance of the network. If it goes up then your model is starting to do bad things and you should stop there.
All in all this seems pretty decent. The training loss will almost always be going lower than the validation at some point, this is an optimization process over the training set.

Training loss does indeed appear to continue decreasing further than validation loss (it still looks to me like it didn't finish decreasing yet at the 500th epoch, would be good to continue for more epochs and see what happens). The difference doesn't appear to be large though.
It may be overfitting slightly, but it may also be possible that the distribution of your validation data is simply a bit different from the distribution of the training data.
I'd recommend testing the following:
Continue for more than 500 epochs, to see if the training loss keeps on decreasing even further, or if it stabilizes close to the validation loss. If it keeps on decreasing much further, and the validation loss stays the same, it's safe to say that the network is overfitting.
Try creating different splits of training and validation sets. How did you determine training and validation sets actually? Were you given two separate sets, one for training and one for validation? Or were you given a single large training set, and did you split it up yourself? In the first case, the distributions may be different, so a difference in training vs validation loss wouldn't be strange. In the second case, try randomly creating different splits and repeating the experiments to see if you always consistently get the same difference in training vs validation loss, or if they're sometimes also closer together.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to split test and train size - python

Related

Am i overfitting?

CNN model validation accuracy is not improving

Why validation accuracy remains at 75% while train accuracy is 100 %?

Overfitting terminology in Machine Learning

Is this an overfitted network?

Categories

Resources