python- Best techniques to split datase to get high performance accuracy - python

I have applied these 4 methods:
Train and Test Sets.
K-fold Cross Validation.
Leave One Out Cross
Validation. Repeated Random Test-Train Splits.
The method "Train and Test Sets" achieve high accuracy but the remaining methods achieve same accuracy but lower then first approach.
I want to know which method should I choose?

Each of Train and Test Sets and Cross Validation used in certain case,Cross Validation used if you want to compare different models.Accuracy always increase if you use bigger training data that's why sometimes Leave One Out Cross perform better than K-fold Cross Validation,it's depends on your dataset size and sometimes on algorithm you are using.On the other hand Train and Test Sets usually used if you aren't comparing diffrent models, and if the time requirements for running the cross validation aren't worth it,mean it's not needed to make Cross Validation in this case.In most cases Cross Validation is preferred,but, what method you should choose? this usually depend on your choices while training your data such way you handle data and algorithm such you are trainning data using Random Forests usually it's not needed to do Cross Validation but you can and do it in case need more you usually not doing Cross Validation in Random Forests when you use Out of Bag estimate .

Training a model comprises tuning model accuracy as well as model generalization. If model is not generalized it may be Underfit or Overfit model.
In this case, model may perform better on training data but accuracy may decrease on test or unknown data.
We use training data to improve the accuracy of model. As training data size increases model accuracy may also increase.
Similarly we use different training samples to generalize the model.
So Train-Test splitting methods depend on the size of available data and algorithm used for model design.
First train-test method has a fix size training and testing data. So on each iteration, we use same train data to train model and same test data for model's accuracy assessment.
Second k-fold method has fix size train and test data but on each iteration, test and train data changes. So it may be a better approach irrespective of data size.
Leave one out approach is useful only if data size is small. Here we use almost whole data for training purpose. So training accuracy of model will be better but may not be a generalized model.
Randomised Train-test method is also a good approach for training and testing model's performance. Here we randomly select train and test data each time. So it may perform better than Leave one out method if data size is small.
And last each splitting approach has some pros and cons. So it depends on you which splitting method is good to your model. It also depends on data size and data selection means how we are selecting data from sample while splitting.

Related

Reset the weights in K-fold cross validation

In k-fold cross validation why we need to reset the weights after each fold
we use thia function
def reset_weights(m):
if isinstance(m, nn.Conv2d) or isinstance(m, nn.Linear):
m.reset_parameters()
so we reset the weights of the model so that each cross-validation fold starts from some random initial state and not learning from the previous folds.
Why i that important ? and i think if we don't do that it would be better that the model learn from all folds and update its parameter from all of them not every one on its own.
Cross fold validation is meant to validate if the model performance is consistent and robust to different subsample of train and test data, and to fine tuning hyper parameters in a less biased way.
If the model have a good performance, with low variance among numerous (usually 5 or 10) folds of train and test data, it means that the model performance is not related to some subsample of the data.
https://en.wikipedia.org/wiki/Cross-validation_(statistics)
After validade the model, you can train it on the whole dataset, without splitting it, to improve performance.
But this approach alone can't tell if your model has overfitted or not, so take note of CNN regularization and validation methods.
https://www.analyticsvidhya.com/blog/2020/09/overfitting-in-cnn-show-to-treat-overfitting-in-convolutional-neural-networks/

Understanding Cross Validation for Machine learning

Is the following correct about cross validation?:
The training data is divided into different groups, all but one of the training data sets is used for training the model. Once the model is trained the ‘left out’ training data is used to perform hyperparameter tuning. Once the most optimal hyperparameters have been chosen the test data is applied to the model to give a result which is then compared to other models that have undergone a similar process but with different combinations of training data sets. The model with the best results on the test data is then chosen.
I don't think it is correct. You wrote:
Once the model is trained the ‘left out’ training data is used to perform hyperparameter tuning
You tune the model by picking (manually or using a method like grid search or random search) a set of model's hyperparameters (parameters which values are set by you, before you will even fit the model to data). Then for a selected set of hyperparameters' values you calculate the validation set error using Cross-Validation.
So it should be like this:
The training data is divided into different groups, all but one of the training data sets is used for training the model. Once the model is trained the ‘left out’ training data is used to ...
... calculate the error. At the end of the cross validation, you will have k errors calculated on k left out sets. What you do next is calculating a mean of these k errors which gives you a single value - validation set error.
If you have n sets of hyperparameters, you simply repeat the procedure n times, which gives you n validation set errors. You then pick this set, that gave you the smallest validation error.
At the end, you will typically calculate the test set error to see what is the model's performance on unseen data, which simulates putting a model into production and to see whether there is a difference between test set error and validation set error. If there is a significant difference, it means over-fitting.
Just to add something on cross-validation itself, the reason why we use k-CV or LOOCV
is that it is great test set error estimate, which means that when I manipulate with hyperparameters and the value of validation set error dropped down, I know that I really improved model instead of being lucky and simply better fitting the model to train set.

How to use data augmentation with cross validation

I need to use data augmentation on what would be my training data from the data augmentation step. The problem is that i am using cross-validation, so i can't find a reference how to adjust my model to use data augmentation. My cross-validation is somewhat indexing by hand my data.
There is articles and general content about data augmentation, but very little and with no generalization for cross validation with data augmentation
I need to use data augmentation on training data by simply rotating and adding zoom, cross validate for the best weights and save them, but i wouldnt know how.
This example can be copy pasted for better reproducibility, in short how would i employ data augmentation and also save the weights with the best accuracy?
When training machine learning models, you should not test model on the samples used during model training phase (if you care for realistic results).
Cross validation is a method for estimating model accuracy. The essence of the method is that you split your available labeled data into several parts (or folds), and then use one part as a test set, training the model on all the rest, and repeating this procedure for all parts one by one. This way you essentially test your model on all the available data, without hurting training too much. There is an implicit assumption that data distribution is the same in all folds. As a rule of thumb, the number of cross validation folds is usually 5 or 7. This depends on the amount of the labeled data at one's disposal - if you have lots of data, you can afford to leave less data to train the model and increase test set size. The higher the number of folds, the better accuracy estimation you can achieve, as the training size part increases, and more time you have to invest into the procedure. In extreme case one have a leave-one-out training procedure: train on everything but one single sample, effectively making number of the folds equal to the number of data samples.
So for a 5-fold CV you train 5 different models, which have a a large overlap of the training data. As a result, you should get 5 models that have similar performance. (If it is not the case, you have a problem ;) ) After you have the test results, you throw away all 5 models you have trained, and train a new model on all the available data, assuming it's performance would be a mean of the values you've got during CV phase.
Now about the augmented data. You should not allow data obtained by augmentation of the training part leak into the test. Each data point created from the training part should be used only for training, same applies to the test set.
So you should split your original data into k-folds (for example using KFold or GroupKFold), then create augmented data for each fold and concatenate them to the original. Then you follow regular CV procedure.
In your case, you can simply pass each group (such as x_group1) through augmenting procedure before concatenating them, and you should be fine.
Please note, that splitting data in linear way can lead to unbalanced data sets and it is not the best way of splitting the data. You should consider functions I've mentioned above.

How to check machine learning accuracy without cross validation

I have training sample X_train, and Y_train to train and X_estimated.
I got task to make my classificator learn as accurate as it can, and then predict vector of results over X_estimated to get close results to Y_estimated (which i have now, and I have to be as much precise as it can). If I split my training data to like 75/25 to train and test it, I can get accuracy using sklearn.metrics.accuracy_score and confusion matrix. But I am losing that 25% of samples, that would make my predictions more accurate.
Is there any way, I could learn by using 100% of the data, and still be able to see accuracy score (or percentage), so I can predict it many times, and save best (%) result?
I am using random forest with 500 estimators, and usually get like 90% accuracy. I want to save best prediction vector as possible for my task, without splitting any data (not wasting anything), but still be able to calculate accuracy (so I can save best prediction vector) from multiple attempts (random forest always shows different results)
Thank you
Splitting your data is critical for evaluation.
There is no way that you could train your model on 100% of the data and be able to get a correct evaluation accuracy unless you expand your dataset. I mean, you could change your train/test split, or try to optimize your model in other ways, but i guess the simple answer to your question would be no.
As per your requirement, you can try K Fold Cross Validation. If you split it in 90|10 i.e for Train|Test. Achieving to take 100% data for training is not possible as you have to test the data then only you can validate the same that how good your model is. K Fold CV takes your whole train data into consideration in each fold and randomly takes test data sample from the train data. And lastly calculates the accuracy by taking summation of all the folds. Then finally you can test the accuracy by using 10% of the data.
More you can read here and here
K Fold Cross Validation
Skearn provides simple methods for performing K fold cross validation. Simply you have to pass no of folds in the method. But then remember, more the folds, it takes more time to train the model. More you can check here
It is not necessary to do 75|25 split of your data all the time. 75
|25 is kind of old school now. It greatly depends on the amount of data that you have. For example, if you have 1 billion sentences for training a language model, it is not necessary to reserve 25% for testing.
Also, I second the previous answer of trying K-fold cross-validation. As a side note, you could consider looking at the other metrics like precision and recall as well.
In general splitting your data set is critical for evaluation. So I would recommend you always do that.
Said that, there are methods that in some sense allow you to train on all your data and still get an estimate of your performance or to estimate the generalization accuracy.
One particularly prominent method is leveraging out-of-bag samples of models based on bootstrapping, i.e. RandomForests.
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, bootstrap=True, oob_score=True)
rf.fit(X, y)
print(rf.oob_score_)
if you are doing classification always go with stratified k-fold cv(https://machinelearningmastery.com/cross-validation-for-imbalanced-classification/).
if you're doing regression then go with simple k-fold cv or you can divide the target as bins and do stratified k-fold cv. by this way you can use your data completely in model training.

How do you estimate the performance of a classifier on test data?

I'm using scikit to make a supervised classifier and I am currently tuning it to give me good accuracy on the labeled data. But how do I estimate how well it does on the test data (unlabeled)?
Also, how do I find out if I'm starting to overfit the classifier?
You can't score your method on unlabeled data because you need to know right answers. In order to evaluate a method you should split your trainset into (new) train and test (via sklearn.cross_validation.train_test_split, for example). Then fit the model to the train and score it on test.
If you don't have a lot of data and holding out some of it may negatively impact performance of an algorithm, use cross validation.
Since overfitting is inability to generalize, low test scores is a good indicator of it.
For more theory and some other approaches, take a look at this article.

Categories