Understanding Cross Validation for Machine learning - python

Is the following correct about cross validation?:
The training data is divided into different groups, all but one of the training data sets is used for training the model. Once the model is trained the ‘left out’ training data is used to perform hyperparameter tuning. Once the most optimal hyperparameters have been chosen the test data is applied to the model to give a result which is then compared to other models that have undergone a similar process but with different combinations of training data sets. The model with the best results on the test data is then chosen.

I don't think it is correct. You wrote:
Once the model is trained the ‘left out’ training data is used to perform hyperparameter tuning
You tune the model by picking (manually or using a method like grid search or random search) a set of model's hyperparameters (parameters which values are set by you, before you will even fit the model to data). Then for a selected set of hyperparameters' values you calculate the validation set error using Cross-Validation.
So it should be like this:
The training data is divided into different groups, all but one of the training data sets is used for training the model. Once the model is trained the ‘left out’ training data is used to ...
... calculate the error. At the end of the cross validation, you will have k errors calculated on k left out sets. What you do next is calculating a mean of these k errors which gives you a single value - validation set error.
If you have n sets of hyperparameters, you simply repeat the procedure n times, which gives you n validation set errors. You then pick this set, that gave you the smallest validation error.
At the end, you will typically calculate the test set error to see what is the model's performance on unseen data, which simulates putting a model into production and to see whether there is a difference between test set error and validation set error. If there is a significant difference, it means over-fitting.
Just to add something on cross-validation itself, the reason why we use k-CV or LOOCV
is that it is great test set error estimate, which means that when I manipulate with hyperparameters and the value of validation set error dropped down, I know that I really improved model instead of being lucky and simply better fitting the model to train set.

Related

How to choose n_component in TruncatedSVD

I want to use TruncatedSvd to reduce the dimension of dataset, but I can't understand how to choose the best number for the n_component. Can anyone help me?
This is a hyperparameter of your model, as such, there is no right answer. What you will likely want to do is split your dataset into training/validation/test set, and use the validation set to conduct hyperparameter tuning to conduct a grid-search of the number of components in the TruncatedSvd.
The basic pipeline is:
Train your model using only your training set, starting with some random value for the number of components
Evaluate your model performance on the validation set. Then, go back to step one trying a different number of components. Then, evaluate again on your validation set. Repeat until you have searched over a reasonable size of number of components, and choose the number of components that gives you highest performance on the validation set.
Evaluate your model on the test set. This is your final model performance

How to use data augmentation with cross validation

I need to use data augmentation on what would be my training data from the data augmentation step. The problem is that i am using cross-validation, so i can't find a reference how to adjust my model to use data augmentation. My cross-validation is somewhat indexing by hand my data.
There is articles and general content about data augmentation, but very little and with no generalization for cross validation with data augmentation
I need to use data augmentation on training data by simply rotating and adding zoom, cross validate for the best weights and save them, but i wouldnt know how.
This example can be copy pasted for better reproducibility, in short how would i employ data augmentation and also save the weights with the best accuracy?
When training machine learning models, you should not test model on the samples used during model training phase (if you care for realistic results).
Cross validation is a method for estimating model accuracy. The essence of the method is that you split your available labeled data into several parts (or folds), and then use one part as a test set, training the model on all the rest, and repeating this procedure for all parts one by one. This way you essentially test your model on all the available data, without hurting training too much. There is an implicit assumption that data distribution is the same in all folds. As a rule of thumb, the number of cross validation folds is usually 5 or 7. This depends on the amount of the labeled data at one's disposal - if you have lots of data, you can afford to leave less data to train the model and increase test set size. The higher the number of folds, the better accuracy estimation you can achieve, as the training size part increases, and more time you have to invest into the procedure. In extreme case one have a leave-one-out training procedure: train on everything but one single sample, effectively making number of the folds equal to the number of data samples.
So for a 5-fold CV you train 5 different models, which have a a large overlap of the training data. As a result, you should get 5 models that have similar performance. (If it is not the case, you have a problem ;) ) After you have the test results, you throw away all 5 models you have trained, and train a new model on all the available data, assuming it's performance would be a mean of the values you've got during CV phase.
Now about the augmented data. You should not allow data obtained by augmentation of the training part leak into the test. Each data point created from the training part should be used only for training, same applies to the test set.
So you should split your original data into k-folds (for example using KFold or GroupKFold), then create augmented data for each fold and concatenate them to the original. Then you follow regular CV procedure.
In your case, you can simply pass each group (such as x_group1) through augmenting procedure before concatenating them, and you should be fine.
Please note, that splitting data in linear way can lead to unbalanced data sets and it is not the best way of splitting the data. You should consider functions I've mentioned above.

Use same data for test and validation

Hey I am training a CNN model , and was wondering what will happen if I use the same data for validation and test?
Does the model train on validation data as well? (Does my model see the validation data?) Or just the error and accuracy are calculatd and taken into account for training?
You use your validation_set to tune your model. It means that you don`t train on this data but the model takes it into account. For example, you use it to tune the model's hyperparameters.
In order to have a good evaluation - as test set you should use a data which is totally unknown to this model.
Take a look at this article for more information which here I point out the most relevant parts of it to your question :
A validation dataset is a sample of data held back from training your
model that is used to give an estimate of model skill while tuning
model’s hyperparameters.
The validation dataset is different from the test dataset that is also
held back from the training of the model, but is instead used to give
an unbiased estimate of the skill of the final tuned model when
comparing or selecting between final models.
If you use the same set for validation and test, your model may overfit (since it has seen the test data before the final test stage).

How to properly select the best model in GridSearchCV - both sklearn and caret do it wrong

Consider 3 data sets train/val/test. Sklearns GridSearchCV by default chooses the best model with the highest cross validation score. In a real world setting where the predictions need to be accurate this is a horrible approach to choosing the best model. The reason is because this is how it's supposed to be used:
-Train set for the model to learn the dataset
-Val set to validate what the model has learned in the train set and update parameters/hyperparameters to maximize the validation score.
-Test set - to test your data on unseen data.
-Finally use the model in a live setting and log the results to see if the results are good enough to make decisions. It's surprising that many data scientists impulsively use their trained model in production based only on selecting the model with the highest validation score. I find grid search to choose models that are painfully overfit and do a worse job at predicting unseen data than the default parameters.
My approaches:
-Manually train the models and look at the results for each model (in a sort of a loop, but not very efficient). It's very manual and time consuming, but I get significantly better results than grid search. I want this to be completely automated.
-Plot the validation curve for each hyperparameter I want to choose, and then pick the hyperparameter that shows the smallest difference between train and val set while maximizing both (i.e. train=98%, val = 78% is really bad, but train=72%, val=70% is acceptable).
Like I said, I want a better (automated) method for choosing the best model.
What kind of answer I'm looking for:
I want to maximize the score in the train and validation set, while minimizing the score difference between the train and val sets. Consider the following example from a grid search algorithm:
There are two models:
Model A: train score = 99%, val score = 89%
Model B: train score = 80%, val score = 79%
Model B is a much more reliable model and I would chose Model B over model A anyday. It is less overfit and the predictions are consistent. We know what to expect. However grid search will choose model A since the val score is higher. I find this to be a common problem and haven't found any solution anywhere on the internet. People tend to be so focused on what they learn in school and don't actually think about the consequences about choosing an overfit model. I see redundant posts about how to use sklearn and carets gridsearch packages and have them choose the model for you, but not how to actually choose the best model.
My approach so far has been very manual. I want an automated way of doing this.
What I do currently is this:
gs = GridSearchCV(model, params, cv=3).fit(X_train, y_train) # X_train and y_train consists of validation sets too if you do it this way, since GridSearchCV already creates a cv set.
final_model = gs.best_estimator_
train_predictions = final_model.predict(X_train)
val_predictions = final_model.predict(X_val)
test_predictions = final_model.predict(X_test)
print('Train Score:', accuracy_score(train_predictions, y_train)) # .99
print('Val Score:', accuracy_score(val_predictions, y_val)) # .89
print('Test Score:', accuracy_score(test_predictions, y_test)) # .8
If I see something like above I'll rule out that model and try different hyperparameters until I get consistent results. By manually fitting different models and looking at all 3 of these results, the validation curves, etc... I can decide what is the best model. I don't want to do this manually. I want this process to be automated. The grid search algorithm returns overfit models every time. I look forward to hearing some answers.
Another big issue is the difference between val and test sets. Since many problems face a time dependency issue, I'd like to know a reliable way to test the models performance as time goes on. It's crucial to split the data set by time, otherwise we are presenting data leakage. One method I'm familiar with is discriminative analysis (fitting a model to see if the model can predict which dataset the example came from: train val test). Another method is KS / KL tests and looking at the distribution of the target variable, or looping through each feature and comparing the distribution.
I agree with the comments that using the test set to choose hyperparameters obviates the need for the validation set (/folds), and makes the test set scores no longer representative of future performance. You fix that by "testing the model on a live feed," so that's fine.
I'll even give the scenario where I take out the test set - it's the same problem. The gridsearch algorithm picks the model with the highest validation score. It doesn't look at the difference between the train score and val score. The difference should be close to 0. A train score of 99% and a val score of 88% is not a good model, but grid search will take that over train score of 88% and val score of 87%. I would choose the second model.
Now this is something that's more understandable: there are reasons outside of raw performance to want the train/test score gap to be small. See e.g. https://datascience.stackexchange.com/q/66350/55122. And sklearn actually does accommodate this since v0.20: by using return_train_score=True and refit as a callable that consumes cv_results_ and returns the best index:
refit : bool, str, or callable, default=True
...
Where there are considerations other than maximum score in choosing a best estimator, refit can be set to a function which returns the selected best_index_ given cv_results_. In that case, the best_estimator_ and best_params_ will be set according to the returned best_index_ while the best_score_ attribute will not be available.
...
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
Of course, that requires you can put your manual process of looking at scores and their differences down into a function, and probably doesn't admit anything like validation curves, but at least it's something.

python- Best techniques to split datase to get high performance accuracy

I have applied these 4 methods:
Train and Test Sets.
K-fold Cross Validation.
Leave One Out Cross
Validation. Repeated Random Test-Train Splits.
The method "Train and Test Sets" achieve high accuracy but the remaining methods achieve same accuracy but lower then first approach.
I want to know which method should I choose?
Each of Train and Test Sets and Cross Validation used in certain case,Cross Validation used if you want to compare different models.Accuracy always increase if you use bigger training data that's why sometimes Leave One Out Cross perform better than K-fold Cross Validation,it's depends on your dataset size and sometimes on algorithm you are using.On the other hand Train and Test Sets usually used if you aren't comparing diffrent models, and if the time requirements for running the cross validation aren't worth it,mean it's not needed to make Cross Validation in this case.In most cases Cross Validation is preferred,but, what method you should choose? this usually depend on your choices while training your data such way you handle data and algorithm such you are trainning data using Random Forests usually it's not needed to do Cross Validation but you can and do it in case need more you usually not doing Cross Validation in Random Forests when you use Out of Bag estimate .
Training a model comprises tuning model accuracy as well as model generalization. If model is not generalized it may be Underfit or Overfit model.
In this case, model may perform better on training data but accuracy may decrease on test or unknown data.
We use training data to improve the accuracy of model. As training data size increases model accuracy may also increase.
Similarly we use different training samples to generalize the model.
So Train-Test splitting methods depend on the size of available data and algorithm used for model design.
First train-test method has a fix size training and testing data. So on each iteration, we use same train data to train model and same test data for model's accuracy assessment.
Second k-fold method has fix size train and test data but on each iteration, test and train data changes. So it may be a better approach irrespective of data size.
Leave one out approach is useful only if data size is small. Here we use almost whole data for training purpose. So training accuracy of model will be better but may not be a generalized model.
Randomised Train-test method is also a good approach for training and testing model's performance. Here we randomly select train and test data each time. So it may perform better than Leave one out method if data size is small.
And last each splitting approach has some pros and cons. So it depends on you which splitting method is good to your model. It also depends on data size and data selection means how we are selecting data from sample while splitting.

Categories