Cross Validation in Scikit Learn

Cross Validation in Scikit Learn - python

I have been using http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.cross_val_score.html
in order to cross validate a Logistic Regression classifier. The results I got are:
[ 0.78571429 0.64285714 0.85714286 0.71428571
0.78571429 0.64285714 0.84615385 0.53846154
0.76923077 0.66666667]
My primary question is how I could find which set/fold maximises my classifier's score and produces 0.857.
Follow-up question: Is training my classifier with this set a good practice?
Thank you in advance.

whether and how I could find which set/fold maximises my classifier's score
From the documentation of cross_val_score, you can see that it operates on a specific cv object. (If you do not give it explicitly, then it will be KFold in some cases, other things in other cases - refer to the documentation there.)
You can iterate over this object (or an identical one) to find the exact train/test indices. E.g., :
for tr, te in KFold(10000, 3):
# tr, te in each iteration correspond to those which gave you the scores you saw.
whether training my classifier with this set is a good practice.
Absolutely not!
The only legitimate uses of cross validation is for things like assessing overall performance, choosing between different models, or configuring model parameters.
Once you are committed to a model, you should train it over the entire training set. It is completely wrong to train it over the subset which happened to give the best score.

Related

GridSearchCV does not improve my test accuracy

I am making multiple classifier models and the test accuracy for all of them is 0.508.
I find it weird that multiple models have the same accuracy. The models I used are Logistic Regressor,DesicionTreeClassifier, MLPClassifier, RandomForestClassifier, BaggingClassifier, AdaBoostClassifier, XGBClassifier, SVC, and VotingClassifier.
After using GridSearchCV to improve the models, all of their test accuracy scores improved. But the test accuracy scores did not change.
I wish I could say I changed something, but I don't know why the test scores did not change. After using gridsearch, I expected the test scores to improve but it didn't

I would like to confirm, you mean your training scores improve but you testing scores did not change? If yes, there are a lot of possibility behind this.
You might want to reconfigure and add your hyper parameter range for example if using KNN you can increase the number of k or by adding more distance metric calculation
If you want to you can change the hyper parameter optimization technique like randomized search or bayesian search
I don't have any information about your data but sometimes turn on or turn off the shuffle mode when splitting can affect the scores for instance if you have time series data you have not to shuffle the dataset

There can be several reasons why the test accuracy didn't change after using GridSearchCV:
The best parameters found by GridSearchCV might not be optimal for the test data.
The test data may have a different distribution than the training data, leading to low test accuracy.
The models might be overfitting to the training data and not generalizing well to the test data.
The test data size might be small, leading to high variance in test accuracy scores.
The problem itself might be challenging, and a test accuracy of 0.508 might be the best that can be achieved with the current models and data.
It would be useful to have more information about the data, the problem, and the experimental setup to diagnose the issue further.

Looking at your accuracy, first of all I would say: are you performing a binary classification task? Because if it is the case, your models are almost not better than random on the test set, which may suggest that something is wrong with your training.
Otherwise, GridSearchCV, like RandomSearchCV and other hyperparameters optimization techniques try to find optimal parameters among a range that you define. If, after optimization, your optimal parameter has the value of one bound of your range, it may suggest that you need to explore beyond this bound, that is to say set another range on purpose and run the optimization again.
By the way, I don't know the size of your dataset but if it is big I would recommend you to use RandomSearchCV instead of GridSearchCV. As it is not exhaustive, it takes less time and gives results that are (nearly) optimized.

Types of Training vs Test Error for Random Forest Classification Algorithm (Assessing Variance)

I have 2 questions that I would like to ascertain if possible (questions are bolded):
I've recently understood (I hope) the random forest classification algorithm, and have tried to apply it using sklearn on Python on a rather large dataset of pixels derived from satellite images (with the features being the different bands, and the labels being specific features that I outlined by myself, i.e., vegetation, cloud, etc). I then wanted to understand if the model was experiencing a variance problem, and so the first thought that came to my mind was to compare between the training and testing data.
Now this is where the confusion kicks in for me - I understand that there have been many different posts about:
How CV error should/should not be used compared to the out of bag (OOB) error
How by design, the training error of a random forest classifier is almost always ~0 (i.e., fitting my model on the training data and using it to predict on the same set of training data) - seems to be the case regardless of the tree depth
Regarding point 2, it seems that I can never compare my training and test error as the former will always be low, and so I decided to use the OOB error as my 'representative' training error for the entire model. I then realized that the OOB error might be a pseudo test error as it essentially tests trees on points that they did not specifically learn (in the case of bootstrapped trees), and so I defaulted to CV error being my new 'representative' training error for the entire model.
Looking back at the usage of CV error, I initially used it for hyperparameter tuning (e.g., max tree depth, number of trees, criterion type, etc), and so I was again doubting myself if I should use it as my official training error to be compared against my test error.
What makes this worse is its hard for me to validate what I think is true based on posts across the web because each answers only a small part and might contradict each other, and so would anyone kindly help me with my predicament on what to use as my official training error that will be compared to my test error?
My second question revolves around how the OOB error might be a pseudo test error based on datapoints not selected during bootstrapping. If that were true, would it be fair to say this does not hold if bootstrapping is disabled (the algorithm is technically still a random forest as features are still randomly subsampled for each tree, its just that the correlation between trees are probably higher)?
Thank you!!!!

Generally, you want to distinctly break a dataset into training, validation, and test. Training is data fed into the model, validation is to monitor progress of the model as it learns, and test data is to see how well your model is generalizing to unseen data. As you've discovered, depending on the application and the algorithm, you can mix-up training and validation data or even forgo validation data entirely. For random forest, if you want to forgo having a distinct validation set and just use OOB to monitor progress that is fine. If you have enough data, I think it still makes sense to have a distinct validation set. No matter what, you should still reserve some data for testing. Depending on your data, you may even need to be careful about how you split up the data (e.g. if there's unevenness in the labels).
As to your second point about comparing training and test sets, I think you may be confused. The test set is really all you care about. You can compare the two to see if you're overfitting, so that you can change hyperparameters to generalize more, but otherwise the whole point is that the test set is to the sole truthful evaluation. If you have a really small dataset, you may need to bootstrap a number of models with a CV scheme like stratified CV to generate a more accurate test evaluation.

Classification: Tweet Sentiment Analysis - Order of steps

I am currently working on a tweet sentiment analysis and have a few questions regarding the right order of the steps. Please assume that the data was already preprocessed and prepared accordingly. So this is how I would proceed:
use train_test_split (80:20 ratio) to withhold a test
data set.
vectorize x_train since the tweets are not numerical.
In the next steps, I would like to identify the best classifier. Please assume those were already imported. So I would go on by:
hyperparameterization (grid-search) including a cross-validation approach.
In this step, I would like to identify the best parameters of each
classifier. For KNN the code is as follows:
model = KNeighborsClassifier()
n_neighbors = range(1, 10, 2)
weights = ['uniform', 'distance']
metric = ['euclidean', 'manhattan', 'minkowski']
# define grid search
grid = dict(n_neighbors=n_neighbors, weights=weights ,metric=metric)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
grid_search = GridSearchCV(estimator=model, param_grid=grid, n_jobs=-1, cv=cv, scoring='accuracy',error_score=0)
grid_result = grid_search.fit(train_tf, y_train)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
print("%f (%f) with: %r" % (mean, stdev, param))
compare the accuracy (depending on the best hyperparameters) of the classifiers
choose the best classifier
take the withheld test data set (from train_test_split()) and use the best classifier on the test data
Is this the right approach or would you recommend changing something (e. g. doing the cross-validation alone and not within the hyperparametrization)? Does it make sense to test the test data as the final step or should I do it earlier to assess the accuracy for an unknown data set?

There are lots of ways to do this and people have strong opinions about it and I'm not always convinced they fully understand what they advocate.
TL;DR: Your methodology looks great and you're asking sensible questions.
Having said that, here are some things to consider:
Why are you doing train-test split validation?
Why are you doing hyperparameter tuning?
Why are you doing cross-validation?
Yes, each of these techniques are good at doing something specific; but that doesn't necessarily mean they should all be part of the same pipeline.
First off, let's answer these questions:
Train-Test Split is useful for testing your classifier's inference abilities. In other words, we want to know how well a classifier performs in general (not on the data we used for training). The test portion allows us to evaluate our classifier without using our training portion.
Hyperparameter-Tuning is useful for evaluating the effect of hyperparameters on the performance of a classifier. For it to be meaningful, we must compare two (or more) models (using different hyperparameters) but trained preferably using the same training portion (to eliminate selection bias). What do we do once we know the best performing hyperparameters? Will this set of hyperparameters always perform optimally? No. You will see that, due to the stochastic nature of classification, one hyperparameter set may work best in experiment A then another set of hyperparameters may work best on experiment B. Rather, hyperparameter tuning is good for generalizing about which hyperparameters to use when building a classifier.
Cross-validation is used to smooth out some of the stochastic randomness associated with building classifiers. So, a machine learning pipeline may produce a classifier that is 94% accurate using 1 test-fold and 83% accuracy using another test-fold. What does it mean? It might mean that 1 fold contains samples that are easy. Or it might mean that the classifier, for whatever reason, is actually better. You don't know because it's a black box.
Practically, how is this helpful?
I see little value in using test-train split and cross-validation. I use cross-validation and report accuracy as an average over the n-folds. It is already testing my classifier's performance. I don't see why dividing your training data further to do another round of train-test validation is going to help. Use the average. Having said that, I use the best performing model of the n-fold models created during cross-validation as my final model. As I said, it's black-box, so we can't know which model is best but, all else being equal, you may as well use the best performing one. It might actually be better.
Hyperparameter-tuning is useful but it can take forever to do extensive tuning. I suggest adding hyperparameter tuning to your pipeline but only test 2 sets of hyperparameters. So, keep all your hyperparameters constant except 1. e.g. Batch size = {64, 128}. Run that, and you'll be able to say with confidence, "Oh, that made a big difference: 64 works better than 128!" or "Well, that was a waste of time. It didn't make much difference either way." If the difference is small, ignore that hyperparameter and try another pair. This way, you'll slowly tack towards optimal without all the wasted time.
In practice, I'd say leave the extensive hyperparameter-tuning to academics and take a more pragmatic approach.
But yeah, you're methodology looks good as it is. I think you thinking about what you're doing and that already puts you a step ahead of the pack.

Get individual models and customized score in GridSearchCV and RandomizedSearchCV [duplicate]

This question already has an answer here:
Retrieving specific classifiers and data from GridSearchCV
(1 answer)
Closed 2 years ago.
GridSearchCV and RandomizedSearchCV has best_estimator_ that :
Returns only the best estimator/model
Find the best estimator via one of the simple scoring methods : accuracy, recall, precision, etc.
Evaluate based on training sets only
I would like to enrich those limitations with
My own definition of scoring methods
Evaluate further on test set rather than training as done by GridSearchCV. Eventually it's the test set performance that counts. Training set tends to give almost perfect accuracy on my Grid Search.
I was thinking of achieving it by :
Get the individual estimators/models in GridSearchCV and RandomizedSearchCV
With every estimator/model, predict on test set and evaluate with my customized score
My question is:
Is there a way to get all individual models from GridSearchCV ?
If not, what is your thought to achieve the same thing as what I wanted ? Initially I wanted to exploit existing GridSearchCV because it handles automatically multiple parameter grid, CV and multi-threading. Any other recommendation to achieve the similar result is welcome.

You can use custom scoring methods already in the XYZSearchCVs: see the scoring parameter and the documentation's links to the User Guide for how to write a custom scorer.
You can use a fixed train/validation split to evaluate the hyperparameters (see the cv parameter), but this will be less robust than a k-fold cross-validation. The test set should be reserved for scoring only the final model; if you use it to select hyperparameters, then the scores you receive will not be unbiased estimates of future performance.
There is no easy way to retrieve all the models built by GridSearchCV. (It would generally be a lot of models, and saving them all would generally be a waste of memory.)
The parallelization and parameter grid parts of GridSearchCV are surprisingly simple; if you need to, you can copy out the relevant parts of the source code to produce your own approach.
Training set tends to give almost perfect accuracy on my Grid Search.
That's a bit surprising, since the CV part of the searches means the models are being scored on unseen data. If you get very high best_score_ but low performance on the test set, then I would suspect your training set is not actually a representative sample, and that'll require a much more nuanced understanding of the situation.

How to properly select the best model in GridSearchCV - both sklearn and caret do it wrong

Consider 3 data sets train/val/test. Sklearns GridSearchCV by default chooses the best model with the highest cross validation score. In a real world setting where the predictions need to be accurate this is a horrible approach to choosing the best model. The reason is because this is how it's supposed to be used:
-Train set for the model to learn the dataset
-Val set to validate what the model has learned in the train set and update parameters/hyperparameters to maximize the validation score.
-Test set - to test your data on unseen data.
-Finally use the model in a live setting and log the results to see if the results are good enough to make decisions. It's surprising that many data scientists impulsively use their trained model in production based only on selecting the model with the highest validation score. I find grid search to choose models that are painfully overfit and do a worse job at predicting unseen data than the default parameters.
My approaches:
-Manually train the models and look at the results for each model (in a sort of a loop, but not very efficient). It's very manual and time consuming, but I get significantly better results than grid search. I want this to be completely automated.
-Plot the validation curve for each hyperparameter I want to choose, and then pick the hyperparameter that shows the smallest difference between train and val set while maximizing both (i.e. train=98%, val = 78% is really bad, but train=72%, val=70% is acceptable).
Like I said, I want a better (automated) method for choosing the best model.
What kind of answer I'm looking for:
I want to maximize the score in the train and validation set, while minimizing the score difference between the train and val sets. Consider the following example from a grid search algorithm:
There are two models:
Model A: train score = 99%, val score = 89%
Model B: train score = 80%, val score = 79%
Model B is a much more reliable model and I would chose Model B over model A anyday. It is less overfit and the predictions are consistent. We know what to expect. However grid search will choose model A since the val score is higher. I find this to be a common problem and haven't found any solution anywhere on the internet. People tend to be so focused on what they learn in school and don't actually think about the consequences about choosing an overfit model. I see redundant posts about how to use sklearn and carets gridsearch packages and have them choose the model for you, but not how to actually choose the best model.
My approach so far has been very manual. I want an automated way of doing this.
What I do currently is this:
gs = GridSearchCV(model, params, cv=3).fit(X_train, y_train) # X_train and y_train consists of validation sets too if you do it this way, since GridSearchCV already creates a cv set.
final_model = gs.best_estimator_
train_predictions = final_model.predict(X_train)
val_predictions = final_model.predict(X_val)
test_predictions = final_model.predict(X_test)
print('Train Score:', accuracy_score(train_predictions, y_train)) # .99
print('Val Score:', accuracy_score(val_predictions, y_val)) # .89
print('Test Score:', accuracy_score(test_predictions, y_test)) # .8
If I see something like above I'll rule out that model and try different hyperparameters until I get consistent results. By manually fitting different models and looking at all 3 of these results, the validation curves, etc... I can decide what is the best model. I don't want to do this manually. I want this process to be automated. The grid search algorithm returns overfit models every time. I look forward to hearing some answers.
Another big issue is the difference between val and test sets. Since many problems face a time dependency issue, I'd like to know a reliable way to test the models performance as time goes on. It's crucial to split the data set by time, otherwise we are presenting data leakage. One method I'm familiar with is discriminative analysis (fitting a model to see if the model can predict which dataset the example came from: train val test). Another method is KS / KL tests and looking at the distribution of the target variable, or looping through each feature and comparing the distribution.

I agree with the comments that using the test set to choose hyperparameters obviates the need for the validation set (/folds), and makes the test set scores no longer representative of future performance. You fix that by "testing the model on a live feed," so that's fine.
I'll even give the scenario where I take out the test set - it's the same problem. The gridsearch algorithm picks the model with the highest validation score. It doesn't look at the difference between the train score and val score. The difference should be close to 0. A train score of 99% and a val score of 88% is not a good model, but grid search will take that over train score of 88% and val score of 87%. I would choose the second model.
Now this is something that's more understandable: there are reasons outside of raw performance to want the train/test score gap to be small. See e.g. https://datascience.stackexchange.com/q/66350/55122. And sklearn actually does accommodate this since v0.20: by using return_train_score=True and refit as a callable that consumes cv_results_ and returns the best index:
refit : bool, str, or callable, default=True
...
Where there are considerations other than maximum score in choosing a best estimator, refit can be set to a function which returns the selected best_index_ given cv_results_. In that case, the best_estimator_ and best_params_ will be set according to the returned best_index_ while the best_score_ attribute will not be available.
...
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
Of course, that requires you can put your manual process of looking at scores and their differences down into a function, and probably doesn't admit anything like validation curves, but at least it's something.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.