GridSearchCV does not improve my test accuracy

GridSearchCV does not improve my test accuracy - python

I am making multiple classifier models and the test accuracy for all of them is 0.508.
I find it weird that multiple models have the same accuracy. The models I used are Logistic Regressor,DesicionTreeClassifier, MLPClassifier, RandomForestClassifier, BaggingClassifier, AdaBoostClassifier, XGBClassifier, SVC, and VotingClassifier.
After using GridSearchCV to improve the models, all of their test accuracy scores improved. But the test accuracy scores did not change.
I wish I could say I changed something, but I don't know why the test scores did not change. After using gridsearch, I expected the test scores to improve but it didn't

I would like to confirm, you mean your training scores improve but you testing scores did not change? If yes, there are a lot of possibility behind this.
You might want to reconfigure and add your hyper parameter range for example if using KNN you can increase the number of k or by adding more distance metric calculation
If you want to you can change the hyper parameter optimization technique like randomized search or bayesian search
I don't have any information about your data but sometimes turn on or turn off the shuffle mode when splitting can affect the scores for instance if you have time series data you have not to shuffle the dataset

There can be several reasons why the test accuracy didn't change after using GridSearchCV:
The best parameters found by GridSearchCV might not be optimal for the test data.
The test data may have a different distribution than the training data, leading to low test accuracy.
The models might be overfitting to the training data and not generalizing well to the test data.
The test data size might be small, leading to high variance in test accuracy scores.
The problem itself might be challenging, and a test accuracy of 0.508 might be the best that can be achieved with the current models and data.
It would be useful to have more information about the data, the problem, and the experimental setup to diagnose the issue further.

Looking at your accuracy, first of all I would say: are you performing a binary classification task? Because if it is the case, your models are almost not better than random on the test set, which may suggest that something is wrong with your training.
Otherwise, GridSearchCV, like RandomSearchCV and other hyperparameters optimization techniques try to find optimal parameters among a range that you define. If, after optimization, your optimal parameter has the value of one bound of your range, it may suggest that you need to explore beyond this bound, that is to say set another range on purpose and run the optimization again.
By the way, I don't know the size of your dataset but if it is big I would recommend you to use RandomSearchCV instead of GridSearchCV. As it is not exhaustive, it takes less time and gives results that are (nearly) optimized.

Related

imbalabced data set score after smote

Is it correct to use 'accuracy' as a metric for an imbalanced data set after using oversampling methods such as SMOTE or we have to use other metrics such as AUROC or other presicion-recall related metrics?

You can use accuracy for the dataset after using SMOTE since now it shouldn't be imbalanced as far as I know. You should try the other metrics though for a more detailed evaluation (classification_report_imbalenced combines some metrics)

SMOTE and similar imbalance treatment techniques will be only be applied to you training data. When you have a largely imbalanced data set, say 99% against 1%, accuracy on the TEST set might still give you a value of 99% by always choosing the larger class.
Therefore, you should definitely switch to another metric.
Popular variants are the F1 score, but there is also a balanced version of the accuracy, see scikit-learn BA page.
As mentioned by #Nocry, applying several evaluation measures, might give you a better feeling. For example, check how accuracy (the regular variant) and balanced accuracy perform with and without using SMOTE, then you should see the difference.

Classification: Tweet Sentiment Analysis - Order of steps

I am currently working on a tweet sentiment analysis and have a few questions regarding the right order of the steps. Please assume that the data was already preprocessed and prepared accordingly. So this is how I would proceed:
use train_test_split (80:20 ratio) to withhold a test
data set.
vectorize x_train since the tweets are not numerical.
In the next steps, I would like to identify the best classifier. Please assume those were already imported. So I would go on by:
hyperparameterization (grid-search) including a cross-validation approach.
In this step, I would like to identify the best parameters of each
classifier. For KNN the code is as follows:
model = KNeighborsClassifier()
n_neighbors = range(1, 10, 2)
weights = ['uniform', 'distance']
metric = ['euclidean', 'manhattan', 'minkowski']
# define grid search
grid = dict(n_neighbors=n_neighbors, weights=weights ,metric=metric)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
grid_search = GridSearchCV(estimator=model, param_grid=grid, n_jobs=-1, cv=cv, scoring='accuracy',error_score=0)
grid_result = grid_search.fit(train_tf, y_train)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
print("%f (%f) with: %r" % (mean, stdev, param))
compare the accuracy (depending on the best hyperparameters) of the classifiers
choose the best classifier
take the withheld test data set (from train_test_split()) and use the best classifier on the test data
Is this the right approach or would you recommend changing something (e. g. doing the cross-validation alone and not within the hyperparametrization)? Does it make sense to test the test data as the final step or should I do it earlier to assess the accuracy for an unknown data set?

There are lots of ways to do this and people have strong opinions about it and I'm not always convinced they fully understand what they advocate.
TL;DR: Your methodology looks great and you're asking sensible questions.
Having said that, here are some things to consider:
Why are you doing train-test split validation?
Why are you doing hyperparameter tuning?
Why are you doing cross-validation?
Yes, each of these techniques are good at doing something specific; but that doesn't necessarily mean they should all be part of the same pipeline.
First off, let's answer these questions:
Train-Test Split is useful for testing your classifier's inference abilities. In other words, we want to know how well a classifier performs in general (not on the data we used for training). The test portion allows us to evaluate our classifier without using our training portion.
Hyperparameter-Tuning is useful for evaluating the effect of hyperparameters on the performance of a classifier. For it to be meaningful, we must compare two (or more) models (using different hyperparameters) but trained preferably using the same training portion (to eliminate selection bias). What do we do once we know the best performing hyperparameters? Will this set of hyperparameters always perform optimally? No. You will see that, due to the stochastic nature of classification, one hyperparameter set may work best in experiment A then another set of hyperparameters may work best on experiment B. Rather, hyperparameter tuning is good for generalizing about which hyperparameters to use when building a classifier.
Cross-validation is used to smooth out some of the stochastic randomness associated with building classifiers. So, a machine learning pipeline may produce a classifier that is 94% accurate using 1 test-fold and 83% accuracy using another test-fold. What does it mean? It might mean that 1 fold contains samples that are easy. Or it might mean that the classifier, for whatever reason, is actually better. You don't know because it's a black box.
Practically, how is this helpful?
I see little value in using test-train split and cross-validation. I use cross-validation and report accuracy as an average over the n-folds. It is already testing my classifier's performance. I don't see why dividing your training data further to do another round of train-test validation is going to help. Use the average. Having said that, I use the best performing model of the n-fold models created during cross-validation as my final model. As I said, it's black-box, so we can't know which model is best but, all else being equal, you may as well use the best performing one. It might actually be better.
Hyperparameter-tuning is useful but it can take forever to do extensive tuning. I suggest adding hyperparameter tuning to your pipeline but only test 2 sets of hyperparameters. So, keep all your hyperparameters constant except 1. e.g. Batch size = {64, 128}. Run that, and you'll be able to say with confidence, "Oh, that made a big difference: 64 works better than 128!" or "Well, that was a waste of time. It didn't make much difference either way." If the difference is small, ignore that hyperparameter and try another pair. This way, you'll slowly tack towards optimal without all the wasted time.
In practice, I'd say leave the extensive hyperparameter-tuning to academics and take a more pragmatic approach.
But yeah, you're methodology looks good as it is. I think you thinking about what you're doing and that already puts you a step ahead of the pack.

Get individual models and customized score in GridSearchCV and RandomizedSearchCV [duplicate]

This question already has an answer here:
Retrieving specific classifiers and data from GridSearchCV
(1 answer)
Closed 2 years ago.
GridSearchCV and RandomizedSearchCV has best_estimator_ that :
Returns only the best estimator/model
Find the best estimator via one of the simple scoring methods : accuracy, recall, precision, etc.
Evaluate based on training sets only
I would like to enrich those limitations with
My own definition of scoring methods
Evaluate further on test set rather than training as done by GridSearchCV. Eventually it's the test set performance that counts. Training set tends to give almost perfect accuracy on my Grid Search.
I was thinking of achieving it by :
Get the individual estimators/models in GridSearchCV and RandomizedSearchCV
With every estimator/model, predict on test set and evaluate with my customized score
My question is:
Is there a way to get all individual models from GridSearchCV ?
If not, what is your thought to achieve the same thing as what I wanted ? Initially I wanted to exploit existing GridSearchCV because it handles automatically multiple parameter grid, CV and multi-threading. Any other recommendation to achieve the similar result is welcome.

You can use custom scoring methods already in the XYZSearchCVs: see the scoring parameter and the documentation's links to the User Guide for how to write a custom scorer.
You can use a fixed train/validation split to evaluate the hyperparameters (see the cv parameter), but this will be less robust than a k-fold cross-validation. The test set should be reserved for scoring only the final model; if you use it to select hyperparameters, then the scores you receive will not be unbiased estimates of future performance.
There is no easy way to retrieve all the models built by GridSearchCV. (It would generally be a lot of models, and saving them all would generally be a waste of memory.)
The parallelization and parameter grid parts of GridSearchCV are surprisingly simple; if you need to, you can copy out the relevant parts of the source code to produce your own approach.
Training set tends to give almost perfect accuracy on my Grid Search.
That's a bit surprising, since the CV part of the searches means the models are being scored on unseen data. If you get very high best_score_ but low performance on the test set, then I would suspect your training set is not actually a representative sample, and that'll require a much more nuanced understanding of the situation.

How to check machine learning accuracy without cross validation

I have training sample X_train, and Y_train to train and X_estimated.
I got task to make my classificator learn as accurate as it can, and then predict vector of results over X_estimated to get close results to Y_estimated (which i have now, and I have to be as much precise as it can). If I split my training data to like 75/25 to train and test it, I can get accuracy using sklearn.metrics.accuracy_score and confusion matrix. But I am losing that 25% of samples, that would make my predictions more accurate.
Is there any way, I could learn by using 100% of the data, and still be able to see accuracy score (or percentage), so I can predict it many times, and save best (%) result?
I am using random forest with 500 estimators, and usually get like 90% accuracy. I want to save best prediction vector as possible for my task, without splitting any data (not wasting anything), but still be able to calculate accuracy (so I can save best prediction vector) from multiple attempts (random forest always shows different results)
Thank you

Splitting your data is critical for evaluation.
There is no way that you could train your model on 100% of the data and be able to get a correct evaluation accuracy unless you expand your dataset. I mean, you could change your train/test split, or try to optimize your model in other ways, but i guess the simple answer to your question would be no.

As per your requirement, you can try K Fold Cross Validation. If you split it in 90|10 i.e for Train|Test. Achieving to take 100% data for training is not possible as you have to test the data then only you can validate the same that how good your model is. K Fold CV takes your whole train data into consideration in each fold and randomly takes test data sample from the train data. And lastly calculates the accuracy by taking summation of all the folds. Then finally you can test the accuracy by using 10% of the data.
More you can read here and here
K Fold Cross Validation
Skearn provides simple methods for performing K fold cross validation. Simply you have to pass no of folds in the method. But then remember, more the folds, it takes more time to train the model. More you can check here

It is not necessary to do 75|25 split of your data all the time. 75
|25 is kind of old school now. It greatly depends on the amount of data that you have. For example, if you have 1 billion sentences for training a language model, it is not necessary to reserve 25% for testing.
Also, I second the previous answer of trying K-fold cross-validation. As a side note, you could consider looking at the other metrics like precision and recall as well.

In general splitting your data set is critical for evaluation. So I would recommend you always do that.
Said that, there are methods that in some sense allow you to train on all your data and still get an estimate of your performance or to estimate the generalization accuracy.
One particularly prominent method is leveraging out-of-bag samples of models based on bootstrapping, i.e. RandomForests.
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, bootstrap=True, oob_score=True)
rf.fit(X, y)
print(rf.oob_score_)

if you are doing classification always go with stratified k-fold cv(https://machinelearningmastery.com/cross-validation-for-imbalanced-classification/).
if you're doing regression then go with simple k-fold cv or you can divide the target as bins and do stratified k-fold cv. by this way you can use your data completely in model training.

Linear regression: Good results for training data, horrible for test data

I am working with a dataset of about 400.000 x 250.
I have a problem with the model yielding a very good R^2 score when testing it on the training set, but extremely poorly when used on the test set. Initially, this sounds like overfitting. But the data is split into training/test set at random and the data set i pretty big, so I feel like there has to be something else.
Any suggestions?
Splitting dataset into training set and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.drop(['SalePrice'],
axis=1), df.SalePrice, test_size = 0.3)
Sklearn's Linear Regression estimator
from sklearn import linear_model
linReg = linear_model.LinearRegression() # Create linear regression object
linReg.fit(X_train, y_train) # Train the model using the training sets
# Predict from training set
y_train_linreg = linReg.predict(X_train)
# Predict from test set
y_pred_linreg = linReg.predict(X_test)
Metric calculation
from sklearn import metrics
metrics.r2_score(y_train, y_train_linreg)
metrics.r2_score(y_test, y_pred_linreg)
R^2 score when testing on training set: 0,64
R^2 score when testing on testing set: -10^23 (approximatly)

While I agree with Mihai that your problem definitely looks like overfitting, I don't necessarily agree on his answer that neural network would solve your problem; at least, not out of the box. By themselves, neural networks overfit more, not less, than linear models. You need somehow to take care of your data, hardly any model can do that for you. A few options that you might consider (apologies, I cannot be more precise without looking at the dataset):
Easiest thing, use regularization. 400k rows is a lot, but with 250 dimensions you can overfit almost whatever you like. So try replacing LinearRegression by Ridge or Lasso (or Elastic Net or whatever). See http://scikit-learn.org/stable/modules/linear_model.html (Lasso has the advantage of discarding features for you, see next point)
Especially if you want to go outside of linear models (and you probably should), it's advisable to first reduce the dimension of the problem, as I said 250 is a lot. Try using some of the Feature selection techniques here: http://scikit-learn.org/stable/modules/feature_selection.html
Probably most importantly than anything else, you should consider adapting your input data. The very first thing I'd try is, assuming you are really trying to predict a price as your code implies, to replace it by its logarithm, or log(1+x). Otherwise linear regression will try very very hard to fit that single object that was sold for 1 Million $ ignoring everything below $1k. Just as important, check if you have any non-numeric (categorical) columns and keep them only if you need them, in case reducing them to macro-categories: a categorical column with 1000 possible values will increase your problem dimension by 1000, making it an assured overfit. A single column with a unique categorical data for each input (e.g. buyer name) will lead you straight to perfect overfitting.
After all this (cleaning data, reducing dimension via either one of the methods above or just Lasso regression until you get to certainly less than dim 100, possibly less than 20 - and remember that this includes any categorical data!), you should consider non-linear methods to further improve your results - but that's useless until your linear model provides you at least some mildly positive R^2 value on test data. sklearn provides a lot of them: http://scikit-learn.org/stable/modules/kernel_ridge.html is the easiest to use out-of-the-box (also does regularization), but it might be too slow to use in your case (you should first try this, and any of the following, on a subset of your data, say 1000 rows once you've selected only 10 or 20 features and see how slow that is). http://scikit-learn.org/stable/modules/svm.html#regression have many different flavours, but I think all but the linear one would be too slow. Sticking to linear things, http://scikit-learn.org/stable/modules/sgd.html#regression is probably the fastest, and would be how I'd train a linear model on this many samples. Going truly out of linear, the easiest techniques would probably include some kind of trees, either directly http://scikit-learn.org/stable/modules/tree.html#regression (but that's an almost-certain overfit) or, better, using some ensemble technique (random forests http://scikit-learn.org/stable/modules/ensemble.html#forests-of-randomized-trees are the typical go-to algorithm, gradient boosting http://scikit-learn.org/stable/modules/ensemble.html#gradient-tree-boosting sometimes works better). Finally, state-of-the-art results are indeed generally obtained via neural networks, see e.g. http://scikit-learn.org/stable/modules/neural_networks_supervised.html but for these methods sklearn is generally not the right answer and you should take a look at dedicated environments (TensorFlow, Caffe, PyTorch, etc.)... however if you're not familiar with those it is certainly not worth the trouble!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.