scikit-learn GridSearchCV with multiple repetitions

scikit-learn GridSearchCV with multiple repetitions - python

I'm trying to get the best set of parameters for an SVR model.
I'd like to use the GridSearchCV over different values of C.
However, from the previous test, I noticed that the split into the Training/Test set highly influences the overall performance (r2 in this instance).
To address this problem, I'd like to implement a repeated 5-fold cross-validation (10 x 5CV). Is there a built-in way of performing it using GridSearchCV?
Quick solution, following the idea presented in the sci-kit official documentation:
NUM_TRIALS = 10
scores = []
for i in range(NUM_TRIALS):
cv = KFold(n_splits=5, shuffle=True, random_state=i)
clf = GridSearchCV(estimator=svr, param_grid=p_grid, cv=cv)
scores.append(clf.best_score_)
print "Average Score: {0} STD: {1}".format(numpy.mean(scores), numpy.std(scores))

This is called as nested cross_validation. You can look at official documentation example to guide you into right direction and also have a look at my other answer here for a similar approach.
You can adapt the steps to suit your need:
svr = SVC(kernel="rbf")
c_grid = {"C": [1, 10, 100, ... ]}
# CV Technique "LabelKFold", "LeaveOneOut", "LeaveOneLabelOut", etc.
# To be used within GridSearch (5 in your case)
inner_cv = KFold(n_splits=5, shuffle=True, random_state=i)
# To be used in outer CV (you asked for 10)
outer_cv = KFold(n_splits=10, shuffle=True, random_state=i)
# Non_nested parameter search and scoring
clf = GridSearchCV(estimator=svr, param_grid=c_grid, cv=inner_cv)
clf.fit(X_iris, y_iris)
non_nested_score = clf.best_score_
# Pass the gridSearch estimator to cross_val_score
# This will be your required 10 x 5 cvs
# 10 for outer cv and 5 for gridSearch's internal CV
clf = GridSearchCV(estimator=svr, param_grid=c_grid, cv=inner_cv)
nested_score = cross_val_score(clf, X=X_iris, y=y_iris, cv=outer_cv).mean()
Edit - Description of nested cross validation with cross_val_score() and GridSearchCV()
clf = GridSearchCV(estimator, param_grid, cv= inner_cv).
Pass clf, X, y, outer_cv to cross_val_score
As seen in source code of cross_val_score, this X will be divided into X_outer_train, X_outer_test using outer_cv. Same for y.
X_outer_test will be held back and X_outer_train will be passed on to clf for fit() (GridSearchCV in our case). Assume X_outer_train is called X_inner from here on since it is passed to inner estimator, assume y_outer_train is y_inner.
X_inner will now be split into X_inner_train and X_inner_test using inner_cv in the GridSearchCV. Same for y
Now the gridSearch estimator will be trained using X_inner_train and y_train_inner and scored using X_inner_test and y_inner_test.
The steps 5 and 6 will be repeated for inner_cv_iters (5 in this case).
The hyper-parameters for which the average score over all inner iterations (X_inner_train, X_inner_test) is best, is passed on to the clf.best_estimator_ and fitted for all data, i.e. X_outer_train.
This clf (gridsearch.best_estimator_) will then be scored using X_outer_test and y_outer_test.
The steps 3 to 9 will be repeated for outer_cv_iters (10 here) and array of scores will returned from cross_val_score
We then use mean() to get back nested_score.

You can supply different cross-validation generators to GridSearchCV. The default for binary or multiclass classification problems is StratifiedKFold. Otherwise, it uses KFold. But you can supply your own. In your case, it looks like you want RepeatedKFold or RepeatedStratifiedKFold.
from sklearn.model_selection import GridSearchCV, RepeatedStratifiedKFold
# Define svr here
...
# Specify cross-validation generator, in this case (10 x 5CV)
cv = RepeatedKFold(n_splits=5, n_repeats=10)
clf = GridSearchCV(estimator=svr, param_grid=p_grid, cv=cv)
# Continue as usual
clf.fit(...)

Related

Different accuracy for cross_val_score and train_test_split

I am testing RandomForestClassifier on simple dataset from sklearn. When I split the data with train_test_split, I get accuracy=0.89. If I use cross-validation with cross_val_score with same parameters of classifier, accuracy is smaller - about 0.83. Why?
Here is the code:
from sklearn.model_selection import cross_val_score, StratifiedKFold,GridSearchCV,train_test_split
from sklearn.metrics import accuracy_score,f1_score,make_scorer
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_circles
np.random.seed(42)
#create dataset:
x, y = make_circles(n_samples=500, factor=0.1, noise=0.35, random_state=42)
#initialize stratified split:
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
#create classifier:
clf = RandomForestClassifier(random_state=42, max_depth=12,n_jobs=-1,
oob_score=True,n_estimators=100,min_samples_leaf=10)
#average accuracy on cross-validation:
results = np.mean(cross_val_score(clf, x, y, cv=skf,scoring=make_scorer(accuracy_score)))
print("ACCURACY WITH CV = ",results)#prints 0.832
#use train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.2)
clf=RandomForestClassifier(random_state=42, max_depth=12,n_jobs=-1, oob_score=True,n_estimators=100,min_samples_leaf=10)
clf.fit(xtrain,ytrain)
ypred=clf.predict(xtest)
print("ACCURACY WITHOUT CV = ",accuracy_score(ytest,ypred))#prints 0.89
what I got:
ACCURACY WITH CV = 0.83
ACCURACY WITHOUT CV = 0.89

Cross validation is used to run multiple experiments on different splits of data and then average their results. This is to ensure that the result of the experiment is not biased by one split, as it is in your case.
Your chosen seed along with some luck gave you a test train split which has higher accuracy than the average. The higher accuracy is an artifact of random sampling when making a split and not an indicator of better model performance.
Simply put:
Cross Validation makes multiple splits of data. Your model is trained
on all of these different splits and then the performance is
averaged.
If you pick one of these splits, you may get lucky and there might be
good overlap between the data points in your test and train set. Your
model will have high accuracy in this case.
Or you may get unlucky and there might not be a high overlap between
the data points in test and train set. Your model will have a lower
accuracy in this case.
Thus, cross validation is used to average the results of various such splits (5 in your case).
Here is your code run in a google colab notebook:
https://colab.research.google.com/drive/16-NotF-_WVLESmvGMONSGSZigxrT3KLx?usp=sharing
The last cell makes 5 different splits and then averages their accuracies. Notice how that is the same as the one you got from cross validation. Also notice how some splits have higher and some splits have a lower accuracy.
To further convince yourself, look at the output of:
cross_val_score(clf, x, y, cv=skf, scoring=make_scorer(accuracy_score))
The output is a list of scores (accuracies in your case) for the 5 different splits. You can see that they have varying values around 0.83

This is just up to chance for the split and the random state of the Random Forest Classifier. Try leaving the random_state=42 out and let it fit several times and you'll get a variance of different accuracies. By chance, I had one without CV of "just" 0.78! In contrast, the cv will give you and average (your calculated mean) PLUS an idea about how much your accuracy could vary around that.

Problem of doing preprocessing for testing set in GridSearchCV

I use 20% of the data set as my testing set and use GridSearchCV to implement K-fold cross-validation to tune hyperparameters.
By using a pipeline, we can put the column transformer and the machine learning algorithm into GridSearchCV together. If I set up a 5 fold cross-validation for GridSearchCV, the function will use 5 different training and validation sets to train and validate each combination of hyperparameters. As I know, GridSearchCV uses the mean of 5 fold scores to choose the best model.
Then my question is, how does it transform the testing set?
I'm very confused about this because to avoid data leakage, we should use only the training set to fit the transformer, but in this case, we have 5 different training sets and I don't know which one the GridSearchCV function uses to fit and transform the validation and testing set.
My code is given below
X_other, X_test, y_other, y_test = train_test_split(X, y, test_size = 0.2, random_state = i)
kf = KFold(n_splits = 4, shuffle = True, random_state = i)
pipe = Pipeline(steps = [("preprocessor", preprocessor),("model", ML_algo)])
grid = GridSearchCV(estimator = pipe, param_grid=param_grid,
scoring = make_scorer(mean_squared_error, greater_is_better=False,
squared=False),cv=kf, return_train_score = True, n_jobs=-1, verbose=False)
grid.fit(X_other, y_other)
test_score = mean_squared_error(y_test, grid.predict(X_test), squared=False)

short answer: there is no data leakage, test set is not used (and should not be used) for training the model in your code.
long answer: k fold cross-validation randomly divided your X_other& y_other(training set) into k splits, for each iteration of cross-validation, k-1 fold of data is used to train the model while this model is then evaluated with the 1 fold left using the metric you specified in scoring=. (refer to the below picture from sklearn for details:https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation)
After finding the best set of hyperparameters by GridSearchCV(), all training set data is used in training a final model with the found hyperparameters, then, X_test,y_test (test set) can be transformed by this model. Note that in this process, X_test,y_test is not used and should not be used other than in the final prediction.

How to get the coefficients in Lasso Regression at every split while performing 10 fold cross validation?

I am doing Randomized search cv to find alpha value in Lasso Regression and I am performing 10 fold cross validation. Is there a way to get the coefficients value for every split, just like we get the scores by using cv_results function?

There is no direct way to do this via RandomizedSearchCV. But you can work around this by defining your own class that e.g. prints the coefficients to the console when the predict function is called:
from sklearn.linear_model import Lasso
class MyLasso(Lasso):
def predict(self, X):
print(self.coef_)
return super().predict(X)
MyLasso behaves the same as Lasso and can be used as usual:
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split, RandomizedSearchCV
X, y = make_regression(n_features=5, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
param_distributions = {'alpha': [0.01, 0.1, 1]}
rs = RandomizedSearchCV(
MyLasso(),
param_distributions=param_distributions,
cv=2,
n_iter=3,
random_state=42
)
rs.fit(X_train, y_train)
Output for the example above (three iterations of 2-fold cross-validation gives six results):
[64.57650818 98.64237403 57.07123743 60.56898095 35.59985227]
[64.57001187 98.63679695 57.06557977 60.56304163 35.59888746]
[64.43774582 98.55938568 57.01219706 60.49221968 35.51151313]
[64.37690435 98.49805298 56.95345309 60.43375789 35.5018112 ]
[63.05012223 97.72950224 56.42179336 59.72460697 34.62812171]
[62.44582912 97.11061327 55.83218634 59.14092054 34.53104869]

It seemed to me that saving the coefficients as additional scores would be slicker than modifying the estimator itself as in #afsharov's answer. Defining a scorer and passing it to the search as
def coefs_scorer(estimator, X, y):
return estimator.coef_
rs = RandomizedSearchCV(
...
scoring={'r2': 'r2', 'coefs': coefs_scorer},
refit='r2',
)
fails because there's a check that scorers return single numbers. So you need to unpack the coefficients, and I ended up with this:
def coefs_scorer(estimator, X, y, i):
return estimator.coef_[i]
from functools import partial
scoring = {'r2': 'r2'}
for i in range(X_train.shape[1]):
scoring[f'coef{i}'] = partial(coefs_scorer, i=i)
param_distributions = {'alpha': [0.01, 0.1, 1]}
rs = RandomizedSearchCV(
Lasso(),
param_distributions=param_distributions,
cv=2,
n_iter=3,
random_state=42,
scoring=scoring,
refit='r2',
)
Note that with multiple metrics you need to specify which to use for refitting. Because of all the additional work, I'm not so sure this is better than the custom class. It does have a few advantages though:
If you wanted to pickle the best estimator, you don't need to package-ize the custom class.
The scores are programmatically saved rather than just printed.
Since they're scores, you get the average and standard deviation of the coefficients across folds stored in cv_results_ (of course, calculating them yourself wouldn't be difficult).
Disadvantages:
We had to specify a metric per feature. It's ugly, but worse it assumes you know in advance the number of features (it would fail if your estimator was a pipeline that had a feature selection or certain feature engineering steps).
If you return train scores, you'll duplicate the coefficients in cv_results_.
These aren't actually scores, so semantically this is hacky.
The scorer assumes that coef_ exists and is one-dimensional.

Basic Sklearn: How to Pass Scoring Function to Fit Method

I'm using sklearn to do some machine learning. I often use GridSearchCV to explore hyperparameters and perform cross-validation. Using this, I can specify a scoring function, like this:
scores = -cross_val_score(svr, X, Y, cv=10, scoring='neg_mean_squared_error')
However, I want to train my SVR model using mean squared error. Unfortunately, there's no scoring parameter in either the constructor for SVR or the fit method.
How should I do this?
Thanks!

I typically use Pipeline to do it. You can create list of pipelines including SVR model (and others if you want). Then, you can apply GridSearchCV where putting pipeline in as your argument.
Here, you can add params_grid where searching space can be defined as pipelinename__paramname (double underscore in between). For example, I have pipeline name svr and I want to search on parameter C, I can put the key in my parameter dictionary as svr__C.
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.svm import SVR
c_range = np.arange(1, 10, 1)
pipeline = Pipeline([('svr', SVR())])
params_grid = {'svr__C': c_range}
# grid search with 3-fold cross validation
gridsearch_model = GridSearchCV(pipeline, params_grid,
cv=3, scoring='neg_mean_squared_error')
Then, you can do the same procedure by fitting training data and find best score and parameters
gridsearch_model.fit(X_train, y_train)
print(gridsearch_model.best_params_, gridsearch_model.best_score_)
You can also use cross_val_score to find the score:
cross_val_score(gridsearch_model, X_train, y_train,
cv=3, scoring='neg_mean_squared_error')
Hope this helps!

Why when I use GridSearchCV with roc_auc scoring, the score is different for grid_search.score(X,y) and roc_auc_score(y, y_predict)?

I am using stratified 10-fold cross validation to find model that predicts y (binary outcome) from X (X has 34 labels) with the highest auc. I set the GridSearchCV:
log_reg = LogisticRegression()
parameter_grid = {'penalty' : ["l1", "l2"],'C': np.arange(0.1, 3, 0.1),}
cross_validation = StratifiedKFold(n_splits=10,shuffle=True,random_state=100)
grid_search = GridSearchCV(log_reg, param_grid = parameter_grid,scoring='roc_auc',
cv = cross_validation)
And then do the cross-validation:
grid_search.fit(X, y)
y_pr=grid_search.predict(X)
I do not understand the following:
why grid_search.score(X,y) and roc_auc_score(y, y_pr) give different results (the former is 0.74 and the latter is 0.63)? Why do not these commands do the same thing in my case?

This is due to different initialization of roc_auc when used in GridSearchCV.
Look at the source code here
roc_auc_scorer = make_scorer(roc_auc_score, greater_is_better=True,
needs_threshold=True)
Observe the third parameter needs_threshold. When true, it will require the continous values for y_pred such as probabilities or confidence scores which in gridsearch will be calculated from log_reg.decision_function().
When you explicitly call roc_auc_score with y_pr, you are using .predict() which will output the resultant predicted class labels of the data and not probabilities. That should account for the difference.
Try :
y_pr=grid_search.decision_function(X)
roc_auc_score(y, y_pr)
If still not same results, please update the question with complete code and some sample data.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.