Use sklearn's GridSearchCV with a pipeline, preprocessing just once - python

I'm using scickit-learn to tune a model hyper-parameters. I'm using a pipeline to have chain the preprocessing with the estimator. A simple version of my problem would look like this:
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
grid = GridSearchCV(make_pipeline(StandardScaler(), LogisticRegression()),
param_grid={'logisticregression__C': [0.1, 10.]},
cv=2,
refit=False)
_ = grid.fit(X=np.random.rand(10, 3),
y=np.random.randint(2, size=(10,)))
In my case the preprocessing (what would be StandardScale() in the toy example) is time consuming, and I'm not tuning any parameter of it.
So, when I execute the example, the StandardScaler is executed 12 times. 2 fit/predict * 2 cv * 3 parameters. But every time StandardScaler is executed for a different value of the parameter C, it returns the same output, so it'd be much more efficient, to compute it once, and then just run the estimator part of the pipeline.
I can manually split the pipeline between the preprocessing (no hyper parameters tuned) and the estimator. But to apply the preprocessing to the data, I should provide the training set only. So, I would have to implement the splits manually, and not use GridSearchCV at all.
Is there a simple/standard way to avoid repeating the preprocessing while using GridSearchCV?

Update:
Ideally, the answer below should not be used as it leads to data leakage as discussed in comments. In this answer, GridSearchCV will tune the hyperparameters on the data already preprocessed by StandardScaler, which is not correct. In most conditions that should not matter much, but algorithms which are too sensitive to scaling will give wrong results.
Essentially, GridSearchCV is also an estimator, implementing fit() and predict() methods, used by the pipeline.
So instead of:
grid = GridSearchCV(make_pipeline(StandardScaler(), LogisticRegression()),
param_grid={'logisticregression__C': [0.1, 10.]},
cv=2,
refit=False)
Do this:
clf = make_pipeline(StandardScaler(),
GridSearchCV(LogisticRegression(),
param_grid={'logisticregression__C': [0.1, 10.]},
cv=2,
refit=True))
clf.fit()
clf.predict()
What it will do is, call the StandardScalar() only once, for one call to clf.fit() instead of multiple calls as you described.
Edit:
Changed refit to True, when GridSearchCV is used inside a pipeline. As mentioned in documentation:
refit : boolean, default=True
Refit the best estimator with the entire dataset. If “False”, it is impossible to make predictions using this GridSearchCV instance
after fitting.
If refit=False, clf.fit() will have no effect because the GridSearchCV object inside the pipeline will be reinitialized after fit().
When refit=True, the GridSearchCV will be refitted with the best scoring parameter combination on the whole data that is passed in fit().
So if you want to make the pipeline, just to see the scores of the grid search, only then the refit=False is appropriate. If you want to call the clf.predict() method, refit=True must be used, else Not Fitted error will be thrown.

For those who stumbled upon a little bit different problem, that I had as well.
Suppose you have this pipeline:
classifier = Pipeline([
('vectorizer', CountVectorizer(max_features=100000, ngram_range=(1, 3))),
('clf', RandomForestClassifier(n_estimators=10, random_state=SEED, n_jobs=-1))])
Then, when specifying parameters you need to include this 'clf_' name that you used for your estimator. So the parameters grid is going to be:
params={'clf__max_features':[0.3, 0.5, 0.7],
'clf__min_samples_leaf':[1, 2, 3],
'clf__max_depth':[None]
}

It is not possible to do this in the current version of scikit-learn (0.18.1). A fix has been proposed on the github project:
https://github.com/scikit-learn/scikit-learn/issues/8830
https://github.com/scikit-learn/scikit-learn/pull/8322

Related

Does anybody know Hyperparameters of GridSearchCV for MultiOutputClassifier model?

from sklearn.model_selection import GridSearchCV
*parameters = {**?????**}*
search = GridSearchCV(_pipeline, n_jobs=1, cv= 5, param_grid=parameters)
#multi_target_linear = MultiOutputClassifier(search)
search.fit(X, y)
#search.get_params().keys()
search.best_params_
Hyperparameters in a case like this would vary from case to case, as they are what you want to solve for, arriving at hyperparameters that are both accurate and efficient.
From what it seems, you aim to create a parameter grid called parameters, that includes several hyperparameters that you want to hone down on. GridSearchCV can then try all the combinations of hyperparameters and find the best performing combination. The CV is Cross Validation, which is a way of shuffling the training and test sets for fair evaluation, which you here have 5 different segments, hence cv=5.
You also are using MultiOutputClassifier, which is designed to adapt other classifiers to be able to handle multiple targets, but you don't define what classifier that is.
Parameter param_grid of GridSearchCV is determined by its estimator, but you didn't write how _pipeline is constructed.
So I cannot answer this question, but this answer will help.

Does it make sense to use sklearn GridSearchCV together with CalibratedClassifierCV?

What I want to do is to derive a classifier which is optimal in its parameters with respect to a given metric (for example the recall score) but also calibrated (in the sense that the output of the predict_proba method can be directly interpreted as a confidence level, see https://scikit-learn.org/stable/modules/calibration.html). Does it make sense to use sklearn GridSearchCV together with CalibratedClassifierCV, that is, to fit a classifier via GridSearchCV, and then pass the GridSearchCV output to the CalibratedClassifierCV object? If I'm correct, the CalibratedClassifierCV object would fit a given estimator cv times, and the probabilities for each of the folds are then averaged for prediction. However, the results of the GridSearchCV could be different for each of the folds.
Yes you can do this and it would work. I don't know if it makes sense to do this, but the least I can do is explain what I believe would happen.
We can compare doing this to the alternative which is getting the best estimator from the grid search and feeding that to the calibration.
Simply getting the best estimator and feeding it to calibrationcv
from sklearn.model_selection import GridSearchCV
from sklearn import svm, datasets
from sklearn.calibration import CalibratedClassifierCV
iris = datasets.load_iris()
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svc = svm.SVC()
clf = GridSearchCV(svc, parameters)
clf.fit(iris.data, iris.target)
calibration_clf = CalibratedClassifierCV(clf.best_estimator_)
calibration_clf.fit(iris.data, iris.target)
calibration_clf.predict_proba(iris.data[0:10])
array([[0.91887427, 0.07441489, 0.00671085],
[0.91907451, 0.07417992, 0.00674558],
[0.91914982, 0.07412815, 0.00672202],
[0.91939591, 0.0738401 , 0.00676399],
[0.91894279, 0.07434967, 0.00670754],
[0.91910347, 0.07414268, 0.00675385],
[0.91944594, 0.07381277, 0.0067413 ],
[0.91903299, 0.0742324 , 0.00673461],
[0.91951618, 0.07371877, 0.00676505],
[0.91899007, 0.07426733, 0.00674259]])
Feeding grid search in the Calibration cv
from sklearn.model_selection import GridSearchCV
from sklearn import svm, datasets
from sklearn.calibration import CalibratedClassifierCV
iris = datasets.load_iris()
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svc = svm.SVC()
clf = GridSearchCV(svc, parameters)
cal_clf = CalibratedClassifierCV(clf)
cal_clf.fit(iris.data, iris.target)
cal_clf.predict_proba(iris.data[0:10])
array([[0.900434 , 0.0906832 , 0.0088828 ],
[0.90021418, 0.09086583, 0.00891999],
[0.90206035, 0.08900572, 0.00893393],
[0.9009212 , 0.09012478, 0.00895402],
[0.90101953, 0.0900889 , 0.00889158],
[0.89868497, 0.09242412, 0.00889091],
[0.90214948, 0.08889812, 0.0089524 ],
[0.8999936 , 0.09110965, 0.00889675],
[0.90204193, 0.08896843, 0.00898964],
[0.89985101, 0.09124147, 0.00890752]])
Notice that the output of the probabilities are slightly different between the two.
The difference between each method is:
Using the best estimator is only doing the calibration across 5 splits (the default cv). It uses the same estimator in all 5 splits.
Using grid search, is doing going to fit a grid search on each of the 5 CV splits from calibration 5 times. You are essentially doing cross validation on 4/5 of the data each time choosing the best estimator for the 4/5 of the data and then doing the calibration with that best estimator on the last 5th. You could have slightly different models running on each set of test data depending on what the grid search chooses.
I think the grid search and calibration are different goals so in my opinion I would probably work on each separately and go with the first way specified above get a model that works the best and then feed that in the calibration curve.
However, I don't know your specific goals so I can't say that the 2nd way described here is the WRONG way. You can always try both ways and see what gives you better performance and go with the one that works best.
I think that your approach is a little different with your objective. What you objective says is "Find a model with best recall, which confidence should be unbiased", but what you do is "Find a model with best recall, then make the confidence unbiased". So a better (but slower) way to do that is:
Wrap your model with CalibratedClassifierCV, treat this model as the final model you should be optimized on;
Modify your param grid, make sure that you are tuning the model inside CalibratedClassifierCV (change param to something like base_estimator__param, which is the property CalibratedClassifierCV to hold the base estimator)
Feed CalibratedClassifierCV model into your final GridSearchCV, then fit
get best_estimator_, which is your unbiased model with best recall.
I would advise that you do calibrate on a separate set not to bias the estimate.
I see two options. Either you cross validate within a fraction of the folds generated for calibrating, as suggested above, or you set apart an ad-hoc evaluation set that you would use only for calibration, after performing cross validation on training set.
In any case, I would recommend that you finally evaluate on a test set.

Skip some of the transform steps (related to over and under sampling) in imbalanced-learn pipeline when predicting on test data set [duplicate]

I'm dealing with an imbalanced dataset and want to do a grid search to tune my model's parameters using scikit's gridsearchcv. To oversample the data, I want to use SMOTE, and I know I can include that as a stage of a pipeline and pass it to gridsearchcv.
My concern is that I think smote will be applied to both train and validation folds, which is not what you are supposed to do. The validation set should not be oversampled.
Am I right that the whole pipeline will be applied to both dataset splits? And if yes, how can I turn around this?
Thanks a lot in advance
Yes, it can be done, but with imblearn Pipeline.
You see, imblearn has its own Pipeline to handle the samplers correctly. I described this in a similar question here.
When called predict() on a imblearn.Pipeline object, it will skip the sampling method and leave the data as it is to be passed to next transformer.
You can confirm that by looking at the source code here:
if hasattr(transform, "fit_sample"):
pass
else:
Xt = transform.transform(Xt)
So for this to work correctly, you need the following:
from imblearn.pipeline import Pipeline
model = Pipeline([
('sampling', SMOTE()),
('classification', LogisticRegression())
])
grid = GridSearchCV(model, params, ...)
grid.fit(X, y)
Fill the details as necessary, and the pipeline will take care of the rest.

Evaluate Loss Function Value Getting From Training Set on Cross Validation Set

I am following Andrew NG instruction to evaluate the algorithm in Classification:
Find the Loss Function of the Training Set.
Compare it with the Loss Function of the Cross Validation.
If both are close enough and small, go to next step (otherwise, there is bias or variance..etc).
Make a prediction on the Test Set using the resulted Thetas(i.e. weights) produced from the previous step as a final confirmation.
I am trying to apply this using Scikit-Learn Library, however, I am really lost there and sure that I am totally wrong (I didn't find anything similar online):
from sklearn import model_selection, svm
from sklearn.metrics import make_scorer, log_loss
from sklearn import datasets
def main():
iris = datasets.load_iris()
kfold = model_selection.KFold(n_splits=10, random_state=42)
model= svm.SVC(kernel='linear', C=1)
results = model_selection.cross_val_score(estimator=model,
X=iris.data,
y=iris.target,
cv=kfold,
scoring=make_scorer(log_loss, greater_is_better=False))
print(results)
Error
ValueError: y_true contains only one label (0). Please provide the true labels explicitly through the labels argument.
I am not sure even it's the right way to start. Any help is very much appreciated.
Given the clarifications you provide in the comments and that you are not particularly interested in the log loss itself, I think the most straightforward approach is to abandon log loss and go for the accuracy instead:
from sklearn import model_selection, svm
from sklearn import datasets
iris = datasets.load_iris()
kfold = model_selection.KFold(n_splits=10, random_state=42)
model= svm.SVC(kernel='linear', C=1)
results = model_selection.cross_val_score(estimator=model,
X=iris.data,
y=iris.target,
cv=kfold,
scoring="accuracy") # change
Al already mentioned in the comments, inclusion of log loss in such situations still suffers from some unresolved issues in scikit-learn (see here and here).
For the purpose of estimating the generalization ability of your model, you will be fine with the accuracy metric.
This kind of error appears often when you do cross validation.
Basically your data is split into n_splits = 10 and some classes are missing on some of these splits. For example, your 9th split may not have training examples for class number 2.
So then when you evaluate your loss, the number of existing classes between your prediction and the test set do not match. So you cannot compute the loss if you have 3 classes in y_true and your model is trained to predict only 2.
What do you do in this case?
You have three possibilities:
Shuffle your data KFold(n_splits=10, random_state=42, shuffle = True)
Make n_splits bigger
provide the list of labels explicitly to the loss function as follows
args_loss = { "labels": [0,1,2] }
make_scorer(log_loss, greater_is_better=False,**args_loss)
Cherry pick your splits so you make sure this doesn't happen. I don't think Kfold allows this but GridSearchCV does
Just for future readers who are following Andrew's Course:
K-Fold is Not practically applicable to this purpose, because we mainly want to evaluate the Thetas (i.e. Weights) produced by a certain algorithm with some parameters on the Cross-Validation Set by using those Thetas in a comparison between both Cost-Functions J(train) vs J(CV) to determine if the model suffers from bias, variance or it's O.K.
Nevertheless, K-Fold is mainly for testing the prediction on the CV using the weights produced from training the Model on Training Set.

How to use scikit's preprocessing/normalization along with cross validation?

As an example of cross-validation without any preprocessing, I can do something like this:
tuned_params = [{"penalty" : ["l2", "l1"]}]
from sklearn.linear_model import SGDClassifier
SGD = SGDClassifier()
from sklearn.grid_search import GridSearchCV
clf = GridSearchCV(myClassifier, params, verbose=5)
clf.fit(x_train, y_train)
I would like to preprocess my data using something like
from sklearn import preprocessing
x_scaled = preprocessing.scale(x_train)
But it would not be a good idea to do this before setting the cross validation, because then the training and testing sets will be normalized together. How do I setup the cross validation to preprocess the corresponding training and test sets separately on each run?
Per the documentation, if you employ Pipeline, this can be done for you. From the docs, just above section 3.1.1.1, emphasis mine:
Just as it is important to test a predictor on data held-out from training, preprocessing (such as standardization, feature selection, etc.) and similar data transformations similarly should be learnt from a training set and applied to held-out data for prediction [...] A Pipeline makes it easier to compose estimators, providing this behavior under cross-validation[.]
More relevant information on pipelines available here.

Categories