how to use SMOTE & feature selection together in sklearn pipeline? - python

from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
smt = SMOTE(random_state=0)
pipeline_rf_smt_fs = Pipeline(
[
('preprocess',preprocessor),
('selector', SelectKBest(mutual_info_classif, k=30)),
('smote',smt),
('rf_classifier',RandomForestClassifier(n_estimators=600, random_state =2021))
]
)
i am getting below error:
All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' 'SMOTE(random_state=0)' (type <class 'imblearn.over_sampling._smote.SMOTE'>) doesn't
I believe smote has to be use post feature selection process. Any help on this would be very helpful.

This is the error message given by scikit-learn's version of the pipeline. Your code, as is, should not produce this error, but you probably have run from sklearn.pipeline import Pipeline somewhere which has overwritten the Pipeline object.
From a methodological point of view, I nonetheless find it questionable to use a sampler after the preprocessing and feature selection in a general setting. What if the features you select are relevant because of the imbalance in your dataset? I would prefer using it in the first step of a pipeline (but this is up to you, it should not cause any errors).

Related

How can I explain predictions of an imblearn pipeline?

I have an imblearn (not sklearn) pipeline consisting of the following steps:
Column selector
Preprocessing pipeline (ColumnTransformer with OneHotEncoders and CountVectorizers on different columns)
imblearn's SMOTE
XGBClassifier
I have a tabular dataset and I'm trying to explain my predictions.
I managed to work out feature importance plots with some work, but can't get either
eli5 or lime to work.
Lime requires that I transform the data to the state of before the last transformation (because the transformers in the Pipeline like custom vectorizers create new columns).
In principle, I can slice my Pipeline like this: pipeline[:-1].predict(instance). However, I get the following error: {AttributeError}'SMOTE' object has no attribute 'predict'.
I also tried an eli5 explainer, since it supposedly works with Sklearn Pipelines.
However, after running eli5.sklearn.explain_prediction.explain_prediction_sklearn_not_supported(pipeline, instance_to_explain) I get the message that the classifier is not supported.
Will appreciate any ideas on how to proceed with this.
Imblearn's samplers are effectively no-op (ie. identity) transformers during prediction. Therefore, it should be safe to delete them after the pipeline has been fitted.
Try the following workflow:
Construct an Imblearn pipeline, and fit it.
Extract the steps of the fitted Imblearn pipeline to a new Scikit-Learn pipeline.
Delete the SMOTE step.
Explain your predictions using standard Scikit-Learn pipeline explanation tools.

Skip some of the transform steps (related to over and under sampling) in imbalanced-learn pipeline when predicting on test data set [duplicate]

I'm dealing with an imbalanced dataset and want to do a grid search to tune my model's parameters using scikit's gridsearchcv. To oversample the data, I want to use SMOTE, and I know I can include that as a stage of a pipeline and pass it to gridsearchcv.
My concern is that I think smote will be applied to both train and validation folds, which is not what you are supposed to do. The validation set should not be oversampled.
Am I right that the whole pipeline will be applied to both dataset splits? And if yes, how can I turn around this?
Thanks a lot in advance
Yes, it can be done, but with imblearn Pipeline.
You see, imblearn has its own Pipeline to handle the samplers correctly. I described this in a similar question here.
When called predict() on a imblearn.Pipeline object, it will skip the sampling method and leave the data as it is to be passed to next transformer.
You can confirm that by looking at the source code here:
if hasattr(transform, "fit_sample"):
pass
else:
Xt = transform.transform(Xt)
So for this to work correctly, you need the following:
from imblearn.pipeline import Pipeline
model = Pipeline([
('sampling', SMOTE()),
('classification', LogisticRegression())
])
grid = GridSearchCV(model, params, ...)
grid.fit(X, y)
Fill the details as necessary, and the pipeline will take care of the rest.

When fitting with TPOT CV, is the fitted_pipeline_ retrained on the whole dataset?

I am using a LeaveOutGroupOut CV strategy with TPOTRegressor
from tpot import TPOTRegressor
from sklearn.model_selection import LeaveOneGroupOut
tpot = TPOTRegressor(
config_dict=regressor_config_dict,
generations=100,
population_size=100,
cv=LeaveOneGroupOut(),
verbosity=2,
n_jobs=1)
tpot.fit(XX, yy, groups=groups)
After optimization the best scoring trained pipeline is stored in tpot.fitted_pipeline_ and tpot.fitted_pipeline_.predict(X) is available.
my question is: what will the fitted pipeline have been trained on? e.g.
does tpot refit the optimised pipeline using the entire dataset before storing it in tpot.fitted_pipeline_?
or will this represent a trained pipeline from the best scoring split during
Additionally, is there a way to access the complete set of trained models corresponding to the set of splits for the winning/optimized pipeline?
TPOT will fit the final 'best' pipeline on the full training set: code
It's therefore recommended that your testing data never be passed to the TPOT fit function if you plan to directly interact with the 'best' pipeline via the TPOT object.
If that is an issue for you, you can retrain the pipeline directly via the tpot.fitted_pipeline_ attribute, which is simply a sklearn Pipeline object. Alternatively, you can use the export function to export the 'best' pipeline to its corresponding Python code and interact with the pipeline outside of TPOT.
Additionally, is there a way to access the complete set of trained models corresponding to the set of splits for the winning/optimized pipeline?
No. TPOT uses sklearn's cross_val_score when evaluating pipelines, so it throws out the set of trained pipelines from the CV process. However, you can access the scoring results of every pipeline that TPOT evaluated via the tpot.evaluated_individuals_ attribute.

Use sklearn's GridSearchCV with a pipeline, preprocessing just once

I'm using scickit-learn to tune a model hyper-parameters. I'm using a pipeline to have chain the preprocessing with the estimator. A simple version of my problem would look like this:
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
grid = GridSearchCV(make_pipeline(StandardScaler(), LogisticRegression()),
param_grid={'logisticregression__C': [0.1, 10.]},
cv=2,
refit=False)
_ = grid.fit(X=np.random.rand(10, 3),
y=np.random.randint(2, size=(10,)))
In my case the preprocessing (what would be StandardScale() in the toy example) is time consuming, and I'm not tuning any parameter of it.
So, when I execute the example, the StandardScaler is executed 12 times. 2 fit/predict * 2 cv * 3 parameters. But every time StandardScaler is executed for a different value of the parameter C, it returns the same output, so it'd be much more efficient, to compute it once, and then just run the estimator part of the pipeline.
I can manually split the pipeline between the preprocessing (no hyper parameters tuned) and the estimator. But to apply the preprocessing to the data, I should provide the training set only. So, I would have to implement the splits manually, and not use GridSearchCV at all.
Is there a simple/standard way to avoid repeating the preprocessing while using GridSearchCV?
Update:
Ideally, the answer below should not be used as it leads to data leakage as discussed in comments. In this answer, GridSearchCV will tune the hyperparameters on the data already preprocessed by StandardScaler, which is not correct. In most conditions that should not matter much, but algorithms which are too sensitive to scaling will give wrong results.
Essentially, GridSearchCV is also an estimator, implementing fit() and predict() methods, used by the pipeline.
So instead of:
grid = GridSearchCV(make_pipeline(StandardScaler(), LogisticRegression()),
param_grid={'logisticregression__C': [0.1, 10.]},
cv=2,
refit=False)
Do this:
clf = make_pipeline(StandardScaler(),
GridSearchCV(LogisticRegression(),
param_grid={'logisticregression__C': [0.1, 10.]},
cv=2,
refit=True))
clf.fit()
clf.predict()
What it will do is, call the StandardScalar() only once, for one call to clf.fit() instead of multiple calls as you described.
Edit:
Changed refit to True, when GridSearchCV is used inside a pipeline. As mentioned in documentation:
refit : boolean, default=True
Refit the best estimator with the entire dataset. If “False”, it is impossible to make predictions using this GridSearchCV instance
after fitting.
If refit=False, clf.fit() will have no effect because the GridSearchCV object inside the pipeline will be reinitialized after fit().
When refit=True, the GridSearchCV will be refitted with the best scoring parameter combination on the whole data that is passed in fit().
So if you want to make the pipeline, just to see the scores of the grid search, only then the refit=False is appropriate. If you want to call the clf.predict() method, refit=True must be used, else Not Fitted error will be thrown.
For those who stumbled upon a little bit different problem, that I had as well.
Suppose you have this pipeline:
classifier = Pipeline([
('vectorizer', CountVectorizer(max_features=100000, ngram_range=(1, 3))),
('clf', RandomForestClassifier(n_estimators=10, random_state=SEED, n_jobs=-1))])
Then, when specifying parameters you need to include this 'clf_' name that you used for your estimator. So the parameters grid is going to be:
params={'clf__max_features':[0.3, 0.5, 0.7],
'clf__min_samples_leaf':[1, 2, 3],
'clf__max_depth':[None]
}
It is not possible to do this in the current version of scikit-learn (0.18.1). A fix has been proposed on the github project:
https://github.com/scikit-learn/scikit-learn/issues/8830
https://github.com/scikit-learn/scikit-learn/pull/8322

How to use scikit's preprocessing/normalization along with cross validation?

As an example of cross-validation without any preprocessing, I can do something like this:
tuned_params = [{"penalty" : ["l2", "l1"]}]
from sklearn.linear_model import SGDClassifier
SGD = SGDClassifier()
from sklearn.grid_search import GridSearchCV
clf = GridSearchCV(myClassifier, params, verbose=5)
clf.fit(x_train, y_train)
I would like to preprocess my data using something like
from sklearn import preprocessing
x_scaled = preprocessing.scale(x_train)
But it would not be a good idea to do this before setting the cross validation, because then the training and testing sets will be normalized together. How do I setup the cross validation to preprocess the corresponding training and test sets separately on each run?
Per the documentation, if you employ Pipeline, this can be done for you. From the docs, just above section 3.1.1.1, emphasis mine:
Just as it is important to test a predictor on data held-out from training, preprocessing (such as standardization, feature selection, etc.) and similar data transformations similarly should be learnt from a training set and applied to held-out data for prediction [...] A Pipeline makes it easier to compose estimators, providing this behavior under cross-validation[.]
More relevant information on pipelines available here.

Categories