I have an imblearn (not sklearn) pipeline consisting of the following steps:
Column selector
Preprocessing pipeline (ColumnTransformer with OneHotEncoders and CountVectorizers on different columns)
imblearn's SMOTE
XGBClassifier
I have a tabular dataset and I'm trying to explain my predictions.
I managed to work out feature importance plots with some work, but can't get either
eli5 or lime to work.
Lime requires that I transform the data to the state of before the last transformation (because the transformers in the Pipeline like custom vectorizers create new columns).
In principle, I can slice my Pipeline like this: pipeline[:-1].predict(instance). However, I get the following error: {AttributeError}'SMOTE' object has no attribute 'predict'.
I also tried an eli5 explainer, since it supposedly works with Sklearn Pipelines.
However, after running eli5.sklearn.explain_prediction.explain_prediction_sklearn_not_supported(pipeline, instance_to_explain) I get the message that the classifier is not supported.
Will appreciate any ideas on how to proceed with this.
Imblearn's samplers are effectively no-op (ie. identity) transformers during prediction. Therefore, it should be safe to delete them after the pipeline has been fitted.
Try the following workflow:
Construct an Imblearn pipeline, and fit it.
Extract the steps of the fitted Imblearn pipeline to a new Scikit-Learn pipeline.
Delete the SMOTE step.
Explain your predictions using standard Scikit-Learn pipeline explanation tools.
Related
I have a quick question about the following short snippet of code (my version of sklearn, from which cross_val_score and LinearDiscriminantAnalysis are imported from, is 1.1.1):
cv_results = cross_val_score(LinearDiscriminantAnalysis(),data,isTarget,cv=kfold,scoring='accuracy')
I am trying to train a LinearDiscriminantAnalysis ML algorithm on the 'data' variable and the 'isTarget' variable, which are numpy arrays of the features of the samples in my ML dataset and a list of which samples are targets (1) or non-targets (0), respectfully. kfold is just a method for scoring the algorithm, it isn't important here.
My question is this: I am trying to score this algorithm by training it on 'data' and 'isTarget', but I would like to test it on a different dataset, 'data_val' and 'isTarget_val,' but cross_val_score does not have parameters for training an algoirithm on one dataset and testing it on another. I've been searching for other functions that will do this, and I feel that it is a really simple answer and I just can't find it.
Can someone help me out? Thanks :)
This is how cross-validation is designed to work. The cv argument you are supplying specifies that you want to do K-Fold cross-validation, which means that the entirety of your dataset will be used for both training and testing in K different folds.
You can read up more on cross-validation here.
You can accomplish this using a PredefinedSplit (docs) as the cv argument.
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
smt = SMOTE(random_state=0)
pipeline_rf_smt_fs = Pipeline(
[
('preprocess',preprocessor),
('selector', SelectKBest(mutual_info_classif, k=30)),
('smote',smt),
('rf_classifier',RandomForestClassifier(n_estimators=600, random_state =2021))
]
)
i am getting below error:
All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' 'SMOTE(random_state=0)' (type <class 'imblearn.over_sampling._smote.SMOTE'>) doesn't
I believe smote has to be use post feature selection process. Any help on this would be very helpful.
This is the error message given by scikit-learn's version of the pipeline. Your code, as is, should not produce this error, but you probably have run from sklearn.pipeline import Pipeline somewhere which has overwritten the Pipeline object.
From a methodological point of view, I nonetheless find it questionable to use a sampler after the preprocessing and feature selection in a general setting. What if the features you select are relevant because of the imbalance in your dataset? I would prefer using it in the first step of a pipeline (but this is up to you, it should not cause any errors).
I am working on a machine learning model of shape 1,456,354 X 53. I wanted to do feature selection for my data set. I know how to do feature selection in python using the following code.
from sklearn.feature_selection import RFECV,RFE
logreg = LogisticRegression()
rfe = RFE(logreg, step=1, n_features_to_select=28)
rfe = rfe.fit(df.values,arrythmia.values)
features_bool = np.array(rfe.support_)
features = np.array(df.columns)
result = features[features_bool]
print(result)
However, I could not find any article which could show how can I perform recursive feature selection in pyspark.
I tried to import sklearn libraries in pyspark but it gave me an error sklearn module not found. I am running pyspark on google dataproc cluster.
Could please someone help me achieve this in pyspark
You have a few options for doing this.
If the model you need is implemented in either Spark's MLlib or spark-sklearn`, you can adapt your code to use the corresponding library.
If you can train your model locally and just want to deploy it to make predictions, you can use User Defined Functions (UDFs) or vectorized UDFs to run the trained model on Spark. Here's a good post discussing how to do this.
If you need to run an sklearn model on Spark that is not supported by spark-sklearn, you'll need to make sklearn available to Spark on each worker node in your cluster. You can do this by manually installing sklearn on each node in your Spark cluster (make sure you are installing into the Python environment that Spark is using).
Alternatively, you can package and distribute the sklearn library with the Pyspark job. In short, you can pip install sklearn into a local directory near your script, then zip the sklearn installation directory and use the --py-files flag of spark-submit to send the zipped sklearn to all workers along with your script. This article has a complete overview of how to accomplish this.
We can try following feature selection methods in pyspark
Chi-Squared selector
Randomforest selector
References:
https://spark.apache.org/docs/2.2.0/ml-features.html#feature-selectors
https://databricks.com/session/building-custom-ml-pipelinestages-for-feature-selection
I suggest with stepwise regression model you can easily find the important features and only that dataset them in logistics regression. Stepwise regression works on correlation but it has variations.
Below link will help to implement stepwise regression for feature selection.
https://datascience.stackexchange.com/questions/24405/how-to-do-stepwise-regression-using-sklearn
I am using a LeaveOutGroupOut CV strategy with TPOTRegressor
from tpot import TPOTRegressor
from sklearn.model_selection import LeaveOneGroupOut
tpot = TPOTRegressor(
config_dict=regressor_config_dict,
generations=100,
population_size=100,
cv=LeaveOneGroupOut(),
verbosity=2,
n_jobs=1)
tpot.fit(XX, yy, groups=groups)
After optimization the best scoring trained pipeline is stored in tpot.fitted_pipeline_ and tpot.fitted_pipeline_.predict(X) is available.
my question is: what will the fitted pipeline have been trained on? e.g.
does tpot refit the optimised pipeline using the entire dataset before storing it in tpot.fitted_pipeline_?
or will this represent a trained pipeline from the best scoring split during
Additionally, is there a way to access the complete set of trained models corresponding to the set of splits for the winning/optimized pipeline?
TPOT will fit the final 'best' pipeline on the full training set: code
It's therefore recommended that your testing data never be passed to the TPOT fit function if you plan to directly interact with the 'best' pipeline via the TPOT object.
If that is an issue for you, you can retrain the pipeline directly via the tpot.fitted_pipeline_ attribute, which is simply a sklearn Pipeline object. Alternatively, you can use the export function to export the 'best' pipeline to its corresponding Python code and interact with the pipeline outside of TPOT.
Additionally, is there a way to access the complete set of trained models corresponding to the set of splits for the winning/optimized pipeline?
No. TPOT uses sklearn's cross_val_score when evaluating pipelines, so it throws out the set of trained pipelines from the CV process. However, you can access the scoring results of every pipeline that TPOT evaluated via the tpot.evaluated_individuals_ attribute.
I am wondering if it is possible to use pipeline in Scikit-Learn in the following way:
I want to train a model on dataset A and then make predictions with the same model but on dataset B. Then, like this, I can use GridSearch to search for the best parameters on pipeline using prediction of dataset B as a measure.
I know how to write a normal pipeline and use it with GridSearch, but I can't see how I can work with two datasets.