Pre-process y column in a sklearn pipeline before classification - python

As part of a sklearn pipeline, I'd like to bin my response variable into a variable with k ordinal categories and then do classification on these categories. I found KBinsDiscretizer which seems to perform this transformation but it seems it does only work on feature columns, not on the target column.
Reproducible example
import sklearn
from sklearn.compose import make_column_transformer
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
import pandas as pd
from sklearn.datasets import load_boston
data = load_boston()
df = pd.DataFrame(data['data'], columns=data['feature_names'])
df['target'] = data['target']
binarizer_col_y = make_column_transformer(
[sklearn.preprocessing.KBinsDiscretizer(n_bins=3, encode='ordinal'), ['target']],
remainder = 'passthrough'
)
pipeline = Pipeline(steps = [
('preprocess', binarizer_col_y),
('ols', LinearRegression())
])
pipeline.fit(df[data['feature_names']], df['target'])
This errors with
pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'target'
The above exception was the direct cause of the following exception:
...
[ another key error for 'target']
I also found sklearn.compose.TransformedTargetRegressor to transform the response (but I want to do classification) and that I can write my own transformers, but they apparently ony modify X, not y.
Can anyone tell me how to modify y in a pre-processing step prior to classification as part of a pipeline?
Why inside the pipeline?
The idea is to move as many transformations into the pipeline as possible, reducing boilerplate code, avoiding data leaks plus simplifying model deployment (e.g. as services like Databricks model registry can deploy a sklearn model with pre-processing expected to happen inside the model).

You get the error because target is not available because the transformation is applied only to the X, and not y.
Sklearn pipeline does not support transforming target y in the way you tried to write it.
However, there is a sklearn.compose.TransformedTargetRegressor which can wrap model and can be provided with instructions how to transform target.
Warning, it is not well supported and I found many issues when trying to work with it on a real project. Maybe you want to have manual target transformation steps.
Here is a little demo that might work for you.
import sklearn
from sklearn.compose import make_column_transformer
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.compose import TransformedTargetRegressor
data = load_boston()
df = pd.DataFrame(data["data"], columns=data["feature_names"])
X = df[data["feature_names"]]
y = data["target"]
pipeline = Pipeline(
steps=[
(
"ols",
TransformedTargetRegressor(
LinearRegression(),
transformer=sklearn.preprocessing.KBinsDiscretizer(
n_bins=3, encode="ordinal"
),
),
)
]
)
pipeline.fit(X, y)
pipeline.predict(X)
Or a more readable snippet that shows how you create target transformer
model = LinearRegression()
kbins = sklearn.preprocessing.KBinsDiscretizer(n_bins=3, encode="ordinal")
ttr = TransformedTargetRegressor(model, transformer=kbins)

Related

sklearn to pmml pipeline how to apply postprocessing linear trasnformation

I'm having a tough time trying to apply a postprocessing step with the sklearn2pmml packages. What I'm trying to do is to apply a linear transformation after applying the predict_proba method within the PMMMLPipeline class in sklearn2pmml package. Any idea about how to do this?
Even a solution outside this package but automatable would help me (like modifying automatically the XML from the PMML).
Here's an example so you can get a deeper understanding of what I'm trying to do:
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import numpy as np
from sklearn2pmml.pipeline import PMMLPipeline
from sklearn2pmml import make_pmml_pipeline, sklearn2pmml
# FORGET ABOIT TRAIN TEST SPLIT; we only care if the PMML pipeline works for now
BIRTHDAY_SEED = 1995
nrows, cols = 1000, 5
X, y = make_classification(n_samples=nrows, n_features=cols, n_informative=2, n_redundant=3, n_classes=2, shuffle=True, random_state=BIRTHDAY_SEED)
X, y = pd.DataFrame(X), pd.Series(y)
model = DecisionTreeClassifier()
model.fit(X,y)
def postprocessig_linear_transformation(probabilities, a,b):
"This function would multiply proabilities by a and sum b"
return probabilities*a+b
# the pipeline should look like this
# first predict probabilities
probabilities = model.predict_proba(X)[:,0]
# then scale them (apply linear transformation)
probabilities_scaled = postprocessig_linear_transformation(probabilities, a = 1000, b=100)
# of course it does not work,
pmml_pipeline = PMMLPipeline([
# here we should place the category preprocesor; I know it does not work but , so you can get the idea
('decisiontree',model),
('postprocesing_apply_linear_transformation',postprocessig_linear_transformation)
])
sklearn2pmml(pmml_pipeline, "example_pipeline_pmml.pmml", with_repr = True)
On a second thought, you don't need a full-blown LinearRegression step to perform a deterministic a * x + b probability scaling operation. A simple ExpressionTransformer step is more than adequate:
from sklearn2pmml.preprocessing import ExpressionTransformer
pipeline = PMMLPipeline([
("decisiontree", model)
], predict_proba_transformer = ExpressionTransformer("X[0] * 1000 + 100"))
I'm having a tough time trying to apply a postprocessing step with the sklearn2pmml packages.
Don't blame the SkLearn2PMML package for your troubles. It is the Scikit-Learn framework that prohibits you from inserting two estimator objects into a single Pipeline object.
In the current case, you should rephrase your problem. What you're really trying to do is build a "chain of two models" (the first model feeding into the second model). The SkLearn2PMML package provides a sklearn2pmml.ensemble.EstimatorChain estimator type, which allows you to accomplish exactly that.

Why are my transformers in a pipeline in a ColumnTransformer missing the fitted attributes?

I have a pipeline in a ColumnTransformer. One of the transformers is a PCA. When i use fit and then transform, the data looks right and everything is working. But when i try to acces the explained_variance_ratio_ of the PCA in the pipeline after the fit, the attribute does not exists. All my other transformers in the pipeline are missing their attributes too that they should have after fitting. What am i doing wrong?
The code looks like this:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
import pandas as pd
def transform(df: pd.DataFrame, cat_cols, log_cols, passthrough_cols):
oh_enc = OneHotEncoder(handle_unknown='ignore')
transformer_oh = ColumnTransformer([('cat_cols', oh_enc, cat_cols)], remainder='passthrough')
scaler = StandardScaler()
pca = PCA(n_components=5)
pipe = Pipeline([("preprocessing", transformer_oh),
("scaling", scaler),
("pca", pca)
])
to_transform = list(set(df.columns) - set(passthrough_cols))
transformer = ColumnTransformer([("pipe", pipe, to_transform)], remainder='passthrough')
transformer = transformer.fit(df)
pca2=transformer.transformers[0][1].steps[2][1]
print(pca2.explained_variance_ratio_) #AttributeError: 'PCA' object has no attribute 'explained_variance_ratio_'
To access the fitted transformers in a fitted ColumnTransformer you have to use the attribute transformers_ and not transformers. By changing that everything works fine.

Unable to make prediction with Sklearn model on pyspark dataframe

I have loaded sklearn model successfully but unable to make predictions on pyspark dataframe. While running the below given code, getting an error mentioned below. Please help me to get the code to make predictions with sklearn model on pyspark. I also have searched relevant questions but could not find the solution.
sc = spark.sparkContext
braodcast_model = sc.broadcast(loaded_model)
braodcast_model.value
#update prediction method
def predictor(cols):
#call predict method for model
return model.value.predict(*cols)
udf_predictor = udf(predictor, FloatType())
#apply the udf to dataframe
df_prediction = df.withColumn("prediction", udf_predictor(df.select(list_of_columns)))
I get the following error message
TypeError: Invalid argument, not a string or column. For column literals, use 'lit', 'array',
'struct' or 'create_map' function.
I think you were on the right track for reaching your expected output.
I managed to find two possible solutions for such problem: one uses Spark UDF, the other uses Pandas UDF.
Spark UDF
from pyspark.sql.functions import udf
#udf('integer')
def predict_udf(*cols):
return int(braodcast_model.value.predict((cols,)))
list_of_columns = df.columns
df_prediction = df.withColumn('prediction', predict_udf(*list_of_columns))
Pandas UDF
import pandas as pd
from pyspark.sql.functions import pandas_udf
#pandas_udf('integer')
def predict_pandas_udf(*cols):
X = pd.concat(cols, axis=1)
return pd.Series(braodcast_model.value.predict(X))
list_of_columns = df.columns
df_prediction = df.withColumn('prediction', predict_pandas_udf(*list_of_columns))
Reproducible example
Here I used a Databricks Community cluster with Spark 3.1.2, pandas==1.2.4 and pyarrow==4.0.0.
broadcasted_model is a simple logistic regression from scikit-learn, trained on the breast cancer dataset.
import pandas as pd
import joblib
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from pyspark.sql.functions import udf, pandas_udf
# load dataset
X, y = load_breast_cancer(return_X_y=True, as_frame=True)
# split in training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=28)
# create a small pipeline with standardization and model
pipe = make_pipeline(StandardScaler(), LogisticRegression())
# save and reload the model
path = '/databricks/driver/test_model.joblib'
joblib.dump(model, path)
loaded_model = joblib.load(path)
# sample of unseen data
df = spark.createDataFrame(X_test.sample(50, random_state=42))
# create broadcasted model
sc = spark.sparkContext
braodcast_model = sc.broadcast(loaded_model)
Then I used the two methods illustrated above and you will see that the outputs df_prediction will be the same in both cases.

How to use imblearn undersampler in pipeline?

I have the following pipeline construction:
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline
sel = SelectKBest(k='all',score_func=chi2)
under = RandomUnderSampler(sampling_strategy=0.2)
preprocessor = ColumnTransformer(transformers=[('num', numeric_transformer, numeric_cols)])
final_pipe = Pipeline(steps=[('sample',under),('preprocessor', preprocessor),('var',VarianceThreshold()),('sel',sel),('clf', model)])
however i get the following error:
TypeError: All intermediate steps of the chain should be estimators that implement fit and transform or fit_resample (but not both) or be a string 'passthrough' '<class 'sklearn.compose._column_transformer.make_column_selector'>' (type <class 'type'>) doesn't)
I don't understand what I am doing wrong? Can anybody help ?
#Math12, I have encountered this same issue recently and the way I got around it, is to wrap the RandomUnderSampler() in a custom function which is then further transformed by a FunctionTransformer.
I have reworded your code to look this way and it worked.
Below is a snippet of the code sample
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline
sel = SelectKBest(k='all',score_func=chi2)
preprocessor = ColumnTransformer(transformers=[('num', numeric_transformer, numeric_cols)])
def Data_Preprocessing_3(df):
# fit random under sampler on the train data
rus = RandomUnderSampler(sampling_strategy=0.2)
df = rus.fit_resample(df)
return df
# in a separate code line outside the above function, transform the function #with a FunctionTransformer
under = FunctionTransformer(Data_Preprocessing_3)
#implement your pipeline as done initially
final_pipe = Pipeline(steps=[('sample',under),('preprocessor', preprocessor),('var',VarianceThreshold()),('sel',sel),('clf', model)])

Parameter Tuning for ML model with column transformer and pipeline

My code works perfectly until the fitting of the final model. But I have no idea how to do GridSearchCV or RandomizedSearchCV for the pipeline. Kindly help me.
import pandas as pd
import numpy as np
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import make_pipeline
df = pd.read_csv('data/vehicle_dataset_v4A.csv')
X = df.drop('price', axis=1)
y = df['price']
numerical_ix = X.select_dtypes(include=['int64', 'float64']).columns
categorical_ix = X.select_dtypes(include=['object', 'bool']).columns
col_transform = make_column_transformer(
(OneHotEncoder(), categorical_ix),
(StandardScaler(), numerical_ix),
remainder='passthrough'
)
model = RandomForestRegressor()
pipe = make_pipeline(col_transform,model)
pipe.fit(X, y)
I tried the following code. The code runs without any error but when I try to make prediction with Gridsearchcv, it throws different errors at different times. Hope there should be a solution for this. Otherwise, If I can know what are the best parameters after gridsearch, I can directly apply those parameters to the model.
lr = {
'base_score':[0.4,0.45,0.5,0.55,0.6],
'max_depth':[1,2,3,4,6,8,10],
'subsample':[0.5,0.6,0.7,0.8,0.9,1],
'n_estimators': [50,100,200,250,300],
'learning_rate': [0.05,0.1,0.4,0.5,0.8,0.9,1],
'min_child_weight': [0.1,0.5,1,1.5,2,3],
'gamma': [0,0.1,0.5,1,1.5,2,2.5,3]
}
clf = make_pipeline(OneHotEncoder(),
StandardScaler(with_mean=False),
GridSearchCV(RandomForestRegressor(),
param_grid=lr,
scoring='r2',cv=3,verbose=2))
Three thoughts on your application:
Do not use OneHotEncoder for RandomForestRegressor, you don't need it.
Do not use make_pipeline, this is overkill for your problem.
Firstly apply StandardScaler on the data, and then run GridSearchCV.
Please test this and give us feedback on it.

Categories