Parameter Tuning for ML model with column transformer and pipeline - python

My code works perfectly until the fitting of the final model. But I have no idea how to do GridSearchCV or RandomizedSearchCV for the pipeline. Kindly help me.
import pandas as pd
import numpy as np
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import make_pipeline
df = pd.read_csv('data/vehicle_dataset_v4A.csv')
X = df.drop('price', axis=1)
y = df['price']
numerical_ix = X.select_dtypes(include=['int64', 'float64']).columns
categorical_ix = X.select_dtypes(include=['object', 'bool']).columns
col_transform = make_column_transformer(
(OneHotEncoder(), categorical_ix),
(StandardScaler(), numerical_ix),
remainder='passthrough'
)
model = RandomForestRegressor()
pipe = make_pipeline(col_transform,model)
pipe.fit(X, y)
I tried the following code. The code runs without any error but when I try to make prediction with Gridsearchcv, it throws different errors at different times. Hope there should be a solution for this. Otherwise, If I can know what are the best parameters after gridsearch, I can directly apply those parameters to the model.
lr = {
'base_score':[0.4,0.45,0.5,0.55,0.6],
'max_depth':[1,2,3,4,6,8,10],
'subsample':[0.5,0.6,0.7,0.8,0.9,1],
'n_estimators': [50,100,200,250,300],
'learning_rate': [0.05,0.1,0.4,0.5,0.8,0.9,1],
'min_child_weight': [0.1,0.5,1,1.5,2,3],
'gamma': [0,0.1,0.5,1,1.5,2,2.5,3]
}
clf = make_pipeline(OneHotEncoder(),
StandardScaler(with_mean=False),
GridSearchCV(RandomForestRegressor(),
param_grid=lr,
scoring='r2',cv=3,verbose=2))

Three thoughts on your application:
Do not use OneHotEncoder for RandomForestRegressor, you don't need it.
Do not use make_pipeline, this is overkill for your problem.
Firstly apply StandardScaler on the data, and then run GridSearchCV.
Please test this and give us feedback on it.

Related

How do you use GridSearchCV to search values for alpha in lasso and ridge regression models?

I'm doing an exercise on using sklearn Pipelines and GridSearchCV to find values for values for alpha in lasso and ridge regression models, where we also use SimpleImputer to take care of some missing values.
However, I keep getting the following traceback error:
ValueError: Invalid parameter alpha for estimator Pipeline
However, I'm certain that Sklearn's Ridge and Lasso both have the alpha parameter and have double checked on the sklearn docs.
Where am I going wrong? My code is below:
import numpy as np
import pandas as pd
from sklearn import svm, datasets
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder, StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn import metrics
columns = ["sex","length","diam","height","whole","shucked","viscera","shell","age"]
df = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data",names=columns)
y = df.age
X=df.drop(columns=['age'])
num_cols = X.select_dtypes(include=np.number).columns
cat_cols = X.select_dtypes(include=['object']).columns
#create some missing values
for i in range(1000):
X.loc[np.random.choice(X.index),np.random.choice(X.columns)] = np.nan
x_train, x_test, y_train, y_test = train_test_split(X,y, random_state=0, test_size=0.25)
cat_vals = Pipeline([("imputer",SimpleImputer(strategy='most_frequent')), ("ohe",OneHotEncoder(sparse=False, drop='first'))])
num_vals = Pipeline([("imputer",SimpleImputer(strategy='mean')), ("scale",StandardScaler())])
preprocess = ColumnTransformer(
transformers=[
("cat_process", cat_vals, cat_cols),
("num_process", num_vals, num_cols)
]
)
#Create a pipeline with preprocess and a linear regression model
pipeline = Pipeline([("preprocess",preprocess),
("regr",LinearRegression())])
#simple parameter grid, with and without the intercept
param_grid = {
'regr': [LinearRegression()],
"regr__fit_intercept": [True,False]
}
# 1. Update the dictionary to include values for alpha in lasso and ridge regression models. Use np.logspace(-4,2,10).
param_grid['alpha'] = np.logspace(-4,2,10)
param_grid['regr'] = [Lasso(), Ridge()]
# 2. Fit the GridSearchCV on the training data and print the best estimator and score from the search.
gscv = GridSearchCV(pipeline, param_grid, scoring='neg_mean_squared_error', cv=5)
gscv.fit(x_train, y_train)
print(gscv.best_params_)
print(gscv.best_score_)
print(gscv.best_estimator_)
The parameter name should be given a prefix according to the pipeline stage name, i.e. regr__alpha - just like you did for the fit_intercept.

Why are my transformers in a pipeline in a ColumnTransformer missing the fitted attributes?

I have a pipeline in a ColumnTransformer. One of the transformers is a PCA. When i use fit and then transform, the data looks right and everything is working. But when i try to acces the explained_variance_ratio_ of the PCA in the pipeline after the fit, the attribute does not exists. All my other transformers in the pipeline are missing their attributes too that they should have after fitting. What am i doing wrong?
The code looks like this:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
import pandas as pd
def transform(df: pd.DataFrame, cat_cols, log_cols, passthrough_cols):
oh_enc = OneHotEncoder(handle_unknown='ignore')
transformer_oh = ColumnTransformer([('cat_cols', oh_enc, cat_cols)], remainder='passthrough')
scaler = StandardScaler()
pca = PCA(n_components=5)
pipe = Pipeline([("preprocessing", transformer_oh),
("scaling", scaler),
("pca", pca)
])
to_transform = list(set(df.columns) - set(passthrough_cols))
transformer = ColumnTransformer([("pipe", pipe, to_transform)], remainder='passthrough')
transformer = transformer.fit(df)
pca2=transformer.transformers[0][1].steps[2][1]
print(pca2.explained_variance_ratio_) #AttributeError: 'PCA' object has no attribute 'explained_variance_ratio_'
To access the fitted transformers in a fitted ColumnTransformer you have to use the attribute transformers_ and not transformers. By changing that everything works fine.

Pre-process y column in a sklearn pipeline before classification

As part of a sklearn pipeline, I'd like to bin my response variable into a variable with k ordinal categories and then do classification on these categories. I found KBinsDiscretizer which seems to perform this transformation but it seems it does only work on feature columns, not on the target column.
Reproducible example
import sklearn
from sklearn.compose import make_column_transformer
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
import pandas as pd
from sklearn.datasets import load_boston
data = load_boston()
df = pd.DataFrame(data['data'], columns=data['feature_names'])
df['target'] = data['target']
binarizer_col_y = make_column_transformer(
[sklearn.preprocessing.KBinsDiscretizer(n_bins=3, encode='ordinal'), ['target']],
remainder = 'passthrough'
)
pipeline = Pipeline(steps = [
('preprocess', binarizer_col_y),
('ols', LinearRegression())
])
pipeline.fit(df[data['feature_names']], df['target'])
This errors with
pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'target'
The above exception was the direct cause of the following exception:
...
[ another key error for 'target']
I also found sklearn.compose.TransformedTargetRegressor to transform the response (but I want to do classification) and that I can write my own transformers, but they apparently ony modify X, not y.
Can anyone tell me how to modify y in a pre-processing step prior to classification as part of a pipeline?
Why inside the pipeline?
The idea is to move as many transformations into the pipeline as possible, reducing boilerplate code, avoiding data leaks plus simplifying model deployment (e.g. as services like Databricks model registry can deploy a sklearn model with pre-processing expected to happen inside the model).
You get the error because target is not available because the transformation is applied only to the X, and not y.
Sklearn pipeline does not support transforming target y in the way you tried to write it.
However, there is a sklearn.compose.TransformedTargetRegressor which can wrap model and can be provided with instructions how to transform target.
Warning, it is not well supported and I found many issues when trying to work with it on a real project. Maybe you want to have manual target transformation steps.
Here is a little demo that might work for you.
import sklearn
from sklearn.compose import make_column_transformer
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.compose import TransformedTargetRegressor
data = load_boston()
df = pd.DataFrame(data["data"], columns=data["feature_names"])
X = df[data["feature_names"]]
y = data["target"]
pipeline = Pipeline(
steps=[
(
"ols",
TransformedTargetRegressor(
LinearRegression(),
transformer=sklearn.preprocessing.KBinsDiscretizer(
n_bins=3, encode="ordinal"
),
),
)
]
)
pipeline.fit(X, y)
pipeline.predict(X)
Or a more readable snippet that shows how you create target transformer
model = LinearRegression()
kbins = sklearn.preprocessing.KBinsDiscretizer(n_bins=3, encode="ordinal")
ttr = TransformedTargetRegressor(model, transformer=kbins)

How does cross_val_score and gridsearchCV works?

I am new to python and I have been trying to figure out how gridsearchCV and cross_val_score work.
Finding odds results a set up a sort of validation experiment, but still I do not understand what I am doing wrong.
To try to simplify I am using gridsearchCV is the simplest possible way and try to validate and understand what is happening:
Here it is:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, RobustScaler, QuantileTransformer
from sklearn.feature_selection import SelectKBest, f_regression, RFECV
from sklearn.decomposition import PCA
from sklearn.linear_model import RidgeCV,Ridge, LinearRegression
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.model_selection import GridSearchCV,KFold,TimeSeriesSplit,PredefinedSplit,cross_val_score
from sklearn.metrics import mean_squared_error,make_scorer,r2_score,mean_absolute_error,mean_squared_error
from math import sqrt
I create a cross validation object (for gridsearchCV and cross_val_score) and a train/test dataset for pipeline and simple linear regression. I have checked that the two dataset are identical:
train_indices = np.full((15,), -1, dtype=int)
test_indices = np.full((6,), 0, dtype=int)
test_fold = np.append(train_indices, test_indices)
kf = PredefinedSplit(test_fold)
for train_index, test_index in kf.split(X):
print('TRAIN:', train_index, 'TEST:', test_index)
X_train_kf = X[train_index]
X_test_kf = X[test_index]
train_data = list(range(0,15))
test_data = list(range(15,21))
X_train, y_train=X[train_data,:],y[train_data]
X_test, y_test=X[test_data,:],y[test_data]
Here is what I do:
instantiate a simple linear model and use it with the manual set of data
lr=LinearRegression()
lm=lr.fit(X,y)
lmscore_train=lm.score(X_train,y_train)
->r2=0.4686662249071524
lmscore_test=lm.score(X_test,y_test)
->r2 0.6264021467338086
now I try do do the exact same things using a pipeline:
pipe_steps = ([('est', LinearRegression())])
pipe=Pipeline(pipe_steps)
p=pipe.fit(X,y)
pscore_train=p.score(X_train,y_train)
->r2=0.4686662249071524
pscore_test=p.score(X_test,y_test)
->r2 0.6264021467338086
LinearRegression and pipeline matches perfectly
Now I try to do the same by using cross_val_score using the predefined split kf
cv_scores = cross_val_score(lm, X, y, cv=kf)
->r2 = -1.234474757883921470e+01?!?! (this is supposed to be the test score)
Now let's try gridsearchCV
scoring = {'r_squared':'r2'}
grid_parameters = [{}]
gridsearch=GridSearchCV(p, grid_parameters, verbose=3,cv=kf,scoring=scoring,return_train_score='true',refit='r_squared')
gs=gridsearch.fit(X,y)
results=gs.cv_results_
from cv_results_ I get once again
->mean_test_r_squared->r2->-1.234474757883921292e+01
So cross_val_score and gridsearch in the end match one another, but the score is totally off and different from what should be.
Will you please help me out solving this puzzle?
cross_val_score and GridSearchCV will first split the data, train the model on the train data only and then score on test data.
Here you are training on the full data, and then scoring on test data. Hence you dont match the results of cross_val_score.
Instead of this:
lm=lr.fit(X,y)
Try this:
lm=lr.fit(X_train, y_train)
Same for pipeline:
Instead of p=pipe.fit(X,y), do this:
p=pipe.fit(X_train, y_train)
You can look at my answers for more description:-
https://stackoverflow.com/a/42364900/3374996
https://stackoverflow.com/a/42230764/3374996

Include feature extraction in pipeline sklearn

For a text classification project I made a pipeline for the feature selection and the classifier. Now my question is if it is possible to include the feature extraction module in the pipeline and how. I looked some things up about it, but it doesn't seem to fit with my current code.
This is what I have now:
# feature_extraction module.
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.feature_extraction import DictVectorizer
import numpy as np
vec = DictVectorizer()
X = vec.fit_transform(instances)
scaler = StandardScaler(with_mean=False) # we use cross validation, no train/test set
X_scaled = scaler.fit_transform(X) # To make sure everything is on the same scale
enc = LabelEncoder()
y = enc.fit_transform(labels)
# Feature selection and classification pipeline
from sklearn.feature_selection import SelectKBest, mutual_info_classif
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn import linear_model
from sklearn.pipeline import Pipeline
feat_sel = SelectKBest(mutual_info_classif, k=200)
clf = linear_model.LogisticRegression()
pipe = Pipeline([('mutual_info', feat_sel), ('logistregress', clf)]))
y_pred = model_selection.cross_val_predict(pipe, X_scaled, y, cv=10)
How can I put the dictvectorizer until the label encoder in the pipeline?
Here's how you would do it. Assuming instances is a dict-like object, as specified in the API, then just build your pipeline like so:
pipe = Pipeline([('vectorizer', DictVectorizer()),
('scaler', StandardScaler(with_mean=False)),
('mutual_info', feat_sel),
('logistregress', clf)])
To predict, then call cross_val_predict, passing instances as X:
y_pred = model_selection.cross_val_predict(pipe, instances, y, cv=10)

Categories