Invalid parameter n_neighbors for estimator Pipeline - python

I'm pretty new to Python. I've run into an issue below that I really need help with:
df = pd.read_csv('train.csv') #titanic dataset from Kaggle
df = df.loc[df.Embarked.notna(), ['Survived', 'Pclass', 'Sex', 'SibSp', 'Embarked']]
X = df.drop('Survived', axis='columns')
y = df.Survived
column_trans = make_column_transformer(
(OneHotEncoder(), ['Sex', 'Embarked']),
remainder='passthrough')
column_trans.fit_transform(X)
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)
param_grid = dict(n_neighbors=k_range)
knn = KNeighborsClassifier()
pipe = make_pipeline(column_trans, knn)
grid = GridSearchCV(pipe, param_grid, cv=10, scoring='accuracy')
grid.fit(train_X, train_y) #this line gives me an error
The last line gives me an error of:
ValueError: Invalid parameter n_neighbors for estimator Pipeline(memory=None,
steps=[('columntransformer',
ColumnTransformer(n_jobs=None, remainder='passthrough',
sparse_threshold=0.3,
transformer_weights=None,
transformers=[('onehotencoder',
OneHotEncoder(categories='auto',
drop=None,
dtype=<class 'numpy.float64'>,
handle_unknown='error',
sparse=True),
['Sex', 'Embarked'])],
verbose=False)),
('kneighborsclassifier',
KNeighborsClassifier(algorithm='auto', leaf_size=30,
metric='minkowski', metric_params=None,
n_jobs=None, n_neighbors=5, p=2,
weights='uniform'))],
verbose=False). Check the list of available parameters with `estimator.get_params().keys()`.
What am I doing wrong here? Is it just not possible to do oneHot encoding, knn and pipeline simultaneously?

Parameters of pipelines can be set using __ separated parameter names, also you need the way in which you have defined your pipeline needs a revision. Please refer to the modified code below:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
df = pd.read_csv("titanic.csv")
df = df.drop(["Name"], axis=1)
X = df.drop('Survived', axis='columns')
y = df.Survived
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)
column_trans = make_column_transformer(
(OneHotEncoder(), ['Sex']),
remainder='passthrough')
knn = KNeighborsClassifier()
pipe = Pipeline(steps=[('column_trans', column_trans), ('knn', knn)])
param_grid = {
'knn__n_neighbors': [2,5,15, 30, 45, 64]
}
grid = GridSearchCV(pipe, param_grid, cv=10, scoring='accuracy')
grid.fit(train_X,train_y)
grid.best_params_
#{'knn__n_neighbors': 5}

Related

Scikit Learn Pipeline with SMOTE

I would like to create a Pipeline with SMOTE() inside, but I can't figure out where to implement it.
My target value is imbalanced. Without SMOTE I have very bad results.
My code:
df_n = df[['user_id','signup_day', 'signup_month', 'signup_year',
'purchase_day', 'purchase_month', 'purchase_year','purchase_value',
'source','browser','sex','age', 'is_fraud']]
#Definition X et y:
X = df_n.drop(['is_fraud'], axis = 1)
y = df_n.is_fraud
# split into 70:30 ration
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)
print(Counter(y_train)) #Counter({0: 95844, 1: 9934})
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='mean'))
,('scaler', StandardScaler())
])
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant'))
,('encoder', OrdinalEncoder())
])
numeric_features = ['user_id','signup_day', 'signup_month', 'signup_year',
'purchase_day', 'purchase_month', 'purchase_year','purchase_value', 'age']
categorical_features = ['source', 'browser', 'sex']
preprocessor = ColumnTransformer(
transformers=[
('numeric', numeric_transformer, numeric_features)
,('categorical', categorical_transformer, categorical_features)
])
regressors = [
RandomForestRegressor()
,LogisticRegression()
,DecisionTreeClassifier()
,KNeighborsClassifier()
,LinearSVC(random_state=42)]
for regressor in regressors:
pipeline = Pipeline(steps = [
('preprocessor', preprocessor)
,('regressor',regressor)
])
model = pipeline.fit(X_train, y_train)
predictions = model.predict(X_test)
print(regressor)
print(r2_score(y_test, predictions))
My results:
RandomForestRegressor()
0.48925960579049166
LogisticRegression()
0.24151543370722806
DecisionTreeClassifier()
-0.14622417739659155
KNeighborsClassifier()
0.3542030752350408
LinearSVC(random_state=42)
-0.10256098450762474
from imblearn.over_sampling import SMOTEN
sampler = SMOTEN(random_state=0)
Xsm,ysm = sampler.fit_resample(X, y)
You can use below code for adding SMOTE in pipeline (need some tweaking though)
from imblearn.pipeline import Pipeline
# define pipeline
model = DecisionTreeClassifier()
over = SMOTE(sampling_strategy=0.1)
under = RandomUnderSampler(sampling_strategy=0.5)
steps = [('over', over), ('under', under), ('model', model)]
pipeline = Pipeline(steps=steps)
# evaluate pipeline
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(pipeline, X, Y, scoring='roc_auc', cv=cv, n_jobs=-1)
treat smote separately not inside pipeline by using this code
What you can do is use a modification of the SMOTE algorithm, called SMOTE-N (see https://imbalanced-learn.org/dev/over_sampling.html#smote-variants), which works when all features are categorical. This modifies the SMOTE algorithm to

How to use StandardScaler inside a pipeline only on certain values?

I have a problem. I want to use StandardScaler(), but my dataset contains certain OneHotEncoding values and other values that should be not be scaled. But if I'm running the StandardScaler() all the values are scaled. So is there an option to run this method only on certain values inside a pipeline?
I found this question: One-Hot-Encode categorical variables and scale continuous ones simultaneouely with the below code
columns = ['rank']
columns_to_scale = ['gre', 'gpa']
scaler = StandardScaler()
ohe = OneHotEncoder(sparse=False)
# Concatenate (Column-Bind) Processed Columns Back Together
processed_data = np.concatenate([scaled_columns, encoded_columns], axis=1)
So is there an option to only run the StandardScaler() inside a pipeline on only certain values and the other values should be merged to the scaled values?
So the pipeline should only use StandardScaler on the values 'xy', 'xyz'.
StandardScaler Class
from sklearn.base import BaseEstimator, TransformerMixin
class StandardScaler_with_certain_features(BaseEstimator, TransformerMixin):
def __init__(self, columns_to_scale):
scaler = StandardScaler()
def fit(self, X, y = None):
scaler.fit(X_train) # only std.fit on train set
X_train_nor = scaler.transform(X_train.values)
def transform(self, X, y = None):
return X
Pipeline
columns_to_scale = ['xy', 'xyz']
steps = [('standard_scaler', StandardScaler_with_certain_features(columns_to_scale)),
('feature_selection', SelectFromModel(estimator=LogisticRegression(max_iter=100))),
('lasso', Lasso(alpha=0.03))]
pipeline = Pipeline(steps)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=30)
parameteres = { }
grid = GridSearchCV(pipeline, param_grid=parameteres, cv=5)
grid.fit(X_train, y_train)
print("score = %3.2f" %(grid.score(X_test,y_test)))
print('Training set score: ' + str(grid.score(X_train,y_train)))
print('Test set score: ' + str(grid.score(X_test,y_test)))
# Prediction
y_pred = grid.predict(X_test)
print("RMSE Val:", metrics.mean_squared_error(y_test, y_pred, squared=False))
You can include a ColumnTransformer in the Pipeline in order to apply the StandardScaler only to certain columns. You need to set remainder='passthrough to make sure that the columns that are not scaled are concatenated with the ones that are scaled.
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Lasso
df = pd.DataFrame({
'y': np.random.normal(0, 1, 100),
'x': np.random.normal(0, 1, 100),
'z': np.random.normal(0, 1, 100),
'xy': np.random.normal(2, 3, 100),
'xyz': np.random.normal(4, 5, 100),
})
X = df.drop(labels=['y'], axis=1)
y = df['y']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=30)
preprocessor = ColumnTransformer(
transformers=[('scaler', StandardScaler(), ['xy', 'xyz'])],
remainder='passthrough'
)
pipeline = Pipeline([
('preprocessor', preprocessor),
('lasso', Lasso(alpha=0.03))
])
pipeline.fit(X_train, y_train)
pipeline.score(X_test, y_test)

How to access ColumnTransformer elements in GridSearchCV

I wanted to find out the correct naming convention when referring to individual preprocessor included in ColumnTransformer (which is part of a pipeline) in param_grid for grid_search.
Environment & sample data:
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, KBinsDiscretizer, MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
df = sns.load_dataset('titanic')[['survived', 'age', 'embarked']]
X_train, X_test, y_train, y_test = train_test_split(df.drop(columns='survived'), df['survived'], test_size=0.2,
random_state=123)
Pipeline:
num = ['age']
cat = ['embarked']
num_transformer = Pipeline(steps=[('imputer', SimpleImputer()),
('discritiser', KBinsDiscretizer(encode='ordinal', strategy='uniform')),
('scaler', MinMaxScaler())])
cat_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))])
preprocessor = ColumnTransformer(transformers=[('num', num_transformer, num),
('cat', cat_transformer, cat)])
pipe = Pipeline(steps=[('preprocessor', preprocessor),
('classiffier', LogisticRegression(random_state=1, max_iter=10000))])
param_grid = dict([SOMETHING]imputer__strategy = ['mean', 'median'],
[SOMETHING]discritiser__nbins = range(5,10),
classiffier__C = [0.1, 10, 100],
classiffier__solver = ['liblinear', 'saga'])
grid_search = GridSearchCV(pipe, param_grid=param_grid, cv=10)
grid_search.fit(X_train, y_train)
Basically, what should I write instead of [SOMETHING] in my code?
I have looked at this answer which answered the question for make_pipeline - so using the similar idea, I tried 'preprocessor__num__', 'preprocessor__num_', 'pipeline__num__', 'pipeline__num_' - no luck so far.
Thank you
You were close, the correct way to declare it is like this:
param_grid = {'preprocessor__num__imputer__strategy' : ['mean', 'median'],
'preprocessor__num__discritiser__n_bins' : range(5,10),
'classiffier__C' : [0.1, 10, 100],
'classiffier__solver' : ['liblinear', 'saga']}
Here is the full code:
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, KBinsDiscretizer, MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
df = sns.load_dataset('titanic')[['survived', 'age', 'embarked']]
X_train, X_test, y_train, y_test = train_test_split(df.drop(columns='survived'), df['survived'], test_size=0.2,
random_state=123)
num = ['age']
cat = ['embarked']
num_transformer = Pipeline(steps=[('imputer', SimpleImputer()),
('discritiser', KBinsDiscretizer(encode='ordinal', strategy='uniform')),
('scaler', MinMaxScaler())])
cat_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))])
preprocessor = ColumnTransformer(transformers=[('num', num_transformer, num),
('cat', cat_transformer, cat)])
pipe = Pipeline(steps=[('preprocessor', preprocessor),
('classiffier', LogisticRegression(random_state=1, max_iter=10000))])
param_grid = {'preprocessor__num__imputer__strategy' : ['mean', 'median'],
'preprocessor__num__discritiser__n_bins' : range(5,10),
'classiffier__C' : [0.1, 10, 100],
'classiffier__solver' : ['liblinear', 'saga']}
grid_search = GridSearchCV(pipe, param_grid=param_grid, cv=10)
grid_search.fit(X_train, y_train)
One simply way to check the available parameter names is like this:
print(pipe.get_params().keys())
This will print out the list of all the available parameters which you can copy directly into your params dictionary.
I have written a utility function which you can use to check if a parameter exist in a pipeline/classifier by simply passing in a keyword.
def check_params_exist(esitmator, params_keyword):
all_params = esitmator.get_params().keys()
available_params = [x for x in all_params if params_keyword in x]
if len(available_params)==0:
return "No matching params found!"
else:
return available_params
Now if you are unsure of the exact name, just pass imputer as the keyword
print(check_params_exist(pipe, 'imputer'))
This will print the following list:
['preprocessor__num__imputer',
'preprocessor__num__imputer__add_indicator',
'preprocessor__num__imputer__copy',
'preprocessor__num__imputer__fill_value',
'preprocessor__num__imputer__missing_values',
'preprocessor__num__imputer__strategy',
'preprocessor__num__imputer__verbose',
'preprocessor__cat__imputer',
'preprocessor__cat__imputer__add_indicator',
'preprocessor__cat__imputer__copy',
'preprocessor__cat__imputer__fill_value',
'preprocessor__cat__imputer__missing_values',
'preprocessor__cat__imputer__strategy',
'preprocessor__cat__imputer__verbose']

OneHotEncoder for categorical feature "day of week" results in ValueError

I want to define a Pipeline with a OneHotEncoder for the day_of_week column. I don't understand why I get a ValueError:
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
if __name__ == '__main__':
data_dict = {
'age': [1, 2, 3],
'day_of_week': ['monday', 'tuesday', 'wednesday'],
'y': [5, 6, 7]
}
data = pd.DataFrame(data_dict, columns=data_dict)
numeric_features = ['age']
numeric_transformer = Pipeline(steps=[
('scaler', StandardScaler())])
categorical_features = ['day_of_week']
print(categorical_features)
categorical_transformer = Pipeline(steps=[
('onehot', OneHotEncoder(handle_unknown='ignore', categories='auto'))])
preprocessor = ColumnTransformer(
transformers=[
('numerical', numeric_transformer, numeric_features),
('categorical', categorical_transformer, categorical_features)])
classifier = Pipeline(
steps=[
('preprocessor', preprocessor),
('classifier', RandomForestRegressor(n_estimators=60))])
X = data.drop(labels=['y'], axis=1)
y = data['y']
X_train, y_train, X_test, y_test = train_test_split(X, y, train_size=0.8, random_state=30)
trained_model = classifier.fit(X_train, y_train)
There is an error on this line:
X_train, y_train, X_test, y_test = train_test_split(X, y, train_size=0.8, random_state=30)
train_test_split returns X (train, test) , y(train,test).. and since you assigned them wrongly, your classifier throws all kinds of error.
Try changing it to:
X_train,X_test, y_train,y_test = train_test_split(X, y, train_size=0.8, random_state=30)
Your code runs without error for me

Error while fitting model after One Hot Encoding

I am using one hot encoding(I am aware ordinal encoding is better in this case) for categorical variables of Titanic dataset. The one hot encoding is successfully done. However, the model fitting throws the following error:
ValueError: setting an array element with a sequence.
Here is the code I am running:
from sklearn.preprocessing import OneHotEncoder
def one_hot_encode_features(df_train,df_test):
features = ['Fare', 'Cabin', 'Age', 'Sex']
#features = [ 'Cabin', 'Sex']
df_combined = pd.concat([df_train[features], df_test[features]])
for feature in features:
le = preprocessing.LabelEncoder()
onehot_encoder = OneHotEncoder()
le = le.fit(df_combined[feature])
integer_encoding_train=le.transform(df_train[feature])
integer_encoding_test=le.transform(df_test[feature])
integer_encoding_train = integer_encoding_train.reshape(len(integer_encoding_train), 1)
integer_encoding_test = integer_encoding_test.reshape(len(integer_encoding_test), 1)
df_train[feature] = onehot_encoder.fit_transform(integer_encoding_train)
df_test[feature] = onehot_encoder.fit_transform(integer_encoding_test)
return df_train, df_test
data_train, data_test = one_hot_encode_features(data_train, data_test)
from sklearn.model_selection import train_test_split
X = data_train.drop(['Survived', 'PassengerId'], axis=1)
Y = data_train['Survived']
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=23)
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import make_scorer, accuracy_score
from sklearn.model_selection import GridSearchCV
clf = GaussianNB()
acc_scorer = make_scorer(accuracy_score)
clf.fit(X_train, Y_train)
The error is removed if I use ordinal encoding instead of One Hot. I am new to handling category variables so cannot figure out the error.

Categories