I would like to create a Pipeline with SMOTE() inside, but I can't figure out where to implement it.
My target value is imbalanced. Without SMOTE I have very bad results.
My code:
df_n = df[['user_id','signup_day', 'signup_month', 'signup_year',
'purchase_day', 'purchase_month', 'purchase_year','purchase_value',
'source','browser','sex','age', 'is_fraud']]
#Definition X et y:
X = df_n.drop(['is_fraud'], axis = 1)
y = df_n.is_fraud
# split into 70:30 ration
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)
print(Counter(y_train)) #Counter({0: 95844, 1: 9934})
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='mean'))
,('scaler', StandardScaler())
])
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant'))
,('encoder', OrdinalEncoder())
])
numeric_features = ['user_id','signup_day', 'signup_month', 'signup_year',
'purchase_day', 'purchase_month', 'purchase_year','purchase_value', 'age']
categorical_features = ['source', 'browser', 'sex']
preprocessor = ColumnTransformer(
transformers=[
('numeric', numeric_transformer, numeric_features)
,('categorical', categorical_transformer, categorical_features)
])
regressors = [
RandomForestRegressor()
,LogisticRegression()
,DecisionTreeClassifier()
,KNeighborsClassifier()
,LinearSVC(random_state=42)]
for regressor in regressors:
pipeline = Pipeline(steps = [
('preprocessor', preprocessor)
,('regressor',regressor)
])
model = pipeline.fit(X_train, y_train)
predictions = model.predict(X_test)
print(regressor)
print(r2_score(y_test, predictions))
My results:
RandomForestRegressor()
0.48925960579049166
LogisticRegression()
0.24151543370722806
DecisionTreeClassifier()
-0.14622417739659155
KNeighborsClassifier()
0.3542030752350408
LinearSVC(random_state=42)
-0.10256098450762474
from imblearn.over_sampling import SMOTEN
sampler = SMOTEN(random_state=0)
Xsm,ysm = sampler.fit_resample(X, y)
You can use below code for adding SMOTE in pipeline (need some tweaking though)
from imblearn.pipeline import Pipeline
# define pipeline
model = DecisionTreeClassifier()
over = SMOTE(sampling_strategy=0.1)
under = RandomUnderSampler(sampling_strategy=0.5)
steps = [('over', over), ('under', under), ('model', model)]
pipeline = Pipeline(steps=steps)
# evaluate pipeline
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(pipeline, X, Y, scoring='roc_auc', cv=cv, n_jobs=-1)
treat smote separately not inside pipeline by using this code
What you can do is use a modification of the SMOTE algorithm, called SMOTE-N (see https://imbalanced-learn.org/dev/over_sampling.html#smote-variants), which works when all features are categorical. This modifies the SMOTE algorithm to
Related
I have this:
Preprocessing
numeric_transformer = Pipeline(
steps=[("imputer",SimpleImputer(strategy="mean")), ("scaler", StandardScaler())]
)
num=['hrs', 'absences', 'JobInvolvement', 'PerformanceRating', 'EnvironmentSatisfaction', 'JobSatisfaction', 'WorkLifeBalance', 'Age', 'DistanceFromHome', 'Education', 'EducationField', 'JobLevel', 'JobRole', 'MonthlyIncome', 'NumCompaniesWorked', 'PercentSalaryHike', 'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear', 'YearsAtCompany', 'YearsSinceLastPromotion', 'YearsWithCurrManager']
categorical_transformer=Pipeline(
steps=[("imputer",SimpleImputer(strategy="most_frequent")), ("OE", OrdinalEncoder())] # DROP IF BINARY?
)
cat= ['BusinessTravel', 'Department', 'Gender', 'MaritalStatus']
preprocessor = ColumnTransformer(transformers=[
("numericals", numeric_transformer, num),
("categoricals", categorical_transformer,cat ) ], remainder='passthrough')
Function to simplify
def mod(a,b):
model = Pipeline(
steps=[("preprocessing", preprocessor), ("select", a),("clf", b)])
return model
Starting to create the model
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=100431219)
clf=mod(SelectKBest(chi2),RandomForestClassifier()) # preprocessing, select, clf
param_grid = {'preprocessing__numericals__imputer__strategy': ['mean'],
'preprocessing__numericals__scaler': [MinMaxScaler()],
'preprocessing__categoricals__imputer__strategy': ['most_frequent'],
'select__k': list(range(1,14))}
inner = KFold(n_splits=7, shuffle=True, random_state=100431219)
clf = GridSearchCV(clf,
param_grid,
scoring='accuracy',
cv=inner,
n_jobs=4, verbose=1,
)
np.random.seed(100431219)
clf.fit(X_train, y_train)
And here I got the error:
trained_pipeline = clf.best_estimator_
print(f"Features selected: {trained_pipeline.named_steps['select'].get_support()}")
print(f"Locations where features selected: {np.where(trained_pipeline.named_steps['select'].get_support())}")
# Feature names before selection (i.e. after preprocessing)
feature_names_before_selection = trained_pipeline.named_steps['preprocessing'].get_feature_names_out() # In this line is the error
print(f"In Scikit-learn 1.x, we can even get the feature names after selection: {trained_pipeline.named_steps['select'].get_feature_names_out(feature_names_before_selection)}")
I obtained the number of features, their positions but not their names. I want the names
As you are using sklearn version 2.24 you should be referring to specific docs.
Have a look at this link https://scikit-learn.org/0.24/modules/generated/sklearn.compose.ColumnTransformer.html?highlight=get_feature_names#sklearn.compose.ColumnTransformer.get_feature_names
It says that there is a method ColumnTransformer.get_feature_names. There is no such method `get_feature_names.
For more information about this change you can have a look here https://github.com/scikit-learn/scikit-learn/pull/18444
The likely fix to your issue is using this code:
feature_names_before_selection = trained_pipeline.named_steps['preprocessing'].get_feature_names()
I'm trying to implement a Sequential Feature Selection and it gives me this error.
Here's my code.
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
X = df.copy()
y = X.pop("stroke")
Xtrain, X_val, ytrain, y_val = train_test_split(
X, y, random_state=1, test_size=0.2, shuffle=True
)
categorical_cols = X.select_dtypes("object").columns.tolist()
numerical_cols = X.select_dtypes("float64").columns.tolist()
numerical_transformer = MinMaxScaler()
categorical_transformer = TargetEncoder()
preprocessor = ColumnTransformer(
remainder="passthrough",
transformers=[
("num", numerical_transformer, numerical_cols),
("cat", categorical_transformer, categorical_cols),
],
)
Xtrain_pre = preprocessor.fit_transform(Xtrain, ytrain)
sfs = SFS(LogisticRegression(), k_features="best", scoring="precision")
sfs.fit(Xtrain_pre, ytrain)
The output I'm getting:
output
I know that there's some questions, but I did not see anything using the SFS. Can someone help?
I was working with California Housing Prices dataset, and this is what I've done:
import pandas as pd
from sklearn.model_selection import train_test_split
housing = pd.read_csv("housing.csv")
X = housing.drop(["longitude", "latitude", "median_house_value"], axis=1)
y = housing["median_house_value"]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
import category_encoders as ce
encoder_list = [ce.WOEEncoder(), ce.OneHotEncoder()]
for encoder in encoder_list:
numeric_transformer = Pipeline(
steps=[
("imputer", SimpleImputer(strategy="median")),
("scaler", StandardScaler()),
]
)
categorical_transformer = Pipeline(
steps=[
("imputer", SimpleImputer(strategy="constant")),
("encoder", encoder),
]
)
pipe = Pipeline(
steps=[("preprocessor", preprocessor), ("regressor", LinearRegression())]
)
pipe.fit(X_train, y_train)
pipe.predict(X_test)
print(encoder)
print(pipe.score(X_test, y_test))
Why is this generating two similar results? Shouldn't they be different? The same is happening when I try different scalers.
I want to define a Pipeline with a OneHotEncoder for the day_of_week column. I don't understand why I get a ValueError:
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
if __name__ == '__main__':
data_dict = {
'age': [1, 2, 3],
'day_of_week': ['monday', 'tuesday', 'wednesday'],
'y': [5, 6, 7]
}
data = pd.DataFrame(data_dict, columns=data_dict)
numeric_features = ['age']
numeric_transformer = Pipeline(steps=[
('scaler', StandardScaler())])
categorical_features = ['day_of_week']
print(categorical_features)
categorical_transformer = Pipeline(steps=[
('onehot', OneHotEncoder(handle_unknown='ignore', categories='auto'))])
preprocessor = ColumnTransformer(
transformers=[
('numerical', numeric_transformer, numeric_features),
('categorical', categorical_transformer, categorical_features)])
classifier = Pipeline(
steps=[
('preprocessor', preprocessor),
('classifier', RandomForestRegressor(n_estimators=60))])
X = data.drop(labels=['y'], axis=1)
y = data['y']
X_train, y_train, X_test, y_test = train_test_split(X, y, train_size=0.8, random_state=30)
trained_model = classifier.fit(X_train, y_train)
There is an error on this line:
X_train, y_train, X_test, y_test = train_test_split(X, y, train_size=0.8, random_state=30)
train_test_split returns X (train, test) , y(train,test).. and since you assigned them wrongly, your classifier throws all kinds of error.
Try changing it to:
X_train,X_test, y_train,y_test = train_test_split(X, y, train_size=0.8, random_state=30)
Your code runs without error for me
I'm pretty new to Python. I've run into an issue below that I really need help with:
df = pd.read_csv('train.csv') #titanic dataset from Kaggle
df = df.loc[df.Embarked.notna(), ['Survived', 'Pclass', 'Sex', 'SibSp', 'Embarked']]
X = df.drop('Survived', axis='columns')
y = df.Survived
column_trans = make_column_transformer(
(OneHotEncoder(), ['Sex', 'Embarked']),
remainder='passthrough')
column_trans.fit_transform(X)
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)
param_grid = dict(n_neighbors=k_range)
knn = KNeighborsClassifier()
pipe = make_pipeline(column_trans, knn)
grid = GridSearchCV(pipe, param_grid, cv=10, scoring='accuracy')
grid.fit(train_X, train_y) #this line gives me an error
The last line gives me an error of:
ValueError: Invalid parameter n_neighbors for estimator Pipeline(memory=None,
steps=[('columntransformer',
ColumnTransformer(n_jobs=None, remainder='passthrough',
sparse_threshold=0.3,
transformer_weights=None,
transformers=[('onehotencoder',
OneHotEncoder(categories='auto',
drop=None,
dtype=<class 'numpy.float64'>,
handle_unknown='error',
sparse=True),
['Sex', 'Embarked'])],
verbose=False)),
('kneighborsclassifier',
KNeighborsClassifier(algorithm='auto', leaf_size=30,
metric='minkowski', metric_params=None,
n_jobs=None, n_neighbors=5, p=2,
weights='uniform'))],
verbose=False). Check the list of available parameters with `estimator.get_params().keys()`.
What am I doing wrong here? Is it just not possible to do oneHot encoding, knn and pipeline simultaneously?
Parameters of pipelines can be set using __ separated parameter names, also you need the way in which you have defined your pipeline needs a revision. Please refer to the modified code below:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
df = pd.read_csv("titanic.csv")
df = df.drop(["Name"], axis=1)
X = df.drop('Survived', axis='columns')
y = df.Survived
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)
column_trans = make_column_transformer(
(OneHotEncoder(), ['Sex']),
remainder='passthrough')
knn = KNeighborsClassifier()
pipe = Pipeline(steps=[('column_trans', column_trans), ('knn', knn)])
param_grid = {
'knn__n_neighbors': [2,5,15, 30, 45, 64]
}
grid = GridSearchCV(pipe, param_grid, cv=10, scoring='accuracy')
grid.fit(train_X,train_y)
grid.best_params_
#{'knn__n_neighbors': 5}