Why do I have different encoders generating the same results? - python

I was working with California Housing Prices dataset, and this is what I've done:
import pandas as pd
from sklearn.model_selection import train_test_split
housing = pd.read_csv("housing.csv")
X = housing.drop(["longitude", "latitude", "median_house_value"], axis=1)
y = housing["median_house_value"]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
import category_encoders as ce
encoder_list = [ce.WOEEncoder(), ce.OneHotEncoder()]
for encoder in encoder_list:
numeric_transformer = Pipeline(
steps=[
("imputer", SimpleImputer(strategy="median")),
("scaler", StandardScaler()),
]
)
categorical_transformer = Pipeline(
steps=[
("imputer", SimpleImputer(strategy="constant")),
("encoder", encoder),
]
)
pipe = Pipeline(
steps=[("preprocessor", preprocessor), ("regressor", LinearRegression())]
)
pipe.fit(X_train, y_train)
pipe.predict(X_test)
print(encoder)
print(pipe.score(X_test, y_test))
Why is this generating two similar results? Shouldn't they be different? The same is happening when I try different scalers.

Related

Sequential Feature Selection - UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples

I'm trying to implement a Sequential Feature Selection and it gives me this error.
Here's my code.
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
X = df.copy()
y = X.pop("stroke")
Xtrain, X_val, ytrain, y_val = train_test_split(
X, y, random_state=1, test_size=0.2, shuffle=True
)
categorical_cols = X.select_dtypes("object").columns.tolist()
numerical_cols = X.select_dtypes("float64").columns.tolist()
numerical_transformer = MinMaxScaler()
categorical_transformer = TargetEncoder()
preprocessor = ColumnTransformer(
remainder="passthrough",
transformers=[
("num", numerical_transformer, numerical_cols),
("cat", categorical_transformer, categorical_cols),
],
)
Xtrain_pre = preprocessor.fit_transform(Xtrain, ytrain)
sfs = SFS(LogisticRegression(), k_features="best", scoring="precision")
sfs.fit(Xtrain_pre, ytrain)
The output I'm getting:
output
I know that there's some questions, but I did not see anything using the SFS. Can someone help?

Scikit Learn Pipeline with SMOTE

I would like to create a Pipeline with SMOTE() inside, but I can't figure out where to implement it.
My target value is imbalanced. Without SMOTE I have very bad results.
My code:
df_n = df[['user_id','signup_day', 'signup_month', 'signup_year',
'purchase_day', 'purchase_month', 'purchase_year','purchase_value',
'source','browser','sex','age', 'is_fraud']]
#Definition X et y:
X = df_n.drop(['is_fraud'], axis = 1)
y = df_n.is_fraud
# split into 70:30 ration
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)
print(Counter(y_train)) #Counter({0: 95844, 1: 9934})
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='mean'))
,('scaler', StandardScaler())
])
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant'))
,('encoder', OrdinalEncoder())
])
numeric_features = ['user_id','signup_day', 'signup_month', 'signup_year',
'purchase_day', 'purchase_month', 'purchase_year','purchase_value', 'age']
categorical_features = ['source', 'browser', 'sex']
preprocessor = ColumnTransformer(
transformers=[
('numeric', numeric_transformer, numeric_features)
,('categorical', categorical_transformer, categorical_features)
])
regressors = [
RandomForestRegressor()
,LogisticRegression()
,DecisionTreeClassifier()
,KNeighborsClassifier()
,LinearSVC(random_state=42)]
for regressor in regressors:
pipeline = Pipeline(steps = [
('preprocessor', preprocessor)
,('regressor',regressor)
])
model = pipeline.fit(X_train, y_train)
predictions = model.predict(X_test)
print(regressor)
print(r2_score(y_test, predictions))
My results:
RandomForestRegressor()
0.48925960579049166
LogisticRegression()
0.24151543370722806
DecisionTreeClassifier()
-0.14622417739659155
KNeighborsClassifier()
0.3542030752350408
LinearSVC(random_state=42)
-0.10256098450762474
from imblearn.over_sampling import SMOTEN
sampler = SMOTEN(random_state=0)
Xsm,ysm = sampler.fit_resample(X, y)
You can use below code for adding SMOTE in pipeline (need some tweaking though)
from imblearn.pipeline import Pipeline
# define pipeline
model = DecisionTreeClassifier()
over = SMOTE(sampling_strategy=0.1)
under = RandomUnderSampler(sampling_strategy=0.5)
steps = [('over', over), ('under', under), ('model', model)]
pipeline = Pipeline(steps=steps)
# evaluate pipeline
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(pipeline, X, Y, scoring='roc_auc', cv=cv, n_jobs=-1)
treat smote separately not inside pipeline by using this code
What you can do is use a modification of the SMOTE algorithm, called SMOTE-N (see https://imbalanced-learn.org/dev/over_sampling.html#smote-variants), which works when all features are categorical. This modifies the SMOTE algorithm to

How to solve 'Input contains NaN, infinity or a value too large for dtype('float64')' after already preprocessing using Pipeline?

There are many posts containing this error, but I couldn't find the solution for this problem. I'm using this dataset. This is what I've done, a preprocessing, with SimpleImputer for categorical and numerical features:
import pandas as pd
import numpy as np
%load_ext nb_black
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from category_encoders import CatBoostEncoder
from sklearn.model_selection import train_test_split
housing = pd.read_csv("housing.csv")
housing.head()
X = housing.drop(["longitude", "latitude", "median_house_value"], axis=1)
y = housing["median_house_value"]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
numeric_transformer = Pipeline(
steps=[("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler())]
)
categorical_transformer = Pipeline(
steps=[
("imputer", SimpleImputer(strategy="constant")),
("encoder", CatBoostEncoder()),
]
)
numeric_features = [
"housing_median_age",
"total_rooms",
"total_bedrooms",
"population",
"households",
"median_income",
]
categorical_features = ["ocean_proximity"]
preprocessor = ColumnTransformer(
transformers=[
("numeric", numeric_transformer, numeric_features),
("categorical", categorical_transformer, categorical_features),
]
)
from sklearn.linear_model import LinearRegression
pipeline = Pipeline(
steps=[("preprocessor", preprocessor), ("regressor", LinearRegression())]
)
lr_model = pipeline.fit(X_train, y_train)
But I got this error:
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
Any idea of what's happening in here?
It seems that the CatBoostEncoder is returning several nan values when fitted to the training set, which is why the LinearRegression throws an error.
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from category_encoders import CatBoostEncoder
housing = pd.read_csv("housing.csv")
X = housing.drop(["longitude", "latitude", "median_house_value"], axis=1)
y = housing["median_house_value"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
numeric_transformer = Pipeline(steps=[
("imputer", SimpleImputer(strategy="median")),
("scaler", StandardScaler())
])
categorical_transformer = Pipeline(steps=[
("imputer", SimpleImputer(strategy="constant")),
("encoder", CatBoostEncoder())
])
numeric_features = ["housing_median_age", "total_rooms", "total_bedrooms", "population", "households", "median_income"]
categorical_features = ["ocean_proximity"]
preprocessor = ColumnTransformer(transformers=[
("numeric", numeric_transformer, numeric_features),
("categorical", categorical_transformer, categorical_features),
])
X_new = preprocessor.fit_transform(X_train, y_train)
print(np.isnan(X_new).sum(axis=0))
# array([ 0, 0, 0, 0, 0, 0, 4315])

A given column is not a column of the dataframe Pandas

I have the following splitting function:
from typing import Tuple
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
def split_dataframe(
df: pd.DataFrame,
target_feature: str,
split_ratio: int = 0.2
) -> Tuple[pd.DataFrame, pd.DataFrame, np.ndarray, np.ndarray]:
df_ = df.copy()
X = df_.drop(target_feature, axis=1)
y = df_[target_feature]
encoder = LabelEncoder()
y = encoder.fit_transform(y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = split_ratio)
return X_train, X_test, y_train, y_test
I split the dataframe by using the following:
X_train, X_test, y_train, y_test = split_dataframe(df, 'Банк')
I use pipeline to transform X_train and y_train
from sklearn.pipeline import Pipeline, FeatureUnion
from mlxtend.feature_selection import ColumnSelector
import category_encoders as ce
cat_pipe = Pipeline(
[
('selector', ColumnSelector(categorical_features)),
('encoder', ce.one_hot.OneHotEncoder())
]
)
num_pipe = Pipeline(
[
('selector', ColumnSelector(numeric_features)),
('scaler', StandardScaler())
]
)
preprocessor = FeatureUnion(
transformer_list=[
('cat', cat_pipe),
('num', num_pipe)
]
)
new_df = pipe.fit_transform(X_train, y_train)
And after that I got ValueError: A given column is not a column of the dataframe and specifically KeyError: 'Банк'. I checked if the columns exist before of pass the dataframe to split in train and test. If i remove X = df_.drop(target_feature, axis=1) to X = df_ everything works correctly but target feature still in X.
I mage a mistake in pipe.fit_transform(X_train, y_train), i changed it to preprocessor.fit_transform(X_train, y_train) and it worked

OneHotEncoder for categorical feature "day of week" results in ValueError

I want to define a Pipeline with a OneHotEncoder for the day_of_week column. I don't understand why I get a ValueError:
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
if __name__ == '__main__':
data_dict = {
'age': [1, 2, 3],
'day_of_week': ['monday', 'tuesday', 'wednesday'],
'y': [5, 6, 7]
}
data = pd.DataFrame(data_dict, columns=data_dict)
numeric_features = ['age']
numeric_transformer = Pipeline(steps=[
('scaler', StandardScaler())])
categorical_features = ['day_of_week']
print(categorical_features)
categorical_transformer = Pipeline(steps=[
('onehot', OneHotEncoder(handle_unknown='ignore', categories='auto'))])
preprocessor = ColumnTransformer(
transformers=[
('numerical', numeric_transformer, numeric_features),
('categorical', categorical_transformer, categorical_features)])
classifier = Pipeline(
steps=[
('preprocessor', preprocessor),
('classifier', RandomForestRegressor(n_estimators=60))])
X = data.drop(labels=['y'], axis=1)
y = data['y']
X_train, y_train, X_test, y_test = train_test_split(X, y, train_size=0.8, random_state=30)
trained_model = classifier.fit(X_train, y_train)
There is an error on this line:
X_train, y_train, X_test, y_test = train_test_split(X, y, train_size=0.8, random_state=30)
train_test_split returns X (train, test) , y(train,test).. and since you assigned them wrongly, your classifier throws all kinds of error.
Try changing it to:
X_train,X_test, y_train,y_test = train_test_split(X, y, train_size=0.8, random_state=30)
Your code runs without error for me

Categories