Applying SimpleImputer and OneHotEncoder to multiple columns at once - python

I am applying the following code to impute and then encode categorical data in my dataset:
# Encoding categorical data
# Define a Pipeline with an imputing step using SimpleImputer prior to the OneHot encoding
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
# use strategy='constant', fill_value='missing' for imputing to preserve the categories' structure
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('encoder', OneHotEncoder(handle_unknown='ignore'))])
preprocessor = ColumnTransformer(
transformers=[
('cat', categorical_transformer, [0])
])
Z = np.array(preprocessor.fit_transform(Z))
print (Z[:,0])
I want to repeat these steps for all columns in the array Z, as Z is comprised of all categorical features from my original dataset.
Is there a more efficient way of doing this rather than listing each column as such:
preprocessor = ColumnTransformer(
transformers=[
('cat', categorical_transformer, [0,1,2,3,4,5,6,7,8,9,10])
])
Thanks in advance!

If all columns have the same type, I would simply omit the ColumnTransformer and use a simple pipeline in your case:
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder
# some sample data
X = pd.DataFrame({
'col1': ['obj1', 'obj2', 'obj3'],
'col2': [np.nan, 'oj3', 'oj1'],
'col3': ['jo3', 'jo1', np.nan]
}).astype('category')
y = pd.Series([0, 1, 1])
pipeline = make_pipeline(
SimpleImputer(missing_values=np.nan, strategy='constant', fill_value='missing'),
OneHotEncoder(handle_unknown='ignore', sparse=False)
)
Z = pipeline.fit_transform(X, y)
The ColumnTransformer is meant to be used for heterogeneous data when columns or column subsets of the input need to be transformed separately (read here). However, since your features are all of the same type and all require the same preprocessing procedure, you can just apply SimpleImputer and OneHotEncoder to the whole dataset as these transformers will automatically detect the columns to transform (which in your case are simply all).

Related

How to classify and substitute NaN with a pipeline

I have the following data frame:
import pandas as pd
d = {'hrs': [1, "NaN", 2], 'Department': ["ResearchDevelopment", "NaN", "ResearchDevelopment"]}
df = pd.DataFrame(data=d)
df
And I want to use a pipeline in order to classify it as categorical or numerical and predict NaN values as the median in the case of numerical and ignore in the case of categorical.
I have done the following:
from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn import pipeline
numeric_features = ["hrs"]
numeric_transformer = Pipeline(
steps=[("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler())]
)
categorical_features = ["Department"]
categorical_transformer = Pipeline(
steps=[("imputer", SimpleImputer(strategy="most_frequent")),("OneHotEncoder", OneHotEncoder(handle_unknown="ignore"))]
)
preprocessor = ColumnTransformer(
transformers=[
("num", numeric_transformer, numeric_features),
("cat", categorical_transformer, categorical_features),
]
)
But I do not know where is the error, and if I have to call the data frame or not.
No error happens, but the pipeline is not changing anything.

Specifying columns in scikit-learn pipeline after ColumnTransformer

I want to construct a scikit-learn pipeline in which some columns have values imputed, and then scaling is subsequently applied to some. If I put both operations in the same columntransformer this does not work as they proceed in parallel (and so missing values cause the scaler to fail). If I make two columntransformers and run them in series, however, I run into the issue that I cannot specify column names (as output of the first transformer is a np array). What is the correct way to go about this?
numeric_columns = list(X.select_dtypes('float64').columns)
cat_columns = list(X.select_dtypes('object').columns)+list(X.select_dtypes('int64').columns)
# Imputation
imp_mean = SimpleImputer(strategy='mean')
imp_freq = SimpleImputer(strategy='most_frequent')
imputer = ColumnTransformer(
[('Imput_mean', imp_mean, numeric_columns),
('Imput_freq', imp_freq, cat_columns),
], remainder='passthrough'
)
# Scaling
feature_transformer = ColumnTransformer(
[('num',StandardScaler(),numeric_columns),
], remainder='passthrough'
)
#Hyperparameters
parameters = {'model__n_components':[1,2,3,4,5]}
#Pipeline
pipeline = Pipeline([('imputer', imputer),
('feature_transformer', feature_transformer),
('model', PLSRegression())])
#Cross validation strategy
cv = KFold(n_splits=10, shuffle=True)
#Cross valdiate and evaluate
clf = GridSearchCV(pipeline, parameters, scoring="r2", cv=10)
cross_val_score(clf, X, y, cv=cv, scoring="r2"))
You might nest a Pipeline which takes care of the preprocessing of the numerical columns (performed serially) within the ColumnTransformer instance.
Here's an example:
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
X = pd.DataFrame({'city': ['London', 'London', '', 'Sallisaw'],
'title': ['His Last Bow', 'How Watson Learned the Trick', 'A Moveable Feast', 'The Grapes of Wrath'],
'expert_rating': [5, 3, np.nan, 5],
'user_rating': [4, np.nan, 4, 3]})
numeric_columns = list(X.select_dtypes('float64').columns)
cat_columns = list(X.select_dtypes('object').columns) + list(X.select_dtypes('int64').columns)
imp_mean = SimpleImputer(strategy='mean')
imp_freq = SimpleImputer(missing_values='', strategy='most_frequent')
ct = ColumnTransformer([
('Imput_freq', imp_freq, cat_columns),
('pipe_num', Pipeline([('Imput_mean', imp_mean), ('num', StandardScaler())]), numeric_columns)
], remainder='passthrough'
)
pd.DataFrame(ct.fit_transform(X))
Here's a similar post: How to execute both parallel and serial transformations with sklearn pipeline?.

Sklearn pipeline transform specific columns - ValueError: too many values to unpack (expected 2)

i am trying make pipeline with scaler, onhotencoder, polynomialfeature, and finally linear regression model
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
('scaler', StandardScaler(), num_cols),
('polynom', PolynomialFeatures(3), num_cols),
('encoder', OneHotEncoder(), cat_cols),
('linear_regression', LinearRegression() )
])
but when i fit the pipeline i have ValueError: too many values to unpack (expected 2)
pipeline.fit(x_train,y_train)
pipeline.score(x_test, y_test)
If I understand correctly, you want to apply some steps of the pipeline to specific columns. Instead of doing it by adding the column names ad the end of the pipeline stage (which is incorrect and causes the error), you have to use a ColumnTransformer. Here you can find another similar example.
In your case, you could do something like this:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.compose import ColumnTransformer
# Fake data.
train_data = pd.DataFrame({'n1': range(10), 'n2': range(10)})
train_data['c1'] = 0
train_data['c1'][5:] = 1
y_train = [0]*10
y_train[5:] = [1]*5
# Here I assumed you are using a DataFrame. If not, use integer indices instead of column names.
num_cols = ['n1', 'n2']
cat_cols = ['c1']
# Pipeline to transform the numerical features.
numerical_transformer = Pipeline([('scaler', StandardScaler()),
('polynom', PolynomialFeatures(3))
])
# Apply the numerical transformer only on the numerical columns.
# Spearately, apply the OneHotEncoder.
ct = ColumnTransformer([('num_transformer', numerical_transformer, num_cols),
('encoder', OneHotEncoder(), cat_cols)])
# Main pipeline for fitting.
pipeline = Pipeline([
('column_transformer', ct),
('linear_regression', LinearRegression() )
])
pipeline.fit(train_data, y_train)
Schematically, the layout of your pipeline would be like this:

ColumnTransformer(s) in various parts of a pipeline do not play well

I am using sklearn and mlxtend.regressor.StackingRegressor to build a stacked regression model.
For example, say I want the following small pipeline:
A Stacking Regressor with two regressors:
A pipeline which:
Performs data imputation
1-hot encodes categorical features
Performs linear regression
A pipeline which:
Performs data imputation
Performs regression using a Decision Tree
Unfortunately this is not possible, because StackingRegressor doesn't accept NaN in its input data.
This is even if its regressors know how to handle NaN, as it would be in my case where the regressors are actually pipelines which perform data imputation.
However, this is not a problem: I can just move data imputation outside the stacked regressor.
Now my pipeline looks like this:
Perform data imputation
Apply a Stacking Regressor with two regressors:
A pipeline which:
1-hot encodes categorical features
Standardises numerical features
Performs linear regression
An sklearn.tree.DecisionTreeRegressor.
One might try to implement it as follows (the entire minimal working example in this gist, with comments):
sr_linear = Pipeline(steps=[
('preprocessing', ColumnTransformer(transformers=[
('categorical',
make_pipeline(OneHotEncoder(), StandardScaler()),
make_column_selector(dtype_include='category')),
('numerical',
StandardScaler(),
make_column_selector(dtype_include=np.number))
])),
('model', LinearRegression())
])
sr_tree = DecisionTreeRegressor()
ct_imputation = ColumnTransformer(transformers=[
('categorical',
SimpleImputer(strategy='constant', fill_value='None'),
make_column_selector(dtype_include='category')),
('numerical',
SimpleImputer(strategy='median'),
make_column_selector(dtype_include=np.number))
])
stacked_regressor = Pipeline(steps=[
('imputation', ct_imputation),
('back_to_pandas', FunctionTransformer(
func=lambda values: pd.DataFrame(values, columns=ct_imputation.get_feature_names_out())
)),
('model', StackingRegressor(
regressors=[sr_linear, sr_tree],
meta_regressor=DecisionTreeRegressor(),
use_features_in_secondary=True
))
])
Note that the "outer" ColumnTransformer (in stacked_regressor) returns a numpy matrix.
But the "inner" ColumnTransformer (in sr_linear) expects a pandas.DataFrame, so I had to convert the matrix back to a data frame using step back_to_pandas.
(To use get_feature_names_out I had to use the nightly version of sklearn, because the current stable 1.0.2 version does not support it yet. Fortunately it can be installed with one simple command.)
The above code fails when calling stacked_regressor.fit(), with the following error (the entire stacktrace is again in the gist):
ValueError: make_column_selector can only be applied to pandas dataframes
However, because I added the back_to_pandas step to my outer pipeline, the inner pipelines should be getting a pandas data frame!
In fact, if I only fit_transform() my ct_imputation object, I clearly obtain a pandas data frame.
I cannot understand where and when exactly the data which gets passed around ceases to be a data frame.
Why is my code failing?
Imo the issue has to be ascribed to StackingRegressor. Actually, I am not an expert on its usage and still I have not explored its source code, but I've found this sklearn issue - #16473 which seems implying that << the concatenation [of regressors and meta_regressors] does not preserve dataframe >> (though this is referred to sklearn StackingRegressor instance, rather than on mlxtend one).
Indeed, have a look at what happens once you replace it with your sr_linear pipeline:
from sklearn.datasets import fetch_openml
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from mlxtend.regressor import StackingRegressor
import numpy as np
import pandas as pd
# We use the Ames house prices dataset for this example
d = fetch_openml('house_prices', as_frame=True).frame
# Small data preprocessing:
for column in d.columns:
if d[column].dtype == object or column == 'MSSubClass':
d[column] = pd.Categorical(d[column])
d.drop(columns='Id', inplace=True)
# Prepare the data for training
label = 'SalePrice'
features = [col for col in d.columns if col != label]
X, y = d[features], d[label]
# Train the stacked regressor
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True)
sr_linear = Pipeline(steps=[
('preprocessing', ColumnTransformer(transformers=[
('categorical',
make_pipeline(OneHotEncoder(), StandardScaler(with_mean=False)),
make_column_selector(dtype_include='category')),
('numerical',
StandardScaler(),
make_column_selector(dtype_include=np.number))
])),
('model', LinearRegression())
])
ct_imputation = ColumnTransformer(transformers=[
('categorical',
SimpleImputer(missing_values=np.nan, strategy='constant', fill_value='None'),
make_column_selector(dtype_include='category')),
('numerical',
SimpleImputer(strategy='median'),
make_column_selector(dtype_include=np.number))
])
stacked_regressor = Pipeline(steps=[
('imputation', ct_imputation),
('back_to_pandas', FunctionTransformer(
func=lambda values: pd.DataFrame(values, columns=ct_imputation.get_feature_names_out()).astype(types)
)),
('mdl', sr_linear)
])
stacked_regressor.fit(X_train, y_train)
Observe that I had to slightly modify the 'back_to_pandas' step because for some reason pd.DataFrame was changing the dtypes of the columns to 'object' only (from 'category' and 'float64'), therefore clashing with the imputation performed in sr_linear. For this reason, I've applied .astype(types) to the pd.DataFrame constructor, where types is defined as follows (based on the implementation of .get_feature_names_out() method of the SimpleImputer from the dev version of sklearn):
types = {}
for col in d.columns[:-1]:
if d[col].dtype == 'category':
types['categorical__' + col] = str(d[col].dtype)
else:
types['numerical__' + col] = str(d[col].dtype)
The correct thing to do was:
Move from mlxtend's to sklearn's StackingRegressor. I believe the former was creater when sklearn still didn't have a stacking regressor. Now there is no need to use more 'obscure' solutions. sklearn's stacking regressor works pretty well.
Move the 1-hot-encoding step to the outer pipeline, because (surprisingly!) sklearn's DecisionTreeRegressor cannot handle categorical data among the features.
A working version of the code is given below:
from sklearn.datasets import fetch_openml
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.ensemble import StackingRegressor
import numpy as np
import pandas as pd
def set_correct_categories(df: pd.DataFrame) -> pd.DataFrame:
for column in df.columns:
if df[column].dtype == object or 'MSSubClass' in column:
df[column] = pd.Categorical(df[column])
return df
d = fetch_openml('house_prices', as_frame=True).frame
d = set_correct_categories(d).drop(columns='Id')
sr_linear = Pipeline(steps=[
('preprocessing', StandardScaler()),
('model', LinearRegression())
])
ct_preprocessing = ColumnTransformer(transformers=[
('categorical',
make_pipeline(
SimpleImputer(strategy='constant', fill_value='None'),
OneHotEncoder(sparse=False, handle_unknown='ignore')
),
make_column_selector(dtype_include='category')),
('numerical',
SimpleImputer(strategy='median'),
make_column_selector(dtype_include=np.number))
])
stacking_regressor = Pipeline(steps=[
('preprocessing', ct_preprocessing),
('model', StackingRegressor(
estimators=[('linear_regression', sr_linear), ('regression_tree', DecisionTreeRegressor())],
final_estimator=DecisionTreeRegressor(),
passthrough=True
))
])
label = 'SalePrice'
features = [col for col in d.columns if col != label]
X, y = d[features], d[label]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True)
stacking_regressor.fit(X_train, y_train)
Thanks to user amiola for his answer putting me on the right track.

OneHotEncoder raising NaN issue after SimpleImputer has been called already

I have trouble understanding how pipelines are supposed to work in Sklearn. Following is an example using the titanic dataset.
data = pd.read_csv('datasets/train.csv')
cat_attribs = ["Embarked", "Cabin", "Ticket", "Name"]
num_pipeline = Pipeline([
('imputer', SimpleImputer(strategy="median")),
])
str_pipeline = Pipeline([
('imputer', SimpleImputer(strategy="most_frequent")),
])
full_pipeline = ColumnTransformer([
("num", num_pipeline, ["Pclass", "Age", "SibSp", "Parch", "Fare"]),
("str", str_pipeline, ["Cabin", "Sex"]),
("cat", OneHotEncoder(), ["Cabin"]),
])
full_pipeline.fit_transform(data)
I'd expect this to fill all missing NaN values (both in numeric and string) attributes, and then finally transform the Cabin attribute into a numerical one.
Instead the code ends up with the following error:
ValueError: Input contains NaN. If I remove the line calling the
OneHotEncoder and printing the transformed array, there is no NaN
value.
Hence I wonder. How am I supposed to call OneHotEncoder in this situation.
I would recommend applying OneHotEncoder to all categorical variables. Hence make that as a seperate pipeline.
As it's a single step process for numerical columns, you can use the ColumnTransformer directly.
Try this!
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline, make_pipeline
cat_preprocess = make_pipeline(SimpleImputer(strategy="most_frequent"), OneHotEncoder())
ct = make_column_transformer([
("num", SimpleImputer(strategy="median"), ["Pclass", "Age", "SibSp", "Parch", "Fare"]),
("str", cat_preprocess, ["Cabin", "Sex"]),
])
pipeline = Pipeline([('preprocess', ct)])

Categories