Imputation after one hot encoding in scikit-learn - python

I have a dataset where I have categorical and numerical data. I want to -
Apply OneHot encoding for all categorical columns
Use the numerical data + one-hot encoded categorical data to do Multiple Imputation using IterativeImputer.
Integrate it to a pipeline where I have access to fit and transform methods.
I can use ColumnTransformer to impute using only numerical columns but I want to use the categorical column too for the imputation.
E.g.
sample_data = pd.DataFrame({
"a": [4.4, 1.0, None, 3.0, 2.7],
"b": ["HIGH", "HIGH", "LOW", "HIGH", "LOW"],
"c": [True, False, False, True, False]
})
I want to first encoded columns b,c and then use them along with a to impute the missing value in column a.
Also, I do not have any missing values in the categorical columns.

You need to first create pipeline that uses 'ColumnTransformer' then apply one-hot encoding to the categorical columns, followed by an IterativeImputer that uses both the one-hot encoded categorical columns and numerical columns to impute missing values
As an example:
categorical_transformer = Pipeline(steps=[
('onehot', OneHotEncoder())
])
column_transformer = ColumnTransformer(transformers=[
('cat', categorical_transformer, ['b', 'c']),
('num', IterativeImputer(), ['a', 'b_HIGH', 'b_LOW', 'c_True', 'c_False'])
])
pipeline = Pipeline(steps=[
('preprocessor', column_transformer)
])

Related

Sklearn Pipeline / OneHotEncoder : consistency in getting categorical features with feature_names_in_ / get_feature_names_out()

Similar questions have been asked before, but this is a particular case, and it seems that sklearn has evolved quite a bit since then (I am using scikit-learn 1.1.2), so I think it is worth a new post.
I created an sklearn Pipeline in which I apply different transformations to numeric and categorical columns, as below :
# Separate numeric columns and categorical columns
numeric_features = X_train.select_dtypes(exclude=['object']).columns.tolist()
categorical_features = X_train.select_dtypes(include=['object']).columns.tolist()
# Define transformer pipelines to be applied to each type of column
# 1. Apply KNNImputer to numeric columns
# 2. Apply OneHotEncoder to categorical columns
num_transform_pipeline = Pipeline(steps = [('imputer', KNNImputer(n_neighbors=1, weights="uniform"))])
cat_transform_pipeline = Pipeline(steps = [('onehotencoding', OneHotEncoder(handle_unknown='ignore', sparse=False))])
# Apply each transformer pipeline to each type of columns
column_transformer = ColumnTransformer(
transformers=[
("num_column_transformer", num_transform_pipeline, numeric_features),
("cat_column_transformer", cat_transform_pipeline, categorical_features),
], verbose_feature_names_out = False
)
# Define the final pipeline combining column transformers and the regressor
pipeline = Pipeline([('column_transformer', column_transformer),
('regressor', XGBRegressor())])
After loading the pipeline from another script, I am trying to find the categorical columns that are passed to the OneHotEncoder step. In the previous example, since OneHotEncoder is the first step of cat_transform_pipeline, I can't use get_feature_names_out() on the previous step.
However, I found two different ways of getting the list of categorical columns :
Accessing the last element of (name, fitted_transformer, column) in the second transformer of column_transformer returns the categorical columns :
cat_feature_names = pipeline['column_transformer'].transformers_[1][-1]
However, when I try to access the second transformer cat_column_transformer by its name :
cat_feature_names = pipeline['column_transformer'].named_transformers_['cat_column_transformer'][-1]
I get an error TypeError: 'OneHotEncoder' object is not iterable
Is there a way to achieve the same result by using the name of the transformer and not its index ?
Accessing OneHotEncoder's feature_names_in_ attribute does the job and seems to be the easiest method :
cat_feature_names = pipeline['column_transformer'].named_transformers_['cat_column_transformer']['onehotencoding'].feature_names_in_
However, when OneHotEncoder is not the first step of the pipeline, such as in the following case where an imputer is defined just before :
cat_transform_pipeline = Pipeline(steps = [('imputer', SimpleImputer(strategy = 'most_frequent')),
('onehotencoding', OneHotEncoder(handle_unknown='ignore', sparse=False))])
I get the following error : AttributeError: 'OneHotEncoder' object has no attribute 'feature_names_in_'
The solution in this case is to use get_feature_names_out() on the previous step (the imputer). But that doesn't seem very consistent. Why would the attribute feature_names_in_ cease to exist when OneHotEncoder is preceded by an Imputer ?

How to Encode two or more columns in a dataframe using Sk-learn OneHotEncoder?

Here is my dataframe
I want to encode 2 columns: Lines and Name using sklearn OneHotEncoder and Column Transformer. However, I can only encode one column using this code.
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
X = new_df_cl.iloc[:,:-2].values
lines = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [4])], remainder='passthrough')
X = np.array(lines.fit_transform(X))
The syntax for the tuples in the list of transformers in ColumnTransformer is
(name, transformer, columns)
where columns can be a list of column indices. So if you want to transform the columns at indices 4 and 5, specify columns as [4, 5] like this:
ColumnTransformer(
transformers=[('encoder', OneHotEncoder(), [4, 5])], # <-- here
remainder='passthrough'
)

How to get the names of the new columns after performing sklearn Column Transformer

preprocessor = ColumnTransformer(
[
('num', StandardScaler(), numeric_features),
('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
]
)
I want to perform transformations on both some numeric attributes and also on some categorical features.
Running: test=preprocessor.fit_transform(X_train) return a numpy array, which does not have names of columns.
According to documentation the ColumnTransformer should have function get_feature_names(),which would return the names of the new features. However when I run it I get:
AttributeError: Transformer num (type StandardScaler) does not provide get_feature_names.
I want to get the names of the columns dynamically because I don't know the number of categories in advance.
ColumnTransformer takes the column in the same order they are defined in your dataframe, therefore you may consider obtaining them with pandas select_dtypes from your dataframe. Supposing your data is contained in a df:
numeric_columns = list(df.select_dtypes('number'))
categorical_columns = list(df.select_dtypes('object')) + list(df.select_dtyes('category'))

Column-specific processing in an sklearn pipeline

I have a situation where I need to do some column-specific processing in a pipeline, but because transformers return numpy arrays rather than pandas dataframes, I don't have column names to do my feature engineering.
Here's a simple, reproducible example where I have a function called engineer_feature that I want to use to create new data. I need to use it during/after the pipeline because it depends on one column becoming imputed, and I would like it to be able to be performed during k-fold cross-validation.
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder, StandardScaler
df = pd.DataFrame({"Type": ["Beta", "Beta", "Alpha", "Charlie", "Beta", "Charlie"], "A": [1, 2, 3, np.nan, 22, 4], "B": [5, 7, 12, 21, 12, 10]})
def engineer_feature(df):
df["C"] = df["A"] / df["B"]
return df
categorical_transformer = Pipeline([
("one_hot", OneHotEncoder())
])
numeric_transformer = Pipeline([
("imputer", SimpleImputer()),
("engineer", FunctionTransformer(engineer_feature)),
("scaler", StandardScaler())
])
preprocessor = ColumnTransformer([
("categorical", categorical_transformer, ["Type"]),
("numeric", numeric_transformer, ["A", "B"])
])
preprocessor.fit_transform(df)
Which yields this error:
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
Which makes sense because engineer_feature is trying to index columns as if though they are dataframes when they are just numpy arrays.
What's a strategy for getting around this? I don't want to hardcode column indices to access them via numpy, especially since my real dataframe has many more columns.
There are ways to get around your challenge by adding few steps and simplifying the entire approach instead of trying to run everything on a single input dataframe.
For one hot encoding, you can use the get_dummies() function in pandas.
For calculating df["C"], you can write a lambda function and apply it to all rows in the dataframe using the apply function in pandas.
You should still rely on sklearn for imputing and scaling the numeric columns.
As you correctly mentioned, the output from sklearn will be a numpy array. You should convert it back to a pandas dataframe that can be used further.
In order to follow the above approach,
Split your dataframe into two, one with categorical columns and the other with numeric. Once you are done with data processing, use append in pandas to append them back.
df_numeric.append(df_catgeorical)
You will need to save the output of each step in a new dataframe, and pass it further downstream in your data pipeline.
To release memory footprint, delete the old dataframe and call garbage collector
import gc
del df
gc.collect()
You do not need to save the column index of a numpy array. Simply use df.columns to return the dataframe's columns as a list. For example, below is what you can do to convert the output of a sklearn transformation into a dataframe
sim = SimpleImputer()
sklearn_output_array = sim.fit_transform(df_input)
df_output = pd.DataFrame(sklearn_output_array, columns=df_input.columns)
del df_input
del sklearn_output_array
gc.collect()
df_output["C"] = df_output["A"] / df_output["B"]
I agree that the above approach will increase the number of lines of code. However, our code will be much more readable and easier to follow.
In addition to the above, below is another stack overflow post that deals with one-hot encoding and saving column names of transformed dataframes for further use downstream. The answer has some examples with code that you might find useful.
https://stackoverflow.com/a/60107683/12855052
Hope this all helps, and let me know if you have further questions!
Thanks to the discussion and answers given by Nick and Sergey (specifically that I do know what columns of my dataframe I'm passing into engineer_feature), I've come up with a solution that is acceptable to me; though if anyone has a better idea, please chime in.
import numpy as np
import pandas as pd
from functools import partial
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder, StandardScaler
df = pd.DataFrame({"Type": ["Beta", "Beta", "Alpha", "Charlie", "Beta", "Charlie"], "A": [1, 2, 3, np.nan, 22, 4], "B": [5, 7, 12, 21, 12, 10]})
def engineer_feature(columns, X):
df = pd.DataFrame(X, columns=columns)
df["C"] = df["A"] / df["B"]
return df
categorical_transformer = Pipeline([
("one_hot", OneHotEncoder())
])
def numeric_transformer(columns):
transformer = Pipeline([
("imputer", SimpleImputer()),
("engineer", FunctionTransformer(partial(engineer_feature, columns))),
("scaler", StandardScaler())
])
return ("numeric", transformer, columns)
preprocessor = ColumnTransformer([
("categorical", categorical_transformer, ["Type"]),
numeric_transformer(["A", "B"])
])
preprocessor.fit_transform(df)
It's worth nothing this depends on both columns A and B to have at least one value each so that SimpleImputer does not drop the column.
For your toy example to work you need to:
def engineer_feature(X):
return np.c_[X,X[:,0]/X[:,1]]
categorical_transformer = Pipeline([
("one_hot", OneHotEncoder())
])
numeric_transformer = Pipeline([
("imputer", SimpleImputer())
,("engineer", FunctionTransformer(engineer_feature))
,("scaler", StandardScaler())
])
preprocessor = ColumnTransformer([
("categorical", categorical_transformer, ["Type"]),
("numeric", numeric_transformer, ["A", "B"])
])
preprocessor.fit_transform(df)
FunctionTransformer() accepts numpy array, you cannot avoid hardcoding here.

AttributeError when using ColumnTransformer into a pipeline

This is my first machine learning project and the first time that I use ColumnTransformer. My aim is to perform two steps of data preprocessing, and use ColumnTransformer for each of them.
In the first step, I want to replace the missing values in my dataframe with the string 'missing_value' for some features, and the most frequent value for the remaining features. Therefore, I combine these two operations using ColumnTransformer and passing to it the corresponding columns of my dataframe.
In the second step, I want to use the just preprocessed data and apply OrdinalEncoder or OneHotEncoder depending on the features. For that I use again ColumnTransformer.
I then combine the two steps into a single pipeline.
I am using the Kaggle Houses Price dataset, I have scikit-learn version 0.20 and this is a simplified version of my code:
cat_columns_fill_miss = ['PoolQC', 'Alley']
cat_columns_fill_freq = ['Street', 'MSZoning', 'LandContour']
cat_columns_ord = ['Street', 'Alley', 'PoolQC']
ord_mapping = [['Pave', 'Grvl'], # Street
['missing_value', 'Pave', 'Grvl'], # Alley
['missing_value', 'Fa', 'TA', 'Gd', 'Ex'] # PoolQC
]
cat_columns_onehot = ['MSZoning', 'LandContour']
imputer_cat_pipeline = ColumnTransformer([
('imp_miss', SimpleImputer(strategy='constant'), cat_columns_fill_miss), # fill_value='missing_value' by default
('imp_freq', SimpleImputer(strategy='most_frequent'), cat_columns_fill_freq),
])
encoder_cat_pipeline = ColumnTransformer([
('ordinal', OrdinalEncoder(categories=ord_mapping), cat_columns_ord),
('pass_ord', OneHotEncoder(), cat_columns_onehot),
])
cat_pipeline = Pipeline([
('imp_cat', imputer_cat_pipeline),
('cat_encoder', encoder_cat_pipeline),
])
Unfortunately, when I apply it to housing_cat, the subset of my dataframe including only categorical features,
cat_pipeline.fit_transform(housing_cat)
I get the error:
AttributeError: 'numpy.ndarray' object has no attribute 'columns'
During handling of the above exception, another exception occurred:
...
ValueError: Specifying the columns using strings is only supported for pandas DataFrames
I have tried this simplified pipeline and it works properly:
new_cat_pipeline = Pipeline([
('imp_cat', imputer_cat_pipeline),
('onehot', OneHotEncoder()),
])
However, if I try:
enc_one = ColumnTransformer([
('onehot', OneHotEncoder(), cat_columns_onehot),
('pass_ord', 'passthrough', cat_columns_ord)
])
new_cat_pipeline = Pipeline([
('imp_cat', imputer_cat_pipeline),
('onehot_encoder', enc_one),
])
I start to get the same error.
I suspect then that this error is related to the use of ColumnTransformer in the second step, but I do not actually understand where it comes from. The way I identify the columns in the second step is the same as in the first step, so it remains unclear to me why only in the second step I get the Attribute Error...
ColumnTransformer returns numpy.array, so it can't have column attribute (as indicated by your error).
If I may suggest a different solution, use pandas for both of your tasks, it will be easier.
Step 1 - replacing missing values
To replace missing value in a subset of columns with missing_value string use this:
dataframe[["PoolQC", "Alley"]].fillna("missing_value", inplace=True)
For the rest (imputing with mean of each column), this will work perfectly:
dataframe[["Street", "MSZoning", "LandContour"]].fillna(
dataframe[["Street", "MSZoning", "LandContour"]].mean(), inplace=True
)
Step 2 - one hot encoding and categorical variables
pandas provides get_dummies, which returns pandas Dataframe, unlike ColumnTransfomer, code for this would be:
encoded = pd.get_dummies(dataframe[['MSZoning', 'LandContour']], drop_first=True)
pd.dropna(['MSZoning', 'LandContour'], axis=columns, inplace=True)
dataframe = dataframe.join(encoded)
For ordinal variables and their encoding I would suggest you to look at this SO answer (unluckily some manual mapping would be needed in this case).
If you want to use transformer anyway
Get np.array from the dataframe using values attribute, pass it through the pipeline and recreate columns and indices from the array like this:
pd.DataFrame(data=your_array, index=np.arange(len(your_array)), columns=["A", "B"])
There is one caveat of this aprroach though; you will not know the names of custom created one-hot-encoded columns (the pipeline will not do this for you).
Additionally, you could get the names of columns from sklearn's transforming objects (e.g. using categories_ attribute), but I think it would break the pipeline (someone correct me if I'm wrong).
Option #2
use the make_pipeline function
(Had the same Error, found this answer, than found this: Introducing the ColumnTransformer)
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
cat_columns_fill_miss = ['PoolQC', 'Alley']
cat_columns_fill_freq = ['Street', 'MSZoning', 'LandContour']
cat_columns_ord = ['Street', 'Alley', 'PoolQC']
ord_mapping = [['Pave', 'Grvl'], # Street
['missing_value', 'Pave', 'Grvl'], # Alley
['missing_value', 'Fa', 'TA', 'Gd', 'Ex'] # PoolQC
]
cat_columns_onehot = ['MSZoning', 'LandContour']
imputer_cat_pipeline = make_column_transformer(
(make_pipeline(SimpleImputer(strategy='constant'), cat_columns_fill_miss),
(make_pipeline(SimpleImputer(strategy='most_frequent'), cat_columns_fill_freq),
)
encoder_cat_pipeline = make_column_transformer(
(OrdinalEncoder(categories=ord_mapping), cat_columns_ord),
(OneHotEncoder(), cat_columns_onehot),
)
cat_pipeline = Pipeline([
('imp_cat', imputer_cat_pipeline),
('cat_encoder', encoder_cat_pipeline),
])
In my own pipelines i do not have overlapping preprocessing in the column space. So i am not sure, how the transformation and than the "outer pipelining" works.
However, the important part is to use make_pipeline around the SimpleImputer to use it in a pipeline properly:
imputer_cat_pipeline = make_column_transformer(
(make_pipeline(SimpleImputer(strategy='constant'), cat_columns_fill_miss),
)
Just to add to the other answers here. I'm no Python or data science expert but you can pass another pipeline to ColumnTransformer in order to do what you need an add more than one transformer to a column. I came here looking for an answer to the same question and found this solution.
Doing it all via pipelines enables you to control the test/train data a lot easier to avoid leakage, and opens up more Grid Search possibilities too. I'm personally not a fan of the pandas approach in another answer for these reasons, but it would work ok still.
encoder_cat_pipeline = Pipeline([
('ordinal', OrdinalEncoder(categories=ord_mapping)),
('pass_ord', OneHotEncoder()),
])
imputer_cat_pipeline = ColumnTransformer([
('imp_miss', SimpleImputer(strategy='constant'), cat_columns_fill_miss),
('new_pipeline', encoder_cat_pipeline, cat_columns_fill_freq)
])
cat_pipeline = Pipeline([
('imp_cat', imputer_cat_pipeline),
])
I like to use the FunctionTransformer sklearn offers instead of doing transformations directly in pandas whenever I am doing any transformations. The reason for this is now my feature transformations are more generalizable on new incoming data (e.g. suppose you win, and you need to use the same code to predict on next years data). This way you won't have to re-run your code, you can save your preprocessor and call transform. I use something like this
FE_pipeline = {
'numeric_pipe': make_pipeline(
FunctionTransformer(lambda x: x.replace([np.inf, -np.inf], np.nan)),
MinMaxScaler(),
SimpleImputer(strategy='median', add_indicator=True),
),
'oh_pipe': make_pipeline(
FunctionTransformer(lambda x: x.astype(str)),
SimpleImputer(strategy='constant'),
OneHotEncoder(handle_unknown='ignore')
)
}

Categories