Column-specific processing in an sklearn pipeline - python

I have a situation where I need to do some column-specific processing in a pipeline, but because transformers return numpy arrays rather than pandas dataframes, I don't have column names to do my feature engineering.
Here's a simple, reproducible example where I have a function called engineer_feature that I want to use to create new data. I need to use it during/after the pipeline because it depends on one column becoming imputed, and I would like it to be able to be performed during k-fold cross-validation.
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder, StandardScaler
df = pd.DataFrame({"Type": ["Beta", "Beta", "Alpha", "Charlie", "Beta", "Charlie"], "A": [1, 2, 3, np.nan, 22, 4], "B": [5, 7, 12, 21, 12, 10]})
def engineer_feature(df):
df["C"] = df["A"] / df["B"]
return df
categorical_transformer = Pipeline([
("one_hot", OneHotEncoder())
])
numeric_transformer = Pipeline([
("imputer", SimpleImputer()),
("engineer", FunctionTransformer(engineer_feature)),
("scaler", StandardScaler())
])
preprocessor = ColumnTransformer([
("categorical", categorical_transformer, ["Type"]),
("numeric", numeric_transformer, ["A", "B"])
])
preprocessor.fit_transform(df)
Which yields this error:
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
Which makes sense because engineer_feature is trying to index columns as if though they are dataframes when they are just numpy arrays.
What's a strategy for getting around this? I don't want to hardcode column indices to access them via numpy, especially since my real dataframe has many more columns.

There are ways to get around your challenge by adding few steps and simplifying the entire approach instead of trying to run everything on a single input dataframe.
For one hot encoding, you can use the get_dummies() function in pandas.
For calculating df["C"], you can write a lambda function and apply it to all rows in the dataframe using the apply function in pandas.
You should still rely on sklearn for imputing and scaling the numeric columns.
As you correctly mentioned, the output from sklearn will be a numpy array. You should convert it back to a pandas dataframe that can be used further.
In order to follow the above approach,
Split your dataframe into two, one with categorical columns and the other with numeric. Once you are done with data processing, use append in pandas to append them back.
df_numeric.append(df_catgeorical)
You will need to save the output of each step in a new dataframe, and pass it further downstream in your data pipeline.
To release memory footprint, delete the old dataframe and call garbage collector
import gc
del df
gc.collect()
You do not need to save the column index of a numpy array. Simply use df.columns to return the dataframe's columns as a list. For example, below is what you can do to convert the output of a sklearn transformation into a dataframe
sim = SimpleImputer()
sklearn_output_array = sim.fit_transform(df_input)
df_output = pd.DataFrame(sklearn_output_array, columns=df_input.columns)
del df_input
del sklearn_output_array
gc.collect()
df_output["C"] = df_output["A"] / df_output["B"]
I agree that the above approach will increase the number of lines of code. However, our code will be much more readable and easier to follow.
In addition to the above, below is another stack overflow post that deals with one-hot encoding and saving column names of transformed dataframes for further use downstream. The answer has some examples with code that you might find useful.
https://stackoverflow.com/a/60107683/12855052
Hope this all helps, and let me know if you have further questions!

Thanks to the discussion and answers given by Nick and Sergey (specifically that I do know what columns of my dataframe I'm passing into engineer_feature), I've come up with a solution that is acceptable to me; though if anyone has a better idea, please chime in.
import numpy as np
import pandas as pd
from functools import partial
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder, StandardScaler
df = pd.DataFrame({"Type": ["Beta", "Beta", "Alpha", "Charlie", "Beta", "Charlie"], "A": [1, 2, 3, np.nan, 22, 4], "B": [5, 7, 12, 21, 12, 10]})
def engineer_feature(columns, X):
df = pd.DataFrame(X, columns=columns)
df["C"] = df["A"] / df["B"]
return df
categorical_transformer = Pipeline([
("one_hot", OneHotEncoder())
])
def numeric_transformer(columns):
transformer = Pipeline([
("imputer", SimpleImputer()),
("engineer", FunctionTransformer(partial(engineer_feature, columns))),
("scaler", StandardScaler())
])
return ("numeric", transformer, columns)
preprocessor = ColumnTransformer([
("categorical", categorical_transformer, ["Type"]),
numeric_transformer(["A", "B"])
])
preprocessor.fit_transform(df)
It's worth nothing this depends on both columns A and B to have at least one value each so that SimpleImputer does not drop the column.

For your toy example to work you need to:
def engineer_feature(X):
return np.c_[X,X[:,0]/X[:,1]]
categorical_transformer = Pipeline([
("one_hot", OneHotEncoder())
])
numeric_transformer = Pipeline([
("imputer", SimpleImputer())
,("engineer", FunctionTransformer(engineer_feature))
,("scaler", StandardScaler())
])
preprocessor = ColumnTransformer([
("categorical", categorical_transformer, ["Type"]),
("numeric", numeric_transformer, ["A", "B"])
])
preprocessor.fit_transform(df)
FunctionTransformer() accepts numpy array, you cannot avoid hardcoding here.

Related

Get feature names after sklearn pipeline

I want to match the output np array with the features to make a new pandas dataframe
Here is my pipeline:
from sklearn.pipeline import Pipeline
# Categorical pipeline
categorical_preprocessing = Pipeline(
[
('Imputation', SimpleImputer(missing_values=np.nan, strategy='most_frequent')),
('Ordinal encoding', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)),
]
)
# Continuous pipeline
continuous_preprocessing = Pipeline(
[
('Imputation', SimpleImputer(missing_values=np.nan, strategy='mean')),
('Scaling', StandardScaler())
]
)
# Creating preprocessing pipeline
preprocessing = make_column_transformer(
(continuous_preprocessing, continuous_cols),
(categorical_preprocessing, categorical_cols),
)
# Final pipeline
pipeline = Pipeline(
[('Preprocessing', preprocessing)]
)
Here is how I call it:
X_train = pipeline.fit_transform(X_train)
X_val = pipeline.transform(X_val)
X_test = pipeline.transform(X_test)
Here is what I get when trying to get the feature names:
pipeline['Preprocessing'].transformers_[1][1]['Ordinal encoding'].get_feature_names()
OUT:
AttributeError: 'OrdinalEncoder' object has no attribute 'get_feature_names'
Here is a SO question that was similar: Sklearn Pipeline: Get feature names after OneHotEncode In ColumnTransformer
Point is that, as of today, some transformers do expose a method .get_feature_names_out() and some others do not, which generates some problems - for instance - whenever you want to create a well-formatted DataFrame from the np.array outputted by a Pipeline or ColumnTransformer instance. (Instead, afaik, .get_feature_names() was deprecated in latest versions in favor of .get_feature_names_out()).
For what concerns the transformers that you are using, StandardScaler belongs to the first category of transformers exposing the method, while both SimpleImputer and OrdinalEncoder do belong to the second. The docs show the exposed methods within the Methods paragraphs. As said, this causes problems when doing something like pd.DataFrame(pipeline.fit_transform(X_train), columns=pipeline.get_feature_names_out()) on your pipeline, but it would cause problems as well on your categorical_preprocessing and continuous_preprocessing pipelines (as in both cases at least one transformer lacks of the method) and on the preprocessing ColumnTransformer instance.
There's an ongoing attempt in sklearn to enrich all estimators with the .get_feature_names_out() method. It is tracked within github issue #21308, which, as you might see, branches in many PRs (each one dealing with a specific module). For instance, issue #21079 for the preprocessing module, which will enrich the OrdinalEncoder among the others, issue #21078 for the impute module, which will enrich the SimpleImputer. I guess that they'll be available in a new release as soon as all the referenced PR will be merged.
In the meanwhile, imo, you should go with a custom solution that might fit your needs. Here's a simple example, which do not necessarily resemble your need, but which is meant to give a (possible) way of proceeding:
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from sklearn.compose import make_column_transformer, make_column_selector
X = pd.DataFrame({'city': ['London', 'London', 'Paris', 'Sallisaw', ''],
'title': ['His Last Bow', 'How Watson Learned the Trick', 'A Moveable Feast', 'The Grapes of Wrath', 'The Jungle'],
'expert_rating': [5, 3, 4, 5, np.NaN],
'user_rating': [4, 5, 4, np.NaN, 3]})
X
num_cols = X.select_dtypes(include=np.number).columns.tolist()
cat_cols = X.select_dtypes(exclude=np.number).columns.tolist()
# Categorical pipeline
categorical_preprocessing = Pipeline(
[
('Imputation', SimpleImputer(missing_values='', strategy='most_frequent')),
('Ordinal encoding', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)),
]
)
# Continuous pipeline
continuous_preprocessing = Pipeline(
[
('Imputation', SimpleImputer(missing_values=np.nan, strategy='mean')),
('Scaling', StandardScaler())
]
)
# Creating preprocessing pipeline
preprocessing = make_column_transformer(
(continuous_preprocessing, num_cols),
(categorical_preprocessing, cat_cols),
)
# Final pipeline
pipeline = Pipeline(
[('Preprocessing', preprocessing)]
)
X_trans = pipeline.fit_transform(X)
pd.DataFrame(X_trans, columns= num_cols + cat_cols)

How to setup the Imputer as part of sklearn pipeline?

I am working on the Titanic dataset and I wish to handle all the preprocessing activities on a pipeline. So, here is my code:
To get the dataset
!wget "https://calmcode.io/datasets/titanic.csv"
And then read it as below:
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
dt = pd.read_csv("./data/titanic.csv", index_col=["PassengerId"])
And then I setup a single pipeline which is suppose to preprocess the numerical features:
numerical_features = ["Age", "SibSp", "Parch", "Fare"]
numerical_pipeline = Pipeline(steps=[("min_max_scaler", MinMaxScaler()),
('num_imputer',SimpleImputer(missing_values=np.nan, strategy='mean')])
Then fit the pipeline:
column_transformer = ColumnTransformer(transformers=[
('numeric_transformer', numerical_pipeline, numerical_features),remainder='drop')
column_transformer.fit(dt)
transformed_dt = column_transformer.transform(dt)
But, I need to apply the Imputer only in the Age feature and not in all the other columns.Currently, it applies the imputer over all the columns.
My question is :
How can I specify that I need to apply the SimpleImputer only on the Age column and not in all of the numerical_pipeline ?
I think you need to use two column transformers, so if you set up the minmax this way:
minmax = ColumnTransformer([(
"minmax",
MinMaxScaler(),
["age", "sibsp", "parch", "fare"])
],remainder='drop')
The output comes without column names, but based on the column names we input, age will be the first, so:
imp = ColumnTransformer([(
"impute",
SimpleImputer(missing_values=np.nan, strategy='mean'),
[0])
],remainder='passthrough')
Then into a pipeline:
Pipeline([("scale",minmax),("impute",imp)]).fit_transform(dt)
As you have said in a comment, you want to first impute and second do the scaling. In that case, I would first create a column transformer that only imputes the one column, passes through the three others numerical columns, and drops columns that are not part of that set. After that, you add a MinMaxScaler on the output of that column transformer. In code
ct = ColumnTransformer(
[
("num_imputer", SimpleImputer(missing_values=np.nan, strategy="mean"), ["Age"]),
("needed_columns", "passthrough", ["SibSp", "Parch", "Fare"]),
],
)
pipeline = Pipeline(steps=[("transform", ct), ("scale", MinMaxScaler())])
The important bit here is that you add a second entry to the list of transformers, that has the word "passthrough" and specifies all the columns that should be passed through without any modifications.

Applying SimpleImputer and OneHotEncoder to multiple columns at once

I am applying the following code to impute and then encode categorical data in my dataset:
# Encoding categorical data
# Define a Pipeline with an imputing step using SimpleImputer prior to the OneHot encoding
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
# use strategy='constant', fill_value='missing' for imputing to preserve the categories' structure
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('encoder', OneHotEncoder(handle_unknown='ignore'))])
preprocessor = ColumnTransformer(
transformers=[
('cat', categorical_transformer, [0])
])
Z = np.array(preprocessor.fit_transform(Z))
print (Z[:,0])
I want to repeat these steps for all columns in the array Z, as Z is comprised of all categorical features from my original dataset.
Is there a more efficient way of doing this rather than listing each column as such:
preprocessor = ColumnTransformer(
transformers=[
('cat', categorical_transformer, [0,1,2,3,4,5,6,7,8,9,10])
])
Thanks in advance!
If all columns have the same type, I would simply omit the ColumnTransformer and use a simple pipeline in your case:
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder
# some sample data
X = pd.DataFrame({
'col1': ['obj1', 'obj2', 'obj3'],
'col2': [np.nan, 'oj3', 'oj1'],
'col3': ['jo3', 'jo1', np.nan]
}).astype('category')
y = pd.Series([0, 1, 1])
pipeline = make_pipeline(
SimpleImputer(missing_values=np.nan, strategy='constant', fill_value='missing'),
OneHotEncoder(handle_unknown='ignore', sparse=False)
)
Z = pipeline.fit_transform(X, y)
The ColumnTransformer is meant to be used for heterogeneous data when columns or column subsets of the input need to be transformed separately (read here). However, since your features are all of the same type and all require the same preprocessing procedure, you can just apply SimpleImputer and OneHotEncoder to the whole dataset as these transformers will automatically detect the columns to transform (which in your case are simply all).

Can You Consistently Keep Track of Column Labels Using Sklearn's Transformer API?

This seems like a very important issue for this library, and so far I don't see a decisive answer, although it seems like for the most part, the answer is 'No.'
Right now, any method that uses the transformer api in sklearn returns a numpy array as its results. Usually this is fine, but if you're chaining together a multi-step process that expands or reduces the number of columns, not having a clean way to track how they relate to the original column labels makes it difficult to use this section of the library to its fullest.
As an example, here's a snippet that I just recently used, where the inability to map new columns to the ones originally in the dataset was a big drawback:
numeric_columns = train.select_dtypes(include=np.number).columns.tolist()
cat_columns = train.select_dtypes(include=np.object).columns.tolist()
numeric_pipeline = make_pipeline(SimpleImputer(strategy='median'), StandardScaler())
cat_pipeline = make_pipeline(SimpleImputer(strategy='most_frequent'), OneHotEncoder())
transformers = [
('num', numeric_pipeline, numeric_columns),
('cat', cat_pipeline, cat_columns)
]
combined_pipe = ColumnTransformer(transformers)
train_clean = combined_pipe.fit_transform(train)
test_clean = combined_pipe.transform(test)
In this example I split up my dataset using the ColumnTransformer and then added additional columns using the OneHotEncoder, so my arrangement of columns is not the same as what I started out with.
I could easily have different arrangements if I used different modules that use the same API. OrdinalEncoer, select_k_best, etc.
If you're doing multi-step transformations, is there a way to consistently see how your new columns relate to your original dataset?
There's an extensive discussion about it here, but I don't think anything has been finalized yet.
Yes, you are right that there isn't a complete support for tracking the feature_names in sklearn as of now. Initially, it was decide to keep it as generic at the level of numpy array. Latest progress on the feature names addition to sklearn estimators can be tracked here.
Anyhow, we can create wrappers to get the feature names of the ColumnTransformer. I am not sure whether it can capture all the possible types of ColumnTransformers. But at-least, it can solve your problem.
From Documentation of ColumnTransformer:
Notes
The order of the columns in the transformed feature matrix follows the order of how the columns are specified in the transformers list. Columns of the original feature matrix that are not specified are dropped from the resulting transformed feature matrix, unless specified in the passthrough keyword. Those columns specified with passthrough are added at the right to the output of the transformers.
Try this!
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, MinMaxScaler
from sklearn.feature_extraction.text import _VectorizerMixin
from sklearn.feature_selection._base import SelectorMixin
from sklearn.feature_selection import SelectKBest
from sklearn.feature_extraction.text import CountVectorizer
train = pd.DataFrame({'age': [23,12, 12, np.nan],
'Gender': ['M','F', np.nan, 'F'],
'income': ['high','low','low','medium'],
'sales': [10000, 100020, 110000, 100],
'foo' : [1,0,0,1],
'text': ['I will test this',
'need to write more sentence',
'want to keep it simple',
'hope you got that these sentences are junk'],
'y': [0,1,1,1]})
numeric_columns = ['age']
cat_columns = ['Gender','income']
numeric_pipeline = make_pipeline(SimpleImputer(strategy='median'), StandardScaler())
cat_pipeline = make_pipeline(SimpleImputer(strategy='most_frequent'), OneHotEncoder())
text_pipeline = make_pipeline(CountVectorizer(), SelectKBest(k=5))
transformers = [
('num', numeric_pipeline, numeric_columns),
('cat', cat_pipeline, cat_columns),
('text', text_pipeline, 'text'),
('simple_transformer', MinMaxScaler(), ['sales']),
]
combined_pipe = ColumnTransformer(
transformers, remainder='passthrough')
transformed_data = combined_pipe.fit_transform(
train.drop('y',1), train['y'])
def get_feature_out(estimator, feature_in):
if hasattr(estimator,'get_feature_names'):
if isinstance(estimator, _VectorizerMixin):
# handling all vectorizers
return [f'vec_{f}' \
for f in estimator.get_feature_names()]
else:
return estimator.get_feature_names(feature_in)
elif isinstance(estimator, SelectorMixin):
return np.array(feature_in)[estimator.get_support()]
else:
return feature_in
def get_ct_feature_names(ct):
# handles all estimators, pipelines inside ColumnTransfomer
# doesn't work when remainder =='passthrough'
# which requires the input column names.
output_features = []
for name, estimator, features in ct.transformers_:
if name!='remainder':
if isinstance(estimator, Pipeline):
current_features = features
for step in estimator:
current_features = get_feature_out(step, current_features)
features_out = current_features
else:
features_out = get_feature_out(estimator, features)
output_features.extend(features_out)
elif estimator=='passthrough':
output_features.extend(ct._feature_names_in[features])
return output_features
pd.DataFrame(transformed_data,
columns=get_ct_feature_names(combined_pipe))

AttributeError when using ColumnTransformer into a pipeline

This is my first machine learning project and the first time that I use ColumnTransformer. My aim is to perform two steps of data preprocessing, and use ColumnTransformer for each of them.
In the first step, I want to replace the missing values in my dataframe with the string 'missing_value' for some features, and the most frequent value for the remaining features. Therefore, I combine these two operations using ColumnTransformer and passing to it the corresponding columns of my dataframe.
In the second step, I want to use the just preprocessed data and apply OrdinalEncoder or OneHotEncoder depending on the features. For that I use again ColumnTransformer.
I then combine the two steps into a single pipeline.
I am using the Kaggle Houses Price dataset, I have scikit-learn version 0.20 and this is a simplified version of my code:
cat_columns_fill_miss = ['PoolQC', 'Alley']
cat_columns_fill_freq = ['Street', 'MSZoning', 'LandContour']
cat_columns_ord = ['Street', 'Alley', 'PoolQC']
ord_mapping = [['Pave', 'Grvl'], # Street
['missing_value', 'Pave', 'Grvl'], # Alley
['missing_value', 'Fa', 'TA', 'Gd', 'Ex'] # PoolQC
]
cat_columns_onehot = ['MSZoning', 'LandContour']
imputer_cat_pipeline = ColumnTransformer([
('imp_miss', SimpleImputer(strategy='constant'), cat_columns_fill_miss), # fill_value='missing_value' by default
('imp_freq', SimpleImputer(strategy='most_frequent'), cat_columns_fill_freq),
])
encoder_cat_pipeline = ColumnTransformer([
('ordinal', OrdinalEncoder(categories=ord_mapping), cat_columns_ord),
('pass_ord', OneHotEncoder(), cat_columns_onehot),
])
cat_pipeline = Pipeline([
('imp_cat', imputer_cat_pipeline),
('cat_encoder', encoder_cat_pipeline),
])
Unfortunately, when I apply it to housing_cat, the subset of my dataframe including only categorical features,
cat_pipeline.fit_transform(housing_cat)
I get the error:
AttributeError: 'numpy.ndarray' object has no attribute 'columns'
During handling of the above exception, another exception occurred:
...
ValueError: Specifying the columns using strings is only supported for pandas DataFrames
I have tried this simplified pipeline and it works properly:
new_cat_pipeline = Pipeline([
('imp_cat', imputer_cat_pipeline),
('onehot', OneHotEncoder()),
])
However, if I try:
enc_one = ColumnTransformer([
('onehot', OneHotEncoder(), cat_columns_onehot),
('pass_ord', 'passthrough', cat_columns_ord)
])
new_cat_pipeline = Pipeline([
('imp_cat', imputer_cat_pipeline),
('onehot_encoder', enc_one),
])
I start to get the same error.
I suspect then that this error is related to the use of ColumnTransformer in the second step, but I do not actually understand where it comes from. The way I identify the columns in the second step is the same as in the first step, so it remains unclear to me why only in the second step I get the Attribute Error...
ColumnTransformer returns numpy.array, so it can't have column attribute (as indicated by your error).
If I may suggest a different solution, use pandas for both of your tasks, it will be easier.
Step 1 - replacing missing values
To replace missing value in a subset of columns with missing_value string use this:
dataframe[["PoolQC", "Alley"]].fillna("missing_value", inplace=True)
For the rest (imputing with mean of each column), this will work perfectly:
dataframe[["Street", "MSZoning", "LandContour"]].fillna(
dataframe[["Street", "MSZoning", "LandContour"]].mean(), inplace=True
)
Step 2 - one hot encoding and categorical variables
pandas provides get_dummies, which returns pandas Dataframe, unlike ColumnTransfomer, code for this would be:
encoded = pd.get_dummies(dataframe[['MSZoning', 'LandContour']], drop_first=True)
pd.dropna(['MSZoning', 'LandContour'], axis=columns, inplace=True)
dataframe = dataframe.join(encoded)
For ordinal variables and their encoding I would suggest you to look at this SO answer (unluckily some manual mapping would be needed in this case).
If you want to use transformer anyway
Get np.array from the dataframe using values attribute, pass it through the pipeline and recreate columns and indices from the array like this:
pd.DataFrame(data=your_array, index=np.arange(len(your_array)), columns=["A", "B"])
There is one caveat of this aprroach though; you will not know the names of custom created one-hot-encoded columns (the pipeline will not do this for you).
Additionally, you could get the names of columns from sklearn's transforming objects (e.g. using categories_ attribute), but I think it would break the pipeline (someone correct me if I'm wrong).
Option #2
use the make_pipeline function
(Had the same Error, found this answer, than found this: Introducing the ColumnTransformer)
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
cat_columns_fill_miss = ['PoolQC', 'Alley']
cat_columns_fill_freq = ['Street', 'MSZoning', 'LandContour']
cat_columns_ord = ['Street', 'Alley', 'PoolQC']
ord_mapping = [['Pave', 'Grvl'], # Street
['missing_value', 'Pave', 'Grvl'], # Alley
['missing_value', 'Fa', 'TA', 'Gd', 'Ex'] # PoolQC
]
cat_columns_onehot = ['MSZoning', 'LandContour']
imputer_cat_pipeline = make_column_transformer(
(make_pipeline(SimpleImputer(strategy='constant'), cat_columns_fill_miss),
(make_pipeline(SimpleImputer(strategy='most_frequent'), cat_columns_fill_freq),
)
encoder_cat_pipeline = make_column_transformer(
(OrdinalEncoder(categories=ord_mapping), cat_columns_ord),
(OneHotEncoder(), cat_columns_onehot),
)
cat_pipeline = Pipeline([
('imp_cat', imputer_cat_pipeline),
('cat_encoder', encoder_cat_pipeline),
])
In my own pipelines i do not have overlapping preprocessing in the column space. So i am not sure, how the transformation and than the "outer pipelining" works.
However, the important part is to use make_pipeline around the SimpleImputer to use it in a pipeline properly:
imputer_cat_pipeline = make_column_transformer(
(make_pipeline(SimpleImputer(strategy='constant'), cat_columns_fill_miss),
)
Just to add to the other answers here. I'm no Python or data science expert but you can pass another pipeline to ColumnTransformer in order to do what you need an add more than one transformer to a column. I came here looking for an answer to the same question and found this solution.
Doing it all via pipelines enables you to control the test/train data a lot easier to avoid leakage, and opens up more Grid Search possibilities too. I'm personally not a fan of the pandas approach in another answer for these reasons, but it would work ok still.
encoder_cat_pipeline = Pipeline([
('ordinal', OrdinalEncoder(categories=ord_mapping)),
('pass_ord', OneHotEncoder()),
])
imputer_cat_pipeline = ColumnTransformer([
('imp_miss', SimpleImputer(strategy='constant'), cat_columns_fill_miss),
('new_pipeline', encoder_cat_pipeline, cat_columns_fill_freq)
])
cat_pipeline = Pipeline([
('imp_cat', imputer_cat_pipeline),
])
I like to use the FunctionTransformer sklearn offers instead of doing transformations directly in pandas whenever I am doing any transformations. The reason for this is now my feature transformations are more generalizable on new incoming data (e.g. suppose you win, and you need to use the same code to predict on next years data). This way you won't have to re-run your code, you can save your preprocessor and call transform. I use something like this
FE_pipeline = {
'numeric_pipe': make_pipeline(
FunctionTransformer(lambda x: x.replace([np.inf, -np.inf], np.nan)),
MinMaxScaler(),
SimpleImputer(strategy='median', add_indicator=True),
),
'oh_pipe': make_pipeline(
FunctionTransformer(lambda x: x.astype(str)),
SimpleImputer(strategy='constant'),
OneHotEncoder(handle_unknown='ignore')
)
}

Categories