I want to match the output np array with the features to make a new pandas dataframe
Here is my pipeline:
from sklearn.pipeline import Pipeline
# Categorical pipeline
categorical_preprocessing = Pipeline(
('Imputation', SimpleImputer(missing_values=np.nan, strategy='most_frequent')),
('Ordinal encoding', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)),
# Continuous pipeline
continuous_preprocessing = Pipeline(
('Imputation', SimpleImputer(missing_values=np.nan, strategy='mean')),
('Scaling', StandardScaler())
# Creating preprocessing pipeline
preprocessing = make_column_transformer(
(continuous_preprocessing, continuous_cols),
(categorical_preprocessing, categorical_cols),
# Final pipeline
pipeline = Pipeline(
[('Preprocessing', preprocessing)]
Here is how I call it:
X_train = pipeline.fit_transform(X_train)
X_val = pipeline.transform(X_val)
X_test = pipeline.transform(X_test)
Here is what I get when trying to get the feature names:
pipeline['Preprocessing'].transformers_[1][1]['Ordinal encoding'].get_feature_names()
AttributeError: 'OrdinalEncoder' object has no attribute 'get_feature_names'
Point is that, as of today, some transformers do expose a method .get_feature_names_out() and some others do not, which generates some problems - for instance - whenever you want to create a well-formatted DataFrame from the np.array outputted by a Pipeline or ColumnTransformer instance. (Instead, afaik, .get_feature_names() was deprecated in latest versions in favor of .get_feature_names_out()).
For what concerns the transformers that you are using, StandardScaler belongs to the first category of transformers exposing the method, while both SimpleImputer and OrdinalEncoder do belong to the second. The docs show the exposed methods within the Methods paragraphs. As said, this causes problems when doing something like pd.DataFrame(pipeline.fit_transform(X_train), columns=pipeline.get_feature_names_out()) on your pipeline, but it would cause problems as well on your categorical_preprocessing and continuous_preprocessing pipelines (as in both cases at least one transformer lacks of the method) and on the preprocessing ColumnTransformer instance.
There's an ongoing attempt in sklearn to enrich all estimators with the .get_feature_names_out() method. It is tracked within github issue #21308, which, as you might see, branches in many PRs (each one dealing with a specific module). For instance, issue #21079 for the preprocessing module, which will enrich the OrdinalEncoder among the others, issue #21078 for the impute module, which will enrich the SimpleImputer. I guess that they'll be available in a new release as soon as all the referenced PR will be merged.
In the meanwhile, imo, you should go with a custom solution that might fit your needs. Here's a simple example, which do not necessarily resemble your need, but which is meant to give a (possible) way of proceeding:
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from sklearn.compose import make_column_transformer, make_column_selector
X = pd.DataFrame({'city': ['London', 'London', 'Paris', 'Sallisaw', ''],
'title': ['His Last Bow', 'How Watson Learned the Trick', 'A Moveable Feast', 'The Grapes of Wrath', 'The Jungle'],
'expert_rating': [5, 3, 4, 5, np.NaN],
'user_rating': [4, 5, 4, np.NaN, 3]})
num_cols = X.select_dtypes(include=np.number).columns.tolist()
cat_cols = X.select_dtypes(exclude=np.number).columns.tolist()
# Categorical pipeline
categorical_preprocessing = Pipeline(
('Imputation', SimpleImputer(missing_values='', strategy='most_frequent')),
('Ordinal encoding', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)),
# Continuous pipeline
continuous_preprocessing = Pipeline(
('Imputation', SimpleImputer(missing_values=np.nan, strategy='mean')),
('Scaling', StandardScaler())
# Creating preprocessing pipeline
preprocessing = make_column_transformer(
(continuous_preprocessing, num_cols),
(categorical_preprocessing, cat_cols),
# Final pipeline
pipeline = Pipeline(
[('Preprocessing', preprocessing)]
X_trans = pipeline.fit_transform(X)
pd.DataFrame(X_trans, columns= num_cols + cat_cols)
I am using sklearn and mlxtend.regressor.StackingRegressor to build a stacked regression model.
For example, say I want the following small pipeline:
A Stacking Regressor with two regressors:
A pipeline which:
Performs data imputation
1-hot encodes categorical features
Performs linear regression
A pipeline which:
Performs data imputation
Performs regression using a Decision Tree
Unfortunately this is not possible, because StackingRegressor doesn't accept NaN in its input data.
This is even if its regressors know how to handle NaN, as it would be in my case where the regressors are actually pipelines which perform data imputation.
However, this is not a problem: I can just move data imputation outside the stacked regressor.
Now my pipeline looks like this:
Perform data imputation
Apply a Stacking Regressor with two regressors:
A pipeline which:
1-hot encodes categorical features
Standardises numerical features
Performs linear regression
An sklearn.tree.DecisionTreeRegressor.
One might try to implement it as follows (the entire minimal working example in this gist, with comments):
sr_linear = Pipeline(steps=[
('preprocessing', ColumnTransformer(transformers=[
make_pipeline(OneHotEncoder(), StandardScaler()),
('model', LinearRegression())
sr_tree = DecisionTreeRegressor()
ct_imputation = ColumnTransformer(transformers=[
SimpleImputer(strategy='constant', fill_value='None'),
stacked_regressor = Pipeline(steps=[
('imputation', ct_imputation),
('back_to_pandas', FunctionTransformer(
func=lambda values: pd.DataFrame(values, columns=ct_imputation.get_feature_names_out())
('model', StackingRegressor(
regressors=[sr_linear, sr_tree],
Note that the "outer" ColumnTransformer (in stacked_regressor) returns a numpy matrix.
But the "inner" ColumnTransformer (in sr_linear) expects a pandas.DataFrame, so I had to convert the matrix back to a data frame using step back_to_pandas.
(To use get_feature_names_out I had to use the nightly version of sklearn, because the current stable 1.0.2 version does not support it yet. Fortunately it can be installed with one simple command.)
The above code fails when calling stacked_regressor.fit(), with the following error (the entire stacktrace is again in the gist):
ValueError: make_column_selector can only be applied to pandas dataframes
However, because I added the back_to_pandas step to my outer pipeline, the inner pipelines should be getting a pandas data frame!
In fact, if I only fit_transform() my ct_imputation object, I clearly obtain a pandas data frame.
I cannot understand where and when exactly the data which gets passed around ceases to be a data frame.
Why is my code failing?
Imo the issue has to be ascribed to StackingRegressor. Actually, I am not an expert on its usage and still I have not explored its source code, but I've found this sklearn issue - #16473 which seems implying that << the concatenation [of regressors and meta_regressors] does not preserve dataframe >> (though this is referred to sklearn StackingRegressor instance, rather than on mlxtend one).
Indeed, have a look at what happens once you replace it with your sr_linear pipeline:
from sklearn.datasets import fetch_openml
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from mlxtend.regressor import StackingRegressor
import numpy as np
import pandas as pd
# We use the Ames house prices dataset for this example
d = fetch_openml('house_prices', as_frame=True).frame
# Small data preprocessing:
for column in d.columns:
if d[column].dtype == object or column == 'MSSubClass':
d[column] = pd.Categorical(d[column])
d.drop(columns='Id', inplace=True)
# Prepare the data for training
label = 'SalePrice'
features = [col for col in d.columns if col != label]
X, y = d[features], d[label]
# Train the stacked regressor
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True)
sr_linear = Pipeline(steps=[
('preprocessing', ColumnTransformer(transformers=[
make_pipeline(OneHotEncoder(), StandardScaler(with_mean=False)),
('model', LinearRegression())
ct_imputation = ColumnTransformer(transformers=[
SimpleImputer(missing_values=np.nan, strategy='constant', fill_value='None'),
stacked_regressor = Pipeline(steps=[
('imputation', ct_imputation),
('back_to_pandas', FunctionTransformer(
func=lambda values: pd.DataFrame(values, columns=ct_imputation.get_feature_names_out()).astype(types)
('mdl', sr_linear)
stacked_regressor.fit(X_train, y_train)
Observe that I had to slightly modify the 'back_to_pandas' step because for some reason pd.DataFrame was changing the dtypes of the columns to 'object' only (from 'category' and 'float64'), therefore clashing with the imputation performed in sr_linear. For this reason, I've applied .astype(types) to the pd.DataFrame constructor, where types is defined as follows (based on the implementation of .get_feature_names_out() method of the SimpleImputer from the dev version of sklearn):
types = {}
for col in d.columns[:-1]:
if d[col].dtype == 'category':
types['categorical__' + col] = str(d[col].dtype)
types['numerical__' + col] = str(d[col].dtype)
The correct thing to do was:
Move from mlxtend's to sklearn's StackingRegressor. I believe the former was creater when sklearn still didn't have a stacking regressor. Now there is no need to use more 'obscure' solutions. sklearn's stacking regressor works pretty well.
Move the 1-hot-encoding step to the outer pipeline, because (surprisingly!) sklearn's DecisionTreeRegressor cannot handle categorical data among the features.
A working version of the code is given below:
from sklearn.datasets import fetch_openml
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.ensemble import StackingRegressor
import numpy as np
import pandas as pd
def set_correct_categories(df: pd.DataFrame) -> pd.DataFrame:
for column in df.columns:
if df[column].dtype == object or 'MSSubClass' in column:
df[column] = pd.Categorical(df[column])
return df
d = fetch_openml('house_prices', as_frame=True).frame
d = set_correct_categories(d).drop(columns='Id')
sr_linear = Pipeline(steps=[
('preprocessing', StandardScaler()),
('model', LinearRegression())
ct_preprocessing = ColumnTransformer(transformers=[
SimpleImputer(strategy='constant', fill_value='None'),
OneHotEncoder(sparse=False, handle_unknown='ignore')
stacking_regressor = Pipeline(steps=[
('preprocessing', ct_preprocessing),
('model', StackingRegressor(
estimators=[('linear_regression', sr_linear), ('regression_tree', DecisionTreeRegressor())],
label = 'SalePrice'
features = [col for col in d.columns if col != label]
X, y = d[features], d[label]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True)
stacking_regressor.fit(X_train, y_train)
I am trying to use GridSearchCV to select the best imputer strategy but I am having trouble doing that.
First, I have a data preparation pipeline for numerical and categorical columns-
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline, make_pipeline
num_pipe = make_pipeline(SimpleImputer(strategy='median'), StandardScaler())
cat_pipe = make_pipeline(SimpleImputer(strategy='constant', fill_value='NA'),
OneHotEncoder(sparse=False, handle_unknown='ignore'))
preprocessing = ColumnTransformer([
("num", num_pipe, num_cols),
("cat", cat_pipe, cat_cols)
Next, I have created a pipeline to train a support vector machine model with feature selection.
from sklearn.feature_selection import SelectFromModel
model = Pipeline([
("preprocess", preprocessing),
("feature_select", SelectFromModel(RandomForestRegressor(random_state=42))),
("regressor", SVR(kernel='rbf', C=30000.0, gamma=0.3))
Now, I am trying to see which imputer strategy is best for imputing missing values for numerical columns using a GridSearchCV
grid = {"model.named_steps.preprocess.transformers[0][1].named_steps['simpleimputer'].strategy":
grid_search = GridSearchCV(model, param_grid = grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)
This is where I am getting the error. The full pipeline looks like this -
['longitude', 'latitude',
'population', 'households',
('regressor', SVR(C=30000.0, gamma=0.3))])
Can anyone tell me what I need to change in the grid search to make it work?
The way you specify the parameter is via a dictionary that maps the name of the estimator/transformer and name of the parameter you want to change to the parameters you want to try. If you have a pipeline or a pipeline of pipelines, the name is the names of all its parents combined with a double underscore. So for your case, it looks like
gird = {
simpleimputer is simply the name that was automatically assigned by make_pipeline.
However, I think there are other issues in your code like fill_value='NA' being incorrect and actually not needed as it is not the falues to be filled but the value needed to filling missing values.
As part of a sklearn pipeline, I'd like to bin my response variable into a variable with k ordinal categories and then do classification on these categories. I found KBinsDiscretizer which seems to perform this transformation but it seems it does only work on feature columns, not on the target column.
Reproducible example
import sklearn
from sklearn.compose import make_column_transformer
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
import pandas as pd
from sklearn.datasets import load_boston
data = load_boston()
df = pd.DataFrame(data['data'], columns=data['feature_names'])
df['target'] = data['target']
binarizer_col_y = make_column_transformer(
[sklearn.preprocessing.KBinsDiscretizer(n_bins=3, encode='ordinal'), ['target']],
remainder = 'passthrough'
pipeline = Pipeline(steps = [
('preprocess', binarizer_col_y),
('ols', LinearRegression())
pipeline.fit(df[data['feature_names']], df['target'])
This errors with
KeyError: 'target'
The above exception was the direct cause of the following exception:
[ another key error for 'target']
I also found sklearn.compose.TransformedTargetRegressor to transform the response (but I want to do classification) and that I can write my own transformers, but they apparently ony modify X, not y.
Can anyone tell me how to modify y in a pre-processing step prior to classification as part of a pipeline?
Why inside the pipeline?
The idea is to move as many transformations into the pipeline as possible, reducing boilerplate code, avoiding data leaks plus simplifying model deployment (e.g. as services like Databricks model registry can deploy a sklearn model with pre-processing expected to happen inside the model).
You get the error because target is not available because the transformation is applied only to the X, and not y.
Sklearn pipeline does not support transforming target y in the way you tried to write it.
However, there is a sklearn.compose.TransformedTargetRegressor which can wrap model and can be provided with instructions how to transform target.
Warning, it is not well supported and I found many issues when trying to work with it on a real project. Maybe you want to have manual target transformation steps.
Here is a little demo that might work for you.
import sklearn
from sklearn.compose import make_column_transformer
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.compose import TransformedTargetRegressor
data = load_boston()
df = pd.DataFrame(data["data"], columns=data["feature_names"])
X = df[data["feature_names"]]
y = data["target"]
pipeline = Pipeline(
n_bins=3, encode="ordinal"
pipeline.fit(X, y)
Or a more readable snippet that shows how you create target transformer
model = LinearRegression()
kbins = sklearn.preprocessing.KBinsDiscretizer(n_bins=3, encode="ordinal")
ttr = TransformedTargetRegressor(model, transformer=kbins)
This seems like a very important issue for this library, and so far I don't see a decisive answer, although it seems like for the most part, the answer is 'No.'
Right now, any method that uses the transformer api in sklearn returns a numpy array as its results. Usually this is fine, but if you're chaining together a multi-step process that expands or reduces the number of columns, not having a clean way to track how they relate to the original column labels makes it difficult to use this section of the library to its fullest.
As an example, here's a snippet that I just recently used, where the inability to map new columns to the ones originally in the dataset was a big drawback:
numeric_columns = train.select_dtypes(include=np.number).columns.tolist()
cat_columns = train.select_dtypes(include=np.object).columns.tolist()
numeric_pipeline = make_pipeline(SimpleImputer(strategy='median'), StandardScaler())
cat_pipeline = make_pipeline(SimpleImputer(strategy='most_frequent'), OneHotEncoder())
transformers = [
('num', numeric_pipeline, numeric_columns),
('cat', cat_pipeline, cat_columns)
combined_pipe = ColumnTransformer(transformers)
train_clean = combined_pipe.fit_transform(train)
test_clean = combined_pipe.transform(test)
In this example I split up my dataset using the ColumnTransformer and then added additional columns using the OneHotEncoder, so my arrangement of columns is not the same as what I started out with.
I could easily have different arrangements if I used different modules that use the same API. OrdinalEncoer, select_k_best, etc.
If you're doing multi-step transformations, is there a way to consistently see how your new columns relate to your original dataset?
There's an extensive discussion about it here, but I don't think anything has been finalized yet.
Yes, you are right that there isn't a complete support for tracking the feature_names in sklearn as of now. Initially, it was decide to keep it as generic at the level of numpy array. Latest progress on the feature names addition to sklearn estimators can be tracked here.
Anyhow, we can create wrappers to get the feature names of the ColumnTransformer. I am not sure whether it can capture all the possible types of ColumnTransformers. But at-least, it can solve your problem.
From Documentation of ColumnTransformer:
The order of the columns in the transformed feature matrix follows the order of how the columns are specified in the transformers list. Columns of the original feature matrix that are not specified are dropped from the resulting transformed feature matrix, unless specified in the passthrough keyword. Those columns specified with passthrough are added at the right to the output of the transformers.
Try this!
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, MinMaxScaler
from sklearn.feature_extraction.text import _VectorizerMixin
from sklearn.feature_selection._base import SelectorMixin
from sklearn.feature_selection import SelectKBest
from sklearn.feature_extraction.text import CountVectorizer
train = pd.DataFrame({'age': [23,12, 12, np.nan],
'Gender': ['M','F', np.nan, 'F'],
'income': ['high','low','low','medium'],
'sales': [10000, 100020, 110000, 100],
'foo' : [1,0,0,1],
'text': ['I will test this',
'need to write more sentence',
'want to keep it simple',
'hope you got that these sentences are junk'],
'y': [0,1,1,1]})
numeric_columns = ['age']
cat_columns = ['Gender','income']
numeric_pipeline = make_pipeline(SimpleImputer(strategy='median'), StandardScaler())
cat_pipeline = make_pipeline(SimpleImputer(strategy='most_frequent'), OneHotEncoder())
text_pipeline = make_pipeline(CountVectorizer(), SelectKBest(k=5))
transformers = [
('num', numeric_pipeline, numeric_columns),
('cat', cat_pipeline, cat_columns),
('text', text_pipeline, 'text'),
('simple_transformer', MinMaxScaler(), ['sales']),
combined_pipe = ColumnTransformer(
transformers, remainder='passthrough')
transformed_data = combined_pipe.fit_transform(
train.drop('y',1), train['y'])
def get_feature_out(estimator, feature_in):
if hasattr(estimator,'get_feature_names'):
if isinstance(estimator, _VectorizerMixin):
# handling all vectorizers
return [f'vec_{f}' \
for f in estimator.get_feature_names()]
return estimator.get_feature_names(feature_in)
elif isinstance(estimator, SelectorMixin):
return np.array(feature_in)[estimator.get_support()]
return feature_in
def get_ct_feature_names(ct):
# handles all estimators, pipelines inside ColumnTransfomer
# doesn't work when remainder =='passthrough'
# which requires the input column names.
output_features = []
for name, estimator, features in ct.transformers_:
if name!='remainder':
if isinstance(estimator, Pipeline):
current_features = features
for step in estimator:
current_features = get_feature_out(step, current_features)
features_out = current_features
features_out = get_feature_out(estimator, features)
elif estimator=='passthrough':
return output_features
I am pretty new to pipelines in sklearn and I am running into this problem: I have a dataset that has a mixture of text and numbers i.e. certain columns have text only and rest have integers (or floating point numbers).
I was wondering if it was possible to build a pipeline where I can for example call LabelEncoder() on the text features and MinMaxScaler() on the numbers columns. The examples I have seen on the web mostly point towards using LabelEncoder() on the entire dataset and not on select columns. Is this possible? If so any pointers would be greatly appreciated.
The way I usually do it is with a FeatureUnion, using a FunctionTransformer to pull out the relevant columns.
Important notes:
You have to define your functions with def since annoyingly you can't use lambda or partial in FunctionTransformer if you want to pickle your model
You need to initialize FunctionTransformer with validate=False
Something like this:
from sklearn.pipeline import make_union, make_pipeline
from sklearn.preprocessing import FunctionTransformer
def get_text_cols(df):
return df[['name', 'fruit']]
def get_num_cols(df):
return df[['height','age']]
vec = make_union(*[
make_pipeline(FunctionTransformer(get_text_cols, validate=False), LabelEncoder()))),
make_pipeline(FunctionTransformer(get_num_cols, validate=False), MinMaxScaler())))
Since v0.20, you can use ColumnTransformer to accomplish this.
An Example of ColumnTransformer might help you:
# filter data
data = data[data['county'].isin(COUNTIES_OF_INTEREST)]
# define the feature encoding of the data
impute_and_one_hot_encode = Pipeline([
('impute', SimpleImputer(strategy='most_frequent')),
('encode', OneHotEncoder(sparse=False, handle_unknown='ignore'))
featurisation = ColumnTransformer(transformers=[
("impute_and_one_hot_encode", impute_and_one_hot_encode, ['smoker', 'county', 'race']),
('word2vec', MyW2VTransformer(min_count=2), ['last_name']),
('numeric', StandardScaler(), ['num_children', 'income'])
# define the training pipeline for the model
neural_net = KerasClassifier(build_fn=create_model, epochs=10, batch_size=1, verbose=0, input_dim=109)
pipeline = Pipeline([
('features', featurisation),
('learner', neural_net)])
# train-test split
train_data, test_data = train_test_split(data, random_state=0)
# model training
model = pipeline.fit(train_data, train_data['label'])
You can find the entire code under: https://github.com/stefan-grafberger/mlinspect/blob/19ca0d6ae8672249891835190c9e2d9d3c14f28f/example_pipelines/healthcare/healthcare.py