Pipeline with SimpleImputer and OneHotEncoder - how to do properly? - python

I facing a challenge to create a pipeline to impute (SI) a category variable (eg colour) and then onehotencode (OHE) 2 variables (eg colour & dayofweek). colour is used in the 2 steps.
I wanted to put SI and OHE in 1 ColumnTransformer. I just learnt that both SI and OHE running in parallel, meaning OHE will not encode the imputed colour (ie OHE the original un-imputed colour.)
I then tried:
si = SimpleImputer(strategy='mean', add_indicator=True)
ohe = OneHotEncoder(sparse=False,drop='first')
ctsi = ColumnTransformer(transformers=[('si',si,['colour'])],remainder='passthrough')
ctohe = ColumnTransformer(transformers=[('ohe',ohe,['colour','dayofweek'])],remainder='passthrough')
pl = Pipeline([('ctsi',ctsi),('ctohe',ctohe)])
outfit = pl.fit_transform(X,y)
I get the error:
ValueError: Specifying the columns using strings is only supported for pandas DataFrames
I believe it's because the column name colour has been removed by SI. When I change the OHE columns to a list of int:
ctohe = ColumnTransformer(transformers=[('ohe',ohe,[0,1])],remainder='passthrough')
It goes through. I'm just testing the processing, obviously, the columns are incorrect.
So my challenge here is that given what I want to accomplish, is it possible ? And how can I do that ?
Great many thanks in advance !

Actually, I agree with your reasoning. The problem is given by the fact that ColumnTransformer forgets the column names after the transformation and indeed - quoting the answer in here - ColumnTransformer's intended usage is to deal with transformations applied in parallel. That's also specified in the doc by means of this sentence in my opinion:
This estimator allows different columns or column subsets of the input to be transformed separately. [...] This is useful for heterogeneous or columnar data, to combine several feature extraction mechanisms or transformations into a single transformer.
I guess that one solution to this might be to go with a custom renaming of the columns, passing a callable to the columns portion of ColumnTransformer's transformers tuple (name, transformer, columns) (notation which follows the documentation) according to your needs (actually, this would work I guess if you pass a callable to your second ColumnTransformer instance in the pipeline).
EDIT: I have to withdraw somehow what I wrote, I'm not actually sure that passing a callable to columns might work for your need as your problem does not really stand in column selection per se, but rather in column selection via string column names, for which you would need a DataFrame (and imo acting on column selector only won't solve such problem).
Instead, you might better need a transformer that somehow changes column names after the imputation and before the one-hot-encoding (still provided that the setting is not the ideal one when different instances of ColumnTransformer have to transform the same variables in sequence in a Pipeline) acting on a DataFrame.
Actually a couple of months ago, the following https://github.com/scikit-learn/scikit-learn/pull/21078 was merged; I suspect it is not still in the latest release because by upgrading sklearn I couldn't get it to work. Anyway, IMO, in the future it may ease in similar situations as it adds get_feature_names_out() to SimpleImputer and get_feature_names_out() is in turn really useful when dealing with column names.
In general, I would also suggest the same post linked above for further details.
Eventually, here's a naive example I could get to; it's not scalable (I tried to get to something more scalable exploiting feature_names_in_ attribute of the fitted SimpleImputer instance, without arriving to a consistent result) but hopefully might give some hints.
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
X = pd.DataFrame({'city': ['London', 'London', 'Paris', np.NaN],
'title': ['His Last Bow', 'How Watson Learned the Trick', 'A Moveable Feast', 'The Grapes of Wrath'],
'expert_rating': [5, 3, 4, 5],
'user_rating': [4, 5, 4, 3]})
ct_1 = ColumnTransformer([('si', SimpleImputer(strategy='most_frequent'), ['city'])],
ct_2 = ColumnTransformer([('ohe', OneHotEncoder(), ['city'])], remainder='passthrough', verbose_feature_names_out=True)
class ColumnExtractor(BaseEstimator, TransformerMixin):
def __init__(self, columns):
self.columns = columns
def transform(self, X, *_):
return pd.DataFrame(X, columns=self.columns)
def fit(self, *_):
return self
pipe = Pipeline([
('ct_1', ct_1),
('ce', ColumnExtractor(['city', 'title', 'expert_rating', 'user_rating'])),
('ct_2', ct_2)

I think the simplest option (for now) is separate pipelines for the two columns:
si = SimpleImputer(strategy='mean', add_indicator=True)
ohe = OneHotEncoder(sparse=False, drop='first'))
siohe = Pipeline([
('si', si),
('ohe', ohe),
ct = ColumnTransformer(
('si', siohe, ['colour']),
('siohe', ohe, ['dayofweek']),
(You specified the strategy as 'mean', which is probably not what you want, but I've left it in the code above.)
You might also consider just using siohe for both columns: if there are no missings in dayofweek (in production data!) then there's no real difference.


How to implement a function through scikit FunctionTransformer() that refers to two columns of a data frame ('kw_args' argument?)

while working on my submission for the famous Kaggle Titanic dataset (890 rows/11 columns) I would like to execute all of my 'Feature Engineering' steps within one scikit pipeline. However, I could barely find any online examples that demonstrate how to use the scikit FunctionTransformer() in order to execute slightly more complex custom functions, especially functions that refer to more than one column of the dataset.
In my concrete example, I would like to replace NaN values in the column 'Age' depending on the passenger class (column 'Pclass'). Possible passengers classes are 1, 2 or 3 and the corresponding ages that should replace the NaN values are 38, 30 and 25. My current code looks like this:
def impute_age_class(df, column_1, column_2):
for i in range(len(df)):
if np.isnan(df[column_1].iloc[i]):
if df[column_2].iloc[i] == 1:
df[column_1].iloc[i] = 38
elif df[column_2].iloc[i] == 2:
df[column_1].iloc[i] = 30
df[column_1].iloc[i] = 25
return df
age_transformers = [("impute_age_class", FunctionTransformer(impute_age_class,validate=False, kw_args={'column_1': 'Age', 'column_2': 'Pclass'}), ["Age", "Pclass"])]
It seems like the code gets executed and I receive a slightly better accuracy score with my logreg model but also the warnings on this picture:
Note message
I would be very thankful if you could give me any hints on whether the syntax of my code could be improved in order to avoid these warnings and ensure correct execution.
That warning is very common, and worth reading up on. But it's also not great to be looping over the rows of a dataframe. You can use pandas's own fillna for this:
def impute_age_class(df, fillme, groupby):
df = df.copy()
df.loc[:, fillme] = df[fillme].fillna(
{1: 38, 2: 30, 3: 25})
return df
tfmr = FunctionTransformer(
kw_args={'fillme': 'age', 'groupby': 'pclass'}
It's a little unusual to have the parameters for the two column names when you are hard-coding the mapping inside the function. And if you didn't have the mapping already in mind, it'd be better to learn it at fit time and then transform train and test sets with that mapping: see SimpleImputer with groupby and https://datascience.stackexchange.com/q/71856/55122.

Python sklearn-pandas Transform Multiple Columns at the same time error

I am using python with pandas and sklearn and trying to use the new and very convenient sklearn-pandas.
I have a big data frame and need to transform multiple columns in a similar way.
I have multiple column names in the variable other
the source code documentation here
states explicitly there is a possibility of transforming multiple columns with the same transformation, but the following code does not behave as expected:
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
mapper = DataFrameMapper([[other[0],other[1]],LabelEncoder()])
I get the following error:
raise ValueError("bad input shape {0}".format(shape))
ValueError: ['EFW', 'BPD']: bad input shape (154, 2)
When I use the following code, it works great:
cols = [(other[i], LabelEncoder()) for i,col in enumerate(other)]
mapper = DataFrameMapper(cols)
To my understanding, both should work well and yield same results.
What am I doing wrong here?
The problem you encounter here, is that the two snippets of code are completely different in terms of data structure.
cols = [(other[i], LabelEncoder()) for i,col in enumerate(other)] builds a list of tuples. Do note that you can shorten this line of code to:
cols = [(col, LabelEncoder()) for col in other]
Anyway, the first snippet, [[other[0],other[1]],LabelEncoder()] results in a list containing two elements: a list and a LabelEncoder instance. Now, it is documented that you can transform multiple columns through specifying:
Transformations may require multiple input columns. In these cases, the column names can be specified in a list:
mapper2 = DataFrameMapper([
(['children', 'salary'], sklearn.decomposition.PCA(1))
This is a list containing tuple(list, object) structured elements, not list[list, object] structured elements.
If we take a look at the source code itself,
class DataFrameMapper(BaseEstimator, TransformerMixin):
Map Pandas data frame column subsets to their own
sklearn transformation.
def __init__(self, features, default=False, sparse=False, df_out=False,
features a list of tuples with features definitions.
The first element is the pandas column selector. This can
be a string (for one column) or a list of strings.
The second element is an object that supports
sklearn's transform interface, or a list of such objects.
The third element is optional and, if present, must be
a dictionary with the options to apply to the
transformation. Example: {'alias': 'day_of_week'}
It is also clearly stated in the class definition that the features argument to DataFrameMapper is required to be a list of tuples, where the elements of the tuple may be lists.
As a last note, as to why you actually get your error message: The LabelEncoder transformer in sklearn is meant for labeling purposes on 1D arrays. As such, it is fundamentally unable to handle 2 columns at once and will raise an Exception. So, if you want to use the LabelEncoder, you will have to build N tuples with 1 columnname and the transformer where N is the amount of columns you wish to transform.

Creating many feature columns in Tensorflow

I'm getting started on a Tensorflow project, and am in the middle of defining and creating my feature columns. However, I have hundreds and hundreds of features- it's a pretty extensive dataset. Even after preprocessing and scrubbing, I have a lot of columns.
The traditional way of creating a feature_column is defined in the Tensorflow tutorial and even this StackOverflow post. You essentially declare and initialize a Tensorflow object for each feature column:
gender = tf.feature_column.categorical_column_with_vocabulary_list(
"gender", ["Female", "Male"])
This works all well and good if your dataset has only a few columns, but in my case, I surely don't want to have hundreds of lines of code initializing different feature_column objects.
What's the best way to resolve this issue? I notice that in the tutorial, all the columns are collected as a list:
base_columns = [
gender, native_country, education, occupation, workclass, relationship,
Which is ultimately passed into your estimator:
m = tf.estimator.LinearClassifier(
model_dir=model_dir, feature_columns=base_columns)
So would the ideal way of handling feature_column creation for hundreds of columns be to append them directly into a list? Something like this?
my_columns = []
for col in df.columns:
if is_string_dtype(df[col]): #is_string_dtype is pandas function
hash_bucket_size= len(df[col].unique())))
elif is_numeric_dtype(df[col]): #is_numeric_dtype is pandas function
Is this the best way of creating these feature columns? Or am I missing some functionality to Tensorflow that allows me to work around this step?
What you have posted in the question makes sense. Small extension based on your own code:
import pandas.api.types as ptypes
my_columns = []
for col in df.columns:
if ptypes.is_string_dtype(df[col]):
hash_bucket_size= len(df[col].unique())))
elif ptypes.is_numeric_dtype(df[col]):
elif ptypes.is_categorical_dtype(df[col]):
hash_bucket_size= len(df[col].unique())))
I used your own answer. Just edited a little bit (there should be my_columns instead of my_column in for loop) and posting it the way it worked for me.
import pandas.api.types as ptypes
my_columns = []
for col in df.columns:
if ptypes.is_string_dtype(df[col]): #is_string_dtype is pandas function
hash_bucket_size= len(df[col].unique())))
elif ptypes.is_numeric_dtype(df[col]): #is_numeric_dtype is pandas function
The above two methods works only if the data is provided in pandas data frame where you have column name for each column. But, in case you have all numeric column and you don't want to name those columns. for e.g. reading several numerical columns from a numpy array, you can use something like this:-
feature_column = [tf.feature_column.numeric_column(key='image',shape=(784,))]
input_fn = tf.estimator.inputs.numpy_input_fn(dict({'image':x_train})
where X_train is your numy array with 784 columns. You can check this post by Vikas Sangwan for more details.

Deal with missing categorical data python

I have a csv file, and I'm preparing it's data to be trained using different machine learning algorithms, so I replaced numeric missing data with the mean of that column, but how to deal with missing categorical data, should I replace them with the most frequent element? and what the easiest why to do it in python using pandas.
dataset = pd.read_csv('doc.csv')
X = dataset.iloc[:, [2, 4, 5, 6, 7, 9,10 ,11]].values
y = dataset.iloc[:, -1].values
Row number 2 contain the categorical data.
first row value :
[3, 'S', 22.0, 1, 0, 7.25, 107722, 2]
Regarding the modelling part of your question, you're better off asking that at CrossValidated.
If there are too many records with missing data, you could just remove that column from consideration altogether. There are some other excellent suggestions on this StackOverflow post, including sci-kit learn's Imputer() method, or just letting the model handle the missing data.
Regarding replacing a column look into the DataFrame.replace() method
An example usage of this for your dataset, assuming that the missing column values are called 'N' and you are replacing them by some other category 'S' (which you found out using the DataFrame.mode() method): dataset[1].replace('N', 'S').

How to encode categorical features in sklearn?

I have a dataset with 41 features [from 0 to 40 columns], of which 7 are categorical. This categorical set is divided in two subset:
A subset of string type(the column-features 1, 2, 3)
A subset of int type, in binary form 0 or 1 (the column-features 6, 11, 20, 21)
Furthermore the column-features 1, 2 and 3 (of string type) have cardinality 3, 66 and 11 respectively.
In this context I have to encode them to use support vector machine algorithm.
This is the code that I have:
import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn import feature_extraction
df = pd.read_csv("train.csv")
datanumpy = df.as_matrix()
X = datanumpy[:, 0:40] # select columns 1 through 41 (the features)
y = datanumpy[:, 41] # select column 42 (the labels)
I don't know if is better to use DictVectorizer() or OneHotEncoder() [for the reasons that I exposed above], and mostly in which way use them [in term of code] with the X matrix that I have.
Or should I simply assign a number to each cardinality in the subset of string type (since they have high cardinality and so my feature space will increase exponentially)?
With respect to subset of int type I guess that the best choice is to keep the column-features as they are (don't pass them to any encoder)
The problem persist for subset of string type with high cardinality.
This is by far the easiest:
df = pd.get_dummies(df, drop_first=True)
If you get a memory overflow or it is too slow then reduce the cardinality:
top = df[col].isin(df[col].value_counts().index[:10])
df.loc[~top, col] = "other"
As per the official documentation of One Hot Encoder, it should be applied over the combined dataset (Train and Test). Otherwise it may not form a proper encoding.
And performance-wise I think One Hot Encoder is much better than DictVectorizer.
You can use the pandasmethod .get_dummies() as suggested by #simon here above, or you can use the sklearn equivalent given by OneHotEncoder.
I prefer OneHotEncoder because you can pass to it parameters like the categorical features you want to encode and the number of values to keep for each feature (if not indicated, it will select automatically the optimal number).
If, for some features, the cardinality is too big, impose low n_values.
If you have enough memory don't worry, encode all the values of your features.
I guess for a cardinality of 66, if you have a basic computer, encoding all of the 66 features won't lead to a memory issue. Memory overflow usually happens when you have for example as much values for a feature as the number of samples in your dataset (the case for IDs that are unique per sample). The bigger the dataset, the more likely you'll get a memory issue.
