Environment:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier
Sample data:
X_train = pd.DataFrame({'A': ['a1', 'a3', 'a2'],
'B': ['b2', 'b1', 'b3'],
'C': [1, 2, 3]})
y_train = pd.DataFrame({'Y': [1,0,1]})
Desired outcome:
I would like to include sklearn OneHotEncoder in my pipeline in this format:
encoder = ### SOME CODE ###
scaler = StandardScaler()
model = RandomForestClassifier(random_state=0)
# This is my ideal pipeline
pipe = Pipeline([('OneHotEncoder', encoder),
('Scaler', scaler),
('Classifier', model)])
pipe.fit(X_train, y_train)
Challenge:
OneHotEncoder is encoding everything including the numerical columns. I want to keep numerical columns as it is and encode only categorical features in an efficient way that's compatible with Pipeline().
encoder = OneHotEncoder(drop='first', sparse=False)
encoder.fit(X_train)
encoder.transform(X_train) # Columns C is encoded - this is what I want to avoid
Work around (not ideal): I can get around the problem using pd.get_dummies(). However, this means I can't include it in my pipeline. Or is there a way?
X_train = pd.get_dummies(X_train, drop_first=True)
My preferred solution for this would be to use sklearn's ColumnTransformer (see here).
It enables you to split the data in as many groups as you want (in your case, categorical vs numerical data) and apply different preprocessing operations to these groups. This transformer can then be used in a pipeline as any other sklearn preprocessing tool. Here is a short example:
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
X = pd.DataFrame({"a":[1,2,3],"b":["A","A","B"]})
y = np.array([0,1,1])
OHE = OneHotEncoder()
scaler = StandardScaler()
RFC = RandomForestClassifier()
cat_cols = ["b"]
num_cols = ["a"]
transformer = ColumnTransformer([('cat_cols', OHE, cat_cols),
('num_cols', scaler, num_cols)])
pipe = Pipeline([("preprocessing", transformer),
("classifier", RFC)])
pipe.fit(X,y)
NB: I have taken some license with your request because this only applies the scaler to the numerical data, which I believe makes more sense? If you do want to apply the scaler to all columns, you can do this as well by modifying this example.
What I would do is to create my own custom transformer and put it into pipeline. With this way, you will have a lot of power over the data in your hand. So, the steps are like below:
1) Create a custom transformer class inheriting BaseEstimator and TransformerMixin. In its transform() function try to detect the values of that column is either numerical or categorical. If you do not want to deal with the logic right now, you can always give column name for categorical columns to your transform() function to select on the fly.
2) (Optional) Create your custom transformer to handle columns with only categorical values.
3) (Optional) Create your custom transformer to handle columns with only numerical values.
4) Build two pipelines (one for categorical, the other for numerical) using transformers you created and you can also use the existing ones from sklearn.
5) Merge two pipelines with FeatureUnion.
6) Merge your big pipeline with your ML model.
7) Call fit_transform()
The sample code (no optionals implemented): GitHub Jupyter Noteboook
Related
I am working on the Titanic dataset and I wish to handle all the preprocessing activities on a pipeline. So, here is my code:
To get the dataset
!wget "https://calmcode.io/datasets/titanic.csv"
And then read it as below:
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
dt = pd.read_csv("./data/titanic.csv", index_col=["PassengerId"])
And then I setup a single pipeline which is suppose to preprocess the numerical features:
numerical_features = ["Age", "SibSp", "Parch", "Fare"]
numerical_pipeline = Pipeline(steps=[("min_max_scaler", MinMaxScaler()),
('num_imputer',SimpleImputer(missing_values=np.nan, strategy='mean')])
Then fit the pipeline:
column_transformer = ColumnTransformer(transformers=[
('numeric_transformer', numerical_pipeline, numerical_features),remainder='drop')
column_transformer.fit(dt)
transformed_dt = column_transformer.transform(dt)
But, I need to apply the Imputer only in the Age feature and not in all the other columns.Currently, it applies the imputer over all the columns.
My question is :
How can I specify that I need to apply the SimpleImputer only on the Age column and not in all of the numerical_pipeline ?
I think you need to use two column transformers, so if you set up the minmax this way:
minmax = ColumnTransformer([(
"minmax",
MinMaxScaler(),
["age", "sibsp", "parch", "fare"])
],remainder='drop')
The output comes without column names, but based on the column names we input, age will be the first, so:
imp = ColumnTransformer([(
"impute",
SimpleImputer(missing_values=np.nan, strategy='mean'),
[0])
],remainder='passthrough')
Then into a pipeline:
Pipeline([("scale",minmax),("impute",imp)]).fit_transform(dt)
As you have said in a comment, you want to first impute and second do the scaling. In that case, I would first create a column transformer that only imputes the one column, passes through the three others numerical columns, and drops columns that are not part of that set. After that, you add a MinMaxScaler on the output of that column transformer. In code
ct = ColumnTransformer(
[
("num_imputer", SimpleImputer(missing_values=np.nan, strategy="mean"), ["Age"]),
("needed_columns", "passthrough", ["SibSp", "Parch", "Fare"]),
],
)
pipeline = Pipeline(steps=[("transform", ct), ("scale", MinMaxScaler())])
The important bit here is that you add a second entry to the list of transformers, that has the word "passthrough" and specifies all the columns that should be passed through without any modifications.
I'm working through a tutorial focusing on OneHotEncoder. I get the idea behind encoding features, but I'm having a little problem with using the encoder with pipeline to make a new prediction. Two of the features--"Sex" and "Embarked"--are categorical rather than numerical. When creating a new numpy array to make a prediction, do you use the initial values, say "male" and "C", or, say, "1" and "2" to make a new prediction? I get the following error: " ValueError: Specifying the columns using strings is only supported for pandas DataFrames," which is weird given that the values I'm using are numerical. Regardless, would I have to fit the pipeline to X_new to make a new prediction? If so, how can I do that?
X_new = [[3, 1, 0]] OR X_new = [['3','male', 'C']]
pipe.predict(X_new)
Complete code:
import pandas as pd
import numpy as np
df = pd.read_csv("https://raw.githubusercontent.com/---/pandas-videos/master/data/titanic_train.csv")
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(solver='lbfgs')
from sklearn.model_selection import cross_val_score
cross_val_score(logreg, X, y, cv=5, scoring='accuracy').mean()
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse=False)
X = df.drop('Survived', axis='columns')
from sklearn.compose import make_column_transformer
column_trans = make_column_transformer(
(OneHotEncoder(), ['Sex', 'Embarked']),
remainder='passthrough')
column_trans.fit_transform(X)
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(column_trans, logreg)
cross_val_score(pipe, X, y, cv=5, scoring='accuracy').mean()
X_new = [[3, 1, 0]]
pipe.predict(X_new)
When you apply OneHotEncoder, the categorical column that you specify will be transformed into multiple integer columns based on number of unique value in the categorical column.
For example, the gender column contains 'male' and 'female', then it will converted the original column to 2 columns of 'male' and 'female'. It is difference from the LabelEncoder.
If you want to apply pipeline, logistic regression, and OneHotEncoder, you can use the pipeline to fit with the training data.
pipe.fit(X,y)
and then you can apply the prediction. This is an example when I apply 3 features as Sex, Age, and embarked and apply OHE to Sex and embarked.
X_new = [['female', 20, 'C']]
X_new_df = pd.DataFrame (X_new,columns=['Sex','Age','Embarked'])
pipe.predict(X_new_df)
However, the features that you use in your code is all features except label classes ('Survived'), which is 11 features. the number of the input must be equal to or greater than the fitted model, while you apply only 3 columns that may prompt and error.
I have pandas DataFrame df. I want to encode continuous and categorical features of df using different encoders. I find it very comfortable to use make_column_transformer, but the code shown below fails with LabelEncoder(), but works fine with OneHotEncoder(handle_unknown='ignore')). The error message is:
TypeError: fit_transform() takes 2 positional arguments but 3 were
given
It's not clear to me how to fix this issue.
The code:
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import RobustScaler, OneHotEncoder, LabelEncoder
continuous_features = ['COL1','COL2']
categorical_features = ['COL3','COL4']
column_trans = make_column_transformer(
(categorical_features,LabelEncoder()),
(continuous_features, RobustScaler()))
X_enc = column_trans.fit_transform(df)
According to https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_transformer.html.
make_column_transformer(
... (StandardScaler(), ['numerical_column']),
... (OneHotEncoder(), ['categorical_column']))
So for your case:
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import RobustScaler, OneHotEncoder, LabelEncoder
continuous_features = ['COL1','COL2']
categorical_features = ['COL3','COL4']
column_trans = make_column_transformer(
(OneHotEncoder(), categorical_features),
(RobustScaler(), continuous_features))
X_enc = column_trans.fit_transform(df)
If you want to use LabelEncoder(), you can only pass one column, not two!
Hope this helps.
Imo the problem here was that LabelEncoder could not (and still can't) be used within ColumnTransformers or Pipelines instances, because it is meant to only transform targets, while ColumnTransformers and Pipelines are intended to be used for feature transformation only.
This can be clearly seen from the signatures of methods .fit(), .transform(), .fit_transform() of the LabelEncoder class, which differ from the ones of "more-standard" transformers.
fit(y) vs fit(X[,y]) | transform(y) vs transform(X) | fit_transform(y) vs fit_transform(X[,y])
respectively for LabelEncoder-like transformers (i.e. transformers to be applied on target) and for transformers to be applied on features.
I've just posted an answer in full details at Why should LabelEncoder from sklearn be used only for the target variable?.
This said, if using LabelEncoder within a ColumnTransformer was allowed, you would have probably incurred in a problem described in the other answer and deriving from the fact that the input to LabelEncoder should be 1D (and btw in such cases both the ColumnTransformer constructor and the make_column_transformer method would have required the columns parameter to be passed as a string rather than as a list of string(s)). A common use case for this can be seen in the usage of instances of classes meant to deal with texts, like CountVectorizer, which do require 1D inputs
(see Sklearn custom transformers with pipeline: all the input array dimensions for the concatenation axis must match exactly, for instance).
Eventually, from sklearn version 0.20, an alternative to the LabelEncoder that can be used on feature vectors is the OrdinalEncoder.
This seems like a very important issue for this library, and so far I don't see a decisive answer, although it seems like for the most part, the answer is 'No.'
Right now, any method that uses the transformer api in sklearn returns a numpy array as its results. Usually this is fine, but if you're chaining together a multi-step process that expands or reduces the number of columns, not having a clean way to track how they relate to the original column labels makes it difficult to use this section of the library to its fullest.
As an example, here's a snippet that I just recently used, where the inability to map new columns to the ones originally in the dataset was a big drawback:
numeric_columns = train.select_dtypes(include=np.number).columns.tolist()
cat_columns = train.select_dtypes(include=np.object).columns.tolist()
numeric_pipeline = make_pipeline(SimpleImputer(strategy='median'), StandardScaler())
cat_pipeline = make_pipeline(SimpleImputer(strategy='most_frequent'), OneHotEncoder())
transformers = [
('num', numeric_pipeline, numeric_columns),
('cat', cat_pipeline, cat_columns)
]
combined_pipe = ColumnTransformer(transformers)
train_clean = combined_pipe.fit_transform(train)
test_clean = combined_pipe.transform(test)
In this example I split up my dataset using the ColumnTransformer and then added additional columns using the OneHotEncoder, so my arrangement of columns is not the same as what I started out with.
I could easily have different arrangements if I used different modules that use the same API. OrdinalEncoer, select_k_best, etc.
If you're doing multi-step transformations, is there a way to consistently see how your new columns relate to your original dataset?
There's an extensive discussion about it here, but I don't think anything has been finalized yet.
Yes, you are right that there isn't a complete support for tracking the feature_names in sklearn as of now. Initially, it was decide to keep it as generic at the level of numpy array. Latest progress on the feature names addition to sklearn estimators can be tracked here.
Anyhow, we can create wrappers to get the feature names of the ColumnTransformer. I am not sure whether it can capture all the possible types of ColumnTransformers. But at-least, it can solve your problem.
From Documentation of ColumnTransformer:
Notes
The order of the columns in the transformed feature matrix follows the order of how the columns are specified in the transformers list. Columns of the original feature matrix that are not specified are dropped from the resulting transformed feature matrix, unless specified in the passthrough keyword. Those columns specified with passthrough are added at the right to the output of the transformers.
Try this!
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, MinMaxScaler
from sklearn.feature_extraction.text import _VectorizerMixin
from sklearn.feature_selection._base import SelectorMixin
from sklearn.feature_selection import SelectKBest
from sklearn.feature_extraction.text import CountVectorizer
train = pd.DataFrame({'age': [23,12, 12, np.nan],
'Gender': ['M','F', np.nan, 'F'],
'income': ['high','low','low','medium'],
'sales': [10000, 100020, 110000, 100],
'foo' : [1,0,0,1],
'text': ['I will test this',
'need to write more sentence',
'want to keep it simple',
'hope you got that these sentences are junk'],
'y': [0,1,1,1]})
numeric_columns = ['age']
cat_columns = ['Gender','income']
numeric_pipeline = make_pipeline(SimpleImputer(strategy='median'), StandardScaler())
cat_pipeline = make_pipeline(SimpleImputer(strategy='most_frequent'), OneHotEncoder())
text_pipeline = make_pipeline(CountVectorizer(), SelectKBest(k=5))
transformers = [
('num', numeric_pipeline, numeric_columns),
('cat', cat_pipeline, cat_columns),
('text', text_pipeline, 'text'),
('simple_transformer', MinMaxScaler(), ['sales']),
]
combined_pipe = ColumnTransformer(
transformers, remainder='passthrough')
transformed_data = combined_pipe.fit_transform(
train.drop('y',1), train['y'])
def get_feature_out(estimator, feature_in):
if hasattr(estimator,'get_feature_names'):
if isinstance(estimator, _VectorizerMixin):
# handling all vectorizers
return [f'vec_{f}' \
for f in estimator.get_feature_names()]
else:
return estimator.get_feature_names(feature_in)
elif isinstance(estimator, SelectorMixin):
return np.array(feature_in)[estimator.get_support()]
else:
return feature_in
def get_ct_feature_names(ct):
# handles all estimators, pipelines inside ColumnTransfomer
# doesn't work when remainder =='passthrough'
# which requires the input column names.
output_features = []
for name, estimator, features in ct.transformers_:
if name!='remainder':
if isinstance(estimator, Pipeline):
current_features = features
for step in estimator:
current_features = get_feature_out(step, current_features)
features_out = current_features
else:
features_out = get_feature_out(estimator, features)
output_features.extend(features_out)
elif estimator=='passthrough':
output_features.extend(ct._feature_names_in[features])
return output_features
pd.DataFrame(transformed_data,
columns=get_ct_feature_names(combined_pipe))
I am pretty new to pipelines in sklearn and I am running into this problem: I have a dataset that has a mixture of text and numbers i.e. certain columns have text only and rest have integers (or floating point numbers).
I was wondering if it was possible to build a pipeline where I can for example call LabelEncoder() on the text features and MinMaxScaler() on the numbers columns. The examples I have seen on the web mostly point towards using LabelEncoder() on the entire dataset and not on select columns. Is this possible? If so any pointers would be greatly appreciated.
The way I usually do it is with a FeatureUnion, using a FunctionTransformer to pull out the relevant columns.
Important notes:
You have to define your functions with def since annoyingly you can't use lambda or partial in FunctionTransformer if you want to pickle your model
You need to initialize FunctionTransformer with validate=False
Something like this:
from sklearn.pipeline import make_union, make_pipeline
from sklearn.preprocessing import FunctionTransformer
def get_text_cols(df):
return df[['name', 'fruit']]
def get_num_cols(df):
return df[['height','age']]
vec = make_union(*[
make_pipeline(FunctionTransformer(get_text_cols, validate=False), LabelEncoder()))),
make_pipeline(FunctionTransformer(get_num_cols, validate=False), MinMaxScaler())))
])
Since v0.20, you can use ColumnTransformer to accomplish this.
An Example of ColumnTransformer might help you:
# FOREGOING TRANSFORMATIONS ON 'data' ...
# filter data
data = data[data['county'].isin(COUNTIES_OF_INTEREST)]
# define the feature encoding of the data
impute_and_one_hot_encode = Pipeline([
('impute', SimpleImputer(strategy='most_frequent')),
('encode', OneHotEncoder(sparse=False, handle_unknown='ignore'))
])
featurisation = ColumnTransformer(transformers=[
("impute_and_one_hot_encode", impute_and_one_hot_encode, ['smoker', 'county', 'race']),
('word2vec', MyW2VTransformer(min_count=2), ['last_name']),
('numeric', StandardScaler(), ['num_children', 'income'])
])
# define the training pipeline for the model
neural_net = KerasClassifier(build_fn=create_model, epochs=10, batch_size=1, verbose=0, input_dim=109)
pipeline = Pipeline([
('features', featurisation),
('learner', neural_net)])
# train-test split
train_data, test_data = train_test_split(data, random_state=0)
# model training
model = pipeline.fit(train_data, train_data['label'])
You can find the entire code under: https://github.com/stefan-grafberger/mlinspect/blob/19ca0d6ae8672249891835190c9e2d9d3c14f28f/example_pipelines/healthcare/healthcare.py