Names for Feature Selection - python

I want to know the names of the features within my RF model. I read here that the output from gs.best_estimator_.named_steps["stepname"].feature_importances_ would mirror my columns from my data. However, the length of gs.best_estimator_.... is 10 and I have 13 columns. Some columns were not important. From other answers around (answer1, answer2), I would have to declare something within my pipeline. But I am confused as to what to declare because both answers deal with PCA, not RF.
Here is what I have so far.
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn import datasets
# use iris as example
iris = datasets.load_iris()
X = iris.drop(['sepal_length'],axis=1)
y = iris.sepal_length
cats_feats = ['species']
X_train, X_test, y_train, y_test = \
train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=13)
# Pipeline
categorical_transformer = Pipeline(steps=[
('onehot', OneHotEncoder(handle_unknown='ignore',sparse=False))
])
# Bundle any preprocessing
preprocessor = ColumnTransformer(
transformers=[
('cat', categorical_transformer, cat_feats)
])
rf = RandomForestRegressor(random_state = 13)
mymodel = Pipeline(steps = [('preprocessor', preprocessor),
('model', rf)
])
# For this example, I used default values. In reality I do use a dictionary of parameters
gs = GridSearchCV(mymodel
,n_jobs = -1
,cv = 5
)
gs.fit(X_train,y_train)

Why the length of the feature list does not match
The length of your features does not match because all non-categorical columns are being discarded when you are using your ColumnTransformer. By default, it only keeps columns for which a transformation was specified. As a result, if you do not want this to happen, you need to do this
preprocessor = ColumnTransformer(transformers=[('cat', OneHotEncoder(), cat_feats)],
remainder='passthrough')
(I removed your categorical pipeline, which is not necessary here)
Also keep in mind that applying the OHE will add features and so the total number of features is going to be larger than what you had in the beginning.
How to get the feature names
Once you have fitted everything, you need to retrieve the feature names for the result of the OHE and the remaining numerical columns.
For the OHE columns:
cat_features = gs.best_estimator_["preprocessor"].named_transformers_["cat"].get_feature_names()
For the numerical columns, you need to declare num_feats where all numerical features are in the same order as in your original dataframe.
Then just do:
feature_names = np.concatenate((cat_features, num_feats))
PS: this is a bit cumbersome, and this might be improved in later sklearn versions, but as of now, this is the procedure

Related

Sklearn pipeline transform specific columns - ValueError: too many values to unpack (expected 2)

i am trying make pipeline with scaler, onhotencoder, polynomialfeature, and finally linear regression model
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
('scaler', StandardScaler(), num_cols),
('polynom', PolynomialFeatures(3), num_cols),
('encoder', OneHotEncoder(), cat_cols),
('linear_regression', LinearRegression() )
])
but when i fit the pipeline i have ValueError: too many values to unpack (expected 2)
pipeline.fit(x_train,y_train)
pipeline.score(x_test, y_test)
If I understand correctly, you want to apply some steps of the pipeline to specific columns. Instead of doing it by adding the column names ad the end of the pipeline stage (which is incorrect and causes the error), you have to use a ColumnTransformer. Here you can find another similar example.
In your case, you could do something like this:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.compose import ColumnTransformer
# Fake data.
train_data = pd.DataFrame({'n1': range(10), 'n2': range(10)})
train_data['c1'] = 0
train_data['c1'][5:] = 1
y_train = [0]*10
y_train[5:] = [1]*5
# Here I assumed you are using a DataFrame. If not, use integer indices instead of column names.
num_cols = ['n1', 'n2']
cat_cols = ['c1']
# Pipeline to transform the numerical features.
numerical_transformer = Pipeline([('scaler', StandardScaler()),
('polynom', PolynomialFeatures(3))
])
# Apply the numerical transformer only on the numerical columns.
# Spearately, apply the OneHotEncoder.
ct = ColumnTransformer([('num_transformer', numerical_transformer, num_cols),
('encoder', OneHotEncoder(), cat_cols)])
# Main pipeline for fitting.
pipeline = Pipeline([
('column_transformer', ct),
('linear_regression', LinearRegression() )
])
pipeline.fit(train_data, y_train)
Schematically, the layout of your pipeline would be like this:

ColumnTransformer(s) in various parts of a pipeline do not play well

I am using sklearn and mlxtend.regressor.StackingRegressor to build a stacked regression model.
For example, say I want the following small pipeline:
A Stacking Regressor with two regressors:
A pipeline which:
Performs data imputation
1-hot encodes categorical features
Performs linear regression
A pipeline which:
Performs data imputation
Performs regression using a Decision Tree
Unfortunately this is not possible, because StackingRegressor doesn't accept NaN in its input data.
This is even if its regressors know how to handle NaN, as it would be in my case where the regressors are actually pipelines which perform data imputation.
However, this is not a problem: I can just move data imputation outside the stacked regressor.
Now my pipeline looks like this:
Perform data imputation
Apply a Stacking Regressor with two regressors:
A pipeline which:
1-hot encodes categorical features
Standardises numerical features
Performs linear regression
An sklearn.tree.DecisionTreeRegressor.
One might try to implement it as follows (the entire minimal working example in this gist, with comments):
sr_linear = Pipeline(steps=[
('preprocessing', ColumnTransformer(transformers=[
('categorical',
make_pipeline(OneHotEncoder(), StandardScaler()),
make_column_selector(dtype_include='category')),
('numerical',
StandardScaler(),
make_column_selector(dtype_include=np.number))
])),
('model', LinearRegression())
])
sr_tree = DecisionTreeRegressor()
ct_imputation = ColumnTransformer(transformers=[
('categorical',
SimpleImputer(strategy='constant', fill_value='None'),
make_column_selector(dtype_include='category')),
('numerical',
SimpleImputer(strategy='median'),
make_column_selector(dtype_include=np.number))
])
stacked_regressor = Pipeline(steps=[
('imputation', ct_imputation),
('back_to_pandas', FunctionTransformer(
func=lambda values: pd.DataFrame(values, columns=ct_imputation.get_feature_names_out())
)),
('model', StackingRegressor(
regressors=[sr_linear, sr_tree],
meta_regressor=DecisionTreeRegressor(),
use_features_in_secondary=True
))
])
Note that the "outer" ColumnTransformer (in stacked_regressor) returns a numpy matrix.
But the "inner" ColumnTransformer (in sr_linear) expects a pandas.DataFrame, so I had to convert the matrix back to a data frame using step back_to_pandas.
(To use get_feature_names_out I had to use the nightly version of sklearn, because the current stable 1.0.2 version does not support it yet. Fortunately it can be installed with one simple command.)
The above code fails when calling stacked_regressor.fit(), with the following error (the entire stacktrace is again in the gist):
ValueError: make_column_selector can only be applied to pandas dataframes
However, because I added the back_to_pandas step to my outer pipeline, the inner pipelines should be getting a pandas data frame!
In fact, if I only fit_transform() my ct_imputation object, I clearly obtain a pandas data frame.
I cannot understand where and when exactly the data which gets passed around ceases to be a data frame.
Why is my code failing?
Imo the issue has to be ascribed to StackingRegressor. Actually, I am not an expert on its usage and still I have not explored its source code, but I've found this sklearn issue - #16473 which seems implying that << the concatenation [of regressors and meta_regressors] does not preserve dataframe >> (though this is referred to sklearn StackingRegressor instance, rather than on mlxtend one).
Indeed, have a look at what happens once you replace it with your sr_linear pipeline:
from sklearn.datasets import fetch_openml
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from mlxtend.regressor import StackingRegressor
import numpy as np
import pandas as pd
# We use the Ames house prices dataset for this example
d = fetch_openml('house_prices', as_frame=True).frame
# Small data preprocessing:
for column in d.columns:
if d[column].dtype == object or column == 'MSSubClass':
d[column] = pd.Categorical(d[column])
d.drop(columns='Id', inplace=True)
# Prepare the data for training
label = 'SalePrice'
features = [col for col in d.columns if col != label]
X, y = d[features], d[label]
# Train the stacked regressor
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True)
sr_linear = Pipeline(steps=[
('preprocessing', ColumnTransformer(transformers=[
('categorical',
make_pipeline(OneHotEncoder(), StandardScaler(with_mean=False)),
make_column_selector(dtype_include='category')),
('numerical',
StandardScaler(),
make_column_selector(dtype_include=np.number))
])),
('model', LinearRegression())
])
ct_imputation = ColumnTransformer(transformers=[
('categorical',
SimpleImputer(missing_values=np.nan, strategy='constant', fill_value='None'),
make_column_selector(dtype_include='category')),
('numerical',
SimpleImputer(strategy='median'),
make_column_selector(dtype_include=np.number))
])
stacked_regressor = Pipeline(steps=[
('imputation', ct_imputation),
('back_to_pandas', FunctionTransformer(
func=lambda values: pd.DataFrame(values, columns=ct_imputation.get_feature_names_out()).astype(types)
)),
('mdl', sr_linear)
])
stacked_regressor.fit(X_train, y_train)
Observe that I had to slightly modify the 'back_to_pandas' step because for some reason pd.DataFrame was changing the dtypes of the columns to 'object' only (from 'category' and 'float64'), therefore clashing with the imputation performed in sr_linear. For this reason, I've applied .astype(types) to the pd.DataFrame constructor, where types is defined as follows (based on the implementation of .get_feature_names_out() method of the SimpleImputer from the dev version of sklearn):
types = {}
for col in d.columns[:-1]:
if d[col].dtype == 'category':
types['categorical__' + col] = str(d[col].dtype)
else:
types['numerical__' + col] = str(d[col].dtype)
The correct thing to do was:
Move from mlxtend's to sklearn's StackingRegressor. I believe the former was creater when sklearn still didn't have a stacking regressor. Now there is no need to use more 'obscure' solutions. sklearn's stacking regressor works pretty well.
Move the 1-hot-encoding step to the outer pipeline, because (surprisingly!) sklearn's DecisionTreeRegressor cannot handle categorical data among the features.
A working version of the code is given below:
from sklearn.datasets import fetch_openml
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.ensemble import StackingRegressor
import numpy as np
import pandas as pd
def set_correct_categories(df: pd.DataFrame) -> pd.DataFrame:
for column in df.columns:
if df[column].dtype == object or 'MSSubClass' in column:
df[column] = pd.Categorical(df[column])
return df
d = fetch_openml('house_prices', as_frame=True).frame
d = set_correct_categories(d).drop(columns='Id')
sr_linear = Pipeline(steps=[
('preprocessing', StandardScaler()),
('model', LinearRegression())
])
ct_preprocessing = ColumnTransformer(transformers=[
('categorical',
make_pipeline(
SimpleImputer(strategy='constant', fill_value='None'),
OneHotEncoder(sparse=False, handle_unknown='ignore')
),
make_column_selector(dtype_include='category')),
('numerical',
SimpleImputer(strategy='median'),
make_column_selector(dtype_include=np.number))
])
stacking_regressor = Pipeline(steps=[
('preprocessing', ct_preprocessing),
('model', StackingRegressor(
estimators=[('linear_regression', sr_linear), ('regression_tree', DecisionTreeRegressor())],
final_estimator=DecisionTreeRegressor(),
passthrough=True
))
])
label = 'SalePrice'
features = [col for col in d.columns if col != label]
X, y = d[features], d[label]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True)
stacking_regressor.fit(X_train, y_train)
Thanks to user amiola for his answer putting me on the right track.

Trouble changing imputer strategy in scikit-learn pipeline

I am trying to use GridSearchCV to select the best imputer strategy but I am having trouble doing that.
First, I have a data preparation pipeline for numerical and categorical columns-
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline, make_pipeline
num_pipe = make_pipeline(SimpleImputer(strategy='median'), StandardScaler())
cat_pipe = make_pipeline(SimpleImputer(strategy='constant', fill_value='NA'),
OneHotEncoder(sparse=False, handle_unknown='ignore'))
preprocessing = ColumnTransformer([
("num", num_pipe, num_cols),
("cat", cat_pipe, cat_cols)
])
Next, I have created a pipeline to train a support vector machine model with feature selection.
from sklearn.feature_selection import SelectFromModel
model = Pipeline([
("preprocess", preprocessing),
("feature_select", SelectFromModel(RandomForestRegressor(random_state=42))),
("regressor", SVR(kernel='rbf', C=30000.0, gamma=0.3))
])
Now, I am trying to see which imputer strategy is best for imputing missing values for numerical columns using a GridSearchCV
grid = {"model.named_steps.preprocess.transformers[0][1].named_steps['simpleimputer'].strategy":
['mean','median','most_frequent']}
grid_search = GridSearchCV(model, param_grid = grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)
This is where I am getting the error. The full pipeline looks like this -
Pipeline(steps=[('preprocess',
ColumnTransformer(transformers=[('num',
Pipeline(steps=[('simpleimputer',
SimpleImputer(strategy='median')),
('standardscaler',
StandardScaler())]),
['longitude', 'latitude',
'housing_median_age',
'total_rooms',
'total_bedrooms',
'population', 'households',
'median_income']),
('cat',
Pipeline(steps=[('simpleimputer',
SimpleImputer(fill_value='NA',
strategy='constant')),
('onehotencoder',
OneHotEncoder(handle_unknown='ignore',
sparse=False))]),
['ocean_proximity'])])),
('feature_select',
SelectFromModel(estimator=RandomForestRegressor(random_state=42))),
('regressor', SVR(C=30000.0, gamma=0.3))])
Can anyone tell me what I need to change in the grid search to make it work?
The way you specify the parameter is via a dictionary that maps the name of the estimator/transformer and name of the parameter you want to change to the parameters you want to try. If you have a pipeline or a pipeline of pipelines, the name is the names of all its parents combined with a double underscore. So for your case, it looks like
gird = {
"preprocess__num__simpleimputer__strategy":['median']
}
simpleimputer is simply the name that was automatically assigned by make_pipeline.
However, I think there are other issues in your code like fill_value='NA' being incorrect and actually not needed as it is not the falues to be filled but the value needed to filling missing values.

Dummify categorical variables for logistic regression with pandas and scikit (OneHotEncoder)

I read this blog about new things in scikit. The OneHotEncoder taking strings seems like a useful feature. Below my attempt to use this
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
cols = ['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
train_df = pd.read_csv('../../data/train.csv', usecols=cols)
test_df = pd.read_csv('../../data/test.csv', usecols=[e for e in cols if e != 'Survived'])
train_df.dropna(inplace=True)
test_df.dropna(inplace=True)
X_train = train_df.drop("Survived", axis=1)
Y_train = train_df["Survived"]
X_test = test_df.copy()
ct = ColumnTransformer([("onehot", OneHotEncoder(sparse=False), ['Sex', 'Embarked'])], remainder='passthrough')
X_train_t = ct.fit_transform(train_df)
X_test_t = ct.fit_transform(test_df)
print(X_train_t[0])
print(X_test_t[0])
# [ 0. 1. 0. 0. 1. 0. 3. 22. 1. 0. 7.25]
# [ 0. 1. 0. 1. 0. 3. 34.5 0. 0. 7.8292]
logreg = LogisticRegression(max_iter=5000)
logreg.fit(X_train_t, Y_train)
Y_pred = logreg.predict(X_test_t) # ValueError: X has 10 features per sample; expecting 11
acc_log = round(logreg.score(X_train, Y_train) * 100, 2)
print(acc_log)
I encounter the below python error with this code and also I have some additional concerns.
ValueError: X has 10 features per sample; expecting 11
To start from the beginning .. this script is written for the "titanic" dataset from kaggle. We have five numerical columns Pclass, Age, SibSp, Parch and Fare. The columns Sex and Embarked are categories male/female and Q/S/C (which is an abbreviation for a city name).
What I understood from the OneHotEncoder is that it creates dummy variables by placing additional columns. Well actually the output of ct.fit_transform() is no longer a pandas dataframe but a numpy array now. But as seen in the print debug statement there are more than the original 7 columns now.
There are three problems I encounter:
For some reason the test.csv has one less column. That would indicate to me that there is on less option in one of the categories. To fix that i would have to find all the available options in the categories over both train + test data. And then use these options (such as male/female) to transform the train and the test data separately. I have no idea how to do this with the tools i'm working with (pandas, scikit, etc). On second thought .. after inspecting the data i can not find the missing option in the test.csv ..
I want to avoid the "dummy variable trap". Right now it seems that there are too many columns created. I was expecting 1 column for Sex (total options 2 - 1 to avoid trap) and 2 for embarked. With the additional 5 numerical columns that would come to 8 total.
I don't recognize the output of the transform anymore. I would rather prefer a new dataframe where the new dummy columns have given their own name, such as Sex_male (1/0) Embarked_Q (1/0) and Embarked_S(1/0)
I'm only used to using gretl, there dummifying a variable and leaving out one option is very natural. I don't know in python if i'm doing it wrong or if this scenario is not part of the standard scikit toolkit. Any advice? Maybe I should write a custom encoder for this?
I will try and answer all your questions individually.
Answer for Question 1
In your code you have used fit_transform method both on your train and test data which is not the correct way of doing it. Generally, fit_transform is applied only on your train data set, and it returns a transformer which is then just used to transform your test data set. When you apply fit_transform on your test data, you just transform your test data with just the options/levels of the categorical variables available only in your test data set and it is very much possible that your test data may not contain all options/levels of all categorical variables, due to which the dimension of your train and test data set will differ resulting in the error which you have got.
So the correct way of doing it would be:
X_train_t = ct.fit_transform(X_train)
X_test_t = ct.transform(X_test)
Answer for Question 2
If you want to avoid the "dummy variable trap" you can make use of the parameter drop (by setting it to first) while creating the OneHotEncoder object in the ColumnTransformer, this will result in creating just one column for sex and two columns for Embarked since they have two and three options/levels respectively.
So the correct way of doing it would be:
ct = ColumnTransformer([("onehot", OneHotEncoder(sparse=False, drop="first"), ['Sex','Embarked'])], remainder='passthrough')
Answer for Question 3
As of now the get_feature_names method which can be reconstruct your data frame with new dummy columns is not implemented insklearn yet. One work around for this would be to change the reminder to drop in the ColumnTransformer construction and construct your data frame separately as shown below:
ct = ColumnTransformer([("onehot", OneHotEncoder(sparse=False, drop="first"), ['Sex', 'Embarked'])], remainder='drop')
A = pd.concat([X_train.drop(["Sex", "Embarked"], axis=1), pd.DataFrame(X_train_t, columns=ct.get_feature_names())], axis=1)
A.head()
which will result in something like this:
Your final code will look like this:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
cols = ['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
train_df = pd.read_csv('train.csv', usecols=cols)
test_df = pd.read_csv('test.csv', usecols=[e for e in cols if e != 'Survived'])
cols = ['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
train_df = train_df.dropna()
test_df = test_df.dropna()
train_df = train_df.reset_index(drop=True)
test_df = test_df.reset_index(drop=True)
X_train = train_df.drop("Survived", axis=1)
Y_train = train_df["Survived"]
X_test = test_df.copy()
categorical_values = ['Sex', 'Embarked']
X_train_cont = X_train.drop(categorical_values, axis=1)
X_test_cont = X_test.drop(categorical_values, axis=1)
ct = ColumnTransformer([("onehot", OneHotEncoder(sparse=False, drop="first"), categorical_values)], remainder='drop')
X_train_categorical = ct.fit_transform(X_train)
X_test_categorical = ct.transform(X_test)
X_train_t = pd.concat([X_train_cont, pd.DataFrame(X_train_categorical, columns=ct.get_feature_names())], axis=1)
X_test_t = pd.concat([X_test_cont, pd.DataFrame(X_test_categorical, columns=ct.get_feature_names())], axis=1)
logreg = LogisticRegression(max_iter=5000)
logreg.fit(X_train_t, Y_train)
Y_pred = logreg.predict(X_test_t)
acc_log = round(logreg.score(X_train_t, Y_train) * 100, 2)
print(acc_log)
80.34
And when you do X_train_t.head() you get
Recommended practice is suggested in #Parthasarathy Subburaj's answer but I have seen in Kaggle or other competition, where people fit on the complete data (train+test). If you want to try the same, use the following format
ct.fit(X_complete)
X_train_t, X_test_t = ct.transform(X_test), ct.transform(X_test)
ya, use drop='first' to get over this issue. At the same time, remember this multicollinearity problem is not a big deal for non-linear models such as neural networks or even decision trees. I believe that is the reason why it is not kept as the default arg param value.
get_feature_names is not implemented exhaustively for pipelines and other stuffs in sklearn. Hence, they are supporting complete in ColumnTransformer as well.
Based on my experience, I had built this wrapper for ColumnTransfomer, which can support for even it has pipelines or reminder=passthrough.
This also picks up the feature names for get_feature_names instead of calling it as x0, x1 because we know the actual column names inside ColumnTransformer using _feature_names_in.
from sklearn.compose import ColumnTransformer
from sklearn.utils.validation import check_is_fitted
def _get_features_out(name, trans, features_in):
if hasattr(trans, 'get_feature_names'):
return [name + "__" + f for f in
trans.get_feature_names(features_in)]
else:
return features_in
class NamedColumnTransformer(ColumnTransformer):
def get_feature_names(self):
check_is_fitted(self)
feature_names = []
for name, trans, features, _ in self._iter(fitted=True):
if trans == 'drop':
continue
if trans == 'passthrough':
feature_names.extend(self._feature_names_in[features])
elif hasattr(trans, '_iter'):
for _, op_name, t in trans._iter():
features=_get_features_out(op_name, t, features)
feature_names.extend(features)
elif not hasattr(trans, 'get_feature_names'):
raise AttributeError("Transformer %s (type %s) does not "
"provide get_feature_names."
% (str(name), type(trans).__name__))
else:
feature_names.extend(_get_features_out(name, trans, features))
return feature_names
Now, for your example,
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
# you can fetch the titanic dataset using this
X, y = fetch_openml("titanic", version=1,
as_frame=True, return_X_y=True)
# removing the columns which you are not using
X.drop(['name', 'ticket', 'cabin', 'boat', 'body', 'home.dest'],
axis=1, inplace=True)
X.dropna(inplace=True)
X.reset_index(drop=True, inplace=True)
y = y[X.index]
categorical_values = ['sex', 'embarked']
ct = NamedColumnTransformer([("onehot", OneHotEncoder(
sparse=False, drop="first"), categorical_values)], remainder='passthrough')
clf = Pipeline(steps=[('preprocessor', ct),
('classifier', LogisticRegression(max_iter=5000))])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
clf.fit(X_train, y_train)
clf[0].get_feature_names()
# ['onehot__sex_male',
# 'onehot__embarked_Q',
# 'onehot__embarked_S',
# 'pclass',
# 'age',
# 'sibsp',
# 'parch',
# 'fare']
pd.DataFrame(clf[0].transform(X_train), columns=clf[0].get_feature_names())
You can also try the NamedColumnTransformer for a more interesting example of ColumnTransformer here.

sklearn pipeline - how to apply different transformations on different columns

I am pretty new to pipelines in sklearn and I am running into this problem: I have a dataset that has a mixture of text and numbers i.e. certain columns have text only and rest have integers (or floating point numbers).
I was wondering if it was possible to build a pipeline where I can for example call LabelEncoder() on the text features and MinMaxScaler() on the numbers columns. The examples I have seen on the web mostly point towards using LabelEncoder() on the entire dataset and not on select columns. Is this possible? If so any pointers would be greatly appreciated.
The way I usually do it is with a FeatureUnion, using a FunctionTransformer to pull out the relevant columns.
Important notes:
You have to define your functions with def since annoyingly you can't use lambda or partial in FunctionTransformer if you want to pickle your model
You need to initialize FunctionTransformer with validate=False
Something like this:
from sklearn.pipeline import make_union, make_pipeline
from sklearn.preprocessing import FunctionTransformer
def get_text_cols(df):
return df[['name', 'fruit']]
def get_num_cols(df):
return df[['height','age']]
vec = make_union(*[
make_pipeline(FunctionTransformer(get_text_cols, validate=False), LabelEncoder()))),
make_pipeline(FunctionTransformer(get_num_cols, validate=False), MinMaxScaler())))
])
Since v0.20, you can use ColumnTransformer to accomplish this.
An Example of ColumnTransformer might help you:
# FOREGOING TRANSFORMATIONS ON 'data' ...
# filter data
data = data[data['county'].isin(COUNTIES_OF_INTEREST)]
# define the feature encoding of the data
impute_and_one_hot_encode = Pipeline([
('impute', SimpleImputer(strategy='most_frequent')),
('encode', OneHotEncoder(sparse=False, handle_unknown='ignore'))
])
featurisation = ColumnTransformer(transformers=[
("impute_and_one_hot_encode", impute_and_one_hot_encode, ['smoker', 'county', 'race']),
('word2vec', MyW2VTransformer(min_count=2), ['last_name']),
('numeric', StandardScaler(), ['num_children', 'income'])
])
# define the training pipeline for the model
neural_net = KerasClassifier(build_fn=create_model, epochs=10, batch_size=1, verbose=0, input_dim=109)
pipeline = Pipeline([
('features', featurisation),
('learner', neural_net)])
# train-test split
train_data, test_data = train_test_split(data, random_state=0)
# model training
model = pipeline.fit(train_data, train_data['label'])
You can find the entire code under: https://github.com/stefan-grafberger/mlinspect/blob/19ca0d6ae8672249891835190c9e2d9d3c14f28f/example_pipelines/healthcare/healthcare.py

Categories