How do I Label Encode using the Pipeline API? - python

I want to incorporate Label Encoding through the scikit learn pipeline. Unfortunately, LabelEncoder() is broken with the pipeline API so that's not an option right now. I tried creating my own class which calls .map() to map categories to labels:
from sklearn.base import TransformerMixin
from sklearn.base import BaseEstimator
class RatingEncoder(BaseEstimator, TransformerMixin):
"""Takes in dataframe, converts all categorical Ratings columns into numerical Ratings columns
via label-encoding"""
def __init__(self):
pass
def fit(self, df, y=None):
return self
def transform(self, df, y=None):
""""Transform all of the categorical ratings columns into numerical ratings columns"""
for feature in df.columns:
df[feature] = df[feature].map({
"Po" : 1,
"Fa" : 2,
"TA" : 3,
"Gd" : 4,
"Ex" : 5,
})
return df
Then, I set up the following pipeline:
def select_numeric_features(df):
return df.select_dtypes(include=np.number).columns
def select_categorical_features(df):
return df.select_dtypes(exclude=np.number).columns
def select_rated_features(df):
rated_features = []
for column in df:
# This criteria determines if a column is a 'rated column'
if any(df[column] == 'TA'):
rated_features.append(column)
return rated_features
pipeline = make_column_transformer(
(RatingsTransformer(), select_rated_features),
(SimpleImputer(strategy='constant', fill_value='None'), select_categorical_features),
(SimpleImputer(strategy='constant', fill_value=0), select_numeric_features),
remainder='passthrough'
)
The problem with this is that after the RatingsTransformer() step, the categorical 'ratings' columns are supposed to become numerical columns. However, this change doesn't show up in the column selection part of the column transformer, so select_numerical_features and select_categorical_features will choose the incorrect 'ratings' columns as if they had no been mapped from categories to values. Basically, the column transformer isn't using columns that were updated in the middle of the pipeline. Any workaround for this? Or, is there a simpler solution to LabelEncoding using the pipeline API?

LabelEncoder is to encode labels and therefore the y (or target). If you want to encode data (i.e. X) you can use a OneHotEncoder or an OrdinalEncoder which can be easily integrated within a Pipeline from scikit-learn.
In your case it seems that you want to ordinal encode your data.
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder
preprocessor = make_pipeline(
SimpleImputer(strategy="constant", fill_value="missing"),
OrdinalEncoder()
)
preprocessor.fit_transform(X_train)
A more complete example can be found here: https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html#sphx-glr-auto-examples-compose-plot-column-transformer-mixed-types-py
You could imagine to use an OrdinalEncoder instead of the the OneHotEncoder if the classifier would not be a linear model (e.g. RandomForestClassifier).

Related

How to setup the Imputer as part of sklearn pipeline?

I am working on the Titanic dataset and I wish to handle all the preprocessing activities on a pipeline. So, here is my code:
To get the dataset
!wget "https://calmcode.io/datasets/titanic.csv"
And then read it as below:
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
dt = pd.read_csv("./data/titanic.csv", index_col=["PassengerId"])
And then I setup a single pipeline which is suppose to preprocess the numerical features:
numerical_features = ["Age", "SibSp", "Parch", "Fare"]
numerical_pipeline = Pipeline(steps=[("min_max_scaler", MinMaxScaler()),
('num_imputer',SimpleImputer(missing_values=np.nan, strategy='mean')])
Then fit the pipeline:
column_transformer = ColumnTransformer(transformers=[
('numeric_transformer', numerical_pipeline, numerical_features),remainder='drop')
column_transformer.fit(dt)
transformed_dt = column_transformer.transform(dt)
But, I need to apply the Imputer only in the Age feature and not in all the other columns.Currently, it applies the imputer over all the columns.
My question is :
How can I specify that I need to apply the SimpleImputer only on the Age column and not in all of the numerical_pipeline ?
I think you need to use two column transformers, so if you set up the minmax this way:
minmax = ColumnTransformer([(
"minmax",
MinMaxScaler(),
["age", "sibsp", "parch", "fare"])
],remainder='drop')
The output comes without column names, but based on the column names we input, age will be the first, so:
imp = ColumnTransformer([(
"impute",
SimpleImputer(missing_values=np.nan, strategy='mean'),
[0])
],remainder='passthrough')
Then into a pipeline:
Pipeline([("scale",minmax),("impute",imp)]).fit_transform(dt)
As you have said in a comment, you want to first impute and second do the scaling. In that case, I would first create a column transformer that only imputes the one column, passes through the three others numerical columns, and drops columns that are not part of that set. After that, you add a MinMaxScaler on the output of that column transformer. In code
ct = ColumnTransformer(
[
("num_imputer", SimpleImputer(missing_values=np.nan, strategy="mean"), ["Age"]),
("needed_columns", "passthrough", ["SibSp", "Parch", "Fare"]),
],
)
pipeline = Pipeline(steps=[("transform", ct), ("scale", MinMaxScaler())])
The important bit here is that you add a second entry to the list of transformers, that has the word "passthrough" and specifies all the columns that should be passed through without any modifications.

implement custom one-hot-encoding function for sklearn pipeline

In related to question posted in One Hot Encoding preserve the NAs for imputation I am trying to create a custom function that handles NAs when one hot encoding categorical variables. The setup should be suitable for train/test split and modelling using sklearn pipeline.
A simple reproducible example of my problem:
#Packages
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.impute import KNNImputer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.linear_model import Ridge
from sklearn.impute import SimpleImputer
# Make some categorical data X and a response y and split it.
X = pd.DataFrame(columns=["1","2"],data = [["A",np.nan],["B","A"],[np.nan,"A"],[np.nan,"B"],["B","A"],["A","B"],["C","B"],["D","E"]])
y = pd.DataFrame(data = np.array([1,5,4,6,2,3,9,9]))
X_train, X_test, Y_train, Y_test = train_test_split(X,y,test_size=0.2,random_state=42)
I've then created a custom function that does OHE with nan (using the procedure described in Cyclical Loop Between OneHotEncoder and KNNImpute in Scikit-learn)
class OHE_with_nan(BaseEstimator,TransformerMixin):
""" OHE with NAN. Not super pretty but works..
"""
def __init__(self, copy=True):
self.copy = copy
def fit(self, X, y = None):
""" This transformer does not use a fit procedure """
return self
def transform(self, X, y = None):
""" Return the new object here"""
# Replace nans with "Missing" such that OneHotEncoder can work.
enc_missing = SimpleImputer(strategy="constant",fill_value="missing")
data1 = pd.DataFrame(columns=X.columns,data = enc_missing.fit_transform(X))
#Perform standard OHE
OHE = OneHotEncoder(sparse=False,handle_unknown="ignore")
OHE_fit = OHE.fit_transform(data1)
#save feature names of the OHE dataframe
data_OHE = pd.DataFrame(columns=OHE.get_feature_names(data1.columns),data = OHE_fit)
# Initialize
Column_names = data1.columns
Final_OHE = pd.DataFrame()
# Loop over columns to replace 0s with nan the correct places.
for i in range(len(data1.columns)):
tmp_data = data_OHE[data_OHE.columns[pd.Series(data_OHE.columns).str.startswith(Column_names[i])]]
missing_name = tmp_data.iloc[:,-1:].columns
missing_index = np.where(tmp_data[missing_name]==1)[0]
tmp_data.loc[missing_index,:] = np.nan
tmp_data1 = tmp_data.drop(missing_name,axis=1)
Final_OHE = pd.concat([Final_OHE, tmp_data1], axis=1)
return Final_OHE
This is then combined into a pipeline that predicts y using ridge regression (random choice of model, just for the example..)
Estimator = Pipeline([
('Ohe_with_NA',OHE_with_nan()),
("Imputer",KNNImputer(n_neighbors=1)),
('Model',Ridge(alpha = 0.01))
])
The procedure can be fitted:
pipe_fit = Estimator.fit(X_train,Y_train)
But testing on unseen data fails:
pipe_fit.score(X_test, Y_test)
ValueError: X has 2 features, but KNNImputer is expecting 7 features as input.
This is because the handle_unknown = "ignore in OneHotEncoder within OHE_with_nanis no longer "active" as it has been wrapped into my custom function.
If one simply uses OneHotEncoder(handle_unknown = "ignore") directly in the pipeline, everything works fine (but that's not my intention as this "removes" the nans from the data I try to impute.)
My question
How do I enable handle_unknown = "ignore" in my custom function such that it can perform in a pipeline setup on unseen data as well?
Hope you understand my situation - any help is highly appreciated!
I think the main problem is that you need to save more information (especially, the internal OneHotEncoder) at fit time. I also made the missing-column identification a little more robust (I think maybe you were relying on the ordering putting that last, but that only held for your sample data because of alphabetical order?). I didn't spend much time cleaning things up or looking for efficiencies.
class OHE_with_nan(BaseEstimator, TransformerMixin):
"""One-hot encode, propagating NaNs.
Requires a dataframe as input!
"""
def fit(self, X, y=None):
self.orig_cols_ = X.columns
self.imputer_ = SimpleImputer(strategy="constant", fill_value="MISSING")
X_filled = self.imputer_.fit_transform(X)
self.ohe_ = OneHotEncoder(sparse=False, handle_unknown="ignore")
self.ohe_.fit(X_filled)
self.ohe_colnames_ = self.ohe_.get_feature_names(X.columns)
self.missing_value_columns = np.array(["MISSING" in col for col in self.ohe_colnames_])
return self
def transform(self, X, y=None):
raw_ohe = pd.DataFrame(self.ohe_.transform(self.imputer_.transform(X)), columns=self.ohe_colnames_)
out_list = []
# Loop over columns to replace 0s with nan the correct places.
for orig_col in self.orig_cols_:
tmp_data = raw_ohe[self.ohe_colnames_[pd.Series(self.ohe_colnames_).str.startswith(orig_col)]]
missing_name = tmp_data.columns[["MISSING" in col for col in tmp_data.columns]]
missing_indices = np.where(tmp_data[missing_name]==1)[0]
tmp_data.loc[missing_indices, :] = np.nan
tmp_data1 = tmp_data.drop(missing_name, axis=1)
out_list.append(tmp_data1)
out = pd.concat(out_list, axis=1)
return out

Accessing column names of a pandas dataframe within a custom transformer in a Sklearn pipeline with ColumnTransformer?

I need to use a custom transformer within a pipeline that acts using the column names. However, the previous pipeline transformations convert the dataframe to a numpy array. I know I can retrieve the column names from the Column Transformer object after the pipeline has been fit, but I need to access the column names within the fit step. The custom transformer in my example below is a simple minimal example for illustration only, not the true transformation.
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.base import BaseEstimator, TransformerMixin
class MyCustomTransformer(BaseEstimator, TransformerMixin):
def my_custom_transformation(self, X):
"""
Parameters
----------
X: pandas dataframe
"""
columns_to_keep = [col for col in X.columns if col.endswith(('_a', '_b'))]
return columns_to_keep
def fit(self, X, y=None):
self.columns_to_keep = self.my_custom_transformation(X)
return self
def transform(self, X, y=None):
return X[self.columns]
numeric_transformer = Pipeline(steps=[('minmax_scaler', MinMaxScaler())])
categorical_transformer = Pipeline(steps=[('onehot_encoder', OneHotEncoder(sparse=False))])
column_transformer = ColumnTransformer(transformers=[
('numeric_transformer', numeric_transformer, ['num']),
('categorical_transformer', categorical_transformer, ['cat']),
])
pipeline = Pipeline(steps=[
('column_transformer', column_transformer),
('my_custom_transformer', MyCustomTransformer())
])
df = pd.DataFrame(data={'num': [1,2,3], 'cat':['a', 'b', 'c']})
pipeline.fit(data_df)
which would ideally result as:
transformed_df = pipeline.transform(df)
print(transformed_df)
>>> num cat_a cat_b
0 0 1 0
1 0.5 0 1
2 1 0 0
The transformations in the column_transformer convert the dataframe to a numpy array, which is then passed to the custom transformer. Obviously this results in an error since you can't get the column names from a numpy array.
I can't use indexing to access the columns since the one-hot encoding can result in an not-previously-known number of columns.
If I could access the ColumnTransformer object within the fit method of the custom transformer, I could retrieve the column names, then create a pandas dataframe to use in the fit method as above (?), but I have not successfully found a way to do this.
Any help would be much appreciated.
See my proposed implementation of a ColumnTransformerWithNames in response to how do i get_feature_names using a column transformer.
You can replace the calls to ColumnTransformer with ColumnTransformerWithNames and the output of the pipeline will be a DataFrame with column names =)
pip install sklearn-pandas-transformers
from sklearn_pandas_transformers.transformers import SklearnPandasWrapper
column_transformer = ColumnTransformer(transformers=[
('numeric_transformer', SklearnPandasWrapper(numeric_transformer), ['num']),
('categorical_transformer', SklearnPandasWrapper(categorical_transformer), ['cat']),
])

How to use OneHotEncoder and Pipeline to make new predictions?

I'm working through a tutorial focusing on OneHotEncoder. I get the idea behind encoding features, but I'm having a little problem with using the encoder with pipeline to make a new prediction. Two of the features--"Sex" and "Embarked"--are categorical rather than numerical. When creating a new numpy array to make a prediction, do you use the initial values, say "male" and "C", or, say, "1" and "2" to make a new prediction? I get the following error: " ValueError: Specifying the columns using strings is only supported for pandas DataFrames," which is weird given that the values I'm using are numerical. Regardless, would I have to fit the pipeline to X_new to make a new prediction? If so, how can I do that?
X_new = [[3, 1, 0]] OR X_new = [['3','male', 'C']]
pipe.predict(X_new)
Complete code:
import pandas as pd
import numpy as np
df = pd.read_csv("https://raw.githubusercontent.com/---/pandas-videos/master/data/titanic_train.csv")
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(solver='lbfgs')
from sklearn.model_selection import cross_val_score
cross_val_score(logreg, X, y, cv=5, scoring='accuracy').mean()
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse=False)
X = df.drop('Survived', axis='columns')
from sklearn.compose import make_column_transformer
column_trans = make_column_transformer(
(OneHotEncoder(), ['Sex', 'Embarked']),
remainder='passthrough')
column_trans.fit_transform(X)
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(column_trans, logreg)
cross_val_score(pipe, X, y, cv=5, scoring='accuracy').mean()
X_new = [[3, 1, 0]]
pipe.predict(X_new)
When you apply OneHotEncoder, the categorical column that you specify will be transformed into multiple integer columns based on number of unique value in the categorical column.
For example, the gender column contains 'male' and 'female', then it will converted the original column to 2 columns of 'male' and 'female'. It is difference from the LabelEncoder.
If you want to apply pipeline, logistic regression, and OneHotEncoder, you can use the pipeline to fit with the training data.
pipe.fit(X,y)
and then you can apply the prediction. This is an example when I apply 3 features as Sex, Age, and embarked and apply OHE to Sex and embarked.
X_new = [['female', 20, 'C']]
X_new_df = pd.DataFrame (X_new,columns=['Sex','Age','Embarked'])
pipe.predict(X_new_df)
However, the features that you use in your code is all features except label classes ('Survived'), which is 11 features. the number of the input must be equal to or greater than the fitted model, while you apply only 3 columns that may prompt and error.

Preprocess a dataset with different feature types using a single Pipeline

I am trying to preprocess and prepare a dataset to train a model. The dataset has some numerical and categorical features. Let's assume this is the dataset (presented as a Pandas DataFrame):
Age,Gender,Height
23,M,170
33,M,195
31,F,165
and we want to convert it to the following format, using only the Transform functions in scikit-learn 0.19.0 and a single Pipeline object:
Age,Gender_M,Gender_F,Height
0.0,1,0,170
1.0,1,0,195
0.8,0,1,165
The transforms are as follows:
Age: MinMaxScaler
Gender: OneHotEncoder
Height: Unchanged
Is this even possible (an alternative to make it slightly easier is to drop the Height feature for now)?
In my own attempts to come up with a solution, I noticed there is no way to apply a Transform function to a subset of all features, which made me believe this cannot be done using a single Pipeline. But I will be happy to be proven wrong.
Thanks
You can write your own transformer to do the desired operations as following.
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
class CustomTransformer((BaseEstimator, TransformerMixin):
def __init__(self):
self.one_hot_encoder = OneHotEncoder()
self.min_max_scaler = MinMaxScaler()
def fit(self, X, y):
self.one_hot_encoder.fit(X.Gender)
self.min_max_scaler.fit(X.Age)
return self
def transform(self, X, y):
X["Gender"] = self.one_hot_encoder.transform(X["Gender"])
X["Age"] = self.min_max_scaler(X["Age"]
return X
Now you can use the Pipeline object on this transformer as follwoing:
from sklearn.pipeline import Pipeline
pipeline = Pipeline(("transform", CustomTransformer()), ("estimator", est))
pipeline.fit(X, y)

Categories