scikit-learn regression pipeline with imputers - python

I am trying to build a regression pipeline but there are two problems I cannot resolve:
I try the below code, but one thing I cannot understand is how to deal with "NOT NULL" values columns? What to do with them? See the first row of the pipeline, these columns don't contain any "NULL" values in the data frame. I just don't know what to do with these columns.
When I run this Pipeline with .fit it is showing an error.
ValueError: Some of the variables to transform contain NaN. Check and remove those before using this transformer.
regression_pipe = Pipeline([
("Imputer", mdi.ArbitraryNumberImputer(arbitrary_number=-1, variables=['longitude', 'latitude', 'number_of_reviews', 'minimum_nights', 'accommodates', 'availability_365'])),
('medianImputer', MeanMedianImputer(imputation_method='median', variables=['review_scores_rating', 'security_deposit', 'cleaning_fee', 'bathrooms', 'bedrooms', 'beds'])),
("Count_Frequency_Encoder", CountFrequencyEncoder(encoding_method="count", variables=cat_val)),
("One_Hot_Encoding", OneHotEncoder(variables=cat_val, drop_last=False)),
("MinMax_Scaling", MinMaxScaler()),
("Linear_Regression", LinearRegression())
])

Related

Sklearn Pipeline / OneHotEncoder : consistency in getting categorical features with feature_names_in_ / get_feature_names_out()

Similar questions have been asked before, but this is a particular case, and it seems that sklearn has evolved quite a bit since then (I am using scikit-learn 1.1.2), so I think it is worth a new post.
I created an sklearn Pipeline in which I apply different transformations to numeric and categorical columns, as below :
# Separate numeric columns and categorical columns
numeric_features = X_train.select_dtypes(exclude=['object']).columns.tolist()
categorical_features = X_train.select_dtypes(include=['object']).columns.tolist()
# Define transformer pipelines to be applied to each type of column
# 1. Apply KNNImputer to numeric columns
# 2. Apply OneHotEncoder to categorical columns
num_transform_pipeline = Pipeline(steps = [('imputer', KNNImputer(n_neighbors=1, weights="uniform"))])
cat_transform_pipeline = Pipeline(steps = [('onehotencoding', OneHotEncoder(handle_unknown='ignore', sparse=False))])
# Apply each transformer pipeline to each type of columns
column_transformer = ColumnTransformer(
transformers=[
("num_column_transformer", num_transform_pipeline, numeric_features),
("cat_column_transformer", cat_transform_pipeline, categorical_features),
], verbose_feature_names_out = False
)
# Define the final pipeline combining column transformers and the regressor
pipeline = Pipeline([('column_transformer', column_transformer),
('regressor', XGBRegressor())])
After loading the pipeline from another script, I am trying to find the categorical columns that are passed to the OneHotEncoder step. In the previous example, since OneHotEncoder is the first step of cat_transform_pipeline, I can't use get_feature_names_out() on the previous step.
However, I found two different ways of getting the list of categorical columns :
Accessing the last element of (name, fitted_transformer, column) in the second transformer of column_transformer returns the categorical columns :
cat_feature_names = pipeline['column_transformer'].transformers_[1][-1]
However, when I try to access the second transformer cat_column_transformer by its name :
cat_feature_names = pipeline['column_transformer'].named_transformers_['cat_column_transformer'][-1]
I get an error TypeError: 'OneHotEncoder' object is not iterable
Is there a way to achieve the same result by using the name of the transformer and not its index ?
Accessing OneHotEncoder's feature_names_in_ attribute does the job and seems to be the easiest method :
cat_feature_names = pipeline['column_transformer'].named_transformers_['cat_column_transformer']['onehotencoding'].feature_names_in_
However, when OneHotEncoder is not the first step of the pipeline, such as in the following case where an imputer is defined just before :
cat_transform_pipeline = Pipeline(steps = [('imputer', SimpleImputer(strategy = 'most_frequent')),
('onehotencoding', OneHotEncoder(handle_unknown='ignore', sparse=False))])
I get the following error : AttributeError: 'OneHotEncoder' object has no attribute 'feature_names_in_'
The solution in this case is to use get_feature_names_out() on the previous step (the imputer). But that doesn't seem very consistent. Why would the attribute feature_names_in_ cease to exist when OneHotEncoder is preceded by an Imputer ?

How to get feature names when using onehot encoder on only certain columns sklearn

I have read many posts on this that reference the get_feature_names() from sklearn which appears to be now deprecated and replaced by get_feature_names_out neither of which I can get to work. It also appears that there is no way to use the get_feature_names (or the get_feature_names_out) with the ColumnTransformer class. So I am trying to fit and transform my numeric columns with a SimpleImputer and then StandardScaler class then SimpleImpute ('most_frequent') and OneHotEncode the categorical variables. I run them all individually since I can't put them in a pipeline then I try to get_feature_names and this results:
ValueError: input_features should have length equal to number of features (5), got 11
I have also tried getting feature names for just the categorical features as well as just the numeric and each one give the following errors respectively:
ValueError: input_features should have length equal to number of features (5), got 121942
and
ValueError: input_features should have length equal to number of features (5), got 121942
I am completely lost and also open to an easier way to get the feature names so that I can make sure the prod data that I run this model on after training/testing has the exact same features as the ones the model is trained to expect (which is the root issue here).
If I'm "barking up the wrong tree" by trying to get the feature names for the reasoning outlined in the root issue I'm also more than willing to be corrected. Here is my code:
#ONE HOT
import sklearn
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
# !pip install -U scikit-learn
print('The scikit-learn version is {}.'.format(sklearn.__version__))
numeric_columns = X.select_dtypes(include=['int64','float64']).columns
cat_columns = X.select_dtypes(include=['object']).columns
si_num = SimpleImputer(strategy='median')
si_cat = SimpleImputer(strategy='most_frequent')
ss = StandardScaler()
ohe = OneHotEncoder()
si_num.fit_transform(X[numeric_columns])
si_cat.fit_transform(X[cat_columns])
ss.fit_transform(X[numeric_columns])
ohe.fit_transform(X[cat_columns])
ohe.get_feature_names(X[numeric_columns])
Thanks!
I think this should work as a single composite estimator that does all your transformations and provides get_feature_names_out:
num_pipe = Pipeline([
("imp", si_num),
("scale", ss),
])
cat_pipe = Pipeline([
("imp", si_cat),
("ohe", ohe),
])
preproc = ColumnTransformer([
("num", num_pipe, numeric_columns),
("cat", cat_pipe, cat_columns),
])
Ideally, you should save the fitted composite and use that to transform production data, rather than using the feature names to reconcile different categories.
You should also fit this composite only on the training set, transforming the test set separately.

Apply multiple preprocessing steps to a column in sklearn pipeline

I was trying sklearn pipeline for the first time and using Titanic dataset. I want to first impute missing value in Embarked and then do one hot encoding. While in Sex attribute, I just want to do one hot encoding. So, I have the below steps in which two steps are for Embarked. But it is not working as expected as the Embarked column remains in addition to its one hot encoding as shown in the output(column having 'S').
If I do imputation and one hot encoding for Embarked in single step, it is working as expected.
What is the reason behind this or I am doing something wrong? Also, I didn't find any information related to this.
categorical_cols_impute = ['Embarked']
categorical_impute = Pipeline([
("mode_impute", SimpleImputer(missing_values=np.nan, strategy='constant', fill_value='S')),
# ("one_hot", OneHotEncoder(sparse=False))
])
categorical_cols = ['Embarked', 'Sex']
categorical_one_hot = Pipeline([
("one_hot", OneHotEncoder(sparse=False))
])
preprocesor = ColumnTransformer([
("cat_impute", categorical_impute, categorical_cols_impute),
("cat_one_hot", categorical_one_hot, categorical_cols)
], remainder="passthrough")
pipe = Pipeline([
("preprocessor", preprocesor),
# ("model", RandomForestClassifier(random_state=0))
])
ColumnTransformer transformers are applied in parallel, not sequentially. So in your example, Embarked ends up in your transformed data twice: once from the first transformer, keeping its string type, and again from the second transformer, this time one-hot encoded (but not imputed first!(?)).
So just uncomment the second step in the embarked pipeline, and remove Embarked from categorical_cols.
See also Consistent ColumnTransformer for intersecting lists of columns (but I don't think it's quite a duplicate).

AttributeError when using ColumnTransformer into a pipeline

This is my first machine learning project and the first time that I use ColumnTransformer. My aim is to perform two steps of data preprocessing, and use ColumnTransformer for each of them.
In the first step, I want to replace the missing values in my dataframe with the string 'missing_value' for some features, and the most frequent value for the remaining features. Therefore, I combine these two operations using ColumnTransformer and passing to it the corresponding columns of my dataframe.
In the second step, I want to use the just preprocessed data and apply OrdinalEncoder or OneHotEncoder depending on the features. For that I use again ColumnTransformer.
I then combine the two steps into a single pipeline.
I am using the Kaggle Houses Price dataset, I have scikit-learn version 0.20 and this is a simplified version of my code:
cat_columns_fill_miss = ['PoolQC', 'Alley']
cat_columns_fill_freq = ['Street', 'MSZoning', 'LandContour']
cat_columns_ord = ['Street', 'Alley', 'PoolQC']
ord_mapping = [['Pave', 'Grvl'], # Street
['missing_value', 'Pave', 'Grvl'], # Alley
['missing_value', 'Fa', 'TA', 'Gd', 'Ex'] # PoolQC
]
cat_columns_onehot = ['MSZoning', 'LandContour']
imputer_cat_pipeline = ColumnTransformer([
('imp_miss', SimpleImputer(strategy='constant'), cat_columns_fill_miss), # fill_value='missing_value' by default
('imp_freq', SimpleImputer(strategy='most_frequent'), cat_columns_fill_freq),
])
encoder_cat_pipeline = ColumnTransformer([
('ordinal', OrdinalEncoder(categories=ord_mapping), cat_columns_ord),
('pass_ord', OneHotEncoder(), cat_columns_onehot),
])
cat_pipeline = Pipeline([
('imp_cat', imputer_cat_pipeline),
('cat_encoder', encoder_cat_pipeline),
])
Unfortunately, when I apply it to housing_cat, the subset of my dataframe including only categorical features,
cat_pipeline.fit_transform(housing_cat)
I get the error:
AttributeError: 'numpy.ndarray' object has no attribute 'columns'
During handling of the above exception, another exception occurred:
...
ValueError: Specifying the columns using strings is only supported for pandas DataFrames
I have tried this simplified pipeline and it works properly:
new_cat_pipeline = Pipeline([
('imp_cat', imputer_cat_pipeline),
('onehot', OneHotEncoder()),
])
However, if I try:
enc_one = ColumnTransformer([
('onehot', OneHotEncoder(), cat_columns_onehot),
('pass_ord', 'passthrough', cat_columns_ord)
])
new_cat_pipeline = Pipeline([
('imp_cat', imputer_cat_pipeline),
('onehot_encoder', enc_one),
])
I start to get the same error.
I suspect then that this error is related to the use of ColumnTransformer in the second step, but I do not actually understand where it comes from. The way I identify the columns in the second step is the same as in the first step, so it remains unclear to me why only in the second step I get the Attribute Error...
ColumnTransformer returns numpy.array, so it can't have column attribute (as indicated by your error).
If I may suggest a different solution, use pandas for both of your tasks, it will be easier.
Step 1 - replacing missing values
To replace missing value in a subset of columns with missing_value string use this:
dataframe[["PoolQC", "Alley"]].fillna("missing_value", inplace=True)
For the rest (imputing with mean of each column), this will work perfectly:
dataframe[["Street", "MSZoning", "LandContour"]].fillna(
dataframe[["Street", "MSZoning", "LandContour"]].mean(), inplace=True
)
Step 2 - one hot encoding and categorical variables
pandas provides get_dummies, which returns pandas Dataframe, unlike ColumnTransfomer, code for this would be:
encoded = pd.get_dummies(dataframe[['MSZoning', 'LandContour']], drop_first=True)
pd.dropna(['MSZoning', 'LandContour'], axis=columns, inplace=True)
dataframe = dataframe.join(encoded)
For ordinal variables and their encoding I would suggest you to look at this SO answer (unluckily some manual mapping would be needed in this case).
If you want to use transformer anyway
Get np.array from the dataframe using values attribute, pass it through the pipeline and recreate columns and indices from the array like this:
pd.DataFrame(data=your_array, index=np.arange(len(your_array)), columns=["A", "B"])
There is one caveat of this aprroach though; you will not know the names of custom created one-hot-encoded columns (the pipeline will not do this for you).
Additionally, you could get the names of columns from sklearn's transforming objects (e.g. using categories_ attribute), but I think it would break the pipeline (someone correct me if I'm wrong).
Option #2
use the make_pipeline function
(Had the same Error, found this answer, than found this: Introducing the ColumnTransformer)
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
cat_columns_fill_miss = ['PoolQC', 'Alley']
cat_columns_fill_freq = ['Street', 'MSZoning', 'LandContour']
cat_columns_ord = ['Street', 'Alley', 'PoolQC']
ord_mapping = [['Pave', 'Grvl'], # Street
['missing_value', 'Pave', 'Grvl'], # Alley
['missing_value', 'Fa', 'TA', 'Gd', 'Ex'] # PoolQC
]
cat_columns_onehot = ['MSZoning', 'LandContour']
imputer_cat_pipeline = make_column_transformer(
(make_pipeline(SimpleImputer(strategy='constant'), cat_columns_fill_miss),
(make_pipeline(SimpleImputer(strategy='most_frequent'), cat_columns_fill_freq),
)
encoder_cat_pipeline = make_column_transformer(
(OrdinalEncoder(categories=ord_mapping), cat_columns_ord),
(OneHotEncoder(), cat_columns_onehot),
)
cat_pipeline = Pipeline([
('imp_cat', imputer_cat_pipeline),
('cat_encoder', encoder_cat_pipeline),
])
In my own pipelines i do not have overlapping preprocessing in the column space. So i am not sure, how the transformation and than the "outer pipelining" works.
However, the important part is to use make_pipeline around the SimpleImputer to use it in a pipeline properly:
imputer_cat_pipeline = make_column_transformer(
(make_pipeline(SimpleImputer(strategy='constant'), cat_columns_fill_miss),
)
Just to add to the other answers here. I'm no Python or data science expert but you can pass another pipeline to ColumnTransformer in order to do what you need an add more than one transformer to a column. I came here looking for an answer to the same question and found this solution.
Doing it all via pipelines enables you to control the test/train data a lot easier to avoid leakage, and opens up more Grid Search possibilities too. I'm personally not a fan of the pandas approach in another answer for these reasons, but it would work ok still.
encoder_cat_pipeline = Pipeline([
('ordinal', OrdinalEncoder(categories=ord_mapping)),
('pass_ord', OneHotEncoder()),
])
imputer_cat_pipeline = ColumnTransformer([
('imp_miss', SimpleImputer(strategy='constant'), cat_columns_fill_miss),
('new_pipeline', encoder_cat_pipeline, cat_columns_fill_freq)
])
cat_pipeline = Pipeline([
('imp_cat', imputer_cat_pipeline),
])
I like to use the FunctionTransformer sklearn offers instead of doing transformations directly in pandas whenever I am doing any transformations. The reason for this is now my feature transformations are more generalizable on new incoming data (e.g. suppose you win, and you need to use the same code to predict on next years data). This way you won't have to re-run your code, you can save your preprocessor and call transform. I use something like this
FE_pipeline = {
'numeric_pipe': make_pipeline(
FunctionTransformer(lambda x: x.replace([np.inf, -np.inf], np.nan)),
MinMaxScaler(),
SimpleImputer(strategy='median', add_indicator=True),
),
'oh_pipe': make_pipeline(
FunctionTransformer(lambda x: x.astype(str)),
SimpleImputer(strategy='constant'),
OneHotEncoder(handle_unknown='ignore')
)
}

FeatureUnion with different feature dimensions

I want to classify some sentences with sklearn. The sentences are stored in a Pandas DataFrame.
To begin, I want to use the length of the sentence and it's TF-IDF vectors as a feature, so I created this pipeline:
pipeline = Pipeline([
('features', FeatureUnion([
('meta', Pipeline([
('length', LengthAnalyzer())
])),
('bag-of-words', Pipeline([
('tfidf', TfidfVectorizer())
]))
])),
('model', LogisticRegression())
where the LengthAnalyzer is a custom TransformerMixinwith:
def transform(self, documents):
for document in documents:
yield len(document)
So, LengthAnalyzer returns a number (1 dimension) while TfidfVectorizer returns a n-dimensional list.
When I try to run this, I get
ValueError: blocks[0,:] has incompatible row dimensions. Got blocks[0,1].shape[0] == 494, expected 1.
What has to be done to make this feature combination work?
Seems like the problem is originating from the yield used in the transform(). Maybe due to yield the number of rows reported to the scipy hstack method is 1 instead of actual number of samples in documents.
There should be 494 rows (samples) in your data which is coming correct from TfidfVectorizer but LengthAnalyzer is only reporting a single row. Hence the error.
If you can change it to
return np.array([len(document) for document in documents]).reshape(-1,1)
then the pipeline fits successfully.
Note:
I tried finding any related issue on scikit-learn github but was unsuccessful. You can post this issue there to get some real feedback for the usage.

Categories