Do you need to scale Vectorizers in sklearn? - python

I have a set of custom features and a set of features created with Vectorizers, in this case TfidfVectorizer.
All of my custom features are simple np.arrays (e.g. [0, 5, 4, 22, 1]). I am using StandardScaler to scale all of my featues, as you can see in my Pipeline by calling StandardScaler after my "custom pipeline". The question is whether there is a way or a need to scale the Vectorizers I use in my "vectorized_pipeline". Applying StandardScaler on the vectorizers doesn't seem to work (I get the following Error: "ValueError: Cannot center sparse matrices").
And another question, is it sensible to scale all of my features after I have joined them in the FeatureUnion or do I scale each of them separately (in my example, by calling the scaler in "pos_cluster" and "stylistic_features" seprately instead of calling it after the both of them have been joined), what is a better practice of doing this?
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn import feature_selection
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC
X = ['I am a sentence', 'an example']
Y = [1, 2]
X_dev = ['another sentence']
inner_scaler = StandardScaler()
# classifier
LinearSVC1 = LinearSVC(tol=1e-4, C = 0.10000000000000001)
# vectorizers
countVecWord = TfidfVectorizer(ngram_range=(1, 3), max_features=2000, analyzer=u'word', sublinear_tf=True, use_idf = True, min_df=2, max_df=0.85, lowercase = True)
countVecWord_tags = TfidfVectorizer(ngram_range=(1, 4), max_features= 1000, analyzer=u'word', min_df=2, max_df=0.85, sublinear_tf=True, use_idf = True, lowercase = False)
pipeline = Pipeline([
('union', FeatureUnion(
transformer_list=[
('vectorized_pipeline', Pipeline([
('union_vectorizer', FeatureUnion([
('stem_text', Pipeline([
('selector', ItemSelector(key='stem_text')),
('stem_tfidf', countVecWord)
])),
('pos_text', Pipeline([
('selector', ItemSelector(key='pos_text')),
('pos_tfidf', countVecWord_tags)
])),
])),
])),
('custom_pipeline', Pipeline([
('custom_features', FeatureUnion([
('pos_cluster', Pipeline([
('selector', ItemSelector(key='pos_text')),
('pos_cluster_inner', pos_cluster)
])),
('stylistic_features', Pipeline([
('selector', ItemSelector(key='raw_text')),
('stylistic_features_inner', stylistic_features)
]))
])),
('inner_scale', inner_scaler)
])),
],
# weight components in FeatureUnion
# n_jobs=6,
transformer_weights={
'vectorized_pipeline': 0.8, # 0.8,
'custom_pipeline': 1.0 # 1.0
},
)),
('clf', classifier),
])
pipeline.fit(X, Y)
y_pred = pipeline.predict(X_dev)

First things first:
Error "Cannot center sparse matrices"
The reason is quite simple - StandardScaler efficiently applies feature-wise transformation:
f_i = (f_i - mean(f_i)) / std(f_i)
which for sparse matrices will result in the dense ones, as mean(f_i) will be non zero (usually). In practise only features equal to their means - will end up being zero. Scikit learn does not want to do this, as this is a huge modification of your data, which might result in failures in other parts of code, huge usage of memory etc. How to deal with it? If you really want to do this, there are two options:
densify your matrix through .toarray(), which will require lots of memory, but will give you exactly what you expect
create StandardScaler without mean, thus StandardScaler(with_mean = False) which instaed willl apply f_i = f_i / std(f_i), but will leave sparse format of your data.
Is scalind needed?
This is a whole other problem - usualy, scaling (of any form) is just a heuristics. This is not something that you have to apply, there are no guarantees that it will help, it is just a reasonable thing to do when you have no idea what your data looks like. "Smart" vectorizers, such as tfidf are actually already doing that. The idf transformation is supposed to create a kind of reasonable data scaling. There is no guarantee which one will be better, but in general, tfidf should be enough. Especially given the fact, that it still support sparse computations, while StandardScaler does not.

Related

ColumnTransformer(s) in various parts of a pipeline do not play well

I am using sklearn and mlxtend.regressor.StackingRegressor to build a stacked regression model.
For example, say I want the following small pipeline:
A Stacking Regressor with two regressors:
A pipeline which:
Performs data imputation
1-hot encodes categorical features
Performs linear regression
A pipeline which:
Performs data imputation
Performs regression using a Decision Tree
Unfortunately this is not possible, because StackingRegressor doesn't accept NaN in its input data.
This is even if its regressors know how to handle NaN, as it would be in my case where the regressors are actually pipelines which perform data imputation.
However, this is not a problem: I can just move data imputation outside the stacked regressor.
Now my pipeline looks like this:
Perform data imputation
Apply a Stacking Regressor with two regressors:
A pipeline which:
1-hot encodes categorical features
Standardises numerical features
Performs linear regression
An sklearn.tree.DecisionTreeRegressor.
One might try to implement it as follows (the entire minimal working example in this gist, with comments):
sr_linear = Pipeline(steps=[
('preprocessing', ColumnTransformer(transformers=[
('categorical',
make_pipeline(OneHotEncoder(), StandardScaler()),
make_column_selector(dtype_include='category')),
('numerical',
StandardScaler(),
make_column_selector(dtype_include=np.number))
])),
('model', LinearRegression())
])
sr_tree = DecisionTreeRegressor()
ct_imputation = ColumnTransformer(transformers=[
('categorical',
SimpleImputer(strategy='constant', fill_value='None'),
make_column_selector(dtype_include='category')),
('numerical',
SimpleImputer(strategy='median'),
make_column_selector(dtype_include=np.number))
])
stacked_regressor = Pipeline(steps=[
('imputation', ct_imputation),
('back_to_pandas', FunctionTransformer(
func=lambda values: pd.DataFrame(values, columns=ct_imputation.get_feature_names_out())
)),
('model', StackingRegressor(
regressors=[sr_linear, sr_tree],
meta_regressor=DecisionTreeRegressor(),
use_features_in_secondary=True
))
])
Note that the "outer" ColumnTransformer (in stacked_regressor) returns a numpy matrix.
But the "inner" ColumnTransformer (in sr_linear) expects a pandas.DataFrame, so I had to convert the matrix back to a data frame using step back_to_pandas.
(To use get_feature_names_out I had to use the nightly version of sklearn, because the current stable 1.0.2 version does not support it yet. Fortunately it can be installed with one simple command.)
The above code fails when calling stacked_regressor.fit(), with the following error (the entire stacktrace is again in the gist):
ValueError: make_column_selector can only be applied to pandas dataframes
However, because I added the back_to_pandas step to my outer pipeline, the inner pipelines should be getting a pandas data frame!
In fact, if I only fit_transform() my ct_imputation object, I clearly obtain a pandas data frame.
I cannot understand where and when exactly the data which gets passed around ceases to be a data frame.
Why is my code failing?
Imo the issue has to be ascribed to StackingRegressor. Actually, I am not an expert on its usage and still I have not explored its source code, but I've found this sklearn issue - #16473 which seems implying that << the concatenation [of regressors and meta_regressors] does not preserve dataframe >> (though this is referred to sklearn StackingRegressor instance, rather than on mlxtend one).
Indeed, have a look at what happens once you replace it with your sr_linear pipeline:
from sklearn.datasets import fetch_openml
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from mlxtend.regressor import StackingRegressor
import numpy as np
import pandas as pd
# We use the Ames house prices dataset for this example
d = fetch_openml('house_prices', as_frame=True).frame
# Small data preprocessing:
for column in d.columns:
if d[column].dtype == object or column == 'MSSubClass':
d[column] = pd.Categorical(d[column])
d.drop(columns='Id', inplace=True)
# Prepare the data for training
label = 'SalePrice'
features = [col for col in d.columns if col != label]
X, y = d[features], d[label]
# Train the stacked regressor
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True)
sr_linear = Pipeline(steps=[
('preprocessing', ColumnTransformer(transformers=[
('categorical',
make_pipeline(OneHotEncoder(), StandardScaler(with_mean=False)),
make_column_selector(dtype_include='category')),
('numerical',
StandardScaler(),
make_column_selector(dtype_include=np.number))
])),
('model', LinearRegression())
])
ct_imputation = ColumnTransformer(transformers=[
('categorical',
SimpleImputer(missing_values=np.nan, strategy='constant', fill_value='None'),
make_column_selector(dtype_include='category')),
('numerical',
SimpleImputer(strategy='median'),
make_column_selector(dtype_include=np.number))
])
stacked_regressor = Pipeline(steps=[
('imputation', ct_imputation),
('back_to_pandas', FunctionTransformer(
func=lambda values: pd.DataFrame(values, columns=ct_imputation.get_feature_names_out()).astype(types)
)),
('mdl', sr_linear)
])
stacked_regressor.fit(X_train, y_train)
Observe that I had to slightly modify the 'back_to_pandas' step because for some reason pd.DataFrame was changing the dtypes of the columns to 'object' only (from 'category' and 'float64'), therefore clashing with the imputation performed in sr_linear. For this reason, I've applied .astype(types) to the pd.DataFrame constructor, where types is defined as follows (based on the implementation of .get_feature_names_out() method of the SimpleImputer from the dev version of sklearn):
types = {}
for col in d.columns[:-1]:
if d[col].dtype == 'category':
types['categorical__' + col] = str(d[col].dtype)
else:
types['numerical__' + col] = str(d[col].dtype)
The correct thing to do was:
Move from mlxtend's to sklearn's StackingRegressor. I believe the former was creater when sklearn still didn't have a stacking regressor. Now there is no need to use more 'obscure' solutions. sklearn's stacking regressor works pretty well.
Move the 1-hot-encoding step to the outer pipeline, because (surprisingly!) sklearn's DecisionTreeRegressor cannot handle categorical data among the features.
A working version of the code is given below:
from sklearn.datasets import fetch_openml
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.ensemble import StackingRegressor
import numpy as np
import pandas as pd
def set_correct_categories(df: pd.DataFrame) -> pd.DataFrame:
for column in df.columns:
if df[column].dtype == object or 'MSSubClass' in column:
df[column] = pd.Categorical(df[column])
return df
d = fetch_openml('house_prices', as_frame=True).frame
d = set_correct_categories(d).drop(columns='Id')
sr_linear = Pipeline(steps=[
('preprocessing', StandardScaler()),
('model', LinearRegression())
])
ct_preprocessing = ColumnTransformer(transformers=[
('categorical',
make_pipeline(
SimpleImputer(strategy='constant', fill_value='None'),
OneHotEncoder(sparse=False, handle_unknown='ignore')
),
make_column_selector(dtype_include='category')),
('numerical',
SimpleImputer(strategy='median'),
make_column_selector(dtype_include=np.number))
])
stacking_regressor = Pipeline(steps=[
('preprocessing', ct_preprocessing),
('model', StackingRegressor(
estimators=[('linear_regression', sr_linear), ('regression_tree', DecisionTreeRegressor())],
final_estimator=DecisionTreeRegressor(),
passthrough=True
))
])
label = 'SalePrice'
features = [col for col in d.columns if col != label]
X, y = d[features], d[label]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True)
stacking_regressor.fit(X_train, y_train)
Thanks to user amiola for his answer putting me on the right track.

Get feature names after sklearn pipeline

I want to match the output np array with the features to make a new pandas dataframe
Here is my pipeline:
from sklearn.pipeline import Pipeline
# Categorical pipeline
categorical_preprocessing = Pipeline(
[
('Imputation', SimpleImputer(missing_values=np.nan, strategy='most_frequent')),
('Ordinal encoding', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)),
]
)
# Continuous pipeline
continuous_preprocessing = Pipeline(
[
('Imputation', SimpleImputer(missing_values=np.nan, strategy='mean')),
('Scaling', StandardScaler())
]
)
# Creating preprocessing pipeline
preprocessing = make_column_transformer(
(continuous_preprocessing, continuous_cols),
(categorical_preprocessing, categorical_cols),
)
# Final pipeline
pipeline = Pipeline(
[('Preprocessing', preprocessing)]
)
Here is how I call it:
X_train = pipeline.fit_transform(X_train)
X_val = pipeline.transform(X_val)
X_test = pipeline.transform(X_test)
Here is what I get when trying to get the feature names:
pipeline['Preprocessing'].transformers_[1][1]['Ordinal encoding'].get_feature_names()
OUT:
AttributeError: 'OrdinalEncoder' object has no attribute 'get_feature_names'
Here is a SO question that was similar: Sklearn Pipeline: Get feature names after OneHotEncode In ColumnTransformer
Point is that, as of today, some transformers do expose a method .get_feature_names_out() and some others do not, which generates some problems - for instance - whenever you want to create a well-formatted DataFrame from the np.array outputted by a Pipeline or ColumnTransformer instance. (Instead, afaik, .get_feature_names() was deprecated in latest versions in favor of .get_feature_names_out()).
For what concerns the transformers that you are using, StandardScaler belongs to the first category of transformers exposing the method, while both SimpleImputer and OrdinalEncoder do belong to the second. The docs show the exposed methods within the Methods paragraphs. As said, this causes problems when doing something like pd.DataFrame(pipeline.fit_transform(X_train), columns=pipeline.get_feature_names_out()) on your pipeline, but it would cause problems as well on your categorical_preprocessing and continuous_preprocessing pipelines (as in both cases at least one transformer lacks of the method) and on the preprocessing ColumnTransformer instance.
There's an ongoing attempt in sklearn to enrich all estimators with the .get_feature_names_out() method. It is tracked within github issue #21308, which, as you might see, branches in many PRs (each one dealing with a specific module). For instance, issue #21079 for the preprocessing module, which will enrich the OrdinalEncoder among the others, issue #21078 for the impute module, which will enrich the SimpleImputer. I guess that they'll be available in a new release as soon as all the referenced PR will be merged.
In the meanwhile, imo, you should go with a custom solution that might fit your needs. Here's a simple example, which do not necessarily resemble your need, but which is meant to give a (possible) way of proceeding:
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from sklearn.compose import make_column_transformer, make_column_selector
X = pd.DataFrame({'city': ['London', 'London', 'Paris', 'Sallisaw', ''],
'title': ['His Last Bow', 'How Watson Learned the Trick', 'A Moveable Feast', 'The Grapes of Wrath', 'The Jungle'],
'expert_rating': [5, 3, 4, 5, np.NaN],
'user_rating': [4, 5, 4, np.NaN, 3]})
X
num_cols = X.select_dtypes(include=np.number).columns.tolist()
cat_cols = X.select_dtypes(exclude=np.number).columns.tolist()
# Categorical pipeline
categorical_preprocessing = Pipeline(
[
('Imputation', SimpleImputer(missing_values='', strategy='most_frequent')),
('Ordinal encoding', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)),
]
)
# Continuous pipeline
continuous_preprocessing = Pipeline(
[
('Imputation', SimpleImputer(missing_values=np.nan, strategy='mean')),
('Scaling', StandardScaler())
]
)
# Creating preprocessing pipeline
preprocessing = make_column_transformer(
(continuous_preprocessing, num_cols),
(categorical_preprocessing, cat_cols),
)
# Final pipeline
pipeline = Pipeline(
[('Preprocessing', preprocessing)]
)
X_trans = pipeline.fit_transform(X)
pd.DataFrame(X_trans, columns= num_cols + cat_cols)

Names for Feature Selection

I want to know the names of the features within my RF model. I read here that the output from gs.best_estimator_.named_steps["stepname"].feature_importances_ would mirror my columns from my data. However, the length of gs.best_estimator_.... is 10 and I have 13 columns. Some columns were not important. From other answers around (answer1, answer2), I would have to declare something within my pipeline. But I am confused as to what to declare because both answers deal with PCA, not RF.
Here is what I have so far.
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn import datasets
# use iris as example
iris = datasets.load_iris()
X = iris.drop(['sepal_length'],axis=1)
y = iris.sepal_length
cats_feats = ['species']
X_train, X_test, y_train, y_test = \
train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=13)
# Pipeline
categorical_transformer = Pipeline(steps=[
('onehot', OneHotEncoder(handle_unknown='ignore',sparse=False))
])
# Bundle any preprocessing
preprocessor = ColumnTransformer(
transformers=[
('cat', categorical_transformer, cat_feats)
])
rf = RandomForestRegressor(random_state = 13)
mymodel = Pipeline(steps = [('preprocessor', preprocessor),
('model', rf)
])
# For this example, I used default values. In reality I do use a dictionary of parameters
gs = GridSearchCV(mymodel
,n_jobs = -1
,cv = 5
)
gs.fit(X_train,y_train)
Why the length of the feature list does not match
The length of your features does not match because all non-categorical columns are being discarded when you are using your ColumnTransformer. By default, it only keeps columns for which a transformation was specified. As a result, if you do not want this to happen, you need to do this
preprocessor = ColumnTransformer(transformers=[('cat', OneHotEncoder(), cat_feats)],
remainder='passthrough')
(I removed your categorical pipeline, which is not necessary here)
Also keep in mind that applying the OHE will add features and so the total number of features is going to be larger than what you had in the beginning.
How to get the feature names
Once you have fitted everything, you need to retrieve the feature names for the result of the OHE and the remaining numerical columns.
For the OHE columns:
cat_features = gs.best_estimator_["preprocessor"].named_transformers_["cat"].get_feature_names()
For the numerical columns, you need to declare num_feats where all numerical features are in the same order as in your original dataframe.
Then just do:
feature_names = np.concatenate((cat_features, num_feats))
PS: this is a bit cumbersome, and this might be improved in later sklearn versions, but as of now, this is the procedure

sklearn pipeline - how to apply different transformations on different columns

I am pretty new to pipelines in sklearn and I am running into this problem: I have a dataset that has a mixture of text and numbers i.e. certain columns have text only and rest have integers (or floating point numbers).
I was wondering if it was possible to build a pipeline where I can for example call LabelEncoder() on the text features and MinMaxScaler() on the numbers columns. The examples I have seen on the web mostly point towards using LabelEncoder() on the entire dataset and not on select columns. Is this possible? If so any pointers would be greatly appreciated.
The way I usually do it is with a FeatureUnion, using a FunctionTransformer to pull out the relevant columns.
Important notes:
You have to define your functions with def since annoyingly you can't use lambda or partial in FunctionTransformer if you want to pickle your model
You need to initialize FunctionTransformer with validate=False
Something like this:
from sklearn.pipeline import make_union, make_pipeline
from sklearn.preprocessing import FunctionTransformer
def get_text_cols(df):
return df[['name', 'fruit']]
def get_num_cols(df):
return df[['height','age']]
vec = make_union(*[
make_pipeline(FunctionTransformer(get_text_cols, validate=False), LabelEncoder()))),
make_pipeline(FunctionTransformer(get_num_cols, validate=False), MinMaxScaler())))
])
Since v0.20, you can use ColumnTransformer to accomplish this.
An Example of ColumnTransformer might help you:
# FOREGOING TRANSFORMATIONS ON 'data' ...
# filter data
data = data[data['county'].isin(COUNTIES_OF_INTEREST)]
# define the feature encoding of the data
impute_and_one_hot_encode = Pipeline([
('impute', SimpleImputer(strategy='most_frequent')),
('encode', OneHotEncoder(sparse=False, handle_unknown='ignore'))
])
featurisation = ColumnTransformer(transformers=[
("impute_and_one_hot_encode", impute_and_one_hot_encode, ['smoker', 'county', 'race']),
('word2vec', MyW2VTransformer(min_count=2), ['last_name']),
('numeric', StandardScaler(), ['num_children', 'income'])
])
# define the training pipeline for the model
neural_net = KerasClassifier(build_fn=create_model, epochs=10, batch_size=1, verbose=0, input_dim=109)
pipeline = Pipeline([
('features', featurisation),
('learner', neural_net)])
# train-test split
train_data, test_data = train_test_split(data, random_state=0)
# model training
model = pipeline.fit(train_data, train_data['label'])
You can find the entire code under: https://github.com/stefan-grafberger/mlinspect/blob/19ca0d6ae8672249891835190c9e2d9d3c14f28f/example_pipelines/healthcare/healthcare.py

Merging bag-of-words scikits classifier with arbitrary numeric fields

How would you merge a scikits-learn classifier that operates over a bag-of-words with one that operates on arbitrary numeric fields?
I know that these are basically the same thing behind-the-scenes, but I'm having trouble figuring out how to do this via the existing library methods. For example, my bag-of-words classifier uses the pipeline:
classifier = Pipeline([
('vectorizer', HashingVectorizer(ngram_range=(1,4), non_negative=True)),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(LinearSVC())),
])
classifier.fit(['some random text','some other text', ...], [CLS_A, CLS_B, ...])
Whereas my other usage is like:
classifier = LinearSVC()
classifier.fit([1.23, 4.23, ...], [CLS_A, CLS_B, ...])
How would I construct a LinearSVC classifier that could be trained using both sets of data simeltaneously? e.g.
classifier = ?
classifier.fit([('some random text',1.23),('some other text',4.23), ...], [CLS_A, CLS_B, ...])
The easy way:
import scipy.sparse
tfidf = Pipeline([
('vectorizer', HashingVectorizer(ngram_range=(1,4), non_negative=True)),
('tfidf', TfidfTransformer()),
])
X_tfidf = tfidf.fit_transform(texts)
X_other = load_your_other_features()
X = scipy.sparse.hstack([X_tfidf, X_other])
clf = LinearSVC().fit(X, y)
The principled solution, which allows you to keep everything in one Pipeline, would be to wrap hashing, tf-idf and your other feature extraction method in a few simple transformer objects and put these in a FeatureUnion, but it's hard to tell what the code would look like from the information you've given.
(P.S. As I keep saying on SO, on the mailing list and elsewhere, OneVsRestClassifier(LinearSVC()) is useless. LinearSVC does OvR out of the box, so this is just a slower way of fitting an OvR SVM.)

Categories