This issue has been discussed here but there has been no comments: https://github.com/scikit-learn/scikit-learn/issues/16473
I have some numerical features and categorical features in X. The categorical features were one hot encoded. So my pipeline is something similar to the sklearn docs example:
cat_proc_lin = make_pipeline(
SimpleImputer(missing_values=None,
strategy='constant',
fill_value='missing'),
OneHotEncoder(categories=categories)
)
num_proc_lin = make_pipeline(
SimpleImputer(strategy='mean'),
StandardScaler()
)
processor_lin = make_column_transformer(
(cat_proc_lin, cat_cols),
(num_proc_lin, num_cols),
remainder='passthrough')
lasso_pipeline = make_pipeline(processor_lin,
LassoCV())
rf_pipeline = make_pipeline(processor_nlin,
RandomForestRegressor(random_state=42))
gradient_pipeline = make_pipeline(
processor_nlin,
HistGradientBoostingRegressor(random_state=0))
estimators = [('Random Forest', rf_pipeline),
('Lasso', lasso_pipeline),
('Gradient Boosting', gradient_pipeline)]
stacking_regressor = StackingRegressor(estimators=estimators,
final_estimator=RidgeCV())
But if I change passthrough=True, it will raise a TypeError because the passthrough gives the original X and skips the preprocessing part of the pipeline:
/usr/local/lib/python3.6/dist-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details:
ValueError: could not convert string to float: 'RL'
Is there anyway to make the passthrough include the first preprocessing part of the pipeline?
I also cannot add the preprocessing pipeline infront of the final estimator because it will concatenate the original X dataframe with the final layer predictions which is a numpy array as mentioned in the github discussion link at the top of this post. My exact preprocessing pipeline has several custom transformers that operates on pandas dataframe.
Thank you for any help!
Related
I have read many posts on this that reference the get_feature_names() from sklearn which appears to be now deprecated and replaced by get_feature_names_out neither of which I can get to work. It also appears that there is no way to use the get_feature_names (or the get_feature_names_out) with the ColumnTransformer class. So I am trying to fit and transform my numeric columns with a SimpleImputer and then StandardScaler class then SimpleImpute ('most_frequent') and OneHotEncode the categorical variables. I run them all individually since I can't put them in a pipeline then I try to get_feature_names and this results:
ValueError: input_features should have length equal to number of features (5), got 11
I have also tried getting feature names for just the categorical features as well as just the numeric and each one give the following errors respectively:
ValueError: input_features should have length equal to number of features (5), got 121942
and
ValueError: input_features should have length equal to number of features (5), got 121942
I am completely lost and also open to an easier way to get the feature names so that I can make sure the prod data that I run this model on after training/testing has the exact same features as the ones the model is trained to expect (which is the root issue here).
If I'm "barking up the wrong tree" by trying to get the feature names for the reasoning outlined in the root issue I'm also more than willing to be corrected. Here is my code:
#ONE HOT
import sklearn
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
# !pip install -U scikit-learn
print('The scikit-learn version is {}.'.format(sklearn.__version__))
numeric_columns = X.select_dtypes(include=['int64','float64']).columns
cat_columns = X.select_dtypes(include=['object']).columns
si_num = SimpleImputer(strategy='median')
si_cat = SimpleImputer(strategy='most_frequent')
ss = StandardScaler()
ohe = OneHotEncoder()
si_num.fit_transform(X[numeric_columns])
si_cat.fit_transform(X[cat_columns])
ss.fit_transform(X[numeric_columns])
ohe.fit_transform(X[cat_columns])
ohe.get_feature_names(X[numeric_columns])
Thanks!
I think this should work as a single composite estimator that does all your transformations and provides get_feature_names_out:
num_pipe = Pipeline([
("imp", si_num),
("scale", ss),
])
cat_pipe = Pipeline([
("imp", si_cat),
("ohe", ohe),
])
preproc = ColumnTransformer([
("num", num_pipe, numeric_columns),
("cat", cat_pipe, cat_columns),
])
Ideally, you should save the fitted composite and use that to transform production data, rather than using the feature names to reconcile different categories.
You should also fit this composite only on the training set, transforming the test set separately.
I wrote following code and it gives me this error :
"Given feature/column names do not match the ones for the data given
during fit."
Train and predict data has the same features.
df_train = data_preprocessing(df_train)
#Split X and Y
X_train = df_train.drop(target_columns,axis=1)
y_train = df_train[target_columns]
#Create a boolean mask for categorical columns
categorical_columns = X_train.columns[X_train.dtypes == 'O'].tolist()
# Create a boolean mask for numerical columns
numerical_columns = X_train.columns[X_train.dtypes != 'O'].tolist()
# Scaling & Encoding objects
numeric_transformer = Pipeline(steps=[('scaler', StandardScaler())])
categorical_transformer = OneHotEncoder(handle_unknown='ignore')
col_transformers = ColumnTransformer(
# name, transformer itself, columns to apply
transformers=[("scaler_onestep", numeric_transformer, numerical_columns),
("ohe_onestep", categorical_transformer, categorical_columns)])
#Manual PROCESSING
model = MultiOutputClassifier(
xgb.XGBClassifier(objective="binary:logistic",
colsample_bytree = 0.5
))
#Define a pipeline
pipeline = Pipeline([("preprocessing", col_transformers), ("XGB", model)])
pipeline.fit(X_train, y_train)
#Data Preprocessing
predicted = data_preprocessing(predicted)
X_predicted = predicted.drop(target_columns,axis=1)
predictions=pipeline.predict(X_predicted)
I got error on prediction process. How can i fix this problem? I couldn't find any solution.
Try reordering the columns in X_predicted so that they exactly match X_train.
I am guessing the features names in the training dataset are not identical with the predicted dataset.
For example if you have 19 features in the training dataset, it must be the same 19 samples in the predicted dataset. The model cannot test on features, it has not seen before.
Adding to the above answers - if you're using a ColumnTransformer like OP and are unsure about what the column names were at the time the model was fit, you can use pipeline.named_steps['preprocessing']._feature_names_in to figure it out.
I'm trying to train the model to classify short texts. I do the following:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(max_features=1000)
train['vector']=vectorizer.fit_transform(train['item_name'])
train=train.drop('item_name',axis=1)
y=train.category_id
train=train.drop('category_id',axis=1)
X_train, X_test, y_train, y_test = train_test_split(train,y, test_size=0.10,stratify=y,random_state=42)
import xgboost as xgb
xgb_model = xgb.XGBClassifier()
xgb_model.fit(X_train, y_train)
But I get an error:
ValueError: DataFrame.dtypes for data must be int, float, bool or categorical. When
categorical type is supplied, DMatrix parameter
enable_categorical must be set to True.vector
When categorical type is supplied, DMatrix parameter enable_categorical must be set to True.vector
You should be training XGBClassifier with TfidfVectorizer transformation results only. Right now you're passing the original un-vectorized text sentences also, which causes the above ValueError to be raised.
The simplest solution is to set up a two-step pipeline:
pipeline = Pipeline([
("vectorizer", TfidfVectorizer()),
("classifier", XGBClassifier())
])
pipeline.fit(X_train, y_train)
However, be aware that XGBoost estimators are interpreting sparse data matrices differently from the regular Scikit-Learn estimators. To get the correct/meaningful results, you should additionally convert the sparse data matrix to a dense data matrix.
See Training Scikit-Learn based TF(-IDF) plus XGBoost pipelines
I'm having trouble applying at once different transformers to columns with different types (text vs numerical), and concatenating such transformers in a single one for later use.
I tried to follow the steps in the documentation for Column Transformer with Mixed Types, which explains how to do that for a mix of categorical and numerical data, but it doesn't seem to work with text data.
TL;DR
How do you create a storable transformer that follows different pipelines for text and numerical data?
Data download and preparation
# imports
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.preprocessing import StandardScaler
np.random.seed(0)
# download Titanic data
X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
# data preparation
numeric_features = ['age', 'fare']
text_features = ['name', 'cabin', 'home.dest']
X.fillna({text_col: '' for text_col in text_features}, inplace=True)
# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
Transforming numerical features: ok
Following the steps in the link above, one can create a transformer for the numerical features as follows:
# handling missing data and normalization
numeric_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])
num_preprocessor = ColumnTransformer(transformers=[('num', numeric_transformer, numeric_features)])
# this works
num_preprocessor.fit(X_train)
train_feature_set = num_preprocessor.transform(X_train)
test_feature_set = num_preprocessor.transform(X_test)
# verify shape = (number of data points, number of numerical features (2) )
train_feature_set.shape # (1047, 2)
test_feature_set.shape # (262, 2)
Transforming text features: ok
To process text features, I vectorize each text column with TF-IDF (as opposed to concatenating all text columns, and applying TF-IDF just once):
# Tfidf of max 30 features
text_transformer = TfidfVectorizer(use_idf=True,
max_features=30)
# apply separately to each column
text_transformer_list = [(x + '_vectorizer', text_transformer, x) for x in text_features]
text_preprocessor = ColumnTransformer(transformers=text_transformer_list)
# this works
text_preprocessor.fit(X_train)
train_feature_set = text_preprocessor.transform(X_train)
test_feature_set = text_preprocessor.transform(X_test)
# verify shape = (number of data points, number of text features (3) times max_features(30) )
train_feature_set.shape # (1047, 90)
test_feature_set.shape # (262, 90)
How do you do both at once?
I've tried various strategies to save both above procedures in a single transformer, but they all fail due to different errors.
Attempt 1: Follow documented strategy
Following the documentation (Column Transformer with Mixed Types) doesn't work, once text data replaces categorical data:
# documented strategy
sum_preprocessor = ColumnTransformer(transformers=[('num', numeric_transformer, numeric_features),
('text', text_transformer, text_features)])
# fails
sum_preprocessor.fit(X_train)
returns following error message:
ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 1047 and the array at index 1 has size 3
Attempt 2: FeatureUnion on the lists of transformers
# create a list of numerical transformer, like those for text
numerical_transformer_list = [(x + '_scaler', numeric_transformer, x) for x in numeric_features]
# fails
column_trans = FeatureUnion([text_transformer_list, numerical_transformer_list])
returns following error message:
TypeError: All estimators should implement fit and transform. '('cabin_vectorizer', TfidfVectorizer(max_features=30), 'cabin')' (type <class 'tuple'>) doesn't
Attempt 3: ColumnTransformer on the lists of transformers
# create a list of all transformers, text and numerical
sum_transformer_list = text_transformer_list + numerical_transformer_list
# works
sum_preprocessor = ColumnTransformer(transformers=sum_transformer_list)
# fails
sum_preprocessor.fit(X_train)
returns following error message:
ValueError: Expected 2D array, got 1D array instead:
array=[54. nan nan ... 20. nan nan].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
My question
How do I create a single object that can fit and transform data mixing text and numerical types?
Short answer:
all_transformers = text_transformer_list + [('num', numeric_transformer, numeric_features)]
all_preprocessor = ColumnTransformer(transformers=all_transformers)
all_preprocessor.fit(X_train)
train_all = all_preprocessor.transform(X_train)
test_all = all_preprocessor.transform(X_test)
print(train_all.shape, test_all.shape)
# prints (1047, 92) (262, 92)
The difficulty here is that (most?) text transformers expect 1-dimensional input, but (most?) numerical transformers expect 2-dimensional input. ColumnTransformer handles that by allowing you to specify a single column or a list of columns: in the first case, the 1d array is passed on to the transformer, and in the second a 2d array is passed.
So, to explain the errors in the three attempts:
Attempt 1: The TF-IDF is receiving a 2d array, and treats the columns as the documents not the individual entries, and so produces just three outputs. When it tries to concatenate that to the 1047-row numerical output, it fails.
Attempt 2: FeatureUnion doesn't have the same input format as ColumnTransformer: you shouldn't have triples (name, transformer, columns) in this case. Anyway, FeatureUnion isn't meant for what you're doing here.
Attempt 3: This time you're trying to send 1d data through to the numerical transformer, but those are expecting 2d data.
I am pretty new to pipelines in sklearn and I am running into this problem: I have a dataset that has a mixture of text and numbers i.e. certain columns have text only and rest have integers (or floating point numbers).
I was wondering if it was possible to build a pipeline where I can for example call LabelEncoder() on the text features and MinMaxScaler() on the numbers columns. The examples I have seen on the web mostly point towards using LabelEncoder() on the entire dataset and not on select columns. Is this possible? If so any pointers would be greatly appreciated.
The way I usually do it is with a FeatureUnion, using a FunctionTransformer to pull out the relevant columns.
Important notes:
You have to define your functions with def since annoyingly you can't use lambda or partial in FunctionTransformer if you want to pickle your model
You need to initialize FunctionTransformer with validate=False
Something like this:
from sklearn.pipeline import make_union, make_pipeline
from sklearn.preprocessing import FunctionTransformer
def get_text_cols(df):
return df[['name', 'fruit']]
def get_num_cols(df):
return df[['height','age']]
vec = make_union(*[
make_pipeline(FunctionTransformer(get_text_cols, validate=False), LabelEncoder()))),
make_pipeline(FunctionTransformer(get_num_cols, validate=False), MinMaxScaler())))
])
Since v0.20, you can use ColumnTransformer to accomplish this.
An Example of ColumnTransformer might help you:
# FOREGOING TRANSFORMATIONS ON 'data' ...
# filter data
data = data[data['county'].isin(COUNTIES_OF_INTEREST)]
# define the feature encoding of the data
impute_and_one_hot_encode = Pipeline([
('impute', SimpleImputer(strategy='most_frequent')),
('encode', OneHotEncoder(sparse=False, handle_unknown='ignore'))
])
featurisation = ColumnTransformer(transformers=[
("impute_and_one_hot_encode", impute_and_one_hot_encode, ['smoker', 'county', 'race']),
('word2vec', MyW2VTransformer(min_count=2), ['last_name']),
('numeric', StandardScaler(), ['num_children', 'income'])
])
# define the training pipeline for the model
neural_net = KerasClassifier(build_fn=create_model, epochs=10, batch_size=1, verbose=0, input_dim=109)
pipeline = Pipeline([
('features', featurisation),
('learner', neural_net)])
# train-test split
train_data, test_data = train_test_split(data, random_state=0)
# model training
model = pipeline.fit(train_data, train_data['label'])
You can find the entire code under: https://github.com/stefan-grafberger/mlinspect/blob/19ca0d6ae8672249891835190c9e2d9d3c14f28f/example_pipelines/healthcare/healthcare.py