Transfomers for mixed data types - python

I'm having trouble applying at once different transformers to columns with different types (text vs numerical), and concatenating such transformers in a single one for later use.
I tried to follow the steps in the documentation for Column Transformer with Mixed Types, which explains how to do that for a mix of categorical and numerical data, but it doesn't seem to work with text data.
TL;DR
How do you create a storable transformer that follows different pipelines for text and numerical data?
Data download and preparation
# imports
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.preprocessing import StandardScaler
np.random.seed(0)
# download Titanic data
X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
# data preparation
numeric_features = ['age', 'fare']
text_features = ['name', 'cabin', 'home.dest']
X.fillna({text_col: '' for text_col in text_features}, inplace=True)
# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
Transforming numerical features: ok
Following the steps in the link above, one can create a transformer for the numerical features as follows:
# handling missing data and normalization
numeric_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])
num_preprocessor = ColumnTransformer(transformers=[('num', numeric_transformer, numeric_features)])
# this works
num_preprocessor.fit(X_train)
train_feature_set = num_preprocessor.transform(X_train)
test_feature_set = num_preprocessor.transform(X_test)
# verify shape = (number of data points, number of numerical features (2) )
train_feature_set.shape # (1047, 2)
test_feature_set.shape # (262, 2)
Transforming text features: ok
To process text features, I vectorize each text column with TF-IDF (as opposed to concatenating all text columns, and applying TF-IDF just once):
# Tfidf of max 30 features
text_transformer = TfidfVectorizer(use_idf=True,
max_features=30)
# apply separately to each column
text_transformer_list = [(x + '_vectorizer', text_transformer, x) for x in text_features]
text_preprocessor = ColumnTransformer(transformers=text_transformer_list)
# this works
text_preprocessor.fit(X_train)
train_feature_set = text_preprocessor.transform(X_train)
test_feature_set = text_preprocessor.transform(X_test)
# verify shape = (number of data points, number of text features (3) times max_features(30) )
train_feature_set.shape # (1047, 90)
test_feature_set.shape # (262, 90)
How do you do both at once?
I've tried various strategies to save both above procedures in a single transformer, but they all fail due to different errors.
Attempt 1: Follow documented strategy
Following the documentation (Column Transformer with Mixed Types) doesn't work, once text data replaces categorical data:
# documented strategy
sum_preprocessor = ColumnTransformer(transformers=[('num', numeric_transformer, numeric_features),
('text', text_transformer, text_features)])
# fails
sum_preprocessor.fit(X_train)
returns following error message:
ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 1047 and the array at index 1 has size 3
Attempt 2: FeatureUnion on the lists of transformers
# create a list of numerical transformer, like those for text
numerical_transformer_list = [(x + '_scaler', numeric_transformer, x) for x in numeric_features]
# fails
column_trans = FeatureUnion([text_transformer_list, numerical_transformer_list])
returns following error message:
TypeError: All estimators should implement fit and transform. '('cabin_vectorizer', TfidfVectorizer(max_features=30), 'cabin')' (type <class 'tuple'>) doesn't
Attempt 3: ColumnTransformer on the lists of transformers
# create a list of all transformers, text and numerical
sum_transformer_list = text_transformer_list + numerical_transformer_list
# works
sum_preprocessor = ColumnTransformer(transformers=sum_transformer_list)
# fails
sum_preprocessor.fit(X_train)
returns following error message:
ValueError: Expected 2D array, got 1D array instead:
array=[54. nan nan ... 20. nan nan].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
My question
How do I create a single object that can fit and transform data mixing text and numerical types?

Short answer:
all_transformers = text_transformer_list + [('num', numeric_transformer, numeric_features)]
all_preprocessor = ColumnTransformer(transformers=all_transformers)
all_preprocessor.fit(X_train)
train_all = all_preprocessor.transform(X_train)
test_all = all_preprocessor.transform(X_test)
print(train_all.shape, test_all.shape)
# prints (1047, 92) (262, 92)
The difficulty here is that (most?) text transformers expect 1-dimensional input, but (most?) numerical transformers expect 2-dimensional input. ColumnTransformer handles that by allowing you to specify a single column or a list of columns: in the first case, the 1d array is passed on to the transformer, and in the second a 2d array is passed.
So, to explain the errors in the three attempts:
Attempt 1: The TF-IDF is receiving a 2d array, and treats the columns as the documents not the individual entries, and so produces just three outputs. When it tries to concatenate that to the 1047-row numerical output, it fails.
Attempt 2: FeatureUnion doesn't have the same input format as ColumnTransformer: you shouldn't have triples (name, transformer, columns) in this case. Anyway, FeatureUnion isn't meant for what you're doing here.
Attempt 3: This time you're trying to send 1d data through to the numerical transformer, but those are expecting 2d data.

Related

How do I convert values of data frame to string type and how do I use train_test_split to generate the two arrays with same dimensions?

I am trying to learn more about machine learning. I have this data of spam/non-spam emails and trying to build the classifier. to use "CountVectorizer", I need to convert data frame values (emails) to the string type but for some reason, after looping it and converting, values still remain into a pandas series. 1. How would I fix that ? p.s I will put the code as well.
'''
import re
def preprocessor(e):
e = re.sub("[^a-zA-Z0-9]+", " ",e)
return e.lower()
indexes = list(df['content'].index)
for i in indexes:
df['content'][i] = preprocessor(str(df['content'][i]))
df['name'][i] = preprocessor(str(df['name'][i]))
df['category'][i] = preprocessor(str(df['category'][i]))
'''
code for converting to string types
secondly, assuming that I did it and somehow worked, next I need to apply CountVectorizer which should generate two arrays, for X --> email texts and for y --> category (spar or not spam). it does generate the arrays but after I apply train_test_split and then later try to fit my model, I get error " y should be a 1d array, got an array of shape (3455, 1483) instead." --> I try to reshape but then the dimensions get all messed up.
'''
[vectorizer = CountVectorizer()
x = vectorizer.fit_transform(df\['content'\])
x = x.toarray()
y = vectorizer.fit_transform(df\['category'\])
y = y.toarray()
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.33, random_state=42)
model = LogisticRegression(random_state=0)
model.fit(x_train,y_train)][1]
'''

How to get feature names when using onehot encoder on only certain columns sklearn

I have read many posts on this that reference the get_feature_names() from sklearn which appears to be now deprecated and replaced by get_feature_names_out neither of which I can get to work. It also appears that there is no way to use the get_feature_names (or the get_feature_names_out) with the ColumnTransformer class. So I am trying to fit and transform my numeric columns with a SimpleImputer and then StandardScaler class then SimpleImpute ('most_frequent') and OneHotEncode the categorical variables. I run them all individually since I can't put them in a pipeline then I try to get_feature_names and this results:
ValueError: input_features should have length equal to number of features (5), got 11
I have also tried getting feature names for just the categorical features as well as just the numeric and each one give the following errors respectively:
ValueError: input_features should have length equal to number of features (5), got 121942
and
ValueError: input_features should have length equal to number of features (5), got 121942
I am completely lost and also open to an easier way to get the feature names so that I can make sure the prod data that I run this model on after training/testing has the exact same features as the ones the model is trained to expect (which is the root issue here).
If I'm "barking up the wrong tree" by trying to get the feature names for the reasoning outlined in the root issue I'm also more than willing to be corrected. Here is my code:
#ONE HOT
import sklearn
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
# !pip install -U scikit-learn
print('The scikit-learn version is {}.'.format(sklearn.__version__))
numeric_columns = X.select_dtypes(include=['int64','float64']).columns
cat_columns = X.select_dtypes(include=['object']).columns
si_num = SimpleImputer(strategy='median')
si_cat = SimpleImputer(strategy='most_frequent')
ss = StandardScaler()
ohe = OneHotEncoder()
si_num.fit_transform(X[numeric_columns])
si_cat.fit_transform(X[cat_columns])
ss.fit_transform(X[numeric_columns])
ohe.fit_transform(X[cat_columns])
ohe.get_feature_names(X[numeric_columns])
Thanks!
I think this should work as a single composite estimator that does all your transformations and provides get_feature_names_out:
num_pipe = Pipeline([
("imp", si_num),
("scale", ss),
])
cat_pipe = Pipeline([
("imp", si_cat),
("ohe", ohe),
])
preproc = ColumnTransformer([
("num", num_pipe, numeric_columns),
("cat", cat_pipe, cat_columns),
])
Ideally, you should save the fitted composite and use that to transform production data, rather than using the feature names to reconcile different categories.
You should also fit this composite only on the training set, transforming the test set separately.

Unable to make prediction after loading sklearn model

I have created a ML model with Scikit-Learn and saved it. Now when I load the model, I have trouble with transformation and prediction.
I have 4 features in DataFrame. First two features are textual, and other 2 are numerical. The result column is 1 or 0.
In order to train my model, I used ColumnTransformer and CountVectorizer for transformation and vectorization textual features. I specified NAMES of the columns that I want to transform/vectorize.
(columns text1 and text2). Numerical columns do not need to be vectorized so remainder='passthrough' is fixing that.
Part of code that works:
features = df.iloc[:, :-1]
results = df.iloc[:, -1]
transformerVectoriser = ColumnTransformer(transformers=[('vector word 1', CountVectorizer(analyzer='word', ngram_range=(1, 1), max_features = 12000, stop_words = 'english'), 'text1'),
('vector phrase 3', CountVectorizer(analyzer='word', ngram_range=(3, 3), max_features = 2500, stop_words = 'english'), 'text2')],
remainder='passthrough') # Default is to drop untransformed columns, passthrough == leave columns as they are
x_train, x_test, y_train, y_test = train_test_split(features, results, test_size=0.3, random_state=0)
x_train = transformerVectoriser.fit_transform(x_train)
x_test = transformerVectoriser.transform(x_test)
model = clf.fit(x_train, y_train)
y_pred = model.predict(x_test)
filename = 'ml_model.sav'
pickle.dump(model, open(filename, 'wb'))
filename = 'ml_transformer.sav'
pickle.dump(transformerVectoriser, open(filename, 'wb'))
But when I want to load a model, and make prediction I get an error:
# LOADING MODEL
model = pickle.load(open('ml_model.sav','rb'))
vectorizer = pickle.load(open('ml_transformer.sav','rb'))
# MAKING PREDICTION
data_for_prediction = vectorizer.transform([data_for_prediction]) #ERROR
print(model.predict_proba(data_for_prediction))
I get the error:
ValueError: Specifying the columns using strings is only supported for pandas DataFrames
When I was training my model I used Pandas dataframe, and when I wanted to make prediction I have just put the values in the list. So data_for_prediction is list, that looks like this:
["text that should be vectorized with vectorizer that i created", "More texts that should be vectorized", 4, 7]
I think that that is the error, Because I used column names when I was using ColumnTransformer, but now when I want to make prediction, vectorizer do not know what to vectorize.
My final model and vectorizer should be used in an API, and api should only take JSON, so I do not want to convert JSON to DataFrame and pass it to the model.
Is there a way to fix this error without using pandas dataframe in my final Flask APP.
The training data is a dataframe with the columns:
x_train.columns
the function vectorizer.transform() wants data in the same format, so assuming that
data_f_p = ["text that should be vectorized", 4,7,0]
corresponds to the same four columns as x_train you can turn it into a dataframe with
data_f_p = pd.DataFrame([data_f_p], columns=x_train.columns)
data_f_p = vectorizer.transform(data_f_p)
In the case you don't want to use pandas.DataFrame in your REST API endpoint, just don't train your model with the DataFrame but convert your data to a numpy array first:
>>> df
TEXT_1 TEXT_2 NUM_1 NUM_2
0 This is the first text. The second text. 300.000 23.3
1 Here is the third text. And the fourth text. 2.334 29.0
>>> df.to_numpy()
array([['This is the first text.', 'The second text.', 300.0, 23.3],
['Here is the third text.', 'And the fourth text.', 2.334, 29.0]],
dtype=object)
Then, make changes in how you define the model. I'd suggest to combine preprocessing and predicting steps using sklearn.pipeline.Pipeline into a single model like this:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
model = Pipeline(steps=[
('transformer', ColumnTransformer(
transformers=[
('TEXT_1', CountVectorizer(analyzer='word', stop_words='english'), 0),
('TEXT_2', CountVectorizer(analyzer='word', stop_words='english'), 1),
],
remainder='passthrough',
)),
('predictor', RandomForestClassifier()),
])
Note, here we are using indices instead of names to reference texts when defining transformers for the ColumnTransformer instance. Once we've transformed the initial DataFrame to a numpy array, the TEXT_1 feature is located at 0, and the TEXT_2 at 1 in a data row. Here is how you can use the model:
from joblib import dump, load
X = df.to_numpy()
model.fit(X, y)
dump(model, 'model.joblib')
...
model = load('model.joblib')
results = model.predict(data)
As a result, you don't have to convert your incoming data to the DataFrame in order to make a prediction.

Python's "StandardScaler" and "LabelEncoder", and "fit" and "fit_transform" do not work with a CSV which contains both float and string

I was learning the MPL regressor at Google Colaboratory and ran the source code:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data = np.array(table)
scaler.fit(data)
y_index = data.shape[1]-1
sd_x = (scaler.var_[:y_index])**0.5
sd_y = (scaler.var_[y_index])**0.5
mean_x = scaler.mean_[:y_index]
mean_y = scaler.mean_[y_index]
x = (data[:, :y_index]).astype(np.float32)
y = (data[:, y_index]).astype(np.float32)
train_x, test_x, train_y, test_y = train_test_split(x, y, test_size=0.25)
print('Separate training and testing sets!')
It gave the error ValueError: could not convert string to float: 'Photo Editor & Candy Camera & Grid & ScrapBook'.
So I checked the question RandomForestClassfier.fit(): ValueError: could not convert string to float. I also tried sklearn-LinearRegression: could not convert string to float: '--'.
I changed from fit(data) to fit_transform(data), but the same error still insisted. Then I changed from StandardScaler to LabelEncoder, and from scaler = StandardScaler() to scaler = LabelEncoder(). But the different error appeared: ValueError: bad input shape (10841, 13) on the line scaler.fit_transform(data).
You can check the CSV from Kaggle's CSV here. The CSV contains both strings and numbers without quotation marks (except the prices which contain double quotation marks).
From the documentation of sklearn's LabelEncoder: This transformer should be used to encode target values, i.e. y, and not the input X.
Particularly, it's not intended to fit a LabelEncoder on the full dataset.
If you just want to replace the values of the categorical (i.e, string-valued) columns by unique and numeric ids, one way to go is to apply the label encoder (before splitting the data) on each column you want to encode individually. As your sample code imports pandas, I assume that your data has been loaded into a pandas.DataFrame like
df = pd.read_csv('/path/to/googleplaystore.csv')
From there, you can apply the encoder on each column:
df['App'] = LabelEncoder().fit_transform(df['App'].values)
You may also want to have a look how to handle categorical data within pandas.
However, even after doing this for each non-numeric column in your dataset, there is still a long way before fitting a model on the encoded data (you may want to apply one-hot encoding onto these columns afterwards, but this heavily depends on the model you want to use).
StandardScaler is a preprocessing class from sklearn that takes numeric entries and convert them to a likely Gaussian distribution with 0 mean and unit variance. It doesn't deal with text data. That explains the first error.
LabelEncoder is another preprocessing class from sklearn that takes data and maps them to a numeric encoded representation.
Ex: ["apple","banana","apple","banana"] to [0,1,0,1]
Your dataset has missing values, you should deal with them first. By means of imputing, droping or some similar approach.
Then you should convert the types (all but rating are considered object or string) from each column to handle properly each datatype.
table = pd.read_csv('googleplaystore.csv')
# check dataset info
table.info()
# check missing values
table.isna().sum()
To be honest, I think this is more of a conceptual problem than a technical one. As other users told you, StandarScaler must be used on numeric columns but most of your dataframe columns are object type. Probably you should use OneHotEncoder on it, all transformer on sklearn have a similar behaviour.
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(handle_unknown='ignore')
ohe.fit_transform(X) # your data without target column
# ...blabla...
Finally, I recommend you read about Pipelines from sklearn, I think they are more elegant than a lot of messy code. You can put preprocessing and model steps on the same pipeline, for example here.

sklearn Stacking Estimator passthrough skips preprocessing and passes original data

This issue has been discussed here but there has been no comments: https://github.com/scikit-learn/scikit-learn/issues/16473
I have some numerical features and categorical features in X. The categorical features were one hot encoded. So my pipeline is something similar to the sklearn docs example:
cat_proc_lin = make_pipeline(
SimpleImputer(missing_values=None,
strategy='constant',
fill_value='missing'),
OneHotEncoder(categories=categories)
)
num_proc_lin = make_pipeline(
SimpleImputer(strategy='mean'),
StandardScaler()
)
processor_lin = make_column_transformer(
(cat_proc_lin, cat_cols),
(num_proc_lin, num_cols),
remainder='passthrough')
lasso_pipeline = make_pipeline(processor_lin,
LassoCV())
rf_pipeline = make_pipeline(processor_nlin,
RandomForestRegressor(random_state=42))
gradient_pipeline = make_pipeline(
processor_nlin,
HistGradientBoostingRegressor(random_state=0))
estimators = [('Random Forest', rf_pipeline),
('Lasso', lasso_pipeline),
('Gradient Boosting', gradient_pipeline)]
stacking_regressor = StackingRegressor(estimators=estimators,
final_estimator=RidgeCV())
But if I change passthrough=True, it will raise a TypeError because the passthrough gives the original X and skips the preprocessing part of the pipeline:
/usr/local/lib/python3.6/dist-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details:
ValueError: could not convert string to float: 'RL'
Is there anyway to make the passthrough include the first preprocessing part of the pipeline?
I also cannot add the preprocessing pipeline infront of the final estimator because it will concatenate the original X dataframe with the final layer predictions which is a numpy array as mentioned in the github discussion link at the top of this post. My exact preprocessing pipeline has several custom transformers that operates on pandas dataframe.
Thank you for any help!

Categories