Performing CountVectorizer on multiple columns [duplicate] - python

While using this as a model for spam classification, I'd like to add an additional feature of the Subject plus the body.
I have all of my features in a pandas dataframe. For example, the subject is df['Subject'], the body is df['body_text'] and the spam/ham label is df['ham/spam']
I receive the following error:
TypeError: 'FeatureUnion' object is not iterable
How can I use both df['Subject'] and df['body_text'] as features all while running them through the pipeline function?
from sklearn.pipeline import FeatureUnion
features = df[['Subject', 'body_text']].values
combined_2 = FeatureUnion(list(features))
pipeline = Pipeline([
('count_vectorizer', CountVectorizer(ngram_range=(1, 2))),
('tfidf_transformer', TfidfTransformer()),
('classifier', MultinomialNB())])
pipeline.fit(combined_2, df['ham/spam'])
k_fold = KFold(n=len(df), n_folds=6)
scores = []
confusion = numpy.array([[0, 0], [0, 0]])
for train_indices, test_indices in k_fold:
train_text = combined_2.iloc[train_indices]
train_y = df.iloc[test_indices]['ham/spam'].values
test_text = combined_2.iloc[test_indices]
test_y = df.iloc[test_indices]['ham/spam'].values
pipeline.fit(train_text, train_y)
predictions = pipeline.predict(test_text)
prediction_prob = pipeline.predict_proba(test_text)
confusion += confusion_matrix(test_y, predictions)
score = f1_score(test_y, predictions, pos_label='spam')
scores.append(score)

FeatureUnion was not meant to be used that way. It instead takes two feature extractors / vectorizers and applies them to the input. It does not take data in the constructor the way it is shown.
CountVectorizer is expecting a sequence of strings. The easiest way to provide it with that is to concatenate the strings together. That would pass both the text in both columns to the same CountVectorizer.
combined_2 = df['Subject'] + ' ' + df['body_text']
An alternative method would be to run CountVectorizer and optionally TfidfTransformer individually on each column, and then stack the results.
import scipy.sparse as sp
subject_vectorizer = CountVectorizer(...)
subject_vectors = subject_vectorizer.fit_transform(df['Subject'])
body_vectorizer = CountVectorizer(...)
body_vectors = body_vectorizer.fit_transform(df['body_text'])
combined_2 = sp.hstack([subject_vectors, body_vectors], format='csr')
A third option is to implement your own transformer that would extract a dataframe column.
class DataFrameColumnExtracter(TransformerMixin):
def __init__(self, column):
self.column = column
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
return X[self.column]
In that case you could use FeatureUnion on two pipelines, each containing your custom transformer, then CountVectorizer.
subj_pipe = make_pipeline(
DataFrameColumnExtracter('Subject'),
CountVectorizer()
)
body_pipe = make_pipeline(
DataFrameColumnExtracter('body_text'),
CountVectorizer()
)
feature_union = make_union(subj_pipe, body_pipe)
This feature union of pipelines will take the dataframe and each pipeline will process its column. It will produce the concatenation of term count matrices from the two columns given.
sparse_matrix_of_counts = feature_union.fit_transform(df)
This feature union can also be added as the first step in a larger pipeline.

Related

Is there a way to use mutual information as part of a pipeline in scikit learn?

I'm creating a model with scikit-learn. The pipeline that seems to be working best is:
mutual_info_classif with a threshold - i.e. only include fields whose mutual information score is above a given threshold.
PCA
LogisticRegression
I'd like to do them all using sklearn's pipeline object, but I'm not sure how to get the mutual info classification in. For the second and third steps I do:
pca = PCA(random_state=100)
lr = LogisticRegression(random_state=200)
pipe = Pipeline(
[
('dim_red', pca),
('pred', lr)
]
)
But I don't see a way to include the first step. I know I can create my own class to do this, and I will if I have to, but is there a way to do this within sklearn?
You can implement your Estimator by subclassing BaseEstimator. Then, you can pass it as estimator to a SelectFromModel instance, which can be used in a Pipeline:
from sklearn.feature_selection import SelectFromModel, mutual_info_classif
from sklearn.linear_model import LogisticRegression
from sklearn.base import BaseEstimator
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
X = [[ 0.87, -1.34, 0.31 ],
[-2.79, -0.02, -0.85 ],
[-1.34, -0.48, -2.55 ],
[ 1.92, 1.48, 0.65 ]]
y = [0, 1, 0, 1]
class MutualInfoEstimator(BaseEstimator):
def __init__(self, discrete_features='auto', n_neighbors=3, copy=True, random_state=None):
self.discrete_features = discrete_features
self.n_neighbors = n_neighbors
self.copy = copy
self.random_state = random_state
def fit(self, X, y):
self.feature_importances_ = mutual_info_classif(X, y, discrete_features=self.discrete_features,
n_neighbors=self.n_neighbors,
copy=self.copy, random_state=self.random_state)
feat_sel = SelectFromModel(estimator=MutualInfoEstimator(random_state=0))
pca = PCA(random_state=100)
lr = LogisticRegression(random_state=200)
pipe = Pipeline(
[
('feat_sel', feat_sel),
('pca', pca),
('pred', lr)
]
)
print(pipe)
Pipeline(steps=[('feat_sel',
SelectFromModel(estimator=MutualInfoSelector(random_state=0))),
('pca', PCA(random_state=100)),
('pred', LogisticRegression(random_state=200))])
Note that of course the new estimator should expose the parameters you want to tweak during optimisation. Here I just exposed all of them.
Yeah, I do not think there is another way to do it. At least not that I know!
How about SelectKBest or SelectPercentile:
from sklearn.feature_selection import SelectKBest
mi_best = SelectKBest(score_func=mutual_info_classif, k=10)
pca = PCA(random_state=100)
lr = LogisticRegression(random_state=200)
pipe = Pipeline(
[
('select', mi_best),
('dim_red', pca),
('pred', lr),
]
)
It is possible to do this, but the utility of doing this will vary depending on where in the ML workflow you are. I will describe how I got something similar to work.
High Level:
A selector in a ColumnTransformer is just a callable that returns a list of columns when it is passed the dataframe. We can use this to do what you're trying to do. We can define it as follows:
import numpy as np
import pandas as pd
from sklearn.linear_model import Ridge
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_validate
from sklearn.dummy import DummyRegressor
from sklearn.feature_selection import mutual_info_classif
from sklearn.feature_selection import mutual_info_regression
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_selector as selector
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
def mi_selector(mi_threshold=0.05, target_label=None):
def selector_to_return(df,):
mi_df = compute_mutual_information(
df=df,
target_label=target_label,
random_state=random_state)
matching_variables = mi_df[mi_df.loc[:,'mutual_information'] > mi_threshold].index.tolist()
matching_features = []
# Remove target
if target_label in matching_variables:
matching_variables.remove(target_label)
# Only return from features that were in original df
# since we compute more than that as we impute, encode etc.
for feature_name in df.columns.tolist():
if feature_name in matching_variables:
matching_features.append(feature_name)
return matching_features
return selector_to_return
What this does is it uses the mutual_information computed by compute_mutual_information to create a selector which can be plugged into a Pipeline.
Since we're computing mutual information, we need to know what the target is. We also need to know whether it's a classification or regression problem. This is the part that compute_mutual_information solves using the following approach:
Figure out the metadata for the dataframe, physical datatypes as well as whether or not the feature is numeric or not etc., along with some other stats (second last code snippet)
Figure out whether it's a regression or a classification problem to set up the right function for computing mutual information
Impute and encode/scale the data and reconstruct the processed dataframe
Compute mutual information on the processed dataframe
def compute_mutual_information(df, target_label, random_state):
# Analyze data frame
meta_df = df_metadata(df, numerical_threshold=50)
target_is_numerical = meta_df.loc[meta_df.variable == target_label][
'is_numerical'].iloc[0]
# Determine problem type
if target_is_numerical:
problem_type = 'regression'
mutual_information_function = mutual_info_regression
else:
problem_type = 'classification'
mutual_information_function = mutual_info_classif
# Select feature types
my_numerical_selector = feature_type_selector(dtype_include='numerical')
my_categorical_selector = feature_type_selector(dtype_include='categorical')
numerical_features = my_numerical_selector(df)
categorical_features = my_categorical_selector(df)
# Remove target label from features
for feature_list in [numerical_features, categorical_features]:
if target_label in feature_list:
feature_list.remove(target_label)
# Transform df
imputation_preprocessor = ColumnTransformer(
[('numerical_imputer',
SimpleImputer(strategy='median', add_indicator=True),
numerical_features),
('categorical_imputer',
SimpleImputer(strategy='most_frequent', add_indicator=True),
categorical_features)],
remainder='passthrough')
# We need to figure out the indices to the features that are supposed to be scaled and encoded by the next
# step
post_imputation_np = imputation_preprocessor.fit_transform(df)
feature_name_np_array = imputation_preprocessor.get_feature_names_out()
categorical_feature_indices = np.zeros(len(categorical_features))
numerical_feature_indices = np.zeros(len(numerical_features))
for position, feature in enumerate(categorical_features):
categorical_feature_indices[position] = np.where(
feature_name_np_array == 'categorical_imputer__' + feature)[0]
for position, feature in enumerate(numerical_features):
numerical_feature_indices[position] = np.where(
feature_name_np_array == 'numerical_imputer__' + feature)[0]
categorical_feature_indices = categorical_feature_indices.astype(
int).tolist()
numerical_feature_indices = numerical_feature_indices.astype(int).tolist()
numeric_and_categorical_transformer = ColumnTransformer(
[('OneHotEncoder', OneHotEncoder(),
categorical_feature_indices),
('StandardScaler', StandardScaler(),
numerical_feature_indices)],
remainder='passthrough')
preprocessor = Pipeline(
[('imputation_preprocessor', imputation_preprocessor),
('numeric_and_categorical_transformer',
numeric_and_categorical_transformer)])
df_transformed_np = preprocessor.fit_transform(df)
preprocessed_feature_names = list(preprocessor.get_feature_names_out())
df_transformed = pd.DataFrame(
df_transformed_np.todense(),
columns=preprocessed_feature_names)
df_transformed = df_transformed.rename(shorten_param, axis=1)
estimated_mutual_information = mutual_information_function(
X=df_transformed, y=df[target_label], random_state=random_state)
estimated_mutual_information_df = pd.DataFrame(
estimated_mutual_information.T.reshape(
1, -1), columns=preprocessed_feature_names)
estimated_mutual_information_df = estimated_mutual_information_df.rename(
shorten_param,
axis=1)
estimated_mutual_information_df = estimated_mutual_information_df.T
estimated_mutual_information_df.columns = ['mutual_information']
estimated_mutual_information_df = estimated_mutual_information_df.sort_values(
by=['mutual_information'])
return estimated_mutual_information_df
The above used a feature_type_selector which is defined as following:
def feature_type_selector(dtype_include=None):
def nested_function(df,):
meta_df = df_metadata(df)
if dtype_include == 'numerical':
return meta_df.loc[meta_df.is_numerical, 'variable'].tolist()
else:
return meta_df.loc[meta_df.is_numerical ==
False, 'variable'].tolist()
return nested_function
The metadata analysis of the dataframe does the following:
Determine variable types
Figure out, with some threshold, which features are really categoricals encoded as numericals
Percentage missing data etc.
def df_metadata(df, numerical_threshold=50):
list_of_variables = list(df.dtypes.index)
list_of_dtypes = [df.dtypes[variable] for variable in list_of_variables]
categorical_selector = selector(dtype_include=object)
numerical_selector = selector(dtype_exclude=object)
unique_value_counts = [df[variable].nunique()
for variable in list_of_variables]
categorical_features = categorical_selector(df)
numerical_features = numerical_selector(df)
is_numerical_init = [True] * len(list_of_variables)
metadata_frame = pd.DataFrame(
{'variable': list_of_variables, 'dtype': list_of_dtypes,
'is_numerical': is_numerical_init,
'unique_value_counts': unique_value_counts})
null_sum = df.isnull().sum()
null_sum.name = 'null_sum'
metadata_frame = pd.merge(
metadata_frame,
null_sum,
left_on='variable',
right_index=True)
metadata_frame['samples_missing'] = metadata_frame['null_sum'] > 0
total_samples = len(df)
metadata_frame['percent_missing'] = metadata_frame['null_sum'] / total_samples
for feature in categorical_features:
metadata_frame.loc[metadata_frame.variable ==
feature, ['is_numerical']] = False
for feature in numerical_features:
if df[feature].nunique() < numerical_threshold:
metadata_frame.loc[metadata_frame.variable ==
feature, ['is_numerical']] = False
return metadata_frame
The shorten_param function is defined as follows:
def shorten_param(param_name):
if "__" in param_name:
if len(param_name.rsplit(" ", 1)) < 2:
return param_name.rsplit("__", 1)[1]
else:
return str(shorten_param(param_name.rsplit(" ", 1)[
0])) + " " + shorten_param(' '.join(param_name.rsplit(" ", 1)[1:]))
return param_name
With all this in place, you can do something like the following to run your model.
standard_scaler_transformer = StandardScaler()
identity_transformer = ColumnTransformer([('unused_scaler', standard_scaler_transformer, []),],remainder='passthrough')
my_mi_selector = mi_selector(mi_threshold = 0.2, target_label=target_label)
mi_filter = ColumnTransformer([('identity_transformer', identity_transformer, my_mi_selector)], remainder='drop')
brute_imputer = SimpleImputer(strategy='most_frequent')
ames_target = ames_data[ames_target_name]
my_model = Pipeline([('mi_filter', mi_filter), ('brute_imputer', brute_imputer), ('Ridge', Ridge())])
my_model.fit(X=ames_data, y=ames_target)
To my original point about the utility of this, I think it's useful very early on in the process when you're trying to figure out which features are important, and which are not. Here are some challenges that I ran into:
Since feature selection happens at runtime, you are limited in what feature engineering you can do. For example, in imputation, I had to use most_frequent because that works for whichever feature type.
It would be really cool if there was a way to track variable locations as they enter numpy land. For example, in compute_mutual_information, post transformation, the ndarray needs to be put back into a DataFrame and then the mutual_information computed in order to have traceability of the mutual_information values for each feature
You cannot treat mi_threshold as a hyperparameter, because the selector is a callable, and model.get_params() has essentially a memory address.
I look forward to the day when mutual_information is a hyperparameter that you can tune, and you set policies for the types of feature engineering you want to apply.

Unable to make prediction after loading sklearn model

I have created a ML model with Scikit-Learn and saved it. Now when I load the model, I have trouble with transformation and prediction.
I have 4 features in DataFrame. First two features are textual, and other 2 are numerical. The result column is 1 or 0.
In order to train my model, I used ColumnTransformer and CountVectorizer for transformation and vectorization textual features. I specified NAMES of the columns that I want to transform/vectorize.
(columns text1 and text2). Numerical columns do not need to be vectorized so remainder='passthrough' is fixing that.
Part of code that works:
features = df.iloc[:, :-1]
results = df.iloc[:, -1]
transformerVectoriser = ColumnTransformer(transformers=[('vector word 1', CountVectorizer(analyzer='word', ngram_range=(1, 1), max_features = 12000, stop_words = 'english'), 'text1'),
('vector phrase 3', CountVectorizer(analyzer='word', ngram_range=(3, 3), max_features = 2500, stop_words = 'english'), 'text2')],
remainder='passthrough') # Default is to drop untransformed columns, passthrough == leave columns as they are
x_train, x_test, y_train, y_test = train_test_split(features, results, test_size=0.3, random_state=0)
x_train = transformerVectoriser.fit_transform(x_train)
x_test = transformerVectoriser.transform(x_test)
model = clf.fit(x_train, y_train)
y_pred = model.predict(x_test)
filename = 'ml_model.sav'
pickle.dump(model, open(filename, 'wb'))
filename = 'ml_transformer.sav'
pickle.dump(transformerVectoriser, open(filename, 'wb'))
But when I want to load a model, and make prediction I get an error:
# LOADING MODEL
model = pickle.load(open('ml_model.sav','rb'))
vectorizer = pickle.load(open('ml_transformer.sav','rb'))
# MAKING PREDICTION
data_for_prediction = vectorizer.transform([data_for_prediction]) #ERROR
print(model.predict_proba(data_for_prediction))
I get the error:
ValueError: Specifying the columns using strings is only supported for pandas DataFrames
When I was training my model I used Pandas dataframe, and when I wanted to make prediction I have just put the values in the list. So data_for_prediction is list, that looks like this:
["text that should be vectorized with vectorizer that i created", "More texts that should be vectorized", 4, 7]
I think that that is the error, Because I used column names when I was using ColumnTransformer, but now when I want to make prediction, vectorizer do not know what to vectorize.
My final model and vectorizer should be used in an API, and api should only take JSON, so I do not want to convert JSON to DataFrame and pass it to the model.
Is there a way to fix this error without using pandas dataframe in my final Flask APP.
The training data is a dataframe with the columns:
x_train.columns
the function vectorizer.transform() wants data in the same format, so assuming that
data_f_p = ["text that should be vectorized", 4,7,0]
corresponds to the same four columns as x_train you can turn it into a dataframe with
data_f_p = pd.DataFrame([data_f_p], columns=x_train.columns)
data_f_p = vectorizer.transform(data_f_p)
In the case you don't want to use pandas.DataFrame in your REST API endpoint, just don't train your model with the DataFrame but convert your data to a numpy array first:
>>> df
TEXT_1 TEXT_2 NUM_1 NUM_2
0 This is the first text. The second text. 300.000 23.3
1 Here is the third text. And the fourth text. 2.334 29.0
>>> df.to_numpy()
array([['This is the first text.', 'The second text.', 300.0, 23.3],
['Here is the third text.', 'And the fourth text.', 2.334, 29.0]],
dtype=object)
Then, make changes in how you define the model. I'd suggest to combine preprocessing and predicting steps using sklearn.pipeline.Pipeline into a single model like this:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
model = Pipeline(steps=[
('transformer', ColumnTransformer(
transformers=[
('TEXT_1', CountVectorizer(analyzer='word', stop_words='english'), 0),
('TEXT_2', CountVectorizer(analyzer='word', stop_words='english'), 1),
],
remainder='passthrough',
)),
('predictor', RandomForestClassifier()),
])
Note, here we are using indices instead of names to reference texts when defining transformers for the ColumnTransformer instance. Once we've transformed the initial DataFrame to a numpy array, the TEXT_1 feature is located at 0, and the TEXT_2 at 1 in a data row. Here is how you can use the model:
from joblib import dump, load
X = df.to_numpy()
model.fit(X, y)
dump(model, 'model.joblib')
...
model = load('model.joblib')
results = model.predict(data)
As a result, you don't have to convert your incoming data to the DataFrame in order to make a prediction.

Sklearn custom transformers with pipeline: all the input array dimensions for the concatenation axis must match exactly

I am learning about sklearn custom transformers and read about the two core ways to create custom transformers:
by setting up a custom class that inherits from BaseEstimator and TransformerMixin, or
by creating a transformation method and passing it to FunctionTransformer.
I wanted to compare these two approaches by implementing a "meta-vectorizer" functionality: a vectorizer that supports either CountVectorizer or TfidfVectorizer and transforms the input data according to the specified vectorizer type.
However, I can't seem to get any of the two work when passing them to a sklearn.pipeline.Pipeline. I am getting the following error message in the fit_transform() step:
ValueError: all the input array dimensions for the concatenation axis must match
exactly, but along dimension 0, the array at index 0 has size 6 and the array
at index 1 has size 1
My code for option 1 (using a custom class):
class Vectorizer(BaseEstimator, TransformerMixin):
def __init__(self, vectorizer:Callable=CountVectorizer(), ngram_range:tuple=(1,1)) -> None:
super().__init__()
self.vectorizer = vectorizer
self.ngram_range = ngram_range
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
X_vect_ = self.vectorizer.fit_transform(X.copy())
return X_vect_.toarray()
pipe = Pipeline([
('column_transformer', ColumnTransformer([
('lesson_type_category', OneHotEncoder(), ['Type']),
('comment_text_vectorizer', Vectorizer(), ['Text'])],
remainder='drop')),
('model', LogisticRegression())])
param_dict = {'column_transformer__comment_text_vectorizer__vectorizer': \
[CountVectorizer(), TfidfVectorizer()]
}
randsearch = GridSearchCV(pipe, param_dict, cv=2, scoring='f1',).fit(X_train, y_train)
And my code for option 2 (creating a custom transformer from a function using FunctionTransformer):
def vectorize_text(X, vectorizer: Callable):
X_vect_ = vectorizer.fit_transform(X)
return X_vect_.toarray()
vectorizer_transformer = FunctionTransformer(vectorize_text, kw_args={'vectorizer': TfidfVectorizer()})
pipe = Pipeline([
('column_transformer', ColumnTransformer([
('lesson_type_category', OneHotEncoder(), ['Type']),
('comment_text_vectorizer', vectorizer_transformer, ['Text'])],
remainder='drop')),
('model', LogisticRegression())])
param_dict = {'column_transformer__comment_text_vectorizer__kw_args': \
[{'vectorizer':CountVectorizer()}, {'vectorizer': TfidfVectorizer()}]
}
randsearch = GridSearchCV(pipe, param_dict, cv=2, scoring='f1').fit(X_train, y_train)
Imports and sample data:
import pandas as pd
from typing import Callable
import sklearn
from sklearn.preprocessing import OneHotEncoder, FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import GridSearchCV
df = pd.DataFrame([
['A99', 'hi i love python very much', 'c', 1],
['B07', 'which programming language should i learn', 'b', 0],
['A12', 'what is the difference between python django flask', 'b', 1],
['A21', 'i want to be a programmer one day', 'c', 0],
['B11', 'should i learn java or python', 'b', 1],
['C01', 'how much can i earn as a programmer with python', 'a', 0]
], columns=['Src', 'Text', 'Type', 'Target'])
Notes:
As recommended in this question, I transformed all sparse matrices to dense arrays after the vectorization, as you can see in both cases: X_vect_.toarray().
The issue is that both CountVectorizer and TfidfVectorizer require their input to be 1D (and not 2D). In such cases the doc of ColumnTransformer states that parameter columns of the transformers tuple should be passed as a string rather than as a list.
columns: str, array-like of str, int, array-like of int, array-like of bool, slice or callable
Indexes the data on its second axis. Integers are interpreted as positional columns, while strings can reference DataFrame columns by name. A scalar string or int should be used where transformer expects X to be a 1d array-like (vector), otherwise a 2d array will be passed to the transformer. A callable is passed the input data X and can return any of the above. To select multiple columns by name or dtype, you can use make_column_selector.
Therefore, the following will work in your case (i.e. changing ['Text'] into 'Text').
class Vectorizer(BaseEstimator, TransformerMixin):
def __init__(self, vectorizer:Callable=CountVectorizer(), ngram_range:tuple=(1,1)) -> None:
super().__init__()
self.vectorizer = vectorizer
self.ngram_range = ngram_range
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
X_vect_ = self.vectorizer.fit_transform(X.copy())
return X_vect_.toarray()
pipe = Pipeline([
('column_transformer', ColumnTransformer([
('lesson_type_category', OneHotEncoder(handle_unknown='ignore'), ['Type']),
('comment_text_vectorizer', Vectorizer(), 'Text')], remainder='drop')),
('model', LogisticRegression())])
param_dict = {'column_transformer__comment_text_vectorizer__vectorizer': [CountVectorizer(), TfidfVectorizer()]
}
randsearch = GridSearchCV(pipe, param_dict, cv=2, scoring='f1',).fit(X_train, y_train)
You can adjust the example with FunctionTransformer accordingly. Observe, as a final remark, that I had to pass handle_unknown='ignore' to OneHotEncoder to prevent the possibility that an error would have arisen in case of unknown categories seen during the test phase of your cross-validation (and not seen during the training phase).

implement custom one-hot-encoding function for sklearn pipeline

In related to question posted in One Hot Encoding preserve the NAs for imputation I am trying to create a custom function that handles NAs when one hot encoding categorical variables. The setup should be suitable for train/test split and modelling using sklearn pipeline.
A simple reproducible example of my problem:
#Packages
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.impute import KNNImputer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.linear_model import Ridge
from sklearn.impute import SimpleImputer
# Make some categorical data X and a response y and split it.
X = pd.DataFrame(columns=["1","2"],data = [["A",np.nan],["B","A"],[np.nan,"A"],[np.nan,"B"],["B","A"],["A","B"],["C","B"],["D","E"]])
y = pd.DataFrame(data = np.array([1,5,4,6,2,3,9,9]))
X_train, X_test, Y_train, Y_test = train_test_split(X,y,test_size=0.2,random_state=42)
I've then created a custom function that does OHE with nan (using the procedure described in Cyclical Loop Between OneHotEncoder and KNNImpute in Scikit-learn)
class OHE_with_nan(BaseEstimator,TransformerMixin):
""" OHE with NAN. Not super pretty but works..
"""
def __init__(self, copy=True):
self.copy = copy
def fit(self, X, y = None):
""" This transformer does not use a fit procedure """
return self
def transform(self, X, y = None):
""" Return the new object here"""
# Replace nans with "Missing" such that OneHotEncoder can work.
enc_missing = SimpleImputer(strategy="constant",fill_value="missing")
data1 = pd.DataFrame(columns=X.columns,data = enc_missing.fit_transform(X))
#Perform standard OHE
OHE = OneHotEncoder(sparse=False,handle_unknown="ignore")
OHE_fit = OHE.fit_transform(data1)
#save feature names of the OHE dataframe
data_OHE = pd.DataFrame(columns=OHE.get_feature_names(data1.columns),data = OHE_fit)
# Initialize
Column_names = data1.columns
Final_OHE = pd.DataFrame()
# Loop over columns to replace 0s with nan the correct places.
for i in range(len(data1.columns)):
tmp_data = data_OHE[data_OHE.columns[pd.Series(data_OHE.columns).str.startswith(Column_names[i])]]
missing_name = tmp_data.iloc[:,-1:].columns
missing_index = np.where(tmp_data[missing_name]==1)[0]
tmp_data.loc[missing_index,:] = np.nan
tmp_data1 = tmp_data.drop(missing_name,axis=1)
Final_OHE = pd.concat([Final_OHE, tmp_data1], axis=1)
return Final_OHE
This is then combined into a pipeline that predicts y using ridge regression (random choice of model, just for the example..)
Estimator = Pipeline([
('Ohe_with_NA',OHE_with_nan()),
("Imputer",KNNImputer(n_neighbors=1)),
('Model',Ridge(alpha = 0.01))
])
The procedure can be fitted:
pipe_fit = Estimator.fit(X_train,Y_train)
But testing on unseen data fails:
pipe_fit.score(X_test, Y_test)
ValueError: X has 2 features, but KNNImputer is expecting 7 features as input.
This is because the handle_unknown = "ignore in OneHotEncoder within OHE_with_nanis no longer "active" as it has been wrapped into my custom function.
If one simply uses OneHotEncoder(handle_unknown = "ignore") directly in the pipeline, everything works fine (but that's not my intention as this "removes" the nans from the data I try to impute.)
My question
How do I enable handle_unknown = "ignore" in my custom function such that it can perform in a pipeline setup on unseen data as well?
Hope you understand my situation - any help is highly appreciated!
I think the main problem is that you need to save more information (especially, the internal OneHotEncoder) at fit time. I also made the missing-column identification a little more robust (I think maybe you were relying on the ordering putting that last, but that only held for your sample data because of alphabetical order?). I didn't spend much time cleaning things up or looking for efficiencies.
class OHE_with_nan(BaseEstimator, TransformerMixin):
"""One-hot encode, propagating NaNs.
Requires a dataframe as input!
"""
def fit(self, X, y=None):
self.orig_cols_ = X.columns
self.imputer_ = SimpleImputer(strategy="constant", fill_value="MISSING")
X_filled = self.imputer_.fit_transform(X)
self.ohe_ = OneHotEncoder(sparse=False, handle_unknown="ignore")
self.ohe_.fit(X_filled)
self.ohe_colnames_ = self.ohe_.get_feature_names(X.columns)
self.missing_value_columns = np.array(["MISSING" in col for col in self.ohe_colnames_])
return self
def transform(self, X, y=None):
raw_ohe = pd.DataFrame(self.ohe_.transform(self.imputer_.transform(X)), columns=self.ohe_colnames_)
out_list = []
# Loop over columns to replace 0s with nan the correct places.
for orig_col in self.orig_cols_:
tmp_data = raw_ohe[self.ohe_colnames_[pd.Series(self.ohe_colnames_).str.startswith(orig_col)]]
missing_name = tmp_data.columns[["MISSING" in col for col in tmp_data.columns]]
missing_indices = np.where(tmp_data[missing_name]==1)[0]
tmp_data.loc[missing_indices, :] = np.nan
tmp_data1 = tmp_data.drop(missing_name, axis=1)
out_list.append(tmp_data1)
out = pd.concat(out_list, axis=1)
return out

How to add a feature using a pipeline and FeatureUnion

In the code below I use a tweeter dataset to perform sentiment analysis. I use a pipeline which performs the following processes:
1) performs some basic text preprocessing
2) vectorizes the tweet text
3) adds an extra feature ( text length)
4) classification
I would like to add one more feature which is the scaled number of followers. I wrote a function that takes as an input the whole dataframe (df) and returns a new dataframe with scaled number of followers. However, I am finding it challenging to add this process on the pipeline e.g. add this feature to the other features using the sklearn pipeline.
Any help or advise on this problem will be much appreciated.
the question and code below is inspired by Ryan's post:pipelines
import nltk
import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
def import_data(filename,sep,eng,header = None,skiprows=1):
#read csv
dataset = pd.read_csv(filename,sep=sep,engine=eng,header = header,skiprows=skiprows)
#rename columns
dataset.columns = ['text','followers','sentiment']
return dataset
df = import_data('apple_v3.txt','\t','python')
X, y = df.text, df.sentiment
X_train, X_test, y_train, y_test = train_test_split(X, y)
tokenizer = nltk.casual.TweetTokenizer(preserve_case=False, reduce_len=True)
count_vect = CountVectorizer(tokenizer=tokenizer.tokenize)
classifier = LogisticRegression()
def get_scalled_followers(df):
scaler = MinMaxScaler()
df[['followers']] = df[['followers']].astype(float)
df[['followers']] = scaler.fit_transform(df[['followers']])
followers = df['followers'].values
followers_reshaped = followers.reshape((len(followers),1))
return df
def get_tweet_length(text):
return len(text)
import numpy as np
def genericize_mentions(text):
return re.sub(r'#[\w_-]+', 'thisisanatmention', text)
def reshape_a_feature_column(series):
return np.reshape(np.asarray(series), (len(series), 1))
def pipelinize_feature(function, active=True):
def list_comprehend_a_function(list_or_series, active=True):
if active:
processed = [function(i) for i in list_or_series]
processed = reshape_a_feature_column(processed)
return processed
else:
return reshape_a_feature_column(np.zeros(len(list_or_series)))
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn_helpers import pipelinize, genericize_mentions, train_test_and_evaluate
sentiment_pipeline = Pipeline([
('genericize_mentions', pipelinize(genericize_mentions, active=True)),
('features', FeatureUnion([
('vectorizer', count_vect),
('post_length', pipelinize_feature(get_tweet_length, active=True))
])),
('classifier', classifier)
])
sentiment_pipeline, confusion_matrix = train_test_and_evaluate(sentiment_pipeline, X_train, y_train, X_test, y_test)
The best explanation I have found so far is at the following post: pipelines
My data includes heterogenous features and the following step by step approach works well and is easy to understand:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline, FeatureUnion
#step1 - select data from dataframe and split the dataset in train and test sets
features= [c for c in df.columns.values if c not in ['sentiment']]
numeric_features= [c for c in df.columns.values if c not in ['text','sentiment']]
target = 'sentiment'
X_train, X_test, y_train, y_test = train_test_split(df[features], df[target], test_size=0.33, random_state=42)
#step2 - create a number selector class and text selector class. These classes allow to select specific columns from the dataframe
class NumberSelector(BaseEstimator, TransformerMixin):
def __init__(self, key):
self.key = key
def fit(self, X, y=None):
return self
def transform(self, X):
return X[[self.key]]
class TextSelector(BaseEstimator, TransformerMixin):
def __init__(self, key):
self.key = key
def fit(self, X, y=None):
return self
def transform(self, X):
return X[self.key]
#step 3 create one pipeline for the text data and one for the numerical data
text = Pipeline([
('selector', TextSelector(key='content')),
('tfidf', TfidfVectorizer( stop_words='english'))
])
text.fit_transform(X_train)
followers = Pipeline([
('selector', NumberSelector(key='followers')),
('standard', MinMaxScaler())
])
followers.fit_transform(X_train)
#step 4 - features union
feats = FeatureUnion([('text', text),
('length', followers)])
feature_processing = Pipeline([('feats', feats)])
feature_processing.fit_transform(X_train)
# step 5 - add the classifier and predict
pipeline = Pipeline([
('features',feats),
('classifier', SVC(kernel = 'linear', probability=True, C=1, class_weight = 'balanced'))
])
pipeline.fit(X_train, y_train)
preds = pipeline.predict(X_test)
np.mean(preds == y_test)
# step 6 use the model to predict new data not included in the test set
# in my example the pipeline expects a dataframe as an input which should have a column called 'text' and a column called 'followers'
array = [["#apple is amazing",25000]]
dfObj = pd.DataFrame(array,columns = ['text' , 'followers'])
#prints the expected class e.g. positive or negative sentiment
print(pipeline.predict(dfObj))
#print the probability for each class
print(pipeline.predict_proba(dfObj))
You can use FeatureUnion to combine the features extracted from the different columns of your dataframe. You should feed the dataframe to the pipeline and use FunctionTransformer to extract specific columns. It might look like this (I haven't run it, some errors possible)
sentiment_pipeline = Pipeline([
FeatureUnion([
# your added feature (maybe you'll need to reshape it so ndim == 2)
('scaled_followers', FunctionTransformer(lambda df: get_scalled_followers(df).values,
validate=False)),
# previous features
('text_features', Pipeline([
('extractor', FunctionTransformer(lambda df: df.text.values, validate=False))
('genericize_mentions', pipelinize(genericize_mentions, active=True)),
('features', FeatureUnion([
('vectorizer', count_vect),
('post_length', pipelinize_feature(get_tweet_length, active=True))
])),
]))
]),
('classifier', classifier)
])
sentiment_pipeline, confusion_matrix = train_test_and_evaluate(sentiment_pipeline, df_train, y_train, df_test, y_test)
Another solution could be not use Pipeline and just stack the features together with np.hstack.

Categories