How to add a feature to a vectorized data set? - python

I want to write a Naive Base text classificator.
Because sklearn does not accept 'text form' features I am transforming them using TfidfVectorizer.
I was successfully able to create such classificatory using only the transformed data as features. The code looks like this:
### text vectorization--go from strings to lists of numbers
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
X_train_transformed = vectorizer.fit_transform(X_train_raw['url'])
X_test_transformed = vectorizer.transform(X_test_raw['url'])
### feature selection, because text is super high dimensional and
### can be really computationally chewy as a result
selector = SelectPercentile(f_classif, percentile=1), y_train_raw)
X_train = selector.transform(X_train_transformed).toarray()
X_test = selector.transform(X_test_transformed).toarray()
clf = GaussianNB(), y_train_raw)
Everything works as intended but I am having problems when I want to add another feature eg. flag indicating weather the given text contains a certain keyword.
I tried multiple things to properly transform the 'url' feature and then combine the transformed feature with another boolean feature but I was unsuccessfully.
Any tips how it should be done assuming that I have a pandas frame containing two features: 'url' (which I want to transform) and 'contains_keyword' flag?
The solution which failed looks like this:
vectorizer = CountVectorizer(min_df=1)
X_train_transformed = vectorizer.fit_transform(X_train_raw['url'])
X_test_transformed = vectorizer.transform(X_test_raw['url'])
selector = SelectPercentile(f_classif, percentile=1), y_train_raw)
X_train_selected = selector.transform(X_train_transformed)
X_test_selected = selector.transform(X_test_transformed)
X_train_raw['transformed_url'] = X_train_selected.toarray().tolist()
X_train_without = X_train_raw.drop(['url'], axis=1)
X_train = X_train_without.values
This produces rows containing a boolean flag and a list which is a wrong input for sklearn model. I have no idea how should i properly transform this. Grateful for any help.
Here are test data:
googleadapis l google com,1,True
googleadapis l google com,1,True
clients1 google com,1,False
c go-mpulse net,1,False
translate google pl,1,False
url - splitted domain taken from dns query
target - target class for classification
ads_keyword - flag indicating weather the 'url' contains the 'ads' word.
I want to transform the 'url' using the TfidfVectorizer and use the transformed data together with 'ads_keyword' (and possibly more features in the future) as features used to train the Naive Bayes model.

Here is a demo, showing how to union features and how to tune up hyperparameters using GridSearchCV.
Unfortunately your sample data set is too tiny to train a real model...
from pathlib import Path
except ImportError: # Python 2
from pathlib2 import Path
import os
import re
from pprint import pprint
import pandas as pd
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import FunctionTransformer, LabelEncoder, LabelBinarizer, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectPercentile
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import SGDClassifier
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.externals import joblib
from scipy.sparse import csr_matrix, hstack
class ColumnSelector(BaseEstimator, TransformerMixin):
def __init__(self, name=None, position=None,
as_cat_codes=False, sparse=False): = name
self.position = position
self.as_cat_codes = as_cat_codes
self.sparse = sparse
def fit(self, X, y=None):
return self
def transform(self, X, **kwargs):
if is not None:
col_pos = X.columns.get_loc(
elif self.position is not None:
col_pos = self.position
raise Exception('either [name] or [position] parameter must be not-None')
if self.as_cat_codes and X.dtypes.iloc[col_pos] == 'category':
ret = X.iloc[:, col_pos]
ret = X.iloc[:, col_pos]
if self.sparse:
ret = csr_matrix(ret.values.reshape(-1,1))
return ret
union = FeatureUnion([
('select', ColumnSelector('url')),
#('pct', SelectPercentile(percentile=1)),
('vect', TfidfVectorizer(sublinear_tf=True, max_df=0.5,
]) ),
('select', ColumnSelector('ads_keyword', sparse=True,
#('scale', StandardScaler(with_mean=False)),
]) )
pipe = Pipeline([
('union', union),
('clf', MultinomialNB())
param_grid = [
'union__text__vect': [TfidfVectorizer(sublinear_tf=True,
'clf': [SGDClassifier(max_iter=500)],
'union__text__vect__ngram_range': [(1,1), (2,5)],
'union__text__vect__analyzer': ['word','char_wb'],
'clf__alpha': np.logspace(-5, 0, 6),
#'clf__max_iter': [500],
'union__text__vect': [TfidfVectorizer(sublinear_tf=True,
'clf': [MultinomialNB()],
'union__text__vect__ngram_range': [(1,1), (2,5)],
'union__text__vect__analyzer': ['word','char_wb'],
'clf__alpha': np.logspace(-4, 2, 7),
#{ # NOTE: does NOT support sparse matrices!
# 'union__text__vect': [TfidfVectorizer(sublinear_tf=True,
# max_df=0.5,
# stop_words='english')],
# 'clf': [GaussianNB()],
# 'union__text__vect__ngram_range': [(1,1), (2,5)],
# 'union__text__vect__analyzer': ['word','char_wb'],
gs_kwargs = dict(scoring='roc_auc', cv=3, n_jobs=1, verbose=2)
X_train, X_test, y_train, y_test = \
train_test_split(df[['url','ads_keyword']], df['target'], test_size=0.33)
grid = GridSearchCV(pipe, param_grid=param_grid, **gs_kwargs), y_train)
# prediction
predicted = grid.predict(X_test)


how to get a list of wrong predictions on validation set

Im trying to build a text-classification model on a database of site reviews (3 classes).
i cleaned the DF, tokenized it (with countVectorizer) and Tfidf (TfidfTransformer) and built MNB model.
now after i trained and evaluated the model, i want to get a list of the wrong predictions so i can pass them through LIME and explore the words that confuse the model.
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import (
df = pd.read_csv(
cleaned_df = df.drop(
labels=["review_id", "user_id", "business_id", "review_date"], axis=1
x = cleaned_df["review_text"]
y = cleaned_df["business_category"]
# tokenization
vectorizer = CountVectorizer()
vectorizer_fit =
bow_x = vectorizer_fit.transform(x)
#### transform BOW to TF-IDF
transformer = TfidfTransformer()
transformer_x =
tfidf_x = transformer_x.transform(bow_x)
x_train, x_test, y_train, y_test = train_test_split(
tfidf_x, y, test_size=0.3, random_state=101
mnb = MultinomialNB(alpha=0.14), y_train)
predmnb = mnb.predict(x_test)
my objective is to get the original indices of the reviews that the model predicted wrongly.
I managed to get the result like this:
predictions = c.predict(preprocessed_df['review_text'])
df2= preprocessed_df.join(pd.DataFrame(predictions))
df2.columns = ['review_text', 'business_category', 'word_count', 'prediction']
im sure there is a more elegant way...
It seems like there is another problem in your code, generally the TfIdf vectorizer is fit on the training data only and in order to get the test data in the same format we do the transform operation. This is primarily done to avoid data leakage. Please refer to TfidfVectorizer: should it be used on train only or train+test. I have modified your code to suit your need.
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import (
df = pd.read_csv(
cleaned_df = df.drop(
labels=["review_id", "user_id", "business_id", "review_date"], axis=1
x = cleaned_df["review_text"]
y = cleaned_df["business_category"]
x_train, x_test, y_train, y_test = train_test_split(
x, y, test_size=0.3, random_state=101
transformer = TfidfTransformer()
x_train_tf = transformer.fit_transform(x_train)
x_test_tf = transformer.transform(x_test)
mnb = MultinomialNB(alpha=0.14), y_train)
predmnb = mnb.predict(x_test_tf)
incorrect_docs = x_test[predmnb == y_test]

How to add a feature using a pipeline and FeatureUnion

In the code below I use a tweeter dataset to perform sentiment analysis. I use a pipeline which performs the following processes:
1) performs some basic text preprocessing
2) vectorizes the tweet text
3) adds an extra feature ( text length)
4) classification
I would like to add one more feature which is the scaled number of followers. I wrote a function that takes as an input the whole dataframe (df) and returns a new dataframe with scaled number of followers. However, I am finding it challenging to add this process on the pipeline e.g. add this feature to the other features using the sklearn pipeline.
Any help or advise on this problem will be much appreciated.
the question and code below is inspired by Ryan's post:pipelines
import nltk
import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
def import_data(filename,sep,eng,header = None,skiprows=1):
#read csv
dataset = pd.read_csv(filename,sep=sep,engine=eng,header = header,skiprows=skiprows)
#rename columns
dataset.columns = ['text','followers','sentiment']
return dataset
df = import_data('apple_v3.txt','\t','python')
X, y = df.text, df.sentiment
X_train, X_test, y_train, y_test = train_test_split(X, y)
tokenizer = nltk.casual.TweetTokenizer(preserve_case=False, reduce_len=True)
count_vect = CountVectorizer(tokenizer=tokenizer.tokenize)
classifier = LogisticRegression()
def get_scalled_followers(df):
scaler = MinMaxScaler()
df[['followers']] = df[['followers']].astype(float)
df[['followers']] = scaler.fit_transform(df[['followers']])
followers = df['followers'].values
followers_reshaped = followers.reshape((len(followers),1))
return df
def get_tweet_length(text):
return len(text)
import numpy as np
def genericize_mentions(text):
return re.sub(r'#[\w_-]+', 'thisisanatmention', text)
def reshape_a_feature_column(series):
return np.reshape(np.asarray(series), (len(series), 1))
def pipelinize_feature(function, active=True):
def list_comprehend_a_function(list_or_series, active=True):
if active:
processed = [function(i) for i in list_or_series]
processed = reshape_a_feature_column(processed)
return processed
return reshape_a_feature_column(np.zeros(len(list_or_series)))
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn_helpers import pipelinize, genericize_mentions, train_test_and_evaluate
sentiment_pipeline = Pipeline([
('genericize_mentions', pipelinize(genericize_mentions, active=True)),
('features', FeatureUnion([
('vectorizer', count_vect),
('post_length', pipelinize_feature(get_tweet_length, active=True))
('classifier', classifier)
sentiment_pipeline, confusion_matrix = train_test_and_evaluate(sentiment_pipeline, X_train, y_train, X_test, y_test)
The best explanation I have found so far is at the following post: pipelines
My data includes heterogenous features and the following step by step approach works well and is easy to understand:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline, FeatureUnion
#step1 - select data from dataframe and split the dataset in train and test sets
features= [c for c in df.columns.values if c not in ['sentiment']]
numeric_features= [c for c in df.columns.values if c not in ['text','sentiment']]
target = 'sentiment'
X_train, X_test, y_train, y_test = train_test_split(df[features], df[target], test_size=0.33, random_state=42)
#step2 - create a number selector class and text selector class. These classes allow to select specific columns from the dataframe
class NumberSelector(BaseEstimator, TransformerMixin):
def __init__(self, key):
self.key = key
def fit(self, X, y=None):
return self
def transform(self, X):
return X[[self.key]]
class TextSelector(BaseEstimator, TransformerMixin):
def __init__(self, key):
self.key = key
def fit(self, X, y=None):
return self
def transform(self, X):
return X[self.key]
#step 3 create one pipeline for the text data and one for the numerical data
text = Pipeline([
('selector', TextSelector(key='content')),
('tfidf', TfidfVectorizer( stop_words='english'))
followers = Pipeline([
('selector', NumberSelector(key='followers')),
('standard', MinMaxScaler())
#step 4 - features union
feats = FeatureUnion([('text', text),
('length', followers)])
feature_processing = Pipeline([('feats', feats)])
# step 5 - add the classifier and predict
pipeline = Pipeline([
('classifier', SVC(kernel = 'linear', probability=True, C=1, class_weight = 'balanced'))
]), y_train)
preds = pipeline.predict(X_test)
np.mean(preds == y_test)
# step 6 use the model to predict new data not included in the test set
# in my example the pipeline expects a dataframe as an input which should have a column called 'text' and a column called 'followers'
array = [["#apple is amazing",25000]]
dfObj = pd.DataFrame(array,columns = ['text' , 'followers'])
#prints the expected class e.g. positive or negative sentiment
#print the probability for each class
You can use FeatureUnion to combine the features extracted from the different columns of your dataframe. You should feed the dataframe to the pipeline and use FunctionTransformer to extract specific columns. It might look like this (I haven't run it, some errors possible)
sentiment_pipeline = Pipeline([
# your added feature (maybe you'll need to reshape it so ndim == 2)
('scaled_followers', FunctionTransformer(lambda df: get_scalled_followers(df).values,
# previous features
('text_features', Pipeline([
('extractor', FunctionTransformer(lambda df: df.text.values, validate=False))
('genericize_mentions', pipelinize(genericize_mentions, active=True)),
('features', FeatureUnion([
('vectorizer', count_vect),
('post_length', pipelinize_feature(get_tweet_length, active=True))
('classifier', classifier)
sentiment_pipeline, confusion_matrix = train_test_and_evaluate(sentiment_pipeline, df_train, y_train, df_test, y_test)
Another solution could be not use Pipeline and just stack the features together with np.hstack.

Machine Learning Algorithm does not work after Vectorizing a feature that is of type text

I am trying to classify and my features are a combination of words, number and text. I am trying to vectorize the feature that is of type text but when I run it through a classifying algorithm it throws the following error.
line 51, in, y_train.values.ravel())
ValueError: setting an array element with a sequence.
Below is my code.
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from io import StringIO
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
df = pd.read_csv('data.csv')
df = df[pd.notnull(df['memo'])]
df = df[pd.notnull(df['name'])]
# factorize type, name, and categorized account
df['type_id'] = df.txn_type.factorize()[0]
df['name_id'] =[0]
df['categorizedAccountId'] = df.categorizedAccount.factorize()[0]
my_list = df['categorizedAccountId'].tolist()
tfidf = TfidfVectorizer(sublinear_tf=True, min_df=5, norm='l2', encoding='latin-1', ngram_range=(1, 2), stop_words='english')
memoFeatures = tfidf.fit_transform(df.memo)
df['memo_id'] = pd.Series(memoFeatures, index=df.index)
X = df.loc[:, ['type_id', 'name_id', 'memo_id']]
y = df.loc[:, ['categorizedAccountId']]
X_train, X_test, y_train, y_test = train_test_split(X, y)
classifier = LogisticRegression(random_state=0), y_train.values.ravel())
y_pred = classifier.predict(X_test)
confusion_matrix = confusion_matrix(y_test, y_pred)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(classifier.score(X_test, y_test)))
And also here are a few rows of my Data. The top row has the labels and the categorized account is the class
"Journal",""," 11/29/16 Payments",0,207.24,"1072 Money Out Clearing"
"Bill Payment","College Tuition Fund","Multiple inv. (details on stub)",164,-207.24,"1072 Money Out Clearing"
Ok so I have implemented some modifications to your code, which I paste here. This snippet goes immediately after you read the csv, and drop the null rows. You have to implement the train_test_split yourself though.
df['categorizedAccount'] = df['categorizedAccount'].astype('category')
df['all_text'] = df['txn_type'] + ' ' + df['name'] + ' ' + df['memo']
X = df['all_text']
y = df['categorizedAccount']
X_train = X # Change these four lines for train_test_split
X_test = X # I don't have enough rows in the mock dataset to implement it,
y_train = y # And it returns an error
y_test = y
tfidf = TfidfVectorizer()
X_train_transformed = tfidf.fit_transform(X_train)
classifier = LogisticRegression(random_state=0), y_train)
X_test_transformed = tfidf.transform(X_test)
y_pred = classifier.predict(X_test_transformed)
classifier.score(X_test_transformed, y_pred)
A few comments though:
from sklearn.feature_extraction.text import TfidfVectorizer
Imported once, ok
from io import StringIO
Unnecessary as far as I can see
from sklearn.feature_extraction.text import TfidfVectorizer
Why do you import it again?
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
TfidfVectorizer does the job of both CountVectorizer and TfidfTransformer. From sklearn: "Equivalent to CountVectorizer followed by TfidfTransformer." See here for more
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score
Not used, do not import.
1) It is not clear what you are trying to do with factorize. TfidfVectorizer automatically performs tokenization for any string of text that you provide it. All columns that you have selected in your original code contain only strings, so it makes more sense to concatenate them and let tfidf do the tokenization, rather than trying to do it yourself.
2) Use the Pipeline constructor, it will save your life.
3) X = df.loc[:, ['type_id', 'name_id', 'memo_id']] This type of splicing looks very bad, just call df[['column_name_1','column_name_2','column_name_3']]
4) And remember PEP20, "Simple is better than complex"!
As a last advice, when developing a ML model it's always better to start with something plain and simple, and then develop further once you have something that works.

How to add another text feature to current bag of words classification? In Scikit-learn

this is my input matrix enter image description here
my sample Code:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
X_train, X_test, y_train, y_test = train_test_split(data['Extract'],
data['Expense Account code Description'], random_state = 0)
from sklearn.pipeline import Pipeline , FeatureUnion
text_clf = Pipeline([('vect', CountVectorizer(ngram_range=(1,1))),
('tfidf', TfidfTransformer(use_idf = False)),
('clf', RandomForestClassifier(n_estimators =100,
max_features='log2',criterion = 'entropy')),
text_clf =, y_train)
here I am applying Bag of word model for 'Extract' column classifying 'Expense Account code Description' , Here i am getting an accuracy of around 92% , but if i want to include 'Vendor name' as the set of another input feature how can i do that. Is there any way of doing it along with the bag of words ? ,
You can use FeatureUnion.
also you will need to create a new Transformer class with the necessary actions you need to take i.e. Include Vendor name , get dummies.
Feature Union will fit in your pipeline.
For reference.
class get_Vendor(BaseEstimator,TransformerMixin):
def transform(self, X,y):
lr_tfidf = Pipeline([('features',FeatureUnion([('other',get_vendor()),
('vect', tfidf)])),('clf', RandomForestClassifier())])

Pipeline with meta classifier

I am trying to train a meta classifier on different features from a pandas dataframe.
The features are either text or categorical in nature.
I am having issues with fitting the model, with the following error 'Found input variables with inconsistent numbers of samples: [1, 48678]'. I understand what the error means, but not how to fix it. Help much appreciated!
The code I am using is as follows:
import pandas as pd
from sklearn import preprocessing
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
# set target label
target_label = ['target']
features = ['cat_1', 'cat_2', 'cat_3', 'cat_4', 'cat_5',
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(cleansed_data[features],
cleansed_data[target_label], test_size=0.2, random_state=0)
text_features = ['text_1']
categorical_features = ['cat_1', 'cat_2', 'cat_3', 'cat_4', 'cat_5']
# encoder
le = preprocessing.LabelEncoder()
# vectoriser
vectoriser = TfidfVectorizer()
# classifiers
mlp_clf = MLPClassifier()
rf_clf = RandomForestClassifier()
from sklearn.base import TransformerMixin, BaseEstimator
class SelectColumnsTransfomer(BaseEstimator, TransformerMixin):
def __init__(self, columns=[]):
self.columns = columns
def transform(self, X, **transform_params):
trans = X[self.columns].copy()
return trans
def fit(self, X, y=None, **fit_params):
return self
# text pipeline
text_steps = [('feature extractor', SelectColumnsTransfomer(text_features)),
('tf-idf', vectoriser),
('classifier', mlp_clf)]
# categorical pipeline
categorical_steps = [('feature extractor',
('label encode', le),
('classifier', rf_clf)]
pl_text = Pipeline(text_steps)
pl_categorical = Pipeline(categorical_steps), y_train)
from mlxtend.classifier import StackingCVClassifier
sclf = StackingCVClassifier(classifiers=[pl_text, pl_categorical],
EDIT: Here is some code that recreates the issue. 'ValueError: Found input variables with inconsistent numbers of samples: [1, 3]'
d = {'cat_1': ['A', 'A', 'B'], 'cat_2': [1, 2, 3],
'cat_2': ['G', 'H', 'I'], 'cat_3': ['AA', 'DD', 'PP'],
'cat_4': ['X', 'B', 'V'],
'text_1': ['the cat sat on the mat', 'the mat sat on the cat', 'sat on the cat mat']}
features = pd.DataFrame(data=d)
t = [0, 1, 0]
target = pd.DataFrame(data=t)
text_features = ['text_1']
categorical_features = ['cat_1', 'cat_2', 'cat_3', 'cat_4', 'cat_5']
# text pipeline
text_steps = [('feature extractor', SelectColumnsTransfomer(text_features)),
('tf-idf', vectoriser),
('classifier', mlp_clf)]
# categorical pipeline
categorical_steps = [('feature extractor',
('label encode', le),
('classifier', rf_clf)]
pl_text = Pipeline(text_steps)
pl_categorical = Pipeline(categorical_steps), target)
from mlxtend.classifier import StackingCVClassifier
sclf = StackingCVClassifier(classifiers=[pl_text, pl_categorical],
meta_classifier=LogisticRegression()), target)
Ok, I managed to get it to work by replacing text_features = ['text_1']
with text_features = 'text_1'
Basically, when you pass ['text_1'] to the SelectColumnsTransfomer class it returns a DataFrame object which the tfidf vectoriser sees as one single input. The vectoriser applies fit_transform in your pipeline and returns a single value. This single value with cannot be used to predict three target values.
If you pass in 'text_1', this will get you a series and the vectoriser will correctly identify that you have three strings as features. You text pipeline will work now.
