How to use One-hot Encode while using NaiveBayes algorithm? - python

I'm trying to use Naive Bayes algorithm for one of my requirements. In this, I have planned to use "One-hot Encode" for hyper plane. I have used the following code for running my algorithm. But, I'm not sure how to use "One-hot Encode".
Please find the below code:
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import confusion_matrix
def load_data(filename):
x = list()
y = list()
with open(filename) as file:
file.readline()
for line in file:
line = line.strip().split(',')
y.append(line[1])
x.append(line[0].split())
return x, y
X_train, y_train = load_data('/Users/Desktop/abc/train.csv')
X_test, y_test = load_data('/Users/Desktop/abc/test.csv')
onehot_enc = MultiLabelBinarizer()
onehot_enc.fit(X_train)
bnbc = BernoulliNB(binarize=None)
bnbc.fit(onehot_enc.transform(X_train), y_train)
score = bnbc.score(onehot_enc.transform(X_test), y_test)
print("score of Naive Bayes algo is :" , score)
Can anyone please suggest me whether the above written code is correct ?

try using CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
clf = CountVectorizer()
X_train_one_hot = clf.fit(X_train)
X_test_one_hot = clf.transform(X_test)
bnbc = BernoulliNB(binarize=None)
bnbc.fit(X_train_one_hot, y_train)
score = bnbc.score(X_test_one_hot, y_test)
print("score of Naive Bayes algo is :" , score)
Also you can try using TfidfVectorizer in case if you are going to use TfIdf featurization of text.

Related

Keep model made with TFIDF for predicting new content using Scikit for Python

this is a sentiment analysis model made with tf-idf for feature extraction
i want to know how can i save this model and reuse it.
i tried saving it this way but when i load it , do same pre-processing on the test text and fit_transform on it it gave an error that the model expected X numbers of features but got Y
this is how i saved it
filename = "model.joblib"
joblib.dump(model, filename)
and this is the code for my tf-idf model
import pandas as pd
import re
import nltk
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
nltk.download('stopwords')
from nltk.corpus import stopwords
processed_text = ['List of pre-processed text']
y = ['List of labels']
tfidfconverter = TfidfVectorizer(max_features=10000, min_df=5, max_df=0.7, stop_words=stopwords.words('english'))
X = tfidfconverter.fit_transform(processed_text).toarray()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
text_classifier = BernoulliNB()
text_classifier.fit(X_train, y_train)
predictions = text_classifier.predict(X_test)
print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))
print(accuracy_score(y_test, predictions))
edit:
just to exact where to put every line
so after:
tfidfconverter = TfidfVectorizer(max_features=10000, min_df=5, max_df=0.7, stop_words=stopwords.words('english'))
then
tfidf_obj = tfidfconverter.fit(processed_text)//this is what will be used again
joblib.dump(tfidf_obj, 'tf-idf.joblib')
then you do the rest of the steps you will save the classifier after training as well so after:
text_classifier.fit(X_train, y_train)
put
joblib.dump(model, "classifier.joblib")
now when you want to predict any text
tf_idf_converter = joblib.load("tf-idf.joblib")
classifier = joblib.load("classifier.joblib")
now u have List of sentences to predict
sent = []
classifier.predict(tf_idf_converter.transform(sent))
now print that for a list of sentiments for each sentece
You can first fit tfidf to your training set using:
tfidfconverter = TfidfVectorizer(max_features=10000, min_df=5, max_df=0.7, stop_words=stopwords.words('english'))
tfidf_obj = tfidfconverter.fit(processed_text)
Then find a way to store the tfidf_obj for instance using pickle or joblib e.g:
joblib.dump(tfidf_obj, filename)
Then load the saved tfidf_obj and apply transform only on your test set
loaded_tfidf = joblib.load(filename)
test_new = loaded_tfidf.transform(X_test)

Value error when training model with randomforest classifier

from sklearn import ensemble
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import OneHotEncoder
import time
from sklearn import metrics
from sklearn import preprocessing
from sklearn.linear_model import LinearRegression
enc = preprocessing.OneHotEncoder()
onehotencoder = OneHotEncoder(categories='auto')
enc.fit(X)
onehotlabels = enc.transform(X).toarray()
onehotlabels.shape
clf=RandomForestClassifier(n_estimators=10)
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
predict = clf.predict(X_test)
print("Evaluation on Test Set",predict)
I am doing this to train my model with randomforest classifier. I am getting the following error:
ValueError: could not convert string to float: 'gorilla'
I can't tell for sure by looking at your code, because data structures of X, X_train or X_test is not clear.
However, I suspect that the onehotlabels variable is not used.
If one hot encoding worked properly, 'gorilla' string would not have been included.
So, I suggest that you check whether the following code had been executed.
X_train, X_test = train_test_split(onehotlabels)

Machine Learning Algorithm does not work after Vectorizing a feature that is of type text

I am trying to classify and my features are a combination of words, number and text. I am trying to vectorize the feature that is of type text but when I run it through a classifying algorithm it throws the following error.
line 51, in
classifier.fit(X_train, y_train.values.ravel())
ValueError: setting an array element with a sequence.
Below is my code.
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from io import StringIO
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
df = pd.read_csv('data.csv')
df = df[pd.notnull(df['memo'])]
df = df[pd.notnull(df['name'])]
# factorize type, name, and categorized account
df['type_id'] = df.txn_type.factorize()[0]
df['name_id'] = df.name.factorize()[0]
df['categorizedAccountId'] = df.categorizedAccount.factorize()[0]
my_list = df['categorizedAccountId'].tolist()
print(my_list)
tfidf = TfidfVectorizer(sublinear_tf=True, min_df=5, norm='l2', encoding='latin-1', ngram_range=(1, 2), stop_words='english')
memoFeatures = tfidf.fit_transform(df.memo)
df['memo_id'] = pd.Series(memoFeatures, index=df.index)
X = df.loc[:, ['type_id', 'name_id', 'memo_id']]
y = df.loc[:, ['categorizedAccountId']]
X_train, X_test, y_train, y_test = train_test_split(X, y)
'''print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
'''
classifier = LogisticRegression(random_state=0)
classifier.fit(X_train, y_train.values.ravel())
y_pred = classifier.predict(X_test)
confusion_matrix = confusion_matrix(y_test, y_pred)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(classifier.score(X_test, y_test)))
And also here are a few rows of my Data. The top row has the labels and the categorized account is the class
"txn_type","name","memo","account","amount","categorizedAccount"
"Journal","","ABC.com 11/29/16 Payments",0,207.24,"1072 ABC.com Money Out Clearing"
"Bill Payment","College Tuition Fund","Multiple inv. (details on stub)",164,-207.24,"1072 ABC.com Money Out Clearing"
Ok so I have implemented some modifications to your code, which I paste here. This snippet goes immediately after you read the csv, and drop the null rows. You have to implement the train_test_split yourself though.
df['categorizedAccount'] = df['categorizedAccount'].astype('category')
df['all_text'] = df['txn_type'] + ' ' + df['name'] + ' ' + df['memo']
X = df['all_text']
y = df['categorizedAccount']
X_train = X # Change these four lines for train_test_split
X_test = X # I don't have enough rows in the mock dataset to implement it,
y_train = y # And it returns an error
y_test = y
tfidf = TfidfVectorizer()
X_train_transformed = tfidf.fit_transform(X_train)
classifier = LogisticRegression(random_state=0)
classifier.fit(X_train_transformed, y_train)
X_test_transformed = tfidf.transform(X_test)
y_pred = classifier.predict(X_test_transformed)
classifier.score(X_test_transformed, y_pred)
A few comments though:
from sklearn.feature_extraction.text import TfidfVectorizer
Imported once, ok
from io import StringIO
Unnecessary as far as I can see
from sklearn.feature_extraction.text import TfidfVectorizer
Why do you import it again?
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
TfidfVectorizer does the job of both CountVectorizer and TfidfTransformer. From sklearn: "Equivalent to CountVectorizer followed by TfidfTransformer." See here for more
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score
Not used, do not import.
Additionally:
1) It is not clear what you are trying to do with factorize. TfidfVectorizer automatically performs tokenization for any string of text that you provide it. All columns that you have selected in your original code contain only strings, so it makes more sense to concatenate them and let tfidf do the tokenization, rather than trying to do it yourself.
2) Use the Pipeline constructor, it will save your life.
3) X = df.loc[:, ['type_id', 'name_id', 'memo_id']] This type of splicing looks very bad, just call df[['column_name_1','column_name_2','column_name_3']]
4) And remember PEP20, "Simple is better than complex"!
As a last advice, when developing a ML model it's always better to start with something plain and simple, and then develop further once you have something that works.

Facing AttributeError: 'list' object has no attribute 'lower'

I have posted my sample train data as well as test data along with my code. I'm trying to use Naive Bayes algorithm to train the model.
But, in the reviews I'm getting list of list. So, I think my code is failing with the following error:
return lambda x: strip_accents(x.lower())
AttributeError: 'list' object has no attribute 'lower'
Can anyone of you please help me out regarding the same as I'm new to python ....
train.txt:
review,label
Colors & clarity is superb,positive
Sadly the picture is not nearly as clear or bright as my 40 inch Samsung,negative
test.txt:
review,label
The picture is clear and beautiful,positive
Picture is not clear,negative
My code:
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import confusion_matrix
from sklearn.feature_extraction.text import CountVectorizer
def load_data(filename):
reviews = list()
labels = list()
with open(filename) as file:
file.readline()
for line in file:
line = line.strip().split(',')
labels.append(line[1])
reviews.append(line[0].split())
return reviews, labels
X_train, y_train = load_data('/Users/7000015504/Desktop/Sep_10/sample_train.csv')
X_test, y_test = load_data('/Users/7000015504/Desktop/Sep_10/sample_test.csv')
clf = CountVectorizer()
X_train_one_hot = clf.fit(X_train)
X_test_one_hot = clf.transform(X_test)
bnbc = BernoulliNB(binarize=None)
bnbc.fit(X_train_one_hot, y_train)
score = bnbc.score(X_test_one_hot, y_test)
print("score of Naive Bayes algo is :" , score)
I have applied a few modifications to your code. The one posted below works; I added comments on how to debug the one you posted above.
# These three will not used, do not import them
# from sklearn.preprocessing import MultiLabelBinarizer
# from sklearn.model_selection import train_test_split
# from sklearn.metrics import confusion_matrix
# This performs the classification task that you want with your input data in the format provided
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
def load_data(filename):
""" This function works, but you have to modify the second-to-last line from
reviews.append(line[0].split()) to reviews.append(line[0]).
CountVectorizer will perform the splits by itself as it sees fit, trust him :)"""
reviews = list()
labels = list()
with open(filename) as file:
file.readline()
for line in file:
line = line.strip().split(',')
labels.append(line[1])
reviews.append(line[0])
return reviews, labels
X_train, y_train = load_data('train.txt')
X_test, y_test = load_data('test.txt')
vec = CountVectorizer()
# Notice: clf means classifier, not vectorizer.
# While it is syntactically correct, it's bad practice to give misleading names to your objects.
# Replace "clf" with "vec" or something similar.
# Important! you called only the fit method, but did not transform the data
# afterwards. The fit method does not return the transformed data by itself. You
# either have to call .fit() and then .transform() on your training data, or just fit_transform() once.
X_train_transformed = vec.fit_transform(X_train)
X_test_transformed = vec.transform(X_test)
clf= MultinomialNB()
clf.fit(X_train_transformed, y_train)
score = clf.score(X_test_transformed, y_test)
print("score of Naive Bayes algo is :" , score)
The output of this code is:
score of Naive Bayes algo is : 0.5
You need to iterate through each and every element in the list.
for item in list():
item = item.lower()
Note : Only applicable if you iterate through a list of string ( dtype = str ) .

How does cross_val_score and gridsearchCV works?

I am new to python and I have been trying to figure out how gridsearchCV and cross_val_score work.
Finding odds results a set up a sort of validation experiment, but still I do not understand what I am doing wrong.
To try to simplify I am using gridsearchCV is the simplest possible way and try to validate and understand what is happening:
Here it is:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, RobustScaler, QuantileTransformer
from sklearn.feature_selection import SelectKBest, f_regression, RFECV
from sklearn.decomposition import PCA
from sklearn.linear_model import RidgeCV,Ridge, LinearRegression
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.model_selection import GridSearchCV,KFold,TimeSeriesSplit,PredefinedSplit,cross_val_score
from sklearn.metrics import mean_squared_error,make_scorer,r2_score,mean_absolute_error,mean_squared_error
from math import sqrt
I create a cross validation object (for gridsearchCV and cross_val_score) and a train/test dataset for pipeline and simple linear regression. I have checked that the two dataset are identical:
train_indices = np.full((15,), -1, dtype=int)
test_indices = np.full((6,), 0, dtype=int)
test_fold = np.append(train_indices, test_indices)
kf = PredefinedSplit(test_fold)
for train_index, test_index in kf.split(X):
print('TRAIN:', train_index, 'TEST:', test_index)
X_train_kf = X[train_index]
X_test_kf = X[test_index]
train_data = list(range(0,15))
test_data = list(range(15,21))
X_train, y_train=X[train_data,:],y[train_data]
X_test, y_test=X[test_data,:],y[test_data]
Here is what I do:
instantiate a simple linear model and use it with the manual set of data
lr=LinearRegression()
lm=lr.fit(X,y)
lmscore_train=lm.score(X_train,y_train)
->r2=0.4686662249071524
lmscore_test=lm.score(X_test,y_test)
->r2 0.6264021467338086
now I try do do the exact same things using a pipeline:
pipe_steps = ([('est', LinearRegression())])
pipe=Pipeline(pipe_steps)
p=pipe.fit(X,y)
pscore_train=p.score(X_train,y_train)
->r2=0.4686662249071524
pscore_test=p.score(X_test,y_test)
->r2 0.6264021467338086
LinearRegression and pipeline matches perfectly
Now I try to do the same by using cross_val_score using the predefined split kf
cv_scores = cross_val_score(lm, X, y, cv=kf)
->r2 = -1.234474757883921470e+01?!?! (this is supposed to be the test score)
Now let's try gridsearchCV
scoring = {'r_squared':'r2'}
grid_parameters = [{}]
gridsearch=GridSearchCV(p, grid_parameters, verbose=3,cv=kf,scoring=scoring,return_train_score='true',refit='r_squared')
gs=gridsearch.fit(X,y)
results=gs.cv_results_
from cv_results_ I get once again
->mean_test_r_squared->r2->-1.234474757883921292e+01
So cross_val_score and gridsearch in the end match one another, but the score is totally off and different from what should be.
Will you please help me out solving this puzzle?
cross_val_score and GridSearchCV will first split the data, train the model on the train data only and then score on test data.
Here you are training on the full data, and then scoring on test data. Hence you dont match the results of cross_val_score.
Instead of this:
lm=lr.fit(X,y)
Try this:
lm=lr.fit(X_train, y_train)
Same for pipeline:
Instead of p=pipe.fit(X,y), do this:
p=pipe.fit(X_train, y_train)
You can look at my answers for more description:-
https://stackoverflow.com/a/42364900/3374996
https://stackoverflow.com/a/42230764/3374996

Categories