How to predict after training data using naive bayes with python? - python

I have got a dataset which contains just two useful columns for training my model, first is news heading and the second is category of news.
So, I got the following training command running successfully using python:
import re
import numpy as np
import pandas as pd
# the Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
# function to split the data for cross-validation
from sklearn.model_selection import train_test_split
# function for transforming documents into counts
from sklearn.feature_extraction.text import CountVectorizer
# function for encoding categories
from sklearn.preprocessing import LabelEncoder
# grab the data
news = pd.read_csv("/Users/helloworld/Downloads/NewsAggregatorDataset/newsCorpora.csv",encoding='latin-1')
news.head()
def normalize_text(s):
s = s.lower()
# remove punctuation that is not word-internal (e.g., hyphens, apostrophes)
s = re.sub('\s\W',' ',s)
s = re.sub('\W\s',' ',s)
# make sure we didn't introduce any double spaces
s = re.sub('\s+',' ',s)
return s
news['TEXT'] = [normalize_text(s) for s in news['TITLE']]
# pull the data into vectors
vectorizer = CountVectorizer()
x = vectorizer.fit_transform(news['TEXT'])
encoder = LabelEncoder()
y = encoder.fit_transform(news['CATEGORY'])
# split into train and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
nb = MultinomialNB()
nb.fit(x_train, y_train)
So my question is, how can I give a new set of data (e.g. Just news heading) and tell the program to predict the news category using python sklearn command?
P.S. My training data is like:

You should train the model using the training data (as you did) and then you should predict using new data (the test data).
Do the following:
nb = MultinomialNB()
nb.fit(x_train, y_train)
y_predicted = nb.predict(x_test)
Now, if you want to evaluate the predictions based on the **accuracy you can do the following:**
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_predicted)
Similarly, you can calculate other metrics.
Finally, we can see all the available metrics here !
EDIT 1
When you type:
y_predicted = nb.predict(x_test)
y_predicted will contain numerical values that correspond to your categories.
To project back these values and get the labels you can do:
y_predicted_labels = encoder.inverse_transform(y_predicted)

You are very close. Just need two more lines of code. Use this link, explains Naives Bayes using Sci Kit,
https://www.digitalocean.com/community/tutorials/how-to-build-a-machine-learning-classifier-in-python-with-scikit-learn
The short answer to your question is below, import the accuracy function,
from sklearn.metrics import accuracy_score
test the model using the predict function,
preds = nb.predict(x_test)
and then test the accuracy
print(accuracy_score(y_test, preds))

Related

How to properly use Smote in Classification models

I am using smote to balanced the output (y) only for Model train but want to test the model with original data as it makes logic how we can test the model with smote created outputs. Please ask anything for clarification if I didn't explained it well. It's my starting on Stack overflow.
from imblearn.over_sampling import SMOTE
oversample = SMOTE()
X_sm, y_sm = oversample.fit_resample(X, y)
# Splitting Dataset into Train and Test (Smote)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_sm, y_sm,test_size=0.2,random_state=42)
Here i applied the Random Forest Classifier on my data
import math
from sklearn.metrics import accuracy_score, confusion_matrix
import seaborn as sn
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
# RF = RandomForestClassifier(n_estimators=100)
# RF.fit(X_train, y_train.values.ravel())
# y_pred = RF.predict(X)
# print(metrics.classification_report(y,y_pred))
RF = RandomForestClassifier(n_estimators=10)
RF.fit(X_train, y_train.values.ravel())
If i applied this but X also contains the data which we used for train. how we can remove the data which we already used for training the data.
y_pred = RF.predict(X)
print(metrics.classification_report(y,y_pred))
I used SMOTE in the past, it is suboptimal. Lately, researchers have proven some flaws in the generated distribution of Synthetic Minority Oversample Technique (SMOTE). I know sometimes we don't have a choice regarding the unbalanced classes, but you can use sklearn.ensemble.RandomForestClassifier, where you can define a proper class_weight to handle the unbalanced class problem.
Check scikit-learn documentation:
Scikit-documentation
I agree with razimbres about using class_weight.
Another option for you would be to split the dataset into train and test first. Then, keep the test set aside. Use only the training set from here on:
X_sm, y_sm = oversample.fit_resample(X_train, y_train)
.
.
.

Make predictions with a trained model on Python

I'm very new to programming and machine learning but I've been trying to create a prediction model to tag product reviews. I found the following model:
import numpy as np
import pandas as pd
# the Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
# function to split the data for cross-validation
from sklearn.model_selection import train_test_split
# function for transforming documents into counts
from sklearn.feature_extraction.text import CountVectorizer
# function for encoding categories
from sklearn.preprocessing import LabelEncoder
dataset = pd.read_csv('dataset.csv')
def normalize_text(s):
s = s.lower()
# remove punctuation that is not word-internal (e.g., hyphens, apostrophes)
s = re.sub('\s\W',' ',s)
s = re.sub('\W\s',' ',s)
# make sure we didn't introduce any double spaces
s = re.sub('\s+',' ',s)
return s
dataset['TEXT'] = [normalize_text(s) for s in dataset['texto']]
# pull the data into vectors
vectorizer = CountVectorizer()
x = vectorizer.fit_transform(dataset['TEXT'])
encoder = LabelEncoder()
y = encoder.fit_transform(dataset['codigo'])
# split into train and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
nb = MultinomialNB()
nb.fit(x_train, y_train)
y_predicted = nb.predict(x_test)
So far so good. But then, I tried to use that trained model to predict another set of data like this:
#new data
test = pd.read_csv('testset.csv')
test['TEXT'] = [normalize_text(s) for s in test['respostas']]
# pull the data into vectors
vectorizer = CountVectorizer()
classes = vectorizer.fit_transform(test['TEXT'])
classificacao = nb.predict(classes)
However, I got a "ValueError: dimension mismatch"
I'm not sure how to do this second step, which is using the model to predict the category of a fresh data set.
Thanks in advance for your assistance.

How do I apply ML model after it has been trained?

I apologies for the naive question, I've trained a model (Naive Bayes) in python , it does well (95% accuracy). It takes an input string (i.e. 'Apple Inc.' or 'John Doe') and discerns whether it's a business name or customer name.
How do I actually implement this on another data set? If I bring in another pandas dataframe, how do I apply what the model has learned from the training data to the new dataframe?
The new dataframe has a completely new population and set of strings that it needs to predict whether its a business or customer name.
Ideally I would like to insert into the new dataframe a column that has the model's prediction.
Any code snippets are appreciated.
Sample code of current model:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df["CUST_NM_CLEAN"],
df["LABEL"],test_size=0.20,
random_state=1)
# Instantiate the CountVectorizer method
count_vector = CountVectorizer()
# Fit the training data and then return the matrix
training_data = count_vector.fit_transform(X_train)
# Transform testing data and return the matrix.
testing_data = count_vector.transform(X_test)
#in this case we try multinomial, there are two other methods
from sklearn.naive_bayes import cNB
naive_bayes = MultinomialNB()
naive_bayes.fit(training_data,y_train)
#MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
predictions = naive_bayes.predict(testing_data)
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print('Accuracy score: {}'.format(accuracy_score(y_test, predictions)))
print('Precision score: {}'.format(precision_score(y_test, predictions, pos_label='Org')))
print('Recall score: {}'.format(recall_score(y_test, predictions, pos_label='Org')))
print('F1 score: {}'.format(f1_score(y_test, predictions, pos_label='Org')))
Figured it out.
# Convert a collection of text documents to a vector of term/token counts.
cnt_vect_for_new_data = count_vector.transform(df['new_data'])
#RUN Prediction
df['NEW_DATA_PREDICTION'] = naive_bayes.predict(cnt_vect_for_new_data)

How to get the predicted probabilities of a classification model?

I'm trying out different classification models using a binary dependent variable (occupied/unoccupied). The models I am interested in are Logistic regression, Decision tree and Gaussian Naïve Bayes.
My input data is a csv-file with a datetime index (e.g. 2019-01-07 14:00), three variable columns ("R", "P", "C", containing numerical values), and the dependent variable column ("value", containing the binary values).
Training the model is not the problem, that all works fine. All the models give me their prediction in binary values (this of course should be the ultimate outcome), but I would also like to see the predicted probabilities which made them decide on either of the binary values. Is there any way to get also these values?
I have tried all of the classification visualizers that function with the yellowbrick package (ClassBalance, ROCAUC, ClassificationReport, ClassPredictionError). But all of these don't give me a graph that shows the calculated probabilities by the model for the data set.
import pandas as pd
import numpy as np
data = pd.read_csv('testrooms_data.csv', parse_dates=['timestamp'])
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
##split dataset into test and trainig set
X = data.drop("value", axis=1) # X contains all the features
y = data["value"] # y contains only the label
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.5, random_state = 1)
###model training
###Logistic Regression###
clf_lr = LogisticRegression()
# fit the dataset into LogisticRegression Classifier
clf_lr.fit(X_train, y_train)
#predict on the unseen data
pred_lr = clf_lr.predict(X_test)
###Decision Tree###
from sklearn.tree import DecisionTreeClassifier
clf_dt = DecisionTreeClassifier()
pred_dt = clf_dt.fit(X_train, y_train).predict(X_test)
###Bayes###
from sklearn.naive_bayes import GaussianNB
bayes = GaussianNB()
pred_bayes = bayes.fit(X_train, y_train).predict(X_test)
###visualization for e.g. LogReg
from yellowbrick.classifier import ClassificationReport
from yellowbrick.classifier import ClassPredictionError
from yellowbrick.classifier import ROCAUC
#classificationreport
visualizer = ClassificationReport(clf_lr, support=True)
visualizer.fit(X_train, y_train) # Fit the visualizer and the model
visualizer.score(X_test, y_test) # Evaluate the model on the test data
g = visualizer.poof() # Draw/show/poof the data
#classprediction report
visualizer2 = ClassPredictionError(LogisticRegression())
visualizer2.fit(X_train, y_train) # Fit the training data to the visualizer
visualizer2.score(X_test, y_test) # Evaluate the model on the test data
g2 = visualizer2.poof() # Draw visualization
#(ROC)
visualizer3 = ROCAUC(LogisticRegression())
visualizer3.fit(X_train, y_train) # Fit the training data to the visualizer
visualizer3.score(X_test, y_test) # Evaluate the model on the test data
g3 = visualizer3.poof() # Draw/show/poof the data
it would be great to have e.g. an array similar to pred_lr that contains the probabilities calculated for each row of the csv file. Is that possible? If yes, how can I get it?
In most sklearn estimators (if not all) you have a method for obtaining the probability that precluded the classification, either in log probability or probability.
For example, if you have your Naive Bayes classifier and you want to obtain probabilities but not classification itself, you could do (I used same nomenclatures as in your code):
from sklearn.naive_bayes import GaussianNB
bayes = GaussianNB()
pred_bayes = bayes.fit(X_train, y_train).predict(X_test)
#for probabilities
bayes.predict_proba(X_test)
bayes.predict_log_proba(X_test)
Hope this helps.

Logistic regression sklearn - train and apply model

I'm new to machine learning and trying Sklearn for the first time. I have two dataframes, one with data to train a logistic regression model (with 10-fold cross-validation) and another one to predict classes ('0,1') using that model.
Here's my code so far using bits of tutorials I found on Sklearn docs and on the Web:
import pandas as pd
import numpy as np
import sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold
from sklearn.preprocessing import normalize
from sklearn.preprocessing import scale
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn import metrics
# Import dataframe with training data
df = pd.read_csv('summary_44.csv')
cols = df.columns.drop('num_class') # Data to use (num_class is the column with the classes)
# Import dataframe with data to predict
df_pred = pd.read_csv('new_predictions.csv')
# Scores
df_data = df.ix[:,:-1].values
# Target
df_target = df.ix[:,-1].values
# Values to predict
df_test = df_pred.ix[:,:-1].values
# Scores' names
df_data_names = cols.values
# Scaling
X, X_pred, y = scale(df_data), scale(df_test), df_target
# Define number of folds
kf = KFold(n_splits=10)
kf.get_n_splits(X) # returns the number of splitting iterations in the cross-validator
# Logistic regression normalizing variables
LogReg = LogisticRegression()
# 10-fold cross-validation
scores = [LogReg.fit(X[train], y[train]).score(X[test], y[test]) for train, test in kf.split(X)]
print scores
# Predict new
novel = LogReg.predict(X_pred)
Is this the correct way to implement a Logistic Regression?
I know that the fit() method should be used after cross-validation in order to train the model and use it for predictions. However, since I called fit() inside a list comprehension I really don't know if my model was "fitted" and can be used to make predictions.
I general things are okay, but there are some problems.
Scaling
X, X_pred, y = scale(df_data), scale(df_test), df_target
You scale training and test data independently, which isn't correct. Both datasets must be scaled with the same scaler. "Scale" is a simple function, but it is better to use something else, for example StandardScaler.
scaler = StandardScaler()
scaler.fit(df_data)
X = scaler.transform(df_data)
X_pred = scaler.transform(df_test)
Cross-validation and predicting.
How your code works? You split data 10 times into train and hold-out set; 10 times fit model on train set and calculate score on hold-out set. This way you get cross-validation scores, but the model is fitted only on a part of data. So it would be better to fit model on the whole dataset and then make a prediction:
LogReg.fit(X, y)
novel = LogReg.predict(X_pred)
I want to notice that there are advanced technics like stacking and boosting, but if you learn using sklearn, then it is better to stick to the basics.

Categories