I apologies for the naive question, I've trained a model (Naive Bayes) in python , it does well (95% accuracy). It takes an input string (i.e. 'Apple Inc.' or 'John Doe') and discerns whether it's a business name or customer name.
How do I actually implement this on another data set? If I bring in another pandas dataframe, how do I apply what the model has learned from the training data to the new dataframe?
The new dataframe has a completely new population and set of strings that it needs to predict whether its a business or customer name.
Ideally I would like to insert into the new dataframe a column that has the model's prediction.
Any code snippets are appreciated.
Sample code of current model:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df["CUST_NM_CLEAN"],
df["LABEL"],test_size=0.20,
random_state=1)
# Instantiate the CountVectorizer method
count_vector = CountVectorizer()
# Fit the training data and then return the matrix
training_data = count_vector.fit_transform(X_train)
# Transform testing data and return the matrix.
testing_data = count_vector.transform(X_test)
#in this case we try multinomial, there are two other methods
from sklearn.naive_bayes import cNB
naive_bayes = MultinomialNB()
naive_bayes.fit(training_data,y_train)
#MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
predictions = naive_bayes.predict(testing_data)
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print('Accuracy score: {}'.format(accuracy_score(y_test, predictions)))
print('Precision score: {}'.format(precision_score(y_test, predictions, pos_label='Org')))
print('Recall score: {}'.format(recall_score(y_test, predictions, pos_label='Org')))
print('F1 score: {}'.format(f1_score(y_test, predictions, pos_label='Org')))
Figured it out.
# Convert a collection of text documents to a vector of term/token counts.
cnt_vect_for_new_data = count_vector.transform(df['new_data'])
#RUN Prediction
df['NEW_DATA_PREDICTION'] = naive_bayes.predict(cnt_vect_for_new_data)
Related
I'm trying out different classification models using a binary dependent variable (occupied/unoccupied). The models I am interested in are Logistic regression, Decision tree and Gaussian Naïve Bayes.
My input data is a csv-file with a datetime index (e.g. 2019-01-07 14:00), three variable columns ("R", "P", "C", containing numerical values), and the dependent variable column ("value", containing the binary values).
Training the model is not the problem, that all works fine. All the models give me their prediction in binary values (this of course should be the ultimate outcome), but I would also like to see the predicted probabilities which made them decide on either of the binary values. Is there any way to get also these values?
I have tried all of the classification visualizers that function with the yellowbrick package (ClassBalance, ROCAUC, ClassificationReport, ClassPredictionError). But all of these don't give me a graph that shows the calculated probabilities by the model for the data set.
import pandas as pd
import numpy as np
data = pd.read_csv('testrooms_data.csv', parse_dates=['timestamp'])
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
##split dataset into test and trainig set
X = data.drop("value", axis=1) # X contains all the features
y = data["value"] # y contains only the label
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.5, random_state = 1)
###model training
###Logistic Regression###
clf_lr = LogisticRegression()
# fit the dataset into LogisticRegression Classifier
clf_lr.fit(X_train, y_train)
#predict on the unseen data
pred_lr = clf_lr.predict(X_test)
###Decision Tree###
from sklearn.tree import DecisionTreeClassifier
clf_dt = DecisionTreeClassifier()
pred_dt = clf_dt.fit(X_train, y_train).predict(X_test)
###Bayes###
from sklearn.naive_bayes import GaussianNB
bayes = GaussianNB()
pred_bayes = bayes.fit(X_train, y_train).predict(X_test)
###visualization for e.g. LogReg
from yellowbrick.classifier import ClassificationReport
from yellowbrick.classifier import ClassPredictionError
from yellowbrick.classifier import ROCAUC
#classificationreport
visualizer = ClassificationReport(clf_lr, support=True)
visualizer.fit(X_train, y_train) # Fit the visualizer and the model
visualizer.score(X_test, y_test) # Evaluate the model on the test data
g = visualizer.poof() # Draw/show/poof the data
#classprediction report
visualizer2 = ClassPredictionError(LogisticRegression())
visualizer2.fit(X_train, y_train) # Fit the training data to the visualizer
visualizer2.score(X_test, y_test) # Evaluate the model on the test data
g2 = visualizer2.poof() # Draw visualization
#(ROC)
visualizer3 = ROCAUC(LogisticRegression())
visualizer3.fit(X_train, y_train) # Fit the training data to the visualizer
visualizer3.score(X_test, y_test) # Evaluate the model on the test data
g3 = visualizer3.poof() # Draw/show/poof the data
it would be great to have e.g. an array similar to pred_lr that contains the probabilities calculated for each row of the csv file. Is that possible? If yes, how can I get it?
In most sklearn estimators (if not all) you have a method for obtaining the probability that precluded the classification, either in log probability or probability.
For example, if you have your Naive Bayes classifier and you want to obtain probabilities but not classification itself, you could do (I used same nomenclatures as in your code):
from sklearn.naive_bayes import GaussianNB
bayes = GaussianNB()
pred_bayes = bayes.fit(X_train, y_train).predict(X_test)
#for probabilities
bayes.predict_proba(X_test)
bayes.predict_log_proba(X_test)
Hope this helps.
I have trained the model using labeled data for Naive Bayes algorithm. And tested the same model with the other set of labeled data. And I have calculated accuracy, precision and recall scores using the below code.
My code :
from sklearn.metrics import confusion_matrix
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from io import open
def load_data(filename):
reviews = list()
labels = list()
with open(filename, encoding='utf-8') as file:
file.readline()
for line in file:
line = line.strip().split(' ',1)
labels.append(line[0])
reviews.append(line[1])
return reviews, labels
X_train, y_train = load_data('./train_data.txt')
X_test, y_test = load_data('./test_data.txt')
vec = CountVectorizer()
X_train_transformed = vec.fit_transform(X_train)
X_test_transformed = vec.transform(X_test)
clf= MultinomialNB()
clf.fit(X_train_transformed, y_train)
score = clf.score(X_test_transformed, y_test)
print("score of Naive Bayes algo is :" , score)
y_pred = clf.predict(X_test_transformed)
print(confusion_matrix(y_test,y_pred))
print("Precision Score : ",precision_score(y_test, y_pred,average='micro'))
print("Recall Score : ",recall_score(y_test, y_pred,average='micro'))
But, now I have another test set which contains unlabeled data. Now, can I test the model with this unlabeled data using the above code ?
This is what I could interpret from your question.
You have trained Naive Bayes model use train data & tested it using test data and you have used confusion matrix & accuracy as a metric to measure the performance of the model.
Now your question may be
Using this model, is it possible to predict label of the unseen data which don't have any labels ?
if that is your question, then, YES it is possible. Moreover that is the reason why you have trained the model i.e, to predict the labels on unseen data.
Since the unseen data don't have labels, how do you know predicted labels are correct ? For this reason only you have tested the model with test data & measured the performance of the model. If the accuracy of the model is 70%, then 70% of the times your model is predicting correctly.
I strongly suggest you to think why are you doing what are you doing before start doing it!!
If you want to automatically calculate the accuracy of the model with unseen data then answer is NO.
To create confusion matix and find out these matrices you need to pass Y label variable. Good practice is to split your training data into training and test data.
I'm new to machine learning and trying Sklearn for the first time. I have two dataframes, one with data to train a logistic regression model (with 10-fold cross-validation) and another one to predict classes ('0,1') using that model.
Here's my code so far using bits of tutorials I found on Sklearn docs and on the Web:
import pandas as pd
import numpy as np
import sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold
from sklearn.preprocessing import normalize
from sklearn.preprocessing import scale
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn import metrics
# Import dataframe with training data
df = pd.read_csv('summary_44.csv')
cols = df.columns.drop('num_class') # Data to use (num_class is the column with the classes)
# Import dataframe with data to predict
df_pred = pd.read_csv('new_predictions.csv')
# Scores
df_data = df.ix[:,:-1].values
# Target
df_target = df.ix[:,-1].values
# Values to predict
df_test = df_pred.ix[:,:-1].values
# Scores' names
df_data_names = cols.values
# Scaling
X, X_pred, y = scale(df_data), scale(df_test), df_target
# Define number of folds
kf = KFold(n_splits=10)
kf.get_n_splits(X) # returns the number of splitting iterations in the cross-validator
# Logistic regression normalizing variables
LogReg = LogisticRegression()
# 10-fold cross-validation
scores = [LogReg.fit(X[train], y[train]).score(X[test], y[test]) for train, test in kf.split(X)]
print scores
# Predict new
novel = LogReg.predict(X_pred)
Is this the correct way to implement a Logistic Regression?
I know that the fit() method should be used after cross-validation in order to train the model and use it for predictions. However, since I called fit() inside a list comprehension I really don't know if my model was "fitted" and can be used to make predictions.
I general things are okay, but there are some problems.
Scaling
X, X_pred, y = scale(df_data), scale(df_test), df_target
You scale training and test data independently, which isn't correct. Both datasets must be scaled with the same scaler. "Scale" is a simple function, but it is better to use something else, for example StandardScaler.
scaler = StandardScaler()
scaler.fit(df_data)
X = scaler.transform(df_data)
X_pred = scaler.transform(df_test)
Cross-validation and predicting.
How your code works? You split data 10 times into train and hold-out set; 10 times fit model on train set and calculate score on hold-out set. This way you get cross-validation scores, but the model is fitted only on a part of data. So it would be better to fit model on the whole dataset and then make a prediction:
LogReg.fit(X, y)
novel = LogReg.predict(X_pred)
I want to notice that there are advanced technics like stacking and boosting, but if you learn using sklearn, then it is better to stick to the basics.
I have got a dataset which contains just two useful columns for training my model, first is news heading and the second is category of news.
So, I got the following training command running successfully using python:
import re
import numpy as np
import pandas as pd
# the Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
# function to split the data for cross-validation
from sklearn.model_selection import train_test_split
# function for transforming documents into counts
from sklearn.feature_extraction.text import CountVectorizer
# function for encoding categories
from sklearn.preprocessing import LabelEncoder
# grab the data
news = pd.read_csv("/Users/helloworld/Downloads/NewsAggregatorDataset/newsCorpora.csv",encoding='latin-1')
news.head()
def normalize_text(s):
s = s.lower()
# remove punctuation that is not word-internal (e.g., hyphens, apostrophes)
s = re.sub('\s\W',' ',s)
s = re.sub('\W\s',' ',s)
# make sure we didn't introduce any double spaces
s = re.sub('\s+',' ',s)
return s
news['TEXT'] = [normalize_text(s) for s in news['TITLE']]
# pull the data into vectors
vectorizer = CountVectorizer()
x = vectorizer.fit_transform(news['TEXT'])
encoder = LabelEncoder()
y = encoder.fit_transform(news['CATEGORY'])
# split into train and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
nb = MultinomialNB()
nb.fit(x_train, y_train)
So my question is, how can I give a new set of data (e.g. Just news heading) and tell the program to predict the news category using python sklearn command?
P.S. My training data is like:
You should train the model using the training data (as you did) and then you should predict using new data (the test data).
Do the following:
nb = MultinomialNB()
nb.fit(x_train, y_train)
y_predicted = nb.predict(x_test)
Now, if you want to evaluate the predictions based on the **accuracy you can do the following:**
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_predicted)
Similarly, you can calculate other metrics.
Finally, we can see all the available metrics here !
EDIT 1
When you type:
y_predicted = nb.predict(x_test)
y_predicted will contain numerical values that correspond to your categories.
To project back these values and get the labels you can do:
y_predicted_labels = encoder.inverse_transform(y_predicted)
You are very close. Just need two more lines of code. Use this link, explains Naives Bayes using Sci Kit,
https://www.digitalocean.com/community/tutorials/how-to-build-a-machine-learning-classifier-in-python-with-scikit-learn
The short answer to your question is below, import the accuracy function,
from sklearn.metrics import accuracy_score
test the model using the predict function,
preds = nb.predict(x_test)
and then test the accuracy
print(accuracy_score(y_test, preds))
I am running two different classification algorithms on my data logistic regression and naive bayes but it is giving me same accuracy even if I change the training and testing data ratio. Following is the code I am using
import pandas as pd
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
df = pd.read_csv('Speed Dating.csv', encoding = 'latin-1')
X = pd.DataFrame()
X['d_age'] = df ['d_age']
X['match'] = df ['match']
X['importance_same_religion'] = df ['importance_same_religion']
X['importance_same_race'] = df ['importance_same_race']
X['diff_partner_rating'] = df ['diff_partner_rating']
# Drop NAs
X = X.dropna(axis=0)
# Categorical variable Match [Yes, No]
y = X['match']
# Drop y from X
X = X.drop(['match'], axis=1)
# Transformation
scalar = StandardScaler()
X = scalar.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Logistic Regression
model = LogisticRegression(penalty='l2', C=1)
model.fit(X_train, y_train)
print('Accuracy Score with Logistic Regression: ', accuracy_score(y_test, model.predict(X_test)))
#Naive Bayes
model_2 = GaussianNB()
model_2.fit(X_train, y_train)
print('Accuracy Score with Naive Bayes: ', accuracy_score(y_test, model_2.predict(X_test)))
print(model_2.predict(X_test))
Is it possible that every time the accuracy is same ?
This is common phenomena occurring if the class frequencies are unbalanced, e.g. nearly all samples belong to one class. For examples if 80% of your samples belong to class "No", then classifier will often tend to predict "No" because such a trivial prediction reaches the highest overall accuracy on your train set.
In general, when evaluating the performance of a binary classifier, you should not only look at the overall accuracy. You have to consider other metrics such as the ROC Curve, class accuracies, f1 scores and so on.
In your case you can use sklearns classification report to get a better feeling what your classifier is actually learning:
from sklearn.metrics import classification_report
print(classification_report(y_test, model_1.predict(X_test)))
print(classification_report(y_test, model_2.predict(X_test)))
It will print the precision, recall and accuracy for every class.
There are three options on how to reach a better classification accuracy on your class "Yes"
use sample weights, you can increase the importance of the samples of the "Yes" class thus forcing the classifier to predict "Yes" more often
downsample the "No" class in the original X to reach more balanced class frequencies
upsample the "Yes" class in the original X to reach more balanced class frequencies