I'm a newbie to python as well as machine learning. As per my requirement, I'm trying to use Naive Bayes algorithm for my dataset.
I'm able to find out the accuracy but trying to find out precision and recall for the same. But, it is throwing the following error:
ValueError: Target is multiclass but average='binary'. Please choose another average setting.
Can anyone please suggest me how to proceed with it. I have tried using average ='micro' in the precision and the recall scores.It worked without any errors but it is giving the same score for accuracy, precision, recall.
My dataset:
train_data.csv:
review,label
Colors & clarity is superb,positive
Sadly the picture is not nearly as clear or bright as my 40 inch Samsung,negative
test_data.csv:
review,label
The picture is clear and beautiful,positive
Picture is not clear,negative
My Code:
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import confusion_matrix
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
def load_data(filename):
reviews = list()
labels = list()
with open(filename) as file:
file.readline()
for line in file:
line = line.strip().split(',')
labels.append(line[1])
reviews.append(line[0])
return reviews, labels
X_train, y_train = load_data('/Users/abc/Sep_10/train_data.csv')
X_test, y_test = load_data('/Users/abc/Sep_10/test_data.csv')
vec = CountVectorizer()
X_train_transformed = vec.fit_transform(X_train)
X_test_transformed = vec.transform(X_test)
clf= MultinomialNB()
clf.fit(X_train_transformed, y_train)
score = clf.score(X_test_transformed, y_test)
print("score of Naive Bayes algo is :" , score)
y_pred = clf.predict(X_test_transformed)
print(confusion_matrix(y_test,y_pred))
print("Precision Score : ",precision_score(y_test,y_pred,pos_label='positive'))
print("Recall Score :" , recall_score(y_test, y_pred, pos_label='positive') )
You need to add the 'average' param. According to the documentation:
average : string, [None, ‘binary’ (default), ‘micro’, ‘macro’,
‘samples’, ‘weighted’]
This parameter is required for multiclass/multilabel targets. If None, the
scores for each class are returned. Otherwise, this
determines the type of averaging performed on the data:
Do this:
print("Precision Score : ",precision_score(y_test, y_pred,
pos_label='positive'
average='micro'))
print("Recall Score : ",recall_score(y_test, y_pred,
pos_label='positive'
average='micro'))
Replace 'micro' with any one of the above options except 'binary'. Also, in the multiclass setting, there is no need to provide the 'pos_label' as it will be anyways ignored.
Update for comment:
Yes, they can be equal. Its given in the user guide here:
Note that for “micro”-averaging in a multiclass setting with all
labels included will produce equal precision, recall and F, while
“weighted” averaging may produce an F-score that is not between
precision and recall.
Related
I need to set a value to a specific threshold and generate a confusion matrix. The data is in a csv file (11,1 MB), this link for download is: https://drive.google.com/file/d/1cQFp7HteaaL37CefsbMNuHqPzkINCVzs/view?usp=sharing?
First, i received a error message: ""AttributeError: predict_proba is not available when probability=False""
So i used this for correction:
svc = SVC(C=1e9,gamma= 1e-07)
scv_calibrated = CalibratedClassifierCV(svc)
svc_model = scv_calibrated.fit(X_train, y_train)
I saw a lot on the internet and I didn't quite understand how a specific threshold value is being persolanized. Sounds pretty hard.
Now, i see a wrong output:
array([[ 0, 0],
[5359, 65]])
I have no idea whats is somenthing wrong.
i need help and i'm new in that.
thanks
from sklearn.model_selection import train_test_split
df = pd.read_csv('fraud_data.csv')
X = df.iloc[:,:-1]
y = df.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
def answer_four():
from sklearn.metrics import confusion_matrix
from sklearn.svm import SVC
from sklearn.calibration import CalibratedClassifierCV
from sklearn.model_selection import train_test_split
svc = SVC(C=1e9,gamma= 1e-07)
scv_calibrated = CalibratedClassifierCV(svc)
svc_model = scv_calibrated.fit(X_train, y_train)
# set threshold as -220
y_pred = (svc_model.predict_proba(X_test)[:,1] >= -220)
conf_matrix = confusion_matrix(y_pred, svc_model.predict(X_test))
return conf_matrix
answer_four()
This function should return a confusion matrix, a 2x2 numpy array with 4 integers.
This code produces the expected output, in addition to the fact that in the previous code I was using the confusion matrix incorrectly I should have also used decision_function and getting the output filtering the 220 threshold.
def answer_four():
from sklearn.metrics import confusion_matrix
from sklearn.svm import SVC
from sklearn.calibration import CalibratedClassifierCV
from sklearn.model_selection import train_test_split
#SVC without mencions of kernel, the default is rbf
svc = SVC(C=1e9, gamma=1e-07).fit(X_train, y_train)
#decision_function scores: Predict confidence scores for samples
y_score = svc.decision_function(X_test)
#Set a threshold -220
y_score = np.where(y_score > -220, 1, 0)
conf_matrix = confusion_matrix(y_test, y_score)
####threshold###
#input threshold in the model after trained this model
#threshold is a limiar of separation of class
return conf_matrix
answer_four()
#output:
array([[5320, 24],
[ 14, 66]])
You are using the confusion matrix in a wrong way.
The idea behind the confusion matrix is to have a picture as to how good our predictions y_pred are compared with the ground truth y_true, usually in a test set.
What you actually do here is computing a "confusion matrix" between your predictions with the custom threshold of -220 (y_pred), compared to some other predictions with the default threshold (the output of svc_model.predict(X_test)), which does not make any sense.
Your ground truth for the test set is y_test; so, to get the confusion matrix with the default threshold, you should use
confusion_matrix(y_test, svc_model.predict(X_test))
To get the confusion matrix with your custom threshold of -220, you should use
confusion_matrix(y_test, y_pred)
See the documentation for more details in the usage (which is your best friend, and should always be the first place to look at, when having issues or doubts).
Using Python and SVM, I applied these two pieces of codes:
First I applied this code with a dataset
from sklearn.metrics import confusion_matrix
from sklearn.metrics import cohen_kappa_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.svm import LinearSVC
model = LinearSVC(class_weight='balanced',C=0.01, penalty='l2').fit(X_, y)
y_preds = model.predict(X_)
report = classification_report( y, y_preds )
print(report)
print(cohen_kappa_score(y, y_preds),'\n', accuracy_score(y, y_preds), \n',confusion_matrix(y, y_preds))
This gives me this accuracy : 0.9485714285714286
Second I applied this code with exactly same dataset again
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score
models = [
LinearSVC(class_weight='balanced',C=0.01, penalty='l2', loss='squared_hinge'),
]
CV = 5
cv_df = pd.DataFrame(index=range(CV * len(models)))
entries = []
for model in models:
model_name = model.__class__.__name__
accuracies = cross_val_score(model, X_, y, scoring='accuracy', cv=CV)
for fold_idx, accuracy in enumerate(accuracies):
entries.append((model_name, fold_idx, accuracy))
cv_df = pd.DataFrame(entries, columns=['model_name', 'fold_idx', 'accuracy'])
cv_df.groupby('model_name').accuracy.mean()
The accuracy is different: 0.797090
Where are my mistakes?
Which code is correct if any?
How to calculate precision and recall after cross-validation as in the 2nd code?
In the 1st code, you only do 1 time prediction & accuracy calculation. While in the 2nd code you do 5 times predictions & accuracy calculations (with different chunks of dataset) then get the mean/average of the accuracy scores. In other words, the 2nd code gives more reliable accuracy score.
As for your other question, if you want to do cross validation with multiple metrics, you can use cross_validate() instead of cross_val_score():
scores = cross_validate(model, X, y, scoring=('precision', 'recall'))
print(scores['precision'])
print(scores['recall'])
I am running two different classification algorithms on my data logistic regression and naive bayes but it is giving me same accuracy even if I change the training and testing data ratio. Following is the code I am using
import pandas as pd
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
df = pd.read_csv('Speed Dating.csv', encoding = 'latin-1')
X = pd.DataFrame()
X['d_age'] = df ['d_age']
X['match'] = df ['match']
X['importance_same_religion'] = df ['importance_same_religion']
X['importance_same_race'] = df ['importance_same_race']
X['diff_partner_rating'] = df ['diff_partner_rating']
# Drop NAs
X = X.dropna(axis=0)
# Categorical variable Match [Yes, No]
y = X['match']
# Drop y from X
X = X.drop(['match'], axis=1)
# Transformation
scalar = StandardScaler()
X = scalar.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Logistic Regression
model = LogisticRegression(penalty='l2', C=1)
model.fit(X_train, y_train)
print('Accuracy Score with Logistic Regression: ', accuracy_score(y_test, model.predict(X_test)))
#Naive Bayes
model_2 = GaussianNB()
model_2.fit(X_train, y_train)
print('Accuracy Score with Naive Bayes: ', accuracy_score(y_test, model_2.predict(X_test)))
print(model_2.predict(X_test))
Is it possible that every time the accuracy is same ?
This is common phenomena occurring if the class frequencies are unbalanced, e.g. nearly all samples belong to one class. For examples if 80% of your samples belong to class "No", then classifier will often tend to predict "No" because such a trivial prediction reaches the highest overall accuracy on your train set.
In general, when evaluating the performance of a binary classifier, you should not only look at the overall accuracy. You have to consider other metrics such as the ROC Curve, class accuracies, f1 scores and so on.
In your case you can use sklearns classification report to get a better feeling what your classifier is actually learning:
from sklearn.metrics import classification_report
print(classification_report(y_test, model_1.predict(X_test)))
print(classification_report(y_test, model_2.predict(X_test)))
It will print the precision, recall and accuracy for every class.
There are three options on how to reach a better classification accuracy on your class "Yes"
use sample weights, you can increase the importance of the samples of the "Yes" class thus forcing the classifier to predict "Yes" more often
downsample the "No" class in the original X to reach more balanced class frequencies
upsample the "Yes" class in the original X to reach more balanced class frequencies
I tried to use GridSearchCV for multi-class case based on the answer from here:
Accelerating the prediction
But I got value error, multiclass format is not supported.
How can I use this method for multi-class case?
Following code is from the answer in above link.
import numpy as np
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import accuracy_score, recall_score, f1_score, roc_auc_score, make_scorer
X, y = make_classification(n_samples=3000, n_features=5, weights=[0.1, 0.9, 0.3])
pipe = make_pipeline(StandardScaler(), SVC(kernel='rbf', class_weight='auto'))
param_space = dict(svc__C=np.logspace(-5,0,5), svc__gamma=np.logspace(-2, 2, 10))
accuracy_score, recall_score, roc_auc_score
my_scorer = make_scorer(roc_auc_score, greater_is_better=True)
gscv = GridSearchCV(pipe, param_space, scoring=my_scorer)
gscv.fit(X, y)
print gscv.best_params_
From the documentation on roc_auc_score:
Note: this implementation is restricted to the binary classification task or multilabel classification task in label indicator format.
By "label indicator format", they mean each label value is represented as a binary column (rather than as a unique target value in a single column). You don't want to do that for your predictor, as it could result in non-mutually-exclusive predictions (i.e., predicting both label 2 and 4 for case p1, or predicting no labels for case p2).
Pick or custom-implement a scoring function that is well-defined for the multiclass problem, such as F1 score. Personally I find informedness more convincing than F1 score, and easier to generalize to the multiclass problem than roc_auc_score.
It supports multi-class
You can set the para of scoring = f1.macro, example:
gsearch1 = GridSearchCV(estimator = est1, param_grid=params_test1, scoring='f1_macro', cv=5, n_jobs=-1)
Or scoring = 'roc_auc_ovr'
It supports multi-class naturally if the classifier has the correct API by default for y_true and y_pred/y_score.
Otherwise, one has to do some customization using the score function like make_scorer.
For common metrics like AUROC for multi-classes, sklearn offers the 'roc_auc_ovr', where it actually refers to
roc_auc_ovr_scorer = make_scorer(roc_auc_score, needs_proba=True,
multi_class='ovr')
as in the source file.
To deal with multi-class problem with a classifier like e.g.,LogisticRegression, ovr is required and y_true is in the format of categorical values. The above setting will work directly.
Some other metrics for binary classifications can also be extended by wrapping the respective function. E.g., average_precision_score can be wrapped as
from sklearn.preprocessing import OneHotEncoder
def multi_auprc(y_true_cat, y_score):
y_true = OneHotEncoder().fit_transform(y_true_cat.reshape(-1, 1)).toarray()
return average_precision_score(y_true, y_score)
The metric can then be defined for GridsearchCV as
{
'auprc': make_scorer(multi_auprc, needs_proba=True, greater_is_better=True)
}
I am trying to run multinomial naive bayes on a series of examples in python using sci kit learn. I am consitently getting all examples classified as negative. The training set is somewhat biased towards negatives P(negative) ~.75. I looked through the documentation and I couldn't find a way to bias toward positives.
from sklearn.datasets import load_svmlight_file
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
X_train, y_train= load_svmlight_file("POS.train")
x_test, y_test = load_svmlight_file("POS.val")
clf = MultinomialNB()
clf.fit(X_train, y_train)
preds = clf.predict(x_test)
print('accuracy: ' + str(accuracy_score(y_test, preds)))
print('precision: ' + str(precision_score(y_test, preds)))
print('recall: ' + str(recall_score(y_test, preds)))
Setting a prior is a poor way to handle this and will result in negative cases being classified as positive that really shouldn't be. Your data has a .25/.75 split, so a .5/.5 prior is a pretty bad option.
Instead, one can average the precision and recall with a harmonic mean to produce an F score which attempts to properly handle biased data like this:
from sklearn.metrics import f1_score
The F1 score can then be used to assess the quality of the model. You can then do some model tuning and cross validation to find a model that better classifies your data i.e. the model that maximizes the F1 score.
Another option is to randomly prune out the negative cases in your data so that the classifier is trained with .5/.5 data. The predict step should then give more appropriate classifications.