I'm classifying my data using several algorithms including
KNN, LogisticRegression, RandomForrest, DecisionTreeClassifier, GaussianNB etc.
After fitting my data I am analyzing results using the following:
from sklearn.metrics import confusion_matrix, classification_report
classification_report(y_test, predicted)
Im not totally clear on the semantics of the "predicted positive / negative" et.al in respects to which label it is trying to predict.
Also maybe more importantly I don't understand and am trying to analize why all of the various algorithms are predicting relatively well in regards to "Predicted Negative / True Negative vs Predicted Negative / True Positive" portions but very bad in regards to the "Predict Positive" portion .
In other words from my understanding it is quite good at saying "not something" but basically tossing a coin at predicting "is something" (around 50-50)
here are some example classification reports I generated for the different techniques:
confusion matrix (knn)
Predicted Negative Predicted Positive
True Negative 14776 5442
True Positive 2367 6337
precision recall f1-score support
f 0.73 0.86 0.79 17143
t 0.73 0.54 0.62 11779
avg / total 0.73 0.73 0.72 28922
confusion matrix (SVM)
Predicted Negative Predicted Positive
True Negative 14881 4947
True Positive 2262 6832
precision recall f1-score support
f 0.75 0.87 0.81 17143
t 0.75 0.58 0.65 11779
avg / total 0.75 0.75 0.74 28922
confusion matrix (logistic regression)
Predicted Negative Predicted Positive
True Negative 14881 4947
True Positive 2262 6832
precision recall f1-score support
f 0.75 0.87 0.81 17143
t 0.75 0.58 0.65 11779
avg / total 0.75 0.75 0.74 28922
confusion matrix (decision tree)
Predicted Negative Predicted Positive
True Negative 14852 4941
True Positive 2291 6838
precision recall f1-score support
f 0.75 0.87 0.80 17143
t 0.75 0.58 0.65 11779
avg / total 0.75 0.75 0.74 28922
confusion matrix (naive_bayes)
Predicted Negative Predicted Positive
True Negative 13435 4759
True Positive 3708 7020
precision recall f1-score support
f 0.74 0.78 0.76 17143
t 0.65 0.60 0.62 11779
avg / total 0.70 0.71 0.70 28922
confusion matrix (random_forest)
Predicted Negative Predicted Positive
True Negative 13287 5248
True Positive 3856 6531
precision recall f1-score support
f 0.72 0.78 0.74 17143
t 0.63 0.55 0.59 11779
avg / total 0.68 0.69 0.68 28922
confusion matrix (gradient_boost)
Predicted Negative Predicted Positive
True Negative 15071 5583
True Positive 2072 6196
precision recall f1-score support
f 0.73 0.88 0.80 17143
t 0.75 0.53 0.62 11779
avg / total 0.74 0.74 0.72 28922
confusion matrix (neural network MLPClassifier)
Predicted Negative Predicted Positive
True Negative 10789 3653
True Positive 6354 8126
precision recall f1-score support
f 0.75 0.63 0.68 17143
t 0.56 0.69 0.62 11779
avg / total 0.67 0.65 0.66 28922
The only one which seems to predict "Predicted Positive" reasonably was the MLPClassifier classifier.
Sorry, I didn't know how the dataset you used looks like. But let's say there is a flipping coin experiment with 2 kinds of results, either head (1) or tail (0). Now we implement a regression algorithm to predict the results based a bunch of possible features.
If the prediction is correct (as same as the class label), we will count it as a true one. If not, it will be a false record.
If the algorithm outputs a "Head" prediction, it would be regarded as a positive result, and negative for "tail".
For single "True Positive" portion, it has a little value. But if we add it with "False Negative", the sum of them is actually the amount of positive case.
And if we divid "True Positive" by the sum of all positive case, which is normally called "recall" or TP rate, we would get the accuracy of this model in predicting positive (head) case.
We could compare the TP rate(TP/P) with the FP rate(FP/N) to analyze the performance of a given model.
There is also some other combination and usage with these positive, negative, true, false and rate things, such as sensitivity and specificity etc..
If you want to know more, I would recommend you to look ROC Curve
Related
I have two classification report results (from 2 different models), and I want to use a bar chart to plot them in one single graph. How can I do this?
Sample classificaiotn report result:
precision recall f1-score support
False 0.94 0.95 0.95 10078
True 0.95 0.94 0.94 10078
accuracy 0.94 20156
macro avg 0.94 0.94 0.94 20156
weighted avg 0.94 0.94 0.94 20156
I don't know if it is explicit in the title but i want to print a classification_report.
I want to write that it belongs to the Test set just like this :
print(f'Test classification report :{classification_report(y_test, y_pred)}')
But it gives this output with the 4 column names shifted :
Test classification report : precision recall f1-score support
0 0.68 0.50 0.57 187
1 0.79 0.89 0.84 407
accuracy 0.77 594
macro avg 0.74 0.69 0.71 594
weighted avg 0.76 0.77 0.76 594
Thanks
\n is the way to go
It adds a new line
I am using multi class classification problem and solved using XGBoost. Number of unique classes are 7.
I Got Classification report with each class Precision, Recall and F1 score.
I did not have any coding clue to try on this in Python.
I need Mean Per Class Accuracy of each class. Is there any mathematical formula to calculate Per class accuracy.
Update:
Test data per class samples:
Class # samples
0 13
1 16
2 9
SVM predictions per class samples:
Class # samples
0 13
1 15
2 10
SVM Classification Report is:
svm precision recall f1-score support
0 1.00 1.00 1.00 13
1 1.00 0.94 0.97 16
2 0.90 1.00 0.95 9
micro avg 0.97 0.97 0.97 38
macro avg 0.97 0.98 0.97 38
weighted avg 0.98 0.97 0.97 38
Can you please suggest me based on this?
Per-class recall = (members of class identified correctly)/(number of members of class)
Simply multiply each per-class recall value by the number of samples that are actually in the class to get the number of each class classified correctly, add these up to get the total number of correct predictions, and then divide by the total number of samples to get the mean per-class accuracy.
Scikit classification report would show precision and recall scores with two digits only. Is it possible to make it display 4 digits after the dot, I mean instead of 0.67 to show 0.6783?
from sklearn.metrics import classification_report
print classification_report(testLabels, p, labels=list(set(testLabels)), target_names=['POSITIVE', 'NEGATIVE', 'NEUTRAL'])
precision recall f1-score support
POSITIVE 1.00 0.82 0.90 41887
NEGATIVE 0.65 0.86 0.74 19989
NEUTRAL 0.62 0.67 0.64 10578
Also, should I worry about a precision score of 1.00? Thanks!
I just came across this old question.
It is indeed possible to have more precision points in classification_report. You just need to pass in a digits argument.
classification_report(y_true, y_pred, target_names=target_names, digits=4)
From the documentation:
digits : int
Number of digits for formatting output floating point values
Demonstration:
from sklearn.metrics import classification_report
y_true = [0, 1, 2, 2, 2]
y_pred = [0, 0, 2, 2, 1]
target_names = ['class 0', 'class 1', 'class 2']
print(classification_report(y_true, y_pred, target_names=target_names))
Output:
precision recall f1-score support
class 0 0.50 1.00 0.67 1
class 1 0.00 0.00 0.00 1
class 2 1.00 0.67 0.80 3
avg / total 0.70 0.60 0.61 5
With 4 digits:
print(classification_report(y_true, y_pred, target_names=target_names, digits=4))
Output:
precision recall f1-score support
class 0 0.5000 1.0000 0.6667 1
class 1 0.0000 0.0000 0.0000 1
class 2 1.0000 0.6667 0.8000 3
avg / total 0.7000 0.6000 0.6133 5
No, it is not possible to display more digits with classification_report. The format string is hardcoded, see here.
edit: there is an update, see CentAu's answer
I fit a Logistic Regression Model and train the model based on training dataset using the following
import scikits as sklearn
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(C=0.1, penalty='l1')
model = lr.fit(training[:,0:-1], training[:,-1)
I have a cross validation dataset which contains a labels associated in input matrix and can be accessed as
cv[:,-1]
I run my cross validation dataset against the trained model which returns me the list of 0s and 1s based on prediction
cv_predict = model.predict(cv[:,0:-1])
Question
I want to calculate the precision and recall scores based on acutal labels and predicted labels. Is there a standard method to do it using numpy/scipy/scikits?
Thank you
Yes there are, see the documentation: http://scikit-learn.org/stable/modules/classes.html#classification-metrics
You should also have a look at the sklearn.metrics.classification_report utility:
>>> from sklearn.metrics import classification_report
>>> from sklearn.linear_model import SGDClassifier
>>> from sklearn.datasets import load_digits
>>> digits = load_digits()
>>> n_samples, n_features = digits.data.shape
>>> n_split = n_samples / 2
>>> clf = SGDClassifier().fit(digits.data[:n_split], digits.target[:n_split])
>>> predictions = clf.predict(digits.data[n_split:])
>>> expected = digits.target[n_split:]
>>> print classification_report(expected, predictions)
precision recall f1-score support
0 0.90 0.98 0.93 88
1 0.81 0.69 0.75 91
2 0.94 0.98 0.96 86
3 0.94 0.85 0.89 91
4 0.90 0.93 0.91 92
5 0.92 0.92 0.92 91
6 0.92 0.97 0.94 91
7 1.00 0.85 0.92 89
8 0.71 0.89 0.79 88
9 0.89 0.83 0.86 92
avg / total 0.89 0.89 0.89 899