sklearn: multi-class problem and reporting sensitivity and specificity - python

I have a three-class problem and I'm able to report precision and recall for each class with the below code:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
which gives me the precision and recall nicely for each of the 3 classes in a table format.
My question is how can I now get sensitivity and specificity for each of the 3 classes? I looked at sklearn.metrics and I didn't find anything for reporting sensitivity and specificity.

If we check the help page for classification report:
Note that in binary classification, recall of the positive class is
also known as “sensitivity”; recall of the negative class is
“specificity”.
So we can convert the pred into a binary for every class, and then use the recall results from precision_recall_fscore_support.
Using an example:
from sklearn.metrics import classification_report
y_true = [0, 1, 2, 2, 2]
y_pred = [0, 0, 2, 2, 1]
target_names = ['class 0', 'class 1', 'class 2']
print(classification_report(y_true, y_pred, target_names=target_names))
Looks like:
precision recall f1-score support
class 0 0.50 1.00 0.67 1
class 1 0.00 0.00 0.00 1
class 2 1.00 0.67 0.80 3
accuracy 0.60 5
macro avg 0.50 0.56 0.49 5
weighted avg 0.70 0.60 0.61 5
Using sklearn:
from sklearn.metrics import precision_recall_fscore_support
res = []
for l in [0,1,2]:
prec,recall,_,_ = precision_recall_fscore_support(np.array(y_true)==l,
np.array(y_pred)==l,
pos_label=True,average=None)
res.append([l,recall[0],recall[1]])
put the results into a dataframe:
pd.DataFrame(res,columns = ['class','sensitivity','specificity'])
class sensitivity specificity
0 0 0.75 1.000000
1 1 0.75 0.000000
2 2 1.00 0.666667

Classification report's output is a formatted string. This code snippet extracts the required values and stores it in a 2-D list.
Note: To understand the code better, add print statements to check the variable values.
y = classification_report(y_test,y_pred) #classification report's output is a string
lines = y.split('\n') #extract every line and store in a list
res = [] #list to store the cleaned results
for i in range(len(lines)):
line = lines[i].split(" ") #Values are separated by blanks. Split at the blank spaces.
line = [j for j in line if j!=''] #add only the values into the list
if len(line) != 0:
#empty lines get added as empty lists. Remove those
res.append(line)

Related

Problems with all values output to 1 in evaluation metrics

x_test,x_val,y_test,y_val = train_test_split(x_test,y_test,test_size=0.5)
print(x_train.shape)
#(1413, 3) <----Result
print(x_val.shape)
#(472, 3) <----Result
print(x_test.shape)
#(471, 3) <----Result
I proceeded with data split using machine learning and got the above results.
from sklearn.tree import DecisionTreeClassifier
dTree = DecisionTreeClassifier(max_depth=2,random_state=0).fit(x_train,y_train)
print("train score : {}".format(dTree.score(x_train, y_train)))
#train score : 1.0 <----Result
print("val score : {}".format(dTree.score(x_val, y_val)))
#val score : 1.0 <----Result
We then used Decision Tree to print out the score of train and val, respectively, and the results were all 1.
predict_y = dTree.predict(x_test)
from sklearn.metrics import classification_report
print(classification_report(y_test, dTree.predict(x_test)))
print("test score : {}".format(dTree.score(x_test, y_test)))
precision recall f1-score support
A 1.00 1.00 1.00 235
B 1.00 1.00 1.00 236
accuracy 1.00 471
macro avg 1.00 1.00 1.00 471
weighted avg 1.00 1.00 1.00 471
test score : 0.9978768577494692
Finally, classification_report also showed the above results. Are some of my data splits wrong? Or Does the value of 1 mean all datas perfectly classified?If I'm wrong, I want to hear the right solution.

How to split data into test and train after applying stratified k-fold cross validation?

I have already assigned columns to their specific k-fold using the following code:
from sklearn.model_selection import StratifiedKFold, train_test_split
# Stratified K-fold cross-validation
df['kfold'] = -1
df = df.sample(frac=1).reset_index(drop=True)
y = df.quality
kf = StratifiedKFold(n_splits=5)
for f, (t_,v_) in enumerate(kf.split(X=df, y=y)):
df.loc[v_, 'kfold'] = f
Now the dataframe is as expected:
alcohol volatile acidity sulphates citric acid quality kfold
1499 10.9 0.36 0.73 0.39 6 4
1500 9.5 0.65 0.55 0.10 5 4
1501 13.4 0.44 0.66 0.68 6 4
1502 9.6 0.59 0.67 0.24 5 4
1503 13.0 0.53 0.77 0.79 5 4
But how do I split it into train and test split?
StratifiedKFold will split the dataframe into a number of folds and return the training/test indices.
Each fold will have one part for testing (of size len(data)/n) and the rest will be used for training.
In your for loop, you can access the train and test sets as follows:
for f, (t_,v_) in enumerate(kf.split(X=df, y=y)):
df_train = df.loc[t_]
df_test = df.loc[v_]
As you can see the kfold column you added labels the testing data. The rest of the data should be used for training for this fold. I.e., for kfold == 1 the training data is all other data (kfold != 1).

Naive Bayes and SVM classification - how to plot accuracy on x y axis?

I'm trying to generate some line graph with an x and y axis demonstrating accuracy of 2 different algorithms running a classification - Naive Bayes and SVM.
I train/test the data like this:
# split the dataset into training and validation datasets
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(result['post'], result['type'], test_size=0.30, random_state=1)
# label encode the target variable
encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(train_y)
valid_y = encoder.fit_transform(valid_y)
def tokenizersplit(str):
return str.split()
tfidf_vect = TfidfVectorizer(tokenizer=tokenizersplit, encoding='utf-8', min_df=2, ngram_range=(1, 2), max_features=25000)
tfidf_vect.fit(result['post'])
tfidf_vect.transform(result['post'])
xtrain_tfidf = tfidf_vect.transform(train_x)
xvalid_tfidf = tfidf_vect.transform(valid_x)
def train_model(classifier, trains, t_labels, valids, v_labels):
# fit the training dataset on the classifier
classifier.fit(trains, t_labels)
# predict the labels on validation dataset
predictions = classifier.predict(valids)
return metrics.accuracy_score(predictions, v_labels)
# Naive Bayes
accuracy = train_model(naive_bayes.MultinomialNB(), xtrain_tfidf, train_y, xvalid_tfidf, valid_y)
print ("NB accuracy: ", accuracy)
However for an assignment I need something plotted on the x/y axis using matplotlib. I tried this:
m=linear_model.LogisticRegression()
m.fit(xtrain_tfidf, train_y)
y_pred = m.predict(xvalid_tfidf)
print(metrics.classification_report(valid_y, y_pred))
plt.plot(valid_y, y_pred)
plt.show()
But this gives me:
I need something that can more easily compare the accuracy of Naive Bayes vs SVM vs another algorithm. How can I do this?
Plotting classification report:
plt.plot(metrics.classification_report(valid_y, y_pred))
plt.show()
My classification output:
precision recall f1-score support
0 1.00 0.18 0.31 11
1 0.00 0.00 0.00 14
2 0.00 0.00 0.00 19
3 0.50 0.77 0.61 66
4 0.39 0.64 0.49 47
5 0.00 0.00 0.00 23
accuracy 0.46 180
macro avg 0.32 0.27 0.23 180
weighted avg 0.35 0.46 0.37 180
Error w edit:
df = pd.DataFrame(metrics.classification_report(valid_y, y_pred)).transpose()
gives error
ValueError: DataFrame constructor not properly called!
metrics.classification_report summarizes the prediction result. So this is not meant for plotting and just for printing a "report". If you want the table in a visual format you can follow https://stackoverflow.com/a/34304414/4005668.
Otherwise you can get the dataframe by capturing it in a dataframe
import pandas as pd
# put it in a dataframe
df = pd.DataFrame(metrics.classification_report(..)).transpose()
# plot the dataframe
df.plot()

score error while creating model in python

I was using classification report to check the accuracy and also the confusion matrix
I made some modifications to the code and it seems to work now
x = np.array([17, 17.083333, 17.166667, 17.25, 17.333333, 17.416667])
x = x.reshape(6,1)
y = [1,0,1,1,0,1]
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.20)
clf = svm.SVC(kernel='linear')
clf.fit(X_train,y_train)
pred = clf.predict(X_test)
score= sk.metrics.accuracy_score(y_test,pred)
report = sk.metrics.classification_report (y_test, pred, target_names = ['0','1'])
confusionmatrix = sk.metrics.confusion_matrix(y_test,pred)
print ("Accuracy_Score: "+str(score))
print ("Classification_Report:\n"+report)
print ("Confusion_Matrix:")
print (confusionmatrix)
output:
Accuracy_Score: 0.5
Classification_Report:
precision recall f1-score support
0 0.00 0.00 0.00 1
1 0.50 1.00 0.67 1
avg / total 0.25 0.50 0.33 2
Confusion_Matrix:
[[0 1]
[0 1]]
I changed the input "x" to an numpy array and removed values from x.reshape and also you had a typo in clf.predict() you have given "Xtest" it has to be "X_test".
Hope this helps

Scikit classification report - change the format of displayed results

Scikit classification report would show precision and recall scores with two digits only. Is it possible to make it display 4 digits after the dot, I mean instead of 0.67 to show 0.6783?
from sklearn.metrics import classification_report
print classification_report(testLabels, p, labels=list(set(testLabels)), target_names=['POSITIVE', 'NEGATIVE', 'NEUTRAL'])
precision recall f1-score support
POSITIVE 1.00 0.82 0.90 41887
NEGATIVE 0.65 0.86 0.74 19989
NEUTRAL 0.62 0.67 0.64 10578
Also, should I worry about a precision score of 1.00? Thanks!
I just came across this old question.
It is indeed possible to have more precision points in classification_report. You just need to pass in a digits argument.
classification_report(y_true, y_pred, target_names=target_names, digits=4)
From the documentation:
digits : int
Number of digits for formatting output floating point values
Demonstration:
from sklearn.metrics import classification_report
y_true = [0, 1, 2, 2, 2]
y_pred = [0, 0, 2, 2, 1]
target_names = ['class 0', 'class 1', 'class 2']
print(classification_report(y_true, y_pred, target_names=target_names))
Output:
precision recall f1-score support
class 0 0.50 1.00 0.67 1
class 1 0.00 0.00 0.00 1
class 2 1.00 0.67 0.80 3
avg / total 0.70 0.60 0.61 5
With 4 digits:
print(classification_report(y_true, y_pred, target_names=target_names, digits=4))
Output:
precision recall f1-score support
class 0 0.5000 1.0000 0.6667 1
class 1 0.0000 0.0000 0.0000 1
class 2 1.0000 0.6667 0.8000 3
avg / total 0.7000 0.6000 0.6133 5
No, it is not possible to display more digits with classification_report. The format string is hardcoded, see here.
edit: there is an update, see CentAu's answer

Categories