Related
I was trying to plot ROC curve with classifiers other than svm.SVC which is provided in the documentation. My code works good for svm.SVC; however, after I switched to KNeighborsClassifier, MultinomialNB, and DecisionTreeClassifier, the system keeps telling me check_consistent_length(y_true, y_score)andFound input variables with inconsistent numbers of samples: [26632, 53264] My CSV file looks like this
And here is my code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from itertools import cycle
import sys
from sklearn import svm, datasets
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import label_binarize
from sklearn.multiclass import OneVsRestClassifier
from scipy import interp
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
# Import some data to play with
df = pd.read_csv("E:\\autodesk\\Hourly and weather categorized2.csv")
X =df[['TTI','Max TemperatureF','Mean TemperatureF','Min TemperatureF',' Min Humidity']].values
y = df['TTI_Category'].as_matrix()
y=y.reshape(-1,1)
# Binarize the output
y = label_binarize(y, classes=['Good','Bad'])
n_classes = y.shape[1]
# shuffle and split training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5,
random_state=0)
# Learn to predict each class against the other
classifier = OneVsRestClassifier(DecisionTreeClassifier(random_state=0))
y_score = classifier.fit(X_train, y_train).predict_proba(X_test)
# Compute ROC curve and ROC area for each class
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])
roc_auc[i] = auc(fpr[i], tpr[i])
# Compute micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_score.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])
plt.figure()
lw = 1
plt.plot(fpr[0], tpr[0], color='darkorange', lw=lw, label='ROC curve (area = %0.2f)' % roc_auc[0])
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
plt.show()
I'm suspecting that the error occurs at this line fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_score.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"]),but I'm a beginner to this ROC curve, so could someone kindly guide me through this traceback. Thanks a lot for your time and help.Here is another question regarding ROC curve from me
By the way here is the whole traceback. Hopefully my explanation is clear enough. `
Traceback (most recent call last):
File "<ipython-input-1-16eb0db9d4d9>", line 1, in <module>
runfile('C:/Users/Think/Desktop/Python Practice/ROC with decision tree.py', wdir='C:/Users/Think/Desktop/Python Practice')
File "C:\Users\Think\Anaconda2\lib\site-packages\spyder\utils\site\sitecustomize.py", line 880, in runfile
execfile(filename, namespace)
File "C:\Users\Think\Anaconda2\lib\site-packages\spyder\utils\site\sitecustomize.py", line 87, in execfile
exec(compile(scripttext, filename, 'exec'), glob, loc)
File "C:/Users/Think/Desktop/Python Practice/ROC with decision tree.py", line 47, in <module>
fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_score.ravel())
File "C:\Users\Think\Anaconda2\lib\site-packages\sklearn\metrics\ranking.py", line 510, in roc_curve
y_true, y_score, pos_label=pos_label, sample_weight=sample_weight)
File "C:\Users\Think\Anaconda2\lib\site-packages\sklearn\metrics\ranking.py", line 302, in _binary_clf_curve
check_consistent_length(y_true, y_score)
File "C:\Users\Think\Anaconda2\lib\site-packages\sklearn\utils\validation.py", line 173, in check_consistent_length
" samples: %r" % [int(l) for l in lengths])
ValueError: Found input variables with inconsistent numbers of samples: [26632, 53264]
You need to use the predict_proba function of the DecisionTreeClassifier:
Example:
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import label_binarize
from sklearn.metrics import roc_curve, auc
from sklearn.multiclass import OneVsRestClassifier
from sklearn.tree import DecisionTreeClassifier
iris = datasets.load_iris()
X = iris.data
y = iris.target
# Binarize the output
y = label_binarize(y, classes=[0, 1, 2])
n_classes = y.shape[1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5, random_state=0)
classifier = OneVsRestClassifier(DecisionTreeClassifier(random_state=0))
y_score = classifier.fit(X_train, y_train).predict_proba(X_test)
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])
roc_auc[i] = auc(fpr[i], tpr[i])
colors = cycle(['blue', 'red', 'green'])
for i, color in zip(range(n_classes), colors):
plt.plot(fpr[i], tpr[i], color=color, lw=lw,
label='ROC curve of class {0} (area = {1:0.2f})'
''.format(i, roc_auc[i]))
plt.plot([0, 1], [0, 1], 'k--', lw=lw)
plt.xlim([-0.05, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic for multi-class data')
plt.legend(loc="lower right")
plt.show()
The problem is been solved by adding this line to the original code y_resampled = label_binarize(y_resampled, classes=['Good','Bad','Ok'])
There is a key difference in all these implementation which are being ignored. Inherently tree based algorithms in sklearn interpret one-hot encoded (binarized) target labels as a multi-label problem. To get AUC and ROC curve for multi-class problem one must binarize the outputs for ROC calculation only. By default there is no need to use OneVsRestClassifier with any of the algorithm stated under inherently multi class. For algorithms which aren't inherently multi class it makes sense to use OVR classifier or to avoid complex decision functions in case of SVM. Please refer to the code snippets below, first one is the same code which was used in the example above. The second one is the correct implementation which takes into account where a multi class classifier is trained and then computes the ROC for individual class. Check the difference in the plots.
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import label_binarize
from sklearn.metrics import roc_curve, auc
from sklearn.multiclass import OneVsRestClassifier
from sklearn.tree import DecisionTreeClassifier
iris = datasets.load_iris()
X = iris.data
y = iris.target
# Binarize the output
y = label_binarize(y, classes=[0, 1, 2])
n_classes = y.shape[1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5, random_state=0)
classifier = OneVsRestClassifier(DecisionTreeClassifier(random_state=0))
y_score = classifier.fit(X_train, y_train).predict_proba(X_test)
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])
roc_auc[i] = auc(fpr[i], tpr[i])
colors = cycle(['blue', 'red', 'green'])
for i, color in zip(range(n_classes), colors):
plt.plot(fpr[i], tpr[i], color=color, lw=lw,
label='ROC curve of class {0} (area = {1:0.2f})'
''.format(i, roc_auc[i]))
plt.plot([0, 1], [0, 1], 'k--', lw=lw)
plt.xlim([-0.05, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic for multi-class data')
plt.legend(loc="lower right")
plt.show()
import matplotlib.pyplot as plt
from sklearn import datasets
from itertools import cycle
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import label_binarize
from sklearn.metrics import roc_curve, auc
from sklearn.multiclass import OneVsRestClassifier
from sklearn.tree import DecisionTreeClassifier
iris = datasets.load_iris()
X = iris.data
y = iris.target
# Binarize the output
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5, random_state=0)
classifier = DecisionTreeClassifier(random_state=0)
y_score = classifier.fit(X_train, y_train).predict_proba(X_test)
y_test_bin = label_binarize(y_test, classes=[0, 1, 2])
n_classes = y_test_bin.shape[1]
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
fpr[i], tpr[i], _ = roc_curve(y_test_bin[:, i], y_score[:, i])
roc_auc[i] = auc(fpr[i], tpr[i])
colors = cycle(['blue', 'red', 'green'])
for i, color in zip(range(n_classes), colors):
plt.plot(fpr[i], tpr[i], color=color,
label='ROC curve of class {0} (area = {1:0.2f})'
''.format(i, roc_auc[i]))
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([-0.05, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic for multi-class data')
plt.legend(loc="lower right")
plt.show()
I was trying to plot a ROC curve by using the documentation provided by sklearn. My data is in a CSV file, and it looks like this.It has two classes 'Good'and 'Bad'
screenshot of my CSV file
And my code looks like this
import numpy as np
import matplotlib.pyplot as plt
from itertools import cycle
import sys
from sklearn import svm, datasets
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import label_binarize
from sklearn.multiclass import OneVsRestClassifier
from scipy import interp
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
# Import some data to play with
df = pd.read_csv("E:\\autodesk\\TTI ROC curve.csv")
X =df[['TTI','Max TemperatureF','Mean TemperatureF','Min TemperatureF',' Min Humidity']].values
y = df['TTI_Category'].as_matrix()
# Binarize the output
y = label_binarize(y, classes=['Good','Bad'])
n_classes = y.shape[1]
# shuffle and split training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5,
random_state=0)
# Learn to predict each class against the other
classifier = OneVsRestClassifier(svm.SVC(kernel='linear', probability=True,
random_state=random_state))
y_score = classifier.fit(X_train, y_train).decision_function(X_test)
# Compute ROC curve and ROC area for each class
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])
roc_auc[i] = auc(fpr[i], tpr[i])
# Compute micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_score.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])
plt.figure()
lw = 2
plt.plot(fpr[2], tpr[2], color='darkorange',
lw=lw, label='ROC curve (area = %0.2f)' % roc_auc[2])
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
plt.show()enter code here
If i run this code the system told me random_state is not defined. so I changed it to random_state=true. Then the system told me
plt.plot(fpr[2], tpr[2], color='darkorange', KeyError: 2 <matplotlib.figure.Figure at 0xd8bff60>
if I print out n_classes. The system told me it's "1", and if I print out the n_classes in the documentation it says 3. I'm not sure if that's where the problem is. Does anyone have answer to this traceback?
Looks like you simply don't understand how your data is structured and how your code should work.
LabelBinarizer will return a one-v-all encoding, meaning that for two classes you will get the following mapping: ['good', 'bad', 'good'] -> [[1], [0], [1]], s.t. n_classes = 1.
Why would you expect it to be 3 if you have 2 classes?
Simply change plt.plot(fpr[2], tpr[2], color='darkorange', lw=lw, label='ROC curve (area = %0.2f)' % roc_auc[2]) to plt.plot(fpr[0], tpr[0], color='darkorange', lw=lw, label='ROC curve (area = %0.2f)' % roc_auc[0]) and you should be good.
Just look in tpr and fpr dictionaries and you will see that you don't have a tpr[2] or fpr[2]. n_classes = y.shape[1] shows how many classes you have (= 2), which means that you have keys of 0 and 1 in your tpr and fpr dictionaries.
You are overcomplicating things by using a multi-class approach when you only have 2 classes (binary classification). I think you are using this tutorial.
I would advise replacing the following:
# Compute ROC curve and ROC area for each class
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])
roc_auc[i] = auc(fpr[i], tpr[i])
# Compute micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_score.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])
plt.figure()
lw = 2
plt.plot(fpr[2], tpr[2], color='darkorange',
lw=lw, label='ROC curve (area = %0.2f)' % roc_auc[2])
With something like:
fpr, tpr = roc_curve(y_test.values, y_score[:,1])
roc_auc = auc(fpr,tpr)
plt.figure()
lw = 2
plt.plot(fpr, tpr, color='darkorange', lw=lw, label='ROC curve (area = %0.2f)' % roc_auc)
I'm recently struggling with using sklearn for my project.
I wanted to build a classifier and classify my data into six groups. the total sample size was 88 then I split the data into train(66) and test(22)
I did exactly as sklearn documentation showed, here is my code
from sklearn.multiclass import OneVsRestClassifier
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis as QDA
clf = OneVsRestClassifier(QDA())
QDA_score = clf.fit(train,label).decision_function(test)
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import roc_curve
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(3):
fpr[i], tpr[i], _ = roc_curve(label_test[:, i], QDA_score[:, i])
roc_auc[i] = auc(fpr[i], tpr[i])
from itertools import cycle
import matplotlib.pyplot as plt
plt.figure()
lw = 2
colors = cycle(['aqua', 'darkorange', 'cornflowerblue'])
for i, color,n in zip(range(3), colors,['_000','_15_30_45','60']):
plt.plot(fpr[i], tpr[i], color=color, lw=lw,
label='ROC curve of {0} (area = {1:0.2f})'
''.format(n , roc_auc[i]))
plt.plot([0, 1], [0, 1], 'k--', lw=lw)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC for multi-classes')
plt.legend(loc="lower right")
plt.show()
the link is my result.
however every time I run the code the result changes. I'm wondering if there is anyway that I can combine this with Cross validation and compute average and stable ROC for each class
Thanks!
You can use cross_val_predict to first get the cross-validated probabilities and then plot the ROC curve for each class.
Example using Iris data
import matplotlib.pyplot as plt
from sklearn import svm, datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import label_binarize
from sklearn.metrics import roc_curve, auc
from sklearn.multiclass import OneVsRestClassifier
from sklearn.model_selection import cross_val_predict
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis as QDA
iris = datasets.load_iris()
X = iris.data
y = iris.target
# Binarize the output
y_bin = label_binarize(y, classes=[0, 1, 2])
n_classes = y_bin.shape[1]
clf = OneVsRestClassifier(QDA())
y_score = cross_val_predict(clf, X, y, cv=10 ,method='predict_proba')
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
fpr[i], tpr[i], _ = roc_curve(y_bin[:, i], y_score[:, i])
roc_auc[i] = auc(fpr[i], tpr[i])
colors = cycle(['blue', 'red', 'green'])
for i, color in zip(range(n_classes), colors):
plt.plot(fpr[i], tpr[i], color=color, lw=lw,
label='ROC curve of class {0} (area = {1:0.2f})'
''.format(i, roc_auc[i]))
plt.plot([0, 1], [0, 1], 'k--', lw=lw)
plt.xlim([-0.05, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic for multi-class data')
plt.legend(loc="lower right")
plt.show()
To get the ROC for each Fold do this:
import numpy as np
from scipy import interp
import matplotlib.pyplot as plt
from itertools import cycle
from sklearn import svm, datasets
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import StratifiedKFold
iris = datasets.load_iris()
X = iris.data
y = iris.target
X, y = X[y != 2], y[y != 2]
n_samples, n_features = X.shape
# Add noisy features
random_state = np.random.RandomState(0)
X = np.c_[X, random_state.randn(n_samples, 200 * n_features)]
# Classification and ROC analysis
# Run classifier with cross-validation and plot ROC curves
cv = StratifiedKFold(n_splits=6)
classifier = svm.SVC(kernel='linear', probability=True,
random_state=random_state)
tprs = []
aucs = []
mean_fpr = np.linspace(0, 1, 100)
i = 0
for train, test in cv.split(X, y):
probas_ = classifier.fit(X[train], y[train]).predict_proba(X[test])
# Compute ROC curve and area the curve
fpr, tpr, thresholds = roc_curve(y[test], probas_[:, 1])
tprs.append(interp(mean_fpr, fpr, tpr))
tprs[-1][0] = 0.0
roc_auc = auc(fpr, tpr)
aucs.append(roc_auc)
plt.plot(fpr, tpr, lw=1, alpha=0.3,
label='ROC fold %d (AUC = %0.2f)' % (i, roc_auc))
i += 1
plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='r',
label='Luck', alpha=.8)
mean_tpr = np.mean(tprs, axis=0)
mean_tpr[-1] = 1.0
mean_auc = auc(mean_fpr, mean_tpr)
std_auc = np.std(aucs)
plt.plot(mean_fpr, mean_tpr, color='b',
label=r'Mean ROC (AUC = %0.2f $\pm$ %0.2f)' % (mean_auc, std_auc),
lw=2, alpha=.8)
std_tpr = np.std(tprs, axis=0)
tprs_upper = np.minimum(mean_tpr + std_tpr, 1)
tprs_lower = np.maximum(mean_tpr - std_tpr, 0)
plt.fill_between(mean_fpr, tprs_lower, tprs_upper, color='grey', alpha=.2,
label=r'$\pm$ 1 std. dev.')
plt.xlim([-0.05, 1.05])
plt.ylim([-0.05, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
plt.show()
It is hard to tell without more details of the data and the comlexity of the problem you are trying to solve, but irregular learning performance like yours could indicate that your dataset is too small for the irregularity and complexity of the data, so that every time you sample you get a train dataset which is different.
A common test vs train stabling technique you could also look into is k-fold cross validation.
UPDATE:
K-fold cross validation is basically slicing the data into k parts and then do the learning process k times and average their results, where each time a different part of the data is the test dataset and the rest k-1 parts are the train dataset.
UPDATE
I randomly & independently shuffled the data as #Paul suggest and my classifier has random performance now .
I have an imbalanced dataset with around 200.000 instances and 50 predictors. The imbalance has a 4:1 ratio for the negative class (i.e class 0). In other words the negative class makes around 80% of the samples and the positive just 20% of the samples.
It's a binary classification problem where I have a target vector with 0's and 1's.
I have been trying to fit several classifiers like logistic regression and random forest.
I evaluate them with cross -validation skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=999)and ROC roc_curve from Python sklearn v.018
My Problem
My ROC curve for each validation fold are almost the same and I have no idea why. The AUC is identical and always absurdly good (0.9). Although the Precision-Recall Curve shows worse AUC=0.74 (which I think it's more accurate).
I tried following this example for ROC with cross-validation: http://lijiancheng0614.github.io/scikit-learn/auto_examples/model_selection/plot_roc_crossval.html#example-model-selection-plot-roc-crossval-py
ROC curves Logistic Regression
ROC curves Confidence Interval Logistic Regression [zoomed in]
Precision Recall curves
The question:
Why does the performance of the model seem to be similar on each fold? shouldn't the AUC differ at least slightly?
Code Below
X, y = shuffle(X, y, random_state=0)
clasifier = linear_model.LogisticRegression(class_weight = "balanced")
clasifier.fit(X,y)
fig, ax1 = plt.subplots(figsize=(12, 8))
mean_tpr = 0.0
mean_fpr = linspace(0, 1, 100)
skf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=999)
for i, (train_index, test_index) in enumerate(skf.split(X,y)):
# calculate the probability of each class assuming it to be positive
probas_ = classifier.fit(X[train_index], y[train_index]).predict_proba(X[test_index])
# Compute ROC curve and area under the curve
fpr, tpr, thresholds = roc_curve(y[test_index], probas_[:, 1], pos_label=1)
mean_tpr += interp(mean_fpr, fpr, tpr)
mean_tpr[0] = 0.0
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, lw=1, label='ROC fold %d (area = %0.2f)' % (i+1, roc_auc))
plt.plot([0, 1], [0, 1], '--', color=(0.6, 0.6, 0.6), label='Random', lw=2)
mean_tpr /= n_folds
mean_tpr[-1] = 1.0
mean_auc = auc(mean_fpr, mean_tpr)
plt.plot(mean_fpr, mean_tpr, 'k--',
label='Mean ROC (area = %0.2f)' % mean_auc, lw=3)
plt.xlim([-0.05, 1.05])
plt.ylim([-0.05, 1.05])
plt.xlabel('False Positive Rate (1- specificity)', fontsize=18)
plt.ylabel('True Positive Rate (sensitivity)', fontsize=18)
Following up from here: Converting a 1D array into a 2D class-based matrix in python
I want to draw ROC curves for each of my 46 classes. I have 300 test samples for which I've run my classifier to make a prediction.
y_test is the true classes, and y_pred is what my classifier predicted.
Here's my code:
from sklearn.metrics import confusion_matrix, roc_curve, auc
from sklearn.preprocessing import label_binarize
import numpy as np
y_test_bi = label_binarize(y_test, classes=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18, 19,20,21,2,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,3,40,41,42,43,44,45])
y_pred_bi = label_binarize(y_pred, classes=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18, 19,20,21,2,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,3,40,41,42,43,44,45])
# Compute ROC curve and ROC area for each class
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(2):
fpr[i], tpr[i], _ = roc_curve(y_test_bi, y_pred_bi)
roc_auc[i] = auc(fpr[i], tpr[i])
However, now I'm getting the following error:
Traceback (most recent call last):
File "C:\Users\app\Documents\Python Scripts\gbc_classifier_test.py", line 152, in <module>
fpr[i], tpr[i], _ = roc_curve(y_test_bi, y_pred_bi)
File "C:\Users\app\Anaconda\lib\site-packages\sklearn\metrics\metrics.py", line 672, in roc_curve
fps, tps, thresholds = _binary_clf_curve(y_true, y_score, pos_label)
File "C:\Users\app\Anaconda\lib\site-packages\sklearn\metrics\metrics.py", line 505, in _binary_clf_curve
y_true = column_or_1d(y_true)
File "C:\Users\app\Anaconda\lib\site-packages\sklearn\utils\validation.py", line 265, in column_or_1d
raise ValueError("bad input shape {0}".format(shape))
ValueError: bad input shape (300L, 46L)
roc_curve takes parameter with shape [n_samples] (link), and your inputs (either y_test_bi or y_pred_bi) are of shape (300, 46). Note the first
I think the problem is y_pred_bi is an array of probabilities, created by calling clf.predict_proba(X) (please confirm this). Since your classifier was trained on all 46 classes, it outputs a 46-dimensional vectors for each data point, and there is nothing label_binarize can do about that.
I know of two ways around this:
Train 46 binary classifiers by invoking label_binarize before clf.fit() and then compute ROC curve
Slice each column of the 300-by-46 output array and pass that as the second parameter to roc_curve. This is my preferred approach by I am assuming y_pred_bi contains probabilities
Use label_binarize:
import matplotlib.pyplot as plt
from sklearn import svm, datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import label_binarize
from sklearn.metrics import roc_curve, auc
from sklearn.multiclass import OneVsRestClassifier
iris = datasets.load_iris()
X = iris.data
y = iris.target
# Binarize the output
y = label_binarize(y, classes=[0, 1, 2])
n_classes = y.shape[1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5, random_state=0)
classifier = OneVsRestClassifier(svm.SVC(kernel='linear', probability=True,
random_state=0))
y_score = classifier.fit(X_train, y_train).decision_function(X_test)
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])
roc_auc[i] = auc(fpr[i], tpr[i])
colors = cycle(['blue', 'red', 'green'])
for i, color in zip(range(n_classes), colors):
plt.plot(fpr[i], tpr[i], color=color, lw=lw,
label='ROC curve of class {0} (area = {1:0.2f})'
''.format(i, roc_auc[i]))
plt.plot([0, 1], [0, 1], 'k--', lw=lw)
plt.xlim([-0.05, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic for multi-class data')
plt.legend(loc="lower right")
plt.show()