UPDATE
I randomly & independently shuffled the data as #Paul suggest and my classifier has random performance now .
I have an imbalanced dataset with around 200.000 instances and 50 predictors. The imbalance has a 4:1 ratio for the negative class (i.e class 0). In other words the negative class makes around 80% of the samples and the positive just 20% of the samples.
It's a binary classification problem where I have a target vector with 0's and 1's.
I have been trying to fit several classifiers like logistic regression and random forest.
I evaluate them with cross -validation skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=999)and ROC roc_curve from Python sklearn v.018
My Problem
My ROC curve for each validation fold are almost the same and I have no idea why. The AUC is identical and always absurdly good (0.9). Although the Precision-Recall Curve shows worse AUC=0.74 (which I think it's more accurate).
I tried following this example for ROC with cross-validation: http://lijiancheng0614.github.io/scikit-learn/auto_examples/model_selection/plot_roc_crossval.html#example-model-selection-plot-roc-crossval-py
ROC curves Logistic Regression
ROC curves Confidence Interval Logistic Regression [zoomed in]
Precision Recall curves
The question:
Why does the performance of the model seem to be similar on each fold? shouldn't the AUC differ at least slightly?
Code Below
X, y = shuffle(X, y, random_state=0)
clasifier = linear_model.LogisticRegression(class_weight = "balanced")
clasifier.fit(X,y)
fig, ax1 = plt.subplots(figsize=(12, 8))
mean_tpr = 0.0
mean_fpr = linspace(0, 1, 100)
skf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=999)
for i, (train_index, test_index) in enumerate(skf.split(X,y)):
# calculate the probability of each class assuming it to be positive
probas_ = classifier.fit(X[train_index], y[train_index]).predict_proba(X[test_index])
# Compute ROC curve and area under the curve
fpr, tpr, thresholds = roc_curve(y[test_index], probas_[:, 1], pos_label=1)
mean_tpr += interp(mean_fpr, fpr, tpr)
mean_tpr[0] = 0.0
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, lw=1, label='ROC fold %d (area = %0.2f)' % (i+1, roc_auc))
plt.plot([0, 1], [0, 1], '--', color=(0.6, 0.6, 0.6), label='Random', lw=2)
mean_tpr /= n_folds
mean_tpr[-1] = 1.0
mean_auc = auc(mean_fpr, mean_tpr)
plt.plot(mean_fpr, mean_tpr, 'k--',
label='Mean ROC (area = %0.2f)' % mean_auc, lw=3)
plt.xlim([-0.05, 1.05])
plt.ylim([-0.05, 1.05])
plt.xlabel('False Positive Rate (1- specificity)', fontsize=18)
plt.ylabel('True Positive Rate (sensitivity)', fontsize=18)
Related
I am trying to plot the mean ROC curve of a support vector machine (SVM) model with a linear kernel over 10 runs. The code fits the SVM model to the training data, and generates the ROC curve and its corresponding area under the curve (AUC) for each run. The mean ROC curve is then computed using the mean false positive rate (mean_fpr) and mean true positive rate (mean_tpr) obtained from all 10 runs. However, the resulting plot does not start at the origin (0, 0), indicating that there is an issue with the computation of mean_fpr and mean_tpr.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn.svm import SVC
import seaborn as sns
import matplotlib.pyplot as plt
# Read the csv file
df = load_breast_cancer(as_frame=True)
# Split the data into features (X) and target (y)
X = df['data']
y = df['target']
from sklearn.metrics import roc_curve, auc
# Number of runs
#random.seed(321)
n_runs = 10
# Lists to store the results
aucs = []
tprs = []
fprs = []
for i in range(n_runs):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=None)
svclassifier = SVC(kernel='linear', random_state=i)
svclassifier.fit(X_train, y_train)
y_pred = svclassifier.predict(X_test)
y_score = svclassifier.decision_function(X_test)
fpr, tpr, thresholds = roc_curve(y_test, y_score)
fprs.append(fpr)
tprs.append(tpr)
roc_auc = auc(fpr, tpr)
aucs.append(roc_auc)
# Mean ROC curve
mean_fpr = np.unique(np.concatenate(fprs))
mean_tpr = np.unique(np.concatenate(tprs))
mean_tpr = np.zeros_like(mean_fpr)
for i in range(n_runs):
mean_tpr += np.interp(mean_fpr, fprs[i], tprs[i])
mean_tpr /= n_runs
mean_auc = auc(mean_fpr, mean_tpr)
# Plot the mean ROC curve
sns.lineplot(x=mean_fpr, y=mean_tpr, ci=None, label='Mean ROC (AUC = %0.2f)' % mean_auc)
plt.xlim([-0.1, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()
The problematic line of code is as follows:
for i in range(n_runs):
mean_tpr += np.interp(mean_fpr, fprs[i], tprs[i])
Can anyone help me identify and fix the issue with the mean_fpr and mean_tpr values so that the resulting plot starts at (0, 0)?
I have a data having 3 class labels(0,1,2). I tried to make ROC curve. and did it by using pos_label parameter.
fpr, tpr, thresholds = metrics.roc_curve(Ytest, y_pred_prob, pos_label = 0)
By changing pos_label to 0,1,2- I get 3 graphs, Now I am having issue in calculating AUC score.
How can I average the 3 graphs and plot 1 graph from it and then calculate the Roc_AUC score.
i am having error in by this
metrics.roc_auc_score(Ytest, y_pred_prob)
ValueError: multiclass format is not supported
please help me.
# store the predicted probabilities for class 0
y_pred_prob = cls.predict_proba(Xtest)[:, 0]
#first argument is true values, second argument is predicted probabilities
fpr, tpr, thresholds = metrics.roc_curve(Ytest, y_pred_prob, pos_label = 0)
plt.plot(fpr, tpr)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.title('ROC curve classifier')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.grid(True)
# store the predicted probabilities for class 1
y_pred_prob = cls.predict_proba(Xtest)[:, 1]
#first argument is true values, second argument is predicted probabilities
fpr, tpr, thresholds = metrics.roc_curve(Ytest, y_pred_prob, pos_label = 0)
plt.plot(fpr, tpr)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.title('ROC curve classifier')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.grid(True)
# store the predicted probabilities for class 2
y_pred_prob = cls.predict_proba(Xtest)[:, 2]
#first argument is true values, second argument is predicted probabilities
fpr, tpr, thresholds = metrics.roc_curve(Ytest, y_pred_prob, pos_label = 0)
plt.plot(fpr, tpr)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.title('ROC curve classifier')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.grid(True)
from the above code. 3 roc curves are generated. Due to multi-classes.
I want to have a one roc curve from above 3 by taking average or mean. Then, one roc_auc score from that.
Highlights in the multi-class AUC:
You cannot calculate a common AUC for all classes. You must calculate the AUC for each class separately. Just as you have to calculate the recall, precision is separate for each class when making a multi-class classification.
THE SIMPLEST method of calculating the AUC for individual classes:
We choose a classifier
from sklearn.linear_model import LogisticRegression
LRE = LogisticRegression(solver='lbfgs')
LRE.fit(X_train, y_train)
I am making a list of multi-class classes
d = y_test.unique()
class_name = list(d.flatten())
class_name
Now calculate the AUC for each class separately
for p in class_name:
`fpr, tpr, thresholds = metrics.roc_curve(y_test,
LRE.predict_proba(X_test)[:,1], pos_label = p)
auroc = round(metrics.auc(fpr, tpr),2)
print('LRE',p,'--AUC--->',auroc)`
For multiclass, it is often useful to calculate the AUROC for each class. For example, here's an excerpt from some code I use to calculate AUROC for each class separately, where label_meanings is a list of strings describing what each label is, and the various arrays are formatted such that each row is a different example and each column corresponds to a different label:
for label_number in range(len(label_meanings)):
which_label = label_meanings[label_number] #descriptive string for the label
true_labels = true_labels_array[:,label_number]
pred_probs = pred_probs_array[:,label_number]
#AUROC and AP (sliding across multiple decision thresholds)
fpr, tpr, thresholds = sklearn.metrics.roc_curve(y_true = true_labels,
y_score = pred_probs,
pos_label = 1)
auc = sklearn.metrics.auc(fpr, tpr)
If you want to plot an average AUC curve across your three classes: This code https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html includes parts that calculate the average AUC so that you can make a plot (if you have three classes, it will plot the average AUC for the three classes.)
If you just want an average AUC across your three classes: once you have calculated the AUC of each class separately you can average the three numbers to get an overall AUC.
If you want more background on AUROC and how it is calculated for single class versus multi class you can see this article, Measuring Performance: AUC (AUROC).
I have some data labeled as either a 0 or 1 and I am trying to predict these classes using a random forest. Each instance is labeled with 20 features that are used to train the random forest (~30.000 training instances and ~6000 test instances.
I am plotting the precision-recall and ROC curves using the following code:
precision, recall, _ = precision_recall_curve(y_test, y_pred)
plt.step(recall, precision, color='b', alpha=0.2,where='post')
plt.fill_between(recall, precision, step='post', alpha=0.2, color='b')
fpr, tpr, _ = roc_curve(y_test, y_pred)
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
All the PR and ROC curves I have seen thus far always have a jagged/smooth decline in precision/recall and a smooth/jagged increase in the ROC line. But my PR and ROC curves for some reason always look like this:
For some reason the only have a single point where they change direction. Is this due to a coding error by me or something inherent about the data/classification problem? If so, how can this behavior be explained?
I suspect you used the RandomForestClassifier.predict() method which results in either 0 or 1 depending on the predicted class.
To get the probability, which is the fraction of trees voted for a specific class, you have to use the RandomForestClassifier.predict_proba() method.
Using these probabilities as input for your curve calculations should fix the problem.
EDIT: The curve creation methods of scikit-learn sort the predictions first according to the prediction score, then according to their real/observed value, therefore the curves have these "bends".
Inside the precision_recall_curve, the y_pred must be the probabilities of the target class AND NOT the actual predicted class.
Since you are using a RandomForestClassifier, use predict_proba(X) to get the probabilities.
rf = RandomForestClassifier()
probas_pred = rf.predict_proba(X_test)
precision, recall, _ = precision_recall_curve(y_true, probas_pred)
plt.step(recall, precision, color='b', alpha=0.2,where='post')
plt.fill_between(recall, precision, step='post', alpha=0.2, color='b')
I was trying to plot a ROC curve by using the documentation provided by sklearn. My data is in a CSV file, and it looks like this.It has two classes 'Good'and 'Bad'
screenshot of my CSV file
And my code looks like this
import numpy as np
import matplotlib.pyplot as plt
from itertools import cycle
import sys
from sklearn import svm, datasets
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import label_binarize
from sklearn.multiclass import OneVsRestClassifier
from scipy import interp
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
# Import some data to play with
df = pd.read_csv("E:\\autodesk\\TTI ROC curve.csv")
X =df[['TTI','Max TemperatureF','Mean TemperatureF','Min TemperatureF',' Min Humidity']].values
y = df['TTI_Category'].as_matrix()
# Binarize the output
y = label_binarize(y, classes=['Good','Bad'])
n_classes = y.shape[1]
# shuffle and split training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5,
random_state=0)
# Learn to predict each class against the other
classifier = OneVsRestClassifier(svm.SVC(kernel='linear', probability=True,
random_state=random_state))
y_score = classifier.fit(X_train, y_train).decision_function(X_test)
# Compute ROC curve and ROC area for each class
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])
roc_auc[i] = auc(fpr[i], tpr[i])
# Compute micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_score.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])
plt.figure()
lw = 2
plt.plot(fpr[2], tpr[2], color='darkorange',
lw=lw, label='ROC curve (area = %0.2f)' % roc_auc[2])
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
plt.show()enter code here
If i run this code the system told me random_state is not defined. so I changed it to random_state=true. Then the system told me
plt.plot(fpr[2], tpr[2], color='darkorange', KeyError: 2 <matplotlib.figure.Figure at 0xd8bff60>
if I print out n_classes. The system told me it's "1", and if I print out the n_classes in the documentation it says 3. I'm not sure if that's where the problem is. Does anyone have answer to this traceback?
Looks like you simply don't understand how your data is structured and how your code should work.
LabelBinarizer will return a one-v-all encoding, meaning that for two classes you will get the following mapping: ['good', 'bad', 'good'] -> [[1], [0], [1]], s.t. n_classes = 1.
Why would you expect it to be 3 if you have 2 classes?
Simply change plt.plot(fpr[2], tpr[2], color='darkorange', lw=lw, label='ROC curve (area = %0.2f)' % roc_auc[2]) to plt.plot(fpr[0], tpr[0], color='darkorange', lw=lw, label='ROC curve (area = %0.2f)' % roc_auc[0]) and you should be good.
Just look in tpr and fpr dictionaries and you will see that you don't have a tpr[2] or fpr[2]. n_classes = y.shape[1] shows how many classes you have (= 2), which means that you have keys of 0 and 1 in your tpr and fpr dictionaries.
You are overcomplicating things by using a multi-class approach when you only have 2 classes (binary classification). I think you are using this tutorial.
I would advise replacing the following:
# Compute ROC curve and ROC area for each class
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])
roc_auc[i] = auc(fpr[i], tpr[i])
# Compute micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_score.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])
plt.figure()
lw = 2
plt.plot(fpr[2], tpr[2], color='darkorange',
lw=lw, label='ROC curve (area = %0.2f)' % roc_auc[2])
With something like:
fpr, tpr = roc_curve(y_test.values, y_score[:,1])
roc_auc = auc(fpr,tpr)
plt.figure()
lw = 2
plt.plot(fpr, tpr, color='darkorange', lw=lw, label='ROC curve (area = %0.2f)' % roc_auc)
I have an understanding problem by using the roc libraries.
I want to plot a roc curve with a python
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html
I am writing a program which evalutes detectors (haarcascade, neuronal networks) and want to evaluate them.
So I already have the data saved in a file in the following format:
0.5 TP
0.43 FP
0.72 FN
0.82 TN
...
whereas TP means True Positive, FP - False Positivve, FN - False Negative, TN - True Negative
I parse it and fill 4 arrays with this data set.
Then I want to put this in
fpr, tpr = sklearn.metrics.roc_curve(y_true, y_score, average='macro', sample_weight=None)
but how to do this? What is y_true in my case and y_score?
afterwards, I put it fpr, tpr in
auc = sklearn.metric.auc(fpr, tpr)
Quotting Wikipedia:
The ROC is created by plotting the FPR (false positive rate) vs the TPR (true positive rate) at various thresholds settings.
In order to compute FPR and TPR, you must provide the true binary value and the target scores to the function sklearn.metrics.roc_curve.
So in your case, I would do something like this :
from sklearn.metrics import roc_curve
from sklearn.metrics import auc
# Compute fpr, tpr, thresholds and roc auc
fpr, tpr, thresholds = roc_curve(y_true, y_score)
roc_auc = auc(fpr, tpr)
# Plot ROC curve
plt.plot(fpr, tpr, label='ROC curve (area = %0.3f)' % roc_auc)
plt.plot([0, 1], [0, 1], 'k--') # random predictions curve
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate or (1 - Specifity)')
plt.ylabel('True Positive Rate or (Sensitivity)')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
If you want to have a deeper understanding of how the False positive rate and the True positive rate are computed for all the possible thresholds values, I suggest you to read this article