I am trying to find ROC curve and AUROC curve for decision tree. My code was something like
clf.fit(x,y)
y_score = clf.fit(x,y).decision_function(test[col])
pred = clf.predict_proba(test[col])
print(sklearn.metrics.roc_auc_score(actual,y_score))
fpr,tpr,thre = sklearn.metrics.roc_curve(actual,y_score)
output:
Error()
'DecisionTreeClassifier' object has no attribute 'decision_function'
basically, the error is coming up while finding the y_score. Please explain what is y_score and how to solve this problem?
First of all, the DecisionTreeClassifier has no attribute decision_function.
If I guess from the structure of your code , you saw this example
In this case the classifier is not the decision tree but it is the OneVsRestClassifier that supports the decision_function method.
You can see the available attributes of DecisionTreeClassifier here
A possible way to do it is to binarize the classes and then compute the auc for each class:
Example:
from sklearn import datasets
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import label_binarize
from sklearn.tree import DecisionTreeClassifier
from scipy import interp
iris = datasets.load_iris()
X = iris.data
y = iris.target
y = label_binarize(y, classes=[0, 1, 2])
n_classes = y.shape[1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5, random_state=0)
classifier = DecisionTreeClassifier()
y_score = classifier.fit(X_train, y_train).predict(X_test)
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])
roc_auc[i] = auc(fpr[i], tpr[i])
# Compute micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_score.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])
#ROC curve for a specific class here for the class 2
roc_auc[2]
Result
0.94852941176470573
Think that for a decision tree you can use .predict_proba() instead of .decision_function() so you will get something as below:
y_score = classifier.fit(X_train, y_train).predict_proba(X_test)
Then, the rest of the code will be the same.
In fact, the roc_curve function from scikit learn can take two types of input:
"Target scores, can either be probability estimates of the positive class, confidence values, or non-thresholded measure of decisions (as returned by “decision_function” on some classifiers)."
See here for more details.
Related
Lasso, although it's a regression algorithm, can be used as a classifier. Therefore, there should be a way to make a ROC curve and find it's AUC.
This is my code for making the model, scaling and standardizing it:
X = data.drop(['Response'], axis = 1)
Y = data.Response
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, random_state = 42)
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaled_X_train = scaler.fit_transform(X_train)
scaled_X_test = scaler.transform(X_test)
pipline = Pipeline([
('scaler', StandardScaler()),
('model', Lasso(normalize=True))
])
lasso_model = GridSearchCV(pipline,
{'model__alpha': np.arange(0, 3, 0.05)},
cv = 10 ,
scoring = 'roc_auc',
verbose = 3,
n_jobs = -1,
error_score = 'raise')
lasso_model.fit(scaled_X_train, Y_train)
Now trying to make the ROC curves, one for the training set and one for the test set:
# Make a line for the random classification, AUC = 0.5:
r_probs = [0 for _ in range(len(Y_test))]
r_auc = roc_auc_score(Y_test,r_probs)
r_fpr, r_tpr , _ = roc_curve(Y_test, r_probs)
y_pred_proba = lasso_model.predict_proba(scaled_X_train)[::,1]
fpr, tpr, _ = roc_curve(Y_train, y_pred_proba)
auc = roc_auc_score(Y_train, y_pred_proba)
#create ROC curve
plt.plot(r_fpr, r_tpr, linestyle = '--', label = 'Random Prediction (AUROC = %0.3f)' %r_auc)
plt.plot(fpr,tpr,label="AUC="+str(auc))
plt.title('Train set ROC')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.legend(loc=4)
plt.show()
I get the error that predict_proba does not exist for lasso regression, since it is not a classification algorithm. So how can I plot a ROC curve?
Lasso is basically linear model with L1 regularization. You can enable this kind of regularization for LogisticRegression() using penalty='l1' for the same effect (see this question, for example).
Alternatively, you can try to duct tape a sigmoid function to your Lasso() output, but that'd be doing the same thing as above with much more effort.
I'd like to evaluate my machine learning model. I computed the area under the ROC curve with roc_auc_score() and plotted the ROC curve with plot_roc_curve() functions of sklearn. In the second function the AUC is also computed and shown in the plot. Now my problem is, that I get different results for the two AUC.
Here's the reproducible code with sample dataset:
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import plot_roc_curve
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import MinMaxScaler
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
scaler = MinMaxScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
model = MLPClassifier(random_state=42)
model.fit(X_train, y_train)
yPred = model.predict(X_test)
print(roc_auc_score(y_test, yPred))
plot_roc_curve(model, X_test, y_test)
plt.show()
The roc_auc_score function gives me 0.979 and the plot shows 1.00.
Despite the fact that the second function takes the model as an argument and predicts yPred again, the outcome should not differ. It is not a round off error. If I decrease training iterations to get a bad predictor the values still differ.
With my real dataset I "achieved" a difference of 0.1 between the two methods.
How does this aberration come?
You should pass the prediction probabilities to roc_auc_score, and not the predicted classes. Like this:
yPred_p = model.predict_proba(X_test)[:,1]
print(roc_auc_score(y_test, yPred_p))
# output: 0.9983354140657512
When you pass the predicted classes, this is actually the curve for which AUC is being calculated (which is wrong):
Code to regenerate:
from sklearn.metrics import roc_curve, auc
fpr, tpr, _ = roc_curve(y_test, yPred)
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, label='AUC = ' + str(round(roc_auc, 2)))
plt.legend(loc='lower right')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
I'm tring to analyze below data, modeled it with logistic regression first and then did the prediction, calculated the accuracy & auc; then performed recursive feature selection and calculated accuracy & auc again, thought the accuracy and auc would be higher, but actually they are both lower after the recursive feature selection, not sure whether it's expected? Or did I miss something?
Data:
https://github.com/amandawang-dev/census-training/blob/master/census-training.csv
----------------------
for logistic regression, Accuracy: 0.8111649491571692; AUC: 0.824896256487386
after recursive feature selection, Accuracy: 0.8130075752405651; AUC: 0.7997315631730443
import pandas as pd
import numpy as np
from sklearn import preprocessing, metrics
from sklearn.model_selection import train_test_split
train=pd.read_csv('census-training.csv')
train = train.replace('?', np.nan)
for column in train.columns:
train[column].fillna(train[column].mode()[0], inplace=True)
x['Income'] = x['Income'].str.contains('>50K').astype(int)
x['Gender'] = x['Gender'].str.contains('Male').astype(int)
obj = train.select_dtypes(include=['object']) #all features that are 'object' datatypes
le = preprocessing.LabelEncoder()
for i in range(len(obj.columns)):
train[obj.columns[i]] = le.fit_transform(train[obj.columns[i]])#TODO #Encode input data
train_set, test_set = train_test_split(train, test_size=0.3, random_state=42)
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, roc_auc_score
from sklearn.metrics import accuracy_score
log_rgr = LogisticRegression(random_state=0)
X_train=train_set.iloc[:, 0:9]
y_train=train_set.iloc[:, 9:10]
X_test=test_set.iloc[:, 0:9]
y_test=test_set.iloc[:, 9:10]
log_rgr.fit(X_train, y_train)
y_pred = log_rgr.predict(X_test)
lr_acc = accuracy_score(y_test, y_pred)
probs = log_rgr.predict_proba(X_test)
preds = probs[:,1]
print(preds)
from sklearn.preprocessing import label_binarize
y = label_binarize(y_test, classes=[0, 1]) #note to myself: class need to have only 0,1
fpr, tpr, threshold = metrics.roc_curve(y, preds)
roc_auc = roc_auc_score(y_test, preds)
print("Accuracy: {}".format(lr_acc))
print("AUC: {}".format(roc_auc))
from sklearn.feature_selection import RFE
rfe = RFE(log_rgr, 5)
fit = rfe.fit(X_train, y_train)
X_train_new = fit.transform(X_train)
X_test_new = fit.transform(X_test)
log_rgr.fit(X_train_new, y_train)
y_pred = log_rgr.predict(X_test_new)
lr_acc = accuracy_score(y_test, y_pred)
probs = rfe.predict_proba(X_test)
preds = probs[:,1]
y = label_binarize(y_test, classes=[0, 1])
fpr, tpr, threshold = metrics.roc_curve(y, preds)
roc_auc =roc_auc_score(y_test, preds)
print("Accuracy: {}".format(lr_acc))
print("AUC: {}".format(roc_auc))
There is simply no guarantee that any kind of feature selection (backward, forward, recursive - you name it) will actually lead to better performance in general. None at all. Such tools are there for convenience only - they may work, or they may not. Best guide and ultimate judge is always the experiment.
Apart from some very specific cases in linear or logistic regression, most notably the Lasso (which, no coincidence, actually comes from statistics), or somewhat extreme cases with too many features (aka The curse of dimensionality), even when it works (or doesn't), there is not necessarily much to explain as to why (or why not).
I've trained a simple random forest algorithm and bagging classifier (n_estimators = 100). Is it possible to plot the history of accuracy in bagging Classifier? How to calculate the variance of in 100 samples?
I've just printed the accuracy value for both algorithms:
# DecisionTree
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.90)
clf2 = tree.DecisionTreeClassifier()
clf2.fit(X_tr, y_tr)
pred2 = clf2.predict(X_test)
acc2 = clf2.score(X_test, y_test)
acc2 # 0.6983930778739185
# Bagging
clf3 = BaggingClassifier(tree.DecisionTreeClassifier(), max_samples=0.5, max_features=0.5, n_estimators=100,\
verbose=2)
clf3.fit(X_tr, y_tr)
pred3 = clf3.predict(X_test)
acc3=clf3.score(X_test,y_test)
acc3 # 0.911619283065513
I don't think that you can get this information from the fitted BaggingClassifier. But you can create such a plot by fitting for different n_estimators:
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn import datasets
from sklearn.model_selection import train_test_split
iris = datasets.load_iris()
X, X_test, y, y_test = train_test_split(iris.data,
iris.target,
test_size=0.20)
estimators = list(range(1, 20))
accuracy = []
for n_estimators in estimators:
clf = BaggingClassifier(DecisionTreeClassifier(max_depth=1),
max_samples=0.2,
n_estimators=n_estimators)
clf.fit(X, y)
acc = clf.score(X_test, y_test)
accuracy.append(acc)
plt.plot(estimators, accuracy)
plt.xlabel("Number of estimators")
plt.ylabel("Accuracy")
plt.show()
(Of course, the iris dataset is easily fit with just a single DecisionTreeClassifier, so I set max_depth=1 in this example.)
For a statistically meaningful result, you should fit a BaggingClassifier multiple times for each n_estimators and take the average of the obtained accuracies.
I've been running the implementation the 'Mean Decrease Accuracy' measure that is shown on this website:
In the example the author is using the random forest regressor RandomForestRegressor, but I am using the random forest classifier RandomForestClassifier. Thus, my question is, if I should also use the r2_score for measuring accuracy or if I should switch to classic accuracy accuracy_score or matthews correlation coefficient matthews_corrcoef?.
Does anybody here if I should switch or not. And why?
Thanks for any help!
Here is the code from the website in case you are too lazy to click :)
from sklearn.cross_validation import ShuffleSplit
from sklearn.metrics import r2_score
from collections import defaultdict
X = boston["data"]
Y = boston["target"]
rf = RandomForestRegressor()
scores = defaultdict(list)
#crossvalidate the scores on a number of different random splits of the data
for train_idx, test_idx in ShuffleSplit(len(X), 100, .3):
X_train, X_test = X[train_idx], X[test_idx]
Y_train, Y_test = Y[train_idx], Y[test_idx]
r = rf.fit(X_train, Y_train)
acc = r2_score(Y_test, rf.predict(X_test))
for i in range(X.shape[1]):
X_t = X_test.copy()
np.random.shuffle(X_t[:, i])
shuff_acc = r2_score(Y_test, rf.predict(X_t))
scores[names[i]].append((acc-shuff_acc)/acc)
print "Features sorted by their score:"
print sorted([(round(np.mean(score), 4), feat) for
feat, score in scores.items()], reverse=True)
r2_score is for regression (continuous response variable), whereas classic classification (discrete categorical variable) metrics such like accuracy_score and f1_score roc_auc (the last two are most appropriate if you have unbalanced y-labels) are right choices for your task.
Random shuffling each features in the input data matrix and measuring the decline in these classification metrics sounds like a valid approach to rank feature importances.