Difference in calculating mean AUC using Scikit-Learn - python

I have the following code:
from sklearn import svm, datasets
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import StratifiedKFold, cross_val_score
import numpy as np
from scipy import interp
seed = 7
np.random.seed(seed)
iris = datasets.load_iris()
X = iris.data
y = iris.target
X, y = X[y != 2], y[y != 2]
n_samples, n_features = X.shape
# Add noisy features
random_state = np.random.RandomState(0)
X = np.c_[X, random_state.randn(n_samples, 200 * n_features)]
cv = StratifiedKFold(n_splits=10)
classifier = svm.SVC(kernel='linear', probability=True, random_state=seed)
mean_tpr = 0.0
mean_fpr = np.linspace(0, 1, 100)
i= 0
for train, test in cv.split(X, y):
probas_ = classifier.fit(X[train], y[train]).predict_proba(X[test])
# Compute ROC curve and area the curve
fpr, tpr, thresholds = roc_curve(y[test], probas_[:, 1])
mean_tpr += interp(mean_fpr, fpr, tpr)
mean_tpr[0] = 0.0
roc_auc = auc(fpr, tpr)
i += 1
mean_tpr /= cv.get_n_splits(X, y)
mean_tpr[-1] = 1.0
mean_auc_1 = auc(mean_fpr, mean_tpr)
print "#--- Method 1 to calculate mean AUC ---"
print mean_auc_1
print "#--- Method 2 to calculate mean AUC ---"
results = cross_val_score(classifier, X, y, cv=cv)
mean_auc_2 = "{:.3f}".format(results.mean())
print mean_auc_2
It produces the following result:
#--- Method 1 to calculate mean AUC ---
0.801818181818
#--- Method 2 to calculate mean AUC ---
0.700
Method 1 of calculating mean AUC is through loop as suggested by this Scikit Tutorial.
Method 2 calculates the mean AUC using Scikit's inbuilt cross_val_score() method.
My question is, why the difference? Which mean AUC should I believe?
How should I modify Method 2 so that the result is the same with Method 1?
I'm using this version of Scikit-Learn:
In [442]: sklearn.__version__
Out[442]: '0.18'

There is no auc calculation for your second example.
You should add a custom scoring function. See the api for cross_val_score.
You are just calculating the average accuracy. This is typically the standard scoring function used for a classifier. You can check the standard score function for the svm in the documentation
Something like this
cross_val_score(classifier, X, y, cv=cv, scoring='roc_auc')
should work

Related

Plotting mean ROC curve from multiple ROC curves in Python

I am trying to plot the mean ROC curve of a support vector machine (SVM) model with a linear kernel over 10 runs. The code fits the SVM model to the training data, and generates the ROC curve and its corresponding area under the curve (AUC) for each run. The mean ROC curve is then computed using the mean false positive rate (mean_fpr) and mean true positive rate (mean_tpr) obtained from all 10 runs. However, the resulting plot does not start at the origin (0, 0), indicating that there is an issue with the computation of mean_fpr and mean_tpr.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn.svm import SVC
import seaborn as sns
import matplotlib.pyplot as plt
# Read the csv file
df = load_breast_cancer(as_frame=True)
# Split the data into features (X) and target (y)
X = df['data']
y = df['target']
from sklearn.metrics import roc_curve, auc
# Number of runs
#random.seed(321)
n_runs = 10
# Lists to store the results
aucs = []
tprs = []
fprs = []
for i in range(n_runs):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=None)
svclassifier = SVC(kernel='linear', random_state=i)
svclassifier.fit(X_train, y_train)
y_pred = svclassifier.predict(X_test)
y_score = svclassifier.decision_function(X_test)
fpr, tpr, thresholds = roc_curve(y_test, y_score)
fprs.append(fpr)
tprs.append(tpr)
roc_auc = auc(fpr, tpr)
aucs.append(roc_auc)
# Mean ROC curve
mean_fpr = np.unique(np.concatenate(fprs))
mean_tpr = np.unique(np.concatenate(tprs))
mean_tpr = np.zeros_like(mean_fpr)
for i in range(n_runs):
mean_tpr += np.interp(mean_fpr, fprs[i], tprs[i])
mean_tpr /= n_runs
mean_auc = auc(mean_fpr, mean_tpr)
# Plot the mean ROC curve
sns.lineplot(x=mean_fpr, y=mean_tpr, ci=None, label='Mean ROC (AUC = %0.2f)' % mean_auc)
plt.xlim([-0.1, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()
The problematic line of code is as follows:
for i in range(n_runs):
mean_tpr += np.interp(mean_fpr, fprs[i], tprs[i])
Can anyone help me identify and fix the issue with the mean_fpr and mean_tpr values so that the resulting plot starts at (0, 0)?

What would be the n and p values when calculating adjusted r-squared using scikit-learn and 10-fold cross validation?

I know the formula for adjusted r-squared is: adj_r2 = 1-(1-R2)*(n-1)/(n-p-1).
However, what would the correct values be for n and p when you get the r-squared score with cross-validation?
Code sample:
from sklearn.model_selection import cross_val_score
lin_reg_r2 = cross_val_score(estimator = lin_reg, X = X_train, y = y_train, scoring = 'r2', cv = 10)
print("R2: {:.2f} %".format(lin_reg_r2.mean()*100))

How to find the best degree of polynomials?

I'm new to Machine Learning and currently got stuck with this.
First I use linear regression to fit the training set but get very large RMSE. Then I tried using polynomial regression to reduce the bias.
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error
poly_features = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly_features.fit_transform(X)
poly_reg = LinearRegression()
poly_reg.fit(X_poly, y)
poly_predict = poly_reg.predict(X_poly)
poly_mse = mean_squared_error(X, poly_predict)
poly_rmse = np.sqrt(poly_mse)
poly_rmse
Then I got slightly better result than linear regression, then I continued to set degree = 3/4/5, the result kept getting better. But it might be somewhat overfitting as degree increased.
The best degree of polynomial should be the degree that generates the lowest RMSE in cross validation set. But I don't have any idea how to achieve that. Should I use GridSearchCV? or any other method?
Much appreciate if you could me with this.
You should provide the data for X/Y next time, or something dummy, it'll be faster and provide you with a specific solution. For now I've created a dummy equation of the form y = X**4 + X**3 + X + 1.
There are many ways you can improve on this, but a quick iteration to find the best degree is to simply fit your data on each degree and pick the degree with the best performance (e.g., lowest RMSE).
You can also play with how you decide to hold out your train/test/validation data.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
X = np.arange(100).reshape(100, 1)
y = X**4 + X**3 + X + 1
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
rmses = []
degrees = np.arange(1, 10)
min_rmse, min_deg = 1e10, 0
for deg in degrees:
# Train features
poly_features = PolynomialFeatures(degree=deg, include_bias=False)
x_poly_train = poly_features.fit_transform(x_train)
# Linear regression
poly_reg = LinearRegression()
poly_reg.fit(x_poly_train, y_train)
# Compare with test data
x_poly_test = poly_features.fit_transform(x_test)
poly_predict = poly_reg.predict(x_poly_test)
poly_mse = mean_squared_error(y_test, poly_predict)
poly_rmse = np.sqrt(poly_mse)
rmses.append(poly_rmse)
# Cross-validation of degree
if min_rmse > poly_rmse:
min_rmse = poly_rmse
min_deg = deg
# Plot and present results
print('Best degree {} with RMSE {}'.format(min_deg, min_rmse))
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(degrees, rmses)
ax.set_yscale('log')
ax.set_xlabel('Degree')
ax.set_ylabel('RMSE')
This will print:
Best degree 4 with RMSE 1.27689038706e-08
Alternatively, you could also build a new class that carries out Polynomial fitting, and pass that to GridSearchCV with a set of parameters.
In my opinion, the best way to find an optimal curve fitting degree or in general a fitting model is to use the GridSearchCV module from the scikit-learn library.
Here is an example how to use this library:
Firstly let us define a method to sample random data:
def make_data(N, err=1.0, rseed=1):
rng = np.random.RandomState(rseed)
X = rng.rand(N, 1) ** 2
y = 1. / (X.ravel() + 0.3)
if err > 0:
y += err * rng.randn(N)
return X, y
Build a pipeline:
def PolynomialRegression(degree=2, **kwargs):
return make_pipeline(PolynomialFeatures(degree), LinearRegression(**kwargs))
Create a data and a vector(X_test) for testing and visualisation purposes:
X, y = make_data(200)
X_test = np.linspace(-0.1, 1.1, 200)[:, None]
Define the GridSearchCV parameters:
param_grid = {'polynomialfeatures__degree': np.arange(20),
'linearregression__fit_intercept': [True, False],
'linearregression__normalize': [True, False]}
grid = GridSearchCV(PolynomialRegression(), param_grid, cv=7)
grid.fit(X, y)
Get the best parameters from our model:
model = grid.best_estimator_
model
Pipeline(memory=None,
steps=[('polynomialfeatures', PolynomialFeatures(degree=4, include_bias=True, interaction_only=False)), ('linearregression', LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False))])
Fit the model with the X and y data and use the vector to predict the values:
y_test = model.fit(X, y).predict(X_test)
Visualize the result:
plt.scatter(X, y)
plt.plot(X_test.ravel(), y_test, 'r')
The best fit result
The full code snippet:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV
def make_data(N, err=1.0, rseed=1):
rng = np.random.RandomState(rseed)
X = rng.rand(N, 1) ** 2
y = 1. / (X.ravel() + 0.3)
if err > 0:
y += err * rng.randn(N)
return X, y
def PolynomialRegression(degree=2, **kwargs):
return make_pipeline(PolynomialFeatures(degree), LinearRegression(**kwargs))
X, y = make_data(200)
X_test = np.linspace(-0.1, 1.1, 200)[:, None]
param_grid = {'polynomialfeatures__degree': np.arange(20),
'linearregression__fit_intercept': [True, False],
'linearregression__normalize': [True, False]}
grid = GridSearchCV(PolynomialRegression(), param_grid, cv=7)
grid.fit(X, y)
model = grid.best_estimator_
y_test = model.fit(X, y).predict(X_test)
plt.scatter(X, y)
plt.plot(X_test.ravel(), y_test, 'r')
This is where Bayesian model selection comes in really. This gives you the most likely model given both model complexity and data fit. I'm super tired so the quick answer is to use the BIC (Bayesian information criterion):
k = number of variables in the model
n = number of observations
sse = sum(residuals**2)
BIC = n*ln(sse/n) + k*ln(n)
This BIC (or AIC etc) will give you the best model

How to get ROC curve for decision tree?

I am trying to find ROC curve and AUROC curve for decision tree. My code was something like
clf.fit(x,y)
y_score = clf.fit(x,y).decision_function(test[col])
pred = clf.predict_proba(test[col])
print(sklearn.metrics.roc_auc_score(actual,y_score))
fpr,tpr,thre = sklearn.metrics.roc_curve(actual,y_score)
output:
Error()
'DecisionTreeClassifier' object has no attribute 'decision_function'
basically, the error is coming up while finding the y_score. Please explain what is y_score and how to solve this problem?
First of all, the DecisionTreeClassifier has no attribute decision_function.
If I guess from the structure of your code , you saw this example
In this case the classifier is not the decision tree but it is the OneVsRestClassifier that supports the decision_function method.
You can see the available attributes of DecisionTreeClassifier here
A possible way to do it is to binarize the classes and then compute the auc for each class:
Example:
from sklearn import datasets
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import label_binarize
from sklearn.tree import DecisionTreeClassifier
from scipy import interp
iris = datasets.load_iris()
X = iris.data
y = iris.target
y = label_binarize(y, classes=[0, 1, 2])
n_classes = y.shape[1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5, random_state=0)
classifier = DecisionTreeClassifier()
y_score = classifier.fit(X_train, y_train).predict(X_test)
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])
roc_auc[i] = auc(fpr[i], tpr[i])
# Compute micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_score.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])
#ROC curve for a specific class here for the class 2
roc_auc[2]
Result
0.94852941176470573
Think that for a decision tree you can use .predict_proba() instead of .decision_function() so you will get something as below:
y_score = classifier.fit(X_train, y_train).predict_proba(X_test)
Then, the rest of the code will be the same.
In fact, the roc_curve function from scikit learn can take two types of input:
"Target scores, can either be probability estimates of the positive class, confidence values, or non-thresholded measure of decisions (as returned by “decision_function” on some classifiers)."
See here for more details.

How to plot ROC-curve in sklearn for LASSO method?

I want to compare lasso with other classifiers in sklearn. I have a binary outcome vector y. I usually compute a vector probas that contains the predicted probability for each input point to have 1 as a phenotype and then generate a ROC curve fro these 2 vectors. But how to compute this probability for lasso classifier? There is no method predict_proba.
For other classifiers this code works:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn import cross_validation
from sklearn import datasets
from sklearn.cross_validation import LeaveOneOut
import pandas as pd
from sklearn import metrics
#loading a toy dataset
iris = datasets.load_iris()
X = iris.data
random_state = np.random.RandomState(0)
n_samples, n_features = X.shape
X = np.c_[X, random_state.randn(n_samples, 200 * n_features)]
y = iris.target
X, y = X[y != 2], y[y != 2]
classifiers = [
RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
LogisticRegression(),
]
classifierNames=[ "Random Forests", "Logistic Regression" ]
for clf in classifiers:
print (clf)
loo = LeaveOneOut(len(y))
probas=[]
for train, test in loo:
probas.append ( clf.fit(X[train], y[train]).predict_proba(X[test])[0][1])
#probas is a vector that contains the probability of getting phenotype 1
#Then we just need to use our auc roc function for plotting.
dfphenotypes=pd.DataFrame(y)
dfpredicted=pd.DataFrame(probas)
#probas contains the probability of getting phenotype 1
#then we just need to use our auc roc function.
roc_auc=metrics.roc_auc_score(dfphenotypes, dfpredicted)
fpr, tpr, thresholds=metrics.roc_curve(dfphenotypes, dfpredicted)
# Plot ROC curve
plt.plot(fpr, tpr, '--', label=classifierNames[i]+' (area = %0.3f)' % roc_auc)
plt.plot([0, 1], [0, 1], 'k--') # random predictions curve
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate or (1 - Specifity)')
plt.ylabel('True Positive Rate or (Sensitivity)')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.figure(num=1, figsize=(30,40))
print("auc =", roc_auc)

Categories