Cross Validation with ROC? - python

I use the code to run cross validation, returning ROC scores.
rf = RandomForestClassifier(n_estimators=1000,oob_score=True,class_weight = 'balanced')
scores = cross_val_score ( rf, X,np.ravel(y), cv=10, scoring='roc_auc')
How can I return the ROC based on
roc_auc_score(y_test,results.predict(X_test))
rather than
roc_auc_score(y_test,results.predict_proba(X_test))

ROC AUC is only useful if you can rank order your predictions. Using .predict() will just give the most probable class for each sample, and so you won't be able to do that rank ordering.
In the example below, I fit a random forest on a randomly generated dataset and tested it on a held out sample. The blue line shows the proper ROC curve done using .predict_proba() while the green shows the degenerate one with .predict() where it only really knows of the one cutoff point.
from sklearn.datasets import make_classification
from sklearn.metrics import roc_curve
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
rf = RandomForestClassifier()
data, target = make_classification(n_samples=4000, n_features=2, n_redundant=0, flip_y=0.4)
train, test, train_t, test_t = train_test_split(data, target, train_size=0.9)
rf.fit(train, train_t)
plt.plot(*roc_curve(test_t, rf.predict_proba(test)[:,1])[:2])
plt.plot(*roc_curve(test_t, rf.predict(test))[:2])
plt.show()
EDIT: While there's nothing stopping you from calculating an roc_auc_score() on .predict(), the point of the above is that it's not really a useful measurement.
In [5]: roc_auc_score(test_t, rf.predict_proba(test)[:,1]), roc_auc_score(test_t, rf.predict(test))
Out[5]: (0.75502749115010925, 0.70238005573548234)

Related

How to get multi-class roc_auc in cross validate in sklearn?

I have a classification problem where I want to get the roc_auc value using cross_validate in sklearn. My code is as follows.
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data[:, :2] # we only take the first two features.
y = iris.target
from sklearn.ensemble import RandomForestClassifier
clf=RandomForestClassifier(random_state = 0, class_weight="balanced")
from sklearn.model_selection import cross_validate
cross_validate(clf, X, y, cv=10, scoring = ('accuracy', 'roc_auc'))
However, I get the following error.
ValueError: multiclass format is not supported
Please note that I selected roc_auc specifically is that it supports both binary and multiclass classification as mentioned in: https://scikit-learn.org/stable/modules/model_evaluation.html
I have binary classification dataset too. Please let me know how to resolve this error.
I am happy to provide more details if needed.
By default multi_class='raise' so you need explicitly to change this.
From the docs:
multi_class {‘raise’, ‘ovr’, ‘ovo’}, default=’raise’
Multiclass only. Determines the type of configuration to use. The
default value raises an error, so either 'ovr' or 'ovo' must be passed
explicitly.
'ovr':
Computes the AUC of each class against the rest [3] [4]. This treats
the multiclass case in the same way as the multilabel case. Sensitive
to class imbalance even when average == 'macro', because class
imbalance affects the composition of each of the ‘rest’ groupings.
'ovo':
Computes the average AUC of all possible pairwise combinations of
classes [5]. Insensitive to class imbalance when average == 'macro'.
Solution:
Use make_scorer (docs):
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data[:, :2] # we only take the first two features.
y = iris.target
from sklearn.ensemble import RandomForestClassifier
clf=RandomForestClassifier(random_state = 0, class_weight="balanced")
from sklearn.metrics import make_scorer
from sklearn.metrics import roc_auc_score
myscore = make_scorer(roc_auc_score, multi_class='ovo',needs_proba=True)
from sklearn.model_selection import cross_validate
cross_validate(clf, X, y, cv=10, scoring = myscore)

Generate negative predictive value using cross_val_score in sklearn for model performance evaluation

As part of evaluating a model's metrics, I would like to use cross_val_score in sklearn to generate negative predictive value for a binary classification model.
In example below, I set the 'scoring' parameter within cross_val_score to calculate and print 'precision' (mean and standard deviations from 10-fold cross-validation) for positive predictive value of the model:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
log=LogisticRegression()
log_prec = cross_val_score(log, x, y, cv=10, scoring='precision')
print("PPV(mean, std): ", np.round(log_prec.mean(), 2), np.round(log_prec.std(), 2))
How can I use something like the above line of code to generate negative predictive value/NPV (likelihood of a predicted negative to be a true negative) from within the cross_val_score method?
sklearn provides many scoring options (eg: roc_auc, recall, accuracy, F1, etc) but unforunately not one for NPV...
For a binary classification problem you can invert the label definition. Then the PPV will correspond to the NPV in you original problem
You can use make_scorer to pass in pos_label=0 to the precision score function (metrics.precision_score) to get NPV. Like this:
from sklearn.metrics import make_scorer, precision_score
npv = cross_val_score(log, x, y, cv=10, scoring=make_scorer(precision_score, pos_label=0))
For more details see this sklearn example.

Recursive feature elimination combined with nested (leave one group out) cross-validation in scikit

I want to do a binary classification for 30 groups of subjects having 230 samples by 150 features. I founded it very hard to implement especially when doing feature selection, parameters tunning through nested leave one group out cross-validation and report the accuracy using two classifiers the SVM and random forest and to see which features been selected.
I'm new to this and I'm sure the following code is not correct:
from sklearn.model_selection import LeaveOneGroupOut
from sklearn.feature_selection import RFE
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
X= the data (230 samples * 150 features)
y= [1,0,1,0,0,0,1,1,1..]
groups = [1,2...30]
param_grid = [{'estimator__C': [0.01, 0.1, 1.0, 10.0]}]
inner_cross_validation = LeaveOneGroupOut().split(X, y, groups)
outer_cross_validation = LeaveOneGroupOut().split(X, y, groups)
estimator = SVC(kernel="linear")
selector = RFE(estimator, step=1)
grid_search = GridSearchCV(selector, param_grid, cv=inner_cross_validation)
grid_search.fit(X, y)
scores = cross_val_score(grid_search, X, y,cv=outer_cross_validation)
I don't know where to set "the random forest classifier" in the above because I want to compare the accuracies between SVM and random forest.
Thank you very much for reading and hope that someone can help me.
Best regards
You should call the tree in the same way that you call the SVM
#your libraries
from sklearn.tree import DecisionTreeClassifier
#....
estimator = SVC(kernel="linear")
estimator2 = DecisionTreeClassifier( ...parameters here...)
selector = RFE(estimator, step=1)
selector2 = RFE(estimator2, step=1)
grid_search = GridSearchCV(selector, param_grid, cv=inner_cross_validation)
grid_search = GridSearchCV(selector2, ..greed for the tree here.., cv=inner_cross_validation)
Please note that this procedure will lead to two different set of selected features: one for the SVM and one for the Decision Tree

Biasing Sklearn toward positives For MultinomialNB

I am trying to run multinomial naive bayes on a series of examples in python using sci kit learn. I am consitently getting all examples classified as negative. The training set is somewhat biased towards negatives P(negative) ~.75. I looked through the documentation and I couldn't find a way to bias toward positives.
from sklearn.datasets import load_svmlight_file
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
X_train, y_train= load_svmlight_file("POS.train")
x_test, y_test = load_svmlight_file("POS.val")
clf = MultinomialNB()
clf.fit(X_train, y_train)
preds = clf.predict(x_test)
print('accuracy: ' + str(accuracy_score(y_test, preds)))
print('precision: ' + str(precision_score(y_test, preds)))
print('recall: ' + str(recall_score(y_test, preds)))
Setting a prior is a poor way to handle this and will result in negative cases being classified as positive that really shouldn't be. Your data has a .25/.75 split, so a .5/.5 prior is a pretty bad option.
Instead, one can average the precision and recall with a harmonic mean to produce an F score which attempts to properly handle biased data like this:
from sklearn.metrics import f1_score
The F1 score can then be used to assess the quality of the model. You can then do some model tuning and cross validation to find a model that better classifies your data i.e. the model that maximizes the F1 score.
Another option is to randomly prune out the negative cases in your data so that the classifier is trained with .5/.5 data. The predict step should then give more appropriate classifications.

Best way to combine probabilistic classifiers in scikit-learn

I have a logistic regression and a random forest and I'd like to combine them (ensemble) for the final classification probability calculation by taking an average.
Is there a built-in way to do this in sci-kit learn? Some way where I can use the ensemble of the two as a classifier itself? Or would I need to roll my own classifier?
NOTE: The scikit-learn Voting Classifier is probably the best way to do this now
OLD ANSWER:
For what it's worth I ended up doing this as follows:
class EnsembleClassifier(BaseEstimator, ClassifierMixin):
def __init__(self, classifiers=None):
self.classifiers = classifiers
def fit(self, X, y):
for classifier in self.classifiers:
classifier.fit(X, y)
def predict_proba(self, X):
self.predictions_ = list()
for classifier in self.classifiers:
self.predictions_.append(classifier.predict_proba(X))
return np.mean(self.predictions_, axis=0)
Given the same problem, I used a majority voting method.
Combing probabilities/scores arbitrarily is very problematic, in that the performance of your different classifiers can be different, (For example, an SVM with 2 different kernels , + a Random forest + another classifier trained on a different training set).
One possible method to "weigh" the different classifiers, might be to use their Jaccard score as a "weight".
(But be warned, as I understand it, the different scores are not "all made equal", I know that a Gradient Boosting classifier I have in my ensemble gives all its scores as 0.97, 0.98, 1.00 or 0.41/0 . I.E. it's very overconfident..)
What about the sklearn.ensemble.VotingClassifier?
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html#sklearn.ensemble.VotingClassifier
Per the description:
The idea behind the voting classifier implementation is to combine conceptually different machine learning classifiers and use a majority vote or the average predicted probabilities (soft vote) to predict the class labels. Such a classifier can be useful for a set of equally well performing model in order to balance out their individual weaknesses.
Now scikit-learn has StackingClassifier which can be used to stack multiple estimators.
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import StackingClassifier
X, y = load_iris(return_X_y=True)
estimators = [
('rf', RandomForestClassifier(n_estimators=10, random_state=42)),
('lg', LogisticRegression()))
]
clf = StackingClassifier(
estimators=estimators, final_estimator=LogisticRegression()
)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, stratify=y, random_state=42
)
clf.fit(X_train, y_train)
clf.predict_proba(X_test)

Categories