Why roc_auc produces weird results in sklearn?

Why roc_auc produces weird results in sklearn? - python

I have a binary classification problem where I use the following code to get my weighted avarege precision, weighted avarege recall, weighted avarege f-measure and roc_auc.
df = pd.read_csv(input_path+input_file)
X = df[features]
y = df[["gold_standard"]]
clf = RandomForestClassifier(random_state = 42, class_weight="balanced")
k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
scores = cross_validate(clf, X, y, cv=k_fold, scoring = ('accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted', 'roc_auc'))
print("accuracy")
print(np.mean(scores['test_accuracy'].tolist()))
print("precision_weighted")
print(np.mean(scores['test_precision_weighted'].tolist()))
print("recall_weighted")
print(np.mean(scores['test_recall_weighted'].tolist()))
print("f1_weighted")
print(np.mean(scores['test_f1_weighted'].tolist()))
print("roc_auc")
print(np.mean(scores['test_roc_auc'].tolist()))
I got the following results for the same dataset with 2 different feature settings.
Feature setting 1 ('accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted', 'roc_auc'):
0.6920, 0.6888, 0.6920, 0.6752, 0.7120
Feature setting 2 ('accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted', 'roc_auc'):
0.6806 0.6754 0.6806 0.6643 0.7233
So, we can see that in feature setting 1 we get good results for 'accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted' compared to feature setting 2.
However, when it comes to 'roc_auc' feature setting 2 is better than feature setting 1. I found this weird becuase every other metric was better with feature setting 1.
On one hand, I suspect that this happens since I am using weighted scores for precision, recall and f-measure and not with roc_auc. Is it possible to do weighted roc_auc for binary classification in sklearn?
What is the real problem for this weird roc_auc results?

It is not weird, because comparing all these other metrics with AUC is like comparing apples to oranges.
Here is a high-level description of the whole process:
Probabilistic classifiers (like RF here) produce probability outputs p in [0, 1].
To get hard class predictions (0/1), we apply a threshold to these probabilities; if not set explicitly (like here), this threshold is implicitly taken to be 0.5, i.e. if p>0.5 then class=1, else class=0.
Metrics like accuracy, precision, recall, and f1-score are calculated over the hard class predictions 0/1, i.e after the threshold has been applied.
In contrast, AUC measures the performance of a binary classifier averaged over the range of all possible thresholds, and not for a particular threshold.
So, it can certainly happen, and it can indeed lead to confusion among new practitioners.
The second part of my answer in this similar question might be helpful for more details. Quoting:
According to my experience at least, most ML practitioners think that the AUC score measures something different from what it actually does: the common (and unfortunate) use is just like any other the-higher-the-better metric, like accuracy, which may naturally lead to puzzles like the one you express yourself.

Related

random forest: predict vs predict_proba

I am working on a multiclass, highly imbalanced classification problem. I use random forest as base classifier.
I would have to give report of model performance on the evaluation set considering multiple criteria (metrics: precision, recall conf_matrix, roc_auc).
Model train:
rf = RandomForestClassifier(()
rf.fit(train_X, train_y)
To obtain precision/recall and confusion_matrix, I go like:
pred = rf.predict(test_X)
precision = metrics.precision_score(y_test, pred)
recall = metrics.recall_score(y_test, pred)
f1_score = metrics.f1_score(y_test, pred)
confusion_matrix = metrics.confusion_matrix(y_test, pred)
Fine, but then computing roc_auc requires the prediction probability of classes and not the class labels. For that I must further do this:
y_prob = rf.predict_proba(test_X)
roc_auc = metrics.roc_auc_score(y_test, y_prob)
But then I'm worried here that the outcome produced first by rf.predict() may not be consistent with rf.predict_proba() so the roc_auc score I'm reporting. I know that calling predict several times will produce exactly the same result, but I'm concern predict then predict_proba might produce slightly different results, making it inappropriate to discuss together with the metrics above.
If that is the case, is there a way to control this, making sure the class probabilities used by predict() to decide predicted labels are exactly the same when I then call predict_proab?

predict_proba() and predict() are consistent with eachother. In fact, predict uses predict_proba internally as can be seen here in the source code

Python: scoring = 'recall' in GridSearchCV

I have a binary classification problem:
I try to find the best parameters for my model with
grid = {'penalty': ['l1', 'l2'],'C':[0.001,.009,0.01,.09,1,5,10,25]}
logreg =GridSearchCV(LogisticRegression(),grid,cv=5,scoring = 'recall')
logreg.fit(X, Y)
Y_Pred = logreg.predict(X)
I would like to know what is exactly the parameter scoring = 'recall'. When I add it, it improves a lot my model.

Scoring is basically how the model is being evaluated. Scikit supports quite a lot, you can see the full available scorers here.
Having high recall means that your model has high true positives and less false negatives. It means that there are more actual positives values being predicted as true and less actual positive values being predicted as false. You may also like to read more about confusion matrix.
As to what kind of scoring should you use, that depends on what you are trying to achieve with your model.

How to improve Precision without downing the Recall in a unbalanced dataset?

I have to use a Decision Tree for binary classification on a unbalanced dataset(50000:0, 1000:1). To have a good Recall (0.92) I used RandomOversampling function found in module Imblearn and pruning with max_depth parameter.
The problem is that the Precision is very low (0.44), I have too many false positives.
I tried to train a specific classifier to deal with borderline instances that generate false positives.
First I splitted dataset in train and test sets(80%-20%).
Then I splitted train in train2 and test2 sets (66%,33%).
I used a dtc(#1) to predict test2 and i took only the instances predicted as true.
Then I trained a dtc(#2) on all these datas with the goal of build a classifier able to distinguish borderline cases.
I used the dtc(#3) trained on first oversampled train set to predict official test set and got Recall=0.92 and Precision=0.44.
Finally I used dtc(#2) only on datas predicted as true by dtc(#3) with hope to distinguish TP from FP but it doesn't work too good. I got Rec=0.79 and Prec=0.69.
x_train, X_test, y_train, Y_test =train_test_split(df2.drop('k',axis=1), df2['k'], test_size=test_size, random_state=0.2)
x_res, y_res=ros.fit_resample(x_train,y_train)
df_to_trick=df2.iloc[x_train.index.tolist(),:]
#....split in 0.33-0.66, trained and tested
confusion_matrix(y_test,predicted1) #dtc1
array([[13282, 266],
[ 18, 289]])
#training #dtc2 only on (266+289) datas
confusion_matrix(Y_test,predicted3) #dtc3 on official test set
array([[9950, 294],
[ 20, 232]])
confusion_matrix(true,predicted4)#here i used dtc2 on (294+232) datas
array([[204, 90],
[ 34, 198]])
I have to choose between dtc3 (Recall=0.92, Prec=0.44) or the entire cervellotic process with (Recall=0.79, Prec=0.69).
Do you have any ideas to improve these metrics? My goal is about (0.8/0.9).

Keep in mind that precision and recall are based on the threshold that you choose (i.e. in sklearn the default threshold is 0.5 - any class with a prediction probability > 0.5 is classified as positive) and that there will always be a trade-off between favoring precision over recall. ...
I think in the case you are describing (trying to fine-tune your classifier given your model's performance limitations) you can choose a higher or lower threshold to cut-off which has a more favorable precision-recall tradeoff ...
The below code can help you visualize how your precision and recall change as you move your decision threshold:
def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
plt.figure(figsize=(8, 8))
plt.title("Precision and Recall Scores as a function of the decision threshold")
plt.plot(thresholds, precisions[:-1], "b--", label="Precision")
plt.plot(thresholds, recalls[:-1], "g-", label="Recall")
plt.ylabel("Score")
plt.xlabel("Decision Threshold")
plt.legend(loc='best')
Other suggestions to improve your model's performance is to either use alternative pre-processing methods - SMOTE instead of Random Oversampling or choosing a more complex classifier (a random forrest/ ensemble of trees or a boosting approach ADA Boost or Gradient based boosting)

Inverse ROC-AUC value?

I have a classification problem where I need to predict a class of (0,1) given a data. Basically I have a dataset with more than 300 features (including a target value for prediction) and more than 2000 rows (samples). I applied different classifiers as follows:
1. DecisionTreeClassifier()
2. RandomForestClassifier()
3. GradientBoostingClassifier()
4. KNeighborsClassifier()
Almost all the classifiers gave me similar results around 0.50 AUC value except Random forest around 0.28. I would like to know that whether it is correct if I inverse the RandomForest result like:
1-0.28= 0.72
And report it as the AUC? Is it correct?

Your intuition is not wrong: if a binary classifier performs indeed worse than random (i.e. AUC < 0.5), a valid strategy is to simply invert its predictions, i.e. report a 0 whenever the classifier predicts a 1, and vice versa); from the relevant Wikipedia entry (emphasis added):
The diagonal divides the ROC space. Points above the diagonal represent good classification results (better than random); points below the line represent bad results (worse than random). Note that the output of a consistently bad predictor could simply be inverted to obtain a good predictor.
Nevertheless, the formally correct AUC for this inverted classifier, would be to first invert the individual probabilistic predictions prob of your model:
prob_invert = 1 - prob
and then calculate the AUC using these predictions prob_invert (arguably the process should give similar results with the naive approach you describe of simply subtracting the AUC from 1, but I'm not quire sure of the exact result - see also this Quora answer).
Needless to say, all this is based on the assumption that your whole process is correct, i.e. you don't have any modeling or coding errors (constructing a worse-than-random classifier is not exactly trivial).

auc score in scikit-learn for no binary classifiers

I want to calculate the roc_auc for different classifiers. Some are not binary classifiers. Here is a portion of the code I used:
if hasattr(clf, "decision_function"):
y_score = clf.fit(X_train, y_train).decision_function(X_test)
else:
y_score = clf.fit(X_train, y_train).predict_proba(X_test)
AUC=roc_auc_score(y_test, y_score)
However, I get an error for some classifiers (Nearest Neighbors
for example):
ValueError: bad input shape
Just a remark, I used: y_score = clf.fit(X_train, y_train).predict_proba(X_test), but I don't really know if it's correct to use it.

okay so first things first
clf.fit(X_train, y_train)
that will fit your model to your training data. first parameter being features, second parameter being target. okay, nicely done.
After fiting, you can apply ".predict" or ".predict_proba" on another dataset to get an estimative/prediction of its results. or you can do both fit and predict at the same time as you did below:
clf.fit(X_train, y_train).predict_proba(X_test)
Now those are your predictions, not your score.
Your score will be a function of the prediction and the true value "(y_test)".
You can use different score metrics depending on the kind of problem you got, such as accuracy, precision, recall, f1, etc.. (read more at http://scikit-learn.org/stable/modules/model_evaluation.html)
Now, roc_auc_score is one of those metrics, but you gotta watch out what you input that function, otherwise it wont work. As explained on the roc_auc_score page (http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html#sklearn.metrics.roc_auc_score), parameters should be:
y_true: True binary labels in binary label indicators.
y_score : Target scores, can either be probability estimates of the positive class, confidence values, or non-thresholded measure of decisions (as returned by “decision_function” on some classifiers).
so, if you got labels, or multilabels for y_true, the function wont work, it gotta be binary.
y_score in the other hand can be either binary or probabilities (ranging from [0,1])
hope that helps!
edit: if you got a multilabel problem, what you can do is tackle different classes one at a time. that way it will become many binary binary problems/models. (try building a model to predict if its class A or not, and do the roc curve of it, afterwards, move on to the next class and build another model, and so on)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why roc_auc produces weird results in sklearn? - python

Related

random forest: predict vs predict_proba

Python: scoring = 'recall' in GridSearchCV

How to improve Precision without downing the Recall in a unbalanced dataset?

Inverse ROC-AUC value?

auc score in scikit-learn for no binary classifiers

Categories

Resources