Python: scoring = 'recall' in GridSearchCV - python

I have a binary classification problem:
I try to find the best parameters for my model with
grid = {'penalty': ['l1', 'l2'],'C':[0.001,.009,0.01,.09,1,5,10,25]}
logreg =GridSearchCV(LogisticRegression(),grid,cv=5,scoring = 'recall')
logreg.fit(X, Y)
Y_Pred = logreg.predict(X)
I would like to know what is exactly the parameter scoring = 'recall'. When I add it, it improves a lot my model.

Scoring is basically how the model is being evaluated. Scikit supports quite a lot, you can see the full available scorers here.
Having high recall means that your model has high true positives and less false negatives. It means that there are more actual positives values being predicted as true and less actual positive values being predicted as false. You may also like to read more about confusion matrix.
As to what kind of scoring should you use, that depends on what you are trying to achieve with your model.

Related

Why roc_auc produces weird results in sklearn?

I have a binary classification problem where I use the following code to get my weighted avarege precision, weighted avarege recall, weighted avarege f-measure and roc_auc.
df = pd.read_csv(input_path+input_file)
X = df[features]
y = df[["gold_standard"]]
clf = RandomForestClassifier(random_state = 42, class_weight="balanced")
k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
scores = cross_validate(clf, X, y, cv=k_fold, scoring = ('accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted', 'roc_auc'))
print("accuracy")
print(np.mean(scores['test_accuracy'].tolist()))
print("precision_weighted")
print(np.mean(scores['test_precision_weighted'].tolist()))
print("recall_weighted")
print(np.mean(scores['test_recall_weighted'].tolist()))
print("f1_weighted")
print(np.mean(scores['test_f1_weighted'].tolist()))
print("roc_auc")
print(np.mean(scores['test_roc_auc'].tolist()))
I got the following results for the same dataset with 2 different feature settings.
Feature setting 1 ('accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted', 'roc_auc'):
0.6920, 0.6888, 0.6920, 0.6752, 0.7120
Feature setting 2 ('accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted', 'roc_auc'):
0.6806 0.6754 0.6806 0.6643 0.7233
So, we can see that in feature setting 1 we get good results for 'accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted' compared to feature setting 2.
However, when it comes to 'roc_auc' feature setting 2 is better than feature setting 1. I found this weird becuase every other metric was better with feature setting 1.
On one hand, I suspect that this happens since I am using weighted scores for precision, recall and f-measure and not with roc_auc. Is it possible to do weighted roc_auc for binary classification in sklearn?
What is the real problem for this weird roc_auc results?
It is not weird, because comparing all these other metrics with AUC is like comparing apples to oranges.
Here is a high-level description of the whole process:
Probabilistic classifiers (like RF here) produce probability outputs p in [0, 1].
To get hard class predictions (0/1), we apply a threshold to these probabilities; if not set explicitly (like here), this threshold is implicitly taken to be 0.5, i.e. if p>0.5 then class=1, else class=0.
Metrics like accuracy, precision, recall, and f1-score are calculated over the hard class predictions 0/1, i.e after the threshold has been applied.
In contrast, AUC measures the performance of a binary classifier averaged over the range of all possible thresholds, and not for a particular threshold.
So, it can certainly happen, and it can indeed lead to confusion among new practitioners.
The second part of my answer in this similar question might be helpful for more details. Quoting:
According to my experience at least, most ML practitioners think that the AUC score measures something different from what it actually does: the common (and unfortunate) use is just like any other the-higher-the-better metric, like accuracy, which may naturally lead to puzzles like the one you express yourself.

Why all the true positives are classified as true negatives in the machine learning model?

I fit a random forest model for the data. I divided my dataset into training and testing in the ratio of 70:30 and trained the model. I got an accuracy of 80% for the test data. Then I took a benchmark dataset and tested the model with that dataset. That dataset only contained data with true labels(1). But when I get the prediction for the benchmark dataset using the model all the true positives are classified as true negatives. Accuracy is 90%. Why is that? Is there a way to interpret this?
X = dataset.iloc[:, 1:11].values
y=dataset.iloc[:,11].values
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=1,shuffle='true')
XBench_test=benchmarkData.iloc[:, 1:11].values
YBench_test=benchmarkData.iloc[:,11].values
classifier=RandomForestClassifier(n_estimators=35,criterion='entropy',max_depth=30,min_samples_split=2,min_samples_leaf=1,max_features='sqrt',class_weight='balanced',bootstrap='true',random_state=0,oob_score='true')
classifier.fit(X_train,y_train)
y_pred=classifier.predict(X_test)
y_pred_benchmark=classifier.predict(XBench_test)
print("Accuracy on test data: {:.4f}".format(classifier.score(X_test, y_test)))\*This gives 80%*\
print("Accuracy on benchmark data: {:.4f}".format(classifier.score(XBench_test, YBench_test))) \*This gives 90%*\
I'll take a shot at providing a better way to interpret your results. In cases where you have an imbalanced data set accuracy is not going to be a good way to measure your performance.
Here is a common example:
Imagine you have a disease that is present in only .01% of people. If you predict no one has the disease you have an accuracy of 99.99% but your model is not a good model.
In this example it appears your benchmark data set (commonly referred to as a test dataset) has imbalanced classes and you are getting an accuracy of 90% when you call the classifier.score method. In this case, accuracy is not a good way to interpret the model. You should instead look at other metrics.
Other common metrics may be to look at precision and recall to determine how your model is performing. In this case since all True positives are predicted as negative your precision AND your recall would be 0, meaning your model is not differentiating very well.
Going further if you have imbalanced classes it may be better to check different thresholds of scores and look at metrics like ROC_AUC. These metrics look at the probability scores outputted by the model (predict_proba for sklearn) and test different thresholds. Perhaps your model works well at a lower threshold and the positive cases consistently score higher than the negative cases.
Here is an additional article about ROC_AUC.
Sci-kit learn has a few different metric scores you can use they are located here.
Here is one way you could implement ROC AUC into your code.
X = dataset.iloc[:, 1:11].values
y=dataset.iloc[:,11].values
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=1,shuffle='true')
XBench_test=benchmarkData.iloc[:, 1:11].values
YBench_test=benchmarkData.iloc[:,11].values
classifier=RandomForestClassifier(n_estimators=35,criterion='entropy',max_depth=30,min_samples_split=2,min_samples_leaf=1,max_features='sqrt',class_weight='balanced',bootstrap='true',random_state=0,oob_score='true')
classifier.fit(X_train,y_train)
#use predict_proba
y_pred=classifier.predict_proba(X_test)
y_pred_benchmark=classifier.predict_proba(XBench_test)
from sklearn.metrics import roc_auc_score
## instead of measuring accuracy use ROC AUC)
print("Accuracy on test data: {:.4f}".format(roc_auc_score(X_test, y_test)))\*This gives 80%*\
print("Accuracy on benchmark data: {:.4f}".format(roc_auc_score(XBench_test, YBench_test))) \*This gives 90%*\

Underfitting, Overfitting, Good_Generalization

So as a part of my assignment I'm applying linear and lasso regressions, and here's Question 7.
Based on the scores from question 6, what gamma value corresponds to a
model that is underfitting (and has the worst test set accuracy)? What
gamma value corresponds to a model that is overfitting (and has the
worst test set accuracy)? What choice of gamma would be the best
choice for a model with good generalization performance on this
dataset (high accuracy on both training and test set)?
Hint: Try plotting the scores from question 6 to visualize the
relationship between gamma and accuracy. Remember to comment out the
import matplotlib line before submission.
This function should return one tuple with the degree values in this order: (Underfitting, Overfitting, Good_Generalization) Please note there is only one correct solution.
I really need help, I can't really think of any way to solve this last question. What code should I use to determine (Underfitting, Overfitting, Good_Generalization) and why???
Thanks,
Data set: http://archive.ics.uci.edu/ml/datasets/Mushroom?ref=datanews.io
Here's my code from question 6:
from sklearn.svm import SVC
from sklearn.model_selection import validation_curve
def answer_six():
# SVC requires kernel='rbf', C=1, random_state=0 as instructed
# C: Penalty parameter C of the error term
# random_state: The seed of the pseudo random number generator
# used when shuffling the data for probability estimates
# e radial basis function kernel, or RBF kernel, is a popular
# kernel function used in various kernelized learning algorithms,
# In particular, it is commonly used in support vector machine
# classification
model = SVC(kernel='rbf', C=1, random_state=0)
# Return numpy array numbers spaced evenly on a log scale (start,
# stop, num=50, endpoint=True, base=10.0, dtype=None, axis=0)
gamma = np.logspace(-4,1,6)
# Create a Validation Curve for model and subsets.
# Create parameter name and range regarding gamma. Test Scoring
# requires accuracy.
# Validation curve requires X and y.
train_scores, test_scores = validation_curve(model, X_subset, y_subset, param_name='gamma', param_range=gamma, scoring ='accuracy')
# Determine mean for scores and tests along columns (axis=1)
sc = (train_scores.mean(axis=1), test_scores.mean(axis=1))
return sc
answer_six()
Well, make yourself familiar with overfitting. You are supposed to produce something like this: Article on this topic
On the left you have underfitting, on the right overfitting... Where both errors are low you have good generalisation.
And these things are a function of gamma (the regularizor)
Overfitting = your model false
if model false
scatter it
change linear to poly or suport vector with working kernel...
Underfitting = your dataset false
add new data ideal correleated ...
check by nubers
score / accuracy of test and train if test and train high and no big difference you are doiing good ...
if test low or train low then you facing overfitting / underfitting
hope explained you ...

auc score in scikit-learn for no binary classifiers

I want to calculate the roc_auc for different classifiers. Some are not binary classifiers. Here is a portion of the code I used:
if hasattr(clf, "decision_function"):
y_score = clf.fit(X_train, y_train).decision_function(X_test)
else:
y_score = clf.fit(X_train, y_train).predict_proba(X_test)
AUC=roc_auc_score(y_test, y_score)
However, I get an error for some classifiers (Nearest Neighbors
for example):
ValueError: bad input shape
Just a remark, I used: y_score = clf.fit(X_train, y_train).predict_proba(X_test), but I don't really know if it's correct to use it.
okay so first things first
clf.fit(X_train, y_train)
that will fit your model to your training data. first parameter being features, second parameter being target. okay, nicely done.
After fiting, you can apply ".predict" or ".predict_proba" on another dataset to get an estimative/prediction of its results. or you can do both fit and predict at the same time as you did below:
clf.fit(X_train, y_train).predict_proba(X_test)
Now those are your predictions, not your score.
Your score will be a function of the prediction and the true value "(y_test)".
You can use different score metrics depending on the kind of problem you got, such as accuracy, precision, recall, f1, etc.. (read more at http://scikit-learn.org/stable/modules/model_evaluation.html)
Now, roc_auc_score is one of those metrics, but you gotta watch out what you input that function, otherwise it wont work. As explained on the roc_auc_score page (http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html#sklearn.metrics.roc_auc_score), parameters should be:
y_true: True binary labels in binary label indicators.
y_score : Target scores, can either be probability estimates of the positive class, confidence values, or non-thresholded measure of decisions (as returned by “decision_function” on some classifiers).
so, if you got labels, or multilabels for y_true, the function wont work, it gotta be binary.
y_score in the other hand can be either binary or probabilities (ranging from [0,1])
hope that helps!
edit: if you got a multilabel problem, what you can do is tackle different classes one at a time. that way it will become many binary binary problems/models. (try building a model to predict if its class A or not, and do the roc curve of it, afterwards, move on to the next class and build another model, and so on)

Sci-kit: What's the easiest way to get the confusion matrix of an estimator when using GridSearchCV?

In this simplified example, I've trained a learner with GridSearchCV. I would like to return the confusion matrix of the best learner when predicting on the full set X.
lr_pipeline = Pipeline([('clf', LogisticRegression())])
lr_parameters = {}
lr_gs = GridSearchCV(lr_pipeline, lr_parameters, n_jobs=-1)
lr_gs = lr_gs.fit(X,y)
print lr_gs.confusion_matrix # Would like to be able to do this
Thanks
You will first need to predict using best estimator in your GridSerarchCV. A common method to use is GridSearchCV.decision_function(), But for your example, decision_function returns class probabilities from LogisticRegression and does not work with confusion_matrix. Instead, find best estimator using lr_gs and predict the labels using that estimator.
y_pred = lr_gs.best_estimator_.predict(X)
Finally, use sklearn's confusion_matrix on real and predicted y
from sklearn.metrics import confusion_matrix
print confusion_matrix(y, y_pred)
I found this question while searching for how to calculate the confusion matrix while fitting Sci-kit Learn's GridSearchCV. I was able to find a solution by defining a custom scoring function, although it's somewhat kludgy. I'm leaving this answer for anyone else who makes a similar search.
As mentioned by #MLgeek and #bugo99iot, the accepted answer by #Sudeep Juvekar isn't really satisfactory. It offers a literal answer to original question as asked, but it's not usually the case that a machine learning practitioner would be interested in the confusion matrix of a fitted model on its training data. It is more typically of interest to know how well a model generalizes to data it hasn't seen.
To use a custom scoring function in GridSearchCV you will need to import the Scikit-learn helper function make_scorer.
from sklearn.metrics import make_scorer
The custom scoring function looks like this
def _count_score(y_true, y_pred, label1=0, label2=1):
return sum((y == label1 and pred == label2)
for y, pred in zip(y_true, y_pred))
For a given pair of labels, (label1, label2), it calculates the number of examples where the true value of y is label1 and the predicted value of y is label2.
To start, find all of the labels in the training data
all_labels = sorted(set(y))
The optional argument scoring of GridSearchCV can receive a dictionary mapping strings to scorers. make_scorer can take a scoring function along with bindings for some of its parameters and produce a scorer, which is a particular type of callable that is used for scoring in GridSearchCV, cross_val_score, etc. Let's build up this dictionary for each pair of labels.
scorer = {}
for label1 in all_labels:
for label2 in all_labels:
count_score = make_scorer(_count_score, label1=label1,
label2=label2)
scorer['count_%s_%s' % (label1, label2)] = count_score
You'll also want to add any additional scoring functions you're interested in. To avoid getting into the subtleties of scoring for multi-class classification let's add a simple accuracy score.
# import placed here for the sake of demonstration.
# Should be imported alongside make_scorer above
from sklearn.metrics import accuracy_score
scorer['accuracy'] = make_scorer(accuracy_score)
We can now fit GridSearchCV
num_splits = 5
lr_gs = GridSearchCV(lr_pipeline, lr_parameters, n_jobs=-1,
scoring=scorer, refit='accuracy',
cv=num_splits)
refit='accuracy' tells GridSearchCV that it should judge by best accuracy score to decide on the parameters to use when refitting. In the case where you are passing a dictionary of multiple scorers to scoring, if you do not pass a value to the optional argument refit, GridSearchCV will not refit the model on all training data. We've explicitly set the number of splits because we'll need to know this later.
Now, for each of the training folds used in cross-validation, essentially what we've done is calculate the confusion matrix on the respective test folds. The test folds do not overlap and cover the entire space of data, we've therefore made predictions for each data point in X in such a way that the prediction for each point does not depend on the associated target label for that point.
We can add up the confusion matrices associated to the test folds to get something useful that gives information on how well the model generalizes. It can also be interesting to look at the confusion matrices for the test folds separately and do stuff like calculate variances.
We're not done yet though. We need to actually pull out the confusion matrix for the best estimator. In this example, the cross validation results will be stored in the dictionary lr_gs.cv_results. First let's get the index in the results corresponding to the best set of parameters
best_index = lr_gs.cv_results['rank_test_accuracy'] - 1
If you are using a different metric to decide upon the best parameters, substitute for 'accuracy' the key you are using for the associated scorer in the scoring dictionary passed to GridSearchCV.
In my own application I chose to store the confusion matrix as a nested dictionary.
confusion = defaultdict(lambda: defaultdict(int))
for label1 in all_labels:
for label2 in all_labels
for i in range(num_splits):
key = 'split%s_test_count_%s_%s' % (i, label1, label2)
val = int(lr_gs.cv_results[key][best_index])
confusion[label1][label2] += val
confusion = {key: dict(value) for key, value in confusion.items()}
There's some stuff to unpack here. defaultdict(lambda: defaultdict(int)) constructs a nested defaultdict; a defaultdict of defaultdict of int (if you're copying and pasting, don't forget to add from collections import defaultdict at the top of your file). The last line of this snippet is used to turn confusion into a regular dict of dict of int. Never leave defaultdicts lying around when they are no longer needed.
You will likely want to store your confusion matrix in a different way. The key fact is that the confusion matrix entry for the pair of labels 'label1', 'label2' for test fold i is stored in
lr_gs.cv_results['spliti_label1_label2'][best_index]
See here for an example of this confusion matrix calculation used in practice. I think it's a bit of a code smell to rely on the specific format of the keys in the cv_results dictionary but this does work, at least as of the day of this post.

Categories