Make a ROC curve for user defined true/false positives

Make a ROC curve for user defined true/false positives - python

I have two models that predict basically the same thing, one is a regression version, the other a multi-class classifier.
I want to make ROC curves for both of them, where I have a function my_roc(y_true,y_pred) that returns the true positives and false positives for a given y_pred/y_true vector. I wanted to know if there is any way to use a function that returns me a ROC plot when I provide the y_pred,y_true, my_roc(y_true,y_pred) function and the trained model. The scikit learn and keras functions I have seen all assume I want the standard definition of tp/fp.
However, for my case the multi-class version doesnt have to predict the exact class, but something close counts as true positive, the same goes for the regression version, where "something close" is defined by me with a measure of distance.
Is there any simple way to do this ?

Related

understanding sklearn calibratedClassifierCV

Hi all I am having trouble understanding how to use the output of sklearn.calibration.CalibratedClassifierCV.
I have calibrated my binary classifier using this method, and results are greatly improved. However I am not sure how to interpret the results.
sklearn guide states that, after calibration,
the output of predict_proba method can be directly interpreted as a
confidence level. For instance, a well calibrated (binary) classifier should classify the samples such that among the samples to which it gave a predict_proba value close to 0.8, approximately 80% actually belong to the positive class.
Now I would like to reduce false positive by applying a cutoff at .6 for the model to predict label True. Without the calibration, I would have simply used my_model.predict_proba() > .6.
However, it seems that after calibration the meaning of predict_proba has changed, so I am not sure if I can do that anymore.
From a quick testing it seems that predict and predict_proba follow the same logic I would expect before calibration. The output of:
pred = my_model.predict(valid_x)
proba= my_model.predict_proba(valid_x)
pd.DataFrame({"label": pred, "proba": proba[:,1]})
is the following:
Where everything that has a probability of above .5 gets to be classifed as True, and everything below .5 as False.
Can you confirm that, after calibration, I can still use predict_proba to apply a different cutoff to identify my labels?
2 https://scikit-learn.org/stable/modules/calibration.html#calibration

For me, you can actually use predict_proba() after calibration to apply a different cutoff.
What happens within class CalibratedClassifierCV (as you noticed) is effectively that the output of predict() is based on the output of predict_proba() (see here for reference), i.e. np.argmax(self.predict_proba(X), axis=1) == self.predict(X).
On the other side, for the non-calibrated classifier that you're passing to CalibratedClassifierCV (depending on whether it is a probabilistic classifier or not) the above equality may or may not hold (e.g. it does not for an SVC() classifier - see here, for instance, for some other details on this).

ROC curve from y_true and y_pred

I have not worked much with ROC. Is it possible to plot the ROC curve with just y_true = ['A','B','A','B'] and y_pred=['A','B','A','A']?
Or is it necessary to have the model to be able to get the scores?
I want to use sklearns implementations.
Thanks!

No you will need the non-thresholded data. The fact that you have already predictions A and B means that you already applied some kind of threshold, deciding which output belongs to which class.
A ROC curve is supposed to help you find exactly that threshold at which you model works best for you.
Depending on with which model/implementation/code you work there is surely some way to get the probabilities.

Why is negative (MSE or MAS) Scoring parameter like- neg_mean_absolute_error in SKLEARN is considered for regression model evaluation

I an a novice in Machine Learning and while going through the course I came across the "Scoring Parameter". I understood for Regression model evaluation, we consider the negatives of Mean Squared error, mean absolute error etc.
When I wanted to know the reason, I went through SKLearn documentation which says "All scorer objects follow the convention that higher return values are better than lower return values. Thus metrics which measure the distance between the model and the data, like metrics.mean_squared_error, are available as neg_mean_squared_error which return the negated value of the metric."
This explanation does not answer my why's completely and I am confused. So, why is the negatives taken more because logically if the difference in prediction is higher whether -ve or +ve, it makes our models equally bad. Then why is it that scoring parameter is focused on negative differences?

I think there is a slight misunderstanding in the way you understood neg_mean_absolute_error (NMAE). The way in which neg_mean_absolute_error is computed as follows:
where N is the total number of data points, Y_i is the true value and Y_i^p is the predicted value.
Still we equally penalize the model if it predicts higher or lower than the true value, but it is just that we multiply the final result with -1 just to follow the convention that sklearn has set. So if a model gives you a MAE of say 0.55 and another model gives you a MAE of say 0.78, their NMAE values would be flipped as -0.55 and -0.78 and by following the convention of higher the better, we pick the former model which results has a higher NMAE of -0.55.
You can make a similar argument for MSE.

It's simple: minimizing MSE is equivalent to maximizing negative-MSE.
An objective function that the scorer can maximize is just by "convention" as the Sklearn documentation suggests.

sklearn roc_auc_score with multi_class=="ovr" should have None average available

I'm trying to compute the AUC score for a multiclass problem using the sklearn's roc_auc_score() function.
I have prediction matrix of shape [n_samples,n_classes] and a ground truth vector of shape [n_samples], named np_pred and np_label respectively.
What I'm trying to achieve is the set of AUC scores, one for each classes that I have.
To do so I would like to use the average parameter option None and multi_class parameter set to "ovr", but if I run
roc_auc_score(y_score=np_pred, y_true=np_label, multi_class="ovr",average=None)
I get back
ValueError: average must be one of ('macro', 'weighted') for multiclass problems
This error is expected from the sklearn function in the case of the multiclass; but if you take a look at the roc_auc_score function source code, you can see that if the multi_class parameter is set to "ovr", and the average is one of the accepted one, the multiClass case is treated as a multiLabel one and the internal multiLabel function accepts None as average parameter.
So, by looking at the code, it seems that I should be able to execute a multiclass with a None average in a One vs Rest case but the ifs in the source code do not allow such combination.
Am I wrong?
In case I'm wrong, from a theoretical point of view should I fake a multilabel case just to have the different AUCs for the different classes or should I write my own function that cycles the different classes and outputs the AUCs?
Thanks

According to sklearn documentation, the default parameter for multi_class is 'raised', In documentation it is mentioned, that the default parameter will throw an exception, so you have to mention ovr or ovo explicitly multi_class='ovr'.
Refer to the attached screenshot

As you already know, right now sklearn multiclass ROC AUC only handles the macro and weighted averages. But it can be implemented as it can then individually return the scores for each class.
Theoretically speaking, you could implement OVR and calculate per-class roc_auc_score, as:
roc = {label: [] for label in multi_class_series.unique()}
for label in multi_class_series.unique():
selected_classifier.fit(train_set_dataframe, train_class == label)
predictions_proba = selected_classifier.predict_proba(test_set_dataframe)
roc[label] += roc_auc_score(test_class, predictions_proba[:,1])

What is the difference between Linear regression classifier and linear regression to extract the confidential interval?

I am a beginner with machine learning. I want to use time series linear regression to extract confidential interval of my dataset. I don't need to use the linear regression as a classifier. Firstly what is the difference between the two cases? Secondly in python, Is there different way to implement them ?

The main difference is the classifier will compute a probabilty about a label. The regression will compute a quantitative output.
Generally, classifier is used to compute a probability of label, and a regression is often use to compute a quantity. For instance if you want to compute the price of a flat considering some criterias you will use a regression, if you want to compute a label (luxurious, modest, ...) about the same flat considering some criterias you will use classifier.
But to use regressions in order to compute a threshold to seperate labels observed is a technic often used too. That is the case of linear SVM, which compute a boundary between labels. It is called decision boundary. Warning, the main drawback with linear is that is linear: it means the boundary will necessary be a straight line to separate labels. Sometimes it is good enough, sometimes it is not.
Logistic regression is an exception because it compute a probability actually. Its name is misleading.
For regression, when you want to compute a quantitative output, you can use a confidence interval to have an idea about the error. In a classification there is not confidence interval, even if you use linear SVM, it is non sensical. You can use the decision function but it is difficult to interpret in reality, or use the predicted probabilities and to check the number of time the label is wrong and compute a ratio of error. There are plethora ratios available considering your problematic, and it is buntly the subject of a whole book actually.
Anyway, if you're computing a time series, as far as I know your goal is to obtain a quantitative output, then you do not need a classifier as you said. And about extracting it depends totally of the object you used to compute it in python: meaning it depends of the available attributes of the object used. Then depends of the library too. So it would be very better, to answer to you, if you would indicate which libraries and objects you are using.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.