I have not worked much with ROC. Is it possible to plot the ROC curve with just y_true = ['A','B','A','B'] and y_pred=['A','B','A','A']?
Or is it necessary to have the model to be able to get the scores?
I want to use sklearns implementations.
Thanks!
No you will need the non-thresholded data. The fact that you have already predictions A and B means that you already applied some kind of threshold, deciding which output belongs to which class.
A ROC curve is supposed to help you find exactly that threshold at which you model works best for you.
Depending on with which model/implementation/code you work there is surely some way to get the probabilities.
Related
I have two models that predict basically the same thing, one is a regression version, the other a multi-class classifier.
I want to make ROC curves for both of them, where I have a function my_roc(y_true,y_pred) that returns the true positives and false positives for a given y_pred/y_true vector. I wanted to know if there is any way to use a function that returns me a ROC plot when I provide the y_pred,y_true, my_roc(y_true,y_pred) function and the trained model. The scikit learn and keras functions I have seen all assume I want the standard definition of tp/fp.
However, for my case the multi-class version doesnt have to predict the exact class, but something close counts as true positive, the same goes for the regression version, where "something close" is defined by me with a measure of distance.
Is there any simple way to do this ?
I am using LinearRegression() from sklearn to predict. I have created different features for X and trying to understand how can i select the best features automatically? Let's say i have defined 50 different features for X and only one output for y. Is there a way to select the best performing features automatically instead of doing it manually?
Also I can get rmse using following command:
scores = np.sqrt(-cross_val_score(lm, X, y, cv=20, scoring='neg_mean_squared_error')).mean()
From now on, how can i use this RMSE scores? I mean do i have to make multiple predictions? How am i going to use this rmse? There must be a way to predict() using some optimisations but couldn't findout.
Actually sklearn doesn't seem to have a stepwise algorithm, which helps in understanding the importance of features. However, it does provide recursive feature elimination, which is a greedy feature elimination algorithm similar to sequential backward selection.
See the documentation here:
Recursive Feature Elimination
Note that it is not necessary that it will reduce your RMSE. You might try different techniques like Ridge and Lasso Regression as well.
RMSE measures the average magnitude of the prediction error.
RMSE gives high weight to high errors, lower the values it's always better. RMSE can be improved only if you have a decent model. For feature selection, you can use PCA or stepwise regression or basic correlation technique. If you see a lot of multi-collinearity then go for Lasso or Ridge regression. Also, make sure you have a decent split of test and train data. If you have bad testing data you will get poor results. Also, check training data R-sq and testing data R-sq to make sure the model doesn't over-fit.
It would be helpful if you add information on no. of observations in your test and train data and r-sq value. Hope this helps
I have been working on a classification problem. With different classifiers [see figure below], the AUC scores I achieve ranges between 0.79-0.80, which is not very bad. However, I am trying to improve the performance of the classifier. To get some leads on how to do this, I have generated the following visualizations using this tutorial. Extra Trees seem to be the best. But, I do not know how to move forward after this point. For example, can I inform a VotingClassifier using this figure? If so, how? I appreciate any suggestions.
ROC_AUC score is sensitive only to the order of probabilities, not to their absolute values. Literally, if you divide all your probabilities by 2, ROC_AUC score will not change.
This means, probability calibration is useless for improving AUC. You have to resort to different methods. I don't know what you tried already, the list may include
feature engineering
feature selection
GridSearch for optimal hyperparameters
I'd like to know ways to determine how well a Gaussian function is fitting my data.
Here are a few plots I've been testing methods against. Currently, I'm just using the RMSE of the fit versus the sample (red is fit, blue is sample).
For instance, here are 2 good fits:
And here are 2 terrible fits that should be flagged as bad data:
In general, I'm looking for suggestions of additional metrics to measure the goodness of fit. Additionally, as you can see in the second 'good' fit, there can sometimes be other peaks outside the data. Currently, these are penalized by the RSME method, though they should not be.
I'm looking for suggestions of additional metrics to measure the goodness of fit.
The one-sample Kolmogorov-Smirnov (KS) test would be a good starting point.
I'd suggest the Wikipedia article as an introduction.
The test is available in SciPy as scipy.stats.kstest. The function computes and returns both the KS test statistic and the p-value.
You can try quantile-quantile (qq) plots using probplot from stats:
import pylab
from stats import probplot
plot = probplot(data, dist='norm', plot=pylab)
pylab.show()
Calculate quantiles for a probability plot, and optionally show the
plot.
Generates a probability plot of sample data against the quantiles of a
specified theoretical distribution (the normal distribution by
default). probplot optionally calculates a best-fit line for the data
and plots the results using Matplotlib or a given plot function.
There are other ways of evaluating a good fit, but most of them are not robust to outliers.
There is MSE - Mean squared error, which you already know, and RMSE which is the root of it.
But you can also measure it using MAE - Mean Absolute Error and MAPE - Mean absolute percentage error.
Also, there is the Kolmogorov-Smirnov test which is far more complex and you would probably need a library to do that, while MAE, MAPE and MSE you can implement yourself quiet easily.
(If you are dealing with unsupervised data and/or classification, which is not your case apparently, ROC curves and confusion matrix are also accuracy metrics.)
I'm using regression SVMs in python and I am wondering if there is any way to get a "confidence-measure" value for its predictions.
Previously, when using SVMs for binary classification, I was able to compute a confidence-type value from the 'margin'. Here is some pseudo-code showing how I got a confidence value:
# Begin pseudo-code
import svm as svmlib
prob = svmlib.svm_problem(labels, data)
param = svmlib.svm_parameter(svm_type=svmlib.C_SVC, kernel_type = svmlib.RBF)
model = svmlib.svm_model(prob, param)
# get confidence
confidence = self.model.predict_values_raw(sample_to_classify)
I imagine that the further the new sample is from the training data, the worse the confidence, but I'm looking for a function that might help compute a reasonable estimate for this.
My (high-level) problem is as follows:
I have a function F(x), where x is a high-dimensional vector
F(x) can be computed but it is very slow
I want to train a regression SVM to approximate it
If I can find values of 'x' that have low prediction confidence, I can add these points and retrain (aka. active learning)
Has anyone obtained/used regression-SVM confidence/margin values before?
Have a look at this similar response on Stack back in January. The chosen answer was spot on regarding how hard it is to get confidence measures on non-parametric fitting methods. There's probably some Bayesian type thing you could do, but that's probably not possible with the Python SVM library: Prefer one class in libsvm (python).