I understand that Support Vector Machine algorithm does not compute probabilities, which is needed to find the AUC value, is there any other way to just find the AUC score?
from sklearn.svm import SVC
model_ksvm = SVC(kernel = 'rbf', random_state = 0)
model_ksvm.fit(X_train, y_train)
model_ksvm.predict_proba(X_test)
I can't get the the probability output from the SVM algorithm, without the probability output I can't get the AUC score, which I can get with other algorithm.
You don't really need probabilities for the ROC, just any sort of confidence score. You need to rank-order the samples according to how likely they are to be in the positive class. Support Vector Machines can use the (signed) distance from the separating plane for that purpose, and indeed sklearn does that automatically under the hood when scoring with AUC: it uses the decision_function method, which is the signed distance.
You can also set the probability option in the SVC (docs), which fits a Platt calibration model on top of the SVM to produce probability outputs:
model_ksvm = SVC(kernel='rbf', probability=True, random_state=0)
But this will lead to the same AUC, because the Platt calibration just maps the signed distances to probabilities monotonically.
Related
I'm using sklearns SVC with rbf kernel and ovr decision function. While studying the decision_function I noticed that the label with highest confidence score doesn't necessarily correspond to the prediction. Is such behavior normal? If yes, why?
Thx in advance.
Example: For the following decision function output
5.99088671, 3.96528944, 6.02144331, 1.94929957, 9.05033791,
9.04567359, 2.98166027, 1.97837266, 1.96593488, 9.07656409,
2.97453757
the SVM predicted the label with value 9.05033791
I have a dataframe X which is comprised of 60 features and ~ 450k outcomes. My response variable y is categorical (survival, no survival).
I would like to use RFECV to reduce the number of significant features for my estimator (right now, logistic regression) on Xtrain, which I would like to score of accuracy under an ROC Curve. "Features Selected" is a list of all features.
from sklearn.cross_validation import StratifiedKFold
from sklearn.feature_selection import RFECV
import sklearn.linear_model as lm
# Create train and test datasets to evaluate each model
Xtrain, Xtest, ytrain, ytest = train_test_split(X,y,train_size = 0.70)
# Use RFECV to reduce features
# Create a logistic regression estimator
logreg = lm.LogisticRegression()
# Use RFECV to pick best features, using Stratified Kfold
rfecv = RFECV(estimator=logreg, cv=StratifiedKFold(ytrain, 10), scoring='roc_auc')
# Fit the features to the response variable
X_new = rfecv.fit_transform(Xtrain[features_selected], ytrain)
I have a few questions:
a) X_new returns different features when run on separate occasions (one time it returned 5 features, another run it returned 9. One is not a subset of the other). Why would this be?
b) Does this imply an unstable solution? While using the same seed for StratifiedKFold should solve this problem, does this mean I need to reconsider the approach in totality?
c) IN general, how do I approach tuning? e.g., features are selected BEFORE tuning in my current implementation. Would tuning affect the significance of certain features? Or should I tune simultaneously?
In k-fold cross-validation, the original sample is randomly partitioned into k equal size sub-samples. Therefore, it's not surprising to get different results every time you execute the algorithm. Source
There is an approach, so-called Pearson's correlation coefficient. By using this method, you can calculate the a correlation coefficient between each two features, and aim for removing features with a high correlation. This method could be considered as a stable solution to such a problem. Source
I'm trying to predict a binary variable with both random forests and logistic regression. I've got heavily unbalanced classes (approx 1.5% of Y=1).
The default feature importance techniques in random forests are based on classification accuracy (error rate) - which has been shown to be a bad measure for unbalanced classes (see here and here).
The two standard VIMs for feature selection with RF are the Gini VIM and the permutation VIM. Roughly speaking the Gini VIM of a predictor of interest is the sum over the forest of the decreases of Gini impurity generated by this predictor whenever it was selected for splitting, scaled by the number of trees.
My question is : is that kind of method implemented in scikit-learn (like it is in the R package party) ? Or maybe a workaround ?
PS : This question is kind of linked with an other.
scoring is just a performance evaluation tool used in test sample, and it does not enter into the internal DecisionTreeClassifier algo at each split node. You can only specify the criterion (kind of internal loss function at each split node) to be either gini or information entropy for the tree algo.
scoring can be used in a cross-validation context where the goal is to tune some hyperparameters (like max_depth). In your case, you can use a GridSearchCV to tune some of your hyperparameters using the scoring function roc_auc.
After doing some researchs, this is what I came out with :
from sklearn.cross_validation import ShuffleSplit
from collections import defaultdict
names = db_train.iloc[:,1:].columns.tolist()
# -- Gridsearched parameters
model_rf = RandomForestClassifier(n_estimators=500,
class_weight="auto",
criterion='gini',
bootstrap=True,
max_features=10,
min_samples_split=1,
min_samples_leaf=6,
max_depth=3,
n_jobs=-1)
scores = defaultdict(list)
# -- Fit the model (could be cross-validated)
rf = model_rf.fit(X_train, Y_train)
acc = roc_auc_score(Y_test, rf.predict(X_test))
for i in range(X_train.shape[1]):
X_t = X_test.copy()
np.random.shuffle(X_t[:, i])
shuff_acc = roc_auc_score(Y_test, rf.predict(X_t))
scores[names[i]].append((acc-shuff_acc)/acc)
print("Features sorted by their score:")
print(sorted([(round(np.mean(score), 4), feat) for
feat, score in scores.items()], reverse=True))
Features sorted by their score:
[(0.0028999999999999998, 'Var1'), (0.0027000000000000001, 'Var2'), (0.0023999999999999998, 'Var3'), (0.0022000000000000001, 'Var4'), (0.0022000000000000001, 'Var5'), (0.0022000000000000001, 'Var6'), (0.002, 'Var7'), (0.002, 'Var8'), ...]
The output is not very sexy, but you got the idea. The weakness of this approach is that feature importance seems to be very parameters dependent. I ran it using differents params (max_depth, max_features..) and I'm getting a lot different results. So I decided to run a gridsearch on parameters (scoring = 'roc_auc') and then apply this VIM (Variable Importance Measure) to the best model.
I took my inspiration from this (great) notebook.
All suggestions/comments are most welcome !
I'm building a logistic regression model as follows:
cross_validation_object = cross_validation.StratifiedKFold(Y, n_folds = 10)
scaler = MinMaxScaler(feature_range = [0,1])
logistic_fit = LogisticRegression()
pipeline_object = Pipeline([('scaler', scaler),('model', logistic_fit)])
tuned_parameters = [{'model__C': [0.01,0.1,1,10],
'model__penalty': ['l1','l2']}]
grid_search_object = GridSearchCV(pipeline_object, tuned_parameters, cv = cross_validation_object, scoring = 'roc_auc')
I looked at the roc_auc score for the best estimator:
grid_search_object.best_score_
Out[195]: 0.94505225726738229
However, when I used the best estimator to score the full training set, I got a worse score:
grid_search_object.best_estimator_.score(X,Y)
Out[196]: 0.89636762322433028
How can this be? What am I doing wrong?
Edit: Nevermind. I'm an idiot. grid_search_object.best_estimator_.score calculates accuracy, not auc_roc. Right?
But if that is the case, how does GridSearchCV compute the grid_scores_? Does it pick the best decision threshold for each parameter, or is the decision threshold always at 0.5? For area under the ROC curve, decision threshold doesn't matter, but it does for say, f1_score.
If you evaluated the best_estimator_ on the full training set it is not surprising that the scores are different from the best_score_, even if the scoring methods are the same:
The best_score_ is the average over your cross-validation fold scores of the best model (best in exactly that sense: scores highest on average over folds).
When scoring on the whole training set, your score may be higher or lower than this. Especially if you have some sort of temporal structure in your data and you are using the wrong data splitting, scores on the full set can be worse.
I am training my dataset using linearsvm in scikit. Can I calculate/get the probability with which a sample is classified under a given label?
For example, using SGDClassifier(loss="log") to fit the data, enables the predict_proba method, which gives a vector of probability estimates P(y|x) per sample x:
>>> clf = SGDClassifier(loss="log").fit(X, y)
>>> clf.predict_proba([[1., 1.]])
Output:
array([[ 0.0000005, 0.9999995]])
Is there any similar function which I can use to calculate the prediction probability while using svm.LinearSVC (multi-class classification). I know there is a method decision_function to predict the confidence scores for samples in this case. But, is there any way I can calculate probability estimates for the samples using this decision function?
No, LinearSVC will not compute probabilities because it's not trained to do so. Use sklearn.linear_model.LogisticRegression, which uses the same algorithm as LinearSVC but with the log loss. It uses the standard logistic function for probability estimates:
1. / (1 + exp(-decision_function(X)))
(For the same reason, SGDClassifier will only output probabilities when loss="log", not using its default loss function which causes it to learn a linear SVM.)
Multi class classification is a one-vs-all classification. For a SGDClassifier, as a distance to hyperplane corresponding to to particular class is returned, probability is calculated as
clip(decision_function(X), -1, 1) + 1) / 2
Refer to code for details.
You can implement similar function, it seems being reasonable to me for LinearSVC, althrough that probably needs some justification. Refer to paper mentioned in docs
Zadrozny and Elkan, “Transforming classifier scores into multiclass probability estimates”, SIGKDD‘02, http://www.research.ibm.com/people/z/zadrozny/kdd2002-Transf.pdf
P.s. A comment from "Is there 'predict_proba' for LinearSVC?":
if you want probabilities, you should either use Logistic regression or SVC. both can predict probsbilities, but in very diferent ways.