How to get probabilities for SGDClassifier (LinearSVM) - python

I'm using SGDClassifier with loss function = "hinge". But hinge loss does not support probability estimates for class labels.
I need probabilities for calculating roc_curve. How can I get probabilities for hinge loss in SGDClassifier without using SVC from svm?
I've seen people mention about using CalibratedClassifierCV to get the probabilities but I've never used it and I don't know how it works.
I really appreciate the help. Thanks

In the strict sense, that's not possible.
Support vector machine classifiers are non-probabilistic: they use a hyperplane (a line in 2D, a plane in 3D and so on) to separate points into one of two classes. Points are only defined by which side of the hyperplane they are on., which forms the prediction directly.
This is in contrast with probabilistic classifiers like logistic regression and decision trees, which generate a probability for every point that is then converted to a prediction.
CalibratedClassifierCV is a sort of meta-estimator; to use it, you simply pass your instance of a base estimator to its constructor, so this will work:
base_model = SGDClassifier()
model = CalibratedClassifierCV(base_model)
model.fit(X, y)
model.predict_proba(X)
What it does is perform internal cross-validation to create a probability estimate. Note that this is equivalent to what sklearn.SVM.SVC does anyway.

Related

Nature and redundancy of classifiers

I am applying a set of linear and non-linear classification models in a classification task. The input data are language vectors (CountVectorizer, Word2Vec) and binary labels. In scikit-learn, I selected following estimators:
LogisticRegression(),
LinearSVC(),
XGBClassifier(),
SGDClassifier(),
SVC(), # Radial basis function kernel
BernoulliNB(), # Naive Bayes seems widely used for LV models
KNeighborsClassifier(),
RandomForestClassifier(),
MLPClassifier()
Question: Am I correct that LinearSVC() is a linear
classifier, at least for the case of a binary estimator?
Question: In view of experts, is there any significant redundancy among the classifiers?
Thanks for clarification.
LogisticRegression(), LinearSVC(), SGDClassifier() and BernoulliNB() are linear models.
With the default loss function SGDClassifier() works as a linear SVM, with log loss as a logistic regression, so one of these three is redundant. Also you could substitute LogisticRegression() for LogisticRegressionCV() which has built-in optimization for regularization hyperparameter.
XGBClassifier() and all the others are non-linear.
The list seems to include all the major sklearn classifiers.

How to do regression as opposed to classification using logistic regression and scikit learn

The target variable that I need to predict are probabilities (as opposed to labels). The corresponding column in my training data are also in this form. I do not want to lose information by thresholding the targets to create a classification problem out of it.
If I train the logistic regression classifier with binary labels, sk-learn logistic regression API allows obtaining the probabilities at prediction time. However, I need to train it with probabilities. Is there a way to do this in scikits-learn, or a suitable Python package that scales to 100K data points of 1K dimension.
I want the regressor to use the structure of the problem. One such
structure is that the targets are probabilities.
You can't have cross-entropy loss with non-indicator probabilities in scikit-learn; this is not implemented and not supported in API. It is a scikit-learn's limitation.
In general, according to scikit-learn's docs a loss function is of the form Loss(prediction, target), where prediction is the model's output, and target is the ground-truth value.
In the case of logistic regression, prediction is a value on (0,1) (i.e., a "soft label"), while target is 0 or 1 (i.e., a "hard label").
For logistic regression you can approximate probabilities as target by oversampling instances according to probabilities of their labels. e.g. if for given sample class_1 has probability 0.2, and class_2 has probability0.8, then generate 10 training instances (copied sample): 8 withclass_2as "ground truth target label" and 2 withclass_1`.
Obviously it is workaround and is not extremely efficient, but it should work properly.
If you're ok with upsampling approach, you can pip install eli5, and use eli5.lime.utils.fit_proba with a Logistic Regression classifier from scikit-learn.
Alternative solution is to implement (or find implementation?) of LogisticRegression in Tensorflow, where you can define loss function as you like it.
In compiling this solution I worked using answers from scikit-learn - multinomial logistic regression with probabilities as a target variable and scikit-learn classification on soft labels. I advise those for more insight.
This is an excellent question because (contrary to what people might believe) there are many legitimate uses of logistic regression as.... regression!
There are three basic approaches you can use if you insist on true logistic regression, and two additional options that should give similar results. They all assume your target output is between 0 and 1. Most of the time you will have to generate training/test sets "manually," unless you are lucky enough to be using a platform that supports SGD-R with custom kernels and X-validation support out-of-the-box.
Note that given your particular use case, the "not quite true logistic regression" options may be necessary. The downside of these approaches is that it is takes more work to see the weight/importance of each feature in case you want to reduce your feature space by removing weak features.
Direct Approach using Optimization
If you don't mind doing a bit of coding, you can just use scipy optimize function. This is dead simple:
Create a function of the following type:
y_o = inverse-logit (a_0 + a_1x_1 + a_2x_2 + ...)
where inverse-logit (z) = exp^(z) / (1 + exp^z)
Use scipy minimize to minimize the sum of -1 * [y_t*log(y_o) + (1-y_t)*log(1 - y_o)], summed over all datapoints. To do this you have to set up a function that takes (a_0, a_1, ...) as parameters and creates the function and then calculates the loss.
Stochastic Gradient Descent with Custom Loss
If you happen to be using a platform that has SGD regression with a custom loss then you can just use that, specifying a loss of y_t*log(y_o) + (1-y_t)*log(1 - y_o)
One way to do this is just to fork sci-kit learn and add log loss to the regression SGD solver.
Convert to Classification Problem
You can convert your problem to a classification problem by oversampling, as described by #jo9k. But note that even in this case you should not use standard X-validation because the data are not independent anymore. You will need to break up your data manually into train/test sets and oversample only after you have broken them apart.
Convert to SVM
(Edit: I did some testing and found that on my test sets sigmoid kernels were not behaving well. I think they require some special pre-processing to work as expected. An SVM with a sigmoid kernel is equivalent to a 2-layer tanh Neural Network, which should be amenable to a regression task structured where training data outputs are probabilities. I might come back to this after further review.)
You should get similar results to logistic regression using an SVM with sigmoid kernel. You can use sci-kit learn's SVR function and specify the kernel as sigmoid. You may run into performance difficulties with 100,000s of data points across 1000 features.... which leads me to my final suggestion:
Convert to SVM using Approximated Kernels
This method will give results a bit further away from true logistic regression, but it is extremely performant. The process is the following:
Use a sci-kit-learn's RBFsampler to explicitly construct an approximate rbf-kernel for your dataset.
Process your data through that kernel and then use sci-kit-learn's SGDRegressor with a hinge loss to realize a super-performant SVM on the transformed data.
The above is laid out with code here
Instead of using predict in the scikit learn library use predict_proba function
refer here:
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.predict_proba

Optimal SVM parameters for high recall

I'm using scikit-learn to perform classification using SVM. I'm performing a binary classification task.
0: Does not belong to class A
1: Belongs to class A
Now, I want to optimize the parameters such that I get high recall. I don't care much about a few false positives but the objects belonging to class A should not be labelled as not belonging to A often.
I use a SVM with linear kernel.
from sklearn import svm
clf = svm.SVC(kernel='linear')
clf.fit(X,Y)
clf.predict(...)
How should I choose other SVM parameters like C? Also, what is the difference between SVC with a linear kernel and LinearSVC?
The choice of the kernel is really dependent on the data, so picking the kernel based on a plot of the data might be the way to go. This could be automated by running through all kernel types and picking the one that gives you either high/low recall or bias, whatever you're looking for. You can see for yourself the visual difference of the kernels.
Depending on the kernel different arguments of the SVC constructor are important, but in general the C is possibly the most influential, as it's the penalty for getting it wrong. Decreasing C would increase the recall.
Other than that there's more ways to get a better fit, for example by adding more features to the n_features of the X matrix passed on to svm.fit(X,y).
And of course it can always be useful to plot the precision/recall to get a better feel of what the parameters are doing.
Generally speaking you can tackle this problem by penalizing the two types of errors differently during the learning procedure. If you take a look at the loss function, in particular in the primal/parametric setting, you can think of scaling the penalty of false-negatives by alpha and penalty of false-positives by (1 - alpha), where alpha is in [0 1]. (To similar effect would be duplicating the number of positive instances in your training set, but this makes your problem unnecessarily larger, which should be avoided for efficiency)
You can choose the SVM parameter C, which is basically your penalty term, by cross-validation. Here you can use K-Fold cross-validation. You can also use a sklearn class called gridsearchCV in which you can pass your model and then perform cross-validation on it using the cv parameter.
According to linearSVC documentation -
Similar to SVC with parameter kernel=’linear’, but implemented in terms of liblinear rather than libsvm, so it has more flexibility in the choice of penalties and loss functions and should scale better to large numbers of samples.

Working of Regression in sklearn.linear_model.LogisticRegression

How does scikit-learn's sklearn.linear_model.LogisticRegression class work with regression as well as classification problems?
As given on the Wikipedia page as well as a number of sources, since the output of Logistic Regression is based on the sigmoid function, it returns a probability. Then how does the sklearn class work as both a classifier and regressor?
Logistic regression is a method for classification, not regression. This goes for scikit-learn as for anywhere else.
If you have entered continuous values as the target vector y, then LogisticRegression will most probably fail, as it interprets the unique values of y, i.e. np.unique(y) as different classes. So you may end up having as many classes as samples.
TL;DR: Logistic regression needs a categorical target variable, because it is a classification method.

Keep the fitted parameters when using a cross_val_score in scikits learn

I'm trying to use scikits-learn to fit a linear model using Ridge regression. What I'd like to do is use cross validation to fit many different models, and then look at the parameter coefficients to see how stable they are across different CV splits. (or perhaps to average them all together).
When I try to fit the model with a cross-validation routine (e.g., using an instance of KFold and the cross_val_score function), I get back a list of the scores for each CV split, but I don't get back the fitted coefficient values that were calculated on each split. Is there a way for me to access this information? It's clearly being calculated on each iteration, so I assume there must be a way to report this back but I haven't been able to figure it out...
Edit: to clarify, I'm not looking for the parameters that I specified in the fitting (e.g., alpha values), I'm looking for the fitted coefficient values in the regression.
clf = linear_model.RidgeCV(...) # your own parameters setting
param = clf.get_params(deep=True)
See the document for more details.
To get the weight vector coefficient, use clf.coef_. Besides, cv_values_ and alpha_ are two other attributes of clf returns MSEs and estimated regularization parameter respectively.

Categories