Using l1 penalty with LogisticRegressionCV() in scikit-learn - python

I am using python scikit-learn library for classification.
As a feature selection step, I want to use RandomizedLogisticRegression().
So for finding best value of C by cross-validation, I used LogisticRegressionCV(penalty='l1', solver='liblinear').
However, all coefficients were all 0 in this case.
Using l2 penalty works without problem. Also, single run of LogisticRegression() with l1 penalty seems to give proper coeffients.
I am using RandomizedLasso and LassoCV() for work-around, but I am not sure whether it is proper to use LASSO for binary class label.
So my question is like these.
Is there some problem in using LogisticRegressionCV() in my case?
Is there another way to find best value of C_ for logistic regression except GridSearchCV()?
Is it possible to use LASSO for binary(not continuous) classification?

From what you describe, I can say that the coefficient of the l1 regularisation term is high in your case which you need to decrease.
When the coefficient is very high, the regularisation terms becomes more important than the error term and so your model just becomes very sparse and doesn't predict anything.
I checked the LogisticRegressionCV and it says that it will search from 1e-4 to 1e4 using the Cs argument. The documentation says that in order to have lower regularisation coefficients you need to have higher Cs if you provide an integer. Alternatively you can possibly provide the inverse of regularisation coefficients yourself as a list.
So play with the Cs parameter and try to lower the regularisation coefficient.

Related

Fitting Keras L1 models

I have a simple keras model (normal Lasso linear model) where the inputs are moved to a single 'neuron' Dense(1, kernel_regularizer=l1(fdr))(input_layer) but the weights from this model are never set exactly to zero. I find this interesting since scikit-learn's Lasso can set coefficients exactly to zero.
I have used Adam and tensorflow's FtrlOptimizer for optimisation and they have the same problem.
I've already checked this question already but this does not explain why sklearn can set values exactly to zero, no to mention how their models converge in ~500ms on my server when the same model in Keras takes 2.4secs with early terminations.
Is this all because of the optimizer being used or am I missing something?
Is this all because of the optimizer being used or am I missing
something?
Indeed. If you look into the actual function that gets called when you fit Lasso from scikit-learn (it's called from ElasticNet class) you see that it uses different optimization algorithm.
Coordinate Descent in scikit-learn's ElasticNet starts with coefficient vector equal to zero, and then considers adding nonzero entries one at a time (this is related to stepwise feature selection for linear regression).
Other methods that are used to optimize L1 regularized regression also are work in that way: for example LARS (Least-angle regression) can be also used from scikit-learn.
In contrast to that, a paper on FTRL algorithm says
Unfortunately, OGD is not particularly effective at producing
sparse models. In fact, simply adding a subgradient
of the L1 penalty to the gradient of the loss (Ow`t(w))
will essentially never produce coefficients that are exactly
zero.

How to do regression as opposed to classification using logistic regression and scikit learn

The target variable that I need to predict are probabilities (as opposed to labels). The corresponding column in my training data are also in this form. I do not want to lose information by thresholding the targets to create a classification problem out of it.
If I train the logistic regression classifier with binary labels, sk-learn logistic regression API allows obtaining the probabilities at prediction time. However, I need to train it with probabilities. Is there a way to do this in scikits-learn, or a suitable Python package that scales to 100K data points of 1K dimension.
I want the regressor to use the structure of the problem. One such
structure is that the targets are probabilities.
You can't have cross-entropy loss with non-indicator probabilities in scikit-learn; this is not implemented and not supported in API. It is a scikit-learn's limitation.
In general, according to scikit-learn's docs a loss function is of the form Loss(prediction, target), where prediction is the model's output, and target is the ground-truth value.
In the case of logistic regression, prediction is a value on (0,1) (i.e., a "soft label"), while target is 0 or 1 (i.e., a "hard label").
For logistic regression you can approximate probabilities as target by oversampling instances according to probabilities of their labels. e.g. if for given sample class_1 has probability 0.2, and class_2 has probability0.8, then generate 10 training instances (copied sample): 8 withclass_2as "ground truth target label" and 2 withclass_1`.
Obviously it is workaround and is not extremely efficient, but it should work properly.
If you're ok with upsampling approach, you can pip install eli5, and use eli5.lime.utils.fit_proba with a Logistic Regression classifier from scikit-learn.
Alternative solution is to implement (or find implementation?) of LogisticRegression in Tensorflow, where you can define loss function as you like it.
In compiling this solution I worked using answers from scikit-learn - multinomial logistic regression with probabilities as a target variable and scikit-learn classification on soft labels. I advise those for more insight.
This is an excellent question because (contrary to what people might believe) there are many legitimate uses of logistic regression as.... regression!
There are three basic approaches you can use if you insist on true logistic regression, and two additional options that should give similar results. They all assume your target output is between 0 and 1. Most of the time you will have to generate training/test sets "manually," unless you are lucky enough to be using a platform that supports SGD-R with custom kernels and X-validation support out-of-the-box.
Note that given your particular use case, the "not quite true logistic regression" options may be necessary. The downside of these approaches is that it is takes more work to see the weight/importance of each feature in case you want to reduce your feature space by removing weak features.
Direct Approach using Optimization
If you don't mind doing a bit of coding, you can just use scipy optimize function. This is dead simple:
Create a function of the following type:
y_o = inverse-logit (a_0 + a_1x_1 + a_2x_2 + ...)
where inverse-logit (z) = exp^(z) / (1 + exp^z)
Use scipy minimize to minimize the sum of -1 * [y_t*log(y_o) + (1-y_t)*log(1 - y_o)], summed over all datapoints. To do this you have to set up a function that takes (a_0, a_1, ...) as parameters and creates the function and then calculates the loss.
Stochastic Gradient Descent with Custom Loss
If you happen to be using a platform that has SGD regression with a custom loss then you can just use that, specifying a loss of y_t*log(y_o) + (1-y_t)*log(1 - y_o)
One way to do this is just to fork sci-kit learn and add log loss to the regression SGD solver.
Convert to Classification Problem
You can convert your problem to a classification problem by oversampling, as described by #jo9k. But note that even in this case you should not use standard X-validation because the data are not independent anymore. You will need to break up your data manually into train/test sets and oversample only after you have broken them apart.
Convert to SVM
(Edit: I did some testing and found that on my test sets sigmoid kernels were not behaving well. I think they require some special pre-processing to work as expected. An SVM with a sigmoid kernel is equivalent to a 2-layer tanh Neural Network, which should be amenable to a regression task structured where training data outputs are probabilities. I might come back to this after further review.)
You should get similar results to logistic regression using an SVM with sigmoid kernel. You can use sci-kit learn's SVR function and specify the kernel as sigmoid. You may run into performance difficulties with 100,000s of data points across 1000 features.... which leads me to my final suggestion:
Convert to SVM using Approximated Kernels
This method will give results a bit further away from true logistic regression, but it is extremely performant. The process is the following:
Use a sci-kit-learn's RBFsampler to explicitly construct an approximate rbf-kernel for your dataset.
Process your data through that kernel and then use sci-kit-learn's SGDRegressor with a hinge loss to realize a super-performant SVM on the transformed data.
The above is laid out with code here
Instead of using predict in the scikit learn library use predict_proba function
refer here:
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.predict_proba

What are the initial estimates taken in Logistic regression in Scikit-learn for the first iteration?

I am trying out logistic regression from scratch in python.(through finding probability estimates,cost function,applying gradient descent for increasing the maximum likelihood).But I have a confusion regarding which estimates should I take for the first iteration process.I took all the estimates as 0(including the intercept).But the results are different from that we get in Scikit-learn.I want to know which are the initial estimates taken in Scikit-learn for logistic regression?
First of all scikit learn's LogsiticRegerssion uses regularization. So unless you apply that too , it is unlikely you will get exactly the same estimates. if you really want to test your method versus scikit's , it is better to use their gradient decent implementation of Logistic regersion which is called SGDClassifier . Make certain you put loss='log' for logistic regression and set alpha=0 to remove regularization, but again you will need to adjust the iterations and eta as their implementation is likely to be slightly different than yours.
To answer specifically about the initial estimates, I don't think it matters, but most commonly you set everything to 0 (including the intercept) and should converge just fine.
Also bear in mind GD (gradient Decent) models are hard to tune sometimes and you may need to apply some scaling(like StandardScaler) to your data beforehand as very high values are very likely to drive your gradient out of its slope. Scikit's implementation adjusts for that.

Optimal SVM parameters for high recall

I'm using scikit-learn to perform classification using SVM. I'm performing a binary classification task.
0: Does not belong to class A
1: Belongs to class A
Now, I want to optimize the parameters such that I get high recall. I don't care much about a few false positives but the objects belonging to class A should not be labelled as not belonging to A often.
I use a SVM with linear kernel.
from sklearn import svm
clf = svm.SVC(kernel='linear')
clf.fit(X,Y)
clf.predict(...)
How should I choose other SVM parameters like C? Also, what is the difference between SVC with a linear kernel and LinearSVC?
The choice of the kernel is really dependent on the data, so picking the kernel based on a plot of the data might be the way to go. This could be automated by running through all kernel types and picking the one that gives you either high/low recall or bias, whatever you're looking for. You can see for yourself the visual difference of the kernels.
Depending on the kernel different arguments of the SVC constructor are important, but in general the C is possibly the most influential, as it's the penalty for getting it wrong. Decreasing C would increase the recall.
Other than that there's more ways to get a better fit, for example by adding more features to the n_features of the X matrix passed on to svm.fit(X,y).
And of course it can always be useful to plot the precision/recall to get a better feel of what the parameters are doing.
Generally speaking you can tackle this problem by penalizing the two types of errors differently during the learning procedure. If you take a look at the loss function, in particular in the primal/parametric setting, you can think of scaling the penalty of false-negatives by alpha and penalty of false-positives by (1 - alpha), where alpha is in [0 1]. (To similar effect would be duplicating the number of positive instances in your training set, but this makes your problem unnecessarily larger, which should be avoided for efficiency)
You can choose the SVM parameter C, which is basically your penalty term, by cross-validation. Here you can use K-Fold cross-validation. You can also use a sklearn class called gridsearchCV in which you can pass your model and then perform cross-validation on it using the cv parameter.
According to linearSVC documentation -
Similar to SVC with parameter kernel=’linear’, but implemented in terms of liblinear rather than libsvm, so it has more flexibility in the choice of penalties and loss functions and should scale better to large numbers of samples.

Logistic Regression function on sklearn

I am learning Logistic Regression from sklearn and came across this : http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression
I have a created an implementation which shows me the accuracy scores for training and testing. However it is very unclear how this was achieved. My question is : What is the Maximum likelihood estimate? How is this being calculated? What is the error measure? What is the optimisation algorithm used?
I know all of the above in theory, however I am not sure where and when and how scikit.learn calculates it, or if its something I need to implement at some point. I have an accuracy rate of 83% which was what I was aiming for but I am very confused about how this was achieved by scikit learn.
Would anyone be able to point me in the right direction?
I recently started studying LR myself, I still don't get many steps of the derivation but I think I understand which formulas are being used.
First of all let's assume that you are using the latest version of scikit-learn and that the solver being used is solver='lbfgs' (which is the default I believe).
The code is here: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/logistic.py
What is the Maximum likelihood estimate? How is this being calculated?
The function to compute the likelihood estimate is this one https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/logistic.py#L57
The interesting line is:
# Logistic loss is the negative of the log of the logistic function.
out = -np.sum(sample_weight * log_logistic(yz)) + .5 * alpha * np.dot(w, w)
which is the formula 7 of this tutorial. The function also computes the gradient of the likelihood, which is then passed to the minimization function (see below). One important thing is that the intercept is w0 of the formulas in the tutorial. But that's only valid fit_intercept is True.
What is the error measure?
I'm sorry I'm not sure.
What is the optimisation algorithm used?
See the following lines in the code: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/logistic.py#L389
It's this function http://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.fmin_l_bfgs_b.html
One very important thing is that the classes are +1 or -1! (for the binary case, in the literature 0 and 1 are common, but it won't work)
Also notice that numpy broadcasting rules are used in all formulas. (That's why you don't see iteration)
This was my attempt at understanding the code. I slowly went mad till the point of ripping appart scikit-learn code (in only works for the binary case). Also this served as inspiration too
Hope it helps.
Check out Prof. Andrew Ng's machine learning notes on Logistic Regression (starting from page 16): http://cs229.stanford.edu/notes/cs229-notes1.pdf
In logistic regression you minimize cross entropy (which in turn maximizes the likelihood of y given x). In order to do this, the gradient of the cross entropy (cost) function is being computed and is used to update the weights of the algorithm which are assigned to each input. In simple terms, logistic regression comes up with a line that best discriminates your two binary classes by changing around its parameters such that the cross entropy keeps going down. The 83% accuracy (i'm not sure what accuracy that is; you should be diving your data into training/validation/testing) means the line Logistic Regression is using for classification can correctly separate the classes 83% of the time.
I would have a look at the following on github :
https://github.com/scikit-learn/scikit-learn/blob/965b109bf2ac3a61dcbd02bc29dd8c9598c2b54c/sklearn/linear_model/logistic.py
The link is to the implementation of sklearn logictic regression. It contains the optimization algorithms used which include newton conjugate gradient (newton-cg) and bfgs (broyden fletcher goldfarb shanno algorithm) all of which require the calculation of the hessian of the loss function (_logistic_loss) . _logistic_loss is your likelihood function.

Categories