I am learning Logistic Regression from sklearn and came across this : http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression
I have a created an implementation which shows me the accuracy scores for training and testing. However it is very unclear how this was achieved. My question is : What is the Maximum likelihood estimate? How is this being calculated? What is the error measure? What is the optimisation algorithm used?
I know all of the above in theory, however I am not sure where and when and how scikit.learn calculates it, or if its something I need to implement at some point. I have an accuracy rate of 83% which was what I was aiming for but I am very confused about how this was achieved by scikit learn.
Would anyone be able to point me in the right direction?
I recently started studying LR myself, I still don't get many steps of the derivation but I think I understand which formulas are being used.
First of all let's assume that you are using the latest version of scikit-learn and that the solver being used is solver='lbfgs' (which is the default I believe).
The code is here: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/logistic.py
What is the Maximum likelihood estimate? How is this being calculated?
The function to compute the likelihood estimate is this one https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/logistic.py#L57
The interesting line is:
# Logistic loss is the negative of the log of the logistic function.
out = -np.sum(sample_weight * log_logistic(yz)) + .5 * alpha * np.dot(w, w)
which is the formula 7 of this tutorial. The function also computes the gradient of the likelihood, which is then passed to the minimization function (see below). One important thing is that the intercept is w0 of the formulas in the tutorial. But that's only valid fit_intercept is True.
What is the error measure?
I'm sorry I'm not sure.
What is the optimisation algorithm used?
See the following lines in the code: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/logistic.py#L389
It's this function http://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.fmin_l_bfgs_b.html
One very important thing is that the classes are +1 or -1! (for the binary case, in the literature 0 and 1 are common, but it won't work)
Also notice that numpy broadcasting rules are used in all formulas. (That's why you don't see iteration)
This was my attempt at understanding the code. I slowly went mad till the point of ripping appart scikit-learn code (in only works for the binary case). Also this served as inspiration too
Hope it helps.
Check out Prof. Andrew Ng's machine learning notes on Logistic Regression (starting from page 16): http://cs229.stanford.edu/notes/cs229-notes1.pdf
In logistic regression you minimize cross entropy (which in turn maximizes the likelihood of y given x). In order to do this, the gradient of the cross entropy (cost) function is being computed and is used to update the weights of the algorithm which are assigned to each input. In simple terms, logistic regression comes up with a line that best discriminates your two binary classes by changing around its parameters such that the cross entropy keeps going down. The 83% accuracy (i'm not sure what accuracy that is; you should be diving your data into training/validation/testing) means the line Logistic Regression is using for classification can correctly separate the classes 83% of the time.
I would have a look at the following on github :
https://github.com/scikit-learn/scikit-learn/blob/965b109bf2ac3a61dcbd02bc29dd8c9598c2b54c/sklearn/linear_model/logistic.py
The link is to the implementation of sklearn logictic regression. It contains the optimization algorithms used which include newton conjugate gradient (newton-cg) and bfgs (broyden fletcher goldfarb shanno algorithm) all of which require the calculation of the hessian of the loss function (_logistic_loss) . _logistic_loss is your likelihood function.
Related
Why we use gradient descent because sklearn can automatically find best fit line for our data.what is the purpose of gradient descent.
https://scikit-learn.org/stable/modules/sgd.html
if you want to use Gradient Descent approach, you should consider using SDRClassifier in SKlearn because SKlearn gives two Approaches to using Linear Regression. The first is LinearRegression class and is using Ordinary Least Squares solver from scipy
the other one is SDRClassifier class which is an Implementation of the Gradient Descent Algorithm. So to answer your Question if you are using SDRClassifier in SKlearn then you are using an Implementation of Gradient Descent Algorithm behind the Scene.
From Wikipedia itself
Gradient descent is a first-order iterative optimization algorithm for
finding the minimum of a function. To find a local minimum of a
function using gradient descent, one takes steps proportional to the
negative of the gradient (or approximate gradient) of the function at
the current point.
Gradient descent is just another approach used for converging, maximizing likelihood. There are other alternatives with their limitations.
The LinearRegression model in sklearn is just a fancy wrapper of the least squares solver (scipy.linalg.lstsq) built into scipy.
A gradient descent isn't much more than just following the slope locally.
If you're using a gradient descent, you're essentially taking into account how much has the value of the function you're trying to minimize changed as a function of one or more parameters. You can use that information to make a better guess as to what parameters values is likely to work best to minimize the function at the next step.
The image here is in 2D (so you would only have 2 parameters determining the value of your function to minimize). However it still remains conceptually the same (just with harder maths behind the scenes) with as many parameters as you want to use.
So a gradient descent, essentially, means taking your bicycle and letting the local slope lead you to the nearest local minimum.
To answer your question directly: gradient descent can get a solution for many models - from Logistic Regression to neural networks, called Multi-Layer Perceptrons in SKlearn (MLP).
If you are solving only for a simple linear model, then using gradient descent (like in Basilisk's answer) really has some minor benefits at the cost of performance (more flexible, but more slow). I used it when I needed more flexibility.
Besides the point, note that this question is not about programming, it is about machine learning, and should go to Cross Validated instead of Stack Overflow - though you might also want to start with the fundamentals (think about this - what do we mean by "best fit line"?).
I have a simple keras model (normal Lasso linear model) where the inputs are moved to a single 'neuron' Dense(1, kernel_regularizer=l1(fdr))(input_layer) but the weights from this model are never set exactly to zero. I find this interesting since scikit-learn's Lasso can set coefficients exactly to zero.
I have used Adam and tensorflow's FtrlOptimizer for optimisation and they have the same problem.
I've already checked this question already but this does not explain why sklearn can set values exactly to zero, no to mention how their models converge in ~500ms on my server when the same model in Keras takes 2.4secs with early terminations.
Is this all because of the optimizer being used or am I missing something?
Is this all because of the optimizer being used or am I missing
something?
Indeed. If you look into the actual function that gets called when you fit Lasso from scikit-learn (it's called from ElasticNet class) you see that it uses different optimization algorithm.
Coordinate Descent in scikit-learn's ElasticNet starts with coefficient vector equal to zero, and then considers adding nonzero entries one at a time (this is related to stepwise feature selection for linear regression).
Other methods that are used to optimize L1 regularized regression also are work in that way: for example LARS (Least-angle regression) can be also used from scikit-learn.
In contrast to that, a paper on FTRL algorithm says
Unfortunately, OGD is not particularly effective at producing
sparse models. In fact, simply adding a subgradient
of the L1 penalty to the gradient of the loss (Ow`t(w))
will essentially never produce coefficients that are exactly
zero.
I have a small data set with 47 samples. I'm running linear regression with 2 features.
After running LinearRegression I ran Ridge (with sag). I would expect it to converge quickly, and return exactly the same prediction as computed solving the normal equations.
But every time I run Ridge I get a different result, close to the result provided by LinearRegression but not exactly the same. It doesn't matter how many iterations I run. Is this expected? Why? In the past I've implemented regular gradient descent myself and it quickly converges in this data set.
ols = sklearn.linear_model.LinearRegression()
model = ols.fit(x_train, y_train)
print(model.predict([[1650,3]]))
%[[ 293081.4643349]]
scaler=preprocessing.StandardScaler().fit(x_train)
ols = sklearn.linear_model.Ridge(alpha=0,solver="sag",max_iter=99999999,normalize=False)
model = ols.fit(x_scaled, y_train)
x_test=scaler.transform([[1650,3]])
print(model.predict(x_test))
%[[ 293057.69986594]]
Thank you all for your answers! After reading #sascha response I read a little bit more on Stochastic Average Gradient Descent and I think I've found the reason of this discrepancy and it seems in fact that it's due to the "stochastic" part of the algorithm.
Please check the wikipedia page:
https://en.wikipedia.org/wiki/Stochastic_gradient_descent
In regular gradient descent we update the weights on every iteration based on this formula:
Where the second term of the sum is the gradient of the cost function multiplied by a learning rate mu.
This is repeated until convergence, and it always gives the same result after the same number of iterations, given the same starting weights.
In Stochastic Gradient Descent this is done instead in every iteration:
Where the second part of the sum is the gradient in a single sample (multiplied by the learning rate mu). All the samples are randomized at the beginning, and then the algorithm cycles through them at every iteration.
So I think a couple of things contribute to the behavior I asked about:
(EDITED see replies below)
The point used to calculate the gradient at every iteration changes every time I re-run the fit function. That's why I don't obtain the same result every time.
(EDIT)(This can be made deterministic by using random_state when calling the fit method)
I also realized that the number of iterations that the algorithm runs varies between 10 and 15 (regardless of the max_limit I set). I couldn't find anywhere what the criteria for convergence is in scikit, but my guess is that if I could tighten it (i.e. run for more iterations) the answer I would get would be much closer to the LinearRegression method.
(EDIT)(Convergence criteria depends on tol (precision of the solution). By modifying this parameter (I set it to 1e-100) I was able to obtain the same solution as the one reported by LinearRegression)
The difference between your two different outputs may come from your preprocessing that you only do for the Ridge regression :scaler=preprocessing.StandardScaler().fit(x_train).
By doing such normalization you change the representation of your data and it may lead to different results.
Note also the fact that by doing OLS you penalize the L2 norm looking only at the output differences (expected vs predicted) while Ridge the algorithm's also taking into account the input match or mismatch
I am trying out logistic regression from scratch in python.(through finding probability estimates,cost function,applying gradient descent for increasing the maximum likelihood).But I have a confusion regarding which estimates should I take for the first iteration process.I took all the estimates as 0(including the intercept).But the results are different from that we get in Scikit-learn.I want to know which are the initial estimates taken in Scikit-learn for logistic regression?
First of all scikit learn's LogsiticRegerssion uses regularization. So unless you apply that too , it is unlikely you will get exactly the same estimates. if you really want to test your method versus scikit's , it is better to use their gradient decent implementation of Logistic regersion which is called SGDClassifier . Make certain you put loss='log' for logistic regression and set alpha=0 to remove regularization, but again you will need to adjust the iterations and eta as their implementation is likely to be slightly different than yours.
To answer specifically about the initial estimates, I don't think it matters, but most commonly you set everything to 0 (including the intercept) and should converge just fine.
Also bear in mind GD (gradient Decent) models are hard to tune sometimes and you may need to apply some scaling(like StandardScaler) to your data beforehand as very high values are very likely to drive your gradient out of its slope. Scikit's implementation adjusts for that.
I am trying to do logistic regression on a huge data set using scikit-learn SGDClassifier (I am using partial_fit to be precise). The coefficients I obtained are of different sign, whereas I would like to force the classifier to look only for positive values (I know it may not be the best approach in terms of methodology however it is what would be ok for now).
My question is:
Is there any way to impose constraints on coefficients using SGDClassifier?
Thanks for your time
This is not possible with SGDClassifier in its current implementation.
If you wanted to implement this, you have to add a penalty, call it e.g. 'positivity', which makes sure this constraint is verified by placing infinite cost on negative values.
It may be possible to implement this using e.g. this paper, Duchi 2009 (but I think there are follow-ups in newer literature that could be more up to the job). What you need to do at every mini-batch is to project onto the positive orthant. This is done by simply setting all negative values the occur after a gradient step in the logistic loss to 0.