Python: l2-Penalty for logistic regression model from statsmodels? - python

Is there a way to put an l2-Penalty for the logistic regression model in statsmodel through a parameter or something else? I just found the l1-Penalty in the docs but nothing for the l2-Penalty.

The models in statsmodels.discrete like Logit, Poisson and MNLogit have currently only L1 penalization. However, elastic net for GLM and a few other models has recently been merged into statsmodels master.
GLM with family binomial with a binary response is the same model as discrete.Logit although the implementation differs. See my answer for L2 penalization in Is ridge binomial regression available in Python?
What has not yet been merged into statsmodels is L2 penalization with a structured penalization matrix as it is for example used as roughness penality in generalized additive models, GAM, and spline fitting.

If you look closely at the Documentation for statsmodels.regression.linear_model.OLS.fit_regularized you'll see that the current version of statsmodels allows for Elastic Net regularization which is basically just a convex combination of the L1- and L2-penalties (though more robust implementations employ some post-processing to diminish undesired behaviors of the naive implementations, see "Elastic Net" on Wikipedia for details):
2
If you take a look at the parameters for fit_regularized in the documentation:
OLS.fit_regularized(method='elastic_net', alpha=0.0, L1_wt=1.0, start_params=None, profile_scale=False, refit=False, **kwargs)
you'll realize that L1_wt is just lambda_1 in the first equation. So to get the L2-Penalty you're looking for, you just pass L1_wt=0 as an argument when you call the function. As an example:
model = sm.OLS(y, X)
results = model.fit_regularized(method='elastic_net', alpha=1.0, L1_wt=0.0)
print(results.summary())
should give you an L2 Penalized Regression predicting target y from input X.
Hope that helps!
P.S. Three final comments:
1) statsmodels currently only implements elastic_net as an option to the method argument. So that gives you L1 and L2 and any linear combination of them but nothing else (for OLS at least);
2) L1 Penalized Regression = LASSO (least absolute shrinkage and selection operator);
3) L2 Penalized Regression = Ridge Regression, the Tikhonov–Miller method, the Phillips–Twomey method, the constrained linear inversion method, and the method of linear regularization.

Related

Converting an sklearn logistic regression to a statsmodel logistic regression

I'm optimizing a logistic regression in sklearn through repeated kfold cross validation. I want to checkout the confidence interval and, based on other stack exchange answers, it seems easier to get that info from statsmodels.
Coming from a basis in sklearn though, statsmodels is opaque. How can I translate the optimized settings for my logistic regression into statsmodels. Things like the L2 penalty, C value, intercept, etc.?
I've done some research and it looks like statsmodel support L2 indirectly through GLM Binomials. The C value will need to be converted from C into alpha (whatever that means), and I have only a vague idea of how to specify the intercept (it looks like it has something to do with the add_constant function).
Can someone drop an example of how to do this kind of translation into statsmodels? I'm sure once I see it done, a lot of it will naturally fall into place in my head.

Lasso regression won't remove 2 features which are highly correlated

I have two features, say F1 and F2 which has a correlation of about 0.9.
When I built my model, I first considered all the features to go into my regression model. Once I have my model, I then ran Lasso regression on my model, with the hope that this will tackle any colinearity between the features. However, the Lasso regression kept both F1 and F2 in my model.
Two questions:
i) If F1 and F2 are highly correlated, but Lasso regression still kept both of them, what could this mean? Does it mean regularization doesn't work in some cases?
ii) How do I adjust my model or the Lasso regression model to kick out F1 or F2 in my model? (I am using sklearn.linear_model.LogisticRegression, and have set penalty = 'l1' or ‘elasticnet’, tried very large or very small C values, tried 'liblinear' or 'saga' solvers, and l1_ratio = 1, but I still can't kick out either F1 or F2 from my model)
Answers to your questions:
i) Lasso reduces coefficients gradually. You may find a nice picture in some books authored by Robert Tibshirani, the person behind the Lasso/Ridge, where you will see how some coefficients gradually fall to zero as regularization coefficient is increasing (you may perform such an experiment yourself). The fact the model still keeps both may mean two things: either the model deems both important or there no enough regularization to kill one of them.
ii) You're right you're going with Lasso with L1 regularization. It is C parameter. The way it's coded in sklearn: the smaller the C the higher the regularization parameter (inverse). Though in machine learning your task is not to totally exclude collinearity ("to kill F1 or F2" in your parlor), but to find a model (or a set of params if you wish) that will generalize best. That is done through model tuning via CV. Warning: higher regularization means more underfitting.
I would add though that collinearity is somewhat dangerous for linear regression because it may give rise to model instability (differing coefficients on different subsamples). So, with linear regression, you may wish to check this too.

Fitting Keras L1 models

I have a simple keras model (normal Lasso linear model) where the inputs are moved to a single 'neuron' Dense(1, kernel_regularizer=l1(fdr))(input_layer) but the weights from this model are never set exactly to zero. I find this interesting since scikit-learn's Lasso can set coefficients exactly to zero.
I have used Adam and tensorflow's FtrlOptimizer for optimisation and they have the same problem.
I've already checked this question already but this does not explain why sklearn can set values exactly to zero, no to mention how their models converge in ~500ms on my server when the same model in Keras takes 2.4secs with early terminations.
Is this all because of the optimizer being used or am I missing something?
Is this all because of the optimizer being used or am I missing
something?
Indeed. If you look into the actual function that gets called when you fit Lasso from scikit-learn (it's called from ElasticNet class) you see that it uses different optimization algorithm.
Coordinate Descent in scikit-learn's ElasticNet starts with coefficient vector equal to zero, and then considers adding nonzero entries one at a time (this is related to stepwise feature selection for linear regression).
Other methods that are used to optimize L1 regularized regression also are work in that way: for example LARS (Least-angle regression) can be also used from scikit-learn.
In contrast to that, a paper on FTRL algorithm says
Unfortunately, OGD is not particularly effective at producing
sparse models. In fact, simply adding a subgradient
of the L1 penalty to the gradient of the loss (Ow`t(w))
will essentially never produce coefficients that are exactly
zero.

linear regression model with AR errors python

Is there a python package (statsmodels/scipy/pandas/etc...) with functionality for estimating coefficients for a linear regression model with autoregressive errors in python, such as the following SAS implementation below? http://support.sas.com/documentation/cdl/en/etsug/63348/HTML/default/viewer.htm#etsug_autoreg_sect003.htm
statsmodels http://www.statsmodels.org/dev/index.html has ARMA, ARIMA and SARIMAX models that take explanatory variables to model the mean. This corresponds to a linear model, y = X b + e, where the error term e follows an ARMA or seasonal ARMA process. AR errors are a special case when the moving average term has no lags.
statsmodels also has an autoregressive AR class but it does not allow for explanatory variables.
In these time series models, prediction is a conditional prediction that takes the history into account for forecasting.
statsmodels also has a GLSAR class which is a linear model that removes the effect of autocorrelated AR residuals. This uses feasible generalized least squares estimation and can only predict the unconditional term X b.
http://www.statsmodels.org/dev/generated/statsmodels.tsa.arima_model.ARMA.html#statsmodels.tsa.arima_model.ARMA
http://www.statsmodels.org/dev/statespace.html#seasonal-autoregressive-integrated-moving-average-with-exogenous-regressors-sarimax

Using l1 penalty with LogisticRegressionCV() in scikit-learn

I am using python scikit-learn library for classification.
As a feature selection step, I want to use RandomizedLogisticRegression().
So for finding best value of C by cross-validation, I used LogisticRegressionCV(penalty='l1', solver='liblinear').
However, all coefficients were all 0 in this case.
Using l2 penalty works without problem. Also, single run of LogisticRegression() with l1 penalty seems to give proper coeffients.
I am using RandomizedLasso and LassoCV() for work-around, but I am not sure whether it is proper to use LASSO for binary class label.
So my question is like these.
Is there some problem in using LogisticRegressionCV() in my case?
Is there another way to find best value of C_ for logistic regression except GridSearchCV()?
Is it possible to use LASSO for binary(not continuous) classification?
From what you describe, I can say that the coefficient of the l1 regularisation term is high in your case which you need to decrease.
When the coefficient is very high, the regularisation terms becomes more important than the error term and so your model just becomes very sparse and doesn't predict anything.
I checked the LogisticRegressionCV and it says that it will search from 1e-4 to 1e4 using the Cs argument. The documentation says that in order to have lower regularisation coefficients you need to have higher Cs if you provide an integer. Alternatively you can possibly provide the inverse of regularisation coefficients yourself as a list.
So play with the Cs parameter and try to lower the regularisation coefficient.

Categories