Statsmodels - Negative Binomial doesn't converge while GLM does converge - python

I'm trying to do a Negative Binomial regression using Python's statsmodels package. The model estimates fine when using the GLM routine i.e.
model = smf.glm(formula="Sales_Focus_2016 ~ Sales_Focus_2015 + A_Calls + A_Ed", data=df, family=sm.families.NegativeBinomial()).fit()
model.summary()
However, the GLM routine doesn't estimate alpha, the dispersion term. I tried to use the Negative Binomial routine directly (which does estimate alpha) i.e.
nb = smf.negativebinomial(formula="Sales_Focus_2016 ~ Sales_Focus_2015 + A_Calls + A_Ed", data=df).fit()
nb.summary()
But this doesn't converge. Instead I get the message:
Warning: Desired error not necessarily achieved due to precision loss.
Current function value: nan
Iterations: 0
Function evaluations: 1
Gradient evaluations: 1
My question is:
Do the two routines use different methods of estimation? Is there a way to make the smf.NegativeBinomial routine use the same estimation methods as the GLM routine?

discrete.NegativeBinomial uses either a newton method (default) in statsmodels or the scipy optimizers. The main problem is that the exponential mean function can easily result in overflow problems or problems from large gradients and hessian when we are still far away from the optimum. There are some attempts in the fit method to get good starting values but this does not always work.
a few possibilities that I usually try
check that no regressor has large values, e.g. rescale to have max below 10
use method='nm' Nelder-Mead as initial optimizer and switch to newton or bfgs after some iterations or after convergence.
try to come up with better starting values (see for example about GLM below)
GLM uses by default iteratively reweighted least squares, IRLS, which is only standard for one parameter families, i.e. it takes the dispersion parameter as given. So the same method cannot be directly used for the full MLE in discrete NegativeBinomial.
GLM negative binomial still specifies the full loglike. So it is possible to do a grid search over the dispersion parameter using GLM.fit() for estimating the mean parameters for each value of the dispersion parameter. This should be equivalent to the corresponding discrete NegativeBinomial version (nb2 ? I don't remember). It could also be used as start_params for the discrete version.
In the statsmodels master version, there is now a connection to allow arbitrary scipy optimizers instead of the ones that were hardcoded. scipy recently obtained trust region newton methods, and will get more in future, which should work for more cases than the simple newton method in statsmodels.
(However, most likely that does not work currently for discrete NegativeBinomial, I just found out about a possible problem https://github.com/statsmodels/statsmodels/issues/3747 )

Related

How to provide custom gradients to HMC sampler in tensorflow-probability?

I am trying to use the in-built HMC sampler of tensorflow-probability to generate samples from the posterior. According to documentation, it seems like one has to provide (possibly unnormalized) log density of posterior to target_log_prob_fn callable and tensorflow-probability automatically computes its gradient (with respect to parameters to be inferred) to perform Hamiltonian MCMC updates.
However for my application, the likelihood and the gradient of resulting posterior are computed outside of tensorflow (it involves solution of a partial differential equation and I can compute it efficiently using some other python library). So I was wondering is there a way I can somehow directly pass target_log_prob_fn the (unnormalized) log density of posterior and its gradient to perform Hamiltonian MCMC update? In other words, is there a way I can ask the HMC sampler to use the gradients provided by me to perform MCMC update?
I found a related question over here, but it does not exactly answer my question.

nelder-mead, finding error on fitted parameters

I've been starting to use the minimizer in scipy.optimize and for most parameters I've tried to fit, the default method BFGS has worked just fine. The method helpfully reports the inverse of the Hessian matrix from which I can extract the errors on the fitted parameters from the diagonals of the matrix. However, for some new parameters that I'm trying to fit, the values are quite small and I run into precision errors using BFGS.
Switching to Nelder-Mead does the job, but I don't know how to extract the uncertainties from the fitted parameters using this method.
How can I extract the uncertainties for the fitted parameters using the Nelder-Mead in scipy.optimize()?

Modified negative binomial GLM in Python

Packages pymc3 and statsmodels can handle negative binomial GLMs in Python as shown here:
E(Y) = e^(beta_0 + Sigma (X_i * beta_i))
Where X_is are my predictor variables and Y is my dependent variable. Is there a way to force one my variables (for example X_1) to have beta_1=1 so that the algorithm optimizes other coefficients. I am open to using both pymc3 and statsmodels. Thanks.
GLM and the count models in statsmodels.discrete include and optional keyword offset which is exactly for this use case. It is added to the linear prediction part, and so corresponds to an additional variable with fixed coefficient equal to 1.
http://www.statsmodels.org/devel/generated/statsmodels.genmod.generalized_linear_model.GLM.html
http://www.statsmodels.org/devel/generated/statsmodels.discrete.discrete_model.NegativeBinomial.html
Aside: GLM with family NegativeBinomial takes the negative binomial dispersion parameter as fixed, while the discrete model NegativeBinomial estimates the dispersion parameter by MLE jointly with the mean parameters.
Another aside: GLM has a fit_constrained method for linear or affine restrictions on the parameters. This works by transforming the design matrix and using offset for the constant part. In the simple case of a fixed parameter as in the question, this reduces to using offset in the same way as described above (although fit_constrained has to go through the more costly general case.)

Lasso Generalized linear model in Python

I would like to fit a generalized linear model with negative binomial link function and L1 regularization (lasso) in python.
Matlab provides the nice function :
lassoglm(X,y, distr)
where distr can be poisson, binomial etc.
I had a look at both statmodels and scikit-learn but I did not find any ready to use function or example that could direct me towards a solution.
In matlab it seems they minimize this:
min (1/N * Deviance(β0,β) + λ * sum(abs(β)) )
where deviance depends on the link function.
Is there a way to implement this easily with scikit or statsmodels or I should go for cvxopt?
statsmodels has had for some time a fit_regularized for the discrete models including NegativeBinomial.
http://statsmodels.sourceforge.net/devel/generated/statsmodels.discrete.discrete_model.NegativeBinomial.fit_regularized.html
which doesn't have the docstring (I just saw). The docstring for Poisson has the same information http://statsmodels.sourceforge.net/devel/generated/statsmodels.discrete.discrete_model.Poisson.fit_regularized.html
and there should be some examples available in the documentation or unit tests.
It uses an interior algorithm with either scipy slsqp or optionally, if installed, cvxopt. Compared to steepest descend or coordinate descend methods, this is only appropriate for cases where the number of features/explanatory variables is not too large.
Coordinate descend with elastic net for GLM is in a work in progress pull request and will most likely be available in statsmodels 0.8.

Logistic Regression function on sklearn

I am learning Logistic Regression from sklearn and came across this : http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression
I have a created an implementation which shows me the accuracy scores for training and testing. However it is very unclear how this was achieved. My question is : What is the Maximum likelihood estimate? How is this being calculated? What is the error measure? What is the optimisation algorithm used?
I know all of the above in theory, however I am not sure where and when and how scikit.learn calculates it, or if its something I need to implement at some point. I have an accuracy rate of 83% which was what I was aiming for but I am very confused about how this was achieved by scikit learn.
Would anyone be able to point me in the right direction?
I recently started studying LR myself, I still don't get many steps of the derivation but I think I understand which formulas are being used.
First of all let's assume that you are using the latest version of scikit-learn and that the solver being used is solver='lbfgs' (which is the default I believe).
The code is here: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/logistic.py
What is the Maximum likelihood estimate? How is this being calculated?
The function to compute the likelihood estimate is this one https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/logistic.py#L57
The interesting line is:
# Logistic loss is the negative of the log of the logistic function.
out = -np.sum(sample_weight * log_logistic(yz)) + .5 * alpha * np.dot(w, w)
which is the formula 7 of this tutorial. The function also computes the gradient of the likelihood, which is then passed to the minimization function (see below). One important thing is that the intercept is w0 of the formulas in the tutorial. But that's only valid fit_intercept is True.
What is the error measure?
I'm sorry I'm not sure.
What is the optimisation algorithm used?
See the following lines in the code: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/logistic.py#L389
It's this function http://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.fmin_l_bfgs_b.html
One very important thing is that the classes are +1 or -1! (for the binary case, in the literature 0 and 1 are common, but it won't work)
Also notice that numpy broadcasting rules are used in all formulas. (That's why you don't see iteration)
This was my attempt at understanding the code. I slowly went mad till the point of ripping appart scikit-learn code (in only works for the binary case). Also this served as inspiration too
Hope it helps.
Check out Prof. Andrew Ng's machine learning notes on Logistic Regression (starting from page 16): http://cs229.stanford.edu/notes/cs229-notes1.pdf
In logistic regression you minimize cross entropy (which in turn maximizes the likelihood of y given x). In order to do this, the gradient of the cross entropy (cost) function is being computed and is used to update the weights of the algorithm which are assigned to each input. In simple terms, logistic regression comes up with a line that best discriminates your two binary classes by changing around its parameters such that the cross entropy keeps going down. The 83% accuracy (i'm not sure what accuracy that is; you should be diving your data into training/validation/testing) means the line Logistic Regression is using for classification can correctly separate the classes 83% of the time.
I would have a look at the following on github :
https://github.com/scikit-learn/scikit-learn/blob/965b109bf2ac3a61dcbd02bc29dd8c9598c2b54c/sklearn/linear_model/logistic.py
The link is to the implementation of sklearn logictic regression. It contains the optimization algorithms used which include newton conjugate gradient (newton-cg) and bfgs (broyden fletcher goldfarb shanno algorithm) all of which require the calculation of the hessian of the loss function (_logistic_loss) . _logistic_loss is your likelihood function.

Categories