Python statsmodels get_prediction function formula - python

What formula does this function use after computing a simple linear regression (OLS) on data? There's many different prediction interval formulas, some using RMSE (root mean square error), some using standard deviation, etc.
http://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLSResults.get_prediction.html#statsmodels.regression.linear_model.OLSResults.get_prediction
In particular, I want to know if it's using this formula or something else:
Note the standard deviation of x parameter.

It does use the same formula as shown above. The standard error of the prediction is calculated using the formula sqrt(variance of predicted mean + variance of residuals). This can be simplified as shown in this link.

Related

Is there a function for linear-link NB1 regression in python?

I would like to build a linear link negative binomial 1 glm model for some data (where negative binomial 1 means that the dispersion parameter is of the form (1 + a)*mean). This means I need to simultaneously control both the link function form and the dispersion parameter form.
Is there a library/function I can use to control both?
Python's statsmodels negative binomial regression implementation is an NB2 model, which assumes dispersion is of the form mean + a*mean^2 (see https://www.statsmodels.org/stable/glm.html), and I don't see a way to change this.
I am also aware of NegativeBinomialP, which allows me to specify a form for the dispersion parameter, but not for the link function, which defaults to log-link.

R squared weighted by separate variable

Converting code from STATA to Python and I am trying to figure out how to calculate a regression and R-squared based on STATA's aweight function.
I have my X variable, my Y variable, and want to weight the regression based on a separate Z variable (population). How do you do this in Python?
sklearn aka scikit-learn is my go to library for modelling.
This tutorial might be your answere:
SGD: Weighted samples

Modified negative binomial GLM in Python

Packages pymc3 and statsmodels can handle negative binomial GLMs in Python as shown here:
E(Y) = e^(beta_0 + Sigma (X_i * beta_i))
Where X_is are my predictor variables and Y is my dependent variable. Is there a way to force one my variables (for example X_1) to have beta_1=1 so that the algorithm optimizes other coefficients. I am open to using both pymc3 and statsmodels. Thanks.
GLM and the count models in statsmodels.discrete include and optional keyword offset which is exactly for this use case. It is added to the linear prediction part, and so corresponds to an additional variable with fixed coefficient equal to 1.
http://www.statsmodels.org/devel/generated/statsmodels.genmod.generalized_linear_model.GLM.html
http://www.statsmodels.org/devel/generated/statsmodels.discrete.discrete_model.NegativeBinomial.html
Aside: GLM with family NegativeBinomial takes the negative binomial dispersion parameter as fixed, while the discrete model NegativeBinomial estimates the dispersion parameter by MLE jointly with the mean parameters.
Another aside: GLM has a fit_constrained method for linear or affine restrictions on the parameters. This works by transforming the design matrix and using offset for the constant part. In the simple case of a fixed parameter as in the question, this reduces to using offset in the same way as described above (although fit_constrained has to go through the more costly general case.)

Statsmodels - Negative Binomial doesn't converge while GLM does converge

I'm trying to do a Negative Binomial regression using Python's statsmodels package. The model estimates fine when using the GLM routine i.e.
model = smf.glm(formula="Sales_Focus_2016 ~ Sales_Focus_2015 + A_Calls + A_Ed", data=df, family=sm.families.NegativeBinomial()).fit()
model.summary()
However, the GLM routine doesn't estimate alpha, the dispersion term. I tried to use the Negative Binomial routine directly (which does estimate alpha) i.e.
nb = smf.negativebinomial(formula="Sales_Focus_2016 ~ Sales_Focus_2015 + A_Calls + A_Ed", data=df).fit()
nb.summary()
But this doesn't converge. Instead I get the message:
Warning: Desired error not necessarily achieved due to precision loss.
Current function value: nan
Iterations: 0
Function evaluations: 1
Gradient evaluations: 1
My question is:
Do the two routines use different methods of estimation? Is there a way to make the smf.NegativeBinomial routine use the same estimation methods as the GLM routine?
discrete.NegativeBinomial uses either a newton method (default) in statsmodels or the scipy optimizers. The main problem is that the exponential mean function can easily result in overflow problems or problems from large gradients and hessian when we are still far away from the optimum. There are some attempts in the fit method to get good starting values but this does not always work.
a few possibilities that I usually try
check that no regressor has large values, e.g. rescale to have max below 10
use method='nm' Nelder-Mead as initial optimizer and switch to newton or bfgs after some iterations or after convergence.
try to come up with better starting values (see for example about GLM below)
GLM uses by default iteratively reweighted least squares, IRLS, which is only standard for one parameter families, i.e. it takes the dispersion parameter as given. So the same method cannot be directly used for the full MLE in discrete NegativeBinomial.
GLM negative binomial still specifies the full loglike. So it is possible to do a grid search over the dispersion parameter using GLM.fit() for estimating the mean parameters for each value of the dispersion parameter. This should be equivalent to the corresponding discrete NegativeBinomial version (nb2 ? I don't remember). It could also be used as start_params for the discrete version.
In the statsmodels master version, there is now a connection to allow arbitrary scipy optimizers instead of the ones that were hardcoded. scipy recently obtained trust region newton methods, and will get more in future, which should work for more cases than the simple newton method in statsmodels.
(However, most likely that does not work currently for discrete NegativeBinomial, I just found out about a possible problem https://github.com/statsmodels/statsmodels/issues/3747 )

Evaluating a Gaussian Fit

I'd like to know ways to determine how well a Gaussian function is fitting my data.
Here are a few plots I've been testing methods against. Currently, I'm just using the RMSE of the fit versus the sample (red is fit, blue is sample).
For instance, here are 2 good fits:
And here are 2 terrible fits that should be flagged as bad data:
In general, I'm looking for suggestions of additional metrics to measure the goodness of fit. Additionally, as you can see in the second 'good' fit, there can sometimes be other peaks outside the data. Currently, these are penalized by the RSME method, though they should not be.
I'm looking for suggestions of additional metrics to measure the goodness of fit.
The one-sample Kolmogorov-Smirnov (KS) test would be a good starting point.
I'd suggest the Wikipedia article as an introduction.
The test is available in SciPy as scipy.stats.kstest. The function computes and returns both the KS test statistic and the p-value.
You can try quantile-quantile (qq) plots using probplot from stats:
import pylab
from stats import probplot
plot = probplot(data, dist='norm', plot=pylab)
pylab.show()
Calculate quantiles for a probability plot, and optionally show the
plot.
Generates a probability plot of sample data against the quantiles of a
specified theoretical distribution (the normal distribution by
default). probplot optionally calculates a best-fit line for the data
and plots the results using Matplotlib or a given plot function.
There are other ways of evaluating a good fit, but most of them are not robust to outliers.
There is MSE - Mean squared error, which you already know, and RMSE which is the root of it.
But you can also measure it using MAE - Mean Absolute Error and MAPE - Mean absolute percentage error.
Also, there is the Kolmogorov-Smirnov test which is far more complex and you would probably need a library to do that, while MAE, MAPE and MSE you can implement yourself quiet easily.
(If you are dealing with unsupervised data and/or classification, which is not your case apparently, ROC curves and confusion matrix are also accuracy metrics.)

Categories