Getting correct exogenous least squares prediction in Python statsmodels - python

I am having trouble getting a reasonable prediction behavior from least squares fits in statsmodels version 0.6.1. It does not seem to be providing a sensible value.
Consider the following data
import numpy as np
xx = np.array([1.1,2.2,3.3,4.4]) # Independent variable
XX = sm.add_constant(xx) # Include constant for matrix fitting in statsmodels
yy = np.array([2,1,5,6]) # Dependent variable
ww = np.array([0.1,1,3,0.5]) # Weights to try
wn = ww/ww.sum() # Normalized weights
zz = 1.9 # Independent variable value to predict for
We can use numpy to do a weighted fit and prediction
np_unw_value = np.polyval(np.polyfit(xx, yy, deg=1, w=1+0*ww), zz)
print("Unweighted fit prediction from numpy.polyval is {sp}".format(sp=np_unw_value))
and we find a prediction of 2.263636.
As a sanity check, we can also see what R has to say about the matter
import pandas as pd
import rpy2.robjects
from rpy2.robjects.packages import importr
import rpy2.robjects.pandas2ri
pdf = pd.DataFrame({'x':xx, 'y':yy, 'w':wn})
pdz = pd.DataFrame({'x':[zz], 'y':[np.Inf]})
rfit = rpy2.robjects.r.lm('y~x', data=pdf, weights=1+0*pdf['w']**2)
rpred = rpy2.robjects.r.predict(rfit, pdz)[0]
print("Unweighted fit prediction from R is {sp}".format(sp=rpred))
and again we find a prediction of 2.263636. My problem is that we do not get that result from statmodels OLS
import statsmodels.api as sm
from statsmodels.sandbox.regression.predstd import wls_prediction_std
owls = sm.OLS(yy, XX).fit()
sm_value_u, iv_lu, iv_uu = wls_prediction_std(owls, exog=np.array([[1,zz]]))
sm_unw_v = sm_value_u[0]
print("Unweighted OLS fit prediction from statsmodels.wls_prediction_std is {sp}".format(sp=sm_unw_v))
Instead I obtain a value 1.695814 (similar things happen with WLS()). Either there is a bug, or using statsmodels for prediction has some trick too obscure for me to find. What is going on?

The results classes have a predict method that provides the prediction for new values of the explanatory variables:
>>> print(owls.predict(np.array([[1,zz]])))
[ 2.26363636]
The first return of wls_prediction_std is the standard error for the prediction not the prediction itself.
>>> help(wls_prediction_std)
Help on function wls_prediction_std in module statsmodels.sandbox.regression.predstd:
wls_prediction_std(res, exog=None, weights=None, alpha=0.05)
calculate standard deviation and confidence interval for prediction
applies to WLS and OLS, not to general GLS,
that is independently but not identically distributed observations
res : regression result instance
results of WLS or OLS regression required attributes see notes
exog : array_like (optional)
exogenous variables for points to predict
weights : scalar or array_like (optional)
weights as defined for WLS (inverse of variance of observation)
alpha : float (default: alpha = 0.05)
confidence level for two-sided hypothesis
predstd : array_like, 1d
standard error of prediction
same length as rows of exog
interval_l, interval_u : array_like
lower und upper confidence bounds
The sandbox function will be replaced by a new method get_prediction of the results classes that provides the prediction and the extra results like standard deviation and confidence and prediction intervals.


How to use scale and shape parameters of gamma GLM in statsmodels

The task
I have data that looks like this:
I want to fit a generalized linear model (glm) to this from a gamma family using statsmodels. Using this model, for each of my observations I want to calculate the probability of observing a value that is smaller than (or equal to) that value. In other words I want to calculate:
P(y <= y_i | x_i)
My questions
How do I get the shape and scale parameters from the fitted glm in statsmodels? According to this question the scale parameter in statsmodels is not parameterized in the normal way. Can I use it directly as input to a gamma distribution in scipy? Or do I need a transformation first?
How do I use these parameters (shape and scale) to get the probabilities? Currently I'm using scipy to generate a distribution for each x_i and get the probability from that. See implementation below.
My current implementation
import scipy.stats as stat
import patsy
import statsmodels.api as sm
# Generate data in correct form
y, X = patsy.dmatrices('y ~ x', data=myData, return_type='dataframe')
# Fit model with gamma family and log link
mod = sm.GLM(y, X, family=sm.families.Gamma(sm.families.links.log())).fit()
# Predict mean
myData['mu'] = mod.predict(exog=X)
# Predict probabilities (note that for a gamma distribution mean = shape * scale)
probabilities = np.array(
[stat.gamma(m_i/mod.scale, scale=mod.scale).cdf(y_i) for m_i, y_i in zip(myData['mu'], myData['y'])]
However, when I perform this procedure I get the following result:
Currently the predicted probabilities all seem really high. The red line in the graph is the predicted mean. But even for points below this line the predicted cumulative probability is around 80%. This makes me wonder whether the scale parameter I used is indeed the correct one.
In R, you can obtained as estimate of the shape using 1/dispersion (check this post).The naming of the dispersion estimate in statsmodels is a unfortunately scale. So you did to take the reciprocal of this to get the shape estimate. I show it with an example below:
values = gamma.rvs(2,scale=5,size=500)
fit = sm.GLM(values, np.repeat(1,500), family=sm.families.Gamma(sm.families.links.log())).fit()
This is an intercept only model, and we check the intercept and dispersion (named scale):
[array([2.27875973]), 0.563667465203953]
So the mean is exp(2.2599) = 9.582131 and if we use shape as 1/dispersion , shape = 1/0.563667465203953 = 1.774096 which is what we simulated.
If I use a simulated dataset, it works perfectly fine. This is what it looks like, with a shape of 10:
from scipy.stats import gamma
import numpy as np
import matplotlib.pyplot as plt
import patsy
import statsmodels.api as sm
import pandas as pd
_shape = 10
myData = pd.DataFrame({'x':np.random.uniform(0,10,size=500)})
myData['y'] = gamma.rvs(_shape,scale=np.exp(-myData['x']/3 + 0.5)/_shape,size=500)
Then we fit the model like you did:
y, X = patsy.dmatrices('y ~ x', data=myData, return_type='dataframe')
mod = sm.GLM(y, X, family=sm.families.Gamma(sm.families.links.log())).fit()
mu = mod.predict(exog=X)
shape_from_model = 1/mod.scale
probabilities = [gamma(shape_from_model, scale=m_i/shape_from_model).cdf(y_i) for m_i, y_i in zip(mu,myData['y'])]
And plot:
fig, ax = plt.subplots()
im = ax.scatter(myData["x"],myData["y"],c=probabilities)
im = ax.scatter(myData['x'],mu,c="r",s=1)
fig.colorbar(im, ax=ax)

How is the p value calculated for multiple variables in linear regression?

I am wondering how the p value is calculated for various variables in a multiple linear regression. I am sure upon reading several resources that <5% indicates the variable is significant for the model. But how is the p value calculated for each and every variable in the multiple linear regression?
I tried to see the statsmodels summary using the summary() function. I can just see the values. I didn't find any resource on how p value for various variables in a multiple linear regression is calculated.
import statsmodels.api as sm
nsample = 100
x = np.linspace(0, 10, 100)
X = np.column_stack((x, x**2))
beta = np.array([1, 0.1, 10])
e = np.random.normal(size=nsample)
X = sm.add_constant(X)
y =, beta) + e
model = sm.OLS(y, X)
results =
This question has no error but requires an intuition on how p value is calculated for various variables in a multiple linear regression.
Inferential statistics work by comparison to known distributions. In the case of regression, that distribution is typically the t-distribution
You'll notice that each variable has an estimated coefficient from which an associated t-statistic is calculated. x1 for example, has a t-value of -0.278. To get the p-value, we take that t-value, place it on the t-distribution, and calculate the probability of getting a value as extreme as the t-value you calculated. You can gain some intuition for this by noticing that the p-value column is called P>|t|
An additional wrinkle here is that the exact shape of the t-distribution depends on the degrees of freedom
So to calculate a p-value, you need 2 pieces of information: the t-statistic and the residual degrees of freedom of your model (97 in your case)
Taking x1 as an example, you can calculate the p-value in Python like this:
import scipy.stats
scipy.stats.t.sf(abs(-0.278), df=97)*2
The same is done for each of the other predictors using their respective t-values

Python statsmodels WLS (weighted least squares) error independent of weights

I'm using Python's statsmodels to perform a weighted linear regression. Since this is my first time with this module, I ran some basic tests.
Doing these, I find that estimated errors on the parameters (i.e., uncertainties) are independent of the weight.
This doesn't fit my naive expectation (that larger error bars would produce a more uncertain result), or the definition I've seen in nonlinear fitting, which generally uses the weighted Hessian to report the uncertainty of the parameters.
I couldn't find an explicit description of the calculation of the parameter-errors (either in the statsmodels docs or support pages). Posting here to see if anyone has experience with this (and for future searchers).
Minimal code:
import numpy as np
import statsmodels.api as sm
escale = 1
xvals = np.arange(1,11)
yvals = xvals**2 # dummy nonlinear y-data
evals = escale*np.sqrt(yvals) # errorbars
X = sm.add_constant(xvals) # to get intercept term
model = sm.WLS(yvals, X, 1.0/evals**2) # weight inverse to error
result =
print "escale = ", escale
print "fit = ", result.params
print "err = ", result.bse # same value independent of escale
So is this a numerical bug in statsmodels, or PEBKAC (i.e., is this correct behavior and my expectations are wrong)?

How does pymc represent the prior distribution and likelihood function?

If pymc implements the Metropolis-Hastings algorithm to come up with samples from the posterior density over the parameters of interest, then in order to decide whether to move to the next state in the markov chain it must be able to evaluate something proportional to the posterior density for all given parameter values.
The posterior density is proportion to the likelihood function based on the observed data times the prior density.
How are each of these represented within pymc? How does it calculate each of these quantities from the model object?
I wonder if anyone can give me a high level description of the approach or point me to where I can find it.
To represent the prior, you need an instance of the Stochastic class, which has two primary attributes:
value : the variable's current value
logp : the log probability of the variable's current value given the values of its parents
You can initialize a prior with the name of the distribution you are using.
To represent the likelihood, you need a so-called Data Stochastic. That is, an instance of class Stochastic whose observed flag is set to True. The value of this variable cannot be changed and it will not be sampled. Again, you can initialize the likelihood with the name of the distribution you are using (but don't forget to set the observed flag to True).
Say we have the following setup:
import pymc as pm
import numpy as np
import theano.tensor as t
x = np.array([1,2,3,4,5,6])
y = np.array([0,1,0,1,1,1])
We can run a simple logistic regression with the following:
with pm.Model() as model:
b0 = pm.Normal("b0", mu=0, tau=1e-6)
b1 = pm.Normal("b1", mu=0, tau=1e-6)
z = b0 + b1 * x
yhat = pm.Bernoulli("yhat", 1 / (1 + t.exp(-z)), observed=y)
# Sample from the posterior
trace = pm.sample(10000, pm.Metropolis())
Most of the above came from Chris Fonnesbeck's iPython notebook here.

numpy.polyfit has no keyword 'cov'

I'm trying to use polyfit to find the best fitting straight line to a set of data, but I also need to know the uncertainty on the parameters, so I want the covariance matrix too. The online documentation suggests I write:
polyfit(x, y, 2, cov=True)
but this gives the error:
TypeError: polyfit() got an unexpected keyword argument 'cov'
And sure enough help(polyfit) shows no keyword argument 'cov'.
So does the online documentation refer to a previous release of numpy? (I have 1.6.1, the newest one). I could write my own polyfit function, but has anyone got any suggestions for why I don't have a covariance option on my polyfit?
For a solution that comes from a library, I find that using scikits.statsmodels is a convenient choice. In statsmodels, regression objects have callable attributes that return the parameters and standard errors. I put an example of how this would work for you below:
# Imports, I assume NumPy for forming your data.
import numpy as np
import scikits.statsmodels.api as sm
# Form the data here
(X, Y) = ....
reg_x_data = np.ones(X.shape); # 0th degree term.
for ii in range(1,deg+1):
reg_x_data = np.hstack(( reg_x_data, X**(ii) )); # Append the ii^th degree term.
# Store OLS regression results into `result`
result = sm.OLS(Y,reg_x_data).fit()
# Print the estimated coefficients
print result.params
# Print the basic OLS standard error in the coefficients
print result.bse
# Print the estimated basic OLS covariance matrix
print result.cov_params() # <-- Notice, this one is a function by convention.
# Print the heteroskedasticity-consistent standard error
print result.HC0_se
# Print the heteroskedasticity-consistent covariance matrix
print result.cov_HC0
There are additional robust covariance attributes in the result object as well. You can see them by printing out dir(result). Also, by convention, the covariance matrices for the heteroskedasticity-consistent estimators are only available after you already call the corresponding standard error, such as: you must call result.HC0_se prior to result.cov_HC0 because the first reference causes the second one to be computed and stored.
Pandas is another library that probably provides more advanced support for these operations.
Non-library function
This might be useful when you don't want to / can't rely on an extra library function.
Below is a function that I wrote to return the OLS regression coefficients, as well as a bunch of stuff. It returns the residuals, the regression variance and standard error (standard error of the residuals-squared), the asymptotic formula for large-sample variance, the OLS covariance matrix, the heteroskedasticity-consistent "robust" covariance estimate (which is the OLS covariance but weighted according to the residuals), and the "White" or "bias-corrected" heteroskedasticity-consistent covariance.
import numpy as np
# Regression and standard error estimation functions
def ols_linreg(X, Y):
""" ols_linreg(X,Y)
Ordinary least squares regression estimator given explanatory variables
matrix X and observations matrix Y.The length of the first dimension of
X and Y must be the same (equal to the number of samples in the data set).
Note: these methods should be adapted if you need to use this for large data.
This is mostly for illustrating what to do for calculating the different
classicial standard errors. You would never really want to compute the inverse
matrices for large problems.
This was developed with NumPy 1.5.1.
(N, K) = X.shape
t1 = np.linalg.inv( (np.transpose(X)).dot(X) )
t2 = (np.transpose(X)).dot(Y)
beta =
residuals = Y -
sig_hat = (1.0/(N-K))*np.sum(residuals**2)
sig_hat_asymptotic_variance = 2*sig_hat**2/N
conv_st_err = np.sqrt(sig_hat)
sum1 = 0.0
for ii in range(N):
sum1 = sum1 + np.outer(X[ii,:],X[ii,:])
sum1 = (1.0/N)*sum1
ols_cov = (sig_hat/N)*np.linalg.inv(sum1)
PX = np.linalg.inv(np.transpose(X).dot(X)).dot(np.transpose(X)) )
robust_se_mat1 = np.linalg.inv(np.transpose(X).dot(X))
robust_se_mat2 = np.transpose(X).dot(np.diag(residuals[:,0]**(2.0)).dot(X))
robust_se_mat3 = np.transpose(X).dot(np.diag(residuals[:,0]**(2.0)/(1.0-np.diag(PX))).dot(X))
v_robust =
v_modified_robust =
""" Returns:
beta -- The vector of coefficient estimates, ordered on the columns on X.
residuals -- The vector of residuals, Y - X.beta
sig_hat -- The sample variance of the residuals.
conv_st_error -- The 'standard error of the regression', sqrt(sig_hat).
sig_hat_asymptotic_variance -- The analytic formula for the large sample variance
ols_cov -- The covariance matrix under the basic OLS assumptions.
v_robust -- The "robust" covariance matrix, weighted to account for the residuals and heteroskedasticity.
v_modified_robust -- The bias-corrected and heteroskedasticity-consistent covariance matrix.
return beta, residuals, sig_hat, conv_st_err, sig_hat_asymptotic_variance, ols_cov, v_robust, v_modified_robust
For your problem, you would use it like this:
import numpy as np
# Define or load your data:
(Y, X) = ....
# Desired polynomial degree
deg = 2;
reg_x_data = np.ones(X.shape); # 0th degree term.
for ii in range(1,deg+1):
reg_x_data = np.hstack(( reg_x_data, X**(ii) )); # Append the ii^th degree term.
# Get all of the regression data.
beta, residuals, sig_hat, conv_st_error, sig_hat_asymptotic_variance, ols_cov, v_robust, v_modified_robust = ols_linreg(reg_x_data,Y)
# Print the covariance matrix:
print ols_cov
If you spot any bugs in my computations (especially the heteroskedasticity-consistent estimators) please let me know and I'll fix it asap.
