Python statsmodels WLS (weighted least squares) error independent of weights - python

I'm using Python's statsmodels to perform a weighted linear regression. Since this is my first time with this module, I ran some basic tests.
Doing these, I find that estimated errors on the parameters (i.e., uncertainties) are independent of the weight.
This doesn't fit my naive expectation (that larger error bars would produce a more uncertain result), or the definition I've seen in nonlinear fitting, which generally uses the weighted Hessian to report the uncertainty of the parameters.
I couldn't find an explicit description of the calculation of the parameter-errors (either in the statsmodels docs or support pages). Posting here to see if anyone has experience with this (and for future searchers).
Minimal code:
import numpy as np
import statsmodels.api as sm
escale = 1
xvals = np.arange(1,11)
yvals = xvals**2 # dummy nonlinear y-data
evals = escale*np.sqrt(yvals) # errorbars
X = sm.add_constant(xvals) # to get intercept term
model = sm.WLS(yvals, X, 1.0/evals**2) # weight inverse to error
result = model.fit()
print "escale = ", escale
print "fit = ", result.params
print "err = ", result.bse # same value independent of escale
So is this a numerical bug in statsmodels, or PEBKAC (i.e., is this correct behavior and my expectations are wrong)?

Related

Calculate odds ratio with different method in python [duplicate]

When performed a logistic regression using the two API, they give different coefficients.
Even with this simple example it doesn't produce the same results in terms of coefficients. And I follow advice from older advice on the same topic, like setting a large value for the parameter C in sklearn since it makes the penalization almost vanish (or setting penalty="none").
import pandas as pd
import numpy as np
import sklearn as sk
from sklearn.linear_model import LogisticRegression
import statsmodels.api as sm
n = 200
x = np.random.randint(0, 2, size=n)
y = (x > (0.5 + np.random.normal(0, 0.5, n))).astype(int)
display(pd.crosstab( y, x ))
max_iter = 100
#### Statsmodels
res_sm = sm.Logit(y, x).fit(method="ncg", maxiter=max_iter)
print(res_sm.params)
#### Scikit-Learn
res_sk = LogisticRegression( solver='newton-cg', multi_class='multinomial', max_iter=max_iter, fit_intercept=True, C=1e8 )
res_sk.fit( x.reshape(n, 1), y )
print(res_sk.coef_)
For example I just run the above code and get 1.72276655 for statsmodels and 1.86324749 for sklearn. And when run multiple times it always gives different coefficients (sometimes closer than others, but anyway).
Thus, even with that toy example the two APIs give different coefficients (so odds ratios), and with real data (not shown here), it almost get "out of control"...
Am I missing something? How can I produce similar coefficients, for example at least at one or two numbers after the comma?
There are some issues with your code.
To start with, the two models you show here are not equivalent: although you fit your scikit-learn LogisticRegression with fit_intercept=True (which is the default setting), you don't do so with your statsmodels one; from the statsmodels docs:
An intercept is not included by default and should be added by the user. See statsmodels.tools.add_constant.
It seems that this is a frequent point of confusion - see for example scikit-learn & statsmodels - which R-squared is correct? (and own answer there as well).
The other issue is that, although you are in a binary classification setting, you ask for multi_class='multinomial' in your LogisticRegression, which should not be the case.
The third issue is that, as explained in the relevant Cross Validated thread Logistic Regression: Scikit Learn vs Statsmodels:
There is no way to switch off regularization in scikit-learn, but you can make it ineffective by setting the tuning parameter C to a large number.
which makes the two models again non-comparable in principle, but you have successfully addressed it here by setting C=1e8. In fact, since then (2016), scikit-learn has indeed added a way to switch regularization off, by setting penalty='none' since, according to the docs:
If ‘none’ (not supported by the liblinear solver), no regularization is applied.
which should now be considered the canonical way to switch off the regularization.
So, incorporating these changes in your code, we have:
np.random.seed(42) # for reproducibility
#### Statsmodels
# first artificially add intercept to x, as advised in the docs:
x_ = sm.add_constant(x)
res_sm = sm.Logit(y, x_).fit(method="ncg", maxiter=max_iter) # x_ here
print(res_sm.params)
Which gives the result:
Optimization terminated successfully.
Current function value: 0.403297
Iterations: 5
Function evaluations: 6
Gradient evaluations: 10
Hessian evaluations: 5
[-1.65822763 3.65065752]
with the first element of the array being the intercept and the second the coefficient of x. While for scikit learn we have:
#### Scikit-Learn
res_sk = LogisticRegression(solver='newton-cg', max_iter=max_iter, fit_intercept=True, penalty='none')
res_sk.fit( x.reshape(n, 1), y )
print(res_sk.intercept_, res_sk.coef_)
with the result being:
[-1.65822806] [[3.65065707]]
These results are practically identical, within the machine's numeric precision.
Repeating the procedure for different values of np.random.seed() does not change the essence of the results shown above.

Fit gaussians (or other distributions) on my data using python

I have a database of features, a 2D np.array (2000 samples and each sample contains 100 features, 2000 X 100). I want to fit gaussian distributions to my database using python. My code is the following:
data = load_my_data() # loads a np.array with size 2000x200
clf = mixture.GaussianMixture(n_components= 50, covariance_type='full')
clf.fit(data)
I am not sure about the parameters for example the covariance_type and how can I investigate whether the fit was occured succesfully or not.
EDIT: I debug the code to investigate what is happening with the clf.means_ and appartently it produced a matrix n_components X size_of_features 50 X 20). Is there a way that i can check that the fitting was successful, or to plot data? What are the alternatives to Gaussian mixtures (mixtures of exponential for example, I cannot find any available implementation)?
I think you are using sklearn package.
Once you have fit, then type
print clf.means_
If it has output, then the data is fitted, if it raise errors, not fitted.
Hope this helps you.
You can do dimensionality reduction using PCA to 3D space (let's say) and then plot means and data.
Is is always preferred to choose a reduced set of candidate before trying to identify the distribution (in other words, use Cullen & Frey to reject the unlikely candidates) and then go for goodness of fit a select the best result,
You can just create a list of all available distributions in scipy. An example with two distributions and random data:
import numpy as np
import scipy.stats as st
data = np.random.random(10000)
#Specify all distributions here
distributions = [st.laplace, st.norm]
mles = []
for distribution in distributions:
pars = distribution.fit(data)
mle = distribution.nnlf(pars, data)
mles.append(mle)
results = [(distribution.name, mle) for distribution, mle in
zip(distributions, mles)]
best_fit = sorted(zip(distributions, mles), key=lambda d: d[1])[0]
print 'Best fit reached using {}, MLE value: {}'.format(best_fit[0].name, best_fit[1])
I understand, you may like to do regression of two different distributions, more than fitting them to an arithmetic curve. If this is the case, you may be interested in plotting one against the other one, and make a linear (or polynomial) regression, checking the coefficients
If this is the case, linear regression of two distributions, may tell you if there linear dependent or not.
Linear Regression using Scipy documentation

Getting correct exogenous least squares prediction in Python statsmodels

I am having trouble getting a reasonable prediction behavior from least squares fits in statsmodels version 0.6.1. It does not seem to be providing a sensible value.
Consider the following data
import numpy as np
xx = np.array([1.1,2.2,3.3,4.4]) # Independent variable
XX = sm.add_constant(xx) # Include constant for matrix fitting in statsmodels
yy = np.array([2,1,5,6]) # Dependent variable
ww = np.array([0.1,1,3,0.5]) # Weights to try
wn = ww/ww.sum() # Normalized weights
zz = 1.9 # Independent variable value to predict for
We can use numpy to do a weighted fit and prediction
np_unw_value = np.polyval(np.polyfit(xx, yy, deg=1, w=1+0*ww), zz)
print("Unweighted fit prediction from numpy.polyval is {sp}".format(sp=np_unw_value))
and we find a prediction of 2.263636.
As a sanity check, we can also see what R has to say about the matter
import pandas as pd
import rpy2.robjects
from rpy2.robjects.packages import importr
import rpy2.robjects.pandas2ri
rpy2.robjects.pandas2ri.activate()
pdf = pd.DataFrame({'x':xx, 'y':yy, 'w':wn})
pdz = pd.DataFrame({'x':[zz], 'y':[np.Inf]})
rfit = rpy2.robjects.r.lm('y~x', data=pdf, weights=1+0*pdf['w']**2)
rpred = rpy2.robjects.r.predict(rfit, pdz)[0]
print("Unweighted fit prediction from R is {sp}".format(sp=rpred))
and again we find a prediction of 2.263636. My problem is that we do not get that result from statmodels OLS
import statsmodels.api as sm
from statsmodels.sandbox.regression.predstd import wls_prediction_std
owls = sm.OLS(yy, XX).fit()
sm_value_u, iv_lu, iv_uu = wls_prediction_std(owls, exog=np.array([[1,zz]]))
sm_unw_v = sm_value_u[0]
print("Unweighted OLS fit prediction from statsmodels.wls_prediction_std is {sp}".format(sp=sm_unw_v))
Instead I obtain a value 1.695814 (similar things happen with WLS()). Either there is a bug, or using statsmodels for prediction has some trick too obscure for me to find. What is going on?
The results classes have a predict method that provides the prediction for new values of the explanatory variables:
>>> print(owls.predict(np.array([[1,zz]])))
[ 2.26363636]
The first return of wls_prediction_std is the standard error for the prediction not the prediction itself.
>>> help(wls_prediction_std)
Help on function wls_prediction_std in module statsmodels.sandbox.regression.predstd:
wls_prediction_std(res, exog=None, weights=None, alpha=0.05)
calculate standard deviation and confidence interval for prediction
applies to WLS and OLS, not to general GLS,
that is independently but not identically distributed observations
Parameters
----------
res : regression result instance
results of WLS or OLS regression required attributes see notes
exog : array_like (optional)
exogenous variables for points to predict
weights : scalar or array_like (optional)
weights as defined for WLS (inverse of variance of observation)
alpha : float (default: alpha = 0.05)
confidence level for two-sided hypothesis
Returns
-------
predstd : array_like, 1d
standard error of prediction
same length as rows of exog
interval_l, interval_u : array_like
lower und upper confidence bounds
The sandbox function will be replaced by a new method get_prediction of the results classes that provides the prediction and the extra results like standard deviation and confidence and prediction intervals.
http://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.RegressionResults.get_prediction.html

How to perform a chi-squared goodness of fit test using scientific libraries in Python?

Let's assume I have some data I obtained empirically:
from scipy import stats
size = 10000
x = 10 * stats.expon.rvs(size=size) + 0.2 * np.random.uniform(size=size)
It is exponentially distributed (with some noise) and I want to verify this using a chi-squared goodness of fit (GoF) test. What is the simplest way of doing this using the standard scientific libraries in Python (e.g. scipy or statsmodels) with the least amount of manual steps and assumptions?
I can fit a model with:
param = stats.expon.fit(x)
plt.hist(x, normed=True, color='white', hatch='/')
plt.plot(grid, distr.pdf(np.linspace(0, 100, 10000), *param))
It is very elegant to calculate the Kolmogorov-Smirnov test.
>>> stats.kstest(x, lambda x : stats.expon.cdf(x, *param))
(0.0061000000000000004, 0.85077099515985011)
However, I can't find a good way of calculating the chi-squared test.
There is a chi-squared GoF function in statsmodel, but it assumes a discrete distribution (and the exponential distribution is continuous).
The official scipy.stats tutorial only covers a case for a custom distribution and probabilities are built by fiddling with many expressions (npoints, npointsh, nbound, normbound), so it's not quite clear to me how to do it for other distributions. The chisquare examples assume the expected values and DoF are already obtained.
Also, I am not looking for a way to "manually" perform the test as was already discussed here, but would like to know how to apply one of the available library functions.
An approximate solution for equal probability bins:
Estimate the parameters of the distribution
Use the inverse cdf, ppf if it's a scipy.stats.distribution, to get the binedges for a regular probability grid, e.g. distribution.ppf(np.linspace(0, 1, n_bins + 1), *args)
Then, use np.histogram to count the number of observations in each bin
then use chisquare test on the frequencies.
An alternative would be to find the bin edges from the percentiles of the sorted data, and use the cdf to find the actual probabilities.
This is only approximate, since the theory for the chisquare test assumes that the parameters are estimated by maximum likelihood on the binned data. And I'm not sure whether the selection of binedges based on the data affects the asymptotic distribution.
I haven't looked into this into a long time.
If an approximate solution is not good enough, then I would recommend that you ask the question on stats.stackexchange.
Why do you need to "verify" that it's exponential? Are you sure you need a statistical test? I can pretty much guarantee that is isn't ultimately exponential & the test would be significant if you had enough data, making the logic of using the test rather forced. It may help you to read this CV thread: Is normality testing 'essentially useless'?, or my answer here: Testing for heteroscedasticity with many observations.
It is typically better to use a qq-plot and/or pp-plot (depending on whether you are concerned about the fit in the tails or middle of the distribution, see my answer here: PP-plots vs. QQ-plots). Information on how to make qq-plots in Python SciPy can be found in this SO thread: Quantile-Quantile plot using SciPy
I tried you problem with OpenTURNS.
Beginning is the same:
import numpy as np
from scipy import stats
size = 10000
x = 10 * stats.expon.rvs(size=size) + 0.2 * np.random.uniform(size=size)
If you suspect that your sample x is coming from an Exponential distribution, you can use ot.ExponentialFactory() to fit the parameters:
import openturns as ot
sample = ot.Sample([[p] for p in x])
distribution = ot.ExponentialFactory().build(sample)
As Factory needs a an ot.Sample() as input, I needed format x and reshape it as 10.000 points of dimension 1.
Let's now assess this fitting using ChiSquared test:
result = ot.FittingTest.ChiSquared(sample, distribution, 0.01)
print('Exponential?', result.getBinaryQualityMeasure(), ', P-value=', result.getPValue())
>>> Exponential? True , P-value= 0.9275212544642293
Very good!
And of course, print(distribution) will give you the fitted parameters:
>>> Exponential(lambda = 0.0982391, gamma = 0.0274607)

numpy.polyfit has no keyword 'cov'

I'm trying to use polyfit to find the best fitting straight line to a set of data, but I also need to know the uncertainty on the parameters, so I want the covariance matrix too. The online documentation suggests I write:
polyfit(x, y, 2, cov=True)
but this gives the error:
TypeError: polyfit() got an unexpected keyword argument 'cov'
And sure enough help(polyfit) shows no keyword argument 'cov'.
So does the online documentation refer to a previous release of numpy? (I have 1.6.1, the newest one). I could write my own polyfit function, but has anyone got any suggestions for why I don't have a covariance option on my polyfit?
Thanks
For a solution that comes from a library, I find that using scikits.statsmodels is a convenient choice. In statsmodels, regression objects have callable attributes that return the parameters and standard errors. I put an example of how this would work for you below:
# Imports, I assume NumPy for forming your data.
import numpy as np
import scikits.statsmodels.api as sm
# Form the data here
(X, Y) = ....
reg_x_data = np.ones(X.shape); # 0th degree term.
for ii in range(1,deg+1):
reg_x_data = np.hstack(( reg_x_data, X**(ii) )); # Append the ii^th degree term.
# Store OLS regression results into `result`
result = sm.OLS(Y,reg_x_data).fit()
# Print the estimated coefficients
print result.params
# Print the basic OLS standard error in the coefficients
print result.bse
# Print the estimated basic OLS covariance matrix
print result.cov_params() # <-- Notice, this one is a function by convention.
# Print the heteroskedasticity-consistent standard error
print result.HC0_se
# Print the heteroskedasticity-consistent covariance matrix
print result.cov_HC0
There are additional robust covariance attributes in the result object as well. You can see them by printing out dir(result). Also, by convention, the covariance matrices for the heteroskedasticity-consistent estimators are only available after you already call the corresponding standard error, such as: you must call result.HC0_se prior to result.cov_HC0 because the first reference causes the second one to be computed and stored.
Pandas is another library that probably provides more advanced support for these operations.
Non-library function
This might be useful when you don't want to / can't rely on an extra library function.
Below is a function that I wrote to return the OLS regression coefficients, as well as a bunch of stuff. It returns the residuals, the regression variance and standard error (standard error of the residuals-squared), the asymptotic formula for large-sample variance, the OLS covariance matrix, the heteroskedasticity-consistent "robust" covariance estimate (which is the OLS covariance but weighted according to the residuals), and the "White" or "bias-corrected" heteroskedasticity-consistent covariance.
import numpy as np
###
# Regression and standard error estimation functions
###
def ols_linreg(X, Y):
""" ols_linreg(X,Y)
Ordinary least squares regression estimator given explanatory variables
matrix X and observations matrix Y.The length of the first dimension of
X and Y must be the same (equal to the number of samples in the data set).
Note: these methods should be adapted if you need to use this for large data.
This is mostly for illustrating what to do for calculating the different
classicial standard errors. You would never really want to compute the inverse
matrices for large problems.
This was developed with NumPy 1.5.1.
"""
(N, K) = X.shape
t1 = np.linalg.inv( (np.transpose(X)).dot(X) )
t2 = (np.transpose(X)).dot(Y)
beta = t1.dot(t2)
residuals = Y - X.dot(beta)
sig_hat = (1.0/(N-K))*np.sum(residuals**2)
sig_hat_asymptotic_variance = 2*sig_hat**2/N
conv_st_err = np.sqrt(sig_hat)
sum1 = 0.0
for ii in range(N):
sum1 = sum1 + np.outer(X[ii,:],X[ii,:])
sum1 = (1.0/N)*sum1
ols_cov = (sig_hat/N)*np.linalg.inv(sum1)
PX = X.dot( np.linalg.inv(np.transpose(X).dot(X)).dot(np.transpose(X)) )
robust_se_mat1 = np.linalg.inv(np.transpose(X).dot(X))
robust_se_mat2 = np.transpose(X).dot(np.diag(residuals[:,0]**(2.0)).dot(X))
robust_se_mat3 = np.transpose(X).dot(np.diag(residuals[:,0]**(2.0)/(1.0-np.diag(PX))).dot(X))
v_robust = robust_se_mat1.dot(robust_se_mat2.dot(robust_se_mat1))
v_modified_robust = robust_se_mat1.dot(robust_se_mat3.dot(robust_se_mat1))
""" Returns:
beta -- The vector of coefficient estimates, ordered on the columns on X.
residuals -- The vector of residuals, Y - X.beta
sig_hat -- The sample variance of the residuals.
conv_st_error -- The 'standard error of the regression', sqrt(sig_hat).
sig_hat_asymptotic_variance -- The analytic formula for the large sample variance
ols_cov -- The covariance matrix under the basic OLS assumptions.
v_robust -- The "robust" covariance matrix, weighted to account for the residuals and heteroskedasticity.
v_modified_robust -- The bias-corrected and heteroskedasticity-consistent covariance matrix.
"""
return beta, residuals, sig_hat, conv_st_err, sig_hat_asymptotic_variance, ols_cov, v_robust, v_modified_robust
For your problem, you would use it like this:
import numpy as np
# Define or load your data:
(Y, X) = ....
# Desired polynomial degree
deg = 2;
reg_x_data = np.ones(X.shape); # 0th degree term.
for ii in range(1,deg+1):
reg_x_data = np.hstack(( reg_x_data, X**(ii) )); # Append the ii^th degree term.
# Get all of the regression data.
beta, residuals, sig_hat, conv_st_error, sig_hat_asymptotic_variance, ols_cov, v_robust, v_modified_robust = ols_linreg(reg_x_data,Y)
# Print the covariance matrix:
print ols_cov
If you spot any bugs in my computations (especially the heteroskedasticity-consistent estimators) please let me know and I'll fix it asap.

Categories