Statsmodels.formula.api OLS does not show statistical values of intercept - python

I am running the following source code:
import statsmodels.formula.api as sm
# Add one column of ones for the intercept term
X = np.append(arr= np.ones((50, 1)).astype(int), values=X, axis=1)
regressor_OLS = sm.OLS(endog=y, exog=X).fit()
print(regressor_OLS.summary())
where
X is an 50x5 (before adding the intercept term) numpy array which looks like this:
[[0 1 165349.20 136897.80 471784.10]
[0 0 162597.70 151377.59 443898.53]...]
and y is a a 50x1 numpy array with float values for the dependent variable.
The first two columns are for a dummy variable with three different values. The rest of the columns are three different indepedent variables.
Although, it is said that the statsmodels.formula.api.OLS adds automatically an intercept term (see #stellacia's answer here: OLS using statsmodel.formula.api versus statsmodel.api) its summary does not show the statistical values of the intercept term as it evident below in my case:
OLS Regression Results
==============================================================================
Dep. Variable: Profit R-squared: 0.988
Model: OLS Adj. R-squared: 0.986
Method: Least Squares F-statistic: 727.1
Date: Sun, 01 Jul 2018 Prob (F-statistic): 7.87e-42
Time: 21:40:23 Log-Likelihood: -545.15
No. Observations: 50 AIC: 1100.
Df Residuals: 45 BIC: 1110.
Df Model: 5
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
x1 3464.4536 4905.406 0.706 0.484 -6415.541 1.33e+04
x2 5067.8937 4668.238 1.086 0.283 -4334.419 1.45e+04
x3 0.7182 0.066 10.916 0.000 0.586 0.851
x4 0.3113 0.035 8.885 0.000 0.241 0.382
x5 0.0786 0.023 3.429 0.001 0.032 0.125
==============================================================================
Omnibus: 1.355 Durbin-Watson: 1.288
Prob(Omnibus): 0.508 Jarque-Bera (JB): 1.241
Skew: -0.237 Prob(JB): 0.538
Kurtosis: 2.391 Cond. No. 8.28e+05
==============================================================================
For this reason, I added to my source code the line:
X = np.append(arr= np.ones((50, 1)).astype(int), values=X, axis=1)
as you can see at the beginning of my post and the statistical values of the intercept/constant are shown as below:
OLS Regression Results
==============================================================================
Dep. Variable: Profit R-squared: 0.951
Model: OLS Adj. R-squared: 0.945
Method: Least Squares F-statistic: 169.9
Date: Sun, 01 Jul 2018 Prob (F-statistic): 1.34e-27
Time: 20:25:21 Log-Likelihood: -525.38
No. Observations: 50 AIC: 1063.
Df Residuals: 44 BIC: 1074.
Df Model: 5
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 5.013e+04 6884.820 7.281 0.000 3.62e+04 6.4e+04
x1 198.7888 3371.007 0.059 0.953 -6595.030 6992.607
x2 -41.8870 3256.039 -0.013 0.990 -6604.003 6520.229
x3 0.8060 0.046 17.369 0.000 0.712 0.900
x4 -0.0270 0.052 -0.517 0.608 -0.132 0.078
x5 0.0270 0.017 1.574 0.123 -0.008 0.062
==============================================================================
Omnibus: 14.782 Durbin-Watson: 1.283
Prob(Omnibus): 0.001 Jarque-Bera (JB): 21.266
Skew: -0.948 Prob(JB): 2.41e-05
Kurtosis: 5.572 Cond. No. 1.45e+06
==============================================================================
Why the statistical values of the intercept are not showing when I do not add my myself an intercept term even though it is said that statsmodels.formula.api.OLS is adding this automatically?

"No constant is added by the model unless you are using formulas."
Therefore try something like below example. Variable names should be defined according to your data set.
Use,
regressor_OLS = smf.ols(formula='Y_variable ~ X_variable', data=df).fit()
instead of,
regressor_OLS = sm.OLS(endog=y, exog=X).fit()

Can use this
X = sm.add_constant(X)

Related

Correct multiple polynomial regression formula with OLS in Python

I need to perform multiple polynomial regression and obtain statistics, p value, AIC etc.
As far as I understood I can do that with OLS, however I found only a way to produce a formula using one independent variable, like this:
model = 'act_hours ~ h_hours + I(h_hours**2)'
hours_model = smf.ols(formula = model, data = df)
I tried to define a formula using two independent variable, however I could not understand if that is the correct way and if the results are reasonable. The line that I doubt is model = 'Height ~ Diamet + I(Diamet**2) + area + I(area**2). The full code is this one:
import pandas as pd
import statsmodels.formula.api as smf
train = pd.read_csv(r'W:\...file.csv')
model = 'Height ~ Diamet + I(Diamet**2) + area + I(area**2)'
hours_model = smf.ols(formula = model, data = train).fit()
print(hours_model.summary())
The summary of the regression is here:
OLS Regression Results
==============================================================================
Dep. Variable: Height R-squared: 0.611
Model: OLS Adj. R-squared: 0.609
Method: Least Squares F-statistic: 376.0
Date: Fri, 04 Feb 2022 Prob (F-statistic): 1.33e-194
Time: 08:50:17 Log-Likelihood: -5114.6
No. Observations: 963 AIC: 1.024e+04
Df Residuals: 958 BIC: 1.026e+04
Df Model: 4
Covariance Type: nonrobust
==================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------
Intercept 13.9287 60.951 0.229 0.819 -105.684 133.542
Diamet 0.6027 0.340 1.770 0.077 -0.066 1.271
I(Diamet ** 2) 0.0004 0.002 0.262 0.794 -0.003 0.004
area 3.3553 5.307 0.632 0.527 -7.060 13.771
I(area** 2) 0.2519 0.108 2.324 0.020 0.039 0.465
==============================================================================
Omnibus: 60.996 Durbin-Watson: 1.889
Prob(Omnibus): 0.000 Jarque-Bera (JB): 86.039
Skew: 0.528 Prob(JB): 2.07e-19
Kurtosis: 4.015 Cond. No. 4.45e+05
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.45e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

Getting statsmodel RollingOLS results summary information

I am running rolling regressions using the RollingOLS function on statsmodels.api, and wondering if its possible to get the summary statistics (betas, r^2, etc.) out for each regression done in the rolling regression.
Using a single OLS regression, you can get the summary information like such,
X_opt = X[:, [0,1,2,3]]
regressor_OLS = sm.OLS(endog= y, exog= X_opt).fit()
regressor_OLS.summary()
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.951
Model: OLS Adj. R-squared: 0.948
Method: Least Squares F-statistic: 296.0
Date: Wed, 08 Aug 2018 Prob (F-statistic): 4.53e-30
Time: 00:46:48 Log-Likelihood: -525.39
No. Observations: 50 AIC: 1059.
Df Residuals: 46 BIC: 1066.
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 5.012e+04 6572.353 7.626 0.000 3.69e+04 6.34e+04
x1 0.8057 0.045 17.846 0.000 0.715 0.897
x2 -0.0268 0.051 -0.526 0.602 -0.130 0.076
x3 0.0272 0.016 1.655 0.105 -0.006 0.060
==============================================================================
Omnibus: 14.838 Durbin-Watson: 1.282
Prob(Omnibus): 0.001 Jarque-Bera (JB): 21.442
Skew: -0.949 Prob(JB): 2.21e-05
Kurtosis: 5.586 Cond. No. 1.40e+06
==============================================================================
is there a way to get this information for the regression run on each window for a rolling regression?

How to get just condition number from statsmodels.api.OLS?

I'm doing a multiple linear regression, and trying to select the best subset of a number of independent variables. I would like to try to do all 1024 possible combinations in a "for" loop and save the best results based on condition number and r squared. I know it calculates both, giving results like:
model = sm.OLS(salarray, narraycareer)
results = model.fit()
print results.summary()
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.425
Model: OLS Adj. R-squared: 0.409
Method: Least Squares F-statistic: 26.89
Date: Sat, 23 Sep 2017 Prob (F-statistic): 1.69e-27
Time: 00:58:14 Log-Likelihood: -1907.4
No. Observations: 263 AIC: 3831.
Df Residuals: 255 BIC: 3859.
Df Model: 7
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
const 535.9259 21.387 25.058 0.000 493.808 578.044
x1 -675.5296 302.245 -2.235 0.026 -1270.744 -80.315
x2 182.7168 436.493 0.419 0.676 -676.874 1042.307
x3 -48.2603 126.141 -0.383 0.702 -296.671 200.151
x4 445.0863 218.373 2.038 0.043 15.043 875.130
x5 344.0092 219.896 1.564 0.119 -89.035 777.053
x6 -41.5168 71.925 -0.577 0.564 -183.159 100.126
x7 96.5430 30.595 3.156 0.002 36.293 156.793
==============================================================================
Omnibus: 96.442 Durbin-Watson: 1.973
Prob(Omnibus): 0.000 Jarque-Bera (JB): 440.598
Skew: 1.438 Prob(JB): 2.11e-96
Kurtosis: 8.651 Cond. No. 61.7
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
>>>
But I can't find any documentation on how to get out condition number or r squared.
Thanks!
I found it, or the Spyder ide found it for me in the interpreter window.
>>> results.rsquared
0.42465891683421031
>>> results.condition_number
61.715714331759621
>>> >
When I typed "results." it gave a bunch of suggestions. Something vim doesn't do!

plot contrast of a linear model in python

In python I am trying to plot the effect of a linear model
data = pd.read_excel(input_filename)
data.sexe = data.sexe.map({1:'m', 2:'f'})
data.diag = data.diag.map({1:'asd', 4:'hc'})
data.site = data.site.map({ 10:'USS', 20:'UYU', 30:'CAM', 40:'MAM', 2:'Cre'})
lm_full = sm.formula.ols(formula= L_bankssts_thickavg ~ diag + age + sexe + site' % var, data=data).fit()
I used a linear model, which works well :
print(lm_full.summary())
Gives :
OLS Regression Results
===============================================================================
Dep. Variable: L_bankssts_thickavg R-squared: 0.156
Model: OLS Adj. R-squared: 0.131
Method: Least Squares F-statistic: 6.354
Date: Tue, 13 Dec 2016 Prob (F-statistic): 7.30e-07
Time: 15:40:28 Log-Likelihood: 98.227
No. Observations: 249 AIC: -180.5
Df Residuals: 241 BIC: -152.3
Df Model: 7
Covariance Type: nonrobust
===================================================================================
coef std err t P>|t| [95.0% Conf. Int.]
-----------------------------------------------------------------------------------
Intercept 2.8392 0.055 51.284 0.000 2.730 2.948
diag[T.hc] -0.0567 0.021 -2.650 0.009 -0.099 -0.015
sexe[T.m] -0.0435 0.029 -1.476 0.141 -0.102 0.015
site[T.Cre] -0.0069 0.036 -0.189 0.850 -0.078 0.065
site[T.MAM] -0.0635 0.040 -1.593 0.112 -0.142 0.015
site[T.UYU] -0.0948 0.038 -2.497 0.013 -0.170 -0.020
site[T.USS] 0.0145 0.037 0.396 0.692 -0.058 0.086
age -0.0059 0.001 -4.209 0.000 -0.009 -0.003
==============================================================================
Omnibus: 0.698 Durbin-Watson: 2.042
Prob(Omnibus): 0.705 Jarque-Bera (JB): 0.432
Skew: -0.053 Prob(JB): 0.806
Kurtosis: 3.175 Cond. No. 196.
==============================================================================
I know would like to plot the effect for example of the "diag" variable :
As it appears in my model, the diagnosis has an effect on the dependent variable, I would like to plot this effect. I want to have a graphical representation with the two possible values of diag (ie : 'asd' and 'hc') showing which group has the lowest value (ie a graphical representation of a contrast)
I would like something similar as the allEffect library in R
Do you think there are similar functions in python ?
The best way to plot this effect is to do a CCPR Plots with matplot lib.
# Component-Component plus Residual (CCPR) Plots (= partial residual plot)
fig, ax = plt.subplots(figsize=(5, 5))
fig = sm.graphics.plot_ccpr(lm_full, 'diag[T.sz]', ax=ax)
plt.close
Which gives

Newey-West standard errors for OLS in Python?

I want to have a coefficient and Newey-West standard error associated with it.
I am looking for Python library (ideally, but any working solutions is fine) that can do what the following R code is doing:
library(sandwich)
library(lmtest)
a <- matrix(c(1,3,5,7,4,5,6,4,7,8,9))
b <- matrix(c(3,5,6,2,4,6,7,8,7,8,9))
temp.lm = lm(a ~ b)
temp.summ <- summary(temp.lm)
temp.summ$coefficients <- unclass(coeftest(temp.lm, vcov. = NeweyWest))
print (temp.summ$coefficients)
Result:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.0576208 2.5230532 0.8155281 0.4358205
b 0.5594796 0.4071834 1.3740235 0.2026817
I get the coefficients and associated with them standard errors.
I see statsmodels.stats.sandwich_covariance.cov_hac module, but I don't see how to make it work with OLS.
Edited (10/31/2015) to reflect preferred coding style for statsmodels as fall 2015.
In statsmodels version 0.6.1 you can do the following:
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
df = pd.DataFrame({'a':[1,3,5,7,4,5,6,4,7,8,9],
'b':[3,5,6,2,4,6,7,8,7,8,9]})
reg = smf.ols('a ~ 1 + b',data=df).fit(cov_type='HAC',cov_kwds={'maxlags':1})
print reg.summary()
OLS Regression Results
==============================================================================
Dep. Variable: a R-squared: 0.281
Model: OLS Adj. R-squared: 0.201
Method: Least Squares F-statistic: 1.949
Date: Sat, 31 Oct 2015 Prob (F-statistic): 0.196
Time: 03:15:46 Log-Likelihood: -22.603
No. Observations: 11 AIC: 49.21
Df Residuals: 9 BIC: 50.00
Df Model: 1
Covariance Type: HAC
==============================================================================
coef std err z P>|z| [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept 2.0576 2.661 0.773 0.439 -3.157 7.272
b 0.5595 0.401 1.396 0.163 -0.226 1.345
==============================================================================
Omnibus: 0.361 Durbin-Watson: 1.468
Prob(Omnibus): 0.835 Jarque-Bera (JB): 0.331
Skew: 0.321 Prob(JB): 0.847
Kurtosis: 2.442 Cond. No. 19.1
==============================================================================
Warnings:
[1] Standard Errors are heteroscedasticity and autocorrelation robust (HAC) using 1 lags and without small sample correction
Or you can use the get_robustcov_results method after fitting the model:
reg = smf.ols('a ~ 1 + b',data=df).fit()
new = reg.get_robustcov_results(cov_type='HAC',maxlags=1)
print new.summary()
OLS Regression Results
==============================================================================
Dep. Variable: a R-squared: 0.281
Model: OLS Adj. R-squared: 0.201
Method: Least Squares F-statistic: 1.949
Date: Sat, 31 Oct 2015 Prob (F-statistic): 0.196
Time: 03:15:46 Log-Likelihood: -22.603
No. Observations: 11 AIC: 49.21
Df Residuals: 9 BIC: 50.00
Df Model: 1
Covariance Type: HAC
==============================================================================
coef std err z P>|z| [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept 2.0576 2.661 0.773 0.439 -3.157 7.272
b 0.5595 0.401 1.396 0.163 -0.226 1.345
==============================================================================
Omnibus: 0.361 Durbin-Watson: 1.468
Prob(Omnibus): 0.835 Jarque-Bera (JB): 0.331
Skew: 0.321 Prob(JB): 0.847
Kurtosis: 2.442 Cond. No. 19.1
==============================================================================
Warnings:
[1] Standard Errors are heteroscedasticity and autocorrelation robust (HAC) using 1 lags and without small sample correction
The defaults for statsmodels are slightly different than the defaults for the equivalent method in R. The R method can be made equivalent to the statsmodels default (what I did above) by changing the vcov, call to the following:
temp.summ$coefficients <- unclass(coeftest(temp.lm,
vcov. = NeweyWest(temp.lm,lag=1,prewhite=FALSE)))
print (temp.summ$coefficients)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.0576208 2.6605060 0.7733945 0.4591196
b 0.5594796 0.4007965 1.3959193 0.1962142
You can also still do Newey-West in pandas (0.17), although I believe the plan is to deprecate OLS in pandas:
print pd.stats.ols.OLS(df.a,df.b,nw_lags=1)
-------------------------Summary of Regression Analysis-------------------------
Formula: Y ~ <x> + <intercept>
Number of Observations: 11
Number of Degrees of Freedom: 2
R-squared: 0.2807
Adj R-squared: 0.2007
Rmse: 2.0880
F-stat (1, 9): 1.5943, p-value: 0.2384
Degrees of Freedom: model 1, resid 9
-----------------------Summary of Estimated Coefficients------------------------
Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5%
--------------------------------------------------------------------------------
x 0.5595 0.4431 1.26 0.2384 -0.3090 1.4280
intercept 2.0576 2.9413 0.70 0.5019 -3.7073 7.8226
*** The calculations are Newey-West adjusted with lags 1
---------------------------------End of Summary---------------------------------

Categories