I need to perform multiple polynomial regression and obtain statistics, p value, AIC etc.
As far as I understood I can do that with OLS, however I found only a way to produce a formula using one independent variable, like this:
model = 'act_hours ~ h_hours + I(h_hours**2)'
hours_model = smf.ols(formula = model, data = df)
I tried to define a formula using two independent variable, however I could not understand if that is the correct way and if the results are reasonable. The line that I doubt is model = 'Height ~ Diamet + I(Diamet**2) + area + I(area**2). The full code is this one:
import pandas as pd
import statsmodels.formula.api as smf
train = pd.read_csv(r'W:\...file.csv')
model = 'Height ~ Diamet + I(Diamet**2) + area + I(area**2)'
hours_model = smf.ols(formula = model, data = train).fit()
print(hours_model.summary())
The summary of the regression is here:
OLS Regression Results
==============================================================================
Dep. Variable: Height R-squared: 0.611
Model: OLS Adj. R-squared: 0.609
Method: Least Squares F-statistic: 376.0
Date: Fri, 04 Feb 2022 Prob (F-statistic): 1.33e-194
Time: 08:50:17 Log-Likelihood: -5114.6
No. Observations: 963 AIC: 1.024e+04
Df Residuals: 958 BIC: 1.026e+04
Df Model: 4
Covariance Type: nonrobust
==================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------
Intercept 13.9287 60.951 0.229 0.819 -105.684 133.542
Diamet 0.6027 0.340 1.770 0.077 -0.066 1.271
I(Diamet ** 2) 0.0004 0.002 0.262 0.794 -0.003 0.004
area 3.3553 5.307 0.632 0.527 -7.060 13.771
I(area** 2) 0.2519 0.108 2.324 0.020 0.039 0.465
==============================================================================
Omnibus: 60.996 Durbin-Watson: 1.889
Prob(Omnibus): 0.000 Jarque-Bera (JB): 86.039
Skew: 0.528 Prob(JB): 2.07e-19
Kurtosis: 4.015 Cond. No. 4.45e+05
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.45e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
I am running rolling regressions using the RollingOLS function on statsmodels.api, and wondering if its possible to get the summary statistics (betas, r^2, etc.) out for each regression done in the rolling regression.
Using a single OLS regression, you can get the summary information like such,
X_opt = X[:, [0,1,2,3]]
regressor_OLS = sm.OLS(endog= y, exog= X_opt).fit()
regressor_OLS.summary()
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.951
Model: OLS Adj. R-squared: 0.948
Method: Least Squares F-statistic: 296.0
Date: Wed, 08 Aug 2018 Prob (F-statistic): 4.53e-30
Time: 00:46:48 Log-Likelihood: -525.39
No. Observations: 50 AIC: 1059.
Df Residuals: 46 BIC: 1066.
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 5.012e+04 6572.353 7.626 0.000 3.69e+04 6.34e+04
x1 0.8057 0.045 17.846 0.000 0.715 0.897
x2 -0.0268 0.051 -0.526 0.602 -0.130 0.076
x3 0.0272 0.016 1.655 0.105 -0.006 0.060
==============================================================================
Omnibus: 14.838 Durbin-Watson: 1.282
Prob(Omnibus): 0.001 Jarque-Bera (JB): 21.442
Skew: -0.949 Prob(JB): 2.21e-05
Kurtosis: 5.586 Cond. No. 1.40e+06
==============================================================================
is there a way to get this information for the regression run on each window for a rolling regression?
I'm doing a multiple linear regression, and trying to select the best subset of a number of independent variables. I would like to try to do all 1024 possible combinations in a "for" loop and save the best results based on condition number and r squared. I know it calculates both, giving results like:
model = sm.OLS(salarray, narraycareer)
results = model.fit()
print results.summary()
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.425
Model: OLS Adj. R-squared: 0.409
Method: Least Squares F-statistic: 26.89
Date: Sat, 23 Sep 2017 Prob (F-statistic): 1.69e-27
Time: 00:58:14 Log-Likelihood: -1907.4
No. Observations: 263 AIC: 3831.
Df Residuals: 255 BIC: 3859.
Df Model: 7
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
const 535.9259 21.387 25.058 0.000 493.808 578.044
x1 -675.5296 302.245 -2.235 0.026 -1270.744 -80.315
x2 182.7168 436.493 0.419 0.676 -676.874 1042.307
x3 -48.2603 126.141 -0.383 0.702 -296.671 200.151
x4 445.0863 218.373 2.038 0.043 15.043 875.130
x5 344.0092 219.896 1.564 0.119 -89.035 777.053
x6 -41.5168 71.925 -0.577 0.564 -183.159 100.126
x7 96.5430 30.595 3.156 0.002 36.293 156.793
==============================================================================
Omnibus: 96.442 Durbin-Watson: 1.973
Prob(Omnibus): 0.000 Jarque-Bera (JB): 440.598
Skew: 1.438 Prob(JB): 2.11e-96
Kurtosis: 8.651 Cond. No. 61.7
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
>>>
But I can't find any documentation on how to get out condition number or r squared.
Thanks!
I found it, or the Spyder ide found it for me in the interpreter window.
>>> results.rsquared
0.42465891683421031
>>> results.condition_number
61.715714331759621
>>> >
When I typed "results." it gave a bunch of suggestions. Something vim doesn't do!
In python I am trying to plot the effect of a linear model
data = pd.read_excel(input_filename)
data.sexe = data.sexe.map({1:'m', 2:'f'})
data.diag = data.diag.map({1:'asd', 4:'hc'})
data.site = data.site.map({ 10:'USS', 20:'UYU', 30:'CAM', 40:'MAM', 2:'Cre'})
lm_full = sm.formula.ols(formula= L_bankssts_thickavg ~ diag + age + sexe + site' % var, data=data).fit()
I used a linear model, which works well :
print(lm_full.summary())
Gives :
OLS Regression Results
===============================================================================
Dep. Variable: L_bankssts_thickavg R-squared: 0.156
Model: OLS Adj. R-squared: 0.131
Method: Least Squares F-statistic: 6.354
Date: Tue, 13 Dec 2016 Prob (F-statistic): 7.30e-07
Time: 15:40:28 Log-Likelihood: 98.227
No. Observations: 249 AIC: -180.5
Df Residuals: 241 BIC: -152.3
Df Model: 7
Covariance Type: nonrobust
===================================================================================
coef std err t P>|t| [95.0% Conf. Int.]
-----------------------------------------------------------------------------------
Intercept 2.8392 0.055 51.284 0.000 2.730 2.948
diag[T.hc] -0.0567 0.021 -2.650 0.009 -0.099 -0.015
sexe[T.m] -0.0435 0.029 -1.476 0.141 -0.102 0.015
site[T.Cre] -0.0069 0.036 -0.189 0.850 -0.078 0.065
site[T.MAM] -0.0635 0.040 -1.593 0.112 -0.142 0.015
site[T.UYU] -0.0948 0.038 -2.497 0.013 -0.170 -0.020
site[T.USS] 0.0145 0.037 0.396 0.692 -0.058 0.086
age -0.0059 0.001 -4.209 0.000 -0.009 -0.003
==============================================================================
Omnibus: 0.698 Durbin-Watson: 2.042
Prob(Omnibus): 0.705 Jarque-Bera (JB): 0.432
Skew: -0.053 Prob(JB): 0.806
Kurtosis: 3.175 Cond. No. 196.
==============================================================================
I know would like to plot the effect for example of the "diag" variable :
As it appears in my model, the diagnosis has an effect on the dependent variable, I would like to plot this effect. I want to have a graphical representation with the two possible values of diag (ie : 'asd' and 'hc') showing which group has the lowest value (ie a graphical representation of a contrast)
I would like something similar as the allEffect library in R
Do you think there are similar functions in python ?
The best way to plot this effect is to do a CCPR Plots with matplot lib.
# Component-Component plus Residual (CCPR) Plots (= partial residual plot)
fig, ax = plt.subplots(figsize=(5, 5))
fig = sm.graphics.plot_ccpr(lm_full, 'diag[T.sz]', ax=ax)
plt.close
Which gives
I want to have a coefficient and Newey-West standard error associated with it.
I am looking for Python library (ideally, but any working solutions is fine) that can do what the following R code is doing:
library(sandwich)
library(lmtest)
a <- matrix(c(1,3,5,7,4,5,6,4,7,8,9))
b <- matrix(c(3,5,6,2,4,6,7,8,7,8,9))
temp.lm = lm(a ~ b)
temp.summ <- summary(temp.lm)
temp.summ$coefficients <- unclass(coeftest(temp.lm, vcov. = NeweyWest))
print (temp.summ$coefficients)
Result:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.0576208 2.5230532 0.8155281 0.4358205
b 0.5594796 0.4071834 1.3740235 0.2026817
I get the coefficients and associated with them standard errors.
I see statsmodels.stats.sandwich_covariance.cov_hac module, but I don't see how to make it work with OLS.
Edited (10/31/2015) to reflect preferred coding style for statsmodels as fall 2015.
In statsmodels version 0.6.1 you can do the following:
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
df = pd.DataFrame({'a':[1,3,5,7,4,5,6,4,7,8,9],
'b':[3,5,6,2,4,6,7,8,7,8,9]})
reg = smf.ols('a ~ 1 + b',data=df).fit(cov_type='HAC',cov_kwds={'maxlags':1})
print reg.summary()
OLS Regression Results
==============================================================================
Dep. Variable: a R-squared: 0.281
Model: OLS Adj. R-squared: 0.201
Method: Least Squares F-statistic: 1.949
Date: Sat, 31 Oct 2015 Prob (F-statistic): 0.196
Time: 03:15:46 Log-Likelihood: -22.603
No. Observations: 11 AIC: 49.21
Df Residuals: 9 BIC: 50.00
Df Model: 1
Covariance Type: HAC
==============================================================================
coef std err z P>|z| [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept 2.0576 2.661 0.773 0.439 -3.157 7.272
b 0.5595 0.401 1.396 0.163 -0.226 1.345
==============================================================================
Omnibus: 0.361 Durbin-Watson: 1.468
Prob(Omnibus): 0.835 Jarque-Bera (JB): 0.331
Skew: 0.321 Prob(JB): 0.847
Kurtosis: 2.442 Cond. No. 19.1
==============================================================================
Warnings:
[1] Standard Errors are heteroscedasticity and autocorrelation robust (HAC) using 1 lags and without small sample correction
Or you can use the get_robustcov_results method after fitting the model:
reg = smf.ols('a ~ 1 + b',data=df).fit()
new = reg.get_robustcov_results(cov_type='HAC',maxlags=1)
print new.summary()
OLS Regression Results
==============================================================================
Dep. Variable: a R-squared: 0.281
Model: OLS Adj. R-squared: 0.201
Method: Least Squares F-statistic: 1.949
Date: Sat, 31 Oct 2015 Prob (F-statistic): 0.196
Time: 03:15:46 Log-Likelihood: -22.603
No. Observations: 11 AIC: 49.21
Df Residuals: 9 BIC: 50.00
Df Model: 1
Covariance Type: HAC
==============================================================================
coef std err z P>|z| [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept 2.0576 2.661 0.773 0.439 -3.157 7.272
b 0.5595 0.401 1.396 0.163 -0.226 1.345
==============================================================================
Omnibus: 0.361 Durbin-Watson: 1.468
Prob(Omnibus): 0.835 Jarque-Bera (JB): 0.331
Skew: 0.321 Prob(JB): 0.847
Kurtosis: 2.442 Cond. No. 19.1
==============================================================================
Warnings:
[1] Standard Errors are heteroscedasticity and autocorrelation robust (HAC) using 1 lags and without small sample correction
The defaults for statsmodels are slightly different than the defaults for the equivalent method in R. The R method can be made equivalent to the statsmodels default (what I did above) by changing the vcov, call to the following:
temp.summ$coefficients <- unclass(coeftest(temp.lm,
vcov. = NeweyWest(temp.lm,lag=1,prewhite=FALSE)))
print (temp.summ$coefficients)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.0576208 2.6605060 0.7733945 0.4591196
b 0.5594796 0.4007965 1.3959193 0.1962142
You can also still do Newey-West in pandas (0.17), although I believe the plan is to deprecate OLS in pandas:
print pd.stats.ols.OLS(df.a,df.b,nw_lags=1)
-------------------------Summary of Regression Analysis-------------------------
Formula: Y ~ <x> + <intercept>
Number of Observations: 11
Number of Degrees of Freedom: 2
R-squared: 0.2807
Adj R-squared: 0.2007
Rmse: 2.0880
F-stat (1, 9): 1.5943, p-value: 0.2384
Degrees of Freedom: model 1, resid 9
-----------------------Summary of Estimated Coefficients------------------------
Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5%
--------------------------------------------------------------------------------
x 0.5595 0.4431 1.26 0.2384 -0.3090 1.4280
intercept 2.0576 2.9413 0.70 0.5019 -3.7073 7.8226
*** The calculations are Newey-West adjusted with lags 1
---------------------------------End of Summary---------------------------------