plot contrast of a linear model in python - python

In python I am trying to plot the effect of a linear model
data = pd.read_excel(input_filename)
data.sexe = data.sexe.map({1:'m', 2:'f'})
data.diag = data.diag.map({1:'asd', 4:'hc'})
data.site = data.site.map({ 10:'USS', 20:'UYU', 30:'CAM', 40:'MAM', 2:'Cre'})
lm_full = sm.formula.ols(formula= L_bankssts_thickavg ~ diag + age + sexe + site' % var, data=data).fit()
I used a linear model, which works well :
print(lm_full.summary())
Gives :
OLS Regression Results
===============================================================================
Dep. Variable: L_bankssts_thickavg R-squared: 0.156
Model: OLS Adj. R-squared: 0.131
Method: Least Squares F-statistic: 6.354
Date: Tue, 13 Dec 2016 Prob (F-statistic): 7.30e-07
Time: 15:40:28 Log-Likelihood: 98.227
No. Observations: 249 AIC: -180.5
Df Residuals: 241 BIC: -152.3
Df Model: 7
Covariance Type: nonrobust
===================================================================================
coef std err t P>|t| [95.0% Conf. Int.]
-----------------------------------------------------------------------------------
Intercept 2.8392 0.055 51.284 0.000 2.730 2.948
diag[T.hc] -0.0567 0.021 -2.650 0.009 -0.099 -0.015
sexe[T.m] -0.0435 0.029 -1.476 0.141 -0.102 0.015
site[T.Cre] -0.0069 0.036 -0.189 0.850 -0.078 0.065
site[T.MAM] -0.0635 0.040 -1.593 0.112 -0.142 0.015
site[T.UYU] -0.0948 0.038 -2.497 0.013 -0.170 -0.020
site[T.USS] 0.0145 0.037 0.396 0.692 -0.058 0.086
age -0.0059 0.001 -4.209 0.000 -0.009 -0.003
==============================================================================
Omnibus: 0.698 Durbin-Watson: 2.042
Prob(Omnibus): 0.705 Jarque-Bera (JB): 0.432
Skew: -0.053 Prob(JB): 0.806
Kurtosis: 3.175 Cond. No. 196.
==============================================================================
I know would like to plot the effect for example of the "diag" variable :
As it appears in my model, the diagnosis has an effect on the dependent variable, I would like to plot this effect. I want to have a graphical representation with the two possible values of diag (ie : 'asd' and 'hc') showing which group has the lowest value (ie a graphical representation of a contrast)
I would like something similar as the allEffect library in R
Do you think there are similar functions in python ?

The best way to plot this effect is to do a CCPR Plots with matplot lib.
# Component-Component plus Residual (CCPR) Plots (= partial residual plot)
fig, ax = plt.subplots(figsize=(5, 5))
fig = sm.graphics.plot_ccpr(lm_full, 'diag[T.sz]', ax=ax)
plt.close
Which gives

Related

Correct multiple polynomial regression formula with OLS in Python

I need to perform multiple polynomial regression and obtain statistics, p value, AIC etc.
As far as I understood I can do that with OLS, however I found only a way to produce a formula using one independent variable, like this:
model = 'act_hours ~ h_hours + I(h_hours**2)'
hours_model = smf.ols(formula = model, data = df)
I tried to define a formula using two independent variable, however I could not understand if that is the correct way and if the results are reasonable. The line that I doubt is model = 'Height ~ Diamet + I(Diamet**2) + area + I(area**2). The full code is this one:
import pandas as pd
import statsmodels.formula.api as smf
train = pd.read_csv(r'W:\...file.csv')
model = 'Height ~ Diamet + I(Diamet**2) + area + I(area**2)'
hours_model = smf.ols(formula = model, data = train).fit()
print(hours_model.summary())
The summary of the regression is here:
OLS Regression Results
==============================================================================
Dep. Variable: Height R-squared: 0.611
Model: OLS Adj. R-squared: 0.609
Method: Least Squares F-statistic: 376.0
Date: Fri, 04 Feb 2022 Prob (F-statistic): 1.33e-194
Time: 08:50:17 Log-Likelihood: -5114.6
No. Observations: 963 AIC: 1.024e+04
Df Residuals: 958 BIC: 1.026e+04
Df Model: 4
Covariance Type: nonrobust
==================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------
Intercept 13.9287 60.951 0.229 0.819 -105.684 133.542
Diamet 0.6027 0.340 1.770 0.077 -0.066 1.271
I(Diamet ** 2) 0.0004 0.002 0.262 0.794 -0.003 0.004
area 3.3553 5.307 0.632 0.527 -7.060 13.771
I(area** 2) 0.2519 0.108 2.324 0.020 0.039 0.465
==============================================================================
Omnibus: 60.996 Durbin-Watson: 1.889
Prob(Omnibus): 0.000 Jarque-Bera (JB): 86.039
Skew: 0.528 Prob(JB): 2.07e-19
Kurtosis: 4.015 Cond. No. 4.45e+05
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.45e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

Getting statsmodel RollingOLS results summary information

I am running rolling regressions using the RollingOLS function on statsmodels.api, and wondering if its possible to get the summary statistics (betas, r^2, etc.) out for each regression done in the rolling regression.
Using a single OLS regression, you can get the summary information like such,
X_opt = X[:, [0,1,2,3]]
regressor_OLS = sm.OLS(endog= y, exog= X_opt).fit()
regressor_OLS.summary()
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.951
Model: OLS Adj. R-squared: 0.948
Method: Least Squares F-statistic: 296.0
Date: Wed, 08 Aug 2018 Prob (F-statistic): 4.53e-30
Time: 00:46:48 Log-Likelihood: -525.39
No. Observations: 50 AIC: 1059.
Df Residuals: 46 BIC: 1066.
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 5.012e+04 6572.353 7.626 0.000 3.69e+04 6.34e+04
x1 0.8057 0.045 17.846 0.000 0.715 0.897
x2 -0.0268 0.051 -0.526 0.602 -0.130 0.076
x3 0.0272 0.016 1.655 0.105 -0.006 0.060
==============================================================================
Omnibus: 14.838 Durbin-Watson: 1.282
Prob(Omnibus): 0.001 Jarque-Bera (JB): 21.442
Skew: -0.949 Prob(JB): 2.21e-05
Kurtosis: 5.586 Cond. No. 1.40e+06
==============================================================================
is there a way to get this information for the regression run on each window for a rolling regression?

Difference in Linear Regression using Statsmodels between Patsy version and Dummy lists version

I am having differences in the coefficient values and coefficient errors using smf.ols and sm.OLS functions of statsmodels. Even though matematically, they should be the same regression formula and give the same results.
I have done a 100% reproducible example of my question, the dataframe df can be downloaded from here: https://drive.google.com/drive/folders/1i67wztkrAeEZH2tv2hyOlgxG7N80V3pI?usp=sharing
Case 1: Linear Model using Patsy from Statsmodels
# First we load the libraries:
import statsmodels.api as sm
import statsmodels.formula.api as smf
import random
import pandas as pd
# We define a specific seed to have the same results:
random.seed(1234)
# Now we read the data that can be downloaded from Google Drive link provided above:
df = pd.read_csv("/Users/user/Documents/example/cars.csv", sep = "|")
# We create the linear regression:
lm1 = smf.ols('price ~ make + fuel_system + engine_type + num_of_doors + bore + compression_ratio + height + peak_rpm + 1', data = df)
# We see the results:
lm1.fit().summary()
The result of lm1 is:
OLS Regression Results
==============================================================================
Dep. Variable: price R-squared: 0.894
Model: OLS Adj. R-squared: 0.868
Method: Least Squares F-statistic: 35.54
Date: Mon, 18 Feb 2019 Prob (F-statistic): 5.24e-62
Time: 17:19:14 Log-Likelihood: -1899.7
No. Observations: 205 AIC: 3879.
Df Residuals: 165 BIC: 4012.
Df Model: 39
Covariance Type: nonrobust
=========================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------
Intercept 1.592e+04 1.21e+04 1.320 0.189 -7898.396 3.97e+04
make[T.audi] 6519.7045 2371.807 2.749 0.007 1836.700 1.12e+04
make[T.bmw] 1.427e+04 2292.551 6.223 0.000 9740.771 1.88e+04
make[T.chevrolet] -571.8236 2860.026 -0.200 0.842 -6218.788 5075.141
make[T.dodge] -1186.3430 2261.240 -0.525 0.601 -5651.039 3278.353
make[T.honda] 2779.6496 2891.626 0.961 0.338 -2929.709 8489.009
make[T.isuzu] 3098.9677 2592.645 1.195 0.234 -2020.069 8218.004
make[T.jaguar] 1.752e+04 2416.313 7.252 0.000 1.28e+04 2.23e+04
make[T.mazda] 306.6568 2134.567 0.144 0.886 -3907.929 4521.243
make[T.mercedes-benz] 1.698e+04 2320.871 7.318 0.000 1.24e+04 2.16e+04
make[T.mercury] 2958.1002 3605.739 0.820 0.413 -4161.236 1.01e+04
make[T.mitsubishi] -1188.8337 2284.697 -0.520 0.604 -5699.844 3322.176
make[T.nissan] -1211.5463 2073.422 -0.584 0.560 -5305.405 2882.312
make[T.peugot] 3057.0217 4255.809 0.718 0.474 -5345.841 1.15e+04
make[T.plymouth] -894.5921 2332.746 -0.383 0.702 -5500.473 3711.289
make[T.porsche] 9558.8747 3688.038 2.592 0.010 2277.044 1.68e+04
make[T.renault] -2124.9722 2847.536 -0.746 0.457 -7747.277 3497.333
make[T.saab] 3490.5333 2319.189 1.505 0.134 -1088.579 8069.645
make[T.subaru] -1.636e+04 4002.796 -4.087 0.000 -2.43e+04 -8456.659
make[T.toyota] -770.9677 1911.754 -0.403 0.687 -4545.623 3003.688
make[T.volkswagen] 406.9179 2219.714 0.183 0.855 -3975.788 4789.623
make[T.volvo] 5433.7129 2397.030 2.267 0.025 700.907 1.02e+04
fuel_system[T.2bbl] 2142.1594 2232.214 0.960 0.339 -2265.226 6549.545
fuel_system[T.4bbl] 464.1109 3999.976 0.116 0.908 -7433.624 8361.846
fuel_system[T.idi] 1.991e+04 6622.812 3.007 0.003 6837.439 3.3e+04
fuel_system[T.mfi] 3716.5201 3936.805 0.944 0.347 -4056.488 1.15e+04
fuel_system[T.mpfi] 3964.1109 2267.538 1.748 0.082 -513.019 8441.241
fuel_system[T.spdi] 3240.0003 2719.925 1.191 0.235 -2130.344 8610.344
fuel_system[T.spfi] 932.1959 4019.476 0.232 0.817 -7004.041 8868.433
engine_type[T.dohcv] -1.208e+04 4205.826 -2.872 0.005 -2.04e+04 -3773.504
engine_type[T.l] -4833.9860 3763.812 -1.284 0.201 -1.23e+04 2597.456
engine_type[T.ohc] -4038.8848 1213.598 -3.328 0.001 -6435.067 -1642.702
engine_type[T.ohcf] 9618.9281 3504.600 2.745 0.007 2699.286 1.65e+04
engine_type[T.ohcv] 3051.7629 1445.185 2.112 0.036 198.323 5905.203
engine_type[T.rotor] 1403.9928 3217.402 0.436 0.663 -4948.593 7756.579
num_of_doors[T.two] -419.9640 521.754 -0.805 0.422 -1450.139 610.211
bore 3993.4308 1373.487 2.908 0.004 1281.556 6705.306
compression_ratio -1200.5665 460.681 -2.606 0.010 -2110.156 -290.977
height -80.7141 146.219 -0.552 0.582 -369.417 207.988
peak_rpm -0.5903 0.790 -0.747 0.456 -2.150 0.970
==============================================================================
Omnibus: 65.777 Durbin-Watson: 1.217
Prob(Omnibus): 0.000 Jarque-Bera (JB): 399.594
Skew: 1.059 Prob(JB): 1.70e-87
Kurtosis: 9.504 Cond. No. 3.26e+05
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.26e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
"""
Case 2: Linear Model using Dummy Variables from Statsmodels as well
# We define a specific seed to have the same results:
random.seed(1234)
# First we check what `object` type variables we have in our dataset:
df.dtypes
# We create a list where we save the `object` type variables names:
object = ['make',
'fuel_system',
'engine_type',
'num_of_doors'
]
# Now we convert those object variables to numeric with get_dummies function to have 1 unique numeric dataframe:
df_num = pd.get_dummies(df, columns = object)
# We ensure the dataframe is numeric casting all values to float64:
df_num = df_num[df_num.columns].apply(pd.to_numeric, errors='coerce', axis = 1)
# We define the predictive variables dataset:
X = df_num.drop('price', axis = 1)
# We define the response variable values:
y = df_num.price.values
# We add a constant as we did in the previous example (adding "+1" to Patsy):
Xc = sm.add_constant(X) # Adds a constant to the model
# We create the linear model and obtain results:
lm2 = sm.OLS(y, Xc)
lm2.fit().summary()
The result of lm2 is:
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.894
Model: OLS Adj. R-squared: 0.868
Method: Least Squares F-statistic: 35.54
Date: Mon, 18 Feb 2019 Prob (F-statistic): 5.24e-62
Time: 17:28:16 Log-Likelihood: -1899.7
No. Observations: 205 AIC: 3879.
Df Residuals: 165 BIC: 4012.
Df Model: 39
Covariance Type: nonrobust
======================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------
const 1.205e+04 6811.094 1.769 0.079 -1398.490 2.55e+04
bore 3993.4308 1373.487 2.908 0.004 1281.556 6705.306
compression_ratio -1200.5665 460.681 -2.606 0.010 -2110.156 -290.977
height -80.7141 146.219 -0.552 0.582 -369.417 207.988
peak_rpm -0.5903 0.790 -0.747 0.456 -2.150 0.970
make_alfa-romero -2273.9631 1865.185 -1.219 0.225 -5956.669 1408.743
make_audi 4245.7414 1324.140 3.206 0.002 1631.299 6860.184
make_bmw 1.199e+04 1232.635 9.730 0.000 9559.555 1.44e+04
make_chevrolet -2845.7867 1976.730 -1.440 0.152 -6748.733 1057.160
make_dodge -3460.3061 1170.966 -2.955 0.004 -5772.315 -1148.297
make_honda 505.6865 2049.865 0.247 0.805 -3541.661 4553.034
make_isuzu 825.0045 1706.160 0.484 0.629 -2543.716 4193.725
make_jaguar 1.525e+04 1903.813 8.010 0.000 1.15e+04 1.9e+04
make_mazda -1967.3063 982.179 -2.003 0.047 -3906.564 -28.048
make_mercedes-benz 1.471e+04 1423.004 10.338 0.000 1.19e+04 1.75e+04
make_mercury 684.1370 2913.361 0.235 0.815 -5068.136 6436.410
make_mitsubishi -3462.7968 1221.018 -2.836 0.005 -5873.631 -1051.963
make_nissan -3485.5094 946.316 -3.683 0.000 -5353.958 -1617.060
make_peugot 783.0586 3513.296 0.223 0.824 -6153.754 7719.871
make_plymouth -3168.5552 1293.376 -2.450 0.015 -5722.256 -614.854
make_porsche 7284.9115 2853.174 2.553 0.012 1651.475 1.29e+04
make_renault -4398.9354 2037.945 -2.159 0.032 -8422.747 -375.124
make_saab 1216.5702 1487.192 0.818 0.415 -1719.810 4152.950
make_subaru -1.863e+04 3263.524 -5.710 0.000 -2.51e+04 -1.22e+04
make_toyota -3044.9308 776.059 -3.924 0.000 -4577.218 -1512.644
make_volkswagen -1867.0452 1170.975 -1.594 0.113 -4179.072 444.981
make_volvo 3159.7498 1327.405 2.380 0.018 538.862 5780.638
fuel_system_1bbl -2790.4092 2230.161 -1.251 0.213 -7193.740 1612.922
fuel_system_2bbl -648.2498 1094.525 -0.592 0.554 -2809.330 1512.830
fuel_system_4bbl -2326.2983 3094.703 -0.752 0.453 -8436.621 3784.024
fuel_system_idi 1.712e+04 6154.806 2.782 0.006 4971.083 2.93e+04
fuel_system_mfi 926.1109 3063.134 0.302 0.763 -5121.881 6974.102
fuel_system_mpfi 1173.7017 1186.125 0.990 0.324 -1168.238 3515.642
fuel_system_spdi 449.5911 1827.318 0.246 0.806 -3158.349 4057.531
fuel_system_spfi -1858.2133 3111.596 -0.597 0.551 -8001.891 4285.464
engine_type_dohc 2703.6445 1803.080 1.499 0.136 -856.440 6263.729
engine_type_dohcv -9374.0342 3504.717 -2.675 0.008 -1.63e+04 -2454.161
engine_type_l -2130.3416 3357.283 -0.635 0.527 -8759.115 4498.431
engine_type_ohc -1335.2404 1454.047 -0.918 0.360 -4206.177 1535.696
engine_type_ohcf 1.232e+04 2850.883 4.322 0.000 6693.659 1.8e+04
engine_type_ohcv 5755.4074 1669.627 3.447 0.001 2458.820 9051.995
engine_type_rotor 4107.6373 3032.223 1.355 0.177 -1879.323 1.01e+04
num_of_doors_four 6234.8048 3491.722 1.786 0.076 -659.410 1.31e+04
num_of_doors_two 5814.8408 3337.588 1.742 0.083 -775.045 1.24e+04
==============================================================================
Omnibus: 65.777 Durbin-Watson: 1.217
Prob(Omnibus): 0.000 Jarque-Bera (JB): 399.594
Skew: 1.059 Prob(JB): 1.70e-87
Kurtosis: 9.504 Cond. No. 1.01e+16
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 5.38e-23. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
"""
As we can see, some variables like height have the same coefficient. Nevertheless some others don't (level isuzu from variable make, level ohc of engine_type or the independent term, etc.). Shouldn't it be the same result for both outputs? What am I missing here or doing wrong?
Thanks in advance for your help.
P.D. As clarified by #sukhbinder, even using Patsy formula without independent
term (putting "-1" in the formula, as Patsy incorporates it by
default) and eliminating independent term from dummy formulation, I
receive different results.
The reason why the results do not match is because Statsmodels does a pre-selection on predictive variables depending on high multicollinearity.
Exactly the same results are accomplished going through descriptive summary of the regression and identifying variables missing:
deletex = [
'make_alfa-romero',
'fuel_system_1bbl',
'engine_type_dohc',
'num_of_doors_four'
]
df_num.drop( deletex, axis = 1, inplace = True)
df_num = df_num[df_num.columns].apply(pd.to_numeric, errors='coerce', axis = 1)
X = df_num.drop('price', axis = 1)
y = df_num.price.values
Xc = sm.add_constant(X) # Adds a constant to the model
random.seed(1234)
linear_regression = sm.OLS(y, Xc)
linear_regression.fit().summary()
Which prints the result:
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.894
Model: OLS Adj. R-squared: 0.868
Method: Least Squares F-statistic: 35.54
Date: Thu, 21 Feb 2019 Prob (F-statistic): 5.24e-62
Time: 18:16:08 Log-Likelihood: -1899.7
No. Observations: 205 AIC: 3879.
Df Residuals: 165 BIC: 4012.
Df Model: 39
Covariance Type: nonrobust
======================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------
const 1.592e+04 1.21e+04 1.320 0.189 -7898.396 3.97e+04
bore 3993.4308 1373.487 2.908 0.004 1281.556 6705.306
compression_ratio -1200.5665 460.681 -2.606 0.010 -2110.156 -290.977
height -80.7141 146.219 -0.552 0.582 -369.417 207.988
peak_rpm -0.5903 0.790 -0.747 0.456 -2.150 0.970
make_audi 6519.7045 2371.807 2.749 0.007 1836.700 1.12e+04
make_bmw 1.427e+04 2292.551 6.223 0.000 9740.771 1.88e+04
make_chevrolet -571.8236 2860.026 -0.200 0.842 -6218.788 5075.141
make_dodge -1186.3430 2261.240 -0.525 0.601 -5651.039 3278.353
make_honda 2779.6496 2891.626 0.961 0.338 -2929.709 8489.009
make_isuzu 3098.9677 2592.645 1.195 0.234 -2020.069 8218.004
make_jaguar 1.752e+04 2416.313 7.252 0.000 1.28e+04 2.23e+04
make_mazda 306.6568 2134.567 0.144 0.886 -3907.929 4521.243
make_mercedes-benz 1.698e+04 2320.871 7.318 0.000 1.24e+04 2.16e+04
make_mercury 2958.1002 3605.739 0.820 0.413 -4161.236 1.01e+04
make_mitsubishi -1188.8337 2284.697 -0.520 0.604 -5699.844 3322.176
make_nissan -1211.5463 2073.422 -0.584 0.560 -5305.405 2882.312
make_peugot 3057.0217 4255.809 0.718 0.474 -5345.841 1.15e+04
make_plymouth -894.5921 2332.746 -0.383 0.702 -5500.473 3711.289
make_porsche 9558.8747 3688.038 2.592 0.010 2277.044 1.68e+04
make_renault -2124.9722 2847.536 -0.746 0.457 -7747.277 3497.333
make_saab 3490.5333 2319.189 1.505 0.134 -1088.579 8069.645
make_subaru -1.636e+04 4002.796 -4.087 0.000 -2.43e+04 -8456.659
make_toyota -770.9677 1911.754 -0.403 0.687 -4545.623 3003.688
make_volkswagen 406.9179 2219.714 0.183 0.855 -3975.788 4789.623
make_volvo 5433.7129 2397.030 2.267 0.025 700.907 1.02e+04
fuel_system_2bbl 2142.1594 2232.214 0.960 0.339 -2265.226 6549.545
fuel_system_4bbl 464.1109 3999.976 0.116 0.908 -7433.624 8361.846
fuel_system_idi 1.991e+04 6622.812 3.007 0.003 6837.439 3.3e+04
fuel_system_mfi 3716.5201 3936.805 0.944 0.347 -4056.488 1.15e+04
fuel_system_mpfi 3964.1109 2267.538 1.748 0.082 -513.019 8441.241
fuel_system_spdi 3240.0003 2719.925 1.191 0.235 -2130.344 8610.344
fuel_system_spfi 932.1959 4019.476 0.232 0.817 -7004.041 8868.433
engine_type_dohcv -1.208e+04 4205.826 -2.872 0.005 -2.04e+04 -3773.504
engine_type_l -4833.9860 3763.812 -1.284 0.201 -1.23e+04 2597.456
engine_type_ohc -4038.8848 1213.598 -3.328 0.001 -6435.067 -1642.702
engine_type_ohcf 9618.9281 3504.600 2.745 0.007 2699.286 1.65e+04
engine_type_ohcv 3051.7629 1445.185 2.112 0.036 198.323 5905.203
engine_type_rotor 1403.9928 3217.402 0.436 0.663 -4948.593 7756.579
num_of_doors_two -419.9640 521.754 -0.805 0.422 -1450.139 610.211
==============================================================================
Omnibus: 65.777 Durbin-Watson: 1.217
Prob(Omnibus): 0.000 Jarque-Bera (JB): 399.594
Skew: 1.059 Prob(JB): 1.70e-87
Kurtosis: 9.504 Cond. No. 3.26e+05
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.26e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
Results that is completely equal to the first call with Statsmodels:
random.seed(1234)
lm_python = smf.ols('price ~ make + fuel_system + engine_type + num_of_doors + bore + compression_ratio + height + peak_rpm + 1', data = df)
lm_python.fit().summary()
OLS Regression Results
==============================================================================
Dep. Variable: price R-squared: 0.894
Model: OLS Adj. R-squared: 0.868
Method: Least Squares F-statistic: 35.54
Date: Thu, 21 Feb 2019 Prob (F-statistic): 5.24e-62
Time: 18:17:37 Log-Likelihood: -1899.7
No. Observations: 205 AIC: 3879.
Df Residuals: 165 BIC: 4012.
Df Model: 39
Covariance Type: nonrobust
=========================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------
Intercept 1.592e+04 1.21e+04 1.320 0.189 -7898.396 3.97e+04
make[T.audi] 6519.7045 2371.807 2.749 0.007 1836.700 1.12e+04
make[T.bmw] 1.427e+04 2292.551 6.223 0.000 9740.771 1.88e+04
make[T.chevrolet] -571.8236 2860.026 -0.200 0.842 -6218.788 5075.141
make[T.dodge] -1186.3430 2261.240 -0.525 0.601 -5651.039 3278.353
make[T.honda] 2779.6496 2891.626 0.961 0.338 -2929.709 8489.009
make[T.isuzu] 3098.9677 2592.645 1.195 0.234 -2020.069 8218.004
make[T.jaguar] 1.752e+04 2416.313 7.252 0.000 1.28e+04 2.23e+04
make[T.mazda] 306.6568 2134.567 0.144 0.886 -3907.929 4521.243
make[T.mercedes-benz] 1.698e+04 2320.871 7.318 0.000 1.24e+04 2.16e+04
make[T.mercury] 2958.1002 3605.739 0.820 0.413 -4161.236 1.01e+04
make[T.mitsubishi] -1188.8337 2284.697 -0.520 0.604 -5699.844 3322.176
make[T.nissan] -1211.5463 2073.422 -0.584 0.560 -5305.405 2882.312
make[T.peugot] 3057.0217 4255.809 0.718 0.474 -5345.841 1.15e+04
make[T.plymouth] -894.5921 2332.746 -0.383 0.702 -5500.473 3711.289
make[T.porsche] 9558.8747 3688.038 2.592 0.010 2277.044 1.68e+04
make[T.renault] -2124.9722 2847.536 -0.746 0.457 -7747.277 3497.333
make[T.saab] 3490.5333 2319.189 1.505 0.134 -1088.579 8069.645
make[T.subaru] -1.636e+04 4002.796 -4.087 0.000 -2.43e+04 -8456.659
make[T.toyota] -770.9677 1911.754 -0.403 0.687 -4545.623 3003.688
make[T.volkswagen] 406.9179 2219.714 0.183 0.855 -3975.788 4789.623
make[T.volvo] 5433.7129 2397.030 2.267 0.025 700.907 1.02e+04
fuel_system[T.2bbl] 2142.1594 2232.214 0.960 0.339 -2265.226 6549.545
fuel_system[T.4bbl] 464.1109 3999.976 0.116 0.908 -7433.624 8361.846
fuel_system[T.idi] 1.991e+04 6622.812 3.007 0.003 6837.439 3.3e+04
fuel_system[T.mfi] 3716.5201 3936.805 0.944 0.347 -4056.488 1.15e+04
fuel_system[T.mpfi] 3964.1109 2267.538 1.748 0.082 -513.019 8441.241
fuel_system[T.spdi] 3240.0003 2719.925 1.191 0.235 -2130.344 8610.344
fuel_system[T.spfi] 932.1959 4019.476 0.232 0.817 -7004.041 8868.433
engine_type[T.dohcv] -1.208e+04 4205.826 -2.872 0.005 -2.04e+04 -3773.504
engine_type[T.l] -4833.9860 3763.812 -1.284 0.201 -1.23e+04 2597.456
engine_type[T.ohc] -4038.8848 1213.598 -3.328 0.001 -6435.067 -1642.702
engine_type[T.ohcf] 9618.9281 3504.600 2.745 0.007 2699.286 1.65e+04
engine_type[T.ohcv] 3051.7629 1445.185 2.112 0.036 198.323 5905.203
engine_type[T.rotor] 1403.9928 3217.402 0.436 0.663 -4948.593 7756.579
num_of_doors[T.two] -419.9640 521.754 -0.805 0.422 -1450.139 610.211
bore 3993.4308 1373.487 2.908 0.004 1281.556 6705.306
compression_ratio -1200.5665 460.681 -2.606 0.010 -2110.156 -290.977
height -80.7141 146.219 -0.552 0.582 -369.417 207.988
peak_rpm -0.5903 0.790 -0.747 0.456 -2.150 0.970
==============================================================================
Omnibus: 65.777 Durbin-Watson: 1.217
Prob(Omnibus): 0.000 Jarque-Bera (JB): 399.594
Skew: 1.059 Prob(JB): 1.70e-87
Kurtosis: 9.504 Cond. No. 3.26e+05
==============================================================================
There is the need to check correspondence in predictive variables as pd.get_dummies does an extensive obtaining of all dummy variables, and Statsmodels applies an N-1 levels inside the categorical variable selection.

Statsmodels.formula.api OLS does not show statistical values of intercept

I am running the following source code:
import statsmodels.formula.api as sm
# Add one column of ones for the intercept term
X = np.append(arr= np.ones((50, 1)).astype(int), values=X, axis=1)
regressor_OLS = sm.OLS(endog=y, exog=X).fit()
print(regressor_OLS.summary())
where
X is an 50x5 (before adding the intercept term) numpy array which looks like this:
[[0 1 165349.20 136897.80 471784.10]
[0 0 162597.70 151377.59 443898.53]...]
and y is a a 50x1 numpy array with float values for the dependent variable.
The first two columns are for a dummy variable with three different values. The rest of the columns are three different indepedent variables.
Although, it is said that the statsmodels.formula.api.OLS adds automatically an intercept term (see #stellacia's answer here: OLS using statsmodel.formula.api versus statsmodel.api) its summary does not show the statistical values of the intercept term as it evident below in my case:
OLS Regression Results
==============================================================================
Dep. Variable: Profit R-squared: 0.988
Model: OLS Adj. R-squared: 0.986
Method: Least Squares F-statistic: 727.1
Date: Sun, 01 Jul 2018 Prob (F-statistic): 7.87e-42
Time: 21:40:23 Log-Likelihood: -545.15
No. Observations: 50 AIC: 1100.
Df Residuals: 45 BIC: 1110.
Df Model: 5
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
x1 3464.4536 4905.406 0.706 0.484 -6415.541 1.33e+04
x2 5067.8937 4668.238 1.086 0.283 -4334.419 1.45e+04
x3 0.7182 0.066 10.916 0.000 0.586 0.851
x4 0.3113 0.035 8.885 0.000 0.241 0.382
x5 0.0786 0.023 3.429 0.001 0.032 0.125
==============================================================================
Omnibus: 1.355 Durbin-Watson: 1.288
Prob(Omnibus): 0.508 Jarque-Bera (JB): 1.241
Skew: -0.237 Prob(JB): 0.538
Kurtosis: 2.391 Cond. No. 8.28e+05
==============================================================================
For this reason, I added to my source code the line:
X = np.append(arr= np.ones((50, 1)).astype(int), values=X, axis=1)
as you can see at the beginning of my post and the statistical values of the intercept/constant are shown as below:
OLS Regression Results
==============================================================================
Dep. Variable: Profit R-squared: 0.951
Model: OLS Adj. R-squared: 0.945
Method: Least Squares F-statistic: 169.9
Date: Sun, 01 Jul 2018 Prob (F-statistic): 1.34e-27
Time: 20:25:21 Log-Likelihood: -525.38
No. Observations: 50 AIC: 1063.
Df Residuals: 44 BIC: 1074.
Df Model: 5
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 5.013e+04 6884.820 7.281 0.000 3.62e+04 6.4e+04
x1 198.7888 3371.007 0.059 0.953 -6595.030 6992.607
x2 -41.8870 3256.039 -0.013 0.990 -6604.003 6520.229
x3 0.8060 0.046 17.369 0.000 0.712 0.900
x4 -0.0270 0.052 -0.517 0.608 -0.132 0.078
x5 0.0270 0.017 1.574 0.123 -0.008 0.062
==============================================================================
Omnibus: 14.782 Durbin-Watson: 1.283
Prob(Omnibus): 0.001 Jarque-Bera (JB): 21.266
Skew: -0.948 Prob(JB): 2.41e-05
Kurtosis: 5.572 Cond. No. 1.45e+06
==============================================================================
Why the statistical values of the intercept are not showing when I do not add my myself an intercept term even though it is said that statsmodels.formula.api.OLS is adding this automatically?
"No constant is added by the model unless you are using formulas."
Therefore try something like below example. Variable names should be defined according to your data set.
Use,
regressor_OLS = smf.ols(formula='Y_variable ~ X_variable', data=df).fit()
instead of,
regressor_OLS = sm.OLS(endog=y, exog=X).fit()
Can use this
X = sm.add_constant(X)

multiple linear regression Python statsmodel shows predictorVariable[T.x] in the output

OLS Regression Results
==============================================================================
Dep. Variable: BTCUSD R-squared: 0.989
Model: OLS Adj. R-squared: 0.985
Method: Least Squares F-statistic: 260.6
Date: Sun, 22 Apr 2018 Prob (F-statistic): 1.87e-171
Time: 13:10:27 Log-Likelihood: -2119.3
No. Observations: 280 AIC: 4383.
Df Residuals: 208 BIC: 4644.
Df Model: 71
Covariance Type: nonrobust
==========================================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------------------------
Intercept -3.013e+05 1.8e+05 -1.674 0.096 -6.56e+05 5.36e+04
howtobuycryptocurrencyWorldwide[T.1] 284.2228 436.490 0.651 0.516 -576.289 1144.735
howtobuycryptocurrencyWorldwide[T.2] -834.5288 918.605 -0.908 0.365 -2645.499 976.442
howtobuycryptocurrencyWorldwide[T.3] -1639.0373 892.061 -1.837 0.068 -3397.677 119.603
howtobuycryptocurrencyWorldwide[T.4] -1822.9216 1349.968 -1.350 0.178 -4484.296 838.453
howtobuycryptocurrencyWorldwide[T.5] -461.3566 751.629 -0.614 0.540 -1943.144 1020.431
howtobuycryptocurrencyWorldwide[T.6] -1590.4795 1084.831 -1.466 0.144 -3729.153 548.194
howtobuycryptocurrencyWorldwide[T.7] -667.8484 506.288 -1.319 0.189 -1665.962 330.265
howtobuycryptocurrencyWorldwide[T.8] -575.7590 1297.502 -0.444 0.658 -3133.698 1982.180
howtobuycryptocurrencyWorldwide[T.9] -2449.3509 1565.416 -1.565 0.119 -5535.466 636.764
howtobuycryptocurrencyWorldwide[T.10] 1362.5353 1131.645 1.204 0.230 -868.429 3593.499
howtobuycryptocurrencyWorldwide[T.11] 1.206e+04 5006.070 2.408 0.017 2186.460 2.19e+04
howtobuycryptocurrencyWorldwide[T.13] -8135.2934 3056.663 -2.661 0.008 -1.42e+04 -2109.283
howtobuycryptocurrencyWorldwide[T.14] -333.8614 1012.361 -0.330 0.742 -2329.665 1661.943
howtobuycryptocurrencyWorldwide[T.17] -9448.2497 3586.911 -2.634 0.009 -1.65e+04 -2376.888
howtobuycryptocurrencyWorldwide[T.19] -8515.1383 3795.035 -2.244 0.026 -1.6e+04 -1033.475
howtobuycryptocurrencyWorldwide[T.35] -4.1140 1172.341 -0.004 0.997 -2315.308 2307.080
howtobuycryptocurrencyWorldwide[T.36] -1.713e+04 6089.825 -2.814 0.005 -2.91e+04 -5128.168
howtobuycryptocurrencyWorldwide[T.54] -1.193e+04 4885.490 -2.441 0.015 -2.16e+04 -2294.187
howtobuycryptocurrencyWorldwide[T.62] -1.653e+04 5836.682 -2.833 0.005 -2.8e+04 -5027.678
howtobuycryptocurrencyWorldwide[T.72] -1.193e+04 4509.585 -2.645 0.009 -2.08e+04 -3038.531
howtobuycryptocurrencyWorldwide[T.95] -8206.0353 3263.856 -2.514 0.013 -1.46e+04 -1771.556
howtobuycryptocurrencyWorldwide[T.100] -2.327e+04 8503.289 -2.737 0.007 -4e+04 -6507.457
howtobuycryptocurrencyWorldwide[T.<1] -72.6343 359.855 -0.202 0.840 -782.065 636.797
Python code to run the Linear regression :
mod3 = smf.ols('BTCUSD ~ <other variables> +howtobuycryptocurrencyWorldwide+howtobuybitcoinWorldwide+bitcoinWorldwide+howtobuyethereumWorldwide+ethereumWorldwide+howtobuyrippleWorldwide+rippleWorldwide+howtobuylitecoinWorldwide+litecoinWorldwide+bitcoinWorldwideYoutube+ethereumWorldwideYoutube+rippleWorldwideYoutube+litecoinWorldwideYoutube+vitalikWorldwide+satoshiWorldwide',data=cryptos).fit()
print(mod3.summary())
I dont understand the predictor variable[T.x] notations. Can someone help explain ?
The problem was Google Trends data has '<1' in the results which had to be converted.
I basically did below where cryptos was the Dataframe.
cryptos.replace('<1', 0.1 , inplace=True)

Categories