My Goal is: Extracting the formula (not only the coefs) after a linear regression done with statsmodel.
Context :
I have a pandas dataframe ,
df
x y z
0 0.0 2.0 54.200
1 0.0 2.2 70.160
2 0.0 2.4 89.000
3 0.0 2.6 110.960
i 'am doing a linear regression using statsmodels.api (2 variables, polynomial degree=3) , i'am happy with this regression.
OLS Regression Results
==============================================================================
Dep. Variable: z R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 2.193e+29
Date: Sun, 31 May 2020 Prob (F-statistic): 0.00
Time: 22:04:49 Log-Likelihood: 9444.6
No. Observations: 400 AIC: -1.887e+04
Df Residuals: 390 BIC: -1.883e+04
Df Model: 9
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 0.2000 3.33e-11 6.01e+09 0.000 0.200 0.200
x1 2.0000 1.16e-11 1.72e+11 0.000 2.000 2.000
x2 1.0000 2.63e-11 3.8e+10 0.000 1.000 1.000
x3 4.0000 3.85e-12 1.04e+12 0.000 4.000 4.000
x4 12.0000 4.36e-12 2.75e+12 0.000 12.000 12.000
x5 3.0000 6.81e-12 4.41e+11 0.000 3.000 3.000
x6 6.0000 5.74e-13 1.05e+13 0.000 6.000 6.000
x7 13.0000 4.99e-13 2.6e+13 0.000 13.000 13.000
x8 14.0000 4.99e-13 2.81e+13 0.000 14.000 14.000
x9 5.0000 5.74e-13 8.71e+12 0.000 5.000 5.000
==============================================================================
Omnibus: 25.163 Durbin-Watson: 0.038
Prob(Omnibus): 0.000 Jarque-Bera (JB): 28.834
Skew: -0.655 Prob(JB): 5.48e-07
Kurtosis: 2.872 Cond. No. 6.66e+03
==============================================================================
I need to implement that outside of python , (ms excel) , I would like to know the formula.
I know it is polynomial deg3 , but I wondering how to know which coeff apply to which term of
the equation. Something like that :
For exemple : x7 coeef is the coeff for x³ ,y², x²y , ... ?
Note: this is a simplify version of my problem , in reallity I have 3 variables , deg:3 so 20 coefs.
This is a simpler exemple of code to get started with my case:
# %% Question extract formula from linear regresion coeff
#Import
import numpy as np # version : '1.18.1'
import pandas as pd # version'1.0.0'
import statsmodels.api as sm # version : '0.10.1'
from sklearn.preprocessing import PolynomialFeatures
from itertools import product
#%% Creating the dummies datas
def function_for_df(row):
x= row['x']
y= row['y']
return unknow_function(x,y)
def unknow_function(x,y):
"""
This is to generate the datas , of course in reality I do not know the formula
"""
r =0.2+ \
6*x**3+4*x**2+2*x+ \
5*y**3+3*y**2+1*y+ \
12*x*y + 13*x**2*y+ 14*x*y**2
return r
# input data
x_input = np.arange(0, 4 , 0.2)
y_input = np.arange(2, 6 , 0.2)
# create a simple dataframe with dummies datas
df = pd.DataFrame(list(product(x_input, y_input)), columns=['x', 'y'])
df['z'] = df.apply(function_for_df, axis=1)
# In the reality I start from there !
#%% creating the model
X = df[['x','y']].astype(float) #
Y = df['z'].astype(float)
polynomial_features_final= PolynomialFeatures(degree=3)
X3 = polynomial_features_final.fit_transform(X)
model = sm.OLS(Y, X3).fit()
predictions = model.predict(X3)
print_model = model.summary()
print(print_model)
#%% using the model to make predictions, no problem
def model_predict(x_sample, y_samples):
df_input = pd.DataFrame({ "x":x_sample, "y":y_samples }, index=[0])
X_input = polynomial_features_final.fit_transform(df_input)
prediction = model.predict(X_input)
return prediction
print("prediction for x=2, y=3.2 :" ,model_predict(2 ,3.2))
# How to extract the formula for the "model" ?
#Thanks
Side notes:
A desciption like the one given by pasty ModelDesc will be fine:
from patsy import ModelDesc
ModelDesc.from_formula("y ~ x")
# or even better :
desc = ModelDesc.from_formula("y ~ (a + b + c + d) ** 2")
desc.describe()
But i 'am not able to make the bridge between my model and patsy.ModelDesc.
Thanks for your help.
As Josef said in the comment, i had to look at : sklearn PolynomialFeature .
Then I found this answer :
PolynomialFeatures(degree=3).get_feature_names()
In the context :
#%% creating the model
X = df[['x','y']].astype(float) #
Y = df['z'].astype(float)
polynomial_features_final= PolynomialFeatures(degree=3)
#X3 = polynomial_features_final.fit_transform(X)
X3 = polynomial_features_final.fit_transform(df[['x', 'y']].to_numpy())
model = sm.OLS(Y, X3).fit()
predictions = model.predict(X3)
print_model = model.summary()
print(print_model)
print("\n-- ONE SOLUTION --\n Coef and Term name :")
results = list(zip(model.params, polynomial_features_final.get_feature_names()))
print(results)
You can fit the model using Patsy formula language using statsmodels.formula.api
For instance, you can have:
import statsmodels.formula.api as smf
# You do not need fit_transform to generate poly features in df
# You can specify the model using vectorized functions, many transformations are supported
model = smf.ols(formula='z ~ x*y + I(x**2)*I(y**2) + I(x**3)*I(y**3)', data=df).fit()
print_model = model.summary()
print(print_model)
OLS Regression Results
==============================================================================
Dep. Variable: z R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 4.193e+05
Date: Tue, 15 Feb 2022 Prob (F-statistic): 0.00
Time: 16:53:19 Log-Likelihood: -1478.1
No. Observations: 400 AIC: 2976.
Df Residuals: 390 BIC: 3016.
Df Model: 9
Covariance Type: nonrobust
=======================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------------
Intercept 229.3425 23.994 9.558 0.000 182.168 276.517
x -180.0081 11.822 -15.226 0.000 -203.251 -156.765
y -136.2619 19.039 -7.157 0.000 -173.694 -98.830
x:y 118.0431 2.864 41.210 0.000 112.411 123.675
I(x ** 2) 31.2537 4.257 7.341 0.000 22.884 39.624
I(y ** 2) 22.5973 4.986 4.532 0.000 12.795 32.399
I(x ** 2):I(y ** 2) 1.4176 0.213 6.671 0.000 1.000 1.835
I(x ** 3) 4.7562 0.595 7.991 0.000 3.586 5.926
I(y ** 3) 4.7601 0.423 11.250 0.000 3.928 5.592
I(x ** 3):I(y ** 3) 0.0166 0.006 2.915 0.004 0.005 0.028
==============================================================================
Omnibus: 28.012 Durbin-Watson: 0.182
Prob(Omnibus): 0.000 Jarque-Bera (JB): 71.076
Skew: -0.311 Prob(JB): 3.68e-16
Kurtosis: 4.969 Cond. No. 1.32e+05
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.32e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
print(model.params)
Intercept 229.342451
x -180.008082
y -136.261886
x:y 118.043098
I(x ** 2) 31.253705
I(y ** 2) 22.597298
I(x ** 2):I(y ** 2) 1.417551
I(x ** 3) 4.756205
I(y ** 3) 4.760144
I(x ** 3):I(y ** 3) 0.016611
dtype: float64
Then by simple print(model.params), you will have a natural bridge between the model and patsy.ModelDesc (here that is analogous to the formula definition)
Notes: Raw polynomial used here in the demo. You may switch to using orthogonal polynomials. That will help explain the contribution of each term to variance in the outcome, ref to StackExchange
Related
I need to perform multiple polynomial regression and obtain statistics, p value, AIC etc.
As far as I understood I can do that with OLS, however I found only a way to produce a formula using one independent variable, like this:
model = 'act_hours ~ h_hours + I(h_hours**2)'
hours_model = smf.ols(formula = model, data = df)
I tried to define a formula using two independent variable, however I could not understand if that is the correct way and if the results are reasonable. The line that I doubt is model = 'Height ~ Diamet + I(Diamet**2) + area + I(area**2). The full code is this one:
import pandas as pd
import statsmodels.formula.api as smf
train = pd.read_csv(r'W:\...file.csv')
model = 'Height ~ Diamet + I(Diamet**2) + area + I(area**2)'
hours_model = smf.ols(formula = model, data = train).fit()
print(hours_model.summary())
The summary of the regression is here:
OLS Regression Results
==============================================================================
Dep. Variable: Height R-squared: 0.611
Model: OLS Adj. R-squared: 0.609
Method: Least Squares F-statistic: 376.0
Date: Fri, 04 Feb 2022 Prob (F-statistic): 1.33e-194
Time: 08:50:17 Log-Likelihood: -5114.6
No. Observations: 963 AIC: 1.024e+04
Df Residuals: 958 BIC: 1.026e+04
Df Model: 4
Covariance Type: nonrobust
==================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------
Intercept 13.9287 60.951 0.229 0.819 -105.684 133.542
Diamet 0.6027 0.340 1.770 0.077 -0.066 1.271
I(Diamet ** 2) 0.0004 0.002 0.262 0.794 -0.003 0.004
area 3.3553 5.307 0.632 0.527 -7.060 13.771
I(area** 2) 0.2519 0.108 2.324 0.020 0.039 0.465
==============================================================================
Omnibus: 60.996 Durbin-Watson: 1.889
Prob(Omnibus): 0.000 Jarque-Bera (JB): 86.039
Skew: 0.528 Prob(JB): 2.07e-19
Kurtosis: 4.015 Cond. No. 4.45e+05
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.45e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
I've looked through the documentation and still can't figure this out. I want to run a WLS with multiple regressions.
statsmodels.api is imported as sm
Example of single variable.
X = Height
Y = Weight
res = sm.OLS(Y,X,).fit()
res.summary()
Say I also have:
X2 = Age
How do I add this into my regresssion?
You can put them into a data.frame and call out the columns (this way the output looks nicer too):
import statsmodels.api as sm
import pandas as pd
import numpy as np
Height = np.random.uniform(0,1,100)
Weight = np.random.uniform(0,1,100)
Age = np.random.uniform(0,30,100)
df = pd.DataFrame({'Height':Height,'Weight':Weight,'Age':Age})
res = sm.OLS(df['Height'],df[['Weight','Age']]).fit()
In [10]: res.summary()
Out[10]:
<class 'statsmodels.iolib.summary.Summary'>
"""
OLS Regression Results
=======================================================================================
Dep. Variable: Height R-squared (uncentered): 0.700
Model: OLS Adj. R-squared (uncentered): 0.694
Method: Least Squares F-statistic: 114.3
Date: Mon, 24 Aug 2020 Prob (F-statistic): 2.43e-26
Time: 15:54:30 Log-Likelihood: -28.374
No. Observations: 100 AIC: 60.75
Df Residuals: 98 BIC: 65.96
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Weight 0.1787 0.090 1.988 0.050 0.000 0.357
Age 0.0229 0.003 8.235 0.000 0.017 0.028
==============================================================================
Omnibus: 2.938 Durbin-Watson: 1.813
Prob(Omnibus): 0.230 Jarque-Bera (JB): 2.223
Skew: -0.211 Prob(JB): 0.329
Kurtosis: 2.404 Cond. No. 49.7
==============================================================================
I use a 2nd order polynomial to predict how height and age affect weight for a soldier. You can pick up ansur_2_m.csv on my GitHub.
df=pd.read_csv('ANSUR_2_M.csv', encoding = "ISO-8859-1", usecols=['Weightlbs','Heightin','Age'], dtype={'Weightlbs':np.integer,'Heightin':np.integer,'Age':np.integer})
df=df.dropna()
df.reset_index()
df['Heightin2']=df['Heightin']**2
df['Age2']=df['Age']**2
formula="Weightlbs ~ Heightin+Heightin2+Age+Age2"
model_ols = smf.ols(formula,data=df).fit()
minHeight=df['Heightin'].min()
maxHeight=df['Heightin'].max()
avgAge = df['Age'].median()
print(minHeight,maxHeight,avgAge)
df2=pd.DataFrame()
df2['Heightin']=np.linspace(60,100,50)
df2['Heightin2']=df2['Heightin']**2
df2['Age']=28
df2['Age2']=df['Age']**2
df3=pd.DataFrame()
df3['Heightin']=np.linspace(60,100,50)
df3['Heightin2']=df2['Heightin']**2
df3['Age']=45
df3['Age2']=df['Age']**2
prediction28=model_ols.predict(df2)
prediction45=model_ols.predict(df3)
plt.clf()
plt.plot(df2['Heightin'],prediction28,label="Age 28")
plt.plot(df3['Heightin'],prediction45,label="Age 45")
plt.ylabel="Weight lbs"
plt.xlabel="Height in"
plt.legend()
plt.show()
print('A 45 year old soldier is more probable to weight more than an 28 year old soldier')
I am trying to run a regression some data from a dataframe, but I keep getting this weird shape error. Any idea what is wrong?
import pandas as pd
import io
import requests
import statsmodels.api as sm
# Read in a dataset
url="https://raw.githubusercontent.com/jldbc/coffee-quality-database/master/data/arabica_data_cleaned.csv"
s=requests.get(url).content
df=pd.read_csv(io.StringIO(s.decode('utf-8')))
# Select feature columns
X = df[['Body', 'Clean.Cup']]
# Select dv column
y = df['Cupper.Points']
# make model
mod = sm.OLS(X, y).fit()
I get this error:
shapes (1311,2) and (1311,2) not aligned: 2 (dim 1) != 1311 (dim 0)
You have your X and y terms in the wrong order in your sm.OLS command:
import pandas as pd
import io
import requests
import statsmodels.api as sm
# Read in a dataset
url="https://raw.githubusercontent.com/jldbc/coffee-quality-database/master/data/arabica_data_cleaned.csv"
s=requests.get(url).content
df=pd.read_csv(io.StringIO(s.decode('utf-8')))
# Select feature columns
X = df[['Body', 'Clean.Cup']]
# Select dv column
y = df['Cupper.Points']
# make model
mod = sm.OLS(y, X).fit()
mod.summary()
runs and returns
<class 'statsmodels.iolib.summary.Summary'>
"""
OLS Regression Results
==============================================================================
Dep. Variable: Cupper.Points R-squared: 0.998
Model: OLS Adj. R-squared: 0.998
Method: Least Squares F-statistic: 3.145e+05
Date: Sat, 06 Jul 2019 Prob (F-statistic): 0.00
Time: 19:42:59 Log-Likelihood: -454.94
No. Observations: 1311 AIC: 913.9
Df Residuals: 1309 BIC: 924.2
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Body 0.8464 0.016 53.188 0.000 0.815 0.878
Clean.Cup 0.1154 0.012 9.502 0.000 0.092 0.139
==============================================================================
Omnibus: 537.879 Durbin-Watson: 1.710
Prob(Omnibus): 0.000 Jarque-Bera (JB): 30220.027
Skew: 1.094 Prob(JB): 0.00
Kurtosis: 26.419 Cond. No. 26.2
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
"""
The order of y and X is wrong.
sm.OLS(y,X)
I am running the following source code:
import statsmodels.formula.api as sm
# Add one column of ones for the intercept term
X = np.append(arr= np.ones((50, 1)).astype(int), values=X, axis=1)
regressor_OLS = sm.OLS(endog=y, exog=X).fit()
print(regressor_OLS.summary())
where
X is an 50x5 (before adding the intercept term) numpy array which looks like this:
[[0 1 165349.20 136897.80 471784.10]
[0 0 162597.70 151377.59 443898.53]...]
and y is a a 50x1 numpy array with float values for the dependent variable.
The first two columns are for a dummy variable with three different values. The rest of the columns are three different indepedent variables.
Although, it is said that the statsmodels.formula.api.OLS adds automatically an intercept term (see #stellacia's answer here: OLS using statsmodel.formula.api versus statsmodel.api) its summary does not show the statistical values of the intercept term as it evident below in my case:
OLS Regression Results
==============================================================================
Dep. Variable: Profit R-squared: 0.988
Model: OLS Adj. R-squared: 0.986
Method: Least Squares F-statistic: 727.1
Date: Sun, 01 Jul 2018 Prob (F-statistic): 7.87e-42
Time: 21:40:23 Log-Likelihood: -545.15
No. Observations: 50 AIC: 1100.
Df Residuals: 45 BIC: 1110.
Df Model: 5
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
x1 3464.4536 4905.406 0.706 0.484 -6415.541 1.33e+04
x2 5067.8937 4668.238 1.086 0.283 -4334.419 1.45e+04
x3 0.7182 0.066 10.916 0.000 0.586 0.851
x4 0.3113 0.035 8.885 0.000 0.241 0.382
x5 0.0786 0.023 3.429 0.001 0.032 0.125
==============================================================================
Omnibus: 1.355 Durbin-Watson: 1.288
Prob(Omnibus): 0.508 Jarque-Bera (JB): 1.241
Skew: -0.237 Prob(JB): 0.538
Kurtosis: 2.391 Cond. No. 8.28e+05
==============================================================================
For this reason, I added to my source code the line:
X = np.append(arr= np.ones((50, 1)).astype(int), values=X, axis=1)
as you can see at the beginning of my post and the statistical values of the intercept/constant are shown as below:
OLS Regression Results
==============================================================================
Dep. Variable: Profit R-squared: 0.951
Model: OLS Adj. R-squared: 0.945
Method: Least Squares F-statistic: 169.9
Date: Sun, 01 Jul 2018 Prob (F-statistic): 1.34e-27
Time: 20:25:21 Log-Likelihood: -525.38
No. Observations: 50 AIC: 1063.
Df Residuals: 44 BIC: 1074.
Df Model: 5
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 5.013e+04 6884.820 7.281 0.000 3.62e+04 6.4e+04
x1 198.7888 3371.007 0.059 0.953 -6595.030 6992.607
x2 -41.8870 3256.039 -0.013 0.990 -6604.003 6520.229
x3 0.8060 0.046 17.369 0.000 0.712 0.900
x4 -0.0270 0.052 -0.517 0.608 -0.132 0.078
x5 0.0270 0.017 1.574 0.123 -0.008 0.062
==============================================================================
Omnibus: 14.782 Durbin-Watson: 1.283
Prob(Omnibus): 0.001 Jarque-Bera (JB): 21.266
Skew: -0.948 Prob(JB): 2.41e-05
Kurtosis: 5.572 Cond. No. 1.45e+06
==============================================================================
Why the statistical values of the intercept are not showing when I do not add my myself an intercept term even though it is said that statsmodels.formula.api.OLS is adding this automatically?
"No constant is added by the model unless you are using formulas."
Therefore try something like below example. Variable names should be defined according to your data set.
Use,
regressor_OLS = smf.ols(formula='Y_variable ~ X_variable', data=df).fit()
instead of,
regressor_OLS = sm.OLS(endog=y, exog=X).fit()
Can use this
X = sm.add_constant(X)
In python I am trying to plot the effect of a linear model
data = pd.read_excel(input_filename)
data.sexe = data.sexe.map({1:'m', 2:'f'})
data.diag = data.diag.map({1:'asd', 4:'hc'})
data.site = data.site.map({ 10:'USS', 20:'UYU', 30:'CAM', 40:'MAM', 2:'Cre'})
lm_full = sm.formula.ols(formula= L_bankssts_thickavg ~ diag + age + sexe + site' % var, data=data).fit()
I used a linear model, which works well :
print(lm_full.summary())
Gives :
OLS Regression Results
===============================================================================
Dep. Variable: L_bankssts_thickavg R-squared: 0.156
Model: OLS Adj. R-squared: 0.131
Method: Least Squares F-statistic: 6.354
Date: Tue, 13 Dec 2016 Prob (F-statistic): 7.30e-07
Time: 15:40:28 Log-Likelihood: 98.227
No. Observations: 249 AIC: -180.5
Df Residuals: 241 BIC: -152.3
Df Model: 7
Covariance Type: nonrobust
===================================================================================
coef std err t P>|t| [95.0% Conf. Int.]
-----------------------------------------------------------------------------------
Intercept 2.8392 0.055 51.284 0.000 2.730 2.948
diag[T.hc] -0.0567 0.021 -2.650 0.009 -0.099 -0.015
sexe[T.m] -0.0435 0.029 -1.476 0.141 -0.102 0.015
site[T.Cre] -0.0069 0.036 -0.189 0.850 -0.078 0.065
site[T.MAM] -0.0635 0.040 -1.593 0.112 -0.142 0.015
site[T.UYU] -0.0948 0.038 -2.497 0.013 -0.170 -0.020
site[T.USS] 0.0145 0.037 0.396 0.692 -0.058 0.086
age -0.0059 0.001 -4.209 0.000 -0.009 -0.003
==============================================================================
Omnibus: 0.698 Durbin-Watson: 2.042
Prob(Omnibus): 0.705 Jarque-Bera (JB): 0.432
Skew: -0.053 Prob(JB): 0.806
Kurtosis: 3.175 Cond. No. 196.
==============================================================================
I know would like to plot the effect for example of the "diag" variable :
As it appears in my model, the diagnosis has an effect on the dependent variable, I would like to plot this effect. I want to have a graphical representation with the two possible values of diag (ie : 'asd' and 'hc') showing which group has the lowest value (ie a graphical representation of a contrast)
I would like something similar as the allEffect library in R
Do you think there are similar functions in python ?
The best way to plot this effect is to do a CCPR Plots with matplot lib.
# Component-Component plus Residual (CCPR) Plots (= partial residual plot)
fig, ax = plt.subplots(figsize=(5, 5))
fig = sm.graphics.plot_ccpr(lm_full, 'diag[T.sz]', ax=ax)
plt.close
Which gives