Multi Variable Regression statsmodels.api - python

I've looked through the documentation and still can't figure this out. I want to run a WLS with multiple regressions.
statsmodels.api is imported as sm
Example of single variable.
X = Height
Y = Weight
res = sm.OLS(Y,X,).fit()
res.summary()
Say I also have:
X2 = Age
How do I add this into my regresssion?

You can put them into a data.frame and call out the columns (this way the output looks nicer too):
import statsmodels.api as sm
import pandas as pd
import numpy as np
Height = np.random.uniform(0,1,100)
Weight = np.random.uniform(0,1,100)
Age = np.random.uniform(0,30,100)
df = pd.DataFrame({'Height':Height,'Weight':Weight,'Age':Age})
res = sm.OLS(df['Height'],df[['Weight','Age']]).fit()
In [10]: res.summary()
Out[10]:
<class 'statsmodels.iolib.summary.Summary'>
"""
OLS Regression Results
=======================================================================================
Dep. Variable: Height R-squared (uncentered): 0.700
Model: OLS Adj. R-squared (uncentered): 0.694
Method: Least Squares F-statistic: 114.3
Date: Mon, 24 Aug 2020 Prob (F-statistic): 2.43e-26
Time: 15:54:30 Log-Likelihood: -28.374
No. Observations: 100 AIC: 60.75
Df Residuals: 98 BIC: 65.96
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Weight 0.1787 0.090 1.988 0.050 0.000 0.357
Age 0.0229 0.003 8.235 0.000 0.017 0.028
==============================================================================
Omnibus: 2.938 Durbin-Watson: 1.813
Prob(Omnibus): 0.230 Jarque-Bera (JB): 2.223
Skew: -0.211 Prob(JB): 0.329
Kurtosis: 2.404 Cond. No. 49.7
==============================================================================

I use a 2nd order polynomial to predict how height and age affect weight for a soldier. You can pick up ansur_2_m.csv on my GitHub.
df=pd.read_csv('ANSUR_2_M.csv', encoding = "ISO-8859-1", usecols=['Weightlbs','Heightin','Age'], dtype={'Weightlbs':np.integer,'Heightin':np.integer,'Age':np.integer})
df=df.dropna()
df.reset_index()
df['Heightin2']=df['Heightin']**2
df['Age2']=df['Age']**2
formula="Weightlbs ~ Heightin+Heightin2+Age+Age2"
model_ols = smf.ols(formula,data=df).fit()
minHeight=df['Heightin'].min()
maxHeight=df['Heightin'].max()
avgAge = df['Age'].median()
print(minHeight,maxHeight,avgAge)
df2=pd.DataFrame()
df2['Heightin']=np.linspace(60,100,50)
df2['Heightin2']=df2['Heightin']**2
df2['Age']=28
df2['Age2']=df['Age']**2
df3=pd.DataFrame()
df3['Heightin']=np.linspace(60,100,50)
df3['Heightin2']=df2['Heightin']**2
df3['Age']=45
df3['Age2']=df['Age']**2
prediction28=model_ols.predict(df2)
prediction45=model_ols.predict(df3)
plt.clf()
plt.plot(df2['Heightin'],prediction28,label="Age 28")
plt.plot(df3['Heightin'],prediction45,label="Age 45")
plt.ylabel="Weight lbs"
plt.xlabel="Height in"
plt.legend()
plt.show()
print('A 45 year old soldier is more probable to weight more than an 28 year old soldier')

Related

Correct multiple polynomial regression formula with OLS in Python

I need to perform multiple polynomial regression and obtain statistics, p value, AIC etc.
As far as I understood I can do that with OLS, however I found only a way to produce a formula using one independent variable, like this:
model = 'act_hours ~ h_hours + I(h_hours**2)'
hours_model = smf.ols(formula = model, data = df)
I tried to define a formula using two independent variable, however I could not understand if that is the correct way and if the results are reasonable. The line that I doubt is model = 'Height ~ Diamet + I(Diamet**2) + area + I(area**2). The full code is this one:
import pandas as pd
import statsmodels.formula.api as smf
train = pd.read_csv(r'W:\...file.csv')
model = 'Height ~ Diamet + I(Diamet**2) + area + I(area**2)'
hours_model = smf.ols(formula = model, data = train).fit()
print(hours_model.summary())
The summary of the regression is here:
OLS Regression Results
==============================================================================
Dep. Variable: Height R-squared: 0.611
Model: OLS Adj. R-squared: 0.609
Method: Least Squares F-statistic: 376.0
Date: Fri, 04 Feb 2022 Prob (F-statistic): 1.33e-194
Time: 08:50:17 Log-Likelihood: -5114.6
No. Observations: 963 AIC: 1.024e+04
Df Residuals: 958 BIC: 1.026e+04
Df Model: 4
Covariance Type: nonrobust
==================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------
Intercept 13.9287 60.951 0.229 0.819 -105.684 133.542
Diamet 0.6027 0.340 1.770 0.077 -0.066 1.271
I(Diamet ** 2) 0.0004 0.002 0.262 0.794 -0.003 0.004
area 3.3553 5.307 0.632 0.527 -7.060 13.771
I(area** 2) 0.2519 0.108 2.324 0.020 0.039 0.465
==============================================================================
Omnibus: 60.996 Durbin-Watson: 1.889
Prob(Omnibus): 0.000 Jarque-Bera (JB): 86.039
Skew: 0.528 Prob(JB): 2.07e-19
Kurtosis: 4.015 Cond. No. 4.45e+05
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.45e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

I am getting very low score in sklearn diabetes data set ,using linear regression , And please guide me how to draw multiple linearRegression

from sklearn import datasets
diabetes = datasets.load_diabetes()
diabetes.keys()
print(diabetes.feature_names)
diabetes.DESCR
diabetes.data
diabetes.target
import pandas as pd
file =pd.DataFrame(data=diabetes.data,columns=diabetes.feature_names)
file['DiseaseProgression']=diabetes.target
_______________________________________________________________________________________________________
import statsmodels.formula.api as ms
md=ms.ols(formula="DiseaseProgression~sex+bmi+bp+s1+s2+s3+s4",data=file)
reg=md.fit() #let model to read the available and make equation
reg.summary()
ANSWER IS
OLS Regression Results
Dep. Variable: DiseaseProgression R-squared: 0.494
Model: OLS Adj. R-squared: 0.486
Method: Least Squares F-statistic: 60.53
Date: Sat, 09 Nov 2019 Prob (F-statistic): 2.32e-60
Time: 00:51:16 Log-Likelihood: -2396.6
No. Observations: 442 AIC: 4809.
Df Residuals: 434 BIC: 4842.
Df Model: 7
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 152.1335 2.629 57.860 0.000 146.966 157.301
sex -233.5603 61.951 -3.770 0.000 -355.323 -111.798
bmi 576.7016 66.290 8.700 0.000 446.412 706.991
bp 360.0567 64.135 5.614 0.000 234.003 486.110
s1 866.2823 194.446 4.455 0.000 484.109 1248.455
s2 -875.1899 155.944 -5.612 0.000 -1181.690 -568.690
s3 -575.3824 151.388 -3.801 0.000 -872.927 -277.838
s4 125.8840 163.789 0.769 0.443 -196.034 447.802
Omnibus: 2.993 Durbin-Watson: 1.984
Prob(Omnibus): 0.224 Jarque-Bera (JB): 2.595
Skew: 0.094 Prob(JB): 0.273
Kurtosis: 2.676 Cond. No. 108.
Using Sklearn Linear Regression :
from sklearn.linear_model import LinearRegression
lr=LinearRegression()
lr.fit(file_inp,file_op).score(file_inp,file_op)
O/p:0.5177494254132934
And also how to draw this ,
Again : i just want to know that is i am doing right or wrong if wrong what should i have to do to get good answer and please tell me how to draw multilinear Regression graph and also how to draw comperssion graph between predicted and actual o/p
Thanks in advance

Pandas not working with linear regression

I am trying to run a regression some data from a dataframe, but I keep getting this weird shape error. Any idea what is wrong?
import pandas as pd
import io
import requests
import statsmodels.api as sm
# Read in a dataset
url="https://raw.githubusercontent.com/jldbc/coffee-quality-database/master/data/arabica_data_cleaned.csv"
s=requests.get(url).content
df=pd.read_csv(io.StringIO(s.decode('utf-8')))
# Select feature columns
X = df[['Body', 'Clean.Cup']]
# Select dv column
y = df['Cupper.Points']
# make model
mod = sm.OLS(X, y).fit()
I get this error:
shapes (1311,2) and (1311,2) not aligned: 2 (dim 1) != 1311 (dim 0)
You have your X and y terms in the wrong order in your sm.OLS command:
import pandas as pd
import io
import requests
import statsmodels.api as sm
# Read in a dataset
url="https://raw.githubusercontent.com/jldbc/coffee-quality-database/master/data/arabica_data_cleaned.csv"
s=requests.get(url).content
df=pd.read_csv(io.StringIO(s.decode('utf-8')))
# Select feature columns
X = df[['Body', 'Clean.Cup']]
# Select dv column
y = df['Cupper.Points']
# make model
mod = sm.OLS(y, X).fit()
mod.summary()
runs and returns
<class 'statsmodels.iolib.summary.Summary'>
"""
OLS Regression Results
==============================================================================
Dep. Variable: Cupper.Points R-squared: 0.998
Model: OLS Adj. R-squared: 0.998
Method: Least Squares F-statistic: 3.145e+05
Date: Sat, 06 Jul 2019 Prob (F-statistic): 0.00
Time: 19:42:59 Log-Likelihood: -454.94
No. Observations: 1311 AIC: 913.9
Df Residuals: 1309 BIC: 924.2
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Body 0.8464 0.016 53.188 0.000 0.815 0.878
Clean.Cup 0.1154 0.012 9.502 0.000 0.092 0.139
==============================================================================
Omnibus: 537.879 Durbin-Watson: 1.710
Prob(Omnibus): 0.000 Jarque-Bera (JB): 30220.027
Skew: 1.094 Prob(JB): 0.00
Kurtosis: 26.419 Cond. No. 26.2
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
"""
The order of y and X is wrong.
sm.OLS(y,X)

Statsmodels.formula.api OLS does not show statistical values of intercept

I am running the following source code:
import statsmodels.formula.api as sm
# Add one column of ones for the intercept term
X = np.append(arr= np.ones((50, 1)).astype(int), values=X, axis=1)
regressor_OLS = sm.OLS(endog=y, exog=X).fit()
print(regressor_OLS.summary())
where
X is an 50x5 (before adding the intercept term) numpy array which looks like this:
[[0 1 165349.20 136897.80 471784.10]
[0 0 162597.70 151377.59 443898.53]...]
and y is a a 50x1 numpy array with float values for the dependent variable.
The first two columns are for a dummy variable with three different values. The rest of the columns are three different indepedent variables.
Although, it is said that the statsmodels.formula.api.OLS adds automatically an intercept term (see #stellacia's answer here: OLS using statsmodel.formula.api versus statsmodel.api) its summary does not show the statistical values of the intercept term as it evident below in my case:
OLS Regression Results
==============================================================================
Dep. Variable: Profit R-squared: 0.988
Model: OLS Adj. R-squared: 0.986
Method: Least Squares F-statistic: 727.1
Date: Sun, 01 Jul 2018 Prob (F-statistic): 7.87e-42
Time: 21:40:23 Log-Likelihood: -545.15
No. Observations: 50 AIC: 1100.
Df Residuals: 45 BIC: 1110.
Df Model: 5
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
x1 3464.4536 4905.406 0.706 0.484 -6415.541 1.33e+04
x2 5067.8937 4668.238 1.086 0.283 -4334.419 1.45e+04
x3 0.7182 0.066 10.916 0.000 0.586 0.851
x4 0.3113 0.035 8.885 0.000 0.241 0.382
x5 0.0786 0.023 3.429 0.001 0.032 0.125
==============================================================================
Omnibus: 1.355 Durbin-Watson: 1.288
Prob(Omnibus): 0.508 Jarque-Bera (JB): 1.241
Skew: -0.237 Prob(JB): 0.538
Kurtosis: 2.391 Cond. No. 8.28e+05
==============================================================================
For this reason, I added to my source code the line:
X = np.append(arr= np.ones((50, 1)).astype(int), values=X, axis=1)
as you can see at the beginning of my post and the statistical values of the intercept/constant are shown as below:
OLS Regression Results
==============================================================================
Dep. Variable: Profit R-squared: 0.951
Model: OLS Adj. R-squared: 0.945
Method: Least Squares F-statistic: 169.9
Date: Sun, 01 Jul 2018 Prob (F-statistic): 1.34e-27
Time: 20:25:21 Log-Likelihood: -525.38
No. Observations: 50 AIC: 1063.
Df Residuals: 44 BIC: 1074.
Df Model: 5
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 5.013e+04 6884.820 7.281 0.000 3.62e+04 6.4e+04
x1 198.7888 3371.007 0.059 0.953 -6595.030 6992.607
x2 -41.8870 3256.039 -0.013 0.990 -6604.003 6520.229
x3 0.8060 0.046 17.369 0.000 0.712 0.900
x4 -0.0270 0.052 -0.517 0.608 -0.132 0.078
x5 0.0270 0.017 1.574 0.123 -0.008 0.062
==============================================================================
Omnibus: 14.782 Durbin-Watson: 1.283
Prob(Omnibus): 0.001 Jarque-Bera (JB): 21.266
Skew: -0.948 Prob(JB): 2.41e-05
Kurtosis: 5.572 Cond. No. 1.45e+06
==============================================================================
Why the statistical values of the intercept are not showing when I do not add my myself an intercept term even though it is said that statsmodels.formula.api.OLS is adding this automatically?
"No constant is added by the model unless you are using formulas."
Therefore try something like below example. Variable names should be defined according to your data set.
Use,
regressor_OLS = smf.ols(formula='Y_variable ~ X_variable', data=df).fit()
instead of,
regressor_OLS = sm.OLS(endog=y, exog=X).fit()
Can use this
X = sm.add_constant(X)

How to use ols with groupby?

The following code is from
"Python for Data Analysis",chp 11,group transforms and analysis.
I show the version of each library as below.
# -*- coding: utf-8 -*-
""" Created on Sun Jun 4 13:33:47 2017
"Python for Data Analysis",chp 11,group transforms and analysis.
"""
import numpy as np # np.__version__'1.12.1'
import pandas as pd # pd.__version__ '0.20.2'
import random; random.seed(a=0,version=2)
import statsmodels.api as sm # statsmodels.__version__ '0.8.0'
import string
# generate tickers from random
N=1000
def rands(n):
choices=string.ascii_uppercase
return (''.join([random.choice(choices) for _ in range(n)]))
tickers=np.array([rands(5) for _ in range(N)])
# generate data for tickers
M=500
df=pd.DataFrame({'Momentum': np.random.randn(M)/200+0.03,
'Value':np.random.randn(M)/200+0.08,
'ShortInterest':np.random.randn(M)/200-0.02},
index=tickers[:M])
# create industry
ind_names=np.array(['Financial','Tech'])
sampler=np.random.randint(low=0,high=len(ind_names),size=N, dtype='l')
industries=pd.Series(ind_names[sampler],index=tickers,
name='industry')
#%% factor analysis
fac1,fac2,fac3=np.random.rand(3,1000)
ticker_subset=tickers.take(np.random.permutation(N)[:1000])
port=pd.Series(0.7*fac1-1.2*fac2+0.3*fac3+np.random.rand(1000),
index=ticker_subset)
factors=pd.DataFrame({'f1':fac1,'f2':fac2,'f3':fac3},
index=ticker_subset)
by_ind=port.groupby(industries)
This part is from the book, while pd.ols has been depreciated.
#%% use pd.ols, which is depreciated.
# AttributeError: module 'pandas' has no attribute 'ols'
def beta_exposure(chuck,factors=None):
return pd.ols(y=chuck, x=factors).beta
exposures_pd=by_ind.apply(beta_exposure,factors=factors)
print('\nexposures_pd\n',exposures_pd.unstack())
I would like to use sm.OLS, while I have trouble in selecting corresponding rows for x. How should I deal with it?
#%% use sm.OLS, which is not show in the book.
def exposure(chuck,factors):
y=np.array(chuck).reshape(len(chuck),1)
# The following code is wrong, as the rows number is not the corresponding rows as y
# I use [:len(chuck)] just to keep x have same rows number as y.
x=factors[['f1','f2','f3']][:len(chuck)]
print(x[:5])
print(x.shape)
sx=sm.OLS(y,x).fit()
print(sx.summary())
return sm.OLS(y,x).fit()
exposures_sm=exposure(port, factors)
after several try, I think maybe I can do it by combine Series and DataFrame.
factors_data['port']=port
def group_ols(fts):
results=[]
for ind, ft in fts:
y=ft.loc[:,'port']
x=ft.loc[:,['f1','f2','f3']]
result=sm.OLS(y,x).fit()
results.append((ind,result.summary()))
return results
exposures_sm=group_ols(factors_data.groupby(industries))
exposures_sm
the result is like this.
[('Financial', <class 'statsmodels.iolib.summary.Summary'>
"""
OLS Regression Results
==============================================================================
Dep. Variable: port R-squared: 0.746
Model: OLS Adj. R-squared: 0.744
Method: Least Squares F-statistic: 482.4
Date: Thu, 29 Jun 2017 Prob (F-statistic): 2.37e-146
Time: 17:13:34 Log-Likelihood: -134.55
No. Observations: 497 AIC: 275.1
Df Residuals: 494 BIC: 287.7
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
f1 1.0231 0.043 23.894 0.000 0.939 1.107
f2 -0.9639 0.042 -23.146 0.000 -1.046 -0.882
f3 0.6397 0.042 15.391 0.000 0.558 0.721
==============================================================================
Omnibus: 34.466 Durbin-Watson: 1.916
Prob(Omnibus): 0.000 Jarque-Bera (JB): 12.724
Skew: -0.063 Prob(JB): 0.00173
Kurtosis: 2.226 Cond. No. 3.24
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
"""), ('Tech', <class 'statsmodels.iolib.summary.Summary'>
"""
OLS Regression Results
==============================================================================
Dep. Variable: port R-squared: 0.738
Model: OLS Adj. R-squared: 0.736
Method: Least Squares F-statistic: 468.9
Date: Thu, 29 Jun 2017 Prob (F-statistic): 7.30e-145
Time: 17:13:34 Log-Likelihood: -172.76
No. Observations: 503 AIC: 351.5
Df Residuals: 500 BIC: 364.2
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
f1 1.0530 0.045 23.525 0.000 0.965 1.141
f2 -0.8811 0.045 -19.742 0.000 -0.969 -0.793
f3 0.5762 0.046 12.538 0.000 0.486 0.667
==============================================================================
Omnibus: 45.191 Durbin-Watson: 2.013
Prob(Omnibus): 0.000 Jarque-Bera (JB): 15.547
Skew: -0.123 Prob(JB): 0.000421
Kurtosis: 2.175 Cond. No. 3.29
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
""")]
I would like to suggest leverage apply function as it is more scalable if you happen deal with large data
first build a function
from sklearn.linear_model import LinearRegression
def reg_function(x, y):
regr = LinearRegression(normalize=True, fit_intercept = True)
regr.fit(y, x)
return(regr.coef_)
here you can use statmodel library if you prefer
df_ur.groupby([key_col]).apply(lambda x: reg_function(x[col_y].values.reshape(-1, 1), x[col_x].values.reshape(-1, 1)))
key_col is the column you would like to groupby - col_x and col_y are your dependent and independent variables
here because I am doing univarate regression, I have to use reshape otherwise, it won't be treated line nx1 matrix

Categories