I was asked to write a program for Linear Regression with the following steps.
Load the R data set mtcars as a pandas dataframe.
Build another linear regression model by considering the log of independent variable wt, and log of dependent variable mpg.
Fit the model with data, and display the R-squared value
I am a beginner at Statistics with Python.
I have tried getting the log values without converting to a new DataFrame but that gave an error saying "TypeError: 'OLS' object is not subscriptable"
import statsmodels.api as sa
import statsmodels.formula.api as sfa
import pandas as pd
import numpy as np
cars = sa.datasets.get_rdataset("mtcars")
cars_data = cars.data
lin_mod1 = sfa.ols("wt~mpg",cars_data)
lin_mod2 = pd.DataFrame(lin_mod1)
lin_mod2['wt'] = np.log(lin_mod2['wt'])
lin_mod2['mpg'] = np.log(lin_mod2['mpg'])
lin_res1 = lin_mod2.fit()
print(lin_res1.summary())
The expected result is the table after linear regression but the actual output is an error
[ValueError: DataFrame constructor not properly called!]
This might work for you.
import statsmodels.api as sm
import numpy as np
mtcars = sm.datasets.get_rdataset('mtcars')
mtcars_data = mtcars.data
liner_model = sm.formula.ols('np.log(wt) ~ np.log(mpg)',mtcars_data)
liner_result = liner_model.fit()
print(liner_result.rsquared)
I broke your code and I've ran it line by line.
The problem is here:
lin_mod1 = sfa.ols("wt~mpg",cars_data)
If you try to print it, the output is:
statsmodels.regression.linear_model.OLS object at 0x7f1c64273eb8
And it can't be interpreted correctly to build a data frame.
The solution is to get the result of the first linear model into a table and the finally put into a data frame:
results = lin_mod1.fit()
results_summary = results.summary()
If you print the results_summary you will see the variables are: Intercept and mpg.
I don't if it's an error of concept or what, since it's not the pair "wt"-"mpg".
# summary as a html table
results_as_html = results_summary.tables[1].as_html()
# dataframe from the html table
lin_mod2 = pd.read_html(results_as_html, header=0, index_col=0)[0]
The print of lin_mod2 is:
coef std err t P>|t| [0.025 0.975]
Intercept 6.0473 0.309 19.590 0.0 5.417 6.678
mpg -0.1409 0.015 -9.559 0.0 -0.171 -0.111
Here is the solution:
import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np
cars=sm.datasets.get_rdataset("mtcars")
cars_data=cars.data
lin_mod1=smf.ols('np.log(wt)~np.log(mpg)',cars_data)
lin_model_fit=lin_mod1.fit()
print(lin_model_fit.summary())
Change:
lin_mod2 = pd.DataFrame(lin_mod1)
To:
lin_mod2 = pd.DataFrame(data = lin_mod1)
Related
Is there a way to do multi variate Bruesch Godfrey Lagrange Multiplier residual serial correlation tests for vector autoregressions (VAR) using statsmodels? I would like to get the same output as Eviews in View > Residual Tests> Autocorrleation LM Test
I have tried using the acorr_breusch_godfrey from stats models but it doesn't seem to be giving me outputs. Am I misparameterizing this? Or do I need to loop through the variables some how?
Below is an example using OLS (works) and the second one is VAR (doesn't work).
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.tsa.api import VAR
from statsmodels.stats.diagnostic import acorr_breusch_godfrey
data = pd.read_csv('http://web.pdx.edu/~crkl/ceR/data/cjx.txt', sep='\s+', index_col='YEAR', nrows=39)
X = np.log(data.X)
L = np.log(data.L1)
K = np.log(data.K1)
df = pd.DataFrame({'X': X, 'L': L, 'K': K})
# OLS Regression
model_ols = sm.OLS.from_formula('X~L+K', df).fit()
# print(model_ols.summary())
print(sm.stats.diagnostic.acorr_breusch_godfrey(model_ols))
# Vector Auto Regression
model_var = VAR(endog=df[['L','K']],exog=df['X']).fit(maxlags=2)
# print(results_var.summary())
sm.stats.diagnostic.acorr_breusch_godfrey(model_var,nlags=15)
For the last one I've also tried the below to no avail:
sm.stats.diagnostic.acorr_breusch_godfrey(model_var.resid.loc[:,0],nlags=15)
I was asked to write a program for Linear Regression with the following steps.
Load the R data set mtcars as a pandas dataframe.
Build another linear regression model by considering the log of independent variable wt, and log of dependent variable mpg.
Fit the model with data, and display the R-squared value
i tried the following 2 models and the tests are not passing. Is there an issue with my code.
#case 1
import statsmodels.api as sm
import numpy as np
mtcars = sm.datasets.get_rdataset('mtcars')
mtcars_data = mtcars.data
liner_model = sm.formula.ols('np.log(wt) ~ np.log(mpg)',mtcars_data)
liner_result = liner_model.fit()
print(liner_result.rsquared)
#case 2
import statsmodels.api as sa
import statsmodels.formula.api as sfa
import numpy as np
import pandas as pd
mtcars = sa.datasets.get_rdataset('mtcars')
cars_data = mtcars.data
lin_mod2 = pd.DataFrame(cars_data)
lin_mod2['wt'] = np.log(lin_mod2['wt'])
lin_mod2['mpg'] = np.log(lin_mod2['mpg'])
lin_mod1 = sfa.ols("wt~mpg",lin_mod2)
print(lin_mod1.fit().rsquared)
#or
import statsmodels.api as sm
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
mtcars = sm.datasets.get_rdataset('mtcars','datasets',cache=True).data
df = pd.DataFrame(mtcars)
model = smf.ols(formula='np.log(wt) ~ np.log(mpg)', data=df).fit()
r = model.rsquared
print(r)
well, it passed. the question was wrong I think all i had to do was print.
model.summary()
not
model.rsquared
I would like to perform a simple linear regression using statsmodels and I've tried several different methods by now but I just don't get it to work. The code that I have constructed now doesn't give me any errors but it also doesn't show me the result
I am trying to create a model for the variable "Direction" which takes the value 0 if the return for the corresponding date was negative and 1 if it was positive. The explinatory variables are the (5) lags of the returns. The df13 contains the lags and also the direction for each observed date. I tried this code and as I mentioned it doesn't give an error but says " Optimization terminated successfully.
Current function value: 0.682314
Iterations 5
However, I would like to see the typical table with all the beta values, their significance etc.
Also, what would you say, since Direction is a binary variable may it be better to use a logit instead of a linear model? However, in the assignment it appeared as a linear model.
And lastly, I am sorry its not displayed here correctly but I don't know how to write as code or insert my dataframe
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import os
import itertools
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import statsmodels.api as sm
import matplotlib.pyplot as plt
from statsmodels.sandbox.regression.predstd import wls_prediction_std
...
X = df13[['Lag1', 'Lag2', 'Lag3', 'Lag4', 'Lag5']]
Y = df13['Direction']
X = sm.add_constant(X)
model = sm.Logit(Y.astype(float), X.astype(float)).fit()
predictions = model.predict(X)
print_model = model.summary
print(print_model)
Edit: I'm sure it has to be a logit regression so I updated that part
I don't know if this is unintentional, but it looks like you need to define X and Y separately:
X = df13[['Lag1', 'Lag2', 'Lag3', 'Lag4', 'Lag5']]
Y = df13['Direction']
Secondly, I'm not familiar with statsmodel, but I would try converting your dataframes to numpy arrays. You can do this with
Xnum = X.to_numpy()
ynum = y.to_numpy()
And try passing those to the regressors.
I'm getting acquainted with Statsmodels so as to shift my more complicated stats completely over to python. However, I'm being cautious, so I'm cross-checking my results with SPSS, just to make sure I'm not making any obvious blunders. Most of time, there's no difference, but I have one example of a two-way ANOVA that's throwing up very different test statistics in Statsmodels and SPSS. (Relevant point: the sample sizes in the ANOVA are mismatched, so ANOVA may not be the appropriate model here.)
I'm selecting my model as follows:
import pandas as pd
import scipy as sp
import numpy as np
import statsmodels.api as sm
import seaborn as sns
import statsmodels
import statsmodels.api as sm
from statsmodels.formula.api import ols
import matplotlib.pyplot as plt
Body = pd.read_csv(filepath)
Body = Body.dropna()
Body_lm = ols('Effect ~ C(Fiction) + C(Condition) + C(Fiction)*C(Condition)', data = Body).fit()
table = sm.stats.anova_lm(Body_lm, typ=2)
The Statsmodels output is as below:
sum_sq df F PR(>F)
C(Fiction) 278.176684 1.0 307.624463 1.682042e-55
C(Condition) 4.294764 1.0 4.749408 2.971278e-02
C(Fiction):C(Condition) 10.776312 1.0 11.917092 5.970123e-04
Residual 520.861599 576.0 NaN NaN
The corresponding SPSS results are these:
Can anyone help explain the difference? Is is perhaps the unequal sample sizes being treated differently under the hood? Or am I choosing the wrong model?
Any help appreciated!
You should use sum coding when comparing the means of the variables.
BTW you don't need to specify each variable that are in the interaction term if * multiply operator is used:
“:” adds a new column to the design matrix with the product of the other two columns.
“*” will also include the individual columns that were multiplied together.
Your model should be:
Body_lm = ols('Effect ~ C(Fiction, Sum)*C(Condition, Sum)', data = Body).fit()
I am trying to fit some models (Spatial interaction models) according to some code which is provided in R. I have been able to get some of the code to work using statsmodels in a python framework but some of them do not match at all. I believe that the code I have for R and Python should give identical results. Does anyone see any differences? Or is there some fundamental differences which might be throwing things off? The R code is the original code which matches the numbers given in a tutorial (Found here: http://www.bartlett.ucl.ac.uk/casa/pdf/paper181).
R sample Code:
library(mosaic)
Data = fetchData('http://dl.dropbox.com/u/8649795/AT_Austria.csv')
Model = glm(Data~Origin+Destination+Dij+offset(log(Offset)), family=poisson(link="log"), data = Data)
cor = cor(Data$Data, Model$fitted, method = "pearson", use = "complete")
rsquared = cor * cor
rsquared
R output:
> Model = glm(Data~Origin+Destination+Dij+offset(log(Offset)), family=poisson(link="log"), data = Data)
Warning messages:
1: glm.fit: fitted rates numerically 0 occurred
2: glm.fit: fitted rates numerically 0 occurred
> cor = cor(Data$Data, Model$fitted, method = "pearson", use = "complete")
> rsquared = cor * cor
> rsquared
[1] 0.9753279
Python Code:
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.api as sm
from scipy.stats.stats import pearsonr
Data= pd.DataFrame(pd.read_csv('http://dl.dropbox.com/u/8649795/AT_Austria.csv'))
Model = smf.glm('Data~Origin+Destination+Dij', data=Data, offset=np.log(Data['Offset']), family=sm.families.Poisson(link=sm.families.links.log)).fit()
cor = pearsonr(doubleConstrained.fittedvalues, Data["Data"])[0]
print "R-squared for doubly-constrained model is: " + str(cor*cor)
Python Output:
R-squared for doubly-constrained model is: 0.104758481123
It looks like GLM has convergence problems here in statsmodels. Maybe in R too, but R only gives these warnings.
Warning messages:
1: glm.fit: fitted rates numerically 0 occurred
2: glm.fit: fitted rates numerically 0 occurred
That could mean something like perfect separation in Logit/Probit context. I'd have to think about it for a Poisson model.
R is doing a better, if subtle, job of telling you that something may be wrong in your fitting. If you look at the fitted likelihood in statsmodels for instance, it's -1.12e27. That should be a clue right there that something is off.
Using Poisson model directly (I always prefer maximum likelihood to GLM when possible), I can replicate the R results (but I get a convergence warning). Tellingly, again, the default newton-raphson solver fails, so I use bfgs.
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.api as sm
from scipy.stats.stats import pearsonr
data= pd.DataFrame(pd.read_csv('http://dl.dropbox.com/u/8649795/AT_Austria.csv'))
mod = smf.poisson('Data~Origin+Destination+Dij', data=data, offset=np.log(data['Offset'])).fit(method='bfgs')
print mod.mle_retvals['converged']