I am working with an autoregressive model in Python using Statsmodels. The package is great and I am getting the exact results I need. However, testing for residual correlation (Breusch-Godfrey LM-test) doesn't seem to work, because I get an error message.
My code:
import pandas as pd
import datetime
import numpy as np
from statsmodels.tsa.api import VAR
import statsmodels.api as sm
df = pd.read_csv('US_data.csv')
# converting str formatted dates to datetime and setting the index
j = []
for i in df['Date']:
j.append(datetime.datetime.strptime(i, '%Y-%m-%d').date())
df['Date'] = j
df = df.set_index('Date')
# dataframe contains three columns (GDP, INV and CONS)
# log difference
df = pd.DataFrame(np.log(df)*100)
df = df.diff()
p = 4 # order
model = VAR(df[1:])
results = model.fit(p, method='ols')
sm.stats.diagnostic.acorr_breusch_godfrey(results)
Error message:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-6-11abf518baae> in <module>()
----> 1 sm.stats.diagnostic.acorr_breusch_godfrey(results)
/home/****/anaconda3/lib/python3.6/site-packages/statsmodels/sandbox/stats/diagnostic.py in acorr_breusch_godfrey(results, nlags, store)
501 nlags = int(nlags)
502
--> 503 x = np.concatenate((np.zeros(nlags), x))
504
505 #xdiff = np.diff(x)
ValueError: all the input arrays must have same number of dimensions
A similar question was asked here over five months ago, but with no luck. Does anybody have an idea how to resolve this? Thank you very much in advance!
Those diagnostic tests were designed for univariate models like OLS where we have a one-dimensional residual array.
The only way to use it is most likely to use only a single equation of the VAR system or loop over each equation or variable.
VARResults in statsmodels master has a test_whiteness_new method which is a test for no autocorrelation of the multivariate residuals of a VAR.
It uses a Portmanteau test, which I think is the same as Ljung-Box.
The statespace models also use Ljung-Box for the related tests.
Related
I was asked to write a program for Linear Regression with the following steps.
Load the R data set mtcars as a pandas dataframe.
Build another linear regression model by considering the log of independent variable wt, and log of dependent variable mpg.
Fit the model with data, and display the R-squared value
I am a beginner at Statistics with Python.
I have tried getting the log values without converting to a new DataFrame but that gave an error saying "TypeError: 'OLS' object is not subscriptable"
import statsmodels.api as sa
import statsmodels.formula.api as sfa
import pandas as pd
import numpy as np
cars = sa.datasets.get_rdataset("mtcars")
cars_data = cars.data
lin_mod1 = sfa.ols("wt~mpg",cars_data)
lin_mod2 = pd.DataFrame(lin_mod1)
lin_mod2['wt'] = np.log(lin_mod2['wt'])
lin_mod2['mpg'] = np.log(lin_mod2['mpg'])
lin_res1 = lin_mod2.fit()
print(lin_res1.summary())
The expected result is the table after linear regression but the actual output is an error
[ValueError: DataFrame constructor not properly called!]
This might work for you.
import statsmodels.api as sm
import numpy as np
mtcars = sm.datasets.get_rdataset('mtcars')
mtcars_data = mtcars.data
liner_model = sm.formula.ols('np.log(wt) ~ np.log(mpg)',mtcars_data)
liner_result = liner_model.fit()
print(liner_result.rsquared)
I broke your code and I've ran it line by line.
The problem is here:
lin_mod1 = sfa.ols("wt~mpg",cars_data)
If you try to print it, the output is:
statsmodels.regression.linear_model.OLS object at 0x7f1c64273eb8
And it can't be interpreted correctly to build a data frame.
The solution is to get the result of the first linear model into a table and the finally put into a data frame:
results = lin_mod1.fit()
results_summary = results.summary()
If you print the results_summary you will see the variables are: Intercept and mpg.
I don't if it's an error of concept or what, since it's not the pair "wt"-"mpg".
# summary as a html table
results_as_html = results_summary.tables[1].as_html()
# dataframe from the html table
lin_mod2 = pd.read_html(results_as_html, header=0, index_col=0)[0]
The print of lin_mod2 is:
coef std err t P>|t| [0.025 0.975]
Intercept 6.0473 0.309 19.590 0.0 5.417 6.678
mpg -0.1409 0.015 -9.559 0.0 -0.171 -0.111
Here is the solution:
import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np
cars=sm.datasets.get_rdataset("mtcars")
cars_data=cars.data
lin_mod1=smf.ols('np.log(wt)~np.log(mpg)',cars_data)
lin_model_fit=lin_mod1.fit()
print(lin_model_fit.summary())
Change:
lin_mod2 = pd.DataFrame(lin_mod1)
To:
lin_mod2 = pd.DataFrame(data = lin_mod1)
I'm trying to replicate code from R that estimates a random intercept model. The R code is:
fit=lmer(resid~-1+(1|groupid),data=df)
I'm using the lmer command of the lme4 package to estimate random intercepts for the variable resid for observations in different groups (defined by groupid). There is no 'fixed effects' part, therefore no variable before the (1|groupid). Moreover, I do not want a constant estimated so that I get an intercept for each group.
Not sure how to do similar estimation in Python. I tried something like:
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
np.random.seed(12345)
df = pd.DataFrame(np.random.randn(25, 4), columns=list('ABCD'))
df['groupid'] = [1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4,5,5,5,5,5]
df['groupid'] = df['groupid'].astype('category')
###Random intercepts models
md = smf.mixedlm('A~B-1',data=df,groups=df['groupid'])
mdf = md.fit()
print(mdf.random_effects)
A is resid from the earlier example, while groupid is the same.
1) I am not sure whether the mdf.random_effects are the random intercepts I am looking for
2) I cannot remove the variable B, which I understand is the fixed effects part. If I try:
md = smf.mixedlm('A~-1',data=df,groups=df['groupid'])
I get an error that "Arrays cannot be empty".
Just trying to estimate the exact same model as in the R code. Any advice will be appreciated.
I'm trying to run a multiple OLS regression using statsmodels and a pandas dataframe. There are missing values in different columns for different rows, and I keep getting the error message:
ValueError: array must not contain infs or NaNs
I saw this SO question, which is similar but doesn't exactly answer my question: statsmodel.api.Logit: valueerror array must not contain infs or nans
What I would like to do is run the regression and ignore all rows where there are missing variables for the variables I am using in this regression. Right now I have:
import pandas as pd
import numpy as np
import statsmodels.formula.api as sm
df = pd.read_csv('cl_030314.csv')
results = sm.ols(formula = "da ~ cfo + rm_proxy + cpi + year", data=df).fit()
I want something like missing = "drop".
Any suggestions would be greatly appreciated. Thanks so much.
You answered your own question. Just pass
missing = 'drop'
to ols
import statsmodels.formula.api as smf
...
results = smf.ols(formula = "da ~ cfo + rm_proxy + cpi + year",
data=df, missing='drop').fit()
If this doesn't work then it's a bug and please report it with a MWE on github.
FYI, note the import above. Not everything is available in the formula.api namespace, so you should keep it separate from statsmodels.api. Or just use
import statsmodels.api as sm
sm.formula.ols(...)
The answer from jseabold works very well, but it may be not enough if you the want to do some computation on the predicted values and true values, e.g. if you want to use the function mean_squared_error. In that case, it may be better to get definitely rid of NaN
df = pd.read_csv('cl_030314.csv')
df_cleaned = df.dropna()
results = sm.ols(formula = "da ~ cfo + rm_proxy + cpi + year", data=df_cleaned).fit()
I am trying to fit some models (Spatial interaction models) according to some code which is provided in R. I have been able to get some of the code to work using statsmodels in a python framework but some of them do not match at all. I believe that the code I have for R and Python should give identical results. Does anyone see any differences? Or is there some fundamental differences which might be throwing things off? The R code is the original code which matches the numbers given in a tutorial (Found here: http://www.bartlett.ucl.ac.uk/casa/pdf/paper181).
R sample Code:
library(mosaic)
Data = fetchData('http://dl.dropbox.com/u/8649795/AT_Austria.csv')
Model = glm(Data~Origin+Destination+Dij+offset(log(Offset)), family=poisson(link="log"), data = Data)
cor = cor(Data$Data, Model$fitted, method = "pearson", use = "complete")
rsquared = cor * cor
rsquared
R output:
> Model = glm(Data~Origin+Destination+Dij+offset(log(Offset)), family=poisson(link="log"), data = Data)
Warning messages:
1: glm.fit: fitted rates numerically 0 occurred
2: glm.fit: fitted rates numerically 0 occurred
> cor = cor(Data$Data, Model$fitted, method = "pearson", use = "complete")
> rsquared = cor * cor
> rsquared
[1] 0.9753279
Python Code:
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.api as sm
from scipy.stats.stats import pearsonr
Data= pd.DataFrame(pd.read_csv('http://dl.dropbox.com/u/8649795/AT_Austria.csv'))
Model = smf.glm('Data~Origin+Destination+Dij', data=Data, offset=np.log(Data['Offset']), family=sm.families.Poisson(link=sm.families.links.log)).fit()
cor = pearsonr(doubleConstrained.fittedvalues, Data["Data"])[0]
print "R-squared for doubly-constrained model is: " + str(cor*cor)
Python Output:
R-squared for doubly-constrained model is: 0.104758481123
It looks like GLM has convergence problems here in statsmodels. Maybe in R too, but R only gives these warnings.
Warning messages:
1: glm.fit: fitted rates numerically 0 occurred
2: glm.fit: fitted rates numerically 0 occurred
That could mean something like perfect separation in Logit/Probit context. I'd have to think about it for a Poisson model.
R is doing a better, if subtle, job of telling you that something may be wrong in your fitting. If you look at the fitted likelihood in statsmodels for instance, it's -1.12e27. That should be a clue right there that something is off.
Using Poisson model directly (I always prefer maximum likelihood to GLM when possible), I can replicate the R results (but I get a convergence warning). Tellingly, again, the default newton-raphson solver fails, so I use bfgs.
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.api as sm
from scipy.stats.stats import pearsonr
data= pd.DataFrame(pd.read_csv('http://dl.dropbox.com/u/8649795/AT_Austria.csv'))
mod = smf.poisson('Data~Origin+Destination+Dij', data=data, offset=np.log(data['Offset'])).fit(method='bfgs')
print mod.mle_retvals['converged']
I have got a data set with records in the interval of 30 seconds, I am trying to do forecast prediction using ARMA function from time series module. Due to data privacy, I have used random data to reproduce the error
import numpy as np
from pandas import *
import statsmodels.api as sm
data = np.random.rand(100000)
data_index = date_range('2013-5-26', periods = len(data), freq='30s')
data = np.array(data)
data_series = Series(data, index = data_index)
model = sm.tsa.ARMA(data_series,(1,0)).fit()
My package versions:
Python version 2.7.3
pandas version 0.11.0
statsmodels version 0.5.0
The main error message is as follows(I omitted some):
ValueError Traceback (most recent call last)
<ipython-input-24-0f57c74f0fc9> in <module>()
6 data = np.array(data)
7 data_series = Series(data, index = data_index)
----> 8 model = sm.tsa.ARMA(data_series,(1,0)).fit()
...........
...........
ValueError: freq 30S not understood
It seems to me ARMA does not support the date format generated by pandas? If I remove freq option in date_range, then this command will again not work for large series since the year will go well beyond pandas limit.
Anyway to get around? Thanks
Update:
OK, using data_series.values will work, but next, how do I do the prediction? my data_index is from [2013-05-26 00:00:00, ..., 2013-06-29 17:19:30]
prediction = model.predict('2013-05-26 00:00:00', '2013-06-29 17:19:30', dynamic=False)
still give me an error
I know prediction = model.predict() could go through and generate whole sequence prediction and then I can match, but overall it is not that convenient.
The problem is that this freq doesn't give back an offset from pandas for some reason, and we need an offset to be able to use the dates for anything. It looks like a pandas bug/not implemented to me.
from pandas.tseries.frequencies import get_offset
get_offset('30s')
Perhaps we could improve the error message though.
[Edit We don't really need the dates except for adding them back in for convenience in prediction, so you can still estimate the model by using data_series.values.]