I estimate a VAR(1) model in statsmodels (the sample code is from statsmodels user guide).
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.tsa.api import VAR
from statsmodels.tsa.base.datetools import dates_from_str
# prepare the data
mdata = sm.datasets.macrodata.load_pandas().data
dates = mdata[['year', 'quarter']].astype(int).astype(str)
quarterly = dates["year"] + "Q" + dates["quarter"]
quarterly = dates_from_str(quarterly)
mdata = mdata[['realgdp','realcons','realinv']]
mdata.index = pd.DatetimeIndex(quarterly)
data = np.log(mdata).diff().dropna()
# make a VAR model
model = VAR(data)
results = model.fit(1)
I want to compute the variance of the VAR model (click here for an explanation). Is there an attribute or property of the VARResults object that can give the variance directly?
I have found the answer.
results.acf(0)
The acf() method of the VARResults object computes theoretical autocovariance function of the VAR model.
Related
When fitting a GLM in H2O_cluster_version: 3.32.0.5 with
lamdba_search = True, nlambdas = 20, and lambda_min_ratio = .0001
My team and I receive 24 lambdas in our regularization path. The last 4 lambdas in the path are repeats of the first 4, the largest values.
Here is a reproducible example:
import pandas as pd
import numpy as np
import tweedie
import scipy
import os
import sys
import time
import h2o
from h2o.estimators.glm import H2OGeneralizedLinearEstimator
from h2o.grid.grid_search import H2OGridSearch
sys.path.append(h2odir)
from h2o_auto_init import h2o_auto_init
os.system(h2oshellscript)
time.sleep(10)
h2o_auto_init()
#sample data
resp = np.random.choice(range(0,100),size=1000)
pred1 = np.random.choice(range(40,50),size=1000)
pred2 = np.random.choice(range(20,30),size=1000)
pred3 = np.random.choice([1,2,3,4,5],size=1000)
weight = np.random.choice([1,1,1,.9,.37],size=1000)
folds = np.random.choice([1,2,3,4,5],size=1000)
data = pd.DataFrame({'resp': resp, 'pred1':pred1,'pred2':pred2,'pred3':pred3,'weight':weight,'fold_column':folds})
predictors = ['pred1','pred2','pred3']
# convert pandas df to h2oframe
H2Odata = h2o.H2OFrame(data, column_names=data.columns.tolist())
# set up model
model = H2OGeneralizedLinearEstimator(
family="tweedie",
tweedie_link_power = 0,
tweedie_variance_power = 1.7,
lambda_search=True,
early_stopping = False,
lambda_min_ratio = 0.0001,
nlambdas=20,
alpha=.5,
standardize = True,
weights_column='weight',
solver = 'IRLSM',
#beta_constraints = constraints,
keep_cross_validation_models = True,
keep_cross_validation_predictions = True,
keep_cross_validation_fold_assignment=True
)
# Train the model with training and validation data
model.train(
x=predictors,
y='resp',
training_frame=H2Odata,
fold_column = 'fold_column'
)
# get full regularization paths
#list of cross validation model objects
regpath_h2o_cv=[]
for i in range(0,len(model.cross_validation_models())):
regpath_h2o_cv.append(H2OGeneralizedLinearEstimator.getGLMRegularizationPath(model.cross_validation_models()[i]))
H2OGeneralizedLinearEstimator.getGLMRegularizationPath(model.cross_validation_models()[0])['lambdas']
When I run this, there is an extra lambda, a repeat of the first lambda.
Can anyone provide guidance on why H2O is providing more lambdas than requested, and especially repeated lambdas?
Does this mean it is fitting unnecessary models?
Our real use case is on very large data, and any time we can save
avoiding unnecessary modeling will be helpful.
Is there a way to do multi variate Bruesch Godfrey Lagrange Multiplier residual serial correlation tests for vector autoregressions (VAR) using statsmodels? I would like to get the same output as Eviews in View > Residual Tests> Autocorrleation LM Test
I have tried using the acorr_breusch_godfrey from stats models but it doesn't seem to be giving me outputs. Am I misparameterizing this? Or do I need to loop through the variables some how?
Below is an example using OLS (works) and the second one is VAR (doesn't work).
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.tsa.api import VAR
from statsmodels.stats.diagnostic import acorr_breusch_godfrey
data = pd.read_csv('http://web.pdx.edu/~crkl/ceR/data/cjx.txt', sep='\s+', index_col='YEAR', nrows=39)
X = np.log(data.X)
L = np.log(data.L1)
K = np.log(data.K1)
df = pd.DataFrame({'X': X, 'L': L, 'K': K})
# OLS Regression
model_ols = sm.OLS.from_formula('X~L+K', df).fit()
# print(model_ols.summary())
print(sm.stats.diagnostic.acorr_breusch_godfrey(model_ols))
# Vector Auto Regression
model_var = VAR(endog=df[['L','K']],exog=df['X']).fit(maxlags=2)
# print(results_var.summary())
sm.stats.diagnostic.acorr_breusch_godfrey(model_var,nlags=15)
For the last one I've also tried the below to no avail:
sm.stats.diagnostic.acorr_breusch_godfrey(model_var.resid.loc[:,0],nlags=15)
May someone suggests me how can I call the leaveOut entry for model prediction. Initially, the model is developed expect the leeaveout entry and now I am interested to check the error for the leaveOut entry using the developed model.
Sample code is as below:
import pandas as pd # Reading Table
import numpy as np # Processing Array
import scipy.stats # Computing Statistic
import matplotlib.pyplot as plt # Drawing Graph
import statsmodels.api as sm # Statistical Models
n = len(data)
a = data["aa"]
b= data["bb"]
MSE_predict = np.zeros(n)
for i in np.arange(n):
a_leaveOne = np.delete(a.values, i)
b_leaveOne = np.delete(b.values, i)
b_leaveOne=sm.add_constant(b_leaveOne)
model=sm.OLS(a_leaveOne, b_leaveOne).fit()
a_pre=model.predict([1],np.array(pres)[i])
MSE=np.square(np.subtract(a[i],a_pre)).mean()
print(MSE)
I was asked to write a program for Linear Regression with the following steps.
Load the R data set mtcars as a pandas dataframe.
Build another linear regression model by considering the log of independent variable wt, and log of dependent variable mpg.
Fit the model with data, and display the R-squared value
I am a beginner at Statistics with Python.
I have tried getting the log values without converting to a new DataFrame but that gave an error saying "TypeError: 'OLS' object is not subscriptable"
import statsmodels.api as sa
import statsmodels.formula.api as sfa
import pandas as pd
import numpy as np
cars = sa.datasets.get_rdataset("mtcars")
cars_data = cars.data
lin_mod1 = sfa.ols("wt~mpg",cars_data)
lin_mod2 = pd.DataFrame(lin_mod1)
lin_mod2['wt'] = np.log(lin_mod2['wt'])
lin_mod2['mpg'] = np.log(lin_mod2['mpg'])
lin_res1 = lin_mod2.fit()
print(lin_res1.summary())
The expected result is the table after linear regression but the actual output is an error
[ValueError: DataFrame constructor not properly called!]
This might work for you.
import statsmodels.api as sm
import numpy as np
mtcars = sm.datasets.get_rdataset('mtcars')
mtcars_data = mtcars.data
liner_model = sm.formula.ols('np.log(wt) ~ np.log(mpg)',mtcars_data)
liner_result = liner_model.fit()
print(liner_result.rsquared)
I broke your code and I've ran it line by line.
The problem is here:
lin_mod1 = sfa.ols("wt~mpg",cars_data)
If you try to print it, the output is:
statsmodels.regression.linear_model.OLS object at 0x7f1c64273eb8
And it can't be interpreted correctly to build a data frame.
The solution is to get the result of the first linear model into a table and the finally put into a data frame:
results = lin_mod1.fit()
results_summary = results.summary()
If you print the results_summary you will see the variables are: Intercept and mpg.
I don't if it's an error of concept or what, since it's not the pair "wt"-"mpg".
# summary as a html table
results_as_html = results_summary.tables[1].as_html()
# dataframe from the html table
lin_mod2 = pd.read_html(results_as_html, header=0, index_col=0)[0]
The print of lin_mod2 is:
coef std err t P>|t| [0.025 0.975]
Intercept 6.0473 0.309 19.590 0.0 5.417 6.678
mpg -0.1409 0.015 -9.559 0.0 -0.171 -0.111
Here is the solution:
import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np
cars=sm.datasets.get_rdataset("mtcars")
cars_data=cars.data
lin_mod1=smf.ols('np.log(wt)~np.log(mpg)',cars_data)
lin_model_fit=lin_mod1.fit()
print(lin_model_fit.summary())
Change:
lin_mod2 = pd.DataFrame(lin_mod1)
To:
lin_mod2 = pd.DataFrame(data = lin_mod1)
I am trying to fit some models (Spatial interaction models) according to some code which is provided in R. I have been able to get some of the code to work using statsmodels in a python framework but some of them do not match at all. I believe that the code I have for R and Python should give identical results. Does anyone see any differences? Or is there some fundamental differences which might be throwing things off? The R code is the original code which matches the numbers given in a tutorial (Found here: http://www.bartlett.ucl.ac.uk/casa/pdf/paper181).
R sample Code:
library(mosaic)
Data = fetchData('http://dl.dropbox.com/u/8649795/AT_Austria.csv')
Model = glm(Data~Origin+Destination+Dij+offset(log(Offset)), family=poisson(link="log"), data = Data)
cor = cor(Data$Data, Model$fitted, method = "pearson", use = "complete")
rsquared = cor * cor
rsquared
R output:
> Model = glm(Data~Origin+Destination+Dij+offset(log(Offset)), family=poisson(link="log"), data = Data)
Warning messages:
1: glm.fit: fitted rates numerically 0 occurred
2: glm.fit: fitted rates numerically 0 occurred
> cor = cor(Data$Data, Model$fitted, method = "pearson", use = "complete")
> rsquared = cor * cor
> rsquared
[1] 0.9753279
Python Code:
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.api as sm
from scipy.stats.stats import pearsonr
Data= pd.DataFrame(pd.read_csv('http://dl.dropbox.com/u/8649795/AT_Austria.csv'))
Model = smf.glm('Data~Origin+Destination+Dij', data=Data, offset=np.log(Data['Offset']), family=sm.families.Poisson(link=sm.families.links.log)).fit()
cor = pearsonr(doubleConstrained.fittedvalues, Data["Data"])[0]
print "R-squared for doubly-constrained model is: " + str(cor*cor)
Python Output:
R-squared for doubly-constrained model is: 0.104758481123
It looks like GLM has convergence problems here in statsmodels. Maybe in R too, but R only gives these warnings.
Warning messages:
1: glm.fit: fitted rates numerically 0 occurred
2: glm.fit: fitted rates numerically 0 occurred
That could mean something like perfect separation in Logit/Probit context. I'd have to think about it for a Poisson model.
R is doing a better, if subtle, job of telling you that something may be wrong in your fitting. If you look at the fitted likelihood in statsmodels for instance, it's -1.12e27. That should be a clue right there that something is off.
Using Poisson model directly (I always prefer maximum likelihood to GLM when possible), I can replicate the R results (but I get a convergence warning). Tellingly, again, the default newton-raphson solver fails, so I use bfgs.
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.api as sm
from scipy.stats.stats import pearsonr
data= pd.DataFrame(pd.read_csv('http://dl.dropbox.com/u/8649795/AT_Austria.csv'))
mod = smf.poisson('Data~Origin+Destination+Dij', data=data, offset=np.log(data['Offset'])).fit(method='bfgs')
print mod.mle_retvals['converged']