Everytime I run a VARMAX model I get different coefficients.
Is there any way I could replicate my previous results without imposing a seed?
Thank you
I tried to replicate the VARMA(p,q) example posted on the statsmodels webpage: ( https://www.statsmodels.org/dev/examples/notebooks/generated/statespace_varmax.html ). In order to check the replicability of the results, I just added a loop to reestimate the model and a dataframe (parameters) for saving the results. So this is my code:
%matplotlib inline
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
dta = sm.datasets.webuse('lutkepohl2', 'https://www.stata-press.com/data/r12/')
dta.index = dta.qtr
endog = dta.loc['1960-04-01':'1978-10-01', ['dln_inv', 'dln_inc', 'dln_consump']]
exog = endog['dln_consump']
parameters=pd.DataFrame()
for p in range(10):
print(p)
mod = sm.tsa.VARMAX(endog[['dln_inv', 'dln_inc']], order=(1,1))
res = mod.fit(maxiter=1000, disp=False)
print(res.summary())
param= pd.DataFrame(res.params,columns= ["estimation "+str(p)])
parameters=pd.concat([parameters, param], axis=1)
print(parameters)
As you can see, the results change everytime I reestimate the model:
estimation 0 estimation 1 estimation 2 \
const.dln_inv 0.010974 0.010934 0.010934
const.dln_inc 0.016554 0.016536 0.016536
L1.dln_inv.dln_inv -0.010164 -0.010087 -0.010087
L1.dln_inc.dln_inv 0.360306 0.362187 0.362187
L1.dln_inv.dln_inc -0.032975 -0.033071 -0.033071
L1.dln_inc.dln_inc 0.230657 0.231421 0.231421
L1.e(dln_inv).dln_inv -0.249916 -0.250307 -0.250307
L1.e(dln_inc).dln_inv 0.125546 0.125581 0.125581
L1.e(dln_inv).dln_inc 0.088878 0.089001 0.089001
L1.e(dln_inc).dln_inc -0.235258 -0.235176 -0.235176
sqrt.var.dln_inv 0.044926 0.044927 0.044927
sqrt.cov.dln_inv.dln_inc 0.001670 0.001662 0.001662
sqrt.var.dln_inc 0.011554 0.011554 0.011554
Thank you. But I tried to replicate the VARMA(p,q) example posted on the statsmodels webpage: ( https://www.statsmodels.org/dev/examples/notebooks/generated/statespace_varmax.html ). In order to check the replicability of the results, I just added a loop to reestimate the model and a dataframe (parameters) for saving the results. So this is my code:
%matplotlib inline
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
dta = sm.datasets.webuse('lutkepohl2', 'https://www.stata-press.com/data/r12/')
dta.index = dta.qtr
endog = dta.loc['1960-04-01':'1978-10-01', ['dln_inv', 'dln_inc', 'dln_consump']]
exog = endog['dln_consump']
parameters=pd.DataFrame()
for p in range(10):
print(p)
mod = sm.tsa.VARMAX(endog[['dln_inv', 'dln_inc']], order=(1,1))
res = mod.fit(maxiter=1000, disp=False)
print(res.summary())
param= pd.DataFrame(res.params,columns= ["estimation "+str(p)])
parameters=pd.concat([parameters, param], axis=1)
print(parameters)
As you can see, the results change everytime I reestimate the model:
estimation 0 estimation 1 estimation 2 \
const.dln_inv 0.010974 0.010934 0.010934
const.dln_inc 0.016554 0.016536 0.016536
L1.dln_inv.dln_inv -0.010164 -0.010087 -0.010087
L1.dln_inc.dln_inv 0.360306 0.362187 0.362187
L1.dln_inv.dln_inc -0.032975 -0.033071 -0.033071
L1.dln_inc.dln_inc 0.230657 0.231421 0.231421
L1.e(dln_inv).dln_inv -0.249916 -0.250307 -0.250307
L1.e(dln_inc).dln_inv 0.125546 0.125581 0.125581
L1.e(dln_inv).dln_inc 0.088878 0.089001 0.089001
L1.e(dln_inc).dln_inc -0.235258 -0.235176 -0.235176
sqrt.var.dln_inv 0.044926 0.044927 0.044927
sqrt.cov.dln_inv.dln_inc 0.001670 0.001662 0.001662
sqrt.var.dln_inc 0.011554 0.011554 0.011554
Related
This is my code I do not know where the error is.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels
from sklearn.preprocessing import OrdinalEncoder,OneHotEncoder
Light = ['On','Off']
Watering = ['Low','High']
# create combinations for all parameters
experiments = [(x,y) for x in Light for y in Watering]
exp_df = pd.DataFrame(experiments,columns=['A','B'])
print(exp_df)
OHE_model = OneHotEncoder(handle_unknown = 'ignore')
enc = OrdinalEncoder(categories=[['A','B'],['Low','High']])
encoded_df = pd.DataFrame(enc.fit_transform(exp_df[['A','B']]),columns=['A','B'])
#define the experiments order which must be random
encoded_df['exp_order'] = np.random.choice(np.arange(4),4,replace=False)
encoded_df['outcome'] = [25,37,55,65]
and this is the error
ValueError: Found unknown categories ['Off', 'On'] in column 0 during fit
The error indicates that On and Off are expected in place off A and B.
Use this code for your enc variable instead
enc = OrdinalEncoder(categories=[['On','Off'],['Low','High']])
Using DB scan I am iterating through a csv with x, y, z, and id data. I would like to generate a new csv for a every combination of eps and minimum samples that occur within a set range.
The output csv should also include the number of points in the cluster, eps value, and minimum samples used as columns along with the original x, y, z, and id data. The code below will do this, but will only create a csv for the last cluster calculated.
%matplotlib notebook
from interp import *
import pandas as pd
import pandas as pdfrom sklearn.cluster import DBSCAN
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics import silhouette_score
import itertools
points_file = "examples/example-1/fcr.csv"
eps_values = np.arange(0.1,0.5,0.1)
min_samples = np.arange(5,20)
dbscan_params = list(itertools.product(eps_values, min_samples))
cluster_n = []
epsvalues = []
min_samp = []
for p in dbscan_params:
dbscan_cluster = DBSCAN(eps=p[0],
min_samples=p[1]).fit(points)
epsvalues.append(p[0])
min_samp.append(p[1]), cluster_n.append(
len(np.unique(dbscan_cluster.labels_)))
df = pd.read_csv(points_file, header=None, names=["x", "y", "z", "id"])
df["cluster"] = clustering.labels_
df["eps"] = eps
df["min_sample"] = min_sam
csv_name = f'fcr_{eps}e{min_sam}m.csv'
df.to_csv(csv_name, index=False)
I'm trying to get my head around how to implement a Monte Carlo function in python using pymc to replicate a spreadsheet by Douglas Hubbard in his book How to Measure Anything
My attempt was:
import numpy as np
import pandas as pd
from pymc import DiscreteUniform, Exponential, deterministic, Poisson, Uniform, Normal, Stochastic, MCMC, Model
maintenance_saving_range = DiscreteUniform('maintenance_saving_range', lower=10, upper=21)
labour_saving_range = DiscreteUniform('labour_saving_range', lower=-2, upper=9)
raw_material_range = DiscreteUniform('maintenance_saving_range', lower=3, upper=10)
production_level_range = DiscreteUniform('maintenance_saving_range', lower=15000, upper=35000)
#deterministic(plot=False)
def rate(m = maintenance_saving_range, l = labour_saving_range, r=raw_material_range, p=production_level_range):
return (m + l + r) * p
model = Model([rate, maintenance_saving_range, labour_saving_range, raw_material_range, production_level_range])
mc = MCMC(model)
Unfortunately, I'm getting an error: ValueError: A tallyable PyMC object called maintenance_saving_range already exists. This will cause problems for some database backends.
What have I got wrong?
Ah, it was a copy and paste error.
I'd called three distributions by the same name.
Here's the code that works.
import numpy as np
import pandas as pd
from pymc import DiscreteUniform, Exponential, deterministic, Poisson, Uniform, Normal, Stochastic, MCMC, Model
%matplotlib inline
import matplotlib.pyplot as plt
maintenance_saving_range = DiscreteUniform('maintenance_saving_range', lower=10, upper=21)
labour_saving_range = DiscreteUniform('labour_saving_range', lower=-2, upper=9)
raw_material_range = DiscreteUniform('raw_material_range', lower=3, upper=10)
production_level_range = DiscreteUniform('production_level_range', lower=15000, upper=35000)
#deterministic(plot=False, name="rate")
def rate(m = maintenance_saving_range, l = labour_saving_range, r=raw_material_range, p=production_level_range):
#out = np.empty(10000)
out = (m + l + r) * p
return out
model = Model([rate, maintenance_saving_range, labour_saving_range, raw_material_range])
mc = MCMC(model)
mc.sample(iter=10000)
The following code gives me the following error: ValueError: Found array with 0 sample(s) (shape=(0, 3)) while a minimum of 1 is required.
The error is produced in line where prediction is invoked. I am assuming there's something wrong about the shape of the dataframe, 'obs_to_pred.' I checked the shape, which is (1046, 3).
What do you recommend so I can fix this and run the prediction?
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import statsmodels.api as sm
from patsy import dmatrices
from sklearn.linear_model import LogisticRegression
import scipy.stats as stats
from sklearn import linear_model
# Import Titanic Data
train_loc = 'C:/Users/Young/Desktop/Kaggle/Titanic/train.csv'
test_loc = 'C:/Users/Young/Desktop/Kaggle/Titanic/test.csv'
train = pd.read_csv(train_loc)
test = pd.read_csv(test_loc)
# Predict Missing Age Values Based on Factors Pclass, SibSp, and Parch.
# In the function, combine train and test data.
def regressionPred (traindata,testdata):
allobs = pd.concat([traindata, testdata])
allobs = allobs[~allobs.Age.isnull()]
y = allobs.Age
y, X = dmatrices('y ~ Pclass + SibSp + Parch', data = allobs, return_type = 'dataframe')
mod = sm.OLS(y,X)
res = mod.fit()
predictors = ['Pclass', 'SibSp', 'Parch']
regr = linear_model.LinearRegression()
regr.fit(allobs.ix[:,predictors], y)
obs_to_pred = allobs[allobs.Age.isnull()].ix[:,predictors]
prediction = regr.predict( obs_to_pred ) # Error Produced in This Line ***
return res.summary(), prediction
regressionPred(train,test)
In case you may want to look at the dataset, the link will take you there: https://www.kaggle.com/c/titanic/data
In the line
allobs = allobs[~allobs.Age.isnull()]
you define allobs as all the cases with no NaN in Age column.
Later, with:
obs_to_pred = allobs[allobs.Age.isnull()].ix[:,predictors]
you do not have any case to predict on as all allobs.Age.isnull() will be evaluated to False and you'll get an empty obs_to_pred. Thus your error:
array with 0 sample(s) (shape=(0, 3)) while a minimum of 1 is required.
Check the logic what you want with your predictions.
Here is the sample ADF test in python to check for Cointegration between two pairs. However the final result gives only the numeric value for co-integration. How to get the historical results of Co-integration.
Taken from http://www.leinenbock.com/adf-test-in-python/
import numpy as np
import statsmodels.api as stat
import statsmodels.tsa.stattools as ts
x = np.random.normal(0,1, 1000)
y = np.random.normal(0,1, 1000)
def cointegration_test(y, x):
result = stat.OLS(y, x).fit()
return ts.adfuller(result.resid)
I assume you want to test for expanding cointegration? Note that you should use sm.tsa.coint to test for cointegration. You could test for historical cointegrating relationship between realgdp and realdpi using pandas like so
import pandas as pd
import statsmodels.api as sm
data = sm.datasets.macrodata.load_pandas().data
def rolling_coint(x, y):
yy = y[:len(x)]
# returns only the p-value
return sm.tsa.coint(x, yy)[1]
historical_coint = pd.expanding_apply(data.realgdp, rolling_coint,
min_periods=36,
args=(data.realdpi,))