Is there a way to measure the accuracy of an ARMA-GARCH model in Python using a prediction interval (alpha=0.05)? I fitted an ARMA-GARCH model on log returns and used some classical metrics such as RMSE, MSE (out-of-sample), AIC (in-sample), check on residuals and so on. I would like to add a prediction interval as another measurement of accuracy based on my ARMA-GARCH model predictions. I used the armagarch library (https://github.com/iankhr/armagarch).
I already checked on how to use prediction intervals but not sure how to use it with ARMA-GARCH.
I found these formula searching online: Estimator +- 1.96 (for 95%) * Standard Error.
So far i got it, but i have several Standard Errors in my model output for each parameter in the ARMA and GARCH part, which one i have to use? Is there one Standard Error for the whole model itself?
I would be really happy if anyone could help.
ARMA-GARCH model output
So far I created an ARMA(2,2)-GARCH(1,1) model:
#final test of function
import armagarch as ag
#definitions framework
data = pd.DataFrame(data)
meanMdl = ag.ARMA(order = {'AR':2,'MA':2})
volMdl = ag.garch(order = {'p':1,'q':1})
distMdl = ag.normalDist()
model = ag.empModel(data, meanMdl, volMdl, distMdl)
model_fit = model.fit()
After the model fit defining prediction length and
Recieved two arrays as an output (mean + variance) put them into the correct length:
#first array is mean, second is variance
pred = model.predict(nsteps=len(df_test))
#correct the shapes!
df_pred_mean = pd.DataFrame(np.reshape(pred[0], (len(df_test),
1)))
df_pred_variance = pd.DataFrame(np.reshape(pred[1],
(len(df_test), 1)))
So far so good, now i would like to implement a prediction interval.
I got that one has to use the ARMA part +- 1.96 (95%)* GARCH prediction for each prediction. I implemented it for the upper and lower bound. It just shows the upper bound lower bound is same but using * (-1.96) at the end of the formula.
#upper bound
df_all["upper bound"] =df_all["pred_Mean"]+df_all["pred_Variance"]*1.96
Using it on the actual log returns i trained the model with fails in the way its completely wrong. Now I'm unsure if the main approach i used is wrong or the model I used means the package.
prediction interval vs. actual log return
Related
I am training and tuning a model in pycaret such as:
from pycaret.classification import *
clf1 = setup(data = train, target = 'target', feature_selection = True, test_data = test, remove_multicollinearity = True, multicollinearity_threshold = 0.4)
# create model
lr = create_model('lr')
# tune model
tuned_lr = tune_model(lr)
# optimize threshold
optimized_lr = optimize_threshold(tuned_lr)
I would like to get the parameters estimated for the features in the Logistic Regression, so I could proceed with understanding the effect size of each feature on the target. However, the object optimized_lr has a function optimized_lr.get_params() which returns the hyperparameters of the model, however, I am not quite interested in my tuning decisions, instead, I am very interested in the real parameters of the model, the ones estimated in Logistic Regression.
How could I get them to use pycaret? (I could easily get those using other packages such as statsmodels, but I want to know in pycaret)
how about
for f, c in zip (optimized_lr.feature_names_in_,tuned.coef_[0]):
print(f, c)
To get the coefficients, use this code:
tuned_lr.feature_importances_ #this will give you the coefficients
get_config('X_train').columns #this code will give you the names of the columns.
Now we can create a dataframe so that we could see clearly how it correlates with the independent variable.
Coeff=pd.DataFrame({"Feature":get_config('X_train').columns.tolist(),"Coefficients":tuned_lr.feature_importances_})
print(Coeff)
# It would give me the Coefficient with the names of the respective columns. Hope it helps.
I'm using ML to predict a numerical output (a percentage from 0-100) from a number of (discretized) input variables. I want to find the combination of input variables that the model thinks will give the maximum output.
I have a fitted RandomForestRegressor model in Python (3.7.10), and currently evaluate the model's predictions for every combination of input variables.
model = RandomForestRegressor(n_estimators ...)
modelfit = model.fit(a, b)
modelpredictions = model.predict(combinations)
However, this doesn't scale well with more variables/finer steps.
So, is there a simpler way to extract the global optimum of the model? E.g. an equivalent to argmax().
I'm trying to build old school model using only auto regression algorithm. I found out that there's an implementation of it in statsmodel package. I've read the documentation, and as I understand it should work as ARIMA. So, here's my code:
import statsmodels.api as sm
model = sm.tsa.AutoReg(df_train.beer, 12).fit()
And when I want to predict new values, I'm trying to follow the documentation:
y_pred = model.predict(start=df_test.index.min(), end=df_test.index.max())
# or
y_pred = model.predict(start=100, end=1000)
Both returns a list of NaNs.
Also, when I type model.predict(0, df_train.size - 1) it predicts real values, but model.predict(0, df_train.size) predicts NaNs list.
Am I doing something wrong?
P.S. I know there's ARIMA, ARMA or SARIMAX algorithms, that can be used as basic auto regression. But I need exactly AutoReg.
We can do the forecasting in couple of ways:
by directly using the predict() function and
by using the definition of AR(p) process and the parameters learnt with AutoReg(): this will be helpful for short-term predictions, as we shall see.
Let's start with a sample dataset from statsmodels, the data looks like the following:
import statsmodels.api as sm
data = sm.datasets.sunspots.load_pandas().data['SUNACTIVITY']
plt.plot(range(len(data)), data)
Let's fit an AR(p) process to model the time series and use partial autocorrelation plot to find the order p, as shown below
As seen from above, the first few PACF values remain significant, let's use p=10 for the AR(p).
Let's divide the data into training and validation (test) datasets and fit auto-regressive model of order 10 using the training data:
from statsmodels.tsa.ar_model import AutoReg
n = len(data)
ntrain = int(n*0.9)
ntest = n - ntrain
lag = 10
res = AutoReg(data[:ntrain], lags = lag).fit()
Now, use the predict() function for forecasting all values corresponding to the held-out dataset:
preds = res.model.predict(res.params, start=n-ntest, end=n)
Notice that we can get the exactly same predictions using the parameters from the trained model, as shown below:
x = data[ntrain-lag:ntrain].values
preds1 = []
for t in range(ntrain, n):
pred = res.params[0] + np.sum(res.params[1:]*x[::-1])
x[:lag-1], x[lag-1] = x[-(lag-1):], pred
preds1.append(pred)
Note that the forecast values generated this way is same as the ones obtained using the predict() function above.
np.allclose(preds.values, np.array(preds1))
# True
Now, let's plot the forecast values for the test data:
As can be seen, for long term prediction, quality of forecasting is not that good (since the forecasted values are used for long term prediction).
Let's instead go for short-term predictions now and use the last lag points from the dataset to forecast the next value, as shown in the next code snippet.
preds = []
for t in range(ntrain, n):
pred = res.params[0] + np.sum(res.params[1:]*data[t-lag:t].values[::-1])
preds.append(pred)
As can be seen from the next plot, short term forecasting works way better:
You can use this code for forecasting
import statsmodels as sm
model = sm.tsa.AutoReg(df_train.beer, 12).fit()
y_pred = model.model.predict(model.params, start=df_test.index.min(), end=df_test.index.max())
from statsmodels.tsa.ar_model import AutoReg
model=AutoReg(dataset[''],lags=1)
ARFit=model.fit()
forecasted=ARFit.predict(start=len(dataset),end=len(dataset)+12)
#visualizacion
dataset[''].plot(figsize=(12,8),legend=True)
forecasted.plot(legend=True)
I am currently trying to implement both direct and recursive multi-step forecasting strategies using the statsmodels ARIMA library and it has raised a few questions.
A recursive multi-step forecasting strategy would be training a one-step model, predicting the next value, appending the predicted value onto the end of my exogenous values fed into the forecast method and repeating. This is my recursive implementation:
def arima_forecast_recursive(history, horizon=1, config=None):
# make list so can add / remove elements
history = history.tolist()
model = ARIMA(history, order=config)
model_fit = model.fit(trend='nc', disp=0)
for i, x in enumerate(history):
yhat = model_fit.forecast(steps=1, exog=history[i:])
yhat.append(history)
return np.array(yhat)
def walk_forward_validation(dataframe, config=None):
n_train = 52 # Give a minimum of 2 forecasting periods to capture any seasonality
n_test = 26 # Test set should be the size of one forecasting horizon
n_records = len(dataframe)
tuple_list = []
for index, i in enumerate(range(n_train, n_records)):
# create the train-test split
train, test = dataframe[0:i], dataframe[i:i + n_test]
# Test set is less than forecasting horizon so stop here.
if len(test) < n_test:
break
yhat = arima_forecast_recursive(train, n_test, config)
results = smape3(test, yhat)
tuple_list.append(results)
return tuple_list
Similarly to perform a direct strategy I would just fit my model on the available training data and use this to predict the total multi-step forecast at once. I am not sure how to achieve this using the statsmodels library.
My attempt (which produces results) is below:
def walk_forward_validation(dataframe, config=None):
# This currently implements a direct forecasting strategy
n_train = 52 # Give a minimum of 2 forecasting periods to capture any seasonality
n_test = 26 # Test set should be the size of one forecasting horizon
n_records = len(dataframe)
tuple_list = []
for index, i in enumerate(range(n_train, n_records)):
# create the train-test split
train, test = dataframe[0:i], dataframe[i:i + n_test]
# Test set is less than forecasting horizon so stop here.
if len(test) < n_test:
break
yhat = arima_forecast_direct(train, n_test, config)
results = smape3(test, yhat)
tuple_list.append(results)
return tuple_list
def arima_forecast_direct(history, horizon=1, config=None):
model = ARIMA(history, order=config)
model_fit = model.fit(trend='nc', disp=0)
return model_fit.forecast(steps=horizon)[0]
What confuses me specifically is if the model should just be fit once for all predictions or multiple times to make a single prediction in the multi-step forecast? Taken from Souhaib Ben Taieb's doctoral thesis (page 35 paragraph 3) it is presented that direct model will estimate H models, where H is the length of the forecast horizon, so in my example with a forecast horizon of 26, 26 models should be estimated instead of just one. As shown above my current implementation only fits one model.
What I do not understand is how, if I call ARIMA.fit() method multiple times on the same training data I will get a model that I will get aa fit that is any different outside of the expected normal stochastic variation?
My final question is with regard to optimisation. Using a method such as walk forward validation gives me statistically very significant results, but for many time series it is very computationally expensive. Both of the above implementations are already called using the joblib parallel loop execution functionality which significantly reduced the runtime on my laptop. However I would like to know if there is anything that can be done with regard to the above implementations to make them even more efficient. When running these methods for ~2000 separate time series (~ 500,000 data points total across all series) there is a runtime of 10 hours. I have profiled the code and most of the execution time is spent in the statsmodels library, which is fine but there a discrepancy between the runtime of the walk_forward_validation() method and ARIMA.fit(). This is expected as obviously the walk_forward_validation() method does stuff other than just call the fit method, but if anything in it can be changed to speed up execution time then please let me know.
The idea of this code is to find an optimal arima order per time series as it isn't feasible to investigate 2000 time series individually and as such the walk_forward_validation() method is called 27 times per time series. So roughly 27,000 times overall. Therefore any performance saving that can be found within this method will have an impact no matter how small it is.
Normally, ARIMA can only perform recursive forecasting, not direct forecasting. There might some research done on variations of ARIMA for direct forecasting, but they wouldn't be implemented in Statsmodels. In statsmodels, (or in R auto.arima()), when you set a value for h > 1, it simply performs a recursive forecast to get there.
As far as I know, none of the standard forecasting libraries have direct forecasting implemented yet, you're going to have to code it yourself.
Taken from Souhaib Ben Taieb's doctoral thesis (page 35 paragraph 3) it is presented that direct model will estimate H models, where H is the length of the forecast horizon, so in my example with a forecast horizon of 26, 26 models should be estimated instead of just one.
I haven't read Ben Taieb's thesis, but from his paper "Machine Learning Strategies for Time Series Forecasting", for direct forecasting, there is only one model for one value of H. So for H=26, there will be only one model. There will be H models if you need to forecast for every value between 1 and H, but for one H, there is only one model.
I run into a common problem I'm wondering if someone can help with. I often would like to use pymc3 in two modes: training (i.e. actually running inference on parameters) and evaluation (i.e. using inferred parameters to generate predictions).
In general, I'd like a posterior over predictions, not just point-wise estimates (that's part of the benefit of the Bayesian framework, no?). When your training data is fixed, this is typically accomplished by adding simulated variable of a similar form to the observed variable. For example,
from pymc3 import *
with basic_model:
# Priors for unknown model parameters
alpha = Normal('alpha', mu=0, sd=10)
beta = Normal('beta', mu=0, sd=10, shape=2)
sigma = HalfNormal('sigma', sd=1)
# Expected value of outcome
mu = alpha + beta[0]*X1 + beta[1]*X2
# Likelihood (sampling distribution) of observations
Y_obs = Normal('Y_obs', mu=mu, sd=sigma, observed=Y)
Y_sim = Normal('Y_sim', mu=mu, sd=sigma, shape=len(X1))
start = find_MAP()
step = NUTS(scaling=start)
trace = sample(2000, step, start=start)
But what if my data changes? Say I want to generate predictions based on new data, but without running inference all over again. Ideally, I'd have a function like predict_posterior(X1_new, X2_new, 'Y_sim', trace=trace) or even predict_point(X1_new, X2_new, 'Y_sim', vals=trace[-1]) that would simply run the new data through the theano computation graph.
I suppose part of my question relates to how pymc3 implements the theano computation graph. I've noticed that the function model.Y_sim.eval seems similar to what I want, but it requires Y_sim as an input and seems just to return whatever you give it.
I imagine this process is extremely common, but I can't seem to find any way to do it. Any help is greatly appreciated. (Note also that I have a hack to do this in pymc2; it's more difficult in pymc3 because of theano.)
Note: This functionality is now incorporated in the core code as the pymc.sample_ppc method. Check out the docs for more info.
Based on this link (dead as of July 2017) sent to me by twiecki, there are a couple tricks to solve my issue. The first is to put the training data into a shared theano variable. This allows us to change the data later without screwing up the theano computation graph.
X1_shared = theano.shared(X1)
X2_shared = theano.shared(X2)
Next, build the model and run the inference as usual, but using the shared variables.
with basic_model:
# Priors for unknown model parameters
alpha = Normal('alpha', mu=0, sd=10)
beta = Normal('beta', mu=0, sd=10, shape=2)
sigma = HalfNormal('sigma', sd=1)
# Expected value of outcome
mu = alpha + beta[0]*X1_shared + beta[1]*X2_shared
# Likelihood (sampling distribution) of observations
Y_obs = Normal('Y_obs', mu=mu, sd=sigma, observed=Y)
start = find_MAP()
step = NUTS(scaling=start)
trace = sample(2000, step, start=start)
Finally, there's a function under development (will likely eventually get added to pymc3) that will allow to predict posteriors for new data.
from collections import defaultdict
def run_ppc(trace, samples=100, model=None):
"""Generate Posterior Predictive samples from a model given a trace.
"""
if model is None:
model = pm.modelcontext(model)
ppc = defaultdict(list)
for idx in np.random.randint(0, len(trace), samples):
param = trace[idx]
for obs in model.observed_RVs:
ppc[obs.name].append(obs.distribution.random(point=param))
return ppc
Next, pass in the new data that you want to run predictions on:
X1_shared.set_value(X1_new)
X2_shared.set_value(X2_new)
Finally, you can generate posterior predictive samples for the new data.
ppc = run_ppc(trace, model=model, samples=200)
The variable ppc is a dictionary with keys for each observed variable in the model. So, in this case ppc['Y_obs'] would contain a list of arrays, each of which is generated using a single set of parameters from trace.
Note that you can even modify the parameters extracted from the trace. For example, I had a model using a GaussianRandomWalk variable and I wanted to generate predictions into the future. While you could allow pymc3 to sample into the future (i.e. allow the random walk variable to diverge), I just wanted to use a fixed value of the coefficient corresponding to the last inferred value. This logic can implemented in the run_ppc function.
It's also worth mentioning that the run_ppc function is extremely slow. It takes about as much time as running the actual inference. I suspect this has to do with some inefficiency related to how theano is used.
EDIT: The link originally included seems to be dead.
Above answer from #santon is correct. I am just adding to that.
Now you don't need to write your own method run_ppc. pymc3 provides sample_posterior_predictive method which does the same.