Evaluating ARIMA models with the AIC

Evaluating ARIMA models with the AIC - python

Having come across ARIMA/seasonal ARIMA recently, I am wondering why the AIC is chosen as an estimator for the applicability of a model. According to Wikipedia, it evaluates the goodness of the fit while punishing non-parsimonious models in order to prevent overfitting. Many grid search functions such as auto_arima in Python or R use it as an evaluation metric and suggest the model with the lowest AIC as the best fit.
However, in my case, choosing a simple model (with the lowest AIC -> small amount of parameters) just results in a model, that strongly follows previous in-sample observations and performs very badly on the test sample data. I don't see how overfitting is prevented just by choosing a small number of parameters...
ARIMA(1,0,1)(0,0,0,53); AIC=-16.7
Am I misunderstanding something? What could be a workaround to prevent this?

In the case of an ARIMA model whatever the parameters of the model are it will follow past observations, in the sense that you predict next values given previous values from your data. Now, auto.arima just tries some models and gives you the one with the lowest AIC by default or some other information criterion e.g BIC. This does not mean anything more than what the definition of those criteria are: so the model with the lowest AIC is the one that gives minimizes the AIC function. In case of time series analysis after you make sure that time series is stationary, I would recommend that you examine the ACF and PACF plots of your time series and read this
P.S I don't get this straight orange line in your plot after the dashed vertical line.

We usually use some form of cross-validation to protect against overfitting. It is well known that leave-one-out cross-validation is asymptotically equivalent to AIC under some assumptions about normality etc. Indeed, back when we had less computing power, AIC and other information criteria were handy exactly because they accomplish something very similar to cross-validation analytically.
Also, note that by their nature ARMA(1,1) models -- and other stationary ARMA models for that matter -- tend to converge to a constant rather quickly. The easiest way to see this is to write down the expressions of y_t+1, y_t+2 as a function of y_t. You will see that the expression has exponentials of numbers less than 1 (your AR and MA parameters), which quickly converge to zero as t grows. Also see this discussion.
The reason why your 'observed' data (to the left of the dashed line) does not exhibit this behaviour is that for each period you get a new realisation of random error term epsilon_t. On the right hand side, you do not get these realisations of random shocks, but instead they are replaced with their expressed value 0.

Related

Implementing ARIMA(X) model in TensorFlow

I'm currently scratching my head about how I might implement a classic ARIMA(X) model using base TensorFlow (and optionally Keras). The equation I am attempting to setup has the following form:
Where d represents the level of differencing applied to the input observed time series, p is the auto-regressive order, and q is the moving average order. The part which is stumping me currently is the calculation/estimation of the residuals epsilon. The auto-regression portion is a simple linear regression on the lagged samples, while the same is true for the terms involving the exogenous series (X). When I am estimating the residuals, should I simply feed the q-many previous steps into the current estimated parameters, and compute the residuals as y_true-y_predict? Though this also begs the question of: How does one estimate the residuals for observations where you have no previous observations? Do we simply estimate residuals 0 through q simply on a chosen random distribution of set variance (e.g. Normal, Poisson, etc.) with a mean of 0?
I have looked at the source for the statsmodels package to try to understand it, but it is quite opaque. Part of the reason for implementing the model this way is that it needs to fit into a fairly standard ecosystem at the company I work for, and we need control over what slices of data the model is fitted to at a given time step. This is because some data may arrive (much) later than the time stamp it relates to, due to lag at the source etc.
Thank you for any help you might be able to offer.

How to Make statistical tests in time series applications

I received a feedback from my paper about stock market forecasting with Machine Learning, and the reviewer asked the following:
I would like you to statistically test the out-of-sample performance
of your methods. Hence 'differ significantly' in the original wording.
I agree that some of the figures look awesome visually, but visually,
random noise seems to contain patterns. I believe Sortino Ratio is the
appropriate statistic to test, and it can be tested by using
bootstrap. I.e., a distribution is obtained for both BH and your
strategy, and the overlap of these distributions is calculated.
My problem is that I never did that for time series data. My validation procedure is using a strategy called walk forward, where I shift data in time 11 times, generating 11 different combinations of training and test with no overlap. So, here are my questions:
1- what would be the best (or more appropriate) statistical test to use given what the reviewer is asking?
2- If I remember well, statistical tests require vectors as input, is that correct? can I generate a vector containing 11 values of sortino ratios (1 for each walk) and then compare them with baselines? or should I run my code more than once? I am afraid the last choice would be unfeasible given the sort time to review.
So, what would be the correct actions to compare machine learning approaches statistically in this time series scenario?

Pointing out random noise seems to contain patterns, It's mean your plots have nice patterns, but it's might be random noise following [x] distribution (i.e. random uniform noise), which make things less accurate. It might be a good idea to split data into a k groups randomly, then apply Z-Test or T-test, pairwise compare the k-groups.
The reviewer point out the Sortino ratio which seems to be ambiguous as you are targeting to have a machine learning model, for a forecasting task, it's meant that, what you actually care about is the forecasting accuracy and reliability which could be granted if you are using Cross-Vaildation, in convex optimization it's equivalent to use the sensitivity analysis.
Update
The problem of serial dependency for time series data, raised in case of we have non-stationary time series data (low patterns), which seems to be not the problem of your data, even if it's the case, it's could be solved by removing the trends, i.e. convert non-stationery time series into stationery, using ADF Test for example, and might also consider using ARIMA models.
Time shifting, sometimes could be useful, but it's not considered to be a good measurement of noises, but it's might help to improve model accuracy by shifting data and extracting some features (ex. mean, variance over window size, etc.).
There's nothing preventing you to try time shifting approach, but you can't rely on it as an accurate measurement and you still need to prove your statistical analysis, using more robust techniques.

Predictions with ARIMA (python statsmodels)

I have some time series data which contains some seasonal trends and I want to use an ARIMA model to predict how this series will behave in the future.
In order to predict how my variable of interest (log_var) will behave I have taken a weekly, monthly and annual difference and then used these as the input to an ARIMA model.
Below is an example.
exog = np.column_stack([df_arima['log_var_diff_wk'],
df_arima['log_var_diff_mth'],
df_arima['log_var_diff_yr']])
model = ARIMA(df_arima['log_var'], exog = exog, order=(1,0,1))
results_ARIMA = model.fit()
I am doing this for several different data sources and in all of them I see great results, in the sense that if I plot log_var against results_ARIMA.fittedvalues for the training data then it matches very well (I tune p and q for each data source separately, but d is always 0 given that I have already taken the difference myself).
However, I then want to check what the predictions look like, and in order to do this I redfine exog to just be the 'test' dataset. For example, if I train the original ARIMA model on 2014-01-01 to 2016-01-01, the 'test' set would just be 2016-01-01 onwards.
My approach has worked well for some data sources (in the sense that I plot the forecast against the known values and the trends look sensible) but badly for others, although they are all the same 'kind' of data and they have just been taken from different geographical locations. In some of the locations it completely fails to catch obvious seasonal trends that occur again and again in the training data on the same dates each year. The ARIMA model always fits the training data well, it just seems that in some cases the predictions are completely useless.
I am now wondering if I am actually following the correct procedure to predict values from the ARIMA model. My approach is basically:
exog = np.column_stack([df_arima_predict['log_val_diff_wk'],
df_arima_predict['log_val_diff_mth'],
df_arima_predict['log_val_diff_yr']])
arima_predict = results_ARIMA.predict(start=training_cut_date, end = '2017-01-01', dynamic = False, exog = exog)
Is this the correct way to go about making predictions with ARIMA?
If so, is there a way I can try to understand why the predictions look very good in some datasets and terrible in others, when the ARIMA model seems to fit the training data just as well in both cases?

I have a similar problem atm which I have not entirely figured out yet. It seems including multiple seasonal terms in python is still a bit tricky. R does seem to have this capacity, see here. So, one suggestion I can give you is to try this with the more sophisticated functionality R provides for now (although that could require a large investment of time if you are not familiar with R yet).
Looking at your approach for modeling the seasonal patterns, taking the nth order difference scores does not give you seasonal constants, but rather some representation of the difference between the time points that you designate as seasonally related. If those differences are small, correcting for them might not have much impact on your modeling results. In such cases, model prediction might turn out fairly well. Conversely, if the differences are big, including them can easily distort prediction results. This could explain the variation you are seeing in your modeling results. Conceptually, then, what you'd want to do instead is represent the constants over time.
In the blog post referenced above, the author advocates the use of Fourier series to model the variance within each time period. Both the NumPy and SciPy packages offer routines for calculating the fast Fourier transform. However, as a non-mathematician I found it difficult to ascertain that the fast Fourier transform yielded the appropriate numbers.
In the end I opted to use the Welch signal decomposition form SciPy's signal module. What this does is return a spectral density analysis of your time series, from which you can deduce signal strength at various frequencies in your time series.
If you identify the peaks in the spectral density analysis which correspond to the seasonal frequencies you are trying to account for in your time series, you can use their frequencies and amplitudes to construct sine waves representing the seasonal variations. You can then include these in your ARIMA as exogenous variables, much like the Fourier terms in the blog post.
This is about as far as I have gotten myself at this point - right now I am trying to figure out whether I can get the statsmodels ARIMA process to use these sine waves, which specify a seasonal trend, as exogenous variables in my model (the documentation specifies they should not represent trends but hey, a guy can dream, right?) edit: This blog post by Rob Hyneman is also highly relevant, and explains some of the rationale behind including Fourier terms.
Sorry I'm not able to give you a solution that's proven to be effective within Python, but I hope this gives you some new ideas to control for that pesky seasonal variance.
TL;DR:
It seems python is not very well suited to handle multiple seasonal terms right now, R might be a better solution (see reference);
Using difference scores to account for seasonal trends seems not to capture the constant variance associated with the recurrence of the season;
One way to do this in python could be to use Fourier series representing seasonal trends (also see reference), which can be obtained using, among other ways, a Welch signal decomposition. How to use these as exogenous variables in an ARIMA to good effect is an open question, though.
Best of luck,
Evert
p.s.: I'll update if I find a way to get this to work in Python

How exactly BIC in Augmented Dickey–Fuller test work in Python?

This question is on Augmented Dickey–Fuller test implementation in statsmodels.tsa.stattools python library - adfuller().
In principle, AIC and BIC are supposed to compute information criterion for a set of available models and pick up the best (the one with the lowest information loss).
But how do they operate in the context of Augmented Dickey–Fuller?
The thing which I don't get: I've set maxlag=30, BIC chose lags=5 with some informational criterion. I've set maxlag=40 - BIC still chooses lags=5 but the information criterion have changed! Why in the world would information criterion for the same number of lags differ with maxlag changed?
Sometimes this leads to change of the choice of the model, when BIC switches from lags=5 to lags=4 when maxlag is changed from 20 to 30, which makes no sense as lag=4 was previously available.

When we request automatic lag selection in adfulller, then the function needs to compare all models up to the given maxlag lags. For this comparison we need to use the same observations for all models. Because lagged observations enter the regressor matrix we loose observations as initial conditions corresponding to the largest lag included.
As a consequence autolag uses nobs - maxlags observations for all models. For calculating the test statistic for adfuller itself, we don't need model comparison anymore and we can use all observations available for the chosen lag, i.e. nobs - best_lag.
More general, how to treat initial conditions and different number of initial conditions is not always clear cut, autocorrelation and partial autocorrelation are largely based on using all available observations, full MLE for AR and ARMA models uses the stationary model to include the initial conditions, while conditional MLE or least squares drops them as necessary.

Python Statsmodels Testing Coefficients from Robust Linear Model based on M-Estimators

I have a linear model that I'm trying to fit to data with a good # of outliers in the endogenous variable, but not in the exogenous space. I've researched that RLM's based on M-estimators are good in this situation.
When I fit an RLM to my data in the follow way:
import numpy as np
import statsmodels.formula.api as smf
import statsmodels as sm
modelspec = ('cost ~ np.log(units) + np.log(units):item + item') #where item is a categorical variable
results = smf.rlm(modelspec, data = dataset, M = sm.robust.norms.TukeyBiweight()).fit()
print results.summary()
the summary results shows a z statistic, and seemingly the coefficient test of significance is based off of this rather than a t statistic. However, the following R manual (http://www.dst.unive.it/rsr/BelVenTutorial.pdf) shows the use of t statistics on pg. 19-21
Two questions:
Can someone explain to me conceptually why statsmodels uses a z-test rather than a t-test?
All terms and interactions are highly significant in the results (|z| > 4). In most cases, each item has 40 or more observations. There are some items that have 21-25 observations. Is there reason to believe that RLM is not effective in a small sample environment? The line it produces must be the best fit line after reweighting outliers, but is the z-test effective for samples of this size (ie, is there a reason to believe the confidence interval produced by smf.rlm() does not produce 95% probability coverage? I know for t-tests this potentially can be an issue...)?
Thanks!

I have mostly only a general answer, I never read any small sample Monte Carlo studies for M-estimators.
To 1.
In many models, like M-estimators, RLM, or generalized linear models, GLM, we have only asymptotic results, except for maybe a few special cases. Asymptotic results provide conditions that the estimator is normally distributed. Given this, statsmodels defaults to using normal distribution for all models outside of the linear regression model, OLS, and similar, and chisquare instead of the F distribution for Wald tests with joint hypothesis.
There is some evidence that in many cases using the t or F distribution with appropriate choice of degrees of freedom provides a better small sample approximation to the distribution of the test statistic. This relies on Monte Carlo results and is not directly justified by the theory, as far as I know.
In the next release, and in the current development version, of statsmodels users can choose to use the t and F distribution for the results, instead of the normal and chisquare distribution. The defaults stay the same as they are now.
There are other cases where it is not clear whether the t-distribution, and which small sample degrees of freedom should be used. In many cases, statsmodels tries to follow the lead of STATA, for example in cluster robust standard errors after OLS.
Another consequence is that sometimes equivalent models that are special cases of different models use different default assumptions on the distribution, both in Stata and in statsmodels.
I recently read the SAS documentation for M-estimators, and SAS is using the chisquare distribution, i.e. also the normal assumption, for the significance of the parameter estimates and for the confidence intervals.
To 2.
(see first sentence)
I think the same as for linear models also applies here. If the data is highly non-normal, then the test statistics could have incorrect coverage in small samples. This can also be the case with some robust sandwich covariance estimators. On the other hand, if we don't use heteroscedasticity or correlation robust covariance estimators, then the tests can also be strongly biased.
For robust estimation methods like M-estimators, RLM, the effective sample size also depends on the number of inliers, or the weights assigned to the observations, not just the total number of observations.
For your case, I think the z-values and the sample size are large enough that, for example, using the t-distribution would not make them much less significant.
Comparing M-estimators with different norms and scale estimates would provide an additional check on the robustness on the assumption on the outliers and for the choice of robust estimator. Another cross check: Does OLS with dropped outliers (observations with small weights in the RLM estimate) give a similar answer.
Finally as general caution:
The references on robust methods often warn that we should not use (outlier-)robust methods blindly. Using robust methods estimates a relationship based on "inliers". But is our discarding or down-weighting of outliers justified? Or, do we have missing non-linearities, missing variables, a mixture distribution or different regimes?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.