I'm currently scratching my head about how I might implement a classic ARIMA(X) model using base TensorFlow (and optionally Keras). The equation I am attempting to setup has the following form:
Where d represents the level of differencing applied to the input observed time series, p is the auto-regressive order, and q is the moving average order. The part which is stumping me currently is the calculation/estimation of the residuals epsilon. The auto-regression portion is a simple linear regression on the lagged samples, while the same is true for the terms involving the exogenous series (X). When I am estimating the residuals, should I simply feed the q-many previous steps into the current estimated parameters, and compute the residuals as y_true-y_predict? Though this also begs the question of: How does one estimate the residuals for observations where you have no previous observations? Do we simply estimate residuals 0 through q simply on a chosen random distribution of set variance (e.g. Normal, Poisson, etc.) with a mean of 0?
I have looked at the source for the statsmodels package to try to understand it, but it is quite opaque. Part of the reason for implementing the model this way is that it needs to fit into a fairly standard ecosystem at the company I work for, and we need control over what slices of data the model is fitted to at a given time step. This is because some data may arrive (much) later than the time stamp it relates to, due to lag at the source etc.
Thank you for any help you might be able to offer.
Related
I received a feedback from my paper about stock market forecasting with Machine Learning, and the reviewer asked the following:
I would like you to statistically test the out-of-sample performance
of your methods. Hence 'differ significantly' in the original wording.
I agree that some of the figures look awesome visually, but visually,
random noise seems to contain patterns. I believe Sortino Ratio is the
appropriate statistic to test, and it can be tested by using
bootstrap. I.e., a distribution is obtained for both BH and your
strategy, and the overlap of these distributions is calculated.
My problem is that I never did that for time series data. My validation procedure is using a strategy called walk forward, where I shift data in time 11 times, generating 11 different combinations of training and test with no overlap. So, here are my questions:
1- what would be the best (or more appropriate) statistical test to use given what the reviewer is asking?
2- If I remember well, statistical tests require vectors as input, is that correct? can I generate a vector containing 11 values of sortino ratios (1 for each walk) and then compare them with baselines? or should I run my code more than once? I am afraid the last choice would be unfeasible given the sort time to review.
So, what would be the correct actions to compare machine learning approaches statistically in this time series scenario?
Pointing out random noise seems to contain patterns, It's mean your plots have nice patterns, but it's might be random noise following [x] distribution (i.e. random uniform noise), which make things less accurate. It might be a good idea to split data into a k groups randomly, then apply Z-Test or T-test, pairwise compare the k-groups.
The reviewer point out the Sortino ratio which seems to be ambiguous as you are targeting to have a machine learning model, for a forecasting task, it's meant that, what you actually care about is the forecasting accuracy and reliability which could be granted if you are using Cross-Vaildation, in convex optimization it's equivalent to use the sensitivity analysis.
Update
The problem of serial dependency for time series data, raised in case of we have non-stationary time series data (low patterns), which seems to be not the problem of your data, even if it's the case, it's could be solved by removing the trends, i.e. convert non-stationery time series into stationery, using ADF Test for example, and might also consider using ARIMA models.
Time shifting, sometimes could be useful, but it's not considered to be a good measurement of noises, but it's might help to improve model accuracy by shifting data and extracting some features (ex. mean, variance over window size, etc.).
There's nothing preventing you to try time shifting approach, but you can't rely on it as an accurate measurement and you still need to prove your statistical analysis, using more robust techniques.
Having come across ARIMA/seasonal ARIMA recently, I am wondering why the AIC is chosen as an estimator for the applicability of a model. According to Wikipedia, it evaluates the goodness of the fit while punishing non-parsimonious models in order to prevent overfitting. Many grid search functions such as auto_arima in Python or R use it as an evaluation metric and suggest the model with the lowest AIC as the best fit.
However, in my case, choosing a simple model (with the lowest AIC -> small amount of parameters) just results in a model, that strongly follows previous in-sample observations and performs very badly on the test sample data. I don't see how overfitting is prevented just by choosing a small number of parameters...
ARIMA(1,0,1)(0,0,0,53); AIC=-16.7
Am I misunderstanding something? What could be a workaround to prevent this?
In the case of an ARIMA model whatever the parameters of the model are it will follow past observations, in the sense that you predict next values given previous values from your data. Now, auto.arima just tries some models and gives you the one with the lowest AIC by default or some other information criterion e.g BIC. This does not mean anything more than what the definition of those criteria are: so the model with the lowest AIC is the one that gives minimizes the AIC function. In case of time series analysis after you make sure that time series is stationary, I would recommend that you examine the ACF and PACF plots of your time series and read this
P.S I don't get this straight orange line in your plot after the dashed vertical line.
We usually use some form of cross-validation to protect against overfitting. It is well known that leave-one-out cross-validation is asymptotically equivalent to AIC under some assumptions about normality etc. Indeed, back when we had less computing power, AIC and other information criteria were handy exactly because they accomplish something very similar to cross-validation analytically.
Also, note that by their nature ARMA(1,1) models -- and other stationary ARMA models for that matter -- tend to converge to a constant rather quickly. The easiest way to see this is to write down the expressions of y_t+1, y_t+2 as a function of y_t. You will see that the expression has exponentials of numbers less than 1 (your AR and MA parameters), which quickly converge to zero as t grows. Also see this discussion.
The reason why your 'observed' data (to the left of the dashed line) does not exhibit this behaviour is that for each period you get a new realisation of random error term epsilon_t. On the right hand side, you do not get these realisations of random shocks, but instead they are replaced with their expressed value 0.
I am building a Linear regression model on a car dataset using RFE technique and statsmodels library. My final model has p-value well within 5% and has high F-statistics. VIF values for the predictor variables are well below 5 but for the constant term(intercept) VIF is 8.18. I have used add_constant method to add constant to the model. Following are my doubts:
What does High variance for the constant indicate ?
Should i ignore the constant term while calculating VIF?
These are my results:
I am new to machine learning and also posting question on this site for the 1st time. Kindly let me know if any more information is needed to answer my question.
statistical question are better asked on stats.stackexchange. However, I just went through this for statsmodels, e.g. https://github.com/statsmodels/statsmodels/issues/2376
First, there is no multicollinearity problem in your model and data. p-values are low and confidence intervals are pretty narrow, so the parameters in the model should be a good estimates. A vif of 8 is not large.
A large vif in the constant indicates that the (slope) explanatory variables have also a large constant component. An example would be when a variable has a large mean but only a small variance. An example for perfect collinearity with the constant and rank deficiency of the design matrix is the dummy variable trap, when we did not remove one of the levels of a categorical variable in dummy encoding and the dummies sum to 1 and, therefore, replicate a constant.
The purpose of including the constant in the vif computation is to discover this kind of problems with the design matrix exog provided by the user. It would not show up if we compute vif on demeaned or standardized explanatory variables.
There has been a long standing debate in statistics and econometrics about whether multicollinearity measures should include a constant or work only with demeaned explanatory variables.
I am currently preparing an extension to statsmodels that gives users the option to compute both versions, with and without constant.
In some cases reparameterization, demeaning and scaling, can improve numerical precision and prediction. So we want to have measures that check the actual design matrix provided by users, but also check a standardized version of the data to see whether demeaning and scaling might improve numerical precision.
I am implementing an anomaly detection system that will be used on different time series (one observation every 15 min for a total of 5 months). All these time series have a common pattern: high levels during working hours and low levels otherwise.
The idea presented in many papers is the following: build a model to predict future values and calculate an anomaly score based on the residuals.
What I have so far
I use an LSTM to predict the next time step given the previous 96 (1 day of observations) and then I calculate the anomaly score as the likelihood that the residuals come from one of the two normal distributions fitted on the residuals obtained with the validation test. I am using two different distributions, one for working hours and one for non working hours.
The model detects very well point anomalies, such as sudden falls and peaks, but it fails during holidays, for example.
If an holiday is during the week, I expect my model to detect more anomalies, because it's an unusual daily pattern wrt a normal working day.
But the predictions simply follows the previous observations.
My solution
Use a second and more lightweight model (based on time series decomposition) which is fed with daily aggregations instead of 15min aggregations to detect daily anomalies.
The question
This combination of two models allows me to have both anomalies and it works very well, but my idea was to use only one model because I expected the LSTM to be able to "learn" also the weekly pattern. Instead it strictly follows the previous time steps without taking into consideration that it is a working hour and the level should be much higher.
I tried to add exogenous variables to the input (hour of day, day of week), to add layers and number of cells, but the situation is not that better.
Any consideration is appreciated.
Thank you
A note on your current approach
Training with MSE is equivalent to optimizing the likelihood of your data under a Gaussian with fixed variance and mean given by your model. So you are already training an autoencoder, though you do not formulate it so.
About the things you do
You don't give the LSTM a chance
Since you provide data from last 24 hours only, the LSTM cannot possibly learn a weekly pattern.
It could at best learn that the value should be similar as it was 24 hours before (though it is very unlikely, see next point) -- and then you break it with Fri-Sat and Sun-Mon data. From the LSTM's point of view, your holiday 'anomaly' looks pretty much the same as the weekend data you were providing during the training.
So you would first need to provide longer contexts during learning (I assume that you carry the hidden state on during test time).
Even if you gave it a chance, it wouldn't care
Assuming that your data really follows a simple pattern -- high value during and only during working hours, plus some variations of smaller scale -- the LSTM doesn't need any long-term knowledge for most of the datapoints. Putting in all my human imagination, I can only envision the LSTM benefiting from long-term dependencies at the beginning of the working hours, so just for one or two samples out of the 96.
So even if the loss value at the points would like to backpropagate through > 7 * 96 timesteps to learn about your weekly pattern, there are 7*95 other loss terms that are likely to prevent the LSTM from deviating from the current local optimum.
Thus it may help to weight the samples at the beginning of working hours more, so that the respective loss can actually influence representations from far history.
Your solutions is a good thing
It is difficult to model sequences at multiple scales in a single model. Even you, as a human, need to "zoom out" to judge longer trends -- that's why all the Wall Street people have Month/Week/Day/Hour/... charts to watch their shares' prices on. Such multiscale modeling is especially difficult for an RNN, because it needs to process all the information, always, with the same weights.
If you really want on model to learn it all, you may have more success with deep feedforward architectures employing some sort of time-convolution, eg. TDNNs, Residual Memory Networks (Disclaimer: I'm one of the authors.), or the recent one-architecture-to-rule-them-all, WaveNet. As these have skip connections over longer temporal context and apply different transformations at different levels, they have better chances of discovering and exploiting such an unexpected long-term dependency.
There are implementations of WaveNet in Keras laying around on GitHub, e.g. 1 or 2. I did not play with them (I've actually moved away from Keras some time ago), but esp. the second one seems really easy, with the AtrousConvolution1D.
If you want to stay with RNNs, Clockwork RNN is probably the model to fit your needs.
About things you may want to consider for your problem
So are there two data distributions?
This one is a bit philosophical.
Your current approach shows that you have a very strong belief that there are two different setups: workhours and the rest. You're even OK with changing part of your model (the Gaussian) according to it.
So perhaps your data actually comes from two distributions and you should therefore train two models and switch between them as appropriate?
Given what you have told us, I would actually go for this one (to have a theoretically sound system). You cannot expect your LSTM to learn that there will be low values on Dec 25. Or that there is a deadline and this weekend consists purely of working hours.
Or are there two definitions of anomaly?
One philosophical point more. Perhaps you personally consider two different types of anomaly:
A weird temporal trajectory, unexpected peaks, oscillations, whatever is unusual in your domain. Your LSTM supposedly handles these already.
And then, there is different notion of anomaly: Value of certain bound in certain time intervals. Perhaps a simple linear regression / small MLP from time to value would do here?
Let the NN do all the work
Currently, you effectively model the distribution of your quantity in two steps: First, the LSTM provides the mean. Second, you supply the variance.
You might instead let your NN (together with additional 2 affine transformations) directly provide you with a complete Gaussian by producing its mean and variance; much like in Variational AutoEncoders (https://arxiv.org/pdf/1312.6114.pdf, appendix C.2). Then, you need to optimize directly the likelihood of your following sample under the NN-distribution, rather than just MSE between the sample and the NN output.
This will allow your model to tell you when it is very strict about the following value and when "any" sample will be OK.
Note, that you can take this approach further and have your NN produce "any" suitable distribution. E.g. if your data live in-/can be sensibly transformed to- a limited domain, you may try to produce a Categorical distribution over the space by having a Softmax on the output, much like WaveNet does (https://arxiv.org/pdf/1609.03499.pdf, Section 2.2).
I have some time series data which contains some seasonal trends and I want to use an ARIMA model to predict how this series will behave in the future.
In order to predict how my variable of interest (log_var) will behave I have taken a weekly, monthly and annual difference and then used these as the input to an ARIMA model.
Below is an example.
exog = np.column_stack([df_arima['log_var_diff_wk'],
df_arima['log_var_diff_mth'],
df_arima['log_var_diff_yr']])
model = ARIMA(df_arima['log_var'], exog = exog, order=(1,0,1))
results_ARIMA = model.fit()
I am doing this for several different data sources and in all of them I see great results, in the sense that if I plot log_var against results_ARIMA.fittedvalues for the training data then it matches very well (I tune p and q for each data source separately, but d is always 0 given that I have already taken the difference myself).
However, I then want to check what the predictions look like, and in order to do this I redfine exog to just be the 'test' dataset. For example, if I train the original ARIMA model on 2014-01-01 to 2016-01-01, the 'test' set would just be 2016-01-01 onwards.
My approach has worked well for some data sources (in the sense that I plot the forecast against the known values and the trends look sensible) but badly for others, although they are all the same 'kind' of data and they have just been taken from different geographical locations. In some of the locations it completely fails to catch obvious seasonal trends that occur again and again in the training data on the same dates each year. The ARIMA model always fits the training data well, it just seems that in some cases the predictions are completely useless.
I am now wondering if I am actually following the correct procedure to predict values from the ARIMA model. My approach is basically:
exog = np.column_stack([df_arima_predict['log_val_diff_wk'],
df_arima_predict['log_val_diff_mth'],
df_arima_predict['log_val_diff_yr']])
arima_predict = results_ARIMA.predict(start=training_cut_date, end = '2017-01-01', dynamic = False, exog = exog)
Is this the correct way to go about making predictions with ARIMA?
If so, is there a way I can try to understand why the predictions look very good in some datasets and terrible in others, when the ARIMA model seems to fit the training data just as well in both cases?
I have a similar problem atm which I have not entirely figured out yet. It seems including multiple seasonal terms in python is still a bit tricky. R does seem to have this capacity, see here. So, one suggestion I can give you is to try this with the more sophisticated functionality R provides for now (although that could require a large investment of time if you are not familiar with R yet).
Looking at your approach for modeling the seasonal patterns, taking the nth order difference scores does not give you seasonal constants, but rather some representation of the difference between the time points that you designate as seasonally related. If those differences are small, correcting for them might not have much impact on your modeling results. In such cases, model prediction might turn out fairly well. Conversely, if the differences are big, including them can easily distort prediction results. This could explain the variation you are seeing in your modeling results. Conceptually, then, what you'd want to do instead is represent the constants over time.
In the blog post referenced above, the author advocates the use of Fourier series to model the variance within each time period. Both the NumPy and SciPy packages offer routines for calculating the fast Fourier transform. However, as a non-mathematician I found it difficult to ascertain that the fast Fourier transform yielded the appropriate numbers.
In the end I opted to use the Welch signal decomposition form SciPy's signal module. What this does is return a spectral density analysis of your time series, from which you can deduce signal strength at various frequencies in your time series.
If you identify the peaks in the spectral density analysis which correspond to the seasonal frequencies you are trying to account for in your time series, you can use their frequencies and amplitudes to construct sine waves representing the seasonal variations. You can then include these in your ARIMA as exogenous variables, much like the Fourier terms in the blog post.
This is about as far as I have gotten myself at this point - right now I am trying to figure out whether I can get the statsmodels ARIMA process to use these sine waves, which specify a seasonal trend, as exogenous variables in my model (the documentation specifies they should not represent trends but hey, a guy can dream, right?) edit: This blog post by Rob Hyneman is also highly relevant, and explains some of the rationale behind including Fourier terms.
Sorry I'm not able to give you a solution that's proven to be effective within Python, but I hope this gives you some new ideas to control for that pesky seasonal variance.
TL;DR:
It seems python is not very well suited to handle multiple seasonal terms right now, R might be a better solution (see reference);
Using difference scores to account for seasonal trends seems not to capture the constant variance associated with the recurrence of the season;
One way to do this in python could be to use Fourier series representing seasonal trends (also see reference), which can be obtained using, among other ways, a Welch signal decomposition. How to use these as exogenous variables in an ARIMA to good effect is an open question, though.
Best of luck,
Evert
p.s.: I'll update if I find a way to get this to work in Python