I have a time series data were I need to remove the trend and seasonality components from it. I was wondering whether I could use seasonal_decompose() function in Python and extract residual as follows:
result = seasonal_decompose(self.series, model='additive',freq=frequency)
residual = result.resid
Or should I apply well know detrending and deseasonalizing methods (such as by differencing), and if I where to apply such methods, shall I detrend first then deseasonalize or vice versa ??
As No free lunch theorem suggests, there is no universal model that can beat all other models on any kind of data. You should definitely try differencing and Seasonal ARIMA in addition to seasonal decomposition you'v already tried. The criterion of model selection is performance of a model on your data. With ARIMA, you don't need to detrend. Check out this comprehensive tutorial.
Related
I'm currently scratching my head about how I might implement a classic ARIMA(X) model using base TensorFlow (and optionally Keras). The equation I am attempting to setup has the following form:
Where d represents the level of differencing applied to the input observed time series, p is the auto-regressive order, and q is the moving average order. The part which is stumping me currently is the calculation/estimation of the residuals epsilon. The auto-regression portion is a simple linear regression on the lagged samples, while the same is true for the terms involving the exogenous series (X). When I am estimating the residuals, should I simply feed the q-many previous steps into the current estimated parameters, and compute the residuals as y_true-y_predict? Though this also begs the question of: How does one estimate the residuals for observations where you have no previous observations? Do we simply estimate residuals 0 through q simply on a chosen random distribution of set variance (e.g. Normal, Poisson, etc.) with a mean of 0?
I have looked at the source for the statsmodels package to try to understand it, but it is quite opaque. Part of the reason for implementing the model this way is that it needs to fit into a fairly standard ecosystem at the company I work for, and we need control over what slices of data the model is fitted to at a given time step. This is because some data may arrive (much) later than the time stamp it relates to, due to lag at the source etc.
Thank you for any help you might be able to offer.
Having come across ARIMA/seasonal ARIMA recently, I am wondering why the AIC is chosen as an estimator for the applicability of a model. According to Wikipedia, it evaluates the goodness of the fit while punishing non-parsimonious models in order to prevent overfitting. Many grid search functions such as auto_arima in Python or R use it as an evaluation metric and suggest the model with the lowest AIC as the best fit.
However, in my case, choosing a simple model (with the lowest AIC -> small amount of parameters) just results in a model, that strongly follows previous in-sample observations and performs very badly on the test sample data. I don't see how overfitting is prevented just by choosing a small number of parameters...
ARIMA(1,0,1)(0,0,0,53); AIC=-16.7
Am I misunderstanding something? What could be a workaround to prevent this?
In the case of an ARIMA model whatever the parameters of the model are it will follow past observations, in the sense that you predict next values given previous values from your data. Now, auto.arima just tries some models and gives you the one with the lowest AIC by default or some other information criterion e.g BIC. This does not mean anything more than what the definition of those criteria are: so the model with the lowest AIC is the one that gives minimizes the AIC function. In case of time series analysis after you make sure that time series is stationary, I would recommend that you examine the ACF and PACF plots of your time series and read this
P.S I don't get this straight orange line in your plot after the dashed vertical line.
We usually use some form of cross-validation to protect against overfitting. It is well known that leave-one-out cross-validation is asymptotically equivalent to AIC under some assumptions about normality etc. Indeed, back when we had less computing power, AIC and other information criteria were handy exactly because they accomplish something very similar to cross-validation analytically.
Also, note that by their nature ARMA(1,1) models -- and other stationary ARMA models for that matter -- tend to converge to a constant rather quickly. The easiest way to see this is to write down the expressions of y_t+1, y_t+2 as a function of y_t. You will see that the expression has exponentials of numbers less than 1 (your AR and MA parameters), which quickly converge to zero as t grows. Also see this discussion.
The reason why your 'observed' data (to the left of the dashed line) does not exhibit this behaviour is that for each period you get a new realisation of random error term epsilon_t. On the right hand side, you do not get these realisations of random shocks, but instead they are replaced with their expressed value 0.
I am working on python program for timeseries forecasting of number of events by date. For prediction I use ARIMA model. Now I have some results, but predicted values is not so good.
First, I made my timeseries stationary. For this I used: check stationarity by Dickey-Fuller test(0,5), then used Box-Cox transformation and again check ed Dickey-Fuller value(0,3). Then I tried to find first order difference method. I didnt received good results.
My question is how to deal with non-stationary time series. which methods should I use to make it stationary?
Many time series problems are intrinsically difficult, if not unlearnable -- especially if one wants to prevent overfitting and have some predictive power. If results are poor with a simple model, they aren't likely to be leaps and bounds better with a more complicated model.
Your first step ought to be incorporating external data sources and coming up with a theoretical model for your predictive task. Training a model on those stronger-signaled inputs should work better than on your raw data (if the task is learnable).
You can explore a number of methods to handle non stationarity of time series below:
https://medium.com/analytics-vidhya/preprocessing-for-time-series-forecasting-3a331dbfb9c2?source=friends_link&sk=30aac82f09efbbe8f1b6549a8e367575
Key Points (for making stationary time series):
Self Lag Differencing — It can be taken as the difference between present series and lagged version of the series.The shift can be of the order 1,2,3,4,etc. For items where we don’t have any lagged version item, take them as NULL.
Example — Let your dataframe be ‘Time’ and column with values be ‘Temperature’ indexed on date. So self differencing can be done like this:
Time[‘Temperature_Diff’]=Time[‘Temperature’]-Time[‘Temperature’].shift(1)
if lagged version used is 1
Time[‘Temperature_Diff’]=Time[‘Temperature’]-Time[‘Temperature’].shift(2)
if lagged version used is 2
Log Self Differencing — It can be taken as the difference between present series and lagged version of the series. But you can just apply log transformation over the actual series.
Use statsmodels.tsa.seasonal.seasonal_decompose and it will give you three components-Trend, Seasonality and Residuals. Take these residuals and it will be our stationary time series for forecasting.
P.S.- The blogpost has been authored by me.
I have a time series (daily) dataset consisting of 1 label (integer) and 15 features over 5 years. I have no idea about the meaning of features, but I have to predict the labels based on those features.
To do so, first, I used the autocorrelation_plot from pandas.tools.plotting to figure out if I have any seasonality in my label (y) or not. Please see the figure below:
Then I used seasonal_decompose to find seasonal, trend and residual of my label (y) by sweeping the Freq parameter:
Could you please let me know which Freq is OK, and why?
What would be the next step? Do I need to remove both trend and seasonal terms from the data and then try to model and predict the residual factor by regression (e.g., SVR, linear, etc)? Or I need to predict the whole data (without removing seasonal and trend) by regression. I tried to predict the whole data (without removing seasonal and trend) by several regression techniques but the results are very bad. Finally, how can I predict the seasonal at the end? ARIMA is OK? what about the Trend???\
3) Am I on the right track (extracting seasonal, etc), or I should consider the "date" as a feature besides the other 15 features such as:
hour of the day (24 boolean features)
day of the week (7 boolean features)
day of the month (up to 31 boolean features)
month (12 boolean features)
year
Let me explain to you how seasonality is usually treated.
Most of the time, people try to extract a seasonal component and deal with the corrected series for analysis. In North America, statistical agencies apply a sequence of symmetric moving average filters to estimate seasonal, tend-cycle and irregular components and seasonnally adjusted data corresponds to data minus the estimated seasonal component. Usually, they also provide raw data in other tables and, sometimes, they also provide trend-cycle in yet other tables. In Australia, they prefer to present trend-cycles.
In Europe, decomposition is usually based upon a model: they specify an ARIMA model with seasonal components -- it allows for integrated seasonal components, moving averager components in seasonal dynamics, etc. -- and proceed to a decomposition by imposing hypotheses on the model to extract specific frequencies.
Now, the first thing you need to know is what exactly your function does. If you it uses moving average filters, you have to be aware that those filters are symmetric and that it forces the use of backcasts and forecasts (you need points before the beginning and after the end to apply symmetric filters -- it's the same end point problem faced by filters like the Hoddrick-Prescott, for instance). So, it needs to specify a good ARIMA with seasonality as a proxy to not make end points behave too poorly (or specify asymmetric filters for end points) and the symmetry implies a small data-snooping bias if you use the corrected dataset to compare forecasting models (because all new points contain future information). If you use an ARIMA model, the filter is asymmetric and corrected data points are not built using future points.
Now, to forecast, you have two options. (1) You can try to forecast the corrected value (you can then either forecast seasonality separately, if you need raw values abolsutely); (2) you forecast the raw series.
It's not obvious what is the best way to proceed. In theory, you want (2), but it can be very complicated -- like, frontier research models --, unless you use an ARIMA with seasonal component or impose constant seasonality and use seasonal dummies.
As for the 'frequency' choice, I tend to use informal tests to determine what is appropriate. In the moving average literature, we pick how long or short we want our filters -- and the goal is to produce estimated seasonals that capture entirely seasonal regularities. You can use nonparamateric tests on corrected data, like the Kruskal-Wallis test, but it is rather forgiving.
My advice, which I believe is preferable for forecasting, would be to find a package that allows you to work with parametric models with seasonality. Then, you'd have clear tests and information criteria to use to make decisions on sound statistical ground.
I have some time series data which contains some seasonal trends and I want to use an ARIMA model to predict how this series will behave in the future.
In order to predict how my variable of interest (log_var) will behave I have taken a weekly, monthly and annual difference and then used these as the input to an ARIMA model.
Below is an example.
exog = np.column_stack([df_arima['log_var_diff_wk'],
df_arima['log_var_diff_mth'],
df_arima['log_var_diff_yr']])
model = ARIMA(df_arima['log_var'], exog = exog, order=(1,0,1))
results_ARIMA = model.fit()
I am doing this for several different data sources and in all of them I see great results, in the sense that if I plot log_var against results_ARIMA.fittedvalues for the training data then it matches very well (I tune p and q for each data source separately, but d is always 0 given that I have already taken the difference myself).
However, I then want to check what the predictions look like, and in order to do this I redfine exog to just be the 'test' dataset. For example, if I train the original ARIMA model on 2014-01-01 to 2016-01-01, the 'test' set would just be 2016-01-01 onwards.
My approach has worked well for some data sources (in the sense that I plot the forecast against the known values and the trends look sensible) but badly for others, although they are all the same 'kind' of data and they have just been taken from different geographical locations. In some of the locations it completely fails to catch obvious seasonal trends that occur again and again in the training data on the same dates each year. The ARIMA model always fits the training data well, it just seems that in some cases the predictions are completely useless.
I am now wondering if I am actually following the correct procedure to predict values from the ARIMA model. My approach is basically:
exog = np.column_stack([df_arima_predict['log_val_diff_wk'],
df_arima_predict['log_val_diff_mth'],
df_arima_predict['log_val_diff_yr']])
arima_predict = results_ARIMA.predict(start=training_cut_date, end = '2017-01-01', dynamic = False, exog = exog)
Is this the correct way to go about making predictions with ARIMA?
If so, is there a way I can try to understand why the predictions look very good in some datasets and terrible in others, when the ARIMA model seems to fit the training data just as well in both cases?
I have a similar problem atm which I have not entirely figured out yet. It seems including multiple seasonal terms in python is still a bit tricky. R does seem to have this capacity, see here. So, one suggestion I can give you is to try this with the more sophisticated functionality R provides for now (although that could require a large investment of time if you are not familiar with R yet).
Looking at your approach for modeling the seasonal patterns, taking the nth order difference scores does not give you seasonal constants, but rather some representation of the difference between the time points that you designate as seasonally related. If those differences are small, correcting for them might not have much impact on your modeling results. In such cases, model prediction might turn out fairly well. Conversely, if the differences are big, including them can easily distort prediction results. This could explain the variation you are seeing in your modeling results. Conceptually, then, what you'd want to do instead is represent the constants over time.
In the blog post referenced above, the author advocates the use of Fourier series to model the variance within each time period. Both the NumPy and SciPy packages offer routines for calculating the fast Fourier transform. However, as a non-mathematician I found it difficult to ascertain that the fast Fourier transform yielded the appropriate numbers.
In the end I opted to use the Welch signal decomposition form SciPy's signal module. What this does is return a spectral density analysis of your time series, from which you can deduce signal strength at various frequencies in your time series.
If you identify the peaks in the spectral density analysis which correspond to the seasonal frequencies you are trying to account for in your time series, you can use their frequencies and amplitudes to construct sine waves representing the seasonal variations. You can then include these in your ARIMA as exogenous variables, much like the Fourier terms in the blog post.
This is about as far as I have gotten myself at this point - right now I am trying to figure out whether I can get the statsmodels ARIMA process to use these sine waves, which specify a seasonal trend, as exogenous variables in my model (the documentation specifies they should not represent trends but hey, a guy can dream, right?) edit: This blog post by Rob Hyneman is also highly relevant, and explains some of the rationale behind including Fourier terms.
Sorry I'm not able to give you a solution that's proven to be effective within Python, but I hope this gives you some new ideas to control for that pesky seasonal variance.
TL;DR:
It seems python is not very well suited to handle multiple seasonal terms right now, R might be a better solution (see reference);
Using difference scores to account for seasonal trends seems not to capture the constant variance associated with the recurrence of the season;
One way to do this in python could be to use Fourier series representing seasonal trends (also see reference), which can be obtained using, among other ways, a Welch signal decomposition. How to use these as exogenous variables in an ARIMA to good effect is an open question, though.
Best of luck,
Evert
p.s.: I'll update if I find a way to get this to work in Python