Regression with Date variable (python)

Regression with Date variable (python) - python

I have a time series (daily) dataset consisting of 1 label (integer) and 15 features over 5 years. I have no idea about the meaning of features, but I have to predict the labels based on those features.
To do so, first, I used the autocorrelation_plot from pandas.tools.plotting to figure out if I have any seasonality in my label (y) or not. Please see the figure below:
Then I used seasonal_decompose to find seasonal, trend and residual of my label (y) by sweeping the Freq parameter:
Could you please let me know which Freq is OK, and why?
What would be the next step? Do I need to remove both trend and seasonal terms from the data and then try to model and predict the residual factor by regression (e.g., SVR, linear, etc)? Or I need to predict the whole data (without removing seasonal and trend) by regression. I tried to predict the whole data (without removing seasonal and trend) by several regression techniques but the results are very bad. Finally, how can I predict the seasonal at the end? ARIMA is OK? what about the Trend???\
3) Am I on the right track (extracting seasonal, etc), or I should consider the "date" as a feature besides the other 15 features such as:
hour of the day (24 boolean features)
day of the week (7 boolean features)
day of the month (up to 31 boolean features)
month (12 boolean features)
year

Let me explain to you how seasonality is usually treated.
Most of the time, people try to extract a seasonal component and deal with the corrected series for analysis. In North America, statistical agencies apply a sequence of symmetric moving average filters to estimate seasonal, tend-cycle and irregular components and seasonnally adjusted data corresponds to data minus the estimated seasonal component. Usually, they also provide raw data in other tables and, sometimes, they also provide trend-cycle in yet other tables. In Australia, they prefer to present trend-cycles.
In Europe, decomposition is usually based upon a model: they specify an ARIMA model with seasonal components -- it allows for integrated seasonal components, moving averager components in seasonal dynamics, etc. -- and proceed to a decomposition by imposing hypotheses on the model to extract specific frequencies.
Now, the first thing you need to know is what exactly your function does. If you it uses moving average filters, you have to be aware that those filters are symmetric and that it forces the use of backcasts and forecasts (you need points before the beginning and after the end to apply symmetric filters -- it's the same end point problem faced by filters like the Hoddrick-Prescott, for instance). So, it needs to specify a good ARIMA with seasonality as a proxy to not make end points behave too poorly (or specify asymmetric filters for end points) and the symmetry implies a small data-snooping bias if you use the corrected dataset to compare forecasting models (because all new points contain future information). If you use an ARIMA model, the filter is asymmetric and corrected data points are not built using future points.
Now, to forecast, you have two options. (1) You can try to forecast the corrected value (you can then either forecast seasonality separately, if you need raw values abolsutely); (2) you forecast the raw series.
It's not obvious what is the best way to proceed. In theory, you want (2), but it can be very complicated -- like, frontier research models --, unless you use an ARIMA with seasonal component or impose constant seasonality and use seasonal dummies.
As for the 'frequency' choice, I tend to use informal tests to determine what is appropriate. In the moving average literature, we pick how long or short we want our filters -- and the goal is to produce estimated seasonals that capture entirely seasonal regularities. You can use nonparamateric tests on corrected data, like the Kruskal-Wallis test, but it is rather forgiving.
My advice, which I believe is preferable for forecasting, would be to find a package that allows you to work with parametric models with seasonality. Then, you'd have clear tests and information criteria to use to make decisions on sound statistical ground.

Related

Best practice for timeseries prediction with help of indicators

I would like to predict values (e.g. transport volumes). As input data I have the volumes from the last two years. I already did some timeseries prediction on those values basically following the instruction on Basics of Time Series Prediction and Techniques for Time Series Prediction.
I now would like to go a step further and include some indicators (e.g. economic indicators) in the prediction to see if this will increase the accuracy of the predictions.
What is the right approach to do so? Looking around I found this Post, basically describing the same usecase. Unfortunately it got no responses.
One approach might be to do a "simple" prediction based on a model with the current volume and indicators as features and the future volume as label. But I then would loose the timeseries, the connection between the single data points so to say.
Do you have experience with such predictions? What did work in your case? Please point me in the right direction!

One approach might be to do a "simple" prediction based on a model
with the current volume and indicators as features and the future
volume as label. But I then would loose the timeseries, the connection
between the single data points so to say.
In this case a common solution is to include N 'lagging' values (i.e. volumes for N previous periods) as features for every observation, in addition to some indicator value features. This allows using pretty much any regression model for time series forecasting. Just make sure there's no data leakage of the 'future' values when calculating your indicators.

How to Make statistical tests in time series applications

I received a feedback from my paper about stock market forecasting with Machine Learning, and the reviewer asked the following:
I would like you to statistically test the out-of-sample performance
of your methods. Hence 'differ significantly' in the original wording.
I agree that some of the figures look awesome visually, but visually,
random noise seems to contain patterns. I believe Sortino Ratio is the
appropriate statistic to test, and it can be tested by using
bootstrap. I.e., a distribution is obtained for both BH and your
strategy, and the overlap of these distributions is calculated.
My problem is that I never did that for time series data. My validation procedure is using a strategy called walk forward, where I shift data in time 11 times, generating 11 different combinations of training and test with no overlap. So, here are my questions:
1- what would be the best (or more appropriate) statistical test to use given what the reviewer is asking?
2- If I remember well, statistical tests require vectors as input, is that correct? can I generate a vector containing 11 values of sortino ratios (1 for each walk) and then compare them with baselines? or should I run my code more than once? I am afraid the last choice would be unfeasible given the sort time to review.
So, what would be the correct actions to compare machine learning approaches statistically in this time series scenario?

Pointing out random noise seems to contain patterns, It's mean your plots have nice patterns, but it's might be random noise following [x] distribution (i.e. random uniform noise), which make things less accurate. It might be a good idea to split data into a k groups randomly, then apply Z-Test or T-test, pairwise compare the k-groups.
The reviewer point out the Sortino ratio which seems to be ambiguous as you are targeting to have a machine learning model, for a forecasting task, it's meant that, what you actually care about is the forecasting accuracy and reliability which could be granted if you are using Cross-Vaildation, in convex optimization it's equivalent to use the sensitivity analysis.
Update
The problem of serial dependency for time series data, raised in case of we have non-stationary time series data (low patterns), which seems to be not the problem of your data, even if it's the case, it's could be solved by removing the trends, i.e. convert non-stationery time series into stationery, using ADF Test for example, and might also consider using ARIMA models.
Time shifting, sometimes could be useful, but it's not considered to be a good measurement of noises, but it's might help to improve model accuracy by shifting data and extracting some features (ex. mean, variance over window size, etc.).
There's nothing preventing you to try time shifting approach, but you can't rely on it as an accurate measurement and you still need to prove your statistical analysis, using more robust techniques.

Removing Trend and Seasonality Time Series Python

I have a time series data were I need to remove the trend and seasonality components from it. I was wondering whether I could use seasonal_decompose() function in Python and extract residual as follows:
result = seasonal_decompose(self.series, model='additive',freq=frequency)
residual = result.resid
Or should I apply well know detrending and deseasonalizing methods (such as by differencing), and if I where to apply such methods, shall I detrend first then deseasonalize or vice versa ??

As No free lunch theorem suggests, there is no universal model that can beat all other models on any kind of data. You should definitely try differencing and Seasonal ARIMA in addition to seasonal decomposition you'v already tried. The criterion of model selection is performance of a model on your data. With ARIMA, you don't need to detrend. Check out this comprehensive tutorial.

TimeSeries Stationarity

I am working on python program for timeseries forecasting of number of events by date. For prediction I use ARIMA model. Now I have some results, but predicted values is not so good.
First, I made my timeseries stationary. For this I used: check stationarity by Dickey-Fuller test(0,5), then used Box-Cox transformation and again check ed Dickey-Fuller value(0,3). Then I tried to find first order difference method. I didnt received good results.
My question is how to deal with non-stationary time series. which methods should I use to make it stationary?

Many time series problems are intrinsically difficult, if not unlearnable -- especially if one wants to prevent overfitting and have some predictive power. If results are poor with a simple model, they aren't likely to be leaps and bounds better with a more complicated model.
Your first step ought to be incorporating external data sources and coming up with a theoretical model for your predictive task. Training a model on those stronger-signaled inputs should work better than on your raw data (if the task is learnable).

You can explore a number of methods to handle non stationarity of time series below:
https://medium.com/analytics-vidhya/preprocessing-for-time-series-forecasting-3a331dbfb9c2?source=friends_link&sk=30aac82f09efbbe8f1b6549a8e367575
Key Points (for making stationary time series):
Self Lag Differencing — It can be taken as the difference between present series and lagged version of the series.The shift can be of the order 1,2,3,4,etc. For items where we don’t have any lagged version item, take them as NULL.
Example — Let your dataframe be ‘Time’ and column with values be ‘Temperature’ indexed on date. So self differencing can be done like this:
Time[‘Temperature_Diff’]=Time[‘Temperature’]-Time[‘Temperature’].shift(1)
if lagged version used is 1
Time[‘Temperature_Diff’]=Time[‘Temperature’]-Time[‘Temperature’].shift(2)
if lagged version used is 2
Log Self Differencing — It can be taken as the difference between present series and lagged version of the series. But you can just apply log transformation over the actual series.
Use statsmodels.tsa.seasonal.seasonal_decompose and it will give you three components-Trend, Seasonality and Residuals. Take these residuals and it will be our stationary time series for forecasting.
P.S.- The blogpost has been authored by me.

Predictions with ARIMA (python statsmodels)

I have some time series data which contains some seasonal trends and I want to use an ARIMA model to predict how this series will behave in the future.
In order to predict how my variable of interest (log_var) will behave I have taken a weekly, monthly and annual difference and then used these as the input to an ARIMA model.
Below is an example.
exog = np.column_stack([df_arima['log_var_diff_wk'],
df_arima['log_var_diff_mth'],
df_arima['log_var_diff_yr']])
model = ARIMA(df_arima['log_var'], exog = exog, order=(1,0,1))
results_ARIMA = model.fit()
I am doing this for several different data sources and in all of them I see great results, in the sense that if I plot log_var against results_ARIMA.fittedvalues for the training data then it matches very well (I tune p and q for each data source separately, but d is always 0 given that I have already taken the difference myself).
However, I then want to check what the predictions look like, and in order to do this I redfine exog to just be the 'test' dataset. For example, if I train the original ARIMA model on 2014-01-01 to 2016-01-01, the 'test' set would just be 2016-01-01 onwards.
My approach has worked well for some data sources (in the sense that I plot the forecast against the known values and the trends look sensible) but badly for others, although they are all the same 'kind' of data and they have just been taken from different geographical locations. In some of the locations it completely fails to catch obvious seasonal trends that occur again and again in the training data on the same dates each year. The ARIMA model always fits the training data well, it just seems that in some cases the predictions are completely useless.
I am now wondering if I am actually following the correct procedure to predict values from the ARIMA model. My approach is basically:
exog = np.column_stack([df_arima_predict['log_val_diff_wk'],
df_arima_predict['log_val_diff_mth'],
df_arima_predict['log_val_diff_yr']])
arima_predict = results_ARIMA.predict(start=training_cut_date, end = '2017-01-01', dynamic = False, exog = exog)
Is this the correct way to go about making predictions with ARIMA?
If so, is there a way I can try to understand why the predictions look very good in some datasets and terrible in others, when the ARIMA model seems to fit the training data just as well in both cases?

I have a similar problem atm which I have not entirely figured out yet. It seems including multiple seasonal terms in python is still a bit tricky. R does seem to have this capacity, see here. So, one suggestion I can give you is to try this with the more sophisticated functionality R provides for now (although that could require a large investment of time if you are not familiar with R yet).
Looking at your approach for modeling the seasonal patterns, taking the nth order difference scores does not give you seasonal constants, but rather some representation of the difference between the time points that you designate as seasonally related. If those differences are small, correcting for them might not have much impact on your modeling results. In such cases, model prediction might turn out fairly well. Conversely, if the differences are big, including them can easily distort prediction results. This could explain the variation you are seeing in your modeling results. Conceptually, then, what you'd want to do instead is represent the constants over time.
In the blog post referenced above, the author advocates the use of Fourier series to model the variance within each time period. Both the NumPy and SciPy packages offer routines for calculating the fast Fourier transform. However, as a non-mathematician I found it difficult to ascertain that the fast Fourier transform yielded the appropriate numbers.
In the end I opted to use the Welch signal decomposition form SciPy's signal module. What this does is return a spectral density analysis of your time series, from which you can deduce signal strength at various frequencies in your time series.
If you identify the peaks in the spectral density analysis which correspond to the seasonal frequencies you are trying to account for in your time series, you can use their frequencies and amplitudes to construct sine waves representing the seasonal variations. You can then include these in your ARIMA as exogenous variables, much like the Fourier terms in the blog post.
This is about as far as I have gotten myself at this point - right now I am trying to figure out whether I can get the statsmodels ARIMA process to use these sine waves, which specify a seasonal trend, as exogenous variables in my model (the documentation specifies they should not represent trends but hey, a guy can dream, right?) edit: This blog post by Rob Hyneman is also highly relevant, and explains some of the rationale behind including Fourier terms.
Sorry I'm not able to give you a solution that's proven to be effective within Python, but I hope this gives you some new ideas to control for that pesky seasonal variance.
TL;DR:
It seems python is not very well suited to handle multiple seasonal terms right now, R might be a better solution (see reference);
Using difference scores to account for seasonal trends seems not to capture the constant variance associated with the recurrence of the season;
One way to do this in python could be to use Fourier series representing seasonal trends (also see reference), which can be obtained using, among other ways, a Welch signal decomposition. How to use these as exogenous variables in an ARIMA to good effect is an open question, though.
Best of luck,
Evert
p.s.: I'll update if I find a way to get this to work in Python

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.