Assume we have a time-series data that contains the daily orders count of last two years:
We can predict the future's orders using Python's statsmodels library:
fit = statsmodels.api.tsa.statespace.SARIMAX(
train.Count, order=(2, 1, 4),seasonal_order=(0,1,1,7)
).fit()
y_hat_avg['SARIMA'] = fit1.predict(
start="2018-06-16", end="2018-08-14", dynamic=True
)
Result (don't mind the numbers):
Now assume that our input data has some unusual increase or decrease, because of holidays or promotions in the company. So we added two columns that tell if each day was a "holiday" and a day that the company has had "promotion".
Is there a method (and a way of implementing it in Python) to use this new type of input data and help the model to understand the reason of outliers, and also predict the future's orders with providing "holiday" and "promotion_day" information?
fit1.predict('2018-08-29', holiday=True, is_promotion=False)
# or
fit1.predict(start="2018-08-20", end="2018-08-25", holiday=[0,0,0,1,1,0], is_promotion=[0,0,1,1,0,1])
SARIMAX, as a generalisation of the SARIMA model, is designed to handle exactly this. From the docs,
Parameters:
endog (array_like) – The observed time-series process y;
exog (array_like, optional) – Array of exogenous regressors, shaped (nobs, k).
You could pass the holiday and promotion_day as an array of size (nobs, 2) to exog, which will inform the model of the exogenous nature of some of these observations.
This problem have different names such as anomaly detection, rare event detection and extreme event detection.
There is some blog post at Uber engineering blog that may useful for understanding the problem and solution. Please look at here and here.
Although it's not from statsmodels, you can use facebook's prophet library for time series forecasting where you can pass dates with recurring events to your model.
See here.
Try this (it may or may not work based on your problem/data):
You can split your date into multiple features like day of week, day of month, month of year, year, is it last day in month?, is it first day in month? and many more if you think of it and then use some normal ML algorithm like Random Forests or Gradient Boosting Trees or Neural Networks (specially with embedding layers for your categorical features e.g. day of week) to train your model.
Related
I am working on the following timeseries multi-class classification problem:
42 possible classes that are dependent on each other, I want to know the probability of each class for up to 56 days ahead
1 year of daily data so 365 observations
the class probabilities have a strong weekly seasonality
I have exogenous regressors that are strongly correlated with the output classes
I realise that I am trying to predict a lot of output classes with little data, but I am looking for a model (preferably with Python implementation) that is most suited for this use case.
Any recommendations on what model could work for this problem?
So far I have tried:
a tree based model, but it struggles with the high amount of classes and does not capture the time series component well
a VAR model, but the number of parameters to estimate becomes too high compared to the series
predicting each class probability independently, but that assumes the series are independent, which is not the case
Struggling to build an arima model in python that is even close to useful for predicting household electricity usage. Would appreciate any thoughts and suggestions. (Might just be a silly error in my implementation!)
Some design thoughts:
Data is very messy in general but there is clearly daily seasonality (usage drops over night and while household at work/school) and a weekly seasonality (weekday usage differs from weekend)
Have tried statsmodels, sktime, fbprophet and pmdarima 'auto_arima' functions with no luck. Don't think these take seasonality into account particularly well
Currently trying to get a more manual approach to work: statsmodel's sarima with only daily seasonality incorporated (see code and results below), and maybe add fourier term as exogenous variable to handle weekly seasonality.
Will consider adding exogenous variables (like temperature) to account for annual seasonality but first just trying to get something reasonable on a smaller time scale (3-6 months).
Approach I am trying to get working: Use box-jenkins method to specify a sarima model for just daily seasonality (images below).
(1) Looking at Dickey-Fuller and KPSS for the time series, there appears to be minimal trend to correct for (expected), but ACF and PACF charts show significant seasonality (daily, weekly).
(2) Taking differences to account for week and day seasonality, then taking a further first-order difference, we can quickly get to a dataset that has minimal remaining seasonality and is stationary. This should be a really good sign and suggests there is a model we can build to predict this behaviour!
One more plot to show difference between original and differenced data when we zoom in for a typicaly week.
(3) Finally, I trained a sarima model in the following way with results. I configured D=d=0 since no identifiable trend (expected), p=2 to give model opportunity to learn from most recent behaviour, m=48 for seasonality (daily since data is in 30min time intervals), and P=Q=1 to capture those seasonality effects t-48.
model = SARIMAX(
train_data,
trend='n',
order=(2, 0, 0),
seasonal_order=(1, 0, 1, 48),
)
results = model.fit()
I am able to get an exponential smoothing model working, but I had expected double seasonal arima to blow it out of the water. Any thoughts and suggestions most welcome. Thank you in advance!
If you want to check an anomaly in stock data many studies use a linear regression. Let's say you want to check if there is a Monday effect, meaning that monday is significantly worse than other days.
I understood that we can use a regression like: return = a + b DummyMon + e
a is the constant, b the regression coefficient, we have the Dummy for Monday and the error term e.
That's what I used in python:
First you add a constant to the anomaly:
anomaly = sm.add_constant(anomaly)
Then you build the model:
model = sm.OLS(return, anomaly)
The you fit the model:
results = model.fit()
I wonder if this is the correct model setup.
In this case a plot of the linear regression would just show two vertical areas above 0 (for no Monday) and 1 (for Monday) with all the returns. It looks pretty strange. Is this correct?
Should I somehow try to use the time (t) in the regression? If so, how can I do it with python? I thought about giving each date an increasing number, but then I wondered how to treat weekends.
I would assume that with many data points both approaches are similar, if the time series is stationary, right? In the end I do a cross section anaylsis and don't care about the aspect of the time series in this case, correct? ( I heard about GARCH models etc, where this is a different)
Well, I am just learning and hope someone could give me some ideas about the topic.
Thank you very much in advance.
For time series analysis tasks (such as forecasting or anomaly detection), you may need a more advanced model, such as Recurrent Neural Networks (RNN) in deep learning. You can assign any time step to an RNN Cell, in your case, every RNN Cell can represent a day or maybe an hour or half a day etc.
The main purpose of the RNNs is to make the model understand the time dependencies in the data. For example, if monday has a bad affect, then corresponding RNN Cells will have reasonable parameters. I would recommend you to do some further research about it. Here there are some documentations that may help:
https://colah.github.io/posts/2015-08-Understanding-LSTMs/
(Also includes different types of RNN)
https://towardsdatascience.com/understanding-rnn-and-lstm-f7cdf6dfc14e
And you can use tensorflow, keras or PyTorch libraries.
I am developing an application to predict future hourly online orders on my e-commerce website (time-series problem) using Canned Estimator tf.estimator.DNNRegressor
estimator = tf.estimator.DNNRegressor(
feature_columns=my_feature_columns,
hidden_units=hidden_units,
model_dir=model_dir,
optimizer=tf.train.ProximalAdagradOptimizer(learning_rate=0.01,
l1_regularization_strength=0.001))
The features I am using are pretty much based on the date and time. For example, the csv file from my training data looks like this
year,month,day,weekday,isweekend,hr,weeknum,yearday,orders
2018,7,16,2,0,0,29,197,193
2018,7,16,2,0,1,29,197,131
2018,7,16,2,0,2,29,197,77
2018,7,16,2,0,3,29,197,59
.....
where orders column is the target for the model.
The model I got so far is working decently but when I run predictions for a high demand day like Black Friday, it is under-predicting. For example, in the graph below we can see that predictions for Black Friday this year 2018 (dashed line) are not as high as we intuitively expect, even though it predicts the shape nicely.
With that all being said, I would appreciate any recommendation to add to my model so it can also predict correctly the grow factor and not only the trend.
This is a time series problem, so you're better off using tf.contrib.timeseries.ARRegressor (neural network built specifically for time series) or tf.contrib.timeseries.StructuralEnsembleRegressor (time series state space model - which ) than a generic neural network.
Both models take an exogenous_feature_columns argument, you could populate that with 0 for normal days and 1 for event days like Black Friday. That would fix your under-predicting problem since otherwise the model would treat those spikes as outliers (you could do this even with a generic neural network - it's just easier to code with the time series specific functions).
On a more general note, I would recommend other tools besides tensorflow for time series forecasting, such as Facebook Prophet or Statsmodels package.
I would go further and recommend that you don't use Python at all, and instead look at using some of the forecasting packages available in R.
Data:
I have time series data for different countries and factors, e.g. birth rate for "Afghanistan" for years from 1972 up until 2007 (source).
Goal:
Predict e.g. birth rate for 2008 and 2012
Question:
I am familiar with linear regressions, but need some help on how to work with time series data and predict future values.
Can you point me to examples or share code snippets?
Take a look at the statsmodels Time Series Analysis module. Time series models are often based around autocorrelation, and the module has the standard univariate (for individual time series) AR(p) and MA(p) models, as well as the combined version ARIMA that allows for unit roots. You'll also find multivariate (for various interrelated time series) VAR models.
And here's a time series tutorial for statistical analysis and forecasting using pandas and statsmodels.
you can use ARIMA model and VAR Model in R.
ARIMA: Auto Regressive Integrated Moving Average model
VAR: Vector Auto Regressive model
For ARIMA model: click here
For VAR model: click here
For one time series data, use ARIMA model, however, if multiple time series data are related to each other, use VAR model.