Tensorflow: predict grow factor in a time series forecast - python

I am developing an application to predict future hourly online orders on my e-commerce website (time-series problem) using Canned Estimator tf.estimator.DNNRegressor
estimator = tf.estimator.DNNRegressor(
feature_columns=my_feature_columns,
hidden_units=hidden_units,
model_dir=model_dir,
optimizer=tf.train.ProximalAdagradOptimizer(learning_rate=0.01,
l1_regularization_strength=0.001))
The features I am using are pretty much based on the date and time. For example, the csv file from my training data looks like this
year,month,day,weekday,isweekend,hr,weeknum,yearday,orders
2018,7,16,2,0,0,29,197,193
2018,7,16,2,0,1,29,197,131
2018,7,16,2,0,2,29,197,77
2018,7,16,2,0,3,29,197,59
.....
where orders column is the target for the model.
The model I got so far is working decently but when I run predictions for a high demand day like Black Friday, it is under-predicting. For example, in the graph below we can see that predictions for Black Friday this year 2018 (dashed line) are not as high as we intuitively expect, even though it predicts the shape nicely.
With that all being said, I would appreciate any recommendation to add to my model so it can also predict correctly the grow factor and not only the trend.

This is a time series problem, so you're better off using tf.contrib.timeseries.ARRegressor (neural network built specifically for time series) or tf.contrib.timeseries.StructuralEnsembleRegressor (time series state space model - which ) than a generic neural network.
Both models take an exogenous_feature_columns argument, you could populate that with 0 for normal days and 1 for event days like Black Friday. That would fix your under-predicting problem since otherwise the model would treat those spikes as outliers (you could do this even with a generic neural network - it's just easier to code with the time series specific functions).
On a more general note, I would recommend other tools besides tensorflow for time series forecasting, such as Facebook Prophet or Statsmodels package.
I would go further and recommend that you don't use Python at all, and instead look at using some of the forecasting packages available in R.

Related

Anomaly Testing - Linear Regression with t or not with t? Problems to understand the setup

If you want to check an anomaly in stock data many studies use a linear regression. Let's say you want to check if there is a Monday effect, meaning that monday is significantly worse than other days.
I understood that we can use a regression like: return = a + b DummyMon + e
a is the constant, b the regression coefficient, we have the Dummy for Monday and the error term e.
That's what I used in python:
First you add a constant to the anomaly:
anomaly = sm.add_constant(anomaly)
Then you build the model:
model = sm.OLS(return, anomaly)
The you fit the model:
results = model.fit()
I wonder if this is the correct model setup.
In this case a plot of the linear regression would just show two vertical areas above 0 (for no Monday) and 1 (for Monday) with all the returns. It looks pretty strange. Is this correct?
Should I somehow try to use the time (t) in the regression? If so, how can I do it with python? I thought about giving each date an increasing number, but then I wondered how to treat weekends.
I would assume that with many data points both approaches are similar, if the time series is stationary, right? In the end I do a cross section anaylsis and don't care about the aspect of the time series in this case, correct? ( I heard about GARCH models etc, where this is a different)
Well, I am just learning and hope someone could give me some ideas about the topic.
Thank you very much in advance.
For time series analysis tasks (such as forecasting or anomaly detection), you may need a more advanced model, such as Recurrent Neural Networks (RNN) in deep learning. You can assign any time step to an RNN Cell, in your case, every RNN Cell can represent a day or maybe an hour or half a day etc.
The main purpose of the RNNs is to make the model understand the time dependencies in the data. For example, if monday has a bad affect, then corresponding RNN Cells will have reasonable parameters. I would recommend you to do some further research about it. Here there are some documentations that may help:
https://colah.github.io/posts/2015-08-Understanding-LSTMs/
(Also includes different types of RNN)
https://towardsdatascience.com/understanding-rnn-and-lstm-f7cdf6dfc14e
And you can use tensorflow, keras or PyTorch libraries.

How can you do time series forecasting in Tensorflow (or with other tools) where features of the label timestep are known?

This is a question about a general approach rather than a specific coding problem. I'm trying to do time series forecasting with Tensorflow where features of the label timestep are known to the model. E.g. a human trying to predict a variable a week from now would know things that are going to happen in the next week that will affect that variable. So a window of 20 timesteps where the label is the 20th timestep would look something like this:
Timesteps 1-19 would each have a set of features plus the timeseries data
Timestep 20 would have a set of features which are known, plus the timeseries label which is unknown
Is there a model that could handle this sort of data? I've gone through the Tensorflow time series forecasting tutorial, done a Coursera course on Tensorflow time series forecasting and searched elsewhere but I can't find anything. I'm fairly new to this so apologies for any imprecise language.
I once tried to do this kind of TS problem by stacking a multivariate model and another machine learning model. My idea was that I use the normal TS model's output, add it as another feature in the other model that only takes the last time step's info as input. But it is complicated and might overfit a lot even if I carefully regularized the second model. The idea is that I use step 1 to window_size - 1 info to predict a rough output at step window_size, then use the info at step window_size to reduce the residual between my TS model output and the actual label; But I don't think this approach is theoretically correct and the result might be worse than using a TS model without feeding the target step's info.
I don't think tensorflow have any API for your problem because this type of problem is not a normal TS problem. Usually people would just treat this kind of problem as a regression or classification problem.
I am not an expert on this problem as well, but I just happened to attempt to solve the exact problem so this is just my personal experience...

Train machine learning model with scikit learn for time-series prediction

I need to train a model with scikit-learn to predict possible time for less people in a room.
Here is how my dataset looks like:
Time PeopleCount
---------------------------------------------
2019-12-29 12:40:10 50
2019-12-29 12:42:10 30
2019-12-29 12:44:10 10
2019-12-29 12:46:10 10
2019-12-29 12:48:10 80
and so on...
This data will be available for 30 days.
Once the model is trained, I will query the model to get the possible time when there will be fewer people in the room between 10.AM and 8.PM. I expect the machine learning model to respond back with the 30-minute accuracy, ie. "3.00 PM to 3.30PM"
What algorithm can I use for this problem and how can I achieve the goal? Or are there any other Python libraries than SciKit-Learn which can be used for this purpose?
I am new to machine learning, sorry for a naive question.
First of all, time-series prediction is on the base of theory that current value more or less depend on the past ones. For instance, 80 of people count as of 2019-12-29 12:48:10 has to be strongly influenced on the people count at the time of 12:46:10, 12:44:20 or previous ones, correlating with past values. If not, you would be better off using the other algorithm for prediction.
While the scikit package contains a various modules as the machine learning algorithm, most of them specialize in the classification algorithm. I think the classification algorithm certainly satisfy your demand if your date is not identified as the type of time series. Actually, scikit also has some regression module, even though I think that seem not to be well suitable for prediction of time series data.
In the case of prediction of time series data, RNN or LSTM algorithm (Deep Learning) has been widely utilized, but scikit does not provide the build-in algorithm of it. So, you might be better off studying Tensorflow or Pytorch framework which are common tools to be enable you to build the RNN or LSTM model.
SciKitLearn models do not recognize timestamps, so you will have to break down your timestamp column into a number of features, ie. day of week, hour, etc. If you need 30-minute accuracy then you will have to aggregate your data from the PeopleCount column somehow, ie. record average, minimum or maximum number of people within each 30-minute time interval. It may be a good idea to also create lagged features, ie. what was the people count in a previous time slot or even 2, 3 or n time slots ago.
Once you have you have your time features and labels (corresponding people counts) ready you can start training your models in standard way:
split your data into training and validation sets,
train each model that you want to try and compare the results.
Any regressor should be suitable for this task, ie. Ridge, Lasso, DecisionTreeRegressor, SVR etc. Note however that if you need to get the best time slot from the given range you will need to make predictions for every slot from the range and pick the one which fits the criteria, although there may be cases where the smallest predicted value is not smaller then value you compare it with.
If you do not get satisfying results with regressors, ie. every time the mean or median squared errors are too high, you could come up with a classification case, ie. instead of training a regressor to predict the number of people you can train a classifier to predict whether the count is greater than 50 or not.
There are many ways to approach this problem. Once try different models and examine the results you will come up with ways to optimize the parameters, engineer features, pre-process the input etc.

Prediction Model to predict when the future events will happen next month

I have to develop a Prediction Model using Python to predict if a site will crash next month or not depending on the occurances in the last 6 monthes. Input Parameters are: Environment(Dev,Prod,Test), Region(NA,APAC,EMEA) and the Date of the month.
I am using matplotlib, pandas and numpy. It will be a 2D Data Frame or a 3D Panel in Pandas. I am not sure as input parameters are 3 - Region, Env and Date.
I think below Machine Learning Algorithm should be used:
from sklearn.linear_model import LinearRegression
Please correct me if I am wrong.
Linear regression is fine, but calling it is just two line of work, i would suggest try multiple machine learning algorithms, tuning their hyperparameters and checking which gives the best performance, moreover you should look into feature engineering, maybe you could extract features from the already given data

Time Series prediction with multiple features in the input data

Assume we have a time-series data that contains the daily orders count of last two years:
We can predict the future's orders using Python's statsmodels library:
fit = statsmodels.api.tsa.statespace.SARIMAX(
train.Count, order=(2, 1, 4),seasonal_order=(0,1,1,7)
).fit()
y_hat_avg['SARIMA'] = fit1.predict(
start="2018-06-16", end="2018-08-14", dynamic=True
)
Result (don't mind the numbers):
Now assume that our input data has some unusual increase or decrease, because of holidays or promotions in the company. So we added two columns that tell if each day was a "holiday" and a day that the company has had "promotion".
Is there a method (and a way of implementing it in Python) to use this new type of input data and help the model to understand the reason of outliers, and also predict the future's orders with providing "holiday" and "promotion_day" information?
fit1.predict('2018-08-29', holiday=True, is_promotion=False)
# or
fit1.predict(start="2018-08-20", end="2018-08-25", holiday=[0,0,0,1,1,0], is_promotion=[0,0,1,1,0,1])
SARIMAX, as a generalisation of the SARIMA model, is designed to handle exactly this. From the docs,
Parameters:
endog (array_like) – The observed time-series process y;
exog (array_like, optional) – Array of exogenous regressors, shaped (nobs, k).
You could pass the holiday and promotion_day as an array of size (nobs, 2) to exog, which will inform the model of the exogenous nature of some of these observations.
This problem have different names such as anomaly detection, rare event detection and extreme event detection.
There is some blog post at Uber engineering blog that may useful for understanding the problem and solution. Please look at here and here.
Although it's not from statsmodels, you can use facebook's prophet library for time series forecasting where you can pass dates with recurring events to your model.
See here.
Try this (it may or may not work based on your problem/data):
You can split your date into multiple features like day of week, day of month, month of year, year, is it last day in month?, is it first day in month? and many more if you think of it and then use some normal ML algorithm like Random Forests or Gradient Boosting Trees or Neural Networks (specially with embedding layers for your categorical features e.g. day of week) to train your model.

Categories