Multivariate time series distribution forecasting problem

Multivariate time series distribution forecasting problem - python

I am working on the following timeseries multi-class classification problem:
42 possible classes that are dependent on each other, I want to know the probability of each class for up to 56 days ahead
1 year of daily data so 365 observations
the class probabilities have a strong weekly seasonality
I have exogenous regressors that are strongly correlated with the output classes
I realise that I am trying to predict a lot of output classes with little data, but I am looking for a model (preferably with Python implementation) that is most suited for this use case.
Any recommendations on what model could work for this problem?
So far I have tried:
a tree based model, but it struggles with the high amount of classes and does not capture the time series component well
a VAR model, but the number of parameters to estimate becomes too high compared to the series
predicting each class probability independently, but that assumes the series are independent, which is not the case

Related

Anomaly Testing - Linear Regression with t or not with t? Problems to understand the setup

If you want to check an anomaly in stock data many studies use a linear regression. Let's say you want to check if there is a Monday effect, meaning that monday is significantly worse than other days.
I understood that we can use a regression like: return = a + b DummyMon + e
a is the constant, b the regression coefficient, we have the Dummy for Monday and the error term e.
That's what I used in python:
First you add a constant to the anomaly:
anomaly = sm.add_constant(anomaly)
Then you build the model:
model = sm.OLS(return, anomaly)
The you fit the model:
results = model.fit()
I wonder if this is the correct model setup.
In this case a plot of the linear regression would just show two vertical areas above 0 (for no Monday) and 1 (for Monday) with all the returns. It looks pretty strange. Is this correct?
Should I somehow try to use the time (t) in the regression? If so, how can I do it with python? I thought about giving each date an increasing number, but then I wondered how to treat weekends.
I would assume that with many data points both approaches are similar, if the time series is stationary, right? In the end I do a cross section anaylsis and don't care about the aspect of the time series in this case, correct? ( I heard about GARCH models etc, where this is a different)
Well, I am just learning and hope someone could give me some ideas about the topic.
Thank you very much in advance.

For time series analysis tasks (such as forecasting or anomaly detection), you may need a more advanced model, such as Recurrent Neural Networks (RNN) in deep learning. You can assign any time step to an RNN Cell, in your case, every RNN Cell can represent a day or maybe an hour or half a day etc.
The main purpose of the RNNs is to make the model understand the time dependencies in the data. For example, if monday has a bad affect, then corresponding RNN Cells will have reasonable parameters. I would recommend you to do some further research about it. Here there are some documentations that may help:
https://colah.github.io/posts/2015-08-Understanding-LSTMs/
(Also includes different types of RNN)
https://towardsdatascience.com/understanding-rnn-and-lstm-f7cdf6dfc14e
And you can use tensorflow, keras or PyTorch libraries.

Train machine learning model with scikit learn for time-series prediction

I need to train a model with scikit-learn to predict possible time for less people in a room.
Here is how my dataset looks like:
Time PeopleCount
---------------------------------------------
2019-12-29 12:40:10 50
2019-12-29 12:42:10 30
2019-12-29 12:44:10 10
2019-12-29 12:46:10 10
2019-12-29 12:48:10 80
and so on...
This data will be available for 30 days.
Once the model is trained, I will query the model to get the possible time when there will be fewer people in the room between 10.AM and 8.PM. I expect the machine learning model to respond back with the 30-minute accuracy, ie. "3.00 PM to 3.30PM"
What algorithm can I use for this problem and how can I achieve the goal? Or are there any other Python libraries than SciKit-Learn which can be used for this purpose?
I am new to machine learning, sorry for a naive question.

First of all, time-series prediction is on the base of theory that current value more or less depend on the past ones. For instance, 80 of people count as of 2019-12-29 12:48:10 has to be strongly influenced on the people count at the time of 12:46:10, 12:44:20 or previous ones, correlating with past values. If not, you would be better off using the other algorithm for prediction.
While the scikit package contains a various modules as the machine learning algorithm, most of them specialize in the classification algorithm. I think the classification algorithm certainly satisfy your demand if your date is not identified as the type of time series. Actually, scikit also has some regression module, even though I think that seem not to be well suitable for prediction of time series data.
In the case of prediction of time series data, RNN or LSTM algorithm (Deep Learning) has been widely utilized, but scikit does not provide the build-in algorithm of it. So, you might be better off studying Tensorflow or Pytorch framework which are common tools to be enable you to build the RNN or LSTM model.

SciKitLearn models do not recognize timestamps, so you will have to break down your timestamp column into a number of features, ie. day of week, hour, etc. If you need 30-minute accuracy then you will have to aggregate your data from the PeopleCount column somehow, ie. record average, minimum or maximum number of people within each 30-minute time interval. It may be a good idea to also create lagged features, ie. what was the people count in a previous time slot or even 2, 3 or n time slots ago.
Once you have you have your time features and labels (corresponding people counts) ready you can start training your models in standard way:
split your data into training and validation sets,
train each model that you want to try and compare the results.
Any regressor should be suitable for this task, ie. Ridge, Lasso, DecisionTreeRegressor, SVR etc. Note however that if you need to get the best time slot from the given range you will need to make predictions for every slot from the range and pick the one which fits the criteria, although there may be cases where the smallest predicted value is not smaller then value you compare it with.
If you do not get satisfying results with regressors, ie. every time the mean or median squared errors are too high, you could come up with a classification case, ie. instead of training a regressor to predict the number of people you can train a classifier to predict whether the count is greater than 50 or not.
There are many ways to approach this problem. Once try different models and examine the results you will come up with ways to optimize the parameters, engineer features, pre-process the input etc.

When should one use time series analysis vs. non-time series analysis?

I am trying to predict churn and for this my dependent variable is a binary variable. The independent variables can be categorical, integer or timeseries data. I am in the feature selection mode and will like to know if I am running correlation, should I run correlation on time series data or not. If I do use a wrapper method and use a ML algorithm for such a problem, do I use models like ARIMA that are more suited for time series analysis or a decision tree model?
I have tried using Spearman correlation but am not finding any significant correlated dependent variables

You most likely should! Since churn rate may be affected by macroeconomical issues that will show in your autocorrelation function. I suggest paying a visit to statsmodel and making sure you understand ACF plots and PACF plots (that can be done with statsmodel quite easily) together with ARIMA models so you can do some fine tuning. As for the feature selection, you can try using an overfitted neural network or model with L1 regularization.
https://www.statsmodels.org/stable/index.html

How to apply Gaussian naive bayes to predict traffic number in the future?

I have got some historical data on traffic and would like to predict the future.
I take reference from http://www.nuriaoliver.com/bicing/IJCAI09_Bicing.pdf. It applied the Bayesian network to predict the change in numbers of bikes, where I got the Bayesian network and would like to predict the changes by using Bayesian.
I faced several questions. I tried to use naive bayes to predict the number, but it seems naive bayes only allowed to have the output as several discrete class. In my case, the changes seem cannot be grouped into discrete class (like predicting a human is "male" or "female", only 2 discrete output to be the classifier)
May I know how can I apply the baysian approach in my case and what kind of python packages could help me?

I would see this as a time series forecasting problem and not a classification problem. As you noted, you are not trying to label your data into a set of discrete classes. Given a series of observations x_1, x_2, .... x_n, you are trying to predict x_(n+1) or trying forecast the next observation of the same variable in the series. Perhaps you could refer to this slide for a brief introduction to time series forecasting.
A quick start guide for time series forecasting with Python can be found here: https://machinelearningmastery.com/time-series-forecasting-methods-in-python-cheat-sheet/

Time Series prediction with multiple features in the input data

Assume we have a time-series data that contains the daily orders count of last two years:
We can predict the future's orders using Python's statsmodels library:
fit = statsmodels.api.tsa.statespace.SARIMAX(
train.Count, order=(2, 1, 4),seasonal_order=(0,1,1,7)
).fit()
y_hat_avg['SARIMA'] = fit1.predict(
start="2018-06-16", end="2018-08-14", dynamic=True
)
Result (don't mind the numbers):
Now assume that our input data has some unusual increase or decrease, because of holidays or promotions in the company. So we added two columns that tell if each day was a "holiday" and a day that the company has had "promotion".
Is there a method (and a way of implementing it in Python) to use this new type of input data and help the model to understand the reason of outliers, and also predict the future's orders with providing "holiday" and "promotion_day" information?
fit1.predict('2018-08-29', holiday=True, is_promotion=False)
# or
fit1.predict(start="2018-08-20", end="2018-08-25", holiday=[0,0,0,1,1,0], is_promotion=[0,0,1,1,0,1])

SARIMAX, as a generalisation of the SARIMA model, is designed to handle exactly this. From the docs,
Parameters:
endog (array_like) – The observed time-series process y;
exog (array_like, optional) – Array of exogenous regressors, shaped (nobs, k).
You could pass the holiday and promotion_day as an array of size (nobs, 2) to exog, which will inform the model of the exogenous nature of some of these observations.

This problem have different names such as anomaly detection, rare event detection and extreme event detection.
There is some blog post at Uber engineering blog that may useful for understanding the problem and solution. Please look at here and here.

Although it's not from statsmodels, you can use facebook's prophet library for time series forecasting where you can pass dates with recurring events to your model.
See here.

Try this (it may or may not work based on your problem/data):
You can split your date into multiple features like day of week, day of month, month of year, year, is it last day in month?, is it first day in month? and many more if you think of it and then use some normal ML algorithm like Random Forests or Gradient Boosting Trees or Neural Networks (specially with embedding layers for your categorical features e.g. day of week) to train your model.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.