LSTM - how implement holiday features - python

You can see in the following picture a demand problem.
My question relates to how one can/should implement fixed holidays in a LSTM model, which as seen here contain no demand and therefore cause sudden strong 1-day deviations from the average. I am specifically not referring to the change in trend between December and January
An Arima model, for example, can handle such days well.
After hours of searching the internet, all I could find was things how to deal with a change in trend. However, this is not the case, the trend remains the same and is only suspended for one day. I Hope there is someone here who has a paper or an approach for this kind of problem.

since the holydays have predefined dates, why not change the value of the data at that specific date to another value that wouldn't disturb the learning much, maybe the previous one, or the one after. or you could just remove the holydays data from your data and the sequence would be now unharmed by their drastic effect.

Related

how to analyze numerical and categorical variables at the same time?

I'm trying to analyze the data of a food ordering application,
the data consist of both numerical and categorical variables, the main variable I'm studying is the total delivery time of an order, which represent the time from placing the order to closing it, I want to study what are the variables the affects it the most.
an example of rows in the data is the following:
order id
branch id
date
time placed
day
period
items id
no. items
total no. items
total delivery time
total time in seconds
113113
31
2/2/2021
13:32:24
Tuesday
afternoon
571
4
11
00:46:19
2805
113113
31
2/2/2021
13:32:24
Tuesday
afternoon
573
4
11
00:46:19
2805
I want to study the effects of all the variables on the total time, even items id and branch id, does a certain item affect time? does the day and period of the day affect it as well?
I used linear regression to get the correlation between total time and the numerical variables, and tried one way anova for some categorical variables, but I didn't like the results, is there a way to analyze all variable together without encoding categorical variables?
I'm looking forward to seeing what other people say about this. Here's my two cents.
ML algos like Regression, love numbers. ML algos like Classification love labels (non-numbers). You can certainly convert labeled data to 'numbered' data. One example is to code ['red','green','blue'] with [1,2,3], would produce weird things like 'red' is lower than 'blue', and if you average a 'red' and a 'blue' you will get a 'green'. Another more subtle example might happen when you code ['low', 'medium', 'high'] with [1,2,3]. In the latter case it might happen to have an ordering which makes sense, however, some subtle inconsistencies might happen when 'medium' in not in the middle of 'low' and 'high'. Now, under the hood, I think classifiers convert labels to numbers, so if you feed in large, medium, and small, it isn't using large, medium, and small to do it's analysis, it's converting those categories to numbers. I think. Maybe someone can confirm this for me.
Thus, I don't think it makes sense to try to measure any kind of relationship between IDs and specific outcomes, like 'totaltime', 'totaldays', etc. If you kick off a project on a Monday or a Friday, does the project end sooner or later than non-Monday-start or non-Friday-start projects? Well, maybe it does. But, is that correlation or causation? You can find correlations between all kinds of things, but these don't necessarily imply causation between these same things. Let's say you find a strong relationship between multiple projects that start on the second Monday of the month and all of these projects get finished off much faster than all other projects. This seems like pure coincidence, rather than causation. Or, there is some other factor impacting the outcome. Maybe projects that start on the second Monday of the month are typically small upgrades, rather than full-blown new undertakings, so the volume of work is less, and the project is done faster. However, starting the work on the second Monday of the month doesn't CAUSE the project to be finished off faster. Tell me if I am wrong. I'm always open to feedback.

Prophet model predicts negative values

I am new to prophet (and stackoverflow in general ;) ) and have some issues with creating a predictive model using python. I am trying to predict daily sales of a product, using around 5 years of data. The data looks as follows: General data plot.
The company is closed in the weekends en during holidays, so there will be no orders. I accounted for this by creating a dataframe with al the weekends/holidays and using this dataframe as an argument for the holidays parameter. Furthermore I didn't change anything from the model, so it looks like: Prophet(holidays = my weekend/holiday dataframe).
However, my model doens't seem to work right and predicts negative values, see the following plot: Predicition 1. Hereby also the different component plots as extra information: trend, holidays, weekly, yearly. I also tried to just replace the negative values in the prediction by 0, which gives some better result (see prediction 2), but I don't think this is the right way to tackle this problem. The last thing I tried was to remove all the weekends from the training and predicting data. The results weren't good either: prediction 3.
I would love to hear some tips from you guys, for things I could try to do. If anything is unclear or you need more information, just let me know. Thank you in advance!!
My suggestions:
Try normalization
If that doesn't work try using Recurrent Neural Networks

Deal with excessive number of zeros

ipdb> np.count_nonzero(test==0) / len(ytrue) * 100
76.44815766923736
I have a datafile counting 24000 prices where I use them for a time series forecasting problem. Instead of trying predicting the price, I tried to predict log-return, i.e. log(P_t/P_P{t-1}). I have applied the log-return over the prices as well as all the features. The prediction are not bad, but the trend tend to predict zero. As you can see above, ~76% of the data are zeros.
Now the idea is probably to "look up for a zero-inflated estimator : first predict whether it's gonna be a zero; if not, predict the value".
In details, what is the perfect way to deal with excessive number of zeros? How zero-inflated estimator can help me with that? Be aware originally I am not probabilist.
P.S. I am working trying to predict the log-return where the units are "seconds" for High-Frequency Trading study. Be aware that it is a regression problem (not a classification problem).
Update
That picture is probably the best prediction I have on the log-return, i.e log(P_t/P_{t-1}). Although it is not bad, the remaining predictions tend to predict zero. As you can see in the above question, there is too many zeros. I have probably the same problem inside the features as I take the log-return on the features as well, i.e. if F is a particular feature, then I apply log(F_t/F_{t-1}).
Here is a one day data, log_return_with_features.pkl, with shape (23369, 30, 161). Sorry, but I cannot tell what are the features. As I apply log(F_t/F_{t-1}) on all the features and on the target (i.e. the price), then be aware I added 1e-8 to all the features before applying the log-return operation to avoid division by 0.
Ok, so judging from your plot: it's the nature of the data, the price doesn't really change that often.
Try subsampling your original data a bit (perhaps by a factor of 5, just look at the data), so that you generally see a price movement with every time-tick. This should make any modeling much MUCH easier.
For the subsampling: I suggest you do simple regular downsampling in time domain. So if you have price data with a second resolution (i.e. one price tag every second), then simply take every fifth datapoint. Then proceed as you usually do, specifically, compute the log-increase in the price from this subsampled data. Remember that whatever you do, it must be reproducible during the test time.
If that is not an option for you for whatever reasons, have a look at something that can handle multiple time scales, e.g. WaveNet or Clockwork RNN.

Recurrent Neural Network for anomaly detection

I am implementing an anomaly detection system that will be used on different time series (one observation every 15 min for a total of 5 months). All these time series have a common pattern: high levels during working hours and low levels otherwise.
The idea presented in many papers is the following: build a model to predict future values and calculate an anomaly score based on the residuals.
What I have so far
I use an LSTM to predict the next time step given the previous 96 (1 day of observations) and then I calculate the anomaly score as the likelihood that the residuals come from one of the two normal distributions fitted on the residuals obtained with the validation test. I am using two different distributions, one for working hours and one for non working hours.
The model detects very well point anomalies, such as sudden falls and peaks, but it fails during holidays, for example.
If an holiday is during the week, I expect my model to detect more anomalies, because it's an unusual daily pattern wrt a normal working day.
But the predictions simply follows the previous observations.
My solution
Use a second and more lightweight model (based on time series decomposition) which is fed with daily aggregations instead of 15min aggregations to detect daily anomalies.
The question
This combination of two models allows me to have both anomalies and it works very well, but my idea was to use only one model because I expected the LSTM to be able to "learn" also the weekly pattern. Instead it strictly follows the previous time steps without taking into consideration that it is a working hour and the level should be much higher.
I tried to add exogenous variables to the input (hour of day, day of week), to add layers and number of cells, but the situation is not that better.
Any consideration is appreciated.
Thank you
A note on your current approach
Training with MSE is equivalent to optimizing the likelihood of your data under a Gaussian with fixed variance and mean given by your model. So you are already training an autoencoder, though you do not formulate it so.
About the things you do
You don't give the LSTM a chance
Since you provide data from last 24 hours only, the LSTM cannot possibly learn a weekly pattern.
It could at best learn that the value should be similar as it was 24 hours before (though it is very unlikely, see next point) -- and then you break it with Fri-Sat and Sun-Mon data. From the LSTM's point of view, your holiday 'anomaly' looks pretty much the same as the weekend data you were providing during the training.
So you would first need to provide longer contexts during learning (I assume that you carry the hidden state on during test time).
Even if you gave it a chance, it wouldn't care
Assuming that your data really follows a simple pattern -- high value during and only during working hours, plus some variations of smaller scale -- the LSTM doesn't need any long-term knowledge for most of the datapoints. Putting in all my human imagination, I can only envision the LSTM benefiting from long-term dependencies at the beginning of the working hours, so just for one or two samples out of the 96.
So even if the loss value at the points would like to backpropagate through > 7 * 96 timesteps to learn about your weekly pattern, there are 7*95 other loss terms that are likely to prevent the LSTM from deviating from the current local optimum.
Thus it may help to weight the samples at the beginning of working hours more, so that the respective loss can actually influence representations from far history.
Your solutions is a good thing
It is difficult to model sequences at multiple scales in a single model. Even you, as a human, need to "zoom out" to judge longer trends -- that's why all the Wall Street people have Month/Week/Day/Hour/... charts to watch their shares' prices on. Such multiscale modeling is especially difficult for an RNN, because it needs to process all the information, always, with the same weights.
If you really want on model to learn it all, you may have more success with deep feedforward architectures employing some sort of time-convolution, eg. TDNNs, Residual Memory Networks (Disclaimer: I'm one of the authors.), or the recent one-architecture-to-rule-them-all, WaveNet. As these have skip connections over longer temporal context and apply different transformations at different levels, they have better chances of discovering and exploiting such an unexpected long-term dependency.
There are implementations of WaveNet in Keras laying around on GitHub, e.g. 1 or 2. I did not play with them (I've actually moved away from Keras some time ago), but esp. the second one seems really easy, with the AtrousConvolution1D.
If you want to stay with RNNs, Clockwork RNN is probably the model to fit your needs.
About things you may want to consider for your problem
So are there two data distributions?
This one is a bit philosophical.
Your current approach shows that you have a very strong belief that there are two different setups: workhours and the rest. You're even OK with changing part of your model (the Gaussian) according to it.
So perhaps your data actually comes from two distributions and you should therefore train two models and switch between them as appropriate?
Given what you have told us, I would actually go for this one (to have a theoretically sound system). You cannot expect your LSTM to learn that there will be low values on Dec 25. Or that there is a deadline and this weekend consists purely of working hours.
Or are there two definitions of anomaly?
One philosophical point more. Perhaps you personally consider two different types of anomaly:
A weird temporal trajectory, unexpected peaks, oscillations, whatever is unusual in your domain. Your LSTM supposedly handles these already.
And then, there is different notion of anomaly: Value of certain bound in certain time intervals. Perhaps a simple linear regression / small MLP from time to value would do here?
Let the NN do all the work
Currently, you effectively model the distribution of your quantity in two steps: First, the LSTM provides the mean. Second, you supply the variance.
You might instead let your NN (together with additional 2 affine transformations) directly provide you with a complete Gaussian by producing its mean and variance; much like in Variational AutoEncoders (https://arxiv.org/pdf/1312.6114.pdf, appendix C.2). Then, you need to optimize directly the likelihood of your following sample under the NN-distribution, rather than just MSE between the sample and the NN output.
This will allow your model to tell you when it is very strict about the following value and when "any" sample will be OK.
Note, that you can take this approach further and have your NN produce "any" suitable distribution. E.g. if your data live in-/can be sensibly transformed to- a limited domain, you may try to produce a Categorical distribution over the space by having a Softmax on the output, much like WaveNet does (https://arxiv.org/pdf/1609.03499.pdf, Section 2.2).

How to forecast in python using machine learning , from a given set of geographical data?

I was analyzing some geographical data and attempting to predict/forecast next occurrence of event with respect to time and it geographical position. The data was in following order (with sample data)
Timestamp Latitude Longitude Event
13307266 102.86400972 70.64039541 "Event A"
13311695 102.8082912 70.47394645 "Event A"
13314940 102.82240522 70.6308513 "Event A"
13318949 102.83402128 70.64103035 "Event A"
13334397 102.84726242 70.66790352 "Event A"
First step was classifying it into 100 zones, so that reduces dimensions and complexity.
Timestamp Zone
13307266 47
13311695 65
13314940 51
13318949 46
13334397 26
Next step was to do time series analysis then I got stuck here for 2 months, read around a lot of literature and figured these were my options
* ARIMA (auto-regression method)
* Machine Learning
I wanted to utilize Machine learning to forecast using python but couldn't really figure out how.Specifically are there any python libraries/open-source-code specific for use case, which I can build upon.
EDIT 1:
To clarify, data is loosely dependent on past data but over a period of time is uniformly distributed.
The best way to visualize the data would be, to imagine N number of agents controlled by a algorithm which allots them task of picking resource from grids. Resources are function of socioeconomic structure of society and also strongly dependent on geography. Its in interest of " algorithm " to be able to predict demand zone and time wise.
p.s:
For Auto-regressive models like ARIMA Python already has a library http://pypi.python.org/pypi/statsmodels .
Without example data or existing code I can't offer you anything concrete.
However, often it's helpful to re-phrase your problem in the nomenclature of the field you want to explore. In ML terms:
Your problem's features: How your inputs are specified. Timestamp is continuous, geographic zone is discrete.
Your problems's target label: an event, precisely whether or not a given event has occurred.
Your problem is supervised: target labels for previous data are available. You have previous instances of (timestamp, geographic zone) to event mappings.
The target label is discrete, so this is a classification problem (as opposed to a regression problem, where the output is continuous).
So I'd say you have a supervised classification problem. As an aside you may want to do some sort of time regularisation first; I'm guessing there are going to be patterns of the events depending on what time of the day, day of the month, or month of the year it is, and you may want to represent this as an additional feature.
Taking a look at one of the popular Python ML libraries available, scikit-learn, here:
http://scikit-learn.org/stable/supervised_learning.html
and consulting a recent posting on a cheatsheet for scikit-learn by one of the contributors:
http://peekaboo-vision.blogspot.de/2013/01/machine-learning-cheat-sheet-for-scikit.html
Your first good bet would be to try Support Vector Machines (SVM), and if that fails maybe give k Nearest Neighbours (kNN) a shot as well. Note that using an ensemble classifier is usually superior than using just one instance of a given SVM/kNN.
How, exactly, to apply SVM/kNN with time as a feature may require more research, since AFAIK (and others will probably correct me) SVM/kNN require bounded inputs with a mean of zero (or normalised to have a mean of zero). Just doing some random Googling you may be able to find certain SVM kernels, for example a Fourier kernel, that can transform a time-series feature for you:
SVM Kernels for Time Series Analysis
http://www.stefan-rueping.de/publications/rueping-2001-a.pdf
scikit-learn handily allows you to specify a custom kernel for an SVM. See:
http://scikit-learn.org/stable/auto_examples/svm/plot_custom_kernel.html#example-svm-plot-custom-kernel-py
With your knowledge of ML nomenclature, and example data in hand, you may want to consider posting the question to Cross Validated, the statistics Stack Exchange.
EDIT 1: Thinking about this problem more you need to really understand if your features and corresponding labels are independent and identically distributed (IID) or not. For example what if you were modelling how forest fires spread over time. It's clear that the likelihood of a given zone catches fire is contingent on its neighbours being on fire or not. AFAIK SVM and kNN assume the data is IID. At this point I'm starting to get out of my depth, but I think you should at least give several ML methods a shot and see what happens! Remember to cross-validate! (scikit-learn does this for you).

Categories