Has anyone tried to predict a specific pattern in time series data?
Example: In a specific time, there is a huge upward spike in certain variables in a time series...
How would I build a model to predict that spike when next time it occurs?
Please do respond if anyone working in this area.
I tried with converting that particular series of data in a NumPy array and trying to feed in the model.But Its not allowing me.
Here is the data looks like
This data is generated in a controlled manner so that we can have these spikes near to near.. In actual case this could b random, and our main objective is to catch this pattern and make a count.
Das, you could try implementing LSTM based Neural Network Models.
See:
https://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/
It is still preferred that the data contains a trend. If the upward spike happens around the same time of the recurring time interval, it is more likely that you get a better prediction result.
In the image you shared, there seems to be trend in the data. Hence LSTM models can pretty efficiently extract the pattern and output a prediction.
Statistical modelling of the data can also provide better results.
See: https://orangematter.solarwinds.com/2019/12/15/holt-winters-forecasting-simplified/
Das, if outputting the total number of peaks is solely the requirement, then I think heavy neural network models are bit of an overkill. However, neural network models also can pretty well do the job, but require lot of data input for training and fine tuning the weights and biases to give a really good result.
How about you try implementing a thresholding based technique, where you increment a counter every time the data value crosses the preset threshold? In such an approach you should ensure to group very nearby peaks together so that the count is just one for that case. Here you could set a threshold on the x axis too.
ie:- For instance with respect to the given plot, let the y-threshold be 4. Then you will get a count 5 if you consider the y axis threshold (y value 4) alone. This is because for x value at 15:48.2, there are two peaks that cross y value 4. So suppose you set a threshold in the x axis too, then these nearby peaks shall be grouped together within the preset limit and the final count will be 4 (which is the requirement).
Related
I would like to predict values (e.g. transport volumes). As input data I have the volumes from the last two years. I already did some timeseries prediction on those values basically following the instruction on Basics of Time Series Prediction and Techniques for Time Series Prediction.
I now would like to go a step further and include some indicators (e.g. economic indicators) in the prediction to see if this will increase the accuracy of the predictions.
What is the right approach to do so? Looking around I found this Post, basically describing the same usecase. Unfortunately it got no responses.
One approach might be to do a "simple" prediction based on a model with the current volume and indicators as features and the future volume as label. But I then would loose the timeseries, the connection between the single data points so to say.
Do you have experience with such predictions? What did work in your case? Please point me in the right direction!
One approach might be to do a "simple" prediction based on a model
with the current volume and indicators as features and the future
volume as label. But I then would loose the timeseries, the connection
between the single data points so to say.
In this case a common solution is to include N 'lagging' values (i.e. volumes for N previous periods) as features for every observation, in addition to some indicator value features. This allows using pretty much any regression model for time series forecasting. Just make sure there's no data leakage of the 'future' values when calculating your indicators.
ipdb> np.count_nonzero(test==0) / len(ytrue) * 100
76.44815766923736
I have a datafile counting 24000 prices where I use them for a time series forecasting problem. Instead of trying predicting the price, I tried to predict log-return, i.e. log(P_t/P_P{t-1}). I have applied the log-return over the prices as well as all the features. The prediction are not bad, but the trend tend to predict zero. As you can see above, ~76% of the data are zeros.
Now the idea is probably to "look up for a zero-inflated estimator : first predict whether it's gonna be a zero; if not, predict the value".
In details, what is the perfect way to deal with excessive number of zeros? How zero-inflated estimator can help me with that? Be aware originally I am not probabilist.
P.S. I am working trying to predict the log-return where the units are "seconds" for High-Frequency Trading study. Be aware that it is a regression problem (not a classification problem).
Update
That picture is probably the best prediction I have on the log-return, i.e log(P_t/P_{t-1}). Although it is not bad, the remaining predictions tend to predict zero. As you can see in the above question, there is too many zeros. I have probably the same problem inside the features as I take the log-return on the features as well, i.e. if F is a particular feature, then I apply log(F_t/F_{t-1}).
Here is a one day data, log_return_with_features.pkl, with shape (23369, 30, 161). Sorry, but I cannot tell what are the features. As I apply log(F_t/F_{t-1}) on all the features and on the target (i.e. the price), then be aware I added 1e-8 to all the features before applying the log-return operation to avoid division by 0.
Ok, so judging from your plot: it's the nature of the data, the price doesn't really change that often.
Try subsampling your original data a bit (perhaps by a factor of 5, just look at the data), so that you generally see a price movement with every time-tick. This should make any modeling much MUCH easier.
For the subsampling: I suggest you do simple regular downsampling in time domain. So if you have price data with a second resolution (i.e. one price tag every second), then simply take every fifth datapoint. Then proceed as you usually do, specifically, compute the log-increase in the price from this subsampled data. Remember that whatever you do, it must be reproducible during the test time.
If that is not an option for you for whatever reasons, have a look at something that can handle multiple time scales, e.g. WaveNet or Clockwork RNN.
I am implementing an anomaly detection system that will be used on different time series (one observation every 15 min for a total of 5 months). All these time series have a common pattern: high levels during working hours and low levels otherwise.
The idea presented in many papers is the following: build a model to predict future values and calculate an anomaly score based on the residuals.
What I have so far
I use an LSTM to predict the next time step given the previous 96 (1 day of observations) and then I calculate the anomaly score as the likelihood that the residuals come from one of the two normal distributions fitted on the residuals obtained with the validation test. I am using two different distributions, one for working hours and one for non working hours.
The model detects very well point anomalies, such as sudden falls and peaks, but it fails during holidays, for example.
If an holiday is during the week, I expect my model to detect more anomalies, because it's an unusual daily pattern wrt a normal working day.
But the predictions simply follows the previous observations.
My solution
Use a second and more lightweight model (based on time series decomposition) which is fed with daily aggregations instead of 15min aggregations to detect daily anomalies.
The question
This combination of two models allows me to have both anomalies and it works very well, but my idea was to use only one model because I expected the LSTM to be able to "learn" also the weekly pattern. Instead it strictly follows the previous time steps without taking into consideration that it is a working hour and the level should be much higher.
I tried to add exogenous variables to the input (hour of day, day of week), to add layers and number of cells, but the situation is not that better.
Any consideration is appreciated.
Thank you
A note on your current approach
Training with MSE is equivalent to optimizing the likelihood of your data under a Gaussian with fixed variance and mean given by your model. So you are already training an autoencoder, though you do not formulate it so.
About the things you do
You don't give the LSTM a chance
Since you provide data from last 24 hours only, the LSTM cannot possibly learn a weekly pattern.
It could at best learn that the value should be similar as it was 24 hours before (though it is very unlikely, see next point) -- and then you break it with Fri-Sat and Sun-Mon data. From the LSTM's point of view, your holiday 'anomaly' looks pretty much the same as the weekend data you were providing during the training.
So you would first need to provide longer contexts during learning (I assume that you carry the hidden state on during test time).
Even if you gave it a chance, it wouldn't care
Assuming that your data really follows a simple pattern -- high value during and only during working hours, plus some variations of smaller scale -- the LSTM doesn't need any long-term knowledge for most of the datapoints. Putting in all my human imagination, I can only envision the LSTM benefiting from long-term dependencies at the beginning of the working hours, so just for one or two samples out of the 96.
So even if the loss value at the points would like to backpropagate through > 7 * 96 timesteps to learn about your weekly pattern, there are 7*95 other loss terms that are likely to prevent the LSTM from deviating from the current local optimum.
Thus it may help to weight the samples at the beginning of working hours more, so that the respective loss can actually influence representations from far history.
Your solutions is a good thing
It is difficult to model sequences at multiple scales in a single model. Even you, as a human, need to "zoom out" to judge longer trends -- that's why all the Wall Street people have Month/Week/Day/Hour/... charts to watch their shares' prices on. Such multiscale modeling is especially difficult for an RNN, because it needs to process all the information, always, with the same weights.
If you really want on model to learn it all, you may have more success with deep feedforward architectures employing some sort of time-convolution, eg. TDNNs, Residual Memory Networks (Disclaimer: I'm one of the authors.), or the recent one-architecture-to-rule-them-all, WaveNet. As these have skip connections over longer temporal context and apply different transformations at different levels, they have better chances of discovering and exploiting such an unexpected long-term dependency.
There are implementations of WaveNet in Keras laying around on GitHub, e.g. 1 or 2. I did not play with them (I've actually moved away from Keras some time ago), but esp. the second one seems really easy, with the AtrousConvolution1D.
If you want to stay with RNNs, Clockwork RNN is probably the model to fit your needs.
About things you may want to consider for your problem
So are there two data distributions?
This one is a bit philosophical.
Your current approach shows that you have a very strong belief that there are two different setups: workhours and the rest. You're even OK with changing part of your model (the Gaussian) according to it.
So perhaps your data actually comes from two distributions and you should therefore train two models and switch between them as appropriate?
Given what you have told us, I would actually go for this one (to have a theoretically sound system). You cannot expect your LSTM to learn that there will be low values on Dec 25. Or that there is a deadline and this weekend consists purely of working hours.
Or are there two definitions of anomaly?
One philosophical point more. Perhaps you personally consider two different types of anomaly:
A weird temporal trajectory, unexpected peaks, oscillations, whatever is unusual in your domain. Your LSTM supposedly handles these already.
And then, there is different notion of anomaly: Value of certain bound in certain time intervals. Perhaps a simple linear regression / small MLP from time to value would do here?
Let the NN do all the work
Currently, you effectively model the distribution of your quantity in two steps: First, the LSTM provides the mean. Second, you supply the variance.
You might instead let your NN (together with additional 2 affine transformations) directly provide you with a complete Gaussian by producing its mean and variance; much like in Variational AutoEncoders (https://arxiv.org/pdf/1312.6114.pdf, appendix C.2). Then, you need to optimize directly the likelihood of your following sample under the NN-distribution, rather than just MSE between the sample and the NN output.
This will allow your model to tell you when it is very strict about the following value and when "any" sample will be OK.
Note, that you can take this approach further and have your NN produce "any" suitable distribution. E.g. if your data live in-/can be sensibly transformed to- a limited domain, you may try to produce a Categorical distribution over the space by having a Softmax on the output, much like WaveNet does (https://arxiv.org/pdf/1609.03499.pdf, Section 2.2).
I have a list of temporal series of values measured in different places. These measurements may or may not be correlated, (mostly depending on their relative positions, but it is plausible that some very close detectors would actually measure decorrelated series). I would like to predict the values of the whole set, taking into account the series of all of them and their correlation through time. If it is of any help, the values should also have relative periodicity
EDIT: I have access to the generated power of several solar panels. These solar panels are spread spatially, and I would like to use them as 'irradiance detectors'. Knowing the sun illumination in several places in the past, I wish to identify correlations in between signals, which could then be used to make predictions of illumination.
Regardless of usual patterns of production through a day (as seen on image), what I am interested in is the information I can extract from one pannels' past to predict another ones future.
I think I would need a Neural Network to solve this problem, but I am not sure how to feed it :I thought of using a temporal window and feed my NN with a few past values from A, B and C, but I am afraid it's a little weak.
The image shows an example of what my data I looks like.
How can I predict the next values of curve A knowing past values of A, B and C?
How to handle this prediction?
I think the easiest way is to train 3 models with the same input but each will predict one value (A, B or C).
If you are sure about correlation between input variable and their impact on the predicted output, you may create one neural network with a common branch (probably RNN over the stacked 3 inputs) then 3 different prediction head where each will produce one prediction A or B or C. Fast-rcnn architecture is a great example of this.
The best way to achieve this task is to use a RNN.
A good tutorial for learning how to develop such a neural network is here :
https://www.tensorflow.org/tutorials/recurrent
I also found this link, where they achieved training a RNN for a rather close problem :
http://blog.datatonic.com/2016/11/traffic-in-london-episode-ii-predicting.html
An even better inspiration :
http://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/
I have a classification problem with time series data.
Each example has 10 variables which are measured at irregular intervals and in the end the object is classified into 1 of the 2 possible classes (binary classification).
I have only the final class of the example to learn from during training. But when given a new example, I would like to make a prediction at each timestamp (in an online manner). So, if the new example had 25 measurements, I would like to make 25 predictions of its class; one at each timestamp.
The way I am implementing this currently is by using the min, mean and max of the measurements of its 10 variables till that point as features for classification. Is this optimal ? What would be a better way.
If you have to make predictions at each time stamp, then this doesn't become a a time series problem (unless you plan to use the sequence of previous observations to make your next prediction, in which case you will need to train a sequence based model). Assuming you can only train a model based on the final data you observe, there can be many approaches, but I'd recommend you use Random Forest with large number of trees and 3 or 4 variables in each tree. That way even if some variables don't give you the desired input other trees can still make predictions to a fair amount of accuracy. Besides this there can be many ensemble approaches.
The way you're currently doing may be a very loose approximation and practical but doesn't make much statistical sense.