Keras - how can LSTM for time series be so accurate?

Keras - how can LSTM for time series be so accurate? - python

SO I'm starting to test LSTM for time series prediction, and I've found a few different notebooks to use with my own data (here's one example)
What they all have in common is that they predict one timestep into the future, and do a really good job at matching the test data. I tried forcing an outlier in there, and the prediction almost perfectly matched it:
What's going on here? There's no way the model can learn this from the pattern of the data since it's a made up point, but supposedly by looking at the previous time steps this model will "know" an outlier is coming next? I must be missing something, because it predicts the data with an outlier just as well as the data without an outlier...

Related

In Leave One Out Cross Validation, How can I Use `shap.Explainer()` Function to Explain a Machine Learning Model?

Background of the Problem
I want to explain the outcome of machine learning (ML) models using SHapley Additive exPlanations (SHAP) which is implemented in the shap library of Python. As a parameter of the function shap.Explainer(), I need to pass an ML model (e.g. XGBRegressor()). However, in each iteration of the Leave One Out Cross Validation (LOOCV), the ML model will be different as in each iteration, I am training on a different dataset (1 participant’s data will be different). Also, the model will be different as I am doing feature selection in each iteration.
Then, My Question
In LOOCV, How can I use shap.Explainer() function of shap library to present the performance of a machine learning model? It can be noted that I have checked several tutorials (e.g. this one, this one) and also several questions (e.g. this one) of SO. But I failed to find the answer of the problem.
Thanks for reading!
Update
I know that in LOOCV, the model found in each iteration can be explained by shap.Explainer(). However, as there is 250 participants' data, if I apply shap here for each model, there will be 250 output! Thus, I want to get a single output which will present the performance of the 250 models.

You seem to train model on a 250 datapoints while doing LOOCV. This is about choosing a model with hyperparams that will ensure best generalization ability.
Model explanation is different from training in that you don't sift through different sets of hyperparams -- note, 250 LOOCV is already overkill. Will you do that with 250'000 rows? -- you are rather trying to understand which features influence output in what direction and by how much.
Training has it's own limitations (availability of data, if new data resembles the data the model was trained on, if the model good enough to pick up peculiarities of data and generalize well etc), but don't overestimate explanation exercise either. It's still an attempt to understand how inputs influence outputs. You may be willing to average 250 different matrices of SHAP values. But do you expect the result to be much more different from a single random train/test split?
Note as well:
However, in each iteration of the Leave One Out Cross Validation (LOOCV), the ML model will be different as in each iteration, I am training on a different dataset (1 participant’s data will be different).
In each iteration of LOOCV the model is still the same (same features, hyperparams may be different, depending on your definition of iteration). It's still the same dataset (same features)
Also, the model will be different as I am doing feature selection in each iteration.
Doesn't matter. Feed resulting model to SHAP explainer and you'll get what you want.

Anomaly Testing - Linear Regression with t or not with t? Problems to understand the setup

If you want to check an anomaly in stock data many studies use a linear regression. Let's say you want to check if there is a Monday effect, meaning that monday is significantly worse than other days.
I understood that we can use a regression like: return = a + b DummyMon + e
a is the constant, b the regression coefficient, we have the Dummy for Monday and the error term e.
That's what I used in python:
First you add a constant to the anomaly:
anomaly = sm.add_constant(anomaly)
Then you build the model:
model = sm.OLS(return, anomaly)
The you fit the model:
results = model.fit()
I wonder if this is the correct model setup.
In this case a plot of the linear regression would just show two vertical areas above 0 (for no Monday) and 1 (for Monday) with all the returns. It looks pretty strange. Is this correct?
Should I somehow try to use the time (t) in the regression? If so, how can I do it with python? I thought about giving each date an increasing number, but then I wondered how to treat weekends.
I would assume that with many data points both approaches are similar, if the time series is stationary, right? In the end I do a cross section anaylsis and don't care about the aspect of the time series in this case, correct? ( I heard about GARCH models etc, where this is a different)
Well, I am just learning and hope someone could give me some ideas about the topic.
Thank you very much in advance.

For time series analysis tasks (such as forecasting or anomaly detection), you may need a more advanced model, such as Recurrent Neural Networks (RNN) in deep learning. You can assign any time step to an RNN Cell, in your case, every RNN Cell can represent a day or maybe an hour or half a day etc.
The main purpose of the RNNs is to make the model understand the time dependencies in the data. For example, if monday has a bad affect, then corresponding RNN Cells will have reasonable parameters. I would recommend you to do some further research about it. Here there are some documentations that may help:
https://colah.github.io/posts/2015-08-Understanding-LSTMs/
(Also includes different types of RNN)
https://towardsdatascience.com/understanding-rnn-and-lstm-f7cdf6dfc14e
And you can use tensorflow, keras or PyTorch libraries.

How can you do time series forecasting in Tensorflow (or with other tools) where features of the label timestep are known?

This is a question about a general approach rather than a specific coding problem. I'm trying to do time series forecasting with Tensorflow where features of the label timestep are known to the model. E.g. a human trying to predict a variable a week from now would know things that are going to happen in the next week that will affect that variable. So a window of 20 timesteps where the label is the 20th timestep would look something like this:
Timesteps 1-19 would each have a set of features plus the timeseries data
Timestep 20 would have a set of features which are known, plus the timeseries label which is unknown
Is there a model that could handle this sort of data? I've gone through the Tensorflow time series forecasting tutorial, done a Coursera course on Tensorflow time series forecasting and searched elsewhere but I can't find anything. I'm fairly new to this so apologies for any imprecise language.

I once tried to do this kind of TS problem by stacking a multivariate model and another machine learning model. My idea was that I use the normal TS model's output, add it as another feature in the other model that only takes the last time step's info as input. But it is complicated and might overfit a lot even if I carefully regularized the second model. The idea is that I use step 1 to window_size - 1 info to predict a rough output at step window_size, then use the info at step window_size to reduce the residual between my TS model output and the actual label; But I don't think this approach is theoretically correct and the result might be worse than using a TS model without feeding the target step's info.
I don't think tensorflow have any API for your problem because this type of problem is not a normal TS problem. Usually people would just treat this kind of problem as a regression or classification problem.
I am not an expert on this problem as well, but I just happened to attempt to solve the exact problem so this is just my personal experience...

Do we apply fourier/wavelet transform to entire time series or only the training set for forecasting?

I was trying to implement the WSAE-LSTM model from the paper A deep learning framework for financial time series using stacked autoencoders and long-short term memory
. In the first step, Wavelet Transform is applied to the Time Series, although the exact implementation is not outlined in the paper.
The paper hints at applying Wavelet Transform to the whole dataset. I was wondering if this leaks data from testing to training? This article also identifies this problem.
From the article-
I’m sure you’ve heard many times that whenever you’re normalizing a time series for a ML model to fit your normalizer on the train set first then apply it to the test set. The reason is quite simple, our ML model behaves like mean reverter so if we normalize our entire dataset in one go we’re basically giving our model the mean value it needs to revert to. I’ll give you a little clue if we knew the future mean value for a time series we wouldn’t need machine learning to tell us what trades to do ;)
It’s basically the exact same issue as normalising your train and test set in one go. You’re leaking future information into each time step and not even in a small way. In fact you can run a little experiment yourself; the higher a level wavelet transform you apply, miraculously the more “accurate” your ML model’s output becomes.
Can someone tell me if Wavelet Transform "normalizes" the dataset which would lead to data leakage when forecasting? Should it be applied to the whole dataset or only the training dataset?

You are correct. Applying the transform to the whole dataset introduces "tachions" (leakage from the future) and makes the analysis essentially worthless.

Formatting and combining word frequency with other data machine learning python

I'm new in machine learning algorithms. I extensively read the scikit learn website and other SO post, which led me to build my first machine learning algorithm using the RandomForestClassifier and LinearSVC.
I'm working on medical notes. Each stay of a patient is associated (or not) to a code corresponding to a complication (bleeding, infection, heart attack...)
Using the notes, fitted and transformed with Countvectorizer and tfidfTransformer, i can accurately predict most of the codes. However, i'd like to add more data to my training dataset: length of stay, number of operations, title of operations, ICU stay duration...etc...
After parsing the web and SO, i ended up by adding all continuous/binary/scaled value to my word frequency array.
e.g: [0,0,0.34,0,0.45,0, 2, 45] (last 2 numbers are added data, whereas previous one match countvectorizer and tfdif.fit_transform(train_set)
However, this seems to me to be a gross way to combine data, and a huge number of words could mask others data.
I tried to set my data like: [[0,0,0.34,0,0.45,0],[2],[45]] but it doesn't work.
I searched the web, but no real clue, even though i might not be the first one facing this issue...:p
Thanks for your help
Edit:
Thanks for your detailed valuable answer. I really appreciated. However, what is exactly the range 0-1: is it the {predict_proba} value (http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.predict) ?. I understood that the score is the accuracy of the prediction model. Then when you have all your predictions depending of each variable, do you average all of them ? Eventually, i'm working with multiple outputs, i guess it's not a problem since i can get a prediction for each of the output (btw predict_proba(X) give me an array like [array([[0.,1.]]), array ([[0.2,0.8]]).....] with a random forest tree classifier. i guess one of the number is the probability of the output, but i haven't explored this yet !)

Your first solution of just appending to the list is the correct solution. However, you should think about what this is implying. If you have 100 words and add two additional features, each specific word will get the same "weight" as the added features - IE - your added features won't be treated very strongly in the model. Additionally, you're saying that the last feature with a value of 45 is 100x the value of the feature 4th from end (0.45).
One common way to get around that is to use an ensemble model. Instead of adding those features to your list of words and predicting, first build a prediction model just using the words. That prediction will be in the range 0-1 and will capture the "sentiment" of the article. Then, scale your other variables (minmax scaler, normal distribution, etc.). Finally, combine the score from the words with the last two scaled variables and run another prediction on a list like this [.86,.2,.65]. In this way, you have transformed all of the words to a sentiment score, which you can use as a feature.
Hope that helps.
EDIT PER YOUR UPDATE ABOVE
Yes, in this instance you could use the predict_proba, but really if everything is scaled correctly, and you are using 1/0 as your targets for a class you don't need the predict_proba. The idea is to take the prediction from the words and combine it with the other variables. You do not average the predictions, you make a prediction from the predictions! This is called ensemble learning. Train another model with the output of your predictions as the features. Here is a flow of what you need to do.

Thanks for your time and your detailed answer. I think i get it. In short:
Prediction based on words, and for each bag of words of the training set (t1), you pull out a "sentiment"
Create a new array for each training set row with the sentiment and others values->new training set(t2)
Make a prediction based on t2.
Apply previous steps to the test.
One more question though !
What is the "sentiment" value ?! For each bag of words, i have a sparse matrix (countvectorizer+tf_idf). So how do you calculate the sentiment ? Do you run each row of the test again the rest of the test ? and your sentiment is the clf.predict(X) value ?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.