Handling long timestep sequences in LSTM

Handling long timestep sequences in LSTM - python

I'm trying to use LSTM to predict information on timestep sequences.
My data looks that way: I have few different samples of relatively long sequences (>100000 timesteps) and I'm trying to solve a N-class classification problem where each sample is labeled as different ID. Now I'm trying to understand how to properly prepare my data so the LSTM will train on each sample individually.
In the most basic case, I just feed each sample completely to the network:
model = models.Sequential()
model.add(layers.Embedding(FEATURES_NUMBER, 30))
model.add(layers.LSTM(32, return_sequences=True))
model.compile(optimizer="adam",
loss='categorical_crossentropy', metrics=['acc'])
history = model.fit(train_data,
train_labels,
epochs=10,
batch_size=128,
validation_data=(validation_data,validation_labels)
)
Where train_data is of shape: (4, 100000, 1).
But I'm being told by many blog posts around (like here) that training LSTM on very long sequences might harm the training. So, I don't understand how to properly split the data in correspondence with the LSTM internal state.
I can split each 100000 long sequence to 500 long sub-sequences and then my data will be of shape: (800, 500, 1). But can I tell the LSTM to still make sense of the larger sequences (Keep internal state between sub-sequences of the same larger sequence and re-initialize it when switching to new sequence)?
I'd be happy if someone could shed some light over that process!

The problem has been here for a while. Not sure if you are still interested in my humble two cents. Here is the thing about LSTM, as elegantly designed as it is, it still suffers from vanishing gradients and/or exploding gradients. Adding a forget gate for the cell state alleviates the problem as when a forget gate outputs 0, the back propagation process stops "flowing" backwards at that gate. However, this does not mean LSTM is free from exploding/vanishing gradients when you feed an infinitely long sequence into it. Just image a case where the forget gate output 1 50 times in a row and the gradients for the last output using BPTT would look very similar to the original RNN as the gates are "turned off".
Why would that happen? Because your initial set of params are likely not optimal. And also because a long sequence would requires BPTT through more time steps and thus make chances of that happening higher. You can try training you LSTM on your segmented data subsets.

Related

Do LSTMs remember previous windows or is the hidden state reset?

I am training an LSTM to forecast the next value of a timeseries. Let's say I have training data with the given shape (2345, 95) and a total of 15 files with this data, this means that I have 2345 window with 50% overlap between them (the timeseries was divided into windows). Each window has 95 timesteps. If I use the following model:
input1 = Input(shape=(95, 1))
lstm1 = LSTM(units=100, return_sequences=False,
activation="tanh")(input1)
outputs = Dense(1, activation="sigmoid")(lstm1)
model = Model(inputs=input1, outputs=outputs)
model.compile(loss=keras.losses.BinaryCrossentropy(),
optimizer=tf.keras.optimizers.Adam(learning_rate=0.01))
I am feeding this data using a generator where it passes a whole file each time, therefore one epoch will have 15 steps. Now my question is, in a given epoch, for a given step, does the LSTM remember the previous window that it saw or is the memory of the LSTM reset after seeing each window? If it remembers the previous windows, then is the memory reset only at the end of an epoch?
I have seen similar questions like this TensorFlow: Remember LSTM state for next batch (stateful LSTM) or https://datascience.stackexchange.com/questions/27628/sliding-window-leads-to-overfitting-in-lstm but I either did not quite understand the explanation or I was unsure if what I wanted was explained. I'm looking for more of a technical explanation as to where in the LSTM architecture is the whole memory/hidden state reset.
EDIT:
So from my understanding there are two concepts we can call "memory"
here. The weights that are updated through BPTT and the hidden state
of the LSTM cell. For a given window of timesteps the LSTM can
remember what the previous timestep was, this is what the hidden
state is for I think. Now the weight update does not directly
reflect memory if I'm understanding this correctly.
The size of the hidden state, in other words how much the LSTM
remembers is determined by the batch size, which in this case is one
whole file, but other question/answers
(https://datascience.stackexchange.com/questions/27628/sliding-window-leads-to-overfitting-in-lstm and https://stackoverflow.com/a/50235563/13469674) state that if we
have to windows for instance: [1,2,3] and [4,5,6] the LSTM does not
know that 4 comes after 3 because they are in different windows,
even though they belong to the same batch. So I'm still unsure how
exactly memory is maintained in the LSTM
It makes some sense that the hidden state is reset between windows when we look at the LSTM cell diagram. But then the weights are only updated after each step, so where does the hidden state come into play?

What you are describing is called "Back Propagation Through Time", you can google that for tutorials that describe the process.
Your concern is justified in one respect and unjustified in another respect.
The LSTM is capable of learning across multiple training iterations (e.g. multiple 15 step intervals). This is because the LSTM state is being passed forward from one iteration (e.g. multiple 15 step intervals) to the next iteration. This is feeding information forward across multiple training iterations.
Your concern is justified in that the model's weights are only updated with respect to the 15 steps (plus any batch size you have). As long as 15 steps is long enough for the model to catch valuable patterns, it will generally learn a good set of weights that generalize well beyond 15 steps. A good example of this is the Shakespeare character recognition model described in Karpathy's, "The unreasonable effectiveness of RNNs".
In summary, the model is learning to create a good hidden state for the next step averaged over sets of 15 steps as you have defined. It is common that an LSTM will produce a good generalized solution by looking at data in these limited segments. Akin to batch training, but sequentially over time.
I might note that 100 is a more typical upper limit for the number of steps in an LSTM. At ~100 steps you start to see a vanishing gradient problem in which the earlier steps contribute nearly nothing to the gradient.
Note that it is important to ensure you are passing the LSTM state forward from training step to training step over the course of an episode (any contiguous sequence). If this step was missed the model would certainly suffer.

How can I train this multiclass RNN?

I am trying to train the following RNN in tensorflow. It takes an 11-D numeric vector as input and it outputs a sequence of 10 multiclass probability vectors, with 14 exclusive classes.
model = keras.models.Sequential([
keras.layers.SimpleRNN(30, return_sequences=False, input_shape=[1, 11]),
keras.layers.RepeatVector(10),
keras.layers.SimpleRNN(30, return_sequences=True),
keras.layers.SimpleRNN(14, return_sequences=True, activation="softmax")
])
model.compile(loss="categorical_crossentropy",
optimizer="adam")
history = model.fit(X_train, y_train, epochs=50, batch_size=32,
validation_split=0.2)
However, even for a small dataset of 10 points, it takes hundreds of epochs to fit. As you can see in the figure, the loss barely goes down with the training epochs:
When I try to train the real training set, the loss simply does not move. Any idea of how to successfully train this model?
You can find the first 10 datapoints here
And the first 100 datapoints here
To load the data just use:
with open('train10.pickle', 'rb') as f:
X_train, y_train = pickle.load(f)
Thank you very much for your help
EDIT:
To provide additional context, what I have in this problem is a continuous numeric embedding in 11-D to start with, and the output is a sequence of one-hot encodings, so you can think of this problem as training a decoder or doing a decompression to get a sort of "words" back from points in the numeric space (each one-hot vector in the output could be thought of a "letter"). I previously tried to train a non-recurrent network outputting the full list of one-hot encodings (whole "word") at once, but the performance was also very poor. I just do not see what the bottleneck is: if the dimensionality of the numeric embedding, the training algorithm, etc. My tinkering so far with types of layers, numbers of layers, or learning rates did not produce substantial improvements. I am open to sharing the whole dataset if you think that can help. Thank you very much!

Each machine learning problem is unique and it is very difficult to say exactly what the issue is without having access to the full data set. Some possibilities are:
The model specification is suboptimal - try varying the number of hidden layers, the number of neurons in each layer, using GRU/LSTM layers instead of RNN, adding add some dropout layers, etc.
The training algorithm needs to be adjusted - try using a different optimizer, a different batch size, a different train-test split ratio etc.
The input data needs more (or less) preprocessing - try normalizing/standardizing the input features if you haven't already.
You need to do more work on feature engineering - think deeply about all potential relationships between the input data and the target, and try combining columns to create ratios etc. While the NN can theoretically figure this out for itself, it is often effective to try and reduce the work it has to do in this respect.
Your problem may just be difficult or even unsolvable. There may just be no strong relationship between the input and the target.

Keras Dense Net Overfitting

I am attempting to use keras to build an activity classifier from accelerometer signals. However, I am experiencing extreme overfitting of the data even with the most simplistic of models.
The input data is of shape (10,3) and contains roughly .1 second of data from the accelerometer in 3 dimensions. The model is simply
model = Sequential()
model.add(Flatten(input_shape=(10,3)))
model.add(Dense(2, activation='softmax'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
The model should output the label [1,0] for walking activities and [0,1] for non-walking activities. After training I get 99.8% accuracy (if only it was real...). When I attempt to predict on data that wasn't used for training, I get 50% accuracy, verifying that the net isn't really "learning" anything except to predict a single class value.
The data is being prepared from 100hz triaxial accelerometer signals. I am not preprocessing the data in any way except for windowing it into bins on length 10 that overlap with the previous bin by 50%. What measures can I take to make the network produce actual predictions? I have tried increasing the window size but the results remain the same. Any advice/general tips are greatly appreciated.
Ian

Try adding some hidden layers and dropout layers to your network. You could create a simple Multi Layer Perceptron (MLP) with a couple of extra lines in between your Flatten layer and Dense layer:
model.add(Dense(64, activation='relu', input_dim=30))
model.add(Dropout(0.25))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.1))
Or check out this guide. which explains how to create a simple MLP.
Without any hidden layers your model will not actually be 'learning' from the input data, rather it will be mapping the input number of features to the output number of features.
The more layers you add, the more intermediate features and patterns it should extract from the input data which should lead to better model predictions for test data. There will be a lot of trial and error to design the best model as too many layers can result in over fitting.
You have not provided information about how you train the model so that may be the cause of the issue as well. You must ensure that the data is spit into training, testing and validation sets. Some possible split ratios to use for training, validation, test data are: 60%:20%:20%, or 70%:15%:15%. This is ultimately something that you must also decide.

The problem of overfitting was caused by the input data type. The values passed to the classifier should have been float values with 2 decimal places. Somewhere along the way, some of these values had been augmented and had significantly more than 2 decimal places. That is, the input should have looked like
[9.81, 10.22, 11.3]
but instead looked like
[9.81000000012, 10.220010431, 11.3000000101]
The classifier was making its prediction based on this feature, which is obviously not the desired behavior! Lessoned learned - make sure the data preparation is consistent! Thanks to #umutto for the suggestions of random forests, the simple structure was helpful for diagnosing purposes.

How to train a hierarchical model in two parts

This is a follow up to the following question: Confused about how to implement time-distributed LSTM + LSTM
The current draft structure that is working well:
The basic idea is that there is a TimeDistributed deep LSTM input layer that works on each epoch of raw time series data and outputs a vector of features for each output. Then, the "outer" deep LSTM layer takes 7 of those sequential outputs and tries to classify the center epoch (assumed that 1 epoch does not have enough information to be classified by itself, and needs surrounding epochs). I say this is a draft because I haven't yet explored the feature space required for this to work well on many subjects.
There are several issues that still need to be resolved, but the one that I haven't found any clear-cut examples of online are trying to train this model in two parts: 1) the TimeDistributed later and 2) the "outer" layer. The reason being is that as I increase the number of epochs needed to classify (currently 7, but I expect it may get up to 21 or higher) more duplicated data is loaded, and the training speed is decreasing quickly.
One may propose an autoencoder for the first layer. However, I don't think this is the best solution. The reason I think so is that the features necessary to reproduce the input might very well be different than the features necessary to be used with other epochs to classify said layer. To expand: this is probable because the time series is semi-periodic, with most of the epoch providing little information other than the current period from important feature to important feature (and the number and location of these important features varies in each epoch).

Getting started with Keras for machine learning

I'm getting started with machine learning tools and I'd like to learn more about what the heck I'm doing. For instance, the script:
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, BatchNormalization
from keras.initializers import RandomUniform
import numpy
numpy.random.seed(13)
RandomUniform(seed=13)
model = Sequential()
model.add(Dense(6, input_dim=6))
model.add(BatchNormalization())
model.add(Activation('tanh'))
model.add(Dropout(0.01))
model.add(Dense(11))
model.add(Activation('tanh'))
model.add(Dropout(0.01))
model.add(Dense(1))
model.add(Activation('sigmoid'))
model.compile(optimizer='sgd', loss='mean_absolute_error', metrics=['accuracy'])
data = numpy.loadtxt('train', delimiter=' ')
X = data[:, 0:6]
Y = data[:, 6]
model.fit(X, Y, batch_size=1, epochs=1000)
data = numpy.loadtxt('test', delimiter=' ')
X = data[:, 0:6]
Y = data[:, 6]
score = model.evaluate(X, Y, verbose=1)
print ('\n\nThe error is:\n', score, "\n")
print('\n\nPrediction:\n')
Y = model.predict(X, batch_size=1, verbose=1)
print('\nResult:\n', Y, '\n')
It's a Frankenstein I made from some examples I found on the internet and I have many unanswered questions about it:
The file train has 60 rows. Is 1000 epochs too little? Is it too much? Can I get an Underfit/Overfit?
What does the result I get from model.evaluate() mean? I know it's the loss but, if I get a [7.0506157875061035, 0.0], does it mean that my model has a 7% error?
And last, I'm getting a prediction of 0.99875391, 0.99875391, 0.9362126, 0.99875391, 0.99875391, 0.99875391, 0.93571019 when the expected values were anything close to 7.86, 3.57, 8.93, 6.57, 11.7, 8.53, 9.06, which means it's a real bad prediction. Clearly there's a lot of things I'm doing wrong. Could you guys give me a few pointers?
I know it all depends on the type of data I'm using, but is there anything I shouldn't do at all? Or maybe something I should be doing?

1
There is never a ready answer for how many epochs is a good number. It varies wildly depending on the size of your data, your model, and what you want to achieve. Normally, small models require less epochs, bigger models require more. Yours seem small enough and 1000 epochs seems way too much.
It also depends on the learning rate, a parameter given to the optimizer that defines how long are the steps your model takes to update its weights. Bigger learning rates mean less epochs, but there is a chance that you simply never find a good point because you're adjusting weights beyond what is good. Smaller learning rates mean more epochs and better learning.
Normally, if the loss reaches a limit, you're approaching a point where training is not useful anymore. (Of course, there may be problems with the model too, there is really no simple answer for this one).
To detect overfitting, you need besides the training data (X and Y), another group with test data (say Xtest and Ytest, for instance).
Then you use it in model.fit(X,Y, validation_data=(Xtest,Ytest), ...)
Test data is not given for training, it's kept separate just to see if your model can predict good things from data it has never seen in training.
If the training loss goes down, but the validation loss doesn't, you're overfitting (roughly, your model is capable of memorizing the training data without really understanding it).
An underfit, on the contrary, happens when you never achieve the accuracy you expect (of course we always expect a 100% accuracy, no mistakes, but good models get around the 90's, some applicatoins go better 99%, some worse, again, it's very subjective).
2
model.evaluate() gives you the losses and the metrics you added in the compile method.
The loss value is something your model will always try to decrease during training. It roughly means how distant your model is from the exact values. There is no rule for what the loss value means, it could even be negative (but usually keras uses positive losses). The point is: it must decrease during training, that means your model is evolving.
The accuracy value means how many right predictions your model outputs compared to the true values (Y). It seems your accuracy is 0%, your model is getting everything wrong. (You can see that from the values you typed).
3
In your model, you used activation functions. These normalize the results so they don't get too big. This avoids overflowing problems, numeric errors propagating, etc.
It's very very usual to work with values within such bounds.
tanh - outputs values between -1 and 1
sigmoid - outputs values between 0 and 1
Well, if you used a sigmoid activation in the last layer, your model will never output 3 for instance. It tries, but the maximum value is 1.
What you should do is prepare your data (Y), so it's contained between 0 and 1. (This is the best to do in classification problems, often done with images too)
But if you actually want numerical values, then you should just remove the activation and let the output be free to reach higher values. (It all depends on what you want to achieve with your model)

Epoch is a single pass through the full training set. I my mind it seems a lot, but you'd have to check for overfitting and evaluate the predictions. There are many ways of checking and controlling for overfitting in a model. If you understand the methods of doing so from here, coding them in Keras should be no problem.
According to the documentation .evaluate returns:
Scalar test loss (if the model has no metrics) or list of scalars (if the model computes other metrics)
so these are the evaluation metrics of your model, they tell you how good your model is given some notion of good. Those metrics depend on the model and type of data that you've used. Some explanation on those can be found here and here. As mentioned in the documentation,
The attribute model.metrics_names will give you the display labels for the scalar outputs.
So you can know what metric you are looking at. It is easier to do that interactively through the console (ipython, bpython) or Jupyter notebook.
I can't see your data, but a if you are doing a classification problem as suggested by metrics=['accuracy'], the loss=mean_absolute_error doesn't make sense, since it is made for regression problems. To learn more about those I refer you to here and here which discuss classification and regression problems with Keras.
PS: question 3 is not related to software per se, but to the theoretical construct supporting the software. In such cases, I'd recommend asking them at Cross Validated.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.