How do you make predictions with a stateful LSTM? - python

Okay, so I trained a stateful LSTM characterwise on https://cs.stanford.edu/people/karpathy/char-rnn/shakespear.txt. It didn't seem to do too bad in terms of accuracy, but know I want to generate my own shakespeare works.
The question is, how do I go about actually generating predictions from it?
In particular, the models batch input shape is (128, 128, 63) and the output shape is (128, 128, 63). (The first number is the batch size, the second number is the length of the prediction input and output, and the third number is the number of distinct characters in the text.)
For example, I would like to:
Generate various predictions starting from empty text
Generate predictions starting from a small starting text (such as "PYRULEZ:")
This should be possible given how LSTMs work.
Here's a snippet of the code used to generate and fit the model:
model = Sequential()
model.add(LSTM(dataY.shape[2], batch_input_shape=(128, dataX.shape[1], dataX.shape[2]), return_sequences = True, stateful=True, activation = "softmax"))
model.summary()
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics = ['acc'])
model.fit(dataX, dataY, epochs = 1, batch_size = 128, verbose=1, shuffle = False)
Looking at other code samples, it appears I'll need to modify this somehow, but I'm not sure in how specifically.
I can include the whole code sample if that would be helpful. It is self contained.

Simple. Put your input into model.predict() with appropriate parameters (see documentation), concatenate input and output (the model predicts on progressively longer chains). Depending on how you organised training, output will add one character at a time. To be more precise, if you train sequence to sequence shifted by one, your output sequence will ideally be your input sequence shifted by one element; PYRULEZ -> YRULEZ* Hence you need to take the last character of the output and add it to your prior (input) sequence.
If you want long lines of text, you might want to limit the length of your sequence to some number of characters in the loop. Much of the long term dependencies in the text is carried through the stateful vector of the LSTM cell anyway (Not something you interact with).
Pseudocode-ish:
for counter in range(output_length):
output = model.predict(input_)
input_ = np.concatenate((input_, output[:,-1,:]), axis=1)

Related

Keras LSTM trained with masking and custom loss function breaks after first iteration

I am attempting to train an LSTM that reads a variable length input sequence and has a custom loss function applied to it. In order to be able to train on batches, I pad my inputs to all be the maxmimum length.
My input data is a float tensor of shape (7789, 491, 11) where the form is (num_samples, max_sequence_length, dimension).
Any sample that is shorter than the maximum length I pad with -float('inf'), so a sequence with 10 values would start with 481 sets of 11 '-inf' values followed by the real values at the end.
The way I am attempting to evaluate this model doesn't fit into any standard loss functions, so I had to make my own. I've tested it and it performs as expected on sample tensors. I don't believe this is the source of the issue so I won't go into details, but I could be wrong.
The problem I'm having comes from the model itself. Here is how I define and train it:
model = tf.keras.Sequential()
model.add(tf.keras.layers.Masking(mask_value=-float('inf'),
input_shape=(train_X.shape[1], train_X.shape[2])))
model.add(tf.keras.layers.LSTM(32))
model.add(tf.keras.layers.Dense(64, activation='relu'))
model.add(tf.keras.layers.Dense(30),
kernel_initializer=tf.keras.initializers.zeros())
model.add(tf.keras.layers.Reshape((3, 10)))
model.compile(loss=batched_custom_loss, optimizer='rmsprop', run_eagerly=True)
model.fit(x=train_X, y=train_y, validation_data=val, epochs=5, batch_size=32)
No errors are thrown when I try to fit the model, but it only works on the first batch of training. As soon as the second batch starts, the loss becomes 'nan'. Upon closer inspection, it seems like the LSTM layer is outputting 'nan' after the first epoch of training.
My two guesses for what is going on are:
I set up the masking layer wrong, and it for some reason fails to mask out all of the -inf values after the first training iteration. Thus, -inf gets passed through the LSTM and it goes haywire.
I did something wrong with the format of my loss function, and the when the optimizer applies my calculated loss to the model it ruins the weights of the LSTM. For reference, my loss function outputs a 1D tensor with length equal to the number of samples in the batch. Each item in the output is a float with the loss of that sample.
I know that the math in my loss function is good since I've tested it on sample data, but maybe the output format is wrong even though it seems to match what I've found online.
Let me know if the problem is obvious from what I've shown or if you need more information.

Does Keras's LSTM really take into account the cell state and previous output?

I learned about LSTM's over the past day, and then i decided to look at a tutorial which uses Keras to create it. I looked at several tutorials and they all had a derivative of
model = Sequential()
model.add(LSTM(10, input_shape=(1,1)))
model.add(Dense(1, activation='linear'))
model.compile(loss='mse', optimizer='adam')
X,y = get_train()
model.fit(X, y, epochs=300, shuffle=False, verbose=0)
then they predicted using
model.predict(X, verbose=0)
my question is: don't you have to give the previous prediction along with input and cell state in order to predict the next outcome using an LSTM?
Also, what does the 10 represent in model.add(LSTM(10, input_shape(1,1))?
You have to give the previous prediction to the LSTM state. If you call predict the LSTM will be initialized every time, it will not remember the state from previous predictions.
Typically (e.g if you generate text with an lstm) you have a loop where you do something like this:
# pick a random seed
start = numpy.random.randint(0, len(dataX)-1)
pattern = dataX[start]
print "Seed:"
print "\"", ''.join([int_to_char[value] for value in pattern]), "\""
# generate characters
for i in range(1000):
x = numpy.reshape(pattern, (1, len(pattern), 1))
x = x / float(n_vocab)
prediction = model.predict(x, verbose=0)
index = numpy.argmax(prediction)
result = int_to_char[index]
seq_in = [int_to_char[value] for value in pattern]
sys.stdout.write(result)
pattern.append(index)
pattern = pattern[1:len(pattern)]
print "\nDone."
(example copied from machinelearningmastery.com)
The important thing are this lines:
pattern.append(index)
pattern = pattern[1:len(pattern)]
Here they append the next character to the pattern and then drop the first character to have an input length that matches the expectation from the lstm. Then the bring it to a numpy array (x = np.reshape(...)) and predict from the model with the generated output. So to answer your first question you need to feed in the output again.
For the second question the 10 corresponds to the number of lstm cells that you have in a layer. If you don't use "return_sequences=True" it corresponds to the output size of that layer.
Let's break it down into pieces and look pictorially
LSTM(10, input_shape=(3,1))): Defines an LSTM whose sequence length is 3 i.e the LSTM will unroll for 3 timesteps. At each timestep the LSTM will take an input of size 1. The output (and also the size of the hidden state and all other LSTM gates) is 10 (vector or size 10)
You dont have to do unrolling manually (passing in the current hidden state to the next state) it is taken care by the keras/tensorflow LSTM layer. All you have to do is to pass in data in the (batch_size X time_steps X input_size) format.
Dense(1, activation='linear'): This is a dense layer with linear activation with takes in as input the output of the previous layer (i.e the output of the LSTM which will be a vector of size 10 of the last unrolling). It will return a vector of size 1.
The same can be checked using model.summary()
Your 1st question:
don't you have to give the previous prediction along with input and cell state in order to predict the next outcome using an LSTM?
no, you don't have to do that. As far as I understand, it is stored in the LSTM cell which is why LSTM uses so much RAM
if you have data with shape looking like this:
(100,1000)
if you plug that into the fit function, each epoch will run on 100 lists. The LSTM will remember 1000 data plots before refreshing when it moves onto the next list.
2nd:
Also, what does the 10 represent in model.add(LSTM(10, input_shape(1,1))?
it is the shape of the 1st layer after the input, so your model currently has the shape of:
1,1
10
1
hope it helps :)

Unrolling, timesteps, batchsize and hidden unit

I read this blog here to understand the theoretical background this but after reading here I am bit confused about what **1)timesteps, 2)unrolling, 3)number of hidden units and 4) batch size ** are ? Maybe someone could explain this on a code basis as well because when I look into the model config this code below does not unroll, but what is timestep doing in this case ? Lets say I have a data of length of 2.000 points, splitted into 40 time steps and one feature. E.g. hidden units are 100. batchsize is not defined, what is happening in the model ?
model = Sequential()
model.add(LSTM(100, input_shape=(n_timesteps_in, n_features)))
model.add(RepeatVector(n_timesteps_in))
model.add(LSTM(100, return_sequences=True))
model.add(TimeDistributed(Dense(n_features, activation='tanh')))
model.compile(loss='mse', optimizer='adam', metrics=['mae'])
history=model.fit(train, train, epochs=epochs, verbose=2, shuffle=False)
Is the code below still an encoder decode model without a RepeatVector?
model = Sequential()
model.add(LSTM(100, return_sequences=True, input_shape=(n_timesteps_in, n_features)))
model.add(LSTM(100, return_sequences=True))
model.add(TimeDistributed(Dense(n_features, activation='tanh')))
model.compile(loss='mse', optimizer='adam', metrics=['mae'])
history=model.fit(train, train, epochs=epochs, verbose=2, shuffle=False)
"Unroll" is just a mechanism to process the LSTMs in a way that makes them faster by occupying more memory. (The details are unknown for me... but it certainly has no influence in steps, shapes, etc.)
When you say "2000 points split in 40 time steps", I have absolutely no idea of what is going on.
The data must be meaningfully structured and saying "2000" data points is really lacking a lot of information.
Data structured for LSTMs is:
I have a certain number of individual sequences (data evolving with time)
Each sequence has a number of time steps (measures in time)
In each step we measured a number of different vars with different meanings (features)
Example:
2000 users in a website
They used the site for 40 days
In each day I measured the number of times they clicked a button
I can plot how this data evolves with time daily (each day is a step)
So, if you have 2000 sequences (also called "samples" in Keras), each sequence with length of 40 steps, and one single feature per step, this will happen:
Dimensions
Batch size is defined as 32 by default in the fit method. The model will process batches containing 32 sequences/users until it reaches 2000 sequences/users.
input_shape will required to be (40,1) (free batch size to choose in fit)
Steps
Your LSTMs will try to understand how clicks vary in time, step by step. That's why they're recurrent, they calculate things for a step and feed these things into the next step, until all 40 steps are processed. (You won't see this processing, though, it's internal)
With return_sequences=True, you will get the output for all steps.
Without it, you will get only the output for the last step.
The model
The model will process 32 parallel (and independent) sequences/users together in each batch.
The first LSTM layer will process the entire sequence in recurrent steps and return a final result. (The sequence is killed, there are no steps left because you didn't use return_sequences=True)
Output shape = (batch, 100)
You create a new sequence with RepeatVector, but this sequence is constant in time.
Output shape = (batch, 40, 100)
The next LSTM layer processes this constant sequence and produces an output sequence, with all 40 steps
Output shape = (bathc, 40, 100)
The TimeDistributed(Dense) will process each of these steps, but independently (in parallel), not recursively as the LSTMs would do.
Output shape = (batch, 40, n_features)
The output will be a the total group of 2000 sequences (that were processed in groups of 32), each with 40 steps and n_features output features.
Cells, features, units
Everything is independent.
Input features is one thing, output features is another. There is no requirement for Dense to use the same number of features used in input_shape, unless that's what you want.
When you use 100 units in the LSTM layer, it will produce an output sequence of 100 features, shape (batch, 40, 100). If you use 200 units, it will produce an output sequence with 200 features, shape (batch, 40, 200). This is computing power. More neurons = more intelligence in the model.
Something strange in the model:
You should replace:
model.add(LSTM(100, input_shape=(n_timesteps_in, n_features)))
model.add(RepeatVector(n_timesteps_in))
With only:
model.add(LSTM(100, return_sequences=True,input_shape=(n_timesteps_in, n_features)))
Not returning sequences in the first layer and then creating a constant sequence with RepeatVector is sort of destroying the work of your first LSTM.

Training on multiple time-series of various length using recurrent layers in Keras

TL;DR - I have a couple of thousand speed-profiles (time-series where the speed of a car has been sampled) and I am unsure how to configure my models such that I can perform arbitrary forecasting (i.e. predict t+n samples given a sample t).
I have read numerous explanations (1, 2, 3, 4, 5) about how Keras implements statefulness in their recurrent layers, and how one should reset/not reset between iterations, etc..
However, I am unable to acquire the model shape that I want (I think).
As for now, I am only working with a subset of my profiles (denoted as routes in the code below).
Number of training routes: 90
Number of testing routes: 10
The routes vary in length, hence, the first thing I do is to iterate through all routes and pad them with 0, so they are all the same length. (I have assumed this is required, if I am wrong please let me know.) After the padding I convert the routes into a format better suited for the supervised learning task, as described HERE. In this case I have opted to forecast the succeeding 5 steps of the current sample.
The result is a tensor, as:
Shape of trainig_data: (90, 3186, 6) == (nb_routes, nb_samples/route, nb_timesteps)
which is split into X and y for training as:
Shape of X: (90, 3186, 1)
Shape of y: (90, 3186, 5)
My goal is to have the model take one route at the time and train on it. I have created a model like this:
# Create model
model = Sequential()
# Add recurrent layer
model.add(SimpleRNN(nb_cells, batch_input_shape=(1, X.shape[1], X.shape[2]), stateful=True))
# Add dense layer at the end to acquire correct kind of forecast
model.add(Dense(y.shape[2]))
# Compile model
model.compile(loss="mean_squared_error", optimizer="adam", metrics = ["accuracy"])
# Fit model
for _ in range(nb_epochs):
model.fit(X, y,
validation_split=0.1,
epochs=1,
batch_size=1,
verbose=1,
shuffle=False)
model.reset_states()
Which would imply that I have a model with nb_cells layers, the input of the model is (number_of_samples, number_of_timesteps) i.e. (3186, 1) and the output of the model is (number_of_timesteps_lagged) i.e. (5).
However, when running the above I get the following error:
ValueError: Error when checking target: expected dense_1 to have 2 dimensions, but got array with shape (90, 3186, 5)
I have tried different ways to solve the above, but I have been unsuccessful.
I have also tried other ways of structuring my data and my model. For instance merging my routes such that instead of (90, 3186, 6) I had (286740, 6). I simply took the data for each route and put it after the other. After fiddeling with my model I got this to run, and I get a result that is quite good, but I really want to understand how this works - and I think the solution I am attempting above is bette (if I can get it to work).
Update
Note: I am still looking for feedback.
I have reached a "solution" which I think does the trick.
I have abandoned the padding and instead opted for a one sample at the time approach. The reason being that I am trying to acquire a network that allows me to predict by providing the network with one sample at the time. I want to give the network sample t and have it predict t+1, t+2, ...,t+n, so it is my understanding that I must train the network on one sample at the time. I also assume that using:
stateful will allow me to keep the hidden state of the cells unspoiled between batches (meaning that I can determine the batch size to be len(route))
return_sequences will allow me to get the output vector that I desire
The changed code is given below. Unlike the original question, the shape of the input data is now (90,) (i.e. 90 routes of various length) but each training route still has only one feature per sample, and each label route has five samples per feature (the lagged time).
# Create model
model = Sequential()
# Add nn_type cells
model.add(SimpleRNN(nb_cells, return_sequences=True, stateful=True, batch_input_shape=(1, 1, nb_past_obs)))
# Add dense layer at the end to acquire correct kind of forecast
model.add(Dense(nb_future_obs))
# Compile model
model.compile(loss="mean_squared_error", optimizer="adam", metrics = ["accuracy"])
# Fit model
for e in range(nb_epochs):
for r in range(len(training_data)):
route = training_data[r]
for s in range(len(route)):
X = route[s, :nb_past_obs].reshape(1, 1, nb_past_obs)
y = route[s, nb_past_obs:].reshape(1, 1, nb_future_obs)
model.fit(X, y,
epochs=1,
batch_size=1,
verbose=0,
shuffle=False))
model.reset_states()
return model

Keras sequence prediction with multiple simultaneous sequences

My question is very similar to what it seems this post is asking, although that post doesn't pose a satisfactory solution. To elaborate, I am currently using keras with tensorflow backend and a sequential LSTM model. The end goal is I have n time-dependent sequences with equal time steps (the same number of points on each sequence and the points are all the same time apart) and I would like to feed all n sequences into the same network so it can use correlations between the sequences to better predict the next step for each sequence. My ideal output would be an n-element 1-D array with array[0] corresponding to the next-step prediction for sequence_1, array[1] for sequence_2, and so on.
My inputs are sequences of single values, so each of n inputs can be parsed into a 1-D array.
I was able to get a working model for each sequence independently using the code at the end of this guide by Jakob Aungiers, although my difficulty is adapting it to accept multiple sequences at once and correlate between them (i.e. be analyzed in parallel). I believe the issue is related to the shape of my input data, which is currently in the form of a 4-D numpy array because of how Jakob's Guide splits the inputs into sub-sequences of 30 elements each to analyze incrementally, although I could also be completely missing the target here. My code (which is mostly Jakob's, not trying to take credit for anything that isn't mine) presently looks like this:
As-is this complains with "ValueError: Error when checking target: expected activation_1 to have shape (None, 4) but got array with shape (4, 490)", I'm sure there are plenty of other issues but I'd love some direction on how to achieve what I'm describing. Anything stick out immediately to anyone? Any help you could give will be greatly appreciated.
Thanks!
-Eric
Keras is already prepared to work with batches containing many sequences, there is no secret at all.
There are two possible approaches, though:
You input your entire sequences (all steps at once) and predict n results
You input only one step of all sequences and predict the next step in a loop
Suppose:
nSequences = 30
timeSteps = 50
features = 1 #(as you said: single values per step)
outputFeatures = 1
First apporach: stateful=False:
inputArray = arrayWithShape((nSequences,timeSteps,features))
outputArray = arrayWithShape((nSequences,outputFeatures))
input_shape = (timeSteps,features)
#use layers like this:
LSTM(units) #if the first layer in a Sequential model, add the input_shape
#if you want to return the same number of steps (like a new sequence parallel to the input, use return_sequences=True
Train like this:
model.fit(inputArray,outputArray,....)
Predict like this:
newStep = model.predict(inputArray)
Second approach: stateful=True:
inputArray = sameAsBefore
outputArray = inputArray[:,1:] #one step after input array
inputArray = inputArray[:,:-1] #eliminate the last step
batch_input = (nSequences, 1, features) #stateful layers require the batch size
#use layers like this:
LSMT(units, stateful=True) #if the first layer in a Sequential model, add input_shape
Train like this:
model.reset_states() #you need this in stateful=True models
#if you don't reset states,
#the stateful model will think that your inputs are new steps of the same previous sequences
for step in range(inputArray.shape[1]): #for each time step
model.fit(inputArray[:,step:step+1], outputArray[:,step:step+1],shuffle=False,...)
Predict like this:
model.reset_states()
predictions = np.empty(inputArray.shape)
for step in range(inputArray.shape[1]): #for each time step
predictions[:,step] = model.predict(inputArray[:,step:step+1])

Categories