How to train a LSTM model with different N-dimensions labels? - python
I am using keras (ver. 2.0.6 with TensorFlow backend) for a simple neural network:
model = Sequential()
model.add(LSTM(32, return_sequences=True, input_shape=(100, 5)))
model.add(LSTM(32, return_sequences=True))
model.add(TimeDistributed(Dense(5)))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy',
optimizer='rmsprop',
metrics=['accuracy'])
It is only a test for me, I am "training" the model with the following dummy data.
x_train = np.array([
[[0,0,0,0,1], [0,0,0,1,0], [0,0,1,0,0]],
[[1,0,0,0,0], [0,1,0,0,0], [0,0,1,0,0]],
[[0,1,0,0,0], [0,0,1,0,0], [0,0,0,1,0]],
[[0,0,1,0,0], [1,0,0,0,0], [1,0,0,0,0]],
[[0,0,0,1,0], [0,0,0,0,1], [0,1,0,0,0]],
[[0,0,0,0,1], [0,0,0,0,1], [0,0,0,0,1]]
])
y_train = np.array([
[[0,0,0,0,1], [0,0,0,1,0], [0,0,1,0,0]],
[[1,0,0,0,0], [0,1,0,0,0], [0,0,1,0,0]],
[[0,1,0,0,0], [0,0,1,0,0], [0,0,0,1,0]],
[[1,0,0,0,0], [1,0,0,0,0], [1,0,0,0,0]],
[[1,0,0,0,0], [0,0,0,0,1], [0,1,0,0,0]],
[[1,0,0,0,0], [0,0,0,0,1], [0,0,0,0,1]]
])
then i do:
model.fit(x_train, y_train, batch_size=2, epochs=50, shuffle=False)
print(model.predict(x_train))
The result is:
[[[ 0.11855114 0.13603994 0.21069065 0.28492314 0.24979511]
[ 0.03013871 0.04114409 0.16499813 0.41659597 0.34712321]
[ 0.00194826 0.00351031 0.06993906 0.52274817 0.40185428]]
[[ 0.17915446 0.19629011 0.21316603 0.22450975 0.18687972]
[ 0.17935558 0.1994358 0.22070852 0.2309722 0.16952793]
[ 0.18571526 0.20774922 0.22724937 0.23079531 0.14849086]]
[[ 0.11163659 0.13263632 0.20109797 0.28029731 0.27433187]
[ 0.02216373 0.03424517 0.13683401 0.38068131 0.42607573]
[ 0.00105937 0.0023865 0.0521594 0.43946937 0.50492537]]
[[ 0.13276921 0.15531689 0.21852671 0.25823513 0.23515201]
[ 0.05750636 0.08210614 0.22636817 0.3303588 0.30366054]
[ 0.01128351 0.02332032 0.210263 0.3951444 0.35998878]]
[[ 0.15303896 0.18197381 0.21823004 0.23647803 0.21027911]
[ 0.10842207 0.15755147 0.23791778 0.26479205 0.23131666]
[ 0.06472684 0.12843341 0.26680911 0.28923658 0.25079405]]
[[ 0.19560908 0.20663913 0.21954383 0.21920268 0.15900527]
[ 0.22829761 0.22907974 0.22933882 0.20822221 0.10506159]
[ 0.27179539 0.25587022 0.22594844 0.18308094 0.063305 ]]]
Ok, It works, but it is just a test, i really do not care about accuracy etc. I would like to understand how i can work with output of different size.
For example: passing a sequence (numpy.array) like:
[[0,0,0,0,1], [0,0,0,1,0], [0,0,1,0,0]]
I would like to get 4 dimensions output as prediction:
[[..first..], [..second..], [..third..], [..four..]]
Is that possibile somehow? The size could vary I would train the model with different labels that can have different N-dimensions.
Thanks
This answer is for non varying dimensions, but for varying dimensions, the padding idea in Giuseppe's answer seems the way to go, maybe with help of the "Masking" proposed in Keras documentation.
The output shape in Keras is totally dependent on the number of "units/neurons/cells" you put in the last layer, and of course, on the type of layer.
I can see that your data does not match your code in your question, it's impossible, but, suppose your code is right and forget the data for a while.
An input shape of (100,5) in an LSTM layer means a tensor of shape (None, 100, 5), which is
None is the batch size. The first dimension of your data is reserved to the number of examples you have. (X and Y must have the same number of examples).
Each example is a sequence with 100 time steps
each time step is a 5-dimension vector.
And the 32 cells in this same LSTM layer means that the resulting vectors will change from 5 to 32-dimension vectors. With return_sequences=True, all the 100 timesteps will appear in the result. So the result shape of the first layer is (None, 100, 32):
Same number of examples (this will never change along the model)
Still 100 timesteps per example (because return_sequences=True)
each time step is a 32-dimension vector (because of 32 cells)
Now the second LSTM layer does exactly the same thing. Keeps the 100 timesteps, and since it has also 32 cells, keeps the 32-dimension vectors, so the output is also (None, 100, 32)
Finally, the time distributed Dense layer will also keep the 100 timesteps (because of TimeDistributed), and change your vectors to 5-dimensoin vectors again (because of 5 units), resulting in (None, 100, 5).
As you can see, you cannot change the number of timesteps directly with recurrent layers, you need to use other layers to change these dimensions. And the way to do this is completely up to you, there are infinite ways of doing this.
But in all of them, you need to get free of the timesteps and rebuild the data with another shape.
Suggestion
A suggestion from me (which is just one possibility) is to reshape your result, and apply another dense layer just to achieve the final shape expeted.
Suppose you want a result like (None, 4, 5) (never forget, the first dimension of your data is the number of examples, it can be any number, but you must take it into account when you organize your data). We can achieve this by reshaping the data to a shape containing 4 in the second dimension:
#after the Dense layer:
model.add(Reshape((4,125)) #the batch size doesn't appear here,
#just make sure you have 500 elements, which is 100*5 = 4*125
model.add(TimeDistributed(Dense(5))
#this layer could also be model.add(LSTM(5,return_sequences=True)), for instance
#continue to the "Activation" layer
This will give you 4 timesteps (because the dimension after Reshape was: (None, 4, 125), each step being a 5-dimension vector (because of Dense(5)).
Use the model.summary() command to see the shapes outputted by each layer.
I don't know Keras but from a practical and theoretical point of view this is absolutely possible.
The idea is that you have an input sequence and an output sequence. Commonly, the beginning and the end of each sequence are delimited by some special symbol (e.g. the character sequence "cat" is translated into "^cat#" with an start symbol "^" and an end symbol "#"). Then the sequence is padded with another special symbol, up to a maximum sequence length (e.g. "^cat#$$$$$$" with a padding symbol "$").
If the padding symbol correspond to a zero-vector, it will have no impact on your training.
Your output sequence could now assume any length up to the maximum one, because the real length is the one from the start to the end symbol positions.
In other words, you will have always the same input and output sequence length (i.e. the maximum one), but the real length is that between the start and the end symbols.
(Obviously, in the output sequence, anything after the end symbol should not be considered in the loss function)
There seems to be two methods to do a sequence to sequence method, you're describing. The first directly using keras using this example (code below)
from keras.layers import Input, LSTM, RepeatVector
from keras.models import Model
inputs = Input(shape=(timesteps, input_dim))
encoded = LSTM(latent_dim)(inputs)
decoded = RepeatVector(timesteps)(encoded)
decoded = LSTM(input_dim, return_sequences=True)(decoded)
sequence_autoencoder = Model(inputs, decoded)
encoder = Model(inputs, encoded)
Where the repeat vector repeats the initial time series n times to match the output vectors number of timestamps. This will still mean you need a fixed number of time steps in you output vector, however, there may be a method to padding vectors that have less timestamps than you max amount of timesteps.
Or you can you the seq2seq module, which is built ontop of keras.
Related
How is the Keras Conv1D input specified? I seem to be lacking a dimension
My input is a array of 64 integers. model = Sequential() model.add( Input(shape=(68,), name="input")) model.add(Conv1D(64, 2, activation="relu", padding="same", name="convLayer")) I have 10,000 of these arrays in my training set. And I supposed to be specifying this in order for conv1D to work? I am getting the dreaded ValueError: Input 0 of layer convLayer is incompatible with the layer: : expected min_ndim=3, found ndim=2. Full shape received: [None, 68] error and I really don't understand what I need to do.
Don't let the name confuse you. The layer tf.keras.layers.Conv1D needs the following shape: (time_steps, features). If your dataset is made of 10,000 samples with each sample having 64 values, then your data has the shape (10000, 64), which is not directly applicable to the tf.keras.layers.Conv1D layer. You are missing the time_steps dimension. What you can do is use the tf.keras.layers.RepeatVector, which repeats your array input n times, in the example 5. This way your Conv1D layer gets an input of the shape (5, 64). Check out the documentation for more information: time_steps = 5 model = tf.keras.Sequential() model.add(tf.keras.layers.Input(shape=(64,), name="input")) model.add(tf.keras.layers.RepeatVector(time_steps)) model.add(tf.keras.layers.Conv1D(64, 2, activation="relu", padding="same", name="convLayer")) As a side note, you should ask yourself if using a tf.keras.layers.Conv1D layer is the right option for your use case. This layer is usually used for NLP and other time series tasks. For example, in sentence classification, each word in a sentence is usually mapped to a high-dimensional word vector representation, as seen in the image. This results in data with the shape (time_steps, features). If you want to use character one hot encoded embeddings it would look something like this: This is a simple example of one single sample with the shape (10, 10) --> 10 characters along the time series dimension and 10 features. It should help you understand the tutorial I mentioned a bit better.
The Conv1D layer does temporal convolution, that is, along the first dimension (not the batch dimension of course), so you should put something like this: time_steps = 5 model = tf.keras.Sequential() model.add(tf.keras.layers.Input(shape=(time_steps, 64), name="input")) model.add(tf.keras.layers.Conv1D(64, 2, activation="relu", padding="same", name="convLayer")) You will need to slice your data into time_steps temporal slices to feed the network. However, if your arrays don't have a temporal structure, then conv1D is not the layer you are looking for.
Bi-directional LSTM for entity recognition
Following a paper, I'm using word embeddings as a feature vector for entity recognition. I've attempted to architect the network using Keras but have run into a dimensionality problem I cannot seem to resolve. Take the following example sentence: ["I went to the shop"] The sentence has 5 words, and after computing the feature matrix, I am left with a matrix of dimension: (1, 120, 1000) == (#examples, sequence_length, embedding). Note that sequence_length appends 0. padding when not complete. In this example, the actual sequence_length would be 5. My network architecture is as follows: enc = encode() claims_input = Input(shape=(120, 1000), dtype='float32', name='claims') x = Masking(mask_value=0., input_shape=(120, 1000))(claims_input) x = Bidirectional(LSTM(units=512, return_sequences=True, recurrent_dropout=0.2, dropout=0.2))(x) x = Bidirectional(LSTM(units=512, return_sequences=True, recurrent_dropout=0.2, dropout=0.2))(x) out = TimeDistributed(Dense(8, activation="softmax"))(x) model = Model(inputs=claims_input, output=out) model.compile(loss="sparse_categorical_crossentropy", optimizer='adam', metrics=["accuracy"]) model.fit(enc, y) The architecture is straight forward, I mask specific time steps, run two bidirectional LSTMs, followed by a softmax output. My y variable in this case, is a (9,8) one-hot-encoded matrix corresponding to the gold label of each word. When trying to fit() this model, I am running into a dimensionality problem relating to the TimeDistributed() layer and I'm unsure how to resolve, or even begin to debug this. Error: ValueError: Error when checking target: expected time_distributed_1 to have 3 dimensions, but got array with shape (9, 8) Any help would be appreciated.
You are doing entity recognition. So each element in your input sequence will be assigned an entity (probably some of them as null). If your model takes an input sample of shape (120, n_features), then the output must also be a sequence of length of 120, i.e. one entity for each element. Therefore, the labels, i.e. y, you provide to the model must have a shape of (n_samples, 120, n_entities) (or (n_samples, 120, 1) if you are using sparse labeling). Side note: There is no difference between TimeDistributed(Dense(...)) and Dense(...), as the Dense layer is applied on the last axis.
Does Keras's LSTM really take into account the cell state and previous output?
I learned about LSTM's over the past day, and then i decided to look at a tutorial which uses Keras to create it. I looked at several tutorials and they all had a derivative of model = Sequential() model.add(LSTM(10, input_shape=(1,1))) model.add(Dense(1, activation='linear')) model.compile(loss='mse', optimizer='adam') X,y = get_train() model.fit(X, y, epochs=300, shuffle=False, verbose=0) then they predicted using model.predict(X, verbose=0) my question is: don't you have to give the previous prediction along with input and cell state in order to predict the next outcome using an LSTM? Also, what does the 10 represent in model.add(LSTM(10, input_shape(1,1))?
You have to give the previous prediction to the LSTM state. If you call predict the LSTM will be initialized every time, it will not remember the state from previous predictions. Typically (e.g if you generate text with an lstm) you have a loop where you do something like this: # pick a random seed start = numpy.random.randint(0, len(dataX)-1) pattern = dataX[start] print "Seed:" print "\"", ''.join([int_to_char[value] for value in pattern]), "\"" # generate characters for i in range(1000): x = numpy.reshape(pattern, (1, len(pattern), 1)) x = x / float(n_vocab) prediction = model.predict(x, verbose=0) index = numpy.argmax(prediction) result = int_to_char[index] seq_in = [int_to_char[value] for value in pattern] sys.stdout.write(result) pattern.append(index) pattern = pattern[1:len(pattern)] print "\nDone." (example copied from machinelearningmastery.com) The important thing are this lines: pattern.append(index) pattern = pattern[1:len(pattern)] Here they append the next character to the pattern and then drop the first character to have an input length that matches the expectation from the lstm. Then the bring it to a numpy array (x = np.reshape(...)) and predict from the model with the generated output. So to answer your first question you need to feed in the output again. For the second question the 10 corresponds to the number of lstm cells that you have in a layer. If you don't use "return_sequences=True" it corresponds to the output size of that layer.
Let's break it down into pieces and look pictorially LSTM(10, input_shape=(3,1))): Defines an LSTM whose sequence length is 3 i.e the LSTM will unroll for 3 timesteps. At each timestep the LSTM will take an input of size 1. The output (and also the size of the hidden state and all other LSTM gates) is 10 (vector or size 10) You dont have to do unrolling manually (passing in the current hidden state to the next state) it is taken care by the keras/tensorflow LSTM layer. All you have to do is to pass in data in the (batch_size X time_steps X input_size) format. Dense(1, activation='linear'): This is a dense layer with linear activation with takes in as input the output of the previous layer (i.e the output of the LSTM which will be a vector of size 10 of the last unrolling). It will return a vector of size 1. The same can be checked using model.summary()
Your 1st question: don't you have to give the previous prediction along with input and cell state in order to predict the next outcome using an LSTM? no, you don't have to do that. As far as I understand, it is stored in the LSTM cell which is why LSTM uses so much RAM if you have data with shape looking like this: (100,1000) if you plug that into the fit function, each epoch will run on 100 lists. The LSTM will remember 1000 data plots before refreshing when it moves onto the next list. 2nd: Also, what does the 10 represent in model.add(LSTM(10, input_shape(1,1))? it is the shape of the 1st layer after the input, so your model currently has the shape of: 1,1 10 1 hope it helps :)
LSTM architecture in Keras implementation?
I am new to Keras and going through the LSTM and its implementation details in Keras documentation. It was going easy but suddenly I came through this SO post and the comment. It has confused me on what is the actual LSTM architecture: Here is the code: model = Sequential() model.add(LSTM(32, input_shape=(10, 64))) model.add(Dense(2)) As per my understanding, 10 denote the no. of time-steps and each one of them is fed to their respective LSTM cell; 64 denote the no. of features for each time-step. But, the comment in the above post and the actual answer has confused me about the meaning of 32. Also, how is the output from LSTM is getting connected to the Dense layer. A hand-drawn diagrammatic explanation would be quite helpful in visualizing the architecture. EDIT: As far as this another SO post is concerned, then it means 32 represents the length of the output vector that is produced by each of the LSTM cells if return_sequences=True. If that's true then how do we connect each of 32-dimensional output produced by each of the 10 LSTM cells to the next dense layer? Also, kindly tell if the first SO post answer is ambiguous or not?
how do we connect each of 32-dimensional output produced by each of the 10 LSTM cells to the next dense layer? It depends on how you want to do it. Suppose you have: model.add(LSTM(32, input_shape=(10, 64), return_sequences=True)) Then, the output of that layer has shape (10, 32). At this point, you can either use a Flatten layer to get a single vector with 320 components, or use a TimeDistributed to work on each of the 10 vectors independently: model.add(TimeDistributed(Dense(15)) The output shape of this layer is (10, 15), and the same weights are applied to the output of every LSTM unit. it's easy to figure out the no. of LSTM cells required for the input(specified in timespan) How to figure out the no. of LSTM units required in the output? You either get the output of the last LSTM cell (last timestep) or the output of every LSTM cell, depending on the value of return_sequences. As for the dimensionality of the output vector, that's just a choice you have to make, just like the size of a dense layer, or number of filters in a conv layer. how each of the 32-dim vector from the 10 LSTM cells get connected to TimeDistributed layer? Following the previous example, you would have a (10, 32) tensor, i.e. a size-32 vector for each of the 10 LSTM cells. What TimeDistributed(Dense(15)) does, is to create a (15, 32) weight matrix and a bias vector of size 15, and do: for h_t in lstm_outputs: dense_outputs.append( activation(dense_weights.dot(h_t) + dense_bias) ) Hence, dense_outputs has size (10, 15), and the same weights were applied to every LSTM output, independently. Note that everything still works when you don't know how many timesteps you need, e.g. for machine translation. In this case, you use None for the timestep; everything that I wrote still applies, with the only difference that the number of timesteps is not fixed anymore. Keras will repeat LSTM, TimeDistributed, etc. for as many times as necessary (which depend on the input).
How do you make predictions with a stateful LSTM?
Okay, so I trained a stateful LSTM characterwise on https://cs.stanford.edu/people/karpathy/char-rnn/shakespear.txt. It didn't seem to do too bad in terms of accuracy, but know I want to generate my own shakespeare works. The question is, how do I go about actually generating predictions from it? In particular, the models batch input shape is (128, 128, 63) and the output shape is (128, 128, 63). (The first number is the batch size, the second number is the length of the prediction input and output, and the third number is the number of distinct characters in the text.) For example, I would like to: Generate various predictions starting from empty text Generate predictions starting from a small starting text (such as "PYRULEZ:") This should be possible given how LSTMs work. Here's a snippet of the code used to generate and fit the model: model = Sequential() model.add(LSTM(dataY.shape[2], batch_input_shape=(128, dataX.shape[1], dataX.shape[2]), return_sequences = True, stateful=True, activation = "softmax")) model.summary() model.compile(loss='categorical_crossentropy', optimizer='adam', metrics = ['acc']) model.fit(dataX, dataY, epochs = 1, batch_size = 128, verbose=1, shuffle = False) Looking at other code samples, it appears I'll need to modify this somehow, but I'm not sure in how specifically. I can include the whole code sample if that would be helpful. It is self contained.
Simple. Put your input into model.predict() with appropriate parameters (see documentation), concatenate input and output (the model predicts on progressively longer chains). Depending on how you organised training, output will add one character at a time. To be more precise, if you train sequence to sequence shifted by one, your output sequence will ideally be your input sequence shifted by one element; PYRULEZ -> YRULEZ* Hence you need to take the last character of the output and add it to your prior (input) sequence. If you want long lines of text, you might want to limit the length of your sequence to some number of characters in the loop. Much of the long term dependencies in the text is carried through the stateful vector of the LSTM cell anyway (Not something you interact with). Pseudocode-ish: for counter in range(output_length): output = model.predict(input_) input_ = np.concatenate((input_, output[:,-1,:]), axis=1)