LSTM architecture in Keras implementation?

LSTM architecture in Keras implementation? - python

I am new to Keras and going through the LSTM and its implementation details in Keras documentation. It was going easy but suddenly I came through this SO post and the comment. It has confused me on what is the actual LSTM architecture:
Here is the code:
model = Sequential()
model.add(LSTM(32, input_shape=(10, 64)))
model.add(Dense(2))
As per my understanding, 10 denote the no. of time-steps and each one of them is fed to their respective LSTM cell; 64 denote the no. of features for each time-step.
But, the comment in the above post and the actual answer has confused me about the meaning of 32.
Also, how is the output from LSTM is getting connected to the Dense layer.
A hand-drawn diagrammatic explanation would be quite helpful in visualizing the architecture.
EDIT:
As far as this another SO post is concerned, then it means 32 represents the length of the output vector that is produced by each of the LSTM cells if return_sequences=True.
If that's true then how do we connect each of 32-dimensional output produced by each of the 10 LSTM cells to the next dense layer?
Also, kindly tell if the first SO post answer is ambiguous or not?

how do we connect each of 32-dimensional output produced by each of
the 10 LSTM cells to the next dense layer?
It depends on how you want to do it. Suppose you have:
model.add(LSTM(32, input_shape=(10, 64), return_sequences=True))
Then, the output of that layer has shape (10, 32). At this point, you can either use a Flatten layer to get a single vector with 320 components, or use a TimeDistributed to work on each of the 10 vectors independently:
model.add(TimeDistributed(Dense(15))
The output shape of this layer is (10, 15), and the same weights are applied to the output of every LSTM unit.
it's easy to figure out the no. of LSTM cells required for the input(specified in timespan)
How to figure out the no. of LSTM units required in the output?
You either get the output of the last LSTM cell (last timestep) or the output of every LSTM cell, depending on the value of return_sequences. As for the dimensionality of the output vector, that's just a choice you have to make, just like the size of a dense layer, or number of filters in a conv layer.
how each of the 32-dim vector from the 10 LSTM cells get connected to TimeDistributed layer?
Following the previous example, you would have a (10, 32) tensor, i.e. a size-32 vector for each of the 10 LSTM cells. What TimeDistributed(Dense(15)) does, is to create a (15, 32) weight matrix and a bias vector of size 15, and do:
for h_t in lstm_outputs:
dense_outputs.append(
activation(dense_weights.dot(h_t) + dense_bias)
)
Hence, dense_outputs has size (10, 15), and the same weights were applied to every LSTM output, independently.
Note that everything still works when you don't know how many timesteps you need, e.g. for machine translation. In this case, you use None for the timestep; everything that I wrote still applies, with the only difference that the number of timesteps is not fixed anymore. Keras will repeat LSTM, TimeDistributed, etc. for as many times as necessary (which depend on the input).

Related

How to make array have same number of samples?

I am a beginner using CNN and Keras and I am trying to make a program to predict whether someone could develop diabetes using data in a CSV file. I think I am getting confused with how to reshape the array as I am receiving the error:
ValueError: Data cardinality is ambiguous:
x sizes: 8
y sizes: 768
Make sure all arrays contain the same number of samples
Here is the code:
import pandas as pd
from keras.models import Sequential
from keras.layers import Dense, Conv2D, Flatten
# read in the csv file using pandas
data = pd.read_csv("diabetes.csv")
# extract the input and output columns from the dataframe
X = data.drop(columns=['Outcome'])
y = data['Outcome']
# reshape the input data into the shape expected by a CNN
X = X.values.reshape(8, 768, 1)
# create a Sequential model in Keras
model = Sequential()
# add a 2D convolutional layer with 32 filters and a kernel size of 3x3
model.add(Conv2D(32, kernel_size=(3, 3), activation="relu", input_shape=(8, 768, 1)))
# add a flatten layer to flatten the output from the convolutional layer
model.add(Flatten())
# add a fully-connected layer with 64 units and a ReLU activation
model.add(Dense(64, activation="relu"))
# add a fully-connected layer with 10 units and a softmax activation
model.add(Dense(10, activation="softmax"))
# compile the model using categorical crossentropy loss and an Adam optimizer
model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
# fit the model using the input and output data
model.fit(X, y)
# print prediction
print(model.predict(10, 139, 80, 0, 0, 27.1, 1.441, 57))

Tldr, you probably don't want a CNN in this case.
First off, I’m assuming your data looks something like the following, if that’s not the case the rest of the post may be way off target:
enter image description here
So there are 768 rows or patients, 8 inputs for each row, and 1 output (known as the label).
Convolutional layers are used when there is an input signal that you wish to analyze. In 2d, this would be something like a grid of pixels, or in 1d it might be time series data. Your data is neither – each row of the data represents a single 8-dimensional data point (i.e. a single patient) at a single point in time, so you very likely don’t want to use a convolutional layer at all.
For more information, you can read up on the differences between convnets and fully connected neural networks here: https://ai.stackexchange.com/questions/5546/what-is-the-difference-between-a-convolutional-neural-network-and-a-regular-neur?rq=1
“CNN, in specific, has one or more layers of convolution units. A convolution unit receives its input from multiple units from the previous layer which together create a proximity. Therefore, the input units (that form a small neighborhood) share their weights.
The convolution units (as well as pooling units) are especially beneficial as:
• They reduce the number of units in the network (since they are many-to-one mappings). This means, there are fewer parameters to learn which reduces the chance of overfitting as the model would be less complex than a fully connected network.
• They consider the context/shared information in the small neighborhoods. This feature is very important in many applications such as image, video, text, and speech processing/mining as the neighboring inputs (eg pixels, frames, words, etc) usually carry related information."
A very naïve, very basic NN for a problem like this would just use Dense, i.e. fully connected layers.
In Keras, you can do the following:
model = Sequential()
model.add(Dense(64, activation="relu", input_shape=(8,)))
model.add(Dense(1, activation="sigmoid"))
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
Note that the last layer is a single neuron, since you have only one output. If you were classifying images as one of say 10 categories (dog, cat, bird, etc), then you would use 10 output nodes in the last layer, softmax them, and use categorical cross entropy. Here, with a single condition, you only need a single output node and note that the loss function should probably be binary crossentropy – i.e. you’re trying to detect the presence or absence of the condition.
Hope this helps.

how to Create Multi Head Convolutional layers merging as Temporal dimension in keras

I want to implement a time-series prediction model which has a window of non-image matrixes as input , each matrix to be processed by a Conv2d layer at the first layer and then the output of this conv layers merged as time dimension to be passed to a recurrent layer like LSTM,
one way is to use Time-Distribution technique but TimeDistributed layer apply the same layer to several inputs. And it produce one output per input to get the result in time, the Time-Distribution technique will share the same weights among all convolution heads which is not what I want, for example If you injects 5 Matrixes, the weights are not tweaked 5 times, but only once, and distributed to every blocks defined in the current Time Distributed layer. how can I avoid this and have independent Convolutional heads with outputs merging as time dimension for the next layer?
I have tried to implement it as following
Matrix_Dimention=20;
Input_Window=4;
Input_Matrixes=[]
ConvLayers=[]
for i in range(0 , Input_Window):
Inp_Matrix=layers.Input(shape=(Matrix_Dimention,Matrix_Dimention,1));
Input_Matrixes.append(Inp_Matrix);
conv=layers.Conv2D(64, 5, activation='relu', input_shape=(Matrix_Dimention,Matrix_Dimention,1))(Inp_Matrix)
ConvLayers.append(conv);
#Temporal Concatenation
Spatial_Layers_Concate = layers.Concatenate(ConvLayers); # this causes error : Inputs to a layer should be tensors
#Temporal Component
LSTM_Layer=layers.LSTM(activation='relu',return_sequences=False)(Spatial_Layers_Concate )
Model = keras.Model(Input_Matrixes, LSTM_Layer)
Model.compile(optimizer='adam', loss=keras.losses.MeanSquaredError)
it would be great if you provide your answer by correcting my implementation or provide your own if there is a better way to form this idea , tnx.

Tensorflow - building LSTM model - need for tf.keras.layers.Dense()

Python 3.7 tensorflow
I am experimenting Time series forecasting w Tensorflow
I understand the second line creates a LSTM RNN i.e. a Recurrent Neural Network of type Long Short Term Memory.
Why do we need to add a Dense(1) layer in the end?
single_step_model = tf.keras.models.Sequential()
single_step_model.add(tf.keras.layers.LSTM(32, input_shape=x_train_single.shape[-2:]))
single_step_model.add(tf.keras.layers.Dense(1))
Tutorial for Dense() says
Dense implements the operation: output = activation(dot(input, kernel) + bias) where activation is the element-wise activation function passed as the activation argument, kernel is a weights matrix created by the layer, and bias is a bias vector created by the layer (only applicable if use_bias is True).
would you like to rephrase or elaborate on need for Dense() here ?

The following line
single_step_model.add(tf.keras.layers.LSTM(32, input_shape=x_train_single.shape[-2:]))
creates an LSTM layer which transforms each input step of size #features into a latent representation of size 32. You want to predict a single value so you need to convert this latent representation of size 32 into a single value. Hence, you add the following line
single_step_model.add(tf.keras.layers.Dense(1))
which adds a Dense Layer (Fully-Connected Neural Network) with one neuron in the output which, obviously, produces a single value. Look at it as a way to transform an intermediate result of higher dimensionality into the final result.

Well in the tutorial you are following Time series forecasting, they are trying to forecast temperature (6 hrs ahead). For which they are using an LSTM followed by a Dense layer.
single_step_model = tf.keras.models.Sequential()
single_step_model.add(tf.keras.layers.LSTM(32, input_shape=x_train_single.shape[-2:]))
single_step_model.add(tf.keras.layers.Dense(1))
Dense layer is nothing but a regular fully-connected NN layer. In this case you are bringing down the output dimensionality to 1, which should represent some proportionality (need not be linear) to the temperature you are trying to predict. There are other layers you could use as well. Check out, Keras Layers.
If you are confused about the input and output shape of LSTM, check out
I/O Shape.

How to interpret clearly the meaning of the units parameter in Keras?

I am wondering how LSTM work in Keras. In this tutorial for example, as in many others, you can find something like this :
model.add(LSTM(4, input_shape=(1, look_back)))
What does the "4" mean. Is it the number of neuron in the layer. By neuron, I mean something that for each instance gives a single output ?
Actually, I found this brillant discussion but wasn't really convinced by the explanation mentioned in the reference given.
On the scheme, one can see the num_unitsillustrated and I think I am not wrong in saying that each of this unit is a very atomic LSTM unit (i.e. the 4 gates). However, how these units are connected ? If I am right (but not sure), x_(t-1)is of size nb_features, so each feature would be an input of a unit and num_unit must be equal to nb_features right ?
Now, let's talk about keras. I have read this post and the accepted answer and get trouble. Indeed, the answer says :
Basically, the shape is like (batch_size, timespan, input_dim), where input_dim can be different from the unit
In which case ? I am in trouble with the previous reference...
Moreover, it says,
LSTM in Keras only define exactly one LSTM block, whose cells is of unit-length.
Okay, but how do I define a full LSTM layer ? Is it the input_shape that implicitely create as many blocks as the number of time_steps (which, according to me is the first parameter of input_shape parameter in my piece of code ?
Thanks for lighting me
EDIT : would it also be possible to detail clearly how to reshape data of, say, size (n_samples, n_features) for a stateful LSTM model ? How to deal with time_steps and batch_size ?

First, units in LSTM is NOT the number of time_steps.
Each LSTM cell(present at a given time_step) takes in input x and forms a hidden state vector a, the length of this hidden unit vector is what is called the units in LSTM(Keras).
You should keep in mind that there is only one RNN cell created by the code
keras.layers.LSTM(units, activation='tanh', …… )
and RNN operations are repeated by Tx times by the class itself.
I've linked this to help you understand it better in with a very simple code.

You can (sort of) think of it exactly as you think of fully connected layers. Units are neurons.
The dimension of the output is the number of neurons, as with most of the well known layer types.
The difference is that in LSTMs, these neurons will not be completely independent of each other, they will intercommunicate due to the mathematical operations lying under the cover.
Before going further, it might be interesting to take a look at this very complete explanation about LSTMs, its inputs/outputs and the usage of stative = true/false: Understanding Keras LSTMs. Notice that your input shape should be input_shape=(look_back, 1). The input shape goes for (time_steps, features).
While this is a series of fully connected layers:
hidden layer 1: 4 units
hidden layer 2: 4 units
output layer: 1 unit
This is a series of LSTM layers:
Where input_shape = (batch_size, arbitrary_steps, 3)
Each LSTM layer will keep reusing the same units/neurons over and over until all the arbitrary timesteps in the input are processed.
The output will have shape:
(batch, arbitrary_steps, units) if return_sequences=True.
(batch, units) if return_sequences=False.
The memory states will have a size of units.
The inputs processed from the last step will have size of units.
To be really precise, there will be two groups of units, one working on the raw inputs, the other working on already processed inputs coming from the last step. Due to the internal structure, each group will have a number of parameters 4 times bigger than the number of units (this 4 is not related to the image, it's fixed).
Flow:
Takes an input with n steps and 3 features
Layer 1:
For each time step in the inputs:
Uses 4 units on the inputs to get a size 4 result
Uses 4 recurrent units on the outputs of the previous step
Outputs the last (return_sequences=False) or all (return_sequences = True) steps
output features = 4
Layer 2:
Same as layer 1
Layer 3:
For each time step in the inputs:
Uses 1 unit on the inputs to get a size 1 result
Uses 1 unit on the outputs of the previous step
Outputs the last (return_sequences=False) or all (return_sequences = True) steps

The number of units is the size (length) of the internal vector states, h and c of the LSTM. That is no matter the shape of the input, it is upscaled (by a dense transformation) by the various kernels for the i, f, and o gates. The details of how the resulting latent features are transformed into h and c are described in the linked post. In your example, the input shape of data
(batch_size, timesteps, input_dim)
will be transformed to
(batch_size, timesteps, 4)
if return_sequences is true, otherwise only the last h will be emmited making it (batch_size, 4). I would recommend using a much higher latent dimension, perhaps 128 or 256 for most problems.

I would put it this way - there are 4 LSTM "neurons" or "units", each with 1 Cell State and 1 Hidden State for each timestep they process. So for an input of 1 timestep processing , you will have 4 Cell States, and 4 Hidden States and 4 Outputs.
Actually the correct way to say this is - for one timestep sized input you 1 Cell State (a vector of size 4) and 1 Hidden State (a vector of size 4) and 1 Output (a vector of size 4).
So if you feed in a timeseries with 20 steps, you will have 20 (intermediate) Cell States, each of size 4. That is because the inputs in LSTM are processed sequentially, 1 after the other. Similarly you will have 20 Hidden States, each of size 4.
Usually, your output will be the output of the LAST step (a vector of size 4). However in case you want the outputs of each intermediate step(remember you have 20 timesteps to process), you can make return_sequences = TRUE. In which case you will have 20 , 4 sized vectors each telling you what was the output when each of those steps got processed as those 20 inputs came one after the other.
In case when you put return_states = TRUE , you get the last Hidden State of size = 4 and last Cell State of size 4.

Keras Bidirectional LSTM - Layer grouping

While working to implement a paper (Dialogue Act Sequence Labeling using Hierarchical encoder with CRF) using Keras, I need to implement a specific Bidirectional LSTM architecture.
I have to train the network on the concept of a Conversation. Conversations are composed of Utterances, and Utterances are composed of Words. Words are N-dimensional vectors. The model represented in the paper first reduces each Utterance to a single M-dimensional vector. To achieve this, it uses a Bidirectional LSTM layer. Let's call this layer A.
(For simplicity, let's assume that each Utterance has a length of |U| and each Conversation has a length of |C|)
Each Utterance is input to a Bi-LSTM layer with U timesteps, and the output of the last timestep is taken. The input size is (|U|, N), and the output size is (1, M).
This Bi-LSTM layer should be applied separately/simultaneously to each Utterance in the Conversation. Note that, since the network takes as input the entire Conversation, the dimensions for a single input to the network would be (|C|, |U|, N).
As the paper describes, I intend to feed each utterance (i.e. each (|U|, N)) of that input and feed it to a Bi-LSTM layer with |U| units. As there are |C| Utterances in a Conversation, this implies that there should be a total of |C| x |U| Bi-LSTM units, grouped into |C| different partitions for each Utterance. There should be no connection between the |C| groups of units. Once processed, the output of each of those C groups of Bidirectional LSTM units will then be fed into another Bi-LSTM layer, say B.
How is it possible to feed specific portions of the input only to specific portions of the layer A, and make sure that they are not interconnected? (i.e. the portion of Bi-LSTM units used for an Utterance should not be connected to the Bi-LSTM units used for another Utterance)
Is it possible to achieve this through keras.models.Sequential, or is there a specific way to achieve this using Functional API?
Here is what I have tried so far:
# ...
model = Sequential()
model.add(Bidirectional(LSTM(C * U), input_shape = (C, U, N),
merge_mode='concat'))
model.add(GlobalMaxPooling1D())
model.add(Bidirectional(LSTM(n, return_sequences = True), merge_mode='concat'))
# ...
model.compile(loss = loss_function,
optimizer = optimizer,
metrics=['accuracy'])
However, this code is currently receiving the following error:
ValueError: Input 0 is incompatible with layer bidirectional_1: expected ndim=3, found ndim=4
More importantly, the code above obviously does not do the grouping I mentioned. I am looking for a way to enhance the model as I described above.
Finally, below is the figure of the model I described above. It may possibly help clarify some of the written content narrated above. The layer tagged as "Utterance layer" is what I called the layer A. As you can see in the figure, each utterance u_i in the figure is composed of words w_j, which are N-dimensional vectors. (You may omit the embedding layer for the purposes of this question) Assuming, for simplicity, that each u_i has equal number of Words, then each group of Bidirectional LSTM nodes in the Utterance Layer will have an input size of (|U|, N). Yet, since there are |C| such utterances u_i in a Conversation, the dimensions of the entire input will be (|C|, |U|, N).

I'll create a net for what I see in the picture. For now I'm ignoring the "units" part I mentioned in my comment to your question.
This model does exactly what is shown in the picture. All utterances are completely separate from start to end.
model = Sequential()
#You have an extra time dimension that should be kept as is
#So we add a TimeDistributed` wrapper to the first layers
model.add(TimeDistributed(Embedding(dictionaryLength,N), input_shape=(C,U)))
#This is the utterance layer. It works in "word steps", keeping "utterance steps" untouched
model.add(TimeDistributed(Bidirectional(LSTM(M//2, return_sequences=False))))
#Is the pooling really demanded by the article?
#Or was it an attempt to remove one of the time dimensions?
#Not adding it here because I used `return_sequences=False`
model.add(Bidirectional(LSTM(someSize//2,return_sequences=True)))
model.add(Dense(anotherSize)) #is this a CRF layer???
model.summary()
Notice that in every Bidirectional layer, I divided the output size by two, so it's important that M and someSize are even numbers.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.