How point wise multiplication and addition takes place in LSTM?

How point wise multiplication and addition takes place in LSTM? - python

timesteps=4, features=2 and LSTM(units=3)
Input to lstm will be (batch_size, timesteps, features) dimensions of the hidden State and cell state will be (None, unit in LSTM). When lstm take the input at t1 will get concatenated with the hidden State at features axis. After that it will pass through sigmoid and tanh. Now my main confusion is "how will it do the point Wise operation(addition or multiplication) with the cell state".
How I look at this is shown in the attached figure. If there's anything which I am taking wrong kindly make the correction. Thanks in advance to everyone.

Related

Context vector shape using Bahdanau Attention

I am looking here at the Bahdanau attention class. I noticed that the final shape of the context vector is (batch_size, hidden_size). I am wondering how they got that shape given that attention_weights has shape (batch_size, 64, 1) and features has shape (batch_size, 64, embedding_dim). They multiplied the two (I believe it is a matrix product) and then summed up over the first axis. Where is the hidden size coming from in the context vector?

The context vector resulting from Bahdanau attention is a weighted average of all the hidden states of the encoder. The following image from Ref shows how this is calculated. Essentially we do the following.
Compute attention weights, which is a (batch size, encoder time steps, 1) sized tensor
Multiply each hidden state (batch size, hidden size) element-wise with e values. Resulting in (batch_size, encoder timesteps, hidden size)
Average over the time dimension, resulting in (batch size, hidden size)

The answer given is incorrect. Let me explain why first, before I share what the actual answer is.
Take a look at the concerned code in the hyperlink provided. The 'hidden size' in the code refers to the dimensions of the hidden state of the decoder and NOT the hidden state(s) of the encoder as the answer above has assumed. The above multiplication in code will yield (batch_size, embedding_dim) as the question-framer mg_nt rightly points out.. The context is a weighted sum of encoder output and SHOULD have the SAME dimension as the encoder o/ps. Mathematically also one should NOT get (batch size, hidden size).
Of course in this case they are using Attention over a CNN. So there is no encoder as such but the image is broken down into features. These features are collected from the last but 1 layer and each feature is a specific component of the overall image. The hidden state from the decoder..ie the query, 'attend's to all these features and decides which ones are important and need to be given a higher weightage to determine the next word in the caption. The features shape in the above code is (batch_size, embedding_dim) and hence the context shape after being magnified or diminished by the attention weights will also be (batch_size, embedding_dim)!
This is simply a mistake in the comments of the concerned code (the code functionality itself seems right). The shape mentioned in the comments are incorrect. If you search the code for 'hidden_size' there is no such variable. It is only mentioned in the comments. If you further look at the declaration of the encoder and decoder they are using the same embedding size for both. So the code works, but the comments in the code are misleading and incorrect. That is all there is to it.

How to interpret clearly the meaning of the units parameter in Keras?

I am wondering how LSTM work in Keras. In this tutorial for example, as in many others, you can find something like this :
model.add(LSTM(4, input_shape=(1, look_back)))
What does the "4" mean. Is it the number of neuron in the layer. By neuron, I mean something that for each instance gives a single output ?
Actually, I found this brillant discussion but wasn't really convinced by the explanation mentioned in the reference given.
On the scheme, one can see the num_unitsillustrated and I think I am not wrong in saying that each of this unit is a very atomic LSTM unit (i.e. the 4 gates). However, how these units are connected ? If I am right (but not sure), x_(t-1)is of size nb_features, so each feature would be an input of a unit and num_unit must be equal to nb_features right ?
Now, let's talk about keras. I have read this post and the accepted answer and get trouble. Indeed, the answer says :
Basically, the shape is like (batch_size, timespan, input_dim), where input_dim can be different from the unit
In which case ? I am in trouble with the previous reference...
Moreover, it says,
LSTM in Keras only define exactly one LSTM block, whose cells is of unit-length.
Okay, but how do I define a full LSTM layer ? Is it the input_shape that implicitely create as many blocks as the number of time_steps (which, according to me is the first parameter of input_shape parameter in my piece of code ?
Thanks for lighting me
EDIT : would it also be possible to detail clearly how to reshape data of, say, size (n_samples, n_features) for a stateful LSTM model ? How to deal with time_steps and batch_size ?

First, units in LSTM is NOT the number of time_steps.
Each LSTM cell(present at a given time_step) takes in input x and forms a hidden state vector a, the length of this hidden unit vector is what is called the units in LSTM(Keras).
You should keep in mind that there is only one RNN cell created by the code
keras.layers.LSTM(units, activation='tanh', …… )
and RNN operations are repeated by Tx times by the class itself.
I've linked this to help you understand it better in with a very simple code.

You can (sort of) think of it exactly as you think of fully connected layers. Units are neurons.
The dimension of the output is the number of neurons, as with most of the well known layer types.
The difference is that in LSTMs, these neurons will not be completely independent of each other, they will intercommunicate due to the mathematical operations lying under the cover.
Before going further, it might be interesting to take a look at this very complete explanation about LSTMs, its inputs/outputs and the usage of stative = true/false: Understanding Keras LSTMs. Notice that your input shape should be input_shape=(look_back, 1). The input shape goes for (time_steps, features).
While this is a series of fully connected layers:
hidden layer 1: 4 units
hidden layer 2: 4 units
output layer: 1 unit
This is a series of LSTM layers:
Where input_shape = (batch_size, arbitrary_steps, 3)
Each LSTM layer will keep reusing the same units/neurons over and over until all the arbitrary timesteps in the input are processed.
The output will have shape:
(batch, arbitrary_steps, units) if return_sequences=True.
(batch, units) if return_sequences=False.
The memory states will have a size of units.
The inputs processed from the last step will have size of units.
To be really precise, there will be two groups of units, one working on the raw inputs, the other working on already processed inputs coming from the last step. Due to the internal structure, each group will have a number of parameters 4 times bigger than the number of units (this 4 is not related to the image, it's fixed).
Flow:
Takes an input with n steps and 3 features
Layer 1:
For each time step in the inputs:
Uses 4 units on the inputs to get a size 4 result
Uses 4 recurrent units on the outputs of the previous step
Outputs the last (return_sequences=False) or all (return_sequences = True) steps
output features = 4
Layer 2:
Same as layer 1
Layer 3:
For each time step in the inputs:
Uses 1 unit on the inputs to get a size 1 result
Uses 1 unit on the outputs of the previous step
Outputs the last (return_sequences=False) or all (return_sequences = True) steps

The number of units is the size (length) of the internal vector states, h and c of the LSTM. That is no matter the shape of the input, it is upscaled (by a dense transformation) by the various kernels for the i, f, and o gates. The details of how the resulting latent features are transformed into h and c are described in the linked post. In your example, the input shape of data
(batch_size, timesteps, input_dim)
will be transformed to
(batch_size, timesteps, 4)
if return_sequences is true, otherwise only the last h will be emmited making it (batch_size, 4). I would recommend using a much higher latent dimension, perhaps 128 or 256 for most problems.

I would put it this way - there are 4 LSTM "neurons" or "units", each with 1 Cell State and 1 Hidden State for each timestep they process. So for an input of 1 timestep processing , you will have 4 Cell States, and 4 Hidden States and 4 Outputs.
Actually the correct way to say this is - for one timestep sized input you 1 Cell State (a vector of size 4) and 1 Hidden State (a vector of size 4) and 1 Output (a vector of size 4).
So if you feed in a timeseries with 20 steps, you will have 20 (intermediate) Cell States, each of size 4. That is because the inputs in LSTM are processed sequentially, 1 after the other. Similarly you will have 20 Hidden States, each of size 4.
Usually, your output will be the output of the LAST step (a vector of size 4). However in case you want the outputs of each intermediate step(remember you have 20 timesteps to process), you can make return_sequences = TRUE. In which case you will have 20 , 4 sized vectors each telling you what was the output when each of those steps got processed as those 20 inputs came one after the other.
In case when you put return_states = TRUE , you get the last Hidden State of size = 4 and last Cell State of size 4.

How to implement a Many-to-Many RNN in tensorflow?

The below code gives me all the hidden state values of the unrolled RNN.
hidden_states,final_hidden_state = tf.nn.dyanamic_rnn(...)
How do I multiply each of the hidden_states with the weight "Why" shown in the figure.
I find this confusing because the shape of hidden_states is [mini_batch_size,max_seq_length,num_of_hidden_neurons]
Do I have to make some kind of for loop for each of the hidden state and then multiply by Why to get the y values?
Thanks !

What is the return_state output using Keras' RNN Layer

I check the Keras documentation for LSTM layer, the information about the RNN argument is as bellow:
keras.layers.LSTM(units, return_state=True)
Arguments:
return_state: Boolean. Whether to return the last state in addition to the output.
Output shape
if return_state: a list of tensors. The first tensor is the output. The remaining tensors are the last states, each with shape (batch_size, units)
And that's all of the info about return_state for RNN. As a beginner, it's really hard to understand what exactly it means that The remaining tensors are the last states, each with shape (batch_size, units), isn't it?
I do understand there are cell state c, and hidden state a that would be passed to next time step.
But when I did the programing exercise for online course, I encounter this question. Bellow is the hint given by the assignment. But I don't understand what are these three outputs means.
from keras.layers import LSTM
LSTM_cell = LSTM(n_a, return_state = True)
a, _, c = LSTM_cell(input_x, initial_state=[a, c])
Someone said, they are respectively (https://machinelearningmastery.com/return-sequences-and-return-states-for-lstms-in-keras/):
1 The LSTM hidden state output for the last time step.
2 The LSTM hidden state output for the last time step (again).
3 The LSTM cell state for the last time step.
I always regard output a as hidden state ouput for LSTM, and c as cell state output. But this person said that the first output is lstm output, while the second one is the hidden sate output, which is different from the hint given by the online course instruction (as the hint uses the first output as hidden state output for next time step).
Could anyone tell me more about this?
For a more general question, like in this case, Keras doesn't give an beginner friendly understandable documentation or examples, how to learn Keras more efficiently?

Think about how you would start an iteration of the LSTM. You have a hidden state c, an input x, but you also need an alleged previous output h, which is concatenated with x. The LSTM has therefore two hidden tensors that need to be initialized: c and h. Now h happens to be the output of the previous state, which is why you pass it as input together with c. When you set return_state=True, both c and h are returned. Together with the output, you'll therefore receive 3 tensors.

output,h(hidden state),c(memory/ cell state)
Take LSTM as an example, you can understand like this:
c(t) depend on c(t-1);
o(t) depend on x(t) and h(t-1);
h(t) depend on o(t) and c(t);

How to vectorize LSTMs?

In particular, I'm confused about what it means for an LSTM layer to have (say) 50 cells. Consider the following LSTM block from this awesome blog post:
Say my input xt is a (20,) vector and the hidden layer ht is a (50,) vector. Given that the cell state Ct undergoes only point-wise operations (point-wise tanh and *) before becoming the new hidden state, I gather that Ct.shape = ht.shape = (50,). Now the forget gate looks at the input concatenated with the hidden layer, which would be a (20+50,) = (70,) vector, which means the forget gate must have a weight matrix of shape (50, 70), such that dot(W, [xt, ht]).shape = (50,).
So my question at this point is that, am I looking at a LSTM block with 50 cells when Ct.shape = (50,)? Or am I misunderstanding what it means for a LSTM layer to have 50 cells?

I understand what you are getting confused with. So basically, the black line connecting the two boxes at the top which represents the cell state is actually a set of very small 50 lines grouped together. These get multiplied point wise with the output of the forget gate which has an output consisting of 50 values. These 50 values multiply with the cell state point wise.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.