The below code gives me all the hidden state values of the unrolled RNN.
hidden_states,final_hidden_state = tf.nn.dyanamic_rnn(...)
How do I multiply each of the hidden_states with the weight "Why" shown in the figure.
I find this confusing because the shape of hidden_states is [mini_batch_size,max_seq_length,num_of_hidden_neurons]
Do I have to make some kind of for loop for each of the hidden state and then multiply by Why to get the y values?
Thanks !
Related
timesteps=4, features=2 and LSTM(units=3)
Input to lstm will be (batch_size, timesteps, features) dimensions of the hidden State and cell state will be (None, unit in LSTM). When lstm take the input at t1 will get concatenated with the hidden State at features axis. After that it will pass through sigmoid and tanh. Now my main confusion is "how will it do the point Wise operation(addition or multiplication) with the cell state".
How I look at this is shown in the attached figure. If there's anything which I am taking wrong kindly make the correction. Thanks in advance to everyone.
I am currently building a neural network to predict features such as temperature. So the output for this could be a positive or negative value. I am normalizing my input data and using the tanh activation function in each hidden layer.
Should I use a linear activation function for the output layer to get an unbounded continuous output OR should I use tanh for the output layer and then inverse normalize the output? Could someone explain this I don't think my understanding of this is correct.
You are actually in the correct direction
Option1:
you need to normalize the temperatures first and then fed it to the model let say your temperature ranges from [-100,100] so convert it into a scale of [-1,1] then use this scaled version of temp in your target variable.
At the time of prediction just inverse transform the output and you will get your desired result.
Option2:
You create a regression kind of neural network and don't apply any activation function to the output layer (means no bonds for values it could be +ve or -ve).
In this case you are not required to normalize the values.
Sample NN Spec:
Input Layer==> # neurons one per feature
Hidden Layer==>relu/selu as activation function| # of neurons/Hidden layers is as per your requirement
Output Layer==> 1 neuron/ no Activation function required
I am looking here at the Bahdanau attention class. I noticed that the final shape of the context vector is (batch_size, hidden_size). I am wondering how they got that shape given that attention_weights has shape (batch_size, 64, 1) and features has shape (batch_size, 64, embedding_dim). They multiplied the two (I believe it is a matrix product) and then summed up over the first axis. Where is the hidden size coming from in the context vector?
The context vector resulting from Bahdanau attention is a weighted average of all the hidden states of the encoder. The following image from Ref shows how this is calculated. Essentially we do the following.
Compute attention weights, which is a (batch size, encoder time steps, 1) sized tensor
Multiply each hidden state (batch size, hidden size) element-wise with e values. Resulting in (batch_size, encoder timesteps, hidden size)
Average over the time dimension, resulting in (batch size, hidden size)
The answer given is incorrect. Let me explain why first, before I share what the actual answer is.
Take a look at the concerned code in the hyperlink provided. The 'hidden size' in the code refers to the dimensions of the hidden state of the decoder and NOT the hidden state(s) of the encoder as the answer above has assumed. The above multiplication in code will yield (batch_size, embedding_dim) as the question-framer mg_nt rightly points out.. The context is a weighted sum of encoder output and SHOULD have the SAME dimension as the encoder o/ps. Mathematically also one should NOT get (batch size, hidden size).
Of course in this case they are using Attention over a CNN. So there is no encoder as such but the image is broken down into features. These features are collected from the last but 1 layer and each feature is a specific component of the overall image. The hidden state from the decoder..ie the query, 'attend's to all these features and decides which ones are important and need to be given a higher weightage to determine the next word in the caption. The features shape in the above code is (batch_size, embedding_dim) and hence the context shape after being magnified or diminished by the attention weights will also be (batch_size, embedding_dim)!
This is simply a mistake in the comments of the concerned code (the code functionality itself seems right). The shape mentioned in the comments are incorrect. If you search the code for 'hidden_size' there is no such variable. It is only mentioned in the comments. If you further look at the declaration of the encoder and decoder they are using the same embedding size for both. So the code works, but the comments in the code are misleading and incorrect. That is all there is to it.
I am currently setting the vector size by using model.add(LSTM(50)) i.e setting the value in units attribute but I highly doubt its correctness(In keras documentation, units is explained as dimensionality of the output space). Anyone who can help me here?
If by vector size you mean the number of nodes in a layer, then yes you are doing it right. The output dimensionality of your layer
is the same as the number of nodes. The same thing applies to convolutional layers, number of filters and output dimensionality along the last axis, aka number of color channels is the same.
I check the Keras documentation for LSTM layer, the information about the RNN argument is as bellow:
keras.layers.LSTM(units, return_state=True)
Arguments:
return_state: Boolean. Whether to return the last state in addition to the output.
Output shape
if return_state: a list of tensors. The first tensor is the output. The remaining tensors are the last states, each with shape (batch_size, units)
And that's all of the info about return_state for RNN. As a beginner, it's really hard to understand what exactly it means that The remaining tensors are the last states, each with shape (batch_size, units), isn't it?
I do understand there are cell state c, and hidden state a that would be passed to next time step.
But when I did the programing exercise for online course, I encounter this question. Bellow is the hint given by the assignment. But I don't understand what are these three outputs means.
from keras.layers import LSTM
LSTM_cell = LSTM(n_a, return_state = True)
a, _, c = LSTM_cell(input_x, initial_state=[a, c])
Someone said, they are respectively (https://machinelearningmastery.com/return-sequences-and-return-states-for-lstms-in-keras/):
1 The LSTM hidden state output for the last time step.
2 The LSTM hidden state output for the last time step (again).
3 The LSTM cell state for the last time step.
I always regard output a as hidden state ouput for LSTM, and c as cell state output. But this person said that the first output is lstm output, while the second one is the hidden sate output, which is different from the hint given by the online course instruction (as the hint uses the first output as hidden state output for next time step).
Could anyone tell me more about this?
For a more general question, like in this case, Keras doesn't give an beginner friendly understandable documentation or examples, how to learn Keras more efficiently?
Think about how you would start an iteration of the LSTM. You have a hidden state c, an input x, but you also need an alleged previous output h, which is concatenated with x. The LSTM has therefore two hidden tensors that need to be initialized: c and h. Now h happens to be the output of the previous state, which is why you pass it as input together with c. When you set return_state=True, both c and h are returned. Together with the output, you'll therefore receive 3 tensors.
output,h(hidden state),c(memory/ cell state)
Take LSTM as an example, you can understand like this:
c(t) depend on c(t-1);
o(t) depend on x(t) and h(t-1);
h(t) depend on o(t) and c(t);