Context vector shape using Bahdanau Attention

Context vector shape using Bahdanau Attention - python

I am looking here at the Bahdanau attention class. I noticed that the final shape of the context vector is (batch_size, hidden_size). I am wondering how they got that shape given that attention_weights has shape (batch_size, 64, 1) and features has shape (batch_size, 64, embedding_dim). They multiplied the two (I believe it is a matrix product) and then summed up over the first axis. Where is the hidden size coming from in the context vector?

The context vector resulting from Bahdanau attention is a weighted average of all the hidden states of the encoder. The following image from Ref shows how this is calculated. Essentially we do the following.
Compute attention weights, which is a (batch size, encoder time steps, 1) sized tensor
Multiply each hidden state (batch size, hidden size) element-wise with e values. Resulting in (batch_size, encoder timesteps, hidden size)
Average over the time dimension, resulting in (batch size, hidden size)

The answer given is incorrect. Let me explain why first, before I share what the actual answer is.
Take a look at the concerned code in the hyperlink provided. The 'hidden size' in the code refers to the dimensions of the hidden state of the decoder and NOT the hidden state(s) of the encoder as the answer above has assumed. The above multiplication in code will yield (batch_size, embedding_dim) as the question-framer mg_nt rightly points out.. The context is a weighted sum of encoder output and SHOULD have the SAME dimension as the encoder o/ps. Mathematically also one should NOT get (batch size, hidden size).
Of course in this case they are using Attention over a CNN. So there is no encoder as such but the image is broken down into features. These features are collected from the last but 1 layer and each feature is a specific component of the overall image. The hidden state from the decoder..ie the query, 'attend's to all these features and decides which ones are important and need to be given a higher weightage to determine the next word in the caption. The features shape in the above code is (batch_size, embedding_dim) and hence the context shape after being magnified or diminished by the attention weights will also be (batch_size, embedding_dim)!
This is simply a mistake in the comments of the concerned code (the code functionality itself seems right). The shape mentioned in the comments are incorrect. If you search the code for 'hidden_size' there is no such variable. It is only mentioned in the comments. If you further look at the declaration of the encoder and decoder they are using the same embedding size for both. So the code works, but the comments in the code are misleading and incorrect. That is all there is to it.

Related

How point wise multiplication and addition takes place in LSTM?

timesteps=4, features=2 and LSTM(units=3)
Input to lstm will be (batch_size, timesteps, features) dimensions of the hidden State and cell state will be (None, unit in LSTM). When lstm take the input at t1 will get concatenated with the hidden State at features axis. After that it will pass through sigmoid and tanh. Now my main confusion is "how will it do the point Wise operation(addition or multiplication) with the cell state".
How I look at this is shown in the attached figure. If there's anything which I am taking wrong kindly make the correction. Thanks in advance to everyone.

The input to the CNN of Conv1D

I'm working in the field of machine learning.
For the stronger Network, I'm going to adopt the techniques concerning Conv1D.
The input data is an one-dimension list data so I just would've thought that Conv1D is the best choice.
What would happen if the input size is (1, 740)? Would it be okay the input channel is 1?
I mean,I have a feeling that the (1, 740) tensor's conv1D output should be the same with that of a simple Linear networks.
Of course I'll also include other conv1d layer, like below.
self.conv1 = torch.nn.Conv1d(in_channels=1, out_channels=64, kernel_size=5)
self.conv2 = torch.nn.Conv1d(in_channels=64,out_channels=64, kernel_size=5)
self.conv3 = torch.nn.Conv1d(in_channels=64, out_channels=64, kernel_size=5)
self.conv4 = torch.nn.Conv1d(in_channels=64, out_channels=64, kernel_size=5)
Would it make sense when an input channel is 1?
Thanks in advance. :)

I think it's fine.
Note that the input of Conv1D should be (B, N, M), where B is the batch size, N is the number of channels (e.g. for RGB is 3) and M is the number of features.
The out_channels refers to the number of 5x5 filters to use. look at the output shape of the following code:
k = nn.Conv1d(1,64,kernel_size=5)
input = torch.randn(1, 1, 740)
print(k(input).shape) # -> torch.Size([1, 64, 736])
The 736 is the result of not using padding the dimension isn't kept.

The nn.Conv1d layer takes an input of shape (b, c, w) (where b is the batch size, c the number of channels, and w the input width). Its kernel size is one-dimensional. It performs a convolution operation over the input dimension (batch and channel axes aside). This means the kernel will apply the same operation over the whole input (wether 1D, 2D, or 3D). Like a 'sliding window'. As such, it only has kernel_size parameters. This is the main characteristic of a convolution layer.
Conv1d allows to extract features on the input regardless of where it's located in the input data: at the beginning or at the end of your w-width input. This would make sense if your input is temporal (input sequence over time) or spatial data (an image).
On the other hand, a nn.Linear takes a 1D tensor as input and returns another 1D tensor. You could consider w to be the number of neurons. You would end up having w*output_dim parameters. If your input contains components which are independant from one another (like a One/Multi-Hot-Encoding) then a fully connected layer as nn.Linear implements would be prefered.
These two behave differently. When using a nn.Linear - in scenarios where you should use a nn.Conv1d - your training would ideally result in having neurons of equal weights, if that makes sense... but you probably won't. Fully-densely-connected layers were used in the past in deep learning for computer vision. Today convolutions are used because there are much more efficient and suitable for these types of tasks.

How to properly reshape a 3D tensor to a 2D forward linear layer then reshape new 3D tensor's fibers corresponding to the old 3D

I have a 3D tensor of names that comes out of an LSTM that's of shape (batch size x name length x embedding size)
I've been reshaping it to a 2D to put it through a linear layer, because linear layer requires (batch size, linear dimension size) by using the following
y0 = output.contiguous().view(-1, output.size(-1))
this converts outputs to (batchsize, name length * number of possible characters)
then I put y0 through a linear layer and then reshape it back to a 3D using
y = y0.contiguous().view(output.size(0), -1, y0.size(-1))
But I'm not really sure if the fibers of y are correlated properly with the cells of output and I worry this is messing up my learning, because batch size of 1 is actually generating proper names and any larger batch size is generating nonsense.
So what I mean exactly is
output = (batch size * name length, embed size)
y = (batch size * name length, number of possible characters)
I need to make sure y[i,j,:] is the linear transformed version of output[i,j,:]

It seems like you are using an older code example. Just 'comment out' the lines of code where you reshape the tensor as there is no need for them.
This link gives you a bit more explaination: https://discuss.pytorch.org/t/when-and-why-do-we-use-contiguous/47588
Try something like this instead and take the output from the LSTM directly into the linear layer:
output, hidden = self.lstm(x, hidden)
output = self.LinearLayer1(output)

After going through convolution steps, what should be the shape of the tensor in fully connected layer?

So let's assume that I have RGB images of shape [128,128,3], I want to create a CNN with two Conv-ReLu-MaxPool layers as below.
def cnn(input_data):
#conv1
conv1_weight = tf.Variable(tf.truncated_normal([4,4,3,25], stddev=0.1,),tf.float32)
conv1_bias = tf.Variable(tf.zeros([25]), tf.float32)
conv1 = tf.nn.conv2d(input_data, conv1_weight, [1,1,1,1], 'SAME')
relu1 = tf.nn.relu(tf.nn.add(conv1, conv1_bias))
max_pool1 = tf.nn.max_pool(relu1, [1,2,2,1], [1,1,1,1], 'SAME')
#conv2
conv2_weight = tf.Variable(tf.truncated_normal([4,4,25,50]),0.1,tf.float32)
conv2_bias = tf.Variable(tf.zeros([50]), tf.float32)
conv2 = tf.nn.conv2d(max_pool1, conv2_weight, [1,1,1,1], 'SAME')
relu2 = tf.nn.relu(tf.nn.add(conv2, conv2_bias))
max_pool2 = tf.nn.max_pool(relu2, [1,2,2,1], [1,1,1,1], 'SAME')
After this step, I need to transform the output into 1xN layer for the next fully connected layer. However, I am not sure how I should determine what N is in 1xN. Is there a specific formula including the layer size, strides, max pool size, image size etc? I am pretty lost in this phase of the problem even though I think that I get the intuition behind a CNN.

I understand that you want to transform the multiple 2D feature maps that come out of the last convolutional/pooling layer to a vector that can be fed into a fully-connected layer. Or to be precise and include the batch dimension, go from shape [batch, width, height, feature_maps] to [batch, N].
The above already implies that N = batch * width * height since reshaping keeps the overall number of elements the same. width and height depend on the size of your inputs and the strides of your network layers (convolution and/or pooling).
A stride of x simply divides the size by x. You have inputs of size 128 in each dimension, and two pooling layers with stride 2. Thus after the first pooling layer your images are 64x64 and after the second they are 32x32, so width = height = 32. Normally we would have to account for padding as well but the point of SAME padding is precisely that we don't have to worry about that.
Finally, feature_maps is 50 since that is how many filters your last convolutional layer has (pooling doesn't modify this). So N = 32*32*50 = 51200.
Thus, you should be able to do tf.reshape(max_pool2, [-1, 51200]) (or tf.reshape(max_pool2, [-1, 32*32*50]) to keep it more interpretable) and feed the resulting 2D tensor through a fully-connected layer (i.e. tf.matmul).
The simplest way would be to just use tf.layers.flatten(max_pool2). This function does all the above for you and just gives you the [batch, N] result.

First of all since you are starting out, I would recommend Keras instead of pure tensorflow. And to answer your question regarding the shape refer this blog by Andrej karpathy
Quote from the blog:
We can compute the spatial size of the output volume as a function of the input volume size (W), the receptive field size of the Conv Layer neurons (F), the stride with which they are applied (S), and the amount of zero padding used (P) on the border. You can convince yourself that the correct formula for calculating how many neurons “fit” is given by (W−F+2P)/S+1. For example for a 7x7 input and a 3x3 filter with stride 1 and pad 0 we would get a 5x5 output. With stride 2 we would get a 3x3 output.
Now coming to your tensorflow's implementation:
For the conv1 stage you have given a 4*4 filter having a depth of 25. Since you have used padding="SAME" for conv1 and maxpooling1 your output 2D spatial dimensions will be same as input for both the cases. That is after conv1 your output size is: 128*128*25. For the same reason the output of your maxpool1 layer is also the same. Since you have given padding to be "SAME" for the second conv2 also your output shape is 128*128*50(you changed the output channels). Thus after maxpool2 your dimensions are: batch_size, 128*128*50. Thus before adding Dense layer you have 3 major options:
1) flatten the tensor results in a shape : batch_size, 128*128*50
2) global average pooling results in a shape : batch_size, 50
3) global max pooling also results in a shape : batch_size, 50.
Note:
global average pooling layer is similar to average pooling but, we average the entire feature map instead of a window. Hence the name global. For example: in your case you have batch_size, 128,128,50 as your dimensions. This means you have 50 feature maps with spatial dimensions 128*128. What global average pooling does is that, it
Averages the 128*128 feature map to give a single number. Thus you will have 50 values in total. This is very useful in designing fully convolutional architectures like inception, resnet etc. Because, this makes the network's input generic meaning you can send any image size as input to the network. Global max pooling is very similar to above but the slight difference is it finds the max value of the feature map instead of average.
Problems with this architecture:
Generally it is not recommended to use padding = "SAME" in maxpooling layers. If you see the source code of vgg16 you will see that after each block (conv relu and maxpooling) the input size is halved. Thus the general structure is you reduce the spatial dimension while increasing the depth/channels.

Flattening the layer:
var_name = tf.layers.flatten(max_pool2)
Should work, and it's what almost every example of a Tensorflow CNN uses.

How to interpret clearly the meaning of the units parameter in Keras?

I am wondering how LSTM work in Keras. In this tutorial for example, as in many others, you can find something like this :
model.add(LSTM(4, input_shape=(1, look_back)))
What does the "4" mean. Is it the number of neuron in the layer. By neuron, I mean something that for each instance gives a single output ?
Actually, I found this brillant discussion but wasn't really convinced by the explanation mentioned in the reference given.
On the scheme, one can see the num_unitsillustrated and I think I am not wrong in saying that each of this unit is a very atomic LSTM unit (i.e. the 4 gates). However, how these units are connected ? If I am right (but not sure), x_(t-1)is of size nb_features, so each feature would be an input of a unit and num_unit must be equal to nb_features right ?
Now, let's talk about keras. I have read this post and the accepted answer and get trouble. Indeed, the answer says :
Basically, the shape is like (batch_size, timespan, input_dim), where input_dim can be different from the unit
In which case ? I am in trouble with the previous reference...
Moreover, it says,
LSTM in Keras only define exactly one LSTM block, whose cells is of unit-length.
Okay, but how do I define a full LSTM layer ? Is it the input_shape that implicitely create as many blocks as the number of time_steps (which, according to me is the first parameter of input_shape parameter in my piece of code ?
Thanks for lighting me
EDIT : would it also be possible to detail clearly how to reshape data of, say, size (n_samples, n_features) for a stateful LSTM model ? How to deal with time_steps and batch_size ?

First, units in LSTM is NOT the number of time_steps.
Each LSTM cell(present at a given time_step) takes in input x and forms a hidden state vector a, the length of this hidden unit vector is what is called the units in LSTM(Keras).
You should keep in mind that there is only one RNN cell created by the code
keras.layers.LSTM(units, activation='tanh', …… )
and RNN operations are repeated by Tx times by the class itself.
I've linked this to help you understand it better in with a very simple code.

You can (sort of) think of it exactly as you think of fully connected layers. Units are neurons.
The dimension of the output is the number of neurons, as with most of the well known layer types.
The difference is that in LSTMs, these neurons will not be completely independent of each other, they will intercommunicate due to the mathematical operations lying under the cover.
Before going further, it might be interesting to take a look at this very complete explanation about LSTMs, its inputs/outputs and the usage of stative = true/false: Understanding Keras LSTMs. Notice that your input shape should be input_shape=(look_back, 1). The input shape goes for (time_steps, features).
While this is a series of fully connected layers:
hidden layer 1: 4 units
hidden layer 2: 4 units
output layer: 1 unit
This is a series of LSTM layers:
Where input_shape = (batch_size, arbitrary_steps, 3)
Each LSTM layer will keep reusing the same units/neurons over and over until all the arbitrary timesteps in the input are processed.
The output will have shape:
(batch, arbitrary_steps, units) if return_sequences=True.
(batch, units) if return_sequences=False.
The memory states will have a size of units.
The inputs processed from the last step will have size of units.
To be really precise, there will be two groups of units, one working on the raw inputs, the other working on already processed inputs coming from the last step. Due to the internal structure, each group will have a number of parameters 4 times bigger than the number of units (this 4 is not related to the image, it's fixed).
Flow:
Takes an input with n steps and 3 features
Layer 1:
For each time step in the inputs:
Uses 4 units on the inputs to get a size 4 result
Uses 4 recurrent units on the outputs of the previous step
Outputs the last (return_sequences=False) or all (return_sequences = True) steps
output features = 4
Layer 2:
Same as layer 1
Layer 3:
For each time step in the inputs:
Uses 1 unit on the inputs to get a size 1 result
Uses 1 unit on the outputs of the previous step
Outputs the last (return_sequences=False) or all (return_sequences = True) steps

The number of units is the size (length) of the internal vector states, h and c of the LSTM. That is no matter the shape of the input, it is upscaled (by a dense transformation) by the various kernels for the i, f, and o gates. The details of how the resulting latent features are transformed into h and c are described in the linked post. In your example, the input shape of data
(batch_size, timesteps, input_dim)
will be transformed to
(batch_size, timesteps, 4)
if return_sequences is true, otherwise only the last h will be emmited making it (batch_size, 4). I would recommend using a much higher latent dimension, perhaps 128 or 256 for most problems.

I would put it this way - there are 4 LSTM "neurons" or "units", each with 1 Cell State and 1 Hidden State for each timestep they process. So for an input of 1 timestep processing , you will have 4 Cell States, and 4 Hidden States and 4 Outputs.
Actually the correct way to say this is - for one timestep sized input you 1 Cell State (a vector of size 4) and 1 Hidden State (a vector of size 4) and 1 Output (a vector of size 4).
So if you feed in a timeseries with 20 steps, you will have 20 (intermediate) Cell States, each of size 4. That is because the inputs in LSTM are processed sequentially, 1 after the other. Similarly you will have 20 Hidden States, each of size 4.
Usually, your output will be the output of the LAST step (a vector of size 4). However in case you want the outputs of each intermediate step(remember you have 20 timesteps to process), you can make return_sequences = TRUE. In which case you will have 20 , 4 sized vectors each telling you what was the output when each of those steps got processed as those 20 inputs came one after the other.
In case when you put return_states = TRUE , you get the last Hidden State of size = 4 and last Cell State of size 4.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.