Taking the last state from BiLSTM (BiGRU) in PyTorch

Taking the last state from BiLSTM (BiGRU) in PyTorch - python

After reading several articles, I am still quite confused about correctness of my implementation of getting last hidden states from BiLSTM.
Understanding Bidirectional RNN in PyTorch (TowardsDataScience)
PackedSequence for seq2seq model (PyTorch forums)
What's the difference between “hidden” and “output” in PyTorch LSTM? (StackOverflow)
Select tensor in a batch of sequences (Pytorch formums)
The approach from the last source (4) seems to be the cleanest for me, but I am still uncertain if I understood the thread correctly. Am I using the right final hidden states from LSTM and reversed LSTM? This is my implementation
# pos contains indices of words in embedding matrix
# seqlengths contains info about sequence lengths
# so for instance, if batch_size is 2 and pos=[4,6,9,3,1] and
# seqlengths contains [3,2], we have batch with samples
# of variable length [4,6,9] and [3,1]
all_in_embs = self.in_embeddings(pos)
in_emb_seqs = pack_sequence(torch.split(all_in_embs, seqlengths, dim=0))
output,lasthidden = self.rnn(in_emb_seqs)
if not self.data_processor.use_gru:
lasthidden = lasthidden[0]
# u_emb_batch has shape batch_size x embedding_dimension
# sum last state from forward and backward direction
u_emb_batch = lasthidden[-1,:,:] + lasthidden[-2,:,:]
Is it correct?

In a general case if you want to create your own BiLSTM network, you need to create two regular LSTMs, and feed one with the regular input sequence, and the other with inverted input sequence. After you finish feeding both sequences, you just take the last states from both nets and somehow tie them together (sum or concatenate).
As I understand, you are using built-in BiLSTM as in this example (setting bidirectional=True in nn.LSTM constructor). Then you get the concatenated output after feeding the batch, as PyTorch handles all the hassle for you.
If it is the case, and you want to sum the hidden states, then you have to
u_emb_batch = (lasthidden[0, :, :] + lasthidden[1, :, :])
assuming you have only one layer. If you have more layers, your variant seem better.
This is because the result is structured (see documentation):
h_n of shape (num_layers * num_directions, batch, hidden_size): tensor containing the hidden state for t = seq_len
By the way,
u_emb_batch_2 = output[-1, :, :HIDDEN_DIM] + output[-1, :, HIDDEN_DIM:]
should provide the same result.

Here's a detailed explanation for those working with unpacked sequences:
output is of shape (seq_len, batch, num_directions * hidden_size) (see documentation). This means that the output of the forward and backward passes of your GRU are concatenated along the 3rd dimension.
Assuming batch=2 and hidden_size=256 in your example, you can easily separate the outputs of both forward and backward passes by doing:
output = output.view(-1, 2, 2, 256) # (seq_len, batch_size, num_directions, hidden_size)
output_forward = output[:, :, 0, :] # (seq_len, batch_size, hidden_size)
output_backward = output[:, :, 1, :] # (seq_len, batch_size, hidden_size)
(Note: the -1 tells pytorch to infer that dimension from the others. See this question.)
Equivalently, you can use the torch.chunk function on the original output of shape (seq_len, batch, num_directions * hidden_size):
# Split in 2 tensors along dimension 2 (num_directions)
output_forward, output_backward = torch.chunk(output, 2, 2)
Now you can torch.gather the last hidden state of the forward pass using seqlengths (after reshaping it), and the last hidden state of the backward pass by selecting the element at position 0
# First we unsqueeze seqlengths two times so it has the same number of
# of dimensions as output_forward
# (batch_size) -> (1, batch_size, 1)
lengths = seqlengths.unsqueeze(0).unsqueeze(2)
# Then we expand it accordingly
# (1, batch_size, 1) -> (1, batch_size, hidden_size)
lengths = lengths.expand((1, -1, output_forward.size(2)))
last_forward = torch.gather(output_forward, 0, lengths - 1).squeeze(0)
last_backward = output_backward[0, :, :]
Note that I subtracted 1 from lengths because of the 0-based indexing
A this point both last_forward and last_backward are of shape (batch_size, hidden_dim)

I tested the biLSTM output and h_n:
# shape of x is size(batch_size, time_steps, input_size)
# shape of output (batch_size, time_steps, hidden_size * num_directions)
# shape of h_n is size(num_directions, batch_size, hidden_size)
output, (h_n, _c_n) = biLSTM(x)
print('first step (element) of output from reverse == h_n from reverse?',
output[:, 0, hidden_size:] == h_n[1])
print('last step (element) of output from reverse == h_n from reverse?',
output[:, -1, hidden_size:] == h_n[1])
output
first step (element) of output from reverse == h_n from reverse? True
last step (element) of output from reverse == h_n from reverse? False
This confirmed that the h_n of the reverse direction is the hidden state of the first time step.
So, if you really need the hidden state of the last time step from both forward and reverse direction, you should use:
sum_lasthidden = output[:, -1, :hidden_size] + output[:, -1, hidden_size:]
not
h_n[0,:,:] + h_n[1,:,:]
As h_n[1,:,:] is the hidden state of the first time step from the reverse direction.
So the answer from #igrinis
u_emb_batch = (lasthidden[0, :, :] + lasthidden[1, :, :])
is not correct.
But in theory, last time step hidden state from the reverse direction only contains information from the last time step of the sequence.

Related

Use of PyTorch permute in RCNN

I am looking at an implementation of RCNN for text classification using PyTorch. Full Code. There are two points where the dimensions of tensors are permuted using the permute function. The first is after the LSTM layer and before tanh. The second is after a linear layer and before a max pooling layer.
Could you please explain why the permutation is necessary or useful?
Relevant Code
def forward(self, x):
# x.shape = (seq_len, batch_size)
embedded_sent = self.embeddings(x)
# embedded_sent.shape = (seq_len, batch_size, embed_size)
lstm_out, (h_n,c_n) = self.lstm(embedded_sent)
# lstm_out.shape = (seq_len, batch_size, 2 * hidden_size)
input_features = torch.cat([lstm_out,embedded_sent], 2).permute(1,0,2)
# final_features.shape = (batch_size, seq_len, embed_size + 2*hidden_size)
linear_output = self.tanh(
self.W(input_features)
)
# linear_output.shape = (batch_size, seq_len, hidden_size_linear)
linear_output = linear_output.permute(0,2,1) # Reshaping fot max_pool
max_out_features = F.max_pool1d(linear_output, linear_output.shape[2]).squeeze(2)
# max_out_features.shape = (batch_size, hidden_size_linear)
max_out_features = self.dropout(max_out_features)
final_out = self.fc(max_out_features)
return self.softmax(final_out)
Similar Code in other Repositories
Similar implementations of RCNN use permute or transpose. Here are examples:
https://github.com/prakashpandey9/Text-Classification-Pytorch/blob/master/models/RCNN.py
https://github.com/jungwhank/rcnn-text-classification-pytorch/blob/master/model.py

What permute function does is rearranges the original tensor according to the desired ordering, note permute is different from reshape function, because when apply permute, the elements in tensor follow the index you provide where in reshape it's not.
Example code:
import torch
var = torch.randn(2, 4)
pe_var = var.permute(1, 0)
re_var = torch.reshape(var, (4, 2))
print("Original size:\n{}\nOriginal var:\n{}\n".format(var.size(), var) +
"Permute size:\n{}\nPermute var:\n{}\n".format(pe_var.size(), pe_var) +
"Reshape size:\n{}\nReshape var:\n{}\n".format(re_var.size(), re_var))
Outputs:
Original size:
torch.Size([2, 4])
Original var:
tensor([[ 0.8250, -0.1984, 0.5567, -0.7123],
[-1.0503, 0.0470, -1.9473, 0.9925]])
Permute size:
torch.Size([4, 2])
Permute var:
tensor([[ 0.8250, -1.0503],
[-0.1984, 0.0470],
[ 0.5567, -1.9473],
[-0.7123, 0.9925]])
Reshape size:
torch.Size([4, 2])
Reshape var:
tensor([[ 0.8250, -0.1984],
[ 0.5567, -0.7123],
[-1.0503, 0.0470],
[-1.9473, 0.9925]])
With the role of permute in mind we could see what first permute does is reordering the concatenate tensor for it to fit the inputs format of self.W, i.e with batch as first dimension; and the second permute does similar thing because we want to max pool the linear_output along the sequence and F.max_pool1d will pool along the last dimension.

I am adding this answer to provide additional PyTorch-specific details.
It is necessary to use permute between nn.LSTM and nn.Linear because the output shape of LSTM does not correspond to the expected input shape of Linear.
nn.LSTM outputs output, (h_n, c_n). Tensor output has shape seq_len, batch, num_directions * hidden_size nn.LSTM. nn.Linear expects an input tensor with shape N,∗,H, where N is batch size and H is number of input features. nn.Linear.
It is necessary to use permute between nn.Linear and nn.MaxPool1d because the output of nn.Linear is N, L, C, where N is batch size, C is the number of features, and and L is sequence length. nn.MaxPool1d expects an input tensor of shape N, C, L. nn.MaxPool1d
I reviewed seven implementations of RCNN for text classification with PyTorch on GitHub and gitee and found that permute and transpose are the normal ways to convert the output of one layer to the input of a subsequent layer.

Which axis does Keras SimpleRNN / LSTM use as the temporal axis by default?

When using a SimpleRNN or LSTM for classical sentiment analysis algorithms (applied here to sentences of length <= 250 words/tokens):
model = Sequential()
model.add(Embedding(5000, 32, input_length=250)) # Output shape: (None, 250, 32)
model.add(SimpleRNN(100)) # Output shape: (None, 100)
model.add(Dense(1, activation='sigmoid')) # Output shape: (None, 1)
where is it specified which axis of the input of the RNN is used as the "temporal" axis?
To be more precise, after the Embedding layer, a given input sentence, e.g. "the cat sat on the mat", is encoded into a matrix x of shape (250, 32), where 250 is the max length (in words) of the input text, and 32 the dimension of the embedding. Then, where in Keras is it specified if this will be used:
h[t] = activation( W_h * x[:, t] + U_h * h[t-1] + b_h )
or this:
h[t] = activation( W_h * x[t, :] + U_h * h[t-1] + b_h )
(In both cases, y[t] = activation( W_y * h[t] + b_y ))
TL;DR: if an input for a RNN Keras layer is of size, say, (250, 32), which axis does it use as the temporal axis by default? Where is this detailed in the Keras or Tensorflow documentation?
PS: how to explain the number of parameters (given by model.summary()) which is 13300? W_h has 100x32 coefs, U_h has 100x100 coefs, b_h has 100x1 coefs, i.e. we already have 13300! There is no coefs left for W_y and b_y! How to explain this?

Temporal axis: it's always dim 1, unless time_major=True, then it's dim 2; the Embedding layer outputs a 3D tensor. This can be seen here where step_input_shape is the shape of input fed to the RNN cell at each step in the recurrent loop. For your case, timesteps=250, and the SimpleRNN cell "sees" a tensor shaped (batch_size, 32) at each step.
# of params: you can see how the figure's derived by inspecting each layer's .build() code: Embedding, SimpleRNN, Dense, or likewise calling .weights on each layer. For your case, w/ l = model.layers[1]:
l.weights[0].shape == (32, 100) --> 3200 params (kernel)
l.weights[1].shape == (100, 100) --> 10000 params (recurrent_kernel)
l.weights[2].shape == (100,) --> 100 params (bias) (sum: 13,300)
Computation logic: there is no W_y or b_y; the "y" is the hidden state, h, actually for all recurrent layers - what you cite are likely from generic RNN formulae. # "in both cases..." - this is false; to see what's actually happening, inspect the .call() code.
P.S. I recommend defining the full batch_shape of the model for debugging, as it eliminates the ambiguous None shapes
SimpleRNN formula vs. code: as requested; note the h in source code is misleading, and is typically z in formulae ("pre-activation").
return_sequences=True -> outputs for all timesteps are returned: (batch_size, timesteps, channels)
return_sequences=False -> only last timestep's output is returned: (batch_size, 1, channels). See here

Keras Bidirectional LSTM returns two outputs (fwd + backward), which step maps to which row in the output tensor?

Using the following Keras bidirectional LSTM:
x_forward, x_backward = tf.keras.layers.Bidirectional(
layer=tf.keras.layers.LSTM(units=128, return_sequences=True),
merge_mode=None # return separate forward and backward outputs
)
x_foraward and x_backward are now tensors of shape [batch_size, seq_length, 128]. In x_forward I presume that x_forward[:, 0, :] maps to the output for sequence step 0, and so on.
I'm unclear about x_backward. Does x_backward[:, 0, :] map to the first sequence step or the last sequence step?

Tensorflow: How to add bias to ouputs from RNN where the sequences have varying length

First let me explain the input and target values of the RNN. My dataset consists of sequences (e.g. 4, 7, 1, 23, 42, 69). The RNN is trained to predict the next value in each sequence. So all values except the last are input, and all values except the first are target values. Each value is represented as a 1-HOT vector.
I have a RNN in Tensorflow where the outputs from the RNN (tf.dynamic_rnn) are sent through a feedforward layer. The input sequences have varying length, so I use the sequence_length parameter to specify the length of each sequence in a batch. The output from the RNN layer is a Tensor of outputs for each timestep. Most sequences have the same length, but some are shorter. When shorter sequences are sent through, I get additional all-zero vectors (as a padding).
The problem is that I want to send the output from the RNN layer through a feedforward layer. If I add bias in this feedforward layer, then the additional all-zero vectors become non-zero. With no bias, only weights, this works fine, since the all-zero vectors are not affected by multiplication. So without bias, I can set the target vectors as all-zero as well and thus they will not affect the backward pass. But if bias is added, I don't know what to put in the padded/dummy target vectors.
So the network looks like this:
[INPUT (1-HOT vectors, one vector for each value in the sequence)]
V
[GRU layer (smaller size than the input layer)]
V
[Feedforward layer (outputs vectors of the same size as the input)]
And here is the code:
# [batch_size, max_sequence_length, size of 1-HOT vectors]
x = tf.placeholder(tf.float32, [None, max_length, n_classes])
y = tf.placeholder(tf.int32, [None, max_length, n_classes])
session_length = tf.placeholder(tf.int32, [None])
outputs, state = rnn.dynamic_rnn(
rnn_cell.GRUCell(num_hidden),
x,
dtype=tf.float32,
sequence_length=session_length
)
layer = {'weights':tf.Variable(tf.random_normal([n_hidden, n_classes])),
'biases':tf.Variable(tf.random_normal([n_classes]))}
# Flatten to apply same weights to all timesteps
outputs = tf.reshape(outputs, [-1, n_hidden])
prediction = tf.matmul(output, layer['weights']) # + layer['bias']
error = tf.nn.softmax_cross_entropy_with_logits(prediction,y)

You can add the bias, but mask out the non-relevant sequence elements from the loss function.
See an example from the im2txt project:
weights = tf.to_float(tf.reshape(self.input_mask, [-1])) # these are the masks
# Compute losses.
losses = tf.nn.sparse_softmax_cross_entropy_with_logits(logits, targets)
batch_loss = tf.div(tf.reduce_sum(tf.mul(losses, weights)),
tf.reduce_sum(weights),
name="batch_loss") # Here the irrelevant sequence elements are masked out
Also, for generating the mask see the function batch_with_dynamic_pad in the same project, under ops/inputs

Tensorflow Grid LSTM RNN TypeError

I'm trying to build a LSTM RNN that handles 3D data in Tensorflow. From this paper, Grid LSTM RNN's can be n-dimensional. The idea for my network is a have a 3D volume [depth, x, y] and the network should be [depth, x, y, n_hidden] where n_hidden is the number of LSTM cell recursive calls. The idea is that each pixel gets its own "string" of LSTM recursive calls.
The output should be [depth, x, y, n_classes]. I'm doing a binary segmentation -- think foreground and background, so the number of classes is just 2.
# Network Parameters
n_depth = 5
n_input_x = 200 # MNIST data input (img shape: 28*28)
n_input_y = 200
n_hidden = 128 # hidden layer num of features
n_classes = 2
# tf Graph input
x = tf.placeholder("float", [None, n_depth, n_input_x, n_input_y])
y = tf.placeholder("float", [None, n_depth, n_input_x, n_input_y, n_classes])
# Define weights
weights = {}
biases = {}
# Initialize weights
for i in xrange(n_depth * n_input_x * n_input_y):
weights[i] = tf.Variable(tf.random_normal([n_hidden, n_classes]))
biases[i] = tf.Variable(tf.random_normal([n_classes]))
def RNN(x, weights, biases):
# Prepare data shape to match `rnn` function requirements
# Current data input shape: (batch_size, n_input_y, n_input_x)
# Permuting batch_size and n_input_y
x = tf.reshape(x, [-1, n_input_y, n_depth * n_input_x])
x = tf.transpose(x, [1, 0, 2])
# Reshaping to (n_input_y*batch_size, n_input_x)
x = tf.reshape(x, [-1, n_input_x * n_depth])
# Split to get a list of 'n_input_y' tensors of shape (batch_size, n_hidden)
# This input shape is required by `rnn` function
x = tf.split(0, n_depth * n_input_x * n_input_y, x)
# Define a lstm cell with tensorflow
lstm_cell = grid_rnn_cell.GridRNNCell(n_hidden, input_dims=[n_depth, n_input_x, n_input_y])
# lstm_cell = rnn_cell.MultiRNNCell([lstm_cell] * 12, state_is_tuple=True)
# lstm_cell = rnn_cell.DropoutWrapper(lstm_cell, output_keep_prob=0.8)
outputs, states = rnn.rnn(lstm_cell, x, dtype=tf.float32)
# Linear activation, using rnn inner loop last output
# pdb.set_trace()
output = []
for i in xrange(n_depth * n_input_x * n_input_y):
#I'll need to do some sort of reshape here on outputs[i]
output.append(tf.matmul(outputs[i], weights[i]) + biases[i])
return output
pred = RNN(x, weights, biases)
pred = tf.transpose(tf.pack(pred),[1,0,2])
pred = tf.reshape(pred, [-1, n_depth, n_input_x, n_input_y, n_classes])
# pdb.set_trace()
temp_pred = tf.reshape(pred, [-1, n_classes])
n_input_y = tf.reshape(y, [-1, n_classes])
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(temp_pred, n_input_y))
Currently I'm getting the error: TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'
It occurs after the RNN intialization: outputs, states = rnn.rnn(lstm_cell, x, dtype=tf.float32)
x of course is of type float32
I am unable to tell what type GridRNNCell returns, any helpe here? This could be the issue. Should I be defining more arguments to this? input_dims makes sense, but what should output_dims be?
Is this a bug in the contrib code?
GridRNNCell is located in contrib/grid_rnn/python/ops/grid_rnn_cell.py

I was unsure on some of the implementation decisions of the code, so I decided to roll my own. One thing to keep in mind is that this is an implementation of just the cell. It is up to you to build the actual machinery that handles the locations and interactions of the h and m vectors and isn't as simple as passing in your data and expecting it to traverse the dimensions properly.
So for example, if you are working in two dimensions, start with the top left block, take the incoming x and y vectors, concat them together, then use your cell to compute the output (which includes outgoing vectors for both x and y); and it is up to you to store the output for later use in neighboring blocks. Pass those outputs individually to each corresponding dimension, and in each of those neighboring blocks, concat the incoming vectors (again, for each dimension) and compute the output for the neighboring blocks. To do this, you'll need two for-loops, one for each dimension.
Perhaps the version in contrib will work for this, but a couple problems I have with it (I could be wrong here, but as far as I can tell):
1) The vectors are handled using concat and slice rather than with tuples. This will likely result in slower performance.
2) It looks like the input is projected at each step, which doesn't sit well with me. In the paper they only project into the network for incoming blocks along the edge of the grid and not throughout.
If you look at the code, it is actually very simple. Perhaps reading the paper and making adjustments to the code as needed, or rolling your own are your best bet. And remember that the cell is only good for performing the recurrence at each step, and not for managing the incoming and outgoing h and m vectors.

which version of Grid LSTM cells are you using?
If you are using https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/rnn/python/ops/rnn_cell.py
I think you can try to initialize 'feature_size' and 'frequency_skip'.
Also, I think there may exists another bug. Feed a dynamic shape into this version may cause a TypeError

Yes, dynamic shape was the cause. There is a PR to fix this: https://github.com/tensorflow/tensorflow/pull/4631
#jstaker7: Thank you for trying it out. Re. problem 1, the above PR uses tuples for states and outputs, hopefully it can address the performance issue. GridRNNCell was created some while ago, at that time all the LSTMCells in Tensorflow was using concat/slice instead of tuple.
Re. problem 2, GridRNNCell will not project the input if you pass None. A dimension can be both input and recurrent, and when there is no input (inputs = None), it will use the recurrent tensors for computation. We can also use 2 input dimensions, by instantiate the GridRNNCell directly.
Of course writing a generic class for all cases makes the code looks a bit convoluted, and I think that it needs better documentation.
Anyway, it will be great if you could share your improvements, or any idea you might have to make it clearer/more useful. It is the nature of an open-source project anyway.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.