I'm am trying to implement BiLSTM-Max as described in the following paper:
Supervised Learning of Universal Sentence Representations from Natural Language Inference Data.
I am using Tensorflow for my implementation. I started off with an original LSTM code but have successfully made modifications so that it can run with dynamic length input and also bidirectional (i.e Dynamic Bi-LSTM).
# Bi-LSTM, returns output of shape [n_step, batch_size, n_input]
outputs = tf.contrib.rnn.static_bidirectional_rnn(lstm_fw_cell, lstm_bw_cell, x,dtype=tf.float32)
# Change output back to [batch_size, n_step, n_input]
outputs = tf.transpose(tf.stack(outputs), [1, 0, 2])
# Retrieve the last output corresponding the length of input sequence
batch_size_ = tf.shape(outputs)[0]
index = tf.range(0, batch_size_) * seq_max_len + (seqlen - 1)
outputs = tf.gather(tf.reshape(outputs, [-1, 2*n_hidden]), index)
Next modifying it to Bi-LSTM-Max, I replaced taking the last ouput and find the max across n_steps as follows:
# Bi-LSTM, returns output of shape [n_step, batch_size, n_input]
outputs = tf.contrib.rnn.static_bidirectional_rnn(lstm_fw_cell, lstm_bw_cell, x,dtype=tf.float32)
# Change output back to [batch_size, n_step, n_input]
outputs = tf.transpose(tf.stack(outputs), [1, 0, 2])
# Retrieve the max output across n_steps
outputs = tf.reduce_max(outputs, reduction_indices=[1])
When I took the max across the n_steps dimensions, I had assumed that those indices>seqlen are 0s, so I could take the max across the entire dimension instead of taking max from 0 to seqlen. Upon closer inspection, I realised that the values of the non assigned indices may be non-zero due to random initialization or it may just be the last assigned value in memory.
This operation is trivial in python arrays, however, for Tensor operation I can't find an easy way. Does anyone have an idea for this?
Probably the easiest thing to do would be to manually set the invalid outputs to zero or -∞ before finding the maximum. You can do that quite easily with tf.sequence_mask and tf.where:
seq_mask = tf.sequence_mask(seqlen, seq_max_len)
# You can also use e.g. -np.inf * tf.ones_like(outputs)
outputs_masked = tf.where(seq_mask, outputs, tf.zeros_like(outputs))
outputs = tf.reduce_max(outputs, axis=1) # axis is preferred to reduction_indices
I have a 4d tensor output for an object detector that outputs per-pixel, per class box predictions, i.e. Shape H x W x C x 6, where the innermost 6 wide dimension is box parameter for that class. Now, when computing loss, I want to update only the predictions from the ground truth class. To do this, I'd like to have a tensor with shape H x W whose elements are the ground truth class index. This tensor is then used to extract the relevant class only from the input, outputting a tensor with H x W x 6. I know this should be possible using gather or gather_nd but I can't get the parameters right to get the desired output. Plus I'm confused about the purpose of the batch_dims parameter for gather_nd, that may be relevant though to solving this. Any suggestions on how I can properly use these or some other tf function to achieve this result?
I have two arrays: the 1st is a nested array of floats (which we will call the value array) and the 2nd is a 1d array of floats (which we will call the key array). The goal is to map each element within the value array to the numerically closest value on the key array.
To give some background, I am trying to map the weights of a CNN to discrete weights as part of a simulation project. The shape of the weights is dependent on the layer and network definition. In this particular case, I am working with a tf.keras.applications.ResNet50V2 network with CIFAR-10 as the dataset, which has weights that go from 3D to 1D. The weights are returned as nested lists with each index indicating the layer. The number of elements within the value array is very large when completely flattened
I currently have a working solution which I have included below, but I am wondering if anyone could think of any further optimizations. I keep getting warnings about the callback class taking longer than the actual training. This is a function that should be executed at the end of each training batch, so a little optimization can go a long way.
for ii in range(valueArray.size):
# Flatten array to 1D
flatArr = valueArray[ii].flatten()
# Using searchsorted since our discrete values have been sorted
idx = np.searchsorted(keyArray, flatArr, side="left")
# Clip any values that exceed array indices
np.clip(idx, 0, keyArray.size - 1, out=idx)
flatMinVal = keyArray[idx]
# Get bool array of idx that have values to the left (idx > 0)
hasValLeft = idx > 0
# Ensures that closer on left is an bool array of equal size as the original
closerOnLeft = hasValLeft
# Check if abs value for right is greater than left (Acts on values with idx > 0)
closerOnLeftSub = np.abs(flatArr[hasValLeft] - keyArray[idx[hasValLeft]]) > \
np.abs(flatArr[hasValLeft] - keyArray[idx[hasValLeft]-1])
# Only assign values that have a value on the left, else always false
closerOnLeft[hasValLeft] = closerOnLeftSub
# If left element is closer, use that as state
flatMinVal[closerOnLeft] = keyArray[idx[closerOnLeft]-1]
# Return reshaped values
valueArray[ii] = np.reshape(flatMinVal, valueArray[ii].shape)
What the difference between att_mask and key_padding_mask in MultiHeadAttnetion of pytorch:
key_padding_mask – if provided, specified padding elements in the key will be ignored by the attention. When given a binary mask and a value is True, the corresponding value on the attention layer will be ignored. When given a byte mask and a value is non-zero, the corresponding value on the attention layer will be ignored
attn_mask – 2D or 3D mask that prevents attention to certain positions. A 2D mask will be broadcasted for all the batches while a 3D mask allows to specify a different mask for the entries of each batch.
Thanks in advance.
The key_padding_mask is used to mask out positions that are padding, i.e., after the end of the input sequence. This is always specific to the input batch and depends on how long are the sequence in the batch compared to the longest one. It is a 2D tensor of shape batch size × input length.
On the other hand, attn_mask says what key-value pairs are valid. In a Transformer decoder, a triangle mask is used to simulate the inference time and prevent the attending to the "future" positions. This is what att_mask is usually used for. If it is a 2D tensor, the shape is input length × input length. You can also have a mask that is specific to every item in a batch. In that case, you can use a 3D tensor of shape (batch size × num heads) × input length × input length. (So, in theory, you can simulate key_padding_mask with a 3D att_mask.)
I think they work as the same: Both of the mask defines which attention between query and key will not be used. And the only difference between the two choices is in which shape you are more comfortable to input the mask
According to the code, it seems like the two mask are merged/taken union so they all play the same role -- which attention between query and key will not be used. As they are taken union: the two mask inputs can be different valued if it is necessary that you are using two masks, or you can input the mask in whichever mask_args according to whose required shape is convenient: Here is part of the original code from pytorch/functional.py around line 5227 in the function multi_head_attention_forward()
# merge key padding and attention masks
if key_padding_mask is not None:
assert key_padding_mask.shape == (bsz, src_len), \
f"expecting key_padding_mask shape of {(bsz, src_len)}, but got {key_padding_mask.shape}"
key_padding_mask = key_padding_mask.view(bsz, 1, 1, src_len). \
expand(-1, num_heads, -1, -1).reshape(bsz * num_heads, 1, src_len)
if attn_mask is None:
attn_mask = key_padding_mask
elif attn_mask.dtype == torch.bool:
attn_mask = attn_mask.logical_or(key_padding_mask)
attn_mask = attn_mask.masked_fill(key_padding_mask, float("-inf"))
# so here only the merged/unioned mask is used to actually compute the attention
attn_output, attn_output_weights = _scaled_dot_product_attention(q, k, v, attn_mask, dropout_p)
Please correct me if you have different opinions or I am wrong.
What I am trying to do is have a weight matrix for my neural network which grows in size (i.e. a neuron is added to it each iteration). However, I do not want to use tf.Variable again as this will waste memory by copying the values in the previous matrix not expanding the matrix itself.
I have seen that people use tf.assign with validate_shape set to False, however, this does not change the shape of the variable correctly which I believed was a bug but the tensorflow GitHub did not seem to agree (I don't understand why from their reply).
Below is a simplified example of the problem. x is the matrix that I want to expand so that it can be added to z. If anyone knows a solution to what I am trying to achieve here I would be very grateful =)
import tensorflow as tf
import numpy as np
# Initialise some variables
sess = tf.Session()
x = tf.Variable(tf.truncated_normal([2, 4], stddev = 0.04))
z = tf.Variable(tf.truncated_normal([3, 4], stddev = 0.04))
sess.run(tf.variables_initializer([x, z]))
# Enlarge the matrix by assigning it a new set of values
sess.run(tf.assign(x, tf.concat((x, tf.cast(tf.truncated_normal([1, 4], stddev = 0.04), tf.float32)), 0), validate_shape=False))
# Print shapes of matrices, notice that x's actual shape is different for the
# shape tensorflow has recorded for it
# Add two matrices with equal shapes
print(tf.add(x, z).eval(session=sess))
Note: I realize that if I initialized z to the shape (2, 4) and then expanded it with tf.assign (as I do with x) the above example will work. But due to another constraint, I cannot control the original shape of z.
Tensors in tensorflow are immutable, so you can't re-scale them easily.
You can attempt to pad with 0's and then access parts of the matrix with tf.gather() as shown here How to select rows from a 3-D Tensor in TensorFlow?
to effect the "submatrix" within the larger padded matrix. This however does not seem to be an easy or elegant solution.
I am attempting to implement an RNN and have output predictions p_y of shape (batch_size, time_points, num_classes). I also have a target_output of shape (batch_size, time_points), where the value at a given index of target_output is an integer denoting the class (a value between 0 and num_classes-1). How can I index p_y with target_output to get the probabilities of the given class I need to compute Cross-Entropy?
I'm not even sure how to do this in numpy. The expression p_y[target_output] does not give the desired results.
You need to use advanced indexing (search for "advanced indexing" here). But Theano advanced indexing behaves differently to numpy so knowing how to do this in numpy may not be all that helpful!
Here's a function which does this for my setup, but note that the order of my dimensions differs from yours. I use (time points, batch_size, num_classes). This also assumes you want to use the 1-of-N categorical cross-entropy variant. You may not want sequence length padding either.
def categorical_crossentropy_3d(coding_dist, true_dist, lengths):
# Zero out the false probabilities and sum the remaining true probabilities to remove the third dimension.
indexes = theano.tensor.arange(coding_dist.shape[2])
mask = theano.tensor.neq(indexes, true_dist.reshape((true_dist.shape[0], true_dist.shape[1], 1)))
predicted_probabilities = theano.tensor.set_subtensor(coding_dist[theano.tensor.nonzero(mask)], 0.).sum(axis=2)
# Pad short sequences with 1's (the pad locations are implicitly correct!)
indexes = theano.tensor.arange(predicted_probabilities.shape[0]).reshape((predicted_probabilities.shape[0], 1))
mask = indexes >= lengths
predicted_probabilities = theano.tensor.set_subtensor(predicted_probabilities[theano.tensor.nonzero(mask)], 1.)
return -theano.tensor.log(predicted_probabilities)