Masking: Mask everything after a specified token (eos)

Masking: Mask everything after a specified token (eos) - python

My tgt tensor is in shape of [12, 32, 1] which is sequence_length, batch_size, token_idx.
What is the best way to create a mask which has ones for entries with <eos> and before in sequence, and zeros afterwards?
Currently I'm calculating my mask like this, which simply puts zeros where <blank> is, ones otherwise.
mask = torch.zeros_like(tgt).masked_scatter_((tgt != tgt_padding), torch.ones_like(tgt))
But the problem is, that my tgt can contain <blank> as well (before <eos>), in which cases I don't want to mask it out.
My temporary solution:
mask = torch.ones_like(tgt)
for eos_token in (tgt == tgt_eos).nonzero():
mask[eos_token[0]+1:,eos_token[1]] = 0

I guess you are trying to create a mask for the PAD tokens. There are several ways. One of them is as follows.
# tensor is of shape [seq_len, batch_size, 1]
tensor = tensor.mul(tensor.ne(PAD).float())
Here, PAD stands for the index of the PAD_TOKEN. tensor.ne(PAD) will create a byte tensor where at PAD_TOKEN positions, 0 will be assigned and 1 elsewhere.
If you have examples like, "<s> I think <pad> so </s> <pad> <pad>". Then, I would suggest using different PAD tokens, for before and after </s>.
OR, if you have the length information for each sentence (in the above example, the sentence length is 6), then you can create the mask using the following function.
def sequence_mask(lengths, max_len=None):
"""
Creates a boolean mask from sequence lengths.
:param lengths: 1d tensor [batch_size]
:param max_len: int
"""
batch_size = lengths.numel()
max_len = max_len or lengths.max()
return (torch.arange(0, max_len, device=lengths.device) # (0 for pad positions)
.type_as(lengths)
.repeat(batch_size, 1)
.lt(lengths.unsqueeze(1)))

Related

Efficient mapping of values from one Numpy array to the cloesest value on another

I have two arrays: the 1st is a nested array of floats (which we will call the value array) and the 2nd is a 1d array of floats (which we will call the key array). The goal is to map each element within the value array to the numerically closest value on the key array.
To give some background, I am trying to map the weights of a CNN to discrete weights as part of a simulation project. The shape of the weights is dependent on the layer and network definition. In this particular case, I am working with a tf.keras.applications.ResNet50V2 network with CIFAR-10 as the dataset, which has weights that go from 3D to 1D. The weights are returned as nested lists with each index indicating the layer. The number of elements within the value array is very large when completely flattened
I currently have a working solution which I have included below, but I am wondering if anyone could think of any further optimizations. I keep getting warnings about the callback class taking longer than the actual training. This is a function that should be executed at the end of each training batch, so a little optimization can go a long way.
for ii in range(valueArray.size):
# Flatten array to 1D
flatArr = valueArray[ii].flatten()
# Using searchsorted since our discrete values have been sorted
idx = np.searchsorted(keyArray, flatArr, side="left")
# Clip any values that exceed array indices
np.clip(idx, 0, keyArray.size - 1, out=idx)
flatMinVal = keyArray[idx]
# Get bool array of idx that have values to the left (idx > 0)
hasValLeft = idx > 0
# Ensures that closer on left is an bool array of equal size as the original
closerOnLeft = hasValLeft
# Check if abs value for right is greater than left (Acts on values with idx > 0)
closerOnLeftSub = np.abs(flatArr[hasValLeft] - keyArray[idx[hasValLeft]]) > \
np.abs(flatArr[hasValLeft] - keyArray[idx[hasValLeft]-1])
# Only assign values that have a value on the left, else always false
closerOnLeft[hasValLeft] = closerOnLeftSub
# If left element is closer, use that as state
flatMinVal[closerOnLeft] = keyArray[idx[closerOnLeft]-1]
# Return reshaped values
valueArray[ii] = np.reshape(flatMinVal, valueArray[ii].shape)

what the difference between att_mask and key_padding_mask in MultiHeadAttnetion

What the difference between att_mask and key_padding_mask in MultiHeadAttnetion of pytorch:
key_padding_mask – if provided, specified padding elements in the key will be ignored by the attention. When given a binary mask and a value is True, the corresponding value on the attention layer will be ignored. When given a byte mask and a value is non-zero, the corresponding value on the attention layer will be ignored
attn_mask – 2D or 3D mask that prevents attention to certain positions. A 2D mask will be broadcasted for all the batches while a 3D mask allows to specify a different mask for the entries of each batch.
Thanks in advance.

The key_padding_mask is used to mask out positions that are padding, i.e., after the end of the input sequence. This is always specific to the input batch and depends on how long are the sequence in the batch compared to the longest one. It is a 2D tensor of shape batch size × input length.
On the other hand, attn_mask says what key-value pairs are valid. In a Transformer decoder, a triangle mask is used to simulate the inference time and prevent the attending to the "future" positions. This is what att_mask is usually used for. If it is a 2D tensor, the shape is input length × input length. You can also have a mask that is specific to every item in a batch. In that case, you can use a 3D tensor of shape (batch size × num heads) × input length × input length. (So, in theory, you can simulate key_padding_mask with a 3D att_mask.)

I think they work as the same: Both of the mask defines which attention between query and key will not be used. And the only difference between the two choices is in which shape you are more comfortable to input the mask
According to the code, it seems like the two mask are merged/taken union so they all play the same role -- which attention between query and key will not be used. As they are taken union: the two mask inputs can be different valued if it is necessary that you are using two masks, or you can input the mask in whichever mask_args according to whose required shape is convenient: Here is part of the original code from pytorch/functional.py around line 5227 in the function multi_head_attention_forward()
...
# merge key padding and attention masks
if key_padding_mask is not None:
assert key_padding_mask.shape == (bsz, src_len), \
f"expecting key_padding_mask shape of {(bsz, src_len)}, but got {key_padding_mask.shape}"
key_padding_mask = key_padding_mask.view(bsz, 1, 1, src_len). \
expand(-1, num_heads, -1, -1).reshape(bsz * num_heads, 1, src_len)
if attn_mask is None:
attn_mask = key_padding_mask
elif attn_mask.dtype == torch.bool:
attn_mask = attn_mask.logical_or(key_padding_mask)
else:
attn_mask = attn_mask.masked_fill(key_padding_mask, float("-inf"))
...
# so here only the merged/unioned mask is used to actually compute the attention
attn_output, attn_output_weights = _scaled_dot_product_attention(q, k, v, attn_mask, dropout_p)
Please correct me if you have different opinions or I am wrong.

Use tf.gather to extract tensors row-wise based on another tensor row-wisely (first dimension)

I have two tensors with dimensions as A:[B,3000,3] and C:[B,4000] respectively. I want to use tf.gather() to use every single row from tensor C as index, and to use every row from tensor A as params, to get a result with size [B,4000,3].
Here is an example to make this more understandable: Say I have tensors as
A = [[1,2,3],[4,5,6],[7,8,9]],
C = [0,2,1,2,1],
result = [[1,2,3],[7,8,9],[4,5,6],[7,8,9],[4,5,6]],
by using tf.gather(A,C). It is all fine when applying to tensors with dimension less than 3.
But when it is the case as the description as the beginning, by applying tf.gather(A,C,axis=1), the shape of result tensor is
[B,B,4000,3]
It seems that tf.gather() just did the job for every element in tensor C as the indices to gather elements in tensor A. The only solution I am thinking about is to use a for loop, but that would extremely reduce the computational ability by using tf.gather(A[i,...],C[i,...]) to gain the correct size of tensor
[B,4000,3]
Thus, is there any function that is able to do this task similarly?

You need to use tf.gather_nd:
import tensorflow as tf
A = ... # B x 3000 x 3
C = ... # B x 4000
s = tf.shape(C)
B, cols = s[0], s[1]
# Make indices for first dimension
idx = tf.tile(tf.expand_dims(tf.range(B, dtype=C.dtype), 1), [1, cols])
# Complete index for gather_nd
gather_idx = tf.stack([idx, C], axis=-1)
# Gather result
result = tf.gather_nd(A, gather_idx)

Finding max across Tensor of specific indices. (Bi-LSTM-max implementation)

I'm am trying to implement BiLSTM-Max as described in the following paper:
Supervised Learning of Universal Sentence Representations from Natural Language Inference Data.
I am using Tensorflow for my implementation. I started off with an original LSTM code but have successfully made modifications so that it can run with dynamic length input and also bidirectional (i.e Dynamic Bi-LSTM).
# Bi-LSTM, returns output of shape [n_step, batch_size, n_input]
outputs = tf.contrib.rnn.static_bidirectional_rnn(lstm_fw_cell, lstm_bw_cell, x,dtype=tf.float32)
# Change output back to [batch_size, n_step, n_input]
outputs = tf.transpose(tf.stack(outputs), [1, 0, 2])
# Retrieve the last output corresponding the length of input sequence
batch_size_ = tf.shape(outputs)[0]
index = tf.range(0, batch_size_) * seq_max_len + (seqlen - 1)
outputs = tf.gather(tf.reshape(outputs, [-1, 2*n_hidden]), index)
Next modifying it to Bi-LSTM-Max, I replaced taking the last ouput and find the max across n_steps as follows:
# Bi-LSTM, returns output of shape [n_step, batch_size, n_input]
outputs = tf.contrib.rnn.static_bidirectional_rnn(lstm_fw_cell, lstm_bw_cell, x,dtype=tf.float32)
# Change output back to [batch_size, n_step, n_input]
outputs = tf.transpose(tf.stack(outputs), [1, 0, 2])
# Retrieve the max output across n_steps
outputs = tf.reduce_max(outputs, reduction_indices=[1])
When I took the max across the n_steps dimensions, I had assumed that those indices>seqlen are 0s, so I could take the max across the entire dimension instead of taking max from 0 to seqlen. Upon closer inspection, I realised that the values of the non assigned indices may be non-zero due to random initialization or it may just be the last assigned value in memory.
This operation is trivial in python arrays, however, for Tensor operation I can't find an easy way. Does anyone have an idea for this?

Probably the easiest thing to do would be to manually set the invalid outputs to zero or -∞ before finding the maximum. You can do that quite easily with tf.sequence_mask and tf.where:
seq_mask = tf.sequence_mask(seqlen, seq_max_len)
# You can also use e.g. -np.inf * tf.ones_like(outputs)
outputs_masked = tf.where(seq_mask, outputs, tf.zeros_like(outputs))
outputs = tf.reduce_max(outputs, axis=1) # axis is preferred to reduction_indices

Cross Entropy for batch with Theano

I am attempting to implement an RNN and have output predictions p_y of shape (batch_size, time_points, num_classes). I also have a target_output of shape (batch_size, time_points), where the value at a given index of target_output is an integer denoting the class (a value between 0 and num_classes-1). How can I index p_y with target_output to get the probabilities of the given class I need to compute Cross-Entropy?
I'm not even sure how to do this in numpy. The expression p_y[target_output] does not give the desired results.

You need to use advanced indexing (search for "advanced indexing" here). But Theano advanced indexing behaves differently to numpy so knowing how to do this in numpy may not be all that helpful!
Here's a function which does this for my setup, but note that the order of my dimensions differs from yours. I use (time points, batch_size, num_classes). This also assumes you want to use the 1-of-N categorical cross-entropy variant. You may not want sequence length padding either.
def categorical_crossentropy_3d(coding_dist, true_dist, lengths):
# Zero out the false probabilities and sum the remaining true probabilities to remove the third dimension.
indexes = theano.tensor.arange(coding_dist.shape[2])
mask = theano.tensor.neq(indexes, true_dist.reshape((true_dist.shape[0], true_dist.shape[1], 1)))
predicted_probabilities = theano.tensor.set_subtensor(coding_dist[theano.tensor.nonzero(mask)], 0.).sum(axis=2)
# Pad short sequences with 1's (the pad locations are implicitly correct!)
indexes = theano.tensor.arange(predicted_probabilities.shape[0]).reshape((predicted_probabilities.shape[0], 1))
mask = indexes >= lengths
predicted_probabilities = theano.tensor.set_subtensor(predicted_probabilities[theano.tensor.nonzero(mask)], 1.)
return -theano.tensor.log(predicted_probabilities)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Masking: Mask everything after a specified token (eos) - python

Related

Efficient mapping of values from one Numpy array to the cloesest value on another

what the difference between att_mask and key_padding_mask in MultiHeadAttnetion

Use tf.gather to extract tensors row-wise based on another tensor row-wisely (first dimension)

Finding max across Tensor of specific indices. (Bi-LSTM-max implementation)

Cross Entropy for batch with Theano

Categories

Resources