what the difference between att_mask and key_padding_mask in MultiHeadAttnetion

what the difference between att_mask and key_padding_mask in MultiHeadAttnetion - python

What the difference between att_mask and key_padding_mask in MultiHeadAttnetion of pytorch:
key_padding_mask – if provided, specified padding elements in the key will be ignored by the attention. When given a binary mask and a value is True, the corresponding value on the attention layer will be ignored. When given a byte mask and a value is non-zero, the corresponding value on the attention layer will be ignored
attn_mask – 2D or 3D mask that prevents attention to certain positions. A 2D mask will be broadcasted for all the batches while a 3D mask allows to specify a different mask for the entries of each batch.
Thanks in advance.

The key_padding_mask is used to mask out positions that are padding, i.e., after the end of the input sequence. This is always specific to the input batch and depends on how long are the sequence in the batch compared to the longest one. It is a 2D tensor of shape batch size × input length.
On the other hand, attn_mask says what key-value pairs are valid. In a Transformer decoder, a triangle mask is used to simulate the inference time and prevent the attending to the "future" positions. This is what att_mask is usually used for. If it is a 2D tensor, the shape is input length × input length. You can also have a mask that is specific to every item in a batch. In that case, you can use a 3D tensor of shape (batch size × num heads) × input length × input length. (So, in theory, you can simulate key_padding_mask with a 3D att_mask.)

I think they work as the same: Both of the mask defines which attention between query and key will not be used. And the only difference between the two choices is in which shape you are more comfortable to input the mask
According to the code, it seems like the two mask are merged/taken union so they all play the same role -- which attention between query and key will not be used. As they are taken union: the two mask inputs can be different valued if it is necessary that you are using two masks, or you can input the mask in whichever mask_args according to whose required shape is convenient: Here is part of the original code from pytorch/functional.py around line 5227 in the function multi_head_attention_forward()
...
# merge key padding and attention masks
if key_padding_mask is not None:
assert key_padding_mask.shape == (bsz, src_len), \
f"expecting key_padding_mask shape of {(bsz, src_len)}, but got {key_padding_mask.shape}"
key_padding_mask = key_padding_mask.view(bsz, 1, 1, src_len). \
expand(-1, num_heads, -1, -1).reshape(bsz * num_heads, 1, src_len)
if attn_mask is None:
attn_mask = key_padding_mask
elif attn_mask.dtype == torch.bool:
attn_mask = attn_mask.logical_or(key_padding_mask)
else:
attn_mask = attn_mask.masked_fill(key_padding_mask, float("-inf"))
...
# so here only the merged/unioned mask is used to actually compute the attention
attn_output, attn_output_weights = _scaled_dot_product_attention(q, k, v, attn_mask, dropout_p)
Please correct me if you have different opinions or I am wrong.

Related

What does unsqueeze do in pytorch?

lower_bounds = torch.max(set_1[:, :2].unsqueeze(1),
set_2[:, :2].unsqueeze(0)) #(n1, n2, 2)
This code snippet uses unsqueeze(1) for one tensor, but unsqeeze(0) for another. What is the difference between them?

unsqueeze turns an n-dimensionsal tensor into an n+1-dimensional one, by adding an extra dimension of zero depth. However, since it is ambiguous which axis the new dimension should lie across (i.e. in which direction it should be "unsqueezed"), this needs to be specified by the dim argument.
Hence the resulting unsqueezed tensors have the same information, but the indices used to access them are different.
Here is a visual representation of what squeeze/unsqueeze do for an effectively 2d matrix, where it is going from a 2d tensor to a 3d one, and hence there are 3 choices for the new dimension's position:

Switch randomly a predetermined number of elements of a tensor in Tensorflow

I am struggling to find a solution to this puzzle. I have:
mask: a 2d tensor of shape=(batch_size, n) which is made by ones and zeros
n_neg: a 1d tensor of shape=(batch_size,) of type int
what I'd like to do is to change, randomly, from 0 to 1 as many elements in each row of mask as indicated in the correspondent row of n_neg. By previous construction this will always be possible, i.e. the number of zeros in the i-th row of mask is always greater than n_neg[i]. Since it's in the cost function of my model I'd like to do it only exploiting tensorflow ops.
Any idea on how to do it? thanks in advance

Finding max across Tensor of specific indices. (Bi-LSTM-max implementation)

I'm am trying to implement BiLSTM-Max as described in the following paper:
Supervised Learning of Universal Sentence Representations from Natural Language Inference Data.
I am using Tensorflow for my implementation. I started off with an original LSTM code but have successfully made modifications so that it can run with dynamic length input and also bidirectional (i.e Dynamic Bi-LSTM).
# Bi-LSTM, returns output of shape [n_step, batch_size, n_input]
outputs = tf.contrib.rnn.static_bidirectional_rnn(lstm_fw_cell, lstm_bw_cell, x,dtype=tf.float32)
# Change output back to [batch_size, n_step, n_input]
outputs = tf.transpose(tf.stack(outputs), [1, 0, 2])
# Retrieve the last output corresponding the length of input sequence
batch_size_ = tf.shape(outputs)[0]
index = tf.range(0, batch_size_) * seq_max_len + (seqlen - 1)
outputs = tf.gather(tf.reshape(outputs, [-1, 2*n_hidden]), index)
Next modifying it to Bi-LSTM-Max, I replaced taking the last ouput and find the max across n_steps as follows:
# Bi-LSTM, returns output of shape [n_step, batch_size, n_input]
outputs = tf.contrib.rnn.static_bidirectional_rnn(lstm_fw_cell, lstm_bw_cell, x,dtype=tf.float32)
# Change output back to [batch_size, n_step, n_input]
outputs = tf.transpose(tf.stack(outputs), [1, 0, 2])
# Retrieve the max output across n_steps
outputs = tf.reduce_max(outputs, reduction_indices=[1])
When I took the max across the n_steps dimensions, I had assumed that those indices>seqlen are 0s, so I could take the max across the entire dimension instead of taking max from 0 to seqlen. Upon closer inspection, I realised that the values of the non assigned indices may be non-zero due to random initialization or it may just be the last assigned value in memory.
This operation is trivial in python arrays, however, for Tensor operation I can't find an easy way. Does anyone have an idea for this?

Probably the easiest thing to do would be to manually set the invalid outputs to zero or -∞ before finding the maximum. You can do that quite easily with tf.sequence_mask and tf.where:
seq_mask = tf.sequence_mask(seqlen, seq_max_len)
# You can also use e.g. -np.inf * tf.ones_like(outputs)
outputs_masked = tf.where(seq_mask, outputs, tf.zeros_like(outputs))
outputs = tf.reduce_max(outputs, axis=1) # axis is preferred to reduction_indices

Decomposing 3rd Order Tensor in Python

I have a tensor in the shape (n_samples, n_steps, n_features). I want to decompose this into a tensor of shape (n_samples, n_components).
I need a method of decomposition that has a .fit(...) so that I can apply the same decomposition to a new batch of samples. I have been looking at Tucker Decomposition and PARAFAC Decomposition, but neither have that crucial .fit(...) and .transform(...) functionality. (Or at least I think they don't?)
I could use PCA and train it on a representative sample and then call .transform(...) on the remaining samples, but I would rather have some sort of tensor decomposition that can handle all of the samples at once, so as to get a better idea of the differences between each sample.
This is what I mean by "tensor":
In fact tensors are merely a generalisation of scalars and vectors; a scalar is a zero rank tensor, and a vector is a first rank tensor. The rank (or order) of a tensor is defined by the number of directions (and hence the dimensionality of the array) required to describe it.
If you have any questions, please ask, I'll try to clarify my problem if needed.
EDIT: The best solution would be some type of kernel but I have yet to find a kernel that can deal with n-rank Tensors and not just 2D data

You can do this using the development (master) version of TensorLy. Specifically, you can use the new partial_tucker function (it is not yet updated in the documentation...).
Note that the following solution preserves the structure of the tensor, i.e. a tensor of shape (n_samples, n_steps, n_features) is decomposed into a (smaller) tensor of shape (n_samples, n_components_1, n_components_2).
Code
Short answer: this is a very basic class that does what you want (and it would work on tensors of arbitrary order).
import tensorly as tl
from tensorly.decomposition._tucker import partial_tucker
class TensorPCA:
def __init__(self, ranks, modes):
self.ranks = ranks
self.modes = modes
def fit(self, tensor):
self.core, self.factors = partial_tucker(tensor, modes=self.modes, ranks=self.ranks)
return self
def transform(self, tensor):
return tl.tenalg.multi_mode_dot(tensor, self.factors, modes=self.modes, transpose=True)
Usage
Given an input tensor, you can use the previous class by first instantiating it with the desired ranks (size of the core tensor) and modes on which to perform the decomposition (in your 3D case, 1 and 2 since indexing starts at zero):
tpca = TensorPCA(ranks=[4, 5], modes=[1, 2])
tpca.fit(tensor)
Given a new tensor originally called new_tensor, you can project it using the transform method:
tpca.transform(new_tensor)
Explanation
Let's go through the code with an example: first let's import the necessary bits:
import numpy as np
import tensorly as tl
from tensorly.decomposition._tucker import partial_tucker
We then generate a random tensor:
tensor = np.random.random((10, 11, 12))
The next step is to decompose it along its second and third dimensions, or modes (as the first dimension corresponds to the samples):
core, factors = partial_tucker(tensor, modes=[1, 2], ranks=[4, 5])
The core corresponds to the transformed input tensor while factors is a list of two projection matrices, one for the second mode and one for the third mode. Given a new tensor, you can project it to the same subspace (the transform method) by projecting each of its last two dimensions:
tl.tenalg.multi_mode_dot(tensor, factors, modes=[1, 2], transpose=True)
The transposition here is equivalent to an inverse since the factors are orthogonal.
Finally, a note on the terminology: in general, even though it is sometimes done, it is probably best to not use interchangeably order and rank of a tensor. The order of a tensor is simply its number of dimensions while the rank of a tensor is usually a much more complicated notion which you could think of as a generalization of the notion of matrix rank.

Cross Entropy for batch with Theano

I am attempting to implement an RNN and have output predictions p_y of shape (batch_size, time_points, num_classes). I also have a target_output of shape (batch_size, time_points), where the value at a given index of target_output is an integer denoting the class (a value between 0 and num_classes-1). How can I index p_y with target_output to get the probabilities of the given class I need to compute Cross-Entropy?
I'm not even sure how to do this in numpy. The expression p_y[target_output] does not give the desired results.

You need to use advanced indexing (search for "advanced indexing" here). But Theano advanced indexing behaves differently to numpy so knowing how to do this in numpy may not be all that helpful!
Here's a function which does this for my setup, but note that the order of my dimensions differs from yours. I use (time points, batch_size, num_classes). This also assumes you want to use the 1-of-N categorical cross-entropy variant. You may not want sequence length padding either.
def categorical_crossentropy_3d(coding_dist, true_dist, lengths):
# Zero out the false probabilities and sum the remaining true probabilities to remove the third dimension.
indexes = theano.tensor.arange(coding_dist.shape[2])
mask = theano.tensor.neq(indexes, true_dist.reshape((true_dist.shape[0], true_dist.shape[1], 1)))
predicted_probabilities = theano.tensor.set_subtensor(coding_dist[theano.tensor.nonzero(mask)], 0.).sum(axis=2)
# Pad short sequences with 1's (the pad locations are implicitly correct!)
indexes = theano.tensor.arange(predicted_probabilities.shape[0]).reshape((predicted_probabilities.shape[0], 1))
mask = indexes >= lengths
predicted_probabilities = theano.tensor.set_subtensor(predicted_probabilities[theano.tensor.nonzero(mask)], 1.)
return -theano.tensor.log(predicted_probabilities)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

what the difference between att_mask and key_padding_mask in MultiHeadAttnetion - python

Related

What does unsqueeze do in pytorch?

Switch randomly a predetermined number of elements of a tensor in Tensorflow

Finding max across Tensor of specific indices. (Bi-LSTM-max implementation)

Decomposing 3rd Order Tensor in Python

Cross Entropy for batch with Theano

Categories

Resources