Cross Entropy for batch with Theano - python

I am attempting to implement an RNN and have output predictions p_y of shape (batch_size, time_points, num_classes). I also have a target_output of shape (batch_size, time_points), where the value at a given index of target_output is an integer denoting the class (a value between 0 and num_classes-1). How can I index p_y with target_output to get the probabilities of the given class I need to compute Cross-Entropy?
I'm not even sure how to do this in numpy. The expression p_y[target_output] does not give the desired results.

You need to use advanced indexing (search for "advanced indexing" here). But Theano advanced indexing behaves differently to numpy so knowing how to do this in numpy may not be all that helpful!
Here's a function which does this for my setup, but note that the order of my dimensions differs from yours. I use (time points, batch_size, num_classes). This also assumes you want to use the 1-of-N categorical cross-entropy variant. You may not want sequence length padding either.
def categorical_crossentropy_3d(coding_dist, true_dist, lengths):
# Zero out the false probabilities and sum the remaining true probabilities to remove the third dimension.
indexes = theano.tensor.arange(coding_dist.shape[2])
mask = theano.tensor.neq(indexes, true_dist.reshape((true_dist.shape[0], true_dist.shape[1], 1)))
predicted_probabilities = theano.tensor.set_subtensor(coding_dist[theano.tensor.nonzero(mask)], 0.).sum(axis=2)
# Pad short sequences with 1's (the pad locations are implicitly correct!)
indexes = theano.tensor.arange(predicted_probabilities.shape[0]).reshape((predicted_probabilities.shape[0], 1))
mask = indexes >= lengths
predicted_probabilities = theano.tensor.set_subtensor(predicted_probabilities[theano.tensor.nonzero(mask)], 1.)
return -theano.tensor.log(predicted_probabilities)

Related

what the difference between att_mask and key_padding_mask in MultiHeadAttnetion

What the difference between att_mask and key_padding_mask in MultiHeadAttnetion of pytorch:
key_padding_mask – if provided, specified padding elements in the key will be ignored by the attention. When given a binary mask and a value is True, the corresponding value on the attention layer will be ignored. When given a byte mask and a value is non-zero, the corresponding value on the attention layer will be ignored
attn_mask – 2D or 3D mask that prevents attention to certain positions. A 2D mask will be broadcasted for all the batches while a 3D mask allows to specify a different mask for the entries of each batch.
Thanks in advance.
The key_padding_mask is used to mask out positions that are padding, i.e., after the end of the input sequence. This is always specific to the input batch and depends on how long are the sequence in the batch compared to the longest one. It is a 2D tensor of shape batch size × input length.
On the other hand, attn_mask says what key-value pairs are valid. In a Transformer decoder, a triangle mask is used to simulate the inference time and prevent the attending to the "future" positions. This is what att_mask is usually used for. If it is a 2D tensor, the shape is input length × input length. You can also have a mask that is specific to every item in a batch. In that case, you can use a 3D tensor of shape (batch size × num heads) × input length × input length. (So, in theory, you can simulate key_padding_mask with a 3D att_mask.)
I think they work as the same: Both of the mask defines which attention between query and key will not be used. And the only difference between the two choices is in which shape you are more comfortable to input the mask
According to the code, it seems like the two mask are merged/taken union so they all play the same role -- which attention between query and key will not be used. As they are taken union: the two mask inputs can be different valued if it is necessary that you are using two masks, or you can input the mask in whichever mask_args according to whose required shape is convenient: Here is part of the original code from pytorch/functional.py around line 5227 in the function multi_head_attention_forward()
...
# merge key padding and attention masks
if key_padding_mask is not None:
assert key_padding_mask.shape == (bsz, src_len), \
f"expecting key_padding_mask shape of {(bsz, src_len)}, but got {key_padding_mask.shape}"
key_padding_mask = key_padding_mask.view(bsz, 1, 1, src_len). \
expand(-1, num_heads, -1, -1).reshape(bsz * num_heads, 1, src_len)
if attn_mask is None:
attn_mask = key_padding_mask
elif attn_mask.dtype == torch.bool:
attn_mask = attn_mask.logical_or(key_padding_mask)
else:
attn_mask = attn_mask.masked_fill(key_padding_mask, float("-inf"))
...
# so here only the merged/unioned mask is used to actually compute the attention
attn_output, attn_output_weights = _scaled_dot_product_attention(q, k, v, attn_mask, dropout_p)
Please correct me if you have different opinions or I am wrong.

Masking: Mask everything after a specified token (eos)

My tgt tensor is in shape of [12, 32, 1] which is sequence_length, batch_size, token_idx.
What is the best way to create a mask which has ones for entries with <eos> and before in sequence, and zeros afterwards?
Currently I'm calculating my mask like this, which simply puts zeros where <blank> is, ones otherwise.
mask = torch.zeros_like(tgt).masked_scatter_((tgt != tgt_padding), torch.ones_like(tgt))
But the problem is, that my tgt can contain <blank> as well (before <eos>), in which cases I don't want to mask it out.
My temporary solution:
mask = torch.ones_like(tgt)
for eos_token in (tgt == tgt_eos).nonzero():
mask[eos_token[0]+1:,eos_token[1]] = 0
I guess you are trying to create a mask for the PAD tokens. There are several ways. One of them is as follows.
# tensor is of shape [seq_len, batch_size, 1]
tensor = tensor.mul(tensor.ne(PAD).float())
Here, PAD stands for the index of the PAD_TOKEN. tensor.ne(PAD) will create a byte tensor where at PAD_TOKEN positions, 0 will be assigned and 1 elsewhere.
If you have examples like, "<s> I think <pad> so </s> <pad> <pad>". Then, I would suggest using different PAD tokens, for before and after </s>.
OR, if you have the length information for each sentence (in the above example, the sentence length is 6), then you can create the mask using the following function.
def sequence_mask(lengths, max_len=None):
"""
Creates a boolean mask from sequence lengths.
:param lengths: 1d tensor [batch_size]
:param max_len: int
"""
batch_size = lengths.numel()
max_len = max_len or lengths.max()
return (torch.arange(0, max_len, device=lengths.device) # (0 for pad positions)
.type_as(lengths)
.repeat(batch_size, 1)
.lt(lengths.unsqueeze(1)))

Finding max across Tensor of specific indices. (Bi-LSTM-max implementation)

I'm am trying to implement BiLSTM-Max as described in the following paper:
Supervised Learning of Universal Sentence Representations from Natural Language Inference Data.
I am using Tensorflow for my implementation. I started off with an original LSTM code but have successfully made modifications so that it can run with dynamic length input and also bidirectional (i.e Dynamic Bi-LSTM).
# Bi-LSTM, returns output of shape [n_step, batch_size, n_input]
outputs = tf.contrib.rnn.static_bidirectional_rnn(lstm_fw_cell, lstm_bw_cell, x,dtype=tf.float32)
# Change output back to [batch_size, n_step, n_input]
outputs = tf.transpose(tf.stack(outputs), [1, 0, 2])
# Retrieve the last output corresponding the length of input sequence
batch_size_ = tf.shape(outputs)[0]
index = tf.range(0, batch_size_) * seq_max_len + (seqlen - 1)
outputs = tf.gather(tf.reshape(outputs, [-1, 2*n_hidden]), index)
Next modifying it to Bi-LSTM-Max, I replaced taking the last ouput and find the max across n_steps as follows:
# Bi-LSTM, returns output of shape [n_step, batch_size, n_input]
outputs = tf.contrib.rnn.static_bidirectional_rnn(lstm_fw_cell, lstm_bw_cell, x,dtype=tf.float32)
# Change output back to [batch_size, n_step, n_input]
outputs = tf.transpose(tf.stack(outputs), [1, 0, 2])
# Retrieve the max output across n_steps
outputs = tf.reduce_max(outputs, reduction_indices=[1])
When I took the max across the n_steps dimensions, I had assumed that those indices>seqlen are 0s, so I could take the max across the entire dimension instead of taking max from 0 to seqlen. Upon closer inspection, I realised that the values of the non assigned indices may be non-zero due to random initialization or it may just be the last assigned value in memory.
This operation is trivial in python arrays, however, for Tensor operation I can't find an easy way. Does anyone have an idea for this?
Probably the easiest thing to do would be to manually set the invalid outputs to zero or -∞ before finding the maximum. You can do that quite easily with tf.sequence_mask and tf.where:
seq_mask = tf.sequence_mask(seqlen, seq_max_len)
# You can also use e.g. -np.inf * tf.ones_like(outputs)
outputs_masked = tf.where(seq_mask, outputs, tf.zeros_like(outputs))
outputs = tf.reduce_max(outputs, axis=1) # axis is preferred to reduction_indices

Jaccard's distance matrix with tensorflow

I would like to compute a distance matrix using the Jaccard distance. And do so as fast as possible. I used to use scikit-learn's pairwise_distances function. But scikit-learn doesn't plan to support GPU, and there's even a known bug that makes the function slower when run in parallel.
My only constraint is that the resulting distance matrix can then be fed to scikit-learn's DBSCAN clustering algorithm. I was thinking about implementing the computation with tensorflow but couldn't find a nice and simple way to do it.
PS: I have reasons to precompute the distance matrix instead of letting DBSCAN do it as needed.
Hej I was facing the same problem.
Given the idea that the jaccard similarity is the ratio of true postives (tp) to the sum of true positives, false negatives (fn) and false positives (fp), I came up with this solution:
def jaccard_distance(self):
tp = tf.reduce_sum(tf.mul(self.target, self.prediction), 1)
fn = tf.reduce_sum(tf.mul(self.target, 1-self.prediction), 1)
fp = tf.reduce_sum(tf.mul(1-self.target, self.prediction), 1)
return 1 - (tp / (tp + fn + fp))
Hope this helps!
I am not a tensorflow expert, but here is the solution I got. As far as I know, the only ways in tensorflow to do a computation on all-pairs of a list is to do a matrix multiplication or use the broadcasting rules, this solution uses both at some point.
So let's assume we have an input boolean matrix of n_samples rows, one per set, and n_features columns, one per possible element. A value True in the i-th row, j-th column means the i-th set contains the element j. Just like scikit-learn's pairwise_distances expect. We can then proceed as follow.
Cast the matrix to numbers, getting 1 for True and 0 for False.
Multiply the matrix by its own transpose. This produce a matrix where each element M[i][j] contains size of the intersection between the i-th and j-th sets.
Compute a cardv vector that contains the cardinality of all the sets by summing the input matrix by rows.
Make a row and a column vector from cardv.
Compute 1 - M / (cardvrow + cardvcol - M). The broadcasting rules will do all the work when adding a row and a column vector.
This algorithm as a whole seems a bit hack-ish, but it works and produce results within a reasonable margin from the result computed by scikit-learn's pairwise_distances function. A better algorithm should probably make a single pass on every pair of input vectors and compute only half of the matrix as it is symmetric. Any improvement is welcome.
setsin = tf.placeholder(tf.bool, shape=(N, M))
sets = tf.cast(setsin, tf.float16)
mat = tf.matmul(sets, sets, transpose_b=True, name="Main_matmul")
#mat = tf.cast(mat, tf.float32, name="Upgrade_mat")
#sets = tf.cast(sets, tf.float32, name="Upgrade_sets")
cardinal = tf.reduce_sum(sets, 1, name="Richelieu")
cardinalrow = tf.expand_dims(cardinal, 0)
cardinalcol = tf.expand_dims(cardinal, 1)
mat = 1 - mat / (cardinalrow + cardinalcol - mat)
I used float16 type as it seems much faster than float32. Casting to float32 might only be useful if the cardinals are large enough to make them inaccurate or if more precision is needed when performing the division. But even when the casts are needed, it seems to be still relevant to do the matrix multiplication as float16.

Decomposing 3rd Order Tensor in Python

I have a tensor in the shape (n_samples, n_steps, n_features). I want to decompose this into a tensor of shape (n_samples, n_components).
I need a method of decomposition that has a .fit(...) so that I can apply the same decomposition to a new batch of samples. I have been looking at Tucker Decomposition and PARAFAC Decomposition, but neither have that crucial .fit(...) and .transform(...) functionality. (Or at least I think they don't?)
I could use PCA and train it on a representative sample and then call .transform(...) on the remaining samples, but I would rather have some sort of tensor decomposition that can handle all of the samples at once, so as to get a better idea of the differences between each sample.
This is what I mean by "tensor":
In fact tensors are merely a generalisation of scalars and vectors; a scalar is a zero rank tensor, and a vector is a first rank tensor. The rank (or order) of a tensor is defined by the number of directions (and hence the dimensionality of the array) required to describe it.
If you have any questions, please ask, I'll try to clarify my problem if needed.
EDIT: The best solution would be some type of kernel but I have yet to find a kernel that can deal with n-rank Tensors and not just 2D data
You can do this using the development (master) version of TensorLy. Specifically, you can use the new partial_tucker function (it is not yet updated in the documentation...).
Note that the following solution preserves the structure of the tensor, i.e. a tensor of shape (n_samples, n_steps, n_features) is decomposed into a (smaller) tensor of shape (n_samples, n_components_1, n_components_2).
Code
Short answer: this is a very basic class that does what you want (and it would work on tensors of arbitrary order).
import tensorly as tl
from tensorly.decomposition._tucker import partial_tucker
class TensorPCA:
def __init__(self, ranks, modes):
self.ranks = ranks
self.modes = modes
def fit(self, tensor):
self.core, self.factors = partial_tucker(tensor, modes=self.modes, ranks=self.ranks)
return self
def transform(self, tensor):
return tl.tenalg.multi_mode_dot(tensor, self.factors, modes=self.modes, transpose=True)
Usage
Given an input tensor, you can use the previous class by first instantiating it with the desired ranks (size of the core tensor) and modes on which to perform the decomposition (in your 3D case, 1 and 2 since indexing starts at zero):
tpca = TensorPCA(ranks=[4, 5], modes=[1, 2])
tpca.fit(tensor)
Given a new tensor originally called new_tensor, you can project it using the transform method:
tpca.transform(new_tensor)
Explanation
Let's go through the code with an example: first let's import the necessary bits:
import numpy as np
import tensorly as tl
from tensorly.decomposition._tucker import partial_tucker
We then generate a random tensor:
tensor = np.random.random((10, 11, 12))
The next step is to decompose it along its second and third dimensions, or modes (as the first dimension corresponds to the samples):
core, factors = partial_tucker(tensor, modes=[1, 2], ranks=[4, 5])
The core corresponds to the transformed input tensor while factors is a list of two projection matrices, one for the second mode and one for the third mode. Given a new tensor, you can project it to the same subspace (the transform method) by projecting each of its last two dimensions:
tl.tenalg.multi_mode_dot(tensor, factors, modes=[1, 2], transpose=True)
The transposition here is equivalent to an inverse since the factors are orthogonal.
Finally, a note on the terminology: in general, even though it is sometimes done, it is probably best to not use interchangeably order and rank of a tensor. The order of a tensor is simply its number of dimensions while the rank of a tensor is usually a much more complicated notion which you could think of as a generalization of the notion of matrix rank.

Categories