MultiHeadAttention giving very different values between versions (Pytorch/Tensorflow - python

I'm trying to recreate a transformer that was written in Pytorch and make it Tensorflow. Everything was going pretty well until each version of MultiHeadAttention started giving extremely different outputs. Both methods are an implementation of multi-headed attention as described in the paper "Attention is all you Need", so they should be able to achieve the same output.
I'm converting
self_attn = nn.MultiheadAttention(dModel, nheads, dropout=dropout)
self_attn = MultiHeadAttention(num_heads=nheads, key_dim=dModel, dropout=dropout)
For my tests, dropout is 0.
I'm calling them with:
where x is a tensor with shape=(10, 128, 50)
As expected from the documentation, the Pytorch version returns a tuple, (the target sequence length, embedding dimension), both with dimensions [10, 128, 50].
I'm having trouble getting the TensorFlow version to do the same thing. With Tensorflow I only get one tensor back, (size [10, 128, 50]) and it looks like neither the target sequence length or embedding dimension tensor from pytorch.
Based on the Tensorflow documentation I should be getting something comparable.
How can I get them to operate the same way? I'm guessing I'm doing something wrong with Tensorflow but I can't figure out what.

nn.MultiheadAttention outputs by default tuple with two tensors:
attn_output -- result of self-attention operation
attn_output_weights -- attention weights averaged(!) over heads
At the same time tf.keras.layers.MultiHeadAttention outputs by default only one tensor attention_output (which corresponds to attn_output of pytorch). Attention weights of all heads also will be returned if parameter return_attention_scores is set to True, like:
output, scores = self_attn(x, x, x, return_attention_scores=True)
Tensor scores also should be averaged to achieve full correspondence with pytorch:
scores = tf.math.reduce_mean(scores, 1)
While rewriting keep in mind that by default (as in snippet in question) nn.MultiheadAttention expects input in form (seq_length, batch_size, embed_dim), but tf.keras.layers.MultiHeadAttention expects it in form (batch_size, seq_length, embed_dim).


How does masking work in Tensorflow Keras

I have difficulty understanding how exactly masking works in Tensorflow/Keras. On the Keras website ( they simply say that the neural network layers skip/ignore the masked values but it doesn't explain how? Does it force the weights to zero? (I know a boolean array is being created but I don't know how it's being used)
For example check this simple example:
embedding = tf.keras.layers.Embedding(input_dim=10, output_dim=3, mask_zero=True)
masked_output = embedding(np.array([[1,2,0]]))
I asked the Embedding layer to mask zero inputs. Now look at the output:
[[[ 0.00300496 -0.02925059 -0.01254098]
[ 0.04872786 0.01087702 -0.03656749]
[ 0.00446818 0.00290152 -0.02269397]]], shape=(1, 3, 3), dtype=float32)
If you change the "mask_zero" argument to False you get the exact same results. Does anyone know what's happening behind the scene? Any resources explaining the masking mechanism more thoroughly is highly appreciated.
P.S: This is also an example of a full Neural Network which gives an identical outcome with and without masking:
input = np.array([[1,2,0]]) # <--- 0 should be masked and ignored
embedding = tf.keras.layers.Embedding(input_dim=10, output_dim=3, mask_zero=True)
masked_output = embedding(input)
flatten = tf.keras.layers.Flatten()(masked_output)
dense_middle = tf.keras.layers.Dense(4)(flatten)
out = tf.keras.layers.Dense(1)(dense_middle)
In TensorFlow/Keras, masking enables you to disregard certain parts of a tensor, typically those set to zero, when executing the forward pass of your neural network. This can be helpful when dealing with sequences of varying length, where padding is used to make all sequences the same length. In the forward pass, the covered-up elements are taken as having a value of 0, so that their effect on the output is ignored.
In the example you provided, the Embedding layer is set to mask zeros via the mask_zero argument, yet the outcome is the same regardless of whether mask_zero is set to True or False. This is because the example just has one input tensor with no zero values, thus there is no contrast in the output.
Underneath, TensorFlow implements masking by using a special tensor mask that is multiplied element-wise with the input tensor during the forward pass. This mask tensor has the same shape as the input tensor and comprises binary values that indicate if each element should be included or not.
inputs = tf.keras.layers.Input(shape=(3,))
embedding = tf.keras.layers.Embedding(input_dim=10, output_dim=3, mask_zero=True)(inputs)
masking = tf.keras.layers.Masking()(embedding)
flatten = tf.keras.layers.Flatten()(masking)
dense_middle = tf.keras.layers.Dense(4)(flatten)
output = tf.keras.layers.Dense(1)(dense_middle)
model = tf.keras.Model(inputs, output)
By doing this, the network will be able to take advantage of the zeros when the "mask_zero" argument is set to False and will disregard them when it is True, resulting in different predictions.

No gradients provided for any variable error

I'm creating a model using the Keras functional API.
The layer architecture is as follows:
n = tf.keras.layers.Dense(1)(input)
for i in tf.range(n):
output = tf.keras.layers.Dense(4)(input)
I then concat the outputs and return for a tensor with shape [1, None, 4] where [1] is the batch dimension, [None] is n, and [4] is the output from the second dense layer.
My loss function involves comparing the shape of the expected output, and comparing the outputs.
loss = tf.convert_to_tensor(abs(tf.shape(logits)[1] - tf.shape(expected)[1])) * 100.
When running this on a custom training loop, I'm getting the error
ValueError: No gradients provided for any variable: (['while/dense/kernel:0',
'while/dense/bias:0', 'while/while/dense_1/kernel:0', 'while/while/dense_1/bias:0'],).
Provided `grads_and_vars` is ((None, <tf.Variable 'while/dense/kernel:0' shape=(786432, 1)
Shape is not differentiable, you cannot do things like this with gradient based learning. Problems like this need to be tackled with more powerful tools, e.g. reinforcement learning where one considers n as an action, and get policy gradient for that.
A rule of thumb to remember is that you cannot really backprop through discrete objects. You need to produce floats, as gradients require smooth functions. In your case n should be an integer (what does a loop over a float mean?) so this should be your first warning sign. The other being shape itself, which is also an integer. A target can be discrete, but not the prediction. Note that even in classification we do not output class we output probability as probability is smooth.
You could build your model by assuming some maximum number of N and treat it more like a classification where you supervise N directly, and use some form of masking to keep all the results around.

Optimizing Values that are on GPU

I am trying to optimize a PyTorch tensor which I am also using it as input to a network. Lets call this tensor "shape". My optimizer is as follows:
optimizer = torch.optim.Adam(
I also am getting vertice values using this "shape" tensor:
vertices = model(shape)
And my loss function calculates loss as in differences of inferenced vertices and ground truth vertices:
loss = torch.sqrt(((gt_vertices - vertices) ** 2).sum(2)).mean(1).mean()
So what I am doing is actually estimating shape value. I am only interested in shape values. This works perfectly fine when everything is on CPU. However, when I put my shape and models on GPU by calling to("cuda"), I am getting the classic non-leaf Tensor error:
ValueError: can't optimize a non-leaf Tensor
Calling .detach().cpu() on shape inside optimizer solves the issue, but then gradient's cannot flow as they should be and my values are not updated. How can I make this work?
When .to('cuda'), e.g. calling shape_p ='cuda'), you are making a copy of shape. While shape remains a leaf tensor, shape_p is not, because it's 'parent' tensor is shape. Therefore shape_p is not a leaf and returns the error when trying to optimize it.
Sending it to CUDA device after having set the optimizer, would solve the issue (there are certain instances when this can't be possible though, see here).
>>> optimizer = torch.optim.Adam([shape], lr=0.0001)
>>> shape = shape.cuda()
The best option though, in my opinion, is to send it directly on init:
>>> shape = torch.rand(1, requires_grad=True, device='cuda')
>>> optimizer = torch.optim.Adam([shape], lr=0.0001)
This way it remains a leaf tensor.

Making inputs to keras RNN written in Functional API

I'm having some problems making masking work with a keras RNN written in Functional API. The idea is to mask a tensor, zero-padded, with shape (batch_size, timesteps, 100) and feed it into a SimpleRNN. Right now I have the following:
input = keras.layers.Input(shape=(None, 100))
mask_layer = keras.layers.Masking(mask_value=0.)
mask = mask_layer(input)
rnn = keras.layers.SimpleRNN(20)
x = rnn(input, mask=mask)
However, this does not work, because it raises the following InvalidArgumentError:
InvalidArgumentError: Dimension 1 in both shapes must be equal, but are 20 and 2000. Shapes are [?,20] and [?,2000]. for 'Select' (op: 'Select') with input shapes: [?,2000], [?,20], [?,20].
By changing my Input's shape into (None, 1) - a sequential input where each element is a single integer, instead of n-dimensional embeddings - I've gotten this code to work. I've also gotten the same idea to work with the Sequential API, but I cannot do this, as my final model will have multiple inputs and outputs. I also do not want to force my Input's shape to be (None, 1), as I want to swap out different embedding models (Word2Vec, etc) during preprocessing, which means my Inputs will be embedding vectors from the start.
Can anyone help me with using masks with RNNs when using keras's functional API?
According to Masking and Padding with Keras, you won't need to manually set mask on the RNN layer, in the following code the RNN layer will automatically receive the mask.
import keras
input_layer = keras.layers.Input(shape=(None, 100))
masked_layer = keras.layers.Masking(mask_value=0.)(input_layer)
rnn_layer = keras.layers.SimpleRNN(20)(masked_layer)

Tensorflow: Loss function for Binary classification (without one hot labels)

I am trying to use binary cross entropy for binary classification problem and keep running into following error, I have tried type casting as well as reshaping the tensor to shape [-1, 1], but nothing seems to work out.
My Last 2 Layers are defined as,
dense_fin2 = tf.layers.dense(inputs = dense_fin, units = 128, name = "dense_fin2")
logits = tf.sigmoid(tf.layers.dense(inputs = dense_fin2, units = 1, name = "logits"))
Loss function,
loss = labels * -tf.log(logits) + (1 - labels) * -tf.log(1 - logits)
loss = tf.reduce_mean(loss)
Error thrown by tensorflow,
ValueError: Tensor conversion requested dtype int32 for Tensor with dtype float32: 'Tensor("Neg:0", shape=(?, 1), dtype=float32)'
Extra information,
I am using Estimator API coupled with Dataset API. I have integer labels i.e. 0 or 1. They are NOT one-hot encoded. I understand this is doable by one hot encoding my labels but I do not want to take that path.
This error likely comes from trying to multiply integer-type labels with float-type logits. You can explicity cast the labels to float via tf.cast(labels, dtype=tf.float32). Unfortunately your question does not reveal whether you tried this specific casting.
However, for reasons of numerical stability I would advise you to use tf.nn.sigmoid_cross_entropy_with_logits instead (or tf.losses.sigmoid_cross_entropy). This is also a good idea for correctness; cross-entropy uses log-probabilities, but you are already putting in logits (which are log unnormalized probabilities) so the extra tf.log is actually wrong. You could also add a tf.nn.sigmoid activation to the output layer to make it correct, however the built-in functions are still preferrable for stability.
