I have difficulty understanding how exactly masking works in Tensorflow/Keras. On the Keras website (https://www.tensorflow.org/guide/keras/masking_and_padding) they simply say that the neural network layers skip/ignore the masked values but it doesn't explain how? Does it force the weights to zero? (I know a boolean array is being created but I don't know how it's being used)
For example check this simple example:
tf.random.set_seed(1)
embedding = tf.keras.layers.Embedding(input_dim=10, output_dim=3, mask_zero=True)
masked_output = embedding(np.array([[1,2,0]]))
print(masked_output)
I asked the Embedding layer to mask zero inputs. Now look at the output:
tf.Tensor(
[[[ 0.00300496 -0.02925059 -0.01254098]
[ 0.04872786 0.01087702 -0.03656749]
[ 0.00446818 0.00290152 -0.02269397]]], shape=(1, 3, 3), dtype=float32)
If you change the "mask_zero" argument to False you get the exact same results. Does anyone know what's happening behind the scene? Any resources explaining the masking mechanism more thoroughly is highly appreciated.
P.S: This is also an example of a full Neural Network which gives an identical outcome with and without masking:
tf.random.set_seed(1)
input = np.array([[1,2,0]]) # <--- 0 should be masked and ignored
embedding = tf.keras.layers.Embedding(input_dim=10, output_dim=3, mask_zero=True)
masked_output = embedding(input)
flatten = tf.keras.layers.Flatten()(masked_output)
dense_middle = tf.keras.layers.Dense(4)(flatten)
out = tf.keras.layers.Dense(1)(dense_middle)
print(out)
In TensorFlow/Keras, masking enables you to disregard certain parts of a tensor, typically those set to zero, when executing the forward pass of your neural network. This can be helpful when dealing with sequences of varying length, where padding is used to make all sequences the same length. In the forward pass, the covered-up elements are taken as having a value of 0, so that their effect on the output is ignored.
In the example you provided, the Embedding layer is set to mask zeros via the mask_zero argument, yet the outcome is the same regardless of whether mask_zero is set to True or False. This is because the example just has one input tensor with no zero values, thus there is no contrast in the output.
Underneath, TensorFlow implements masking by using a special tensor mask that is multiplied element-wise with the input tensor during the forward pass. This mask tensor has the same shape as the input tensor and comprises binary values that indicate if each element should be included or not.
example:
inputs = tf.keras.layers.Input(shape=(3,))
embedding = tf.keras.layers.Embedding(input_dim=10, output_dim=3, mask_zero=True)(inputs)
masking = tf.keras.layers.Masking()(embedding)
flatten = tf.keras.layers.Flatten()(masking)
dense_middle = tf.keras.layers.Dense(4)(flatten)
output = tf.keras.layers.Dense(1)(dense_middle)
model = tf.keras.Model(inputs, output)
By doing this, the network will be able to take advantage of the zeros when the "mask_zero" argument is set to False and will disregard them when it is True, resulting in different predictions.
Related
I'm trying to recreate a transformer that was written in Pytorch and make it Tensorflow. Everything was going pretty well until each version of MultiHeadAttention started giving extremely different outputs. Both methods are an implementation of multi-headed attention as described in the paper "Attention is all you Need", so they should be able to achieve the same output.
I'm converting
self_attn = nn.MultiheadAttention(dModel, nheads, dropout=dropout)
to
self_attn = MultiHeadAttention(num_heads=nheads, key_dim=dModel, dropout=dropout)
For my tests, dropout is 0.
I'm calling them with:
self_attn(x,x,x)
where x is a tensor with shape=(10, 128, 50)
As expected from the documentation, the Pytorch version returns a tuple, (the target sequence length, embedding dimension), both with dimensions [10, 128, 50].
I'm having trouble getting the TensorFlow version to do the same thing. With Tensorflow I only get one tensor back, (size [10, 128, 50]) and it looks like neither the target sequence length or embedding dimension tensor from pytorch.
Based on the Tensorflow documentation I should be getting something comparable.
How can I get them to operate the same way? I'm guessing I'm doing something wrong with Tensorflow but I can't figure out what.
nn.MultiheadAttention outputs by default tuple with two tensors:
attn_output -- result of self-attention operation
attn_output_weights -- attention weights averaged(!) over heads
At the same time tf.keras.layers.MultiHeadAttention outputs by default only one tensor attention_output (which corresponds to attn_output of pytorch). Attention weights of all heads also will be returned if parameter return_attention_scores is set to True, like:
output, scores = self_attn(x, x, x, return_attention_scores=True)
Tensor scores also should be averaged to achieve full correspondence with pytorch:
scores = tf.math.reduce_mean(scores, 1)
While rewriting keep in mind that by default (as in snippet in question) nn.MultiheadAttention expects input in form (seq_length, batch_size, embed_dim), but tf.keras.layers.MultiHeadAttention expects it in form (batch_size, seq_length, embed_dim).
I am trying to predict a single image. But my model returns a prediction array with the shape (1,1,1,2048) when it should be (1,10). Any idea what I am doing wrong? My x input shape is correct at (1,32,32,3).
ResNet50V2():
IMG_SHAPE = (32, 32, 3)
return tf.keras.applications.ResNet50V2(input_shape=IMG_SHAPE, include_top=False, weights=None, classes=10)
model = ResNet50V2()
x = x[None, :]
predictions = model.predict(x)
You are loading your keras-model with parameter
include_top=False
which cuts of the fully-connected projection layer that is responsible for projecting the model output to your expected amout of classes. Change the parameter to True.
That's because you are disabling top with the include top, which removes the final classification layer. You need to either add your own layer with 10 classes or remove the include top parameter and retrain the network with the desired inputs.
A image classification network usually works within 2 steps of processing.
The first one is feature extraction, we call that "base", and it consists in a stack of layers to find and reinforce patterns on image(2DConv, Relu and MaxPool).
The second one is the "head" and it is used to classify the extracted features from previous step.
Your code is getting the raw output of the "base", with no classification, and as the other gentle people stated, the solution is adding a custom "head" or changing the include_top parameter to True.
I may be mistaken, but it seems that PyTorch Transformers are autoregressive, which is what masking is for. However, I've seen some implementations where people use just the Encoder and output that directly to a Linear layer.
In my case, I'm trying to convert a spectrogram (rows are frequencies and columns are timesteps) to another spectrogram of the same dimensions. I'm having an impossible time trying to figure out how to do this.
For my model, I have:
class TransformerReconstruct(nn.Module):
def __init__(self, feature_size=250, num_layers=1, dropout=0.1, nhead=10, output_dim=1):
super(TransformerReconstruct, self).__init__()
self.model_type = 'Transformer'
self.src_mask = None
self.pos_encoder = PositionalEncoding(feature_size)
self.encoder_layer = nn.TransformerEncoderLayer(d_model=feature_size, nhead=nhead, dropout=dropout)
self.transformer_encoder = nn.TransformerEncoder(self.encoder_layer, num_layers=num_layers)
self.decoder = nn.Linear(feature_size, output_dim)
self.init_weights()
def init_weights(self):
initrange = 0.1
self.decoder.bias.data.zero_()
self.decoder.weight.data.uniform_(-initrange, initrange)
def forward(self, src):
if self.src_mask is None or self.src_mask.size(0) != len(src):
device = src.device
mask = self._generate_square_subsequent_mask(len(src)).to(device)
self.src_mask = mask
src = self.pos_encoder(src)
output = self.transformer_encoder(src, self.src_mask)
output = self.decoder(output)
return output
def _generate_square_subsequent_mask(self, sz):
mask = (torch.triu(torch.ones(sz, sz)) == 1).transpose(0, 1)
mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
return mask
And when training, I have:
model = TransformerReconstruct(feature_size=128, nhead=8, output_dim=128, num_layers=6).to(device)
This returns the right shape, but doesn't seem to learn.
My basic training loop looks like:
for i in range(0, len(data_source) - 1, input_window):
data, target = get_batch(data_source, i, 1)
output = recreate_model(data)
and I'm using an MSELoss and I'm trying to learn a very simple identity. Where the input and output are the same, however this is not learning. What could I be doing wrong? Thanks in advance.
Most of the models in Huggingface Transformers are some version of BERT and thus not autoregressive, the only exceptions are decoder-only models (GPT and similar) and sequence-to-sequence model.
There are two conceptually different types of masks: one is the input mask that is specific to the input batch and the purpose is allowing using sequences of different lengths in a single batch. When the sequences get padded to the same length, the self-attention should attend to the padding positions. This is what you are supposed to use when you call self.transformer_encoder in the forward method.
In addition, the autoregressive Transformer decoder uses another type of mask. It is the triangular mask that prevents the self-attention to attend to tokens that are right of the current position (at inference time, words right of the current position are unknown before they are actually generated). This is what you have in the _generate_square_subsequent_mask method and this is what makes the model autoregressive. It is constant and does not depend on the input batch.
To summarize: to have a bidirectional Transformer, just get rid of the triangular mask. If your input sequences are of different lengths, you should use batch-specific masking, if not, just pass a matrix with ones.
If you want the model to stop behaving in an autoregressive manner, you need to 'unhide' tokens to the right of the current token i.e. modify/remove _generate_square_subsequent_mask.
How you modify this depends on the task. Are you trying to recover 'corrupted' input sequences? Then mask a random subset of tokens and treat it as an autoencoder.
If you just wish to approximate the identity function remove the mask completely.
I am using the Keras functional API. I have some model which outputs a probability distribution by means of a softmax layer:
action_logits = Dense(units=self.action_space, activation='softmax')(prev_layer)
Next, I mask out illegal actions (or classes, if you will) by multiplying the logits with a bitvector representing the legal actions:
mask_illegal_moves = keras.layers.multiply([action_logits, valid_actions])
Finally, I want to renormalize the logits, now that I've set the output for some actions to 0. This seems like a very simple thing to do, yet I can't get it to work. For example, another softmax layer did not yield the desired results. Moreover, googling any 'normalization' layer in Keras mostly led me to BatchNorm, which is not what I'm interested in here.
Any tips would be greatly appreciated!
You can do the following,
action_logits = Dense(units=self.action_space)(prev_layer)
action_logits_masked = Multiply()([action_logits, valid_actions])
action_probs = Activation('softmax')(action_logits_masked)
Explained:
First we get the logits (note that logits are what they are called before applying softmax). So don't have activation='softmax'
Apply your mask using Multiply layer
Apply softmax to the masked layer.
I should have clarified my original question. See the helpful response by thushv89. I have found the solution, which was to use a Lambda layer:
action_probs = Dense(units=self.action_space, activation='softmax')(skip_2)
action_probs_masked = Multiply()([action_probs, valid_actions])
layer = Lambda(lambda x: x / keras.backend.sum(x, axis=1)[:,None])
actions = layer(action_probs_masked)
I want to implement a model like DSSM (Deep Semantic Similarity Model).
I want to train one RNN model and use this model to get three hidden vector for three different inputs, and use these hidden vector to compute loss function.
I try to code in a variable scope with reuse=None like:
gru_cell = tf.nn.rnn_cell.GRUCell(size)
gru_cell = tf.nn.rnn_cell.DropoutWrapper(gru_cell,output_keep_prob=0.5)
cell = tf.nn.rnn_cell.MultiRNNCell([gru_cell] * 2, state_is_tuple=True)
embedding = tf.get_variable("embedding", [vocab_size, wordvec_size])
inputs = tf.nn.embedding_lookup(embedding, self._input_data)
inputs = tf.nn.dropout(inputs, 0.5)
with tf.variable_scope("rnn"):
_, self._states_2 = rnn_states_2[config.num_layers-1] = tf.nn.dynamic_rnn(cell, inputs, sequence_length=self.lengths, dtype=tf.float32)
self._states_1 = rnn_states_1[config.num_layers-1]
with tf.variable_scope("rnn", reuse=True):
_, rnn_states_2 = tf.nn.dynamic_rnn(cell,inputs,sequence_length=self.lengths,dtype=tf.float32)
self._states_2 = rnn_states_2[config.num_layers-1]
I use the same inputs and reuse the RNN model, but when I print 'self_states_1' and 'self_states_2', these two vectors are different.
I use with tf.variable_scope("rnn", reuse=True): to compute 'rnn_states_2' because I want to use the same RNN model like 'rnn_states_1'.
But why I get different hidden vectors with the same inputs and the same model?
Where did i go wrong?
Thanks for your answering.
Update:
I find the reason may be the 'tf.nn.rnn_cell.DropoutWrapper' , when I remove the drop out wrapper, the hidden vectors are same, when I add the drop out wrapper, these vector become different.
So, the new question is :
How to fix the part of vector which be 'dropped out' ? By setting the 'seed' parameter ?
When training a DSSM, should I fix the drop out action ?
If you structure your code to use tf.contrib.rnn.DropoutWrapper, you can set variational_recurrent=True in your wrapper, which causes the same dropout mask to be used at all steps, i.e. the dropout mask will be constant. Is that what you want?
Setting the seed parameter in tf.nn.dropout will just make sure that you get the same sequence of dropout masks every time you run with that seed. That does not mean the dropout mask will be constant, just that you'll always see the same dropout mask at a particular iteration. The mask will be different for every iteration.