Masking layer vs attention_mask parameter in MultiHeadAttention - python

I use MultiHeadAttention layer in my transformer model (my model is very similar to the named entity recognition models). Because my data comes with different lengths, I use padding and attention_mask parameter in MultiHeadAttention to mask padding. If I would use the Masking layer before MultiHeadAttention, will it have the same effect as attention_mask parameter? Or should I use both: attention_mask and Masking layer?

The Tensoflow documentation on Masking and padding with keras may be helpful.
The following is an excerpt from the document.
When using the Functional API or the Sequential API, a mask generated
by an Embedding or Masking layer will be propagated through the
network for any layer that is capable of using them (for example, RNN
layers). Keras will automatically fetch the mask corresponding to an
input and pass it to any layer that knows how to use it.
tf.keras.layers.MultiHeadAttention also supports automatic mask propagation in TF2.10.0.
Improved masking support for tf.keras.layers.MultiHeadAttention.
Implicit masks for query, key and value inputs will automatically be
used to compute a correct attention mask for the layer. These padding
masks will be combined with any attention_mask passed in directly when
calling the layer. This can be used with tf.keras.layers.Embedding
with mask_zero=True to automatically infer a correct padding mask.
Added a use_causal_mask call time arugment to the layer. Passing
use_causal_mask=True will compute a causal attention mask, and
optionally combine it with any attention_mask passed in directly when
calling the layer.

The masking layer keeps the input vector as it and creates a masking vector to be propagated to the following layers if they need a mask vector ( like RNN layers). you can use it if you implement your own model.If you use models from huggingFace, you can use a masking layer for example if you you want to save the mask vector for future use, if not the masking operations are already built_in, so there is no need to add any masking layer at the beginning.

Related

Quantized Convolution Layer Operation in TensorflowLite

I want to understand the basic operation done in a convolution layer of a quantized model in TensorflowLite.
As a baseline, I chose a pretrained Tensorflow model, EfficientNet-lite0-int8 and used a sample image to serve as input for model's inference. Thereinafter, I managed to extract the output tensor of the first fused ReLU6 Convolution Layer and compared this output with that of my custom python implementation on this.
The deviation between the two tensors was large and something that I cannot explain is that Tensorflow's output tensor was not between the range of [0,6] as expected (I expected that because of the fused ReLU6 layer in the Conv layer).
Could you please provide me with a more detailed description of a quantized fused Relu6 Conv2D layer's operation in TensorflowLite?
After, studying carefully Tensorflow's github repository I found kernel_util.cc file and CalculateActivationRangeUint8 function. So using this function, I managed to understand why quantized fused ReLu6 Conv2D layer's output tensor is not clipped between [0, 6] but between [-128, 127] values. For the record, I managed to implement a Conv2D layer's operation in Python with some simple steps.
Firstly, you got to take layer's parameters(kernel, bias, scales, offsets) using interpreter.get_tensor_details() command and calculate output_multiplier using GetQuantizedConvolutionMultipler and QuantizeMultiplierSmallerThanOne functions.
After that, subtract input offset from the input layer before padding it and implement a simple convolution.
Later, you need to use MultiplyByQuantizedMultiplierSmallerThanOne function that uses SaturatingRoundingDoublingHighMul and RoundingDivideByPOT from gemmlowp/fixedpoint.h library.
Finally, add output_offset to the result and clip it using the values taken from CalculateActivationRangeUint8 function.
Link of the issue on project's github page

TimeDistributed layer, but with different weights

I'm trying to apply a separate convolution to each layer of a 3-dimensional array, which brought me to the Keras TimeDistributed layer. But the documentation notes that:
"Because TimeDistributed applies the same instance of Conv2D to each of the
timestamps, the same set of weights are used at each timestamp."
However, I want to perform a separate convolution (with independently defined weights / filters) for each layer of the array, not using the same set of weights. Is there some built in way to do this? Any help is appreciated!

How is this generating an image?

I went through the GAN network using tensorflow in tensorflow official site.
Here I came across this point
generator = make_generator_model()
noise = tf.random.normal([1, 100])
generated_image = generator(noise, training=False)
plt.imshow(generated_image[0, :, :, 0], cmap='gray')
The make generator_model() returns a sequential model. Yeah, that's cool. But what about the generated_image? Isn't it the tensor value? How can we just generate image and check them when we have not run the session and how is that the matplotlib pyplot function is plotting on tensor object? It should be numpy and as far as I know, pyplot accepts numpy array to plot an image. Isn't it? Can anyone help me regarding this issue?
That method is defined as
def make_generator_model():
model = tf.keras.Sequential()
model.add(layers.Dense(4*4*1024, use_bias = False, input_shape = (100,)))
model.add(layers.BatchNormalization())
model.add(layers.LeakyReLU())
As you can see, what you get is a tf.keras.Sequential
Dense Layer
In Keras, you can create layers to develop models. A model is usually a network of layers, in which, the most common type is a stack of layers
Adding a densely-connected layer to the model will take as input arrays of shape (, 100). The shape of the data will be (, 4*4*1024) after the first layer. In this case, you won’t need to specify the size of the input moving forward because of automatic shape inference
Batch normalization functions similarly to preprocessing at every layer of the network.
ReLU is linear for all positive values and set to zero for all negative values. Leaky ReLU has a smaller slope for negative values, instead of altogether zero.
For example, leaky ReLU may have y = 0.01x when x < 0
More info https://towardsdatascience.com/developing-a-dcgan-model-in-tensorflow-2-0-396bc1a101b2
The tutorial uses TF 2.0 which employs eager execution by default. This means that ops are run as they are defined, similar to e.g. PyTorch. Because of this, you can think about control flow in a much more "natural" way (such as numpy functions). Calling generator immediately returns a tensor with values (which plt.imshow converts to a numpy array), there are no more sessions. I encourage you to check out the tutorials on the TF website that talk about the 2.0 changes.

Customized convolutional layer in TensorFlow

Let's assume i want to make the following layer in a neural network: Instead of having a square convolutional filter that moves over some image, I want the shape of the filter to be some other shape, say a rectangle, circle, triangle, etc (this is of course a silly example; the real case I have in mind is something different). How would I implement such a layer in TensorFlow?
I found that one can define custom layers in Keras by extending tf.keras.layers.Layer, but the documentation is quite limited without many examples. A python implementation of a convolutional layer by for example extending the tf.keras.layer.Layer would probably help as well, but it seems that the convolutional layers are implemented in C. Does this mean that I have to implement my custom layer in C to get any reasonable speed or would Python TensorFlow operations be enough?
Edit: Perhaps it is enough if I can just define a tensor of weights, but where I can customize entries in the tensor that are identically zero and some weights showing up in multiple places in this tensor, then I should be able to by hand build a convolutional layer and other layers. How would I do this, and also include these variables in training?
Edit2: Let me add some more clarifications. We can take the example of building a 5x5 convolutional layer with one output channel from scratch. If the input is say 10x10 (plus padding so output is also 10x10)), I would imagine doing this by creating a matrix of size 100x100. Then I would fill in the 25 weights in the correct locations in this matrix (so some entries are zero, and some entries are equal, ie all 25 weights will show up in many locations in this matrix). I then multiply the input with this matrix to get an output. So my question would be twofold: 1. How do I do this in TensorFlow? 2. Would this be very inefficient and is some other approach recommended (assuming that I want to later customize what this filter looks like and thus the standard conv2d is not good enough).
Edit3: It seems doable by using sparse tensors and assigning values via a previously defined tf.Variable. However I don't know if this approach will suffer from performance issues.
Just use regular conv. layers with square filters, and zero out some values after each weight update:
g = tf.get_default_graph()
sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})
conv1_filter = g.get_tensor_by_name('conv1:0')
sess.run(tf.assign(conv1_filter, tf.multiply(conv1_filter, my_mask)))
where my_mask is a binary tensor (of the same shape and type as your filters) that matches the desired pattern.
EDIT: if you're not familiar with tensorflow, you might get confused about using the code above. I recommend looking at this example, and specifically at the way the model is constructed (if you do it like this you can access first layer filters as 'conv1/weights'). Also, I recommend switching to PyTorch :)

Tensorflow: How does gen_nn_ops.max_pool_grad_v2() work?

I am working on a project where I need deconvolution. I read that gen_nn_ops.max_pool_grad_v2() can do that. I load the function from tensorflow.python.ops.
As far as I understand, the function takes an input and output tensor where the input is a convolutional layer before max pooling and the output the result of the max pooling operation. But what is grad? And what exactly does the output of the function represent?
ksize = [1,2,2,1]
strides = [1,2,2,1]
padding = 'SAME'
u = gen_nn_ops.max_pool_grad_v2(input, output, grad, ksize, strides, padding)
Unfortunately I did not find anything useful on the Internet.
Regarding deconvolution, max_pool_grad_v2 is probably not the op you're looking for. For deconvolution, you probably want to use the keras layer Conv2DTranspose instead.
max_pool_grad_v2 is a gradient function for computing the gradient of the maxpooling function (you'll see that it's used for that very purpose internally within tensorflow). A gradient function such as _MaxPoolGradGrad computes gradients with respect to the ops' inputs given gradients with respect to the ops' outputs. You don't really need to understand how gradients are implemented in tensorflow in order to use tensorflow unless you wanted to implement some of your own, but if you did, there is a guide on the main tensorflow site.

Categories