Quantized Convolution Layer Operation in TensorflowLite - python

I want to understand the basic operation done in a convolution layer of a quantized model in TensorflowLite.
As a baseline, I chose a pretrained Tensorflow model, EfficientNet-lite0-int8 and used a sample image to serve as input for model's inference. Thereinafter, I managed to extract the output tensor of the first fused ReLU6 Convolution Layer and compared this output with that of my custom python implementation on this.
The deviation between the two tensors was large and something that I cannot explain is that Tensorflow's output tensor was not between the range of [0,6] as expected (I expected that because of the fused ReLU6 layer in the Conv layer).
Could you please provide me with a more detailed description of a quantized fused Relu6 Conv2D layer's operation in TensorflowLite?

After, studying carefully Tensorflow's github repository I found kernel_util.cc file and CalculateActivationRangeUint8 function. So using this function, I managed to understand why quantized fused ReLu6 Conv2D layer's output tensor is not clipped between [0, 6] but between [-128, 127] values. For the record, I managed to implement a Conv2D layer's operation in Python with some simple steps.
Firstly, you got to take layer's parameters(kernel, bias, scales, offsets) using interpreter.get_tensor_details() command and calculate output_multiplier using GetQuantizedConvolutionMultipler and QuantizeMultiplierSmallerThanOne functions.
After that, subtract input offset from the input layer before padding it and implement a simple convolution.
Later, you need to use MultiplyByQuantizedMultiplierSmallerThanOne function that uses SaturatingRoundingDoublingHighMul and RoundingDivideByPOT from gemmlowp/fixedpoint.h library.
Finally, add output_offset to the result and clip it using the values taken from CalculateActivationRangeUint8 function.
Link of the issue on project's github page

Related

Masking layer vs attention_mask parameter in MultiHeadAttention

I use MultiHeadAttention layer in my transformer model (my model is very similar to the named entity recognition models). Because my data comes with different lengths, I use padding and attention_mask parameter in MultiHeadAttention to mask padding. If I would use the Masking layer before MultiHeadAttention, will it have the same effect as attention_mask parameter? Or should I use both: attention_mask and Masking layer?
The Tensoflow documentation on Masking and padding with keras may be helpful.
The following is an excerpt from the document.
When using the Functional API or the Sequential API, a mask generated
by an Embedding or Masking layer will be propagated through the
network for any layer that is capable of using them (for example, RNN
layers). Keras will automatically fetch the mask corresponding to an
input and pass it to any layer that knows how to use it.
tf.keras.layers.MultiHeadAttention also supports automatic mask propagation in TF2.10.0.
Improved masking support for tf.keras.layers.MultiHeadAttention.
Implicit masks for query, key and value inputs will automatically be
used to compute a correct attention mask for the layer. These padding
masks will be combined with any attention_mask passed in directly when
calling the layer. This can be used with tf.keras.layers.Embedding
with mask_zero=True to automatically infer a correct padding mask.
Added a use_causal_mask call time arugment to the layer. Passing
use_causal_mask=True will compute a causal attention mask, and
optionally combine it with any attention_mask passed in directly when
calling the layer.
The masking layer keeps the input vector as it and creates a masking vector to be propagated to the following layers if they need a mask vector ( like RNN layers). you can use it if you implement your own model.If you use models from huggingFace, you can use a masking layer for example if you you want to save the mask vector for future use, if not the masking operations are already built_in, so there is no need to add any masking layer at the beginning.

TimeDistributed layer, but with different weights

I'm trying to apply a separate convolution to each layer of a 3-dimensional array, which brought me to the Keras TimeDistributed layer. But the documentation notes that:
"Because TimeDistributed applies the same instance of Conv2D to each of the
timestamps, the same set of weights are used at each timestamp."
However, I want to perform a separate convolution (with independently defined weights / filters) for each layer of the array, not using the same set of weights. Is there some built in way to do this? Any help is appreciated!

Tensorflow: How does gen_nn_ops.max_pool_grad_v2() work?

I am working on a project where I need deconvolution. I read that gen_nn_ops.max_pool_grad_v2() can do that. I load the function from tensorflow.python.ops.
As far as I understand, the function takes an input and output tensor where the input is a convolutional layer before max pooling and the output the result of the max pooling operation. But what is grad? And what exactly does the output of the function represent?
ksize = [1,2,2,1]
strides = [1,2,2,1]
padding = 'SAME'
u = gen_nn_ops.max_pool_grad_v2(input, output, grad, ksize, strides, padding)
Unfortunately I did not find anything useful on the Internet.
Regarding deconvolution, max_pool_grad_v2 is probably not the op you're looking for. For deconvolution, you probably want to use the keras layer Conv2DTranspose instead.
max_pool_grad_v2 is a gradient function for computing the gradient of the maxpooling function (you'll see that it's used for that very purpose internally within tensorflow). A gradient function such as _MaxPoolGradGrad computes gradients with respect to the ops' inputs given gradients with respect to the ops' outputs. You don't really need to understand how gradients are implemented in tensorflow in order to use tensorflow unless you wanted to implement some of your own, but if you did, there is a guide on the main tensorflow site.

Tensorflow: Array activation1, which is an input to the Div operator producing the output array dropout/div, is lacking min/max data

I'm using tensorflow 1.8.0rc1. I'm trying to save a very simple NN model to tflite format, with weight quantization, following this documentation: https://www.tensorflow.org/performance/quantization.
However, when converting with toco, I receive this error:
Array Relu, which is an input to the Div operator producing the output array dropout/div, is lacking min/max data, which is necessary for quantization. Either target a non-quantized output format, or change the input graph to contain min/max information, or pass --default_ranges_min= and --default_ranges_max= if you do not care about the accuracy of results.\n"
And this is the graph:
At some point it was not complaining about RELU, but Assign operations (fixed that no idea how), and if I remove the RELU layers it complains about the Add layers. Any idea what's going on?
EDIT:
Just realized that between dropout_1 and activation2 (see picture) there's an act_quant node that must be the fake quantization of activation2 (a RELU). This is not happening in the first layer, between dropout and activation1. I guess this the problem? According to the tensorflow quantization tutorial (attached before) the scripts described there should rewrite the graph with all the necessary information for toco to quantize the weights.

Print layer outputs in Keras during training

I am new to Keras. How can I print the outputs of a layer, both intermediate or final, during the training phase?
I am trying to debug my neural network and wanted to know how the layers behave during training. To do so I am trying to exact input and output of a layer during training, for every step.
The FAQ (https://keras.io/getting-started/faq/#how-can-i-obtain-the-output-of-an-intermediate-layer) has a method to extract output of intermediate layer for building another model but that is not what I want. I don't need to use the intermediate layer output as input to other layer, I just need to print their values out and perhaps graph/chart/visualize it.
I am using Keras 2.1.4
I think I have found an answer myself, although not strictly accomplished by Keras.
Basically, to access layer output during training, one needs to modify the computation graph by adding a print node.
A more detailed description can be found in this StackOverflow question:
How can I print the intermediate variables in the loss function in TensorFlow and Keras?
I will quote an example here, say you would like to have your loss get printed per step, you need to set your custom loss function as:
for Theano backend:
diff = y_pred - y_true
diff = theano.printing.Print('shape of diff', attrs=['shape'])(diff)
return K.square(diff)
for Tensorflow backend:
diff = y_pred - y_true
diff = tf.Print(diff, [tf.shape(diff)])
return K.square(diff)
Outputs of other layers can be accessed similarly.
There is also a nice vice tutorial about using tf.Print() from Google
Using tf.Print() in TensorFlow
If you want to know more info on each neuron, you need to use the following to get their bias and weights.
weights = model.layers[0].get_weights()[0]
biases = model.layers[0].get_weights()[1]
0 index defines weights and 1 defines the bias.
You can also get per layer too,
for layer in model.layers:
weights = layer.get_weights() # list of numpy arrays
After each training, if you can access each layer with its dimension and obtain the weights and bias to a numpy array, you should be able to visualize how the neuron after each training.
Hope it helps.

Categories