INT8 quantization for FP32 matrix multiplication

INT8 quantization for FP32 matrix multiplication - python

I tried to apply INT8bit quantization before FloatingPoint32bit Matrix Multiplication, then requantize accumulated INT32bit output to INT8bit. After all, I guess there's a couple of mix-ups somewhere in the process. I feel stuck in spotting those trouble spots.
data flow [Affine Quantization]:
input(fp32) -> quant(int8) ____\ matmul(int32) -> requant(int8) ->deq(fp32)
input(fp32) -> quant(int8) ----/
My Pseudo Code
INPUT(FP32) :
Embedded Words in Tensor (shape : [1, 4, 1024, 256]) A and B (B is the same as A)
input A(=B) : enter image description here
EXPECTING OUTPUT(FP32) :
Embedded Words in Tensor (shape : [1, 4, 1024, 1024]) AB(after matrix multiplication to itself)
do while(true):
# convert A and B of FP32 into INT8
A_zero_offset = torch.empty(A.shape)
A_zero_offset = torch.zeros_like(A_zero_offset) # offset to be zero **[Question1]**
scale = 255 / (torch.max(A) - torch.min(B)) # 2^8 - 1 = 255
A_quantized = np.round((A - A_zero_offset) * scale)
# likewise
B_quantized = A_quantized
AB = A_quantized.matmul(B_quantized.transpose(-1, -2))
# now accumulated datatype is INT32
AB_offset = torch.empty(AB.shape)
AB_offset = AB_offset.new_full(AB.shape, torch.min(AB)) # offset to be AB's min element **[Question 1]**
scale_AB = 255 / (torch.max(AB) - torch.min(AB)) **[Question 2]**
AB_requantized = np.round((AB - AB_offset) * scale_AB)
# dequantize AB(INT8 at the status quo) into FP32
**[Question 3]**
[Question 1] : does it make sense to set A's offset to be zero and AB's to be min(AB)?
[Question 2] : What operation should I follow with the scale calculation, "max(AB) - min(AB)" or any otherwise method?
[Question 3] : After all, what operation do I have to follow especially with the scale and offset calculation when to dequantize the result into FP32?

I believe this approach is totally wrong because for every embedded word tensor there is an different max and min values, so this bug changes your data continuity. I assume you are aware of you loose information anyway because you cant sequezee(map) fp32 to int8 in same tensor shapes
import torch
import numpy as np
# create Pseudo tensor
a = torch.tensor([[0.654654, 1.654687, -0.5645365],
[5.687646, -5.662354, 0.6546646]], dtype=torch.float32)
print(a.dtype)
print(a)
# torch.float32
# tensor([[ 0.6547, 1.6547, -0.5645],
# [ 5.6876, -5.6624, 0.6547]])
b = a.clone().int()
print(b)
# tensor([[ 0, 1, 0],
# [ 5, -5, 0]], dtype=torch.int32)
# converting to int8 please note range is here -128 to + 128
c = a.clone().to(torch.int8)
print(c)
# tensor([[ 0, 1, 0],
# [ 5, -5, 0]], dtype=torch.int8)
# converting to uint8 please note range is here 0 to 255
d = a.clone().byte()
print(d)
# tensor([[ 0, 1, 0],
# [ 5, 251, 0]], dtype=torch.uint8)
Your approach(wrong)
A, B = a
A_zero_offset = torch.empty(A.shape)
A_zero_offset = torch.zeros_like(A_zero_offset) # offset to be zero **[Question1]**
scale = 255 / (torch.max(A) - torch.min(B)) # 2^8 - 1 = 255
A_quantized = np.round((A - A_zero_offset) * scale)
print(A_quantized.dtype)
print(A_quantized)
# torch.float32
# tensor([ 23., 58., -20.])

Related

Lorentzian inner product in matrix form using Pytorch

I want to compute the Lorentzian inner product, that is <x,y> = -x1y1 + x2y2 + x3y3 +...
I have the code
res = torch.sum(x * y, dim=-1) - 2 * x[..., 0] * y[..., 0]
But this fails to work, because I keep getting this error -
RuntimeError: The size of tensor a (450) must match the size of tensor b (30) at non-singleton dimension 0
I need the inner product in the matrix form. So I did this -
res = torch.matmul(x,torch.transpose(y,0,1))
-2*torch.matmul(x[...,0],torch.transpose(y[...,0],0,0))
But I get a new error
RuntimeError: inconsistent tensor size, expected tensor [450] and src [30] to have the same number of elements, but got 450 and 30 elements respectively.
I have tried this on a simple toy example -
x = torch.tensor([[1, 2, 3]])
y = torch.tensor([[2, 2, 2]])
prod = torch.matmul(x,torch.transpose(y,0,1))-2*torch.matmul(x[...,0],torch.transpose(y[...,0],0,0))
print(prod)
Output : tensor([[8]]) which is right. But somehow doesn't seem to work in the application I am working on.
I am not sure how to solve this. Any insights are welcome please!

So I did this -
import torch
x = torch.tensor([[1, 2, 3]])
y = torch.tensor([[2, 2, 2]])
x[...,0] *= -1
res = torch.matmul(x,torch.transpose(y,0,1))
print (res)
It started working for my application.

Conv2D produces weird output

I'm trying to use a Laplace Filter via TensorFlow tf.nn.conv2d on my image. But the output is super weird and I don't have a clue what I did wrong.
I load my picture via:
file = tf.io.read_file("corgi.jpg")
uint_image = tf.io.decode_jpeg(file, 1)
image = tf.cast(uint_image,tf.float32)
kernel = tf.constant(np.array([[1, 1, 1],
[1, -8, 1],
[1, 1, 1]]), dtype=tf.float32)
convoluted_image = self.convoluteTest(image, kernel)
rs_convoluted_image = tf.reshape(convoluted_image,
[tf.shape(image)[0] - tf.shape(kernel)[0] + 1,
tf.shape(image)[1] - tf.shape(kernel)[0] + 1, 1])
casted_image = tf.cast(rs_convoluted_image, tf.uint8)
encoded = tf.io.encode_jpeg(casted_image)
tf.io.write_file("corgi-tensor-laplace.jpg", encoded)
But the image parameter cant be passed onto the tf.nn.conv2d function since image tensor requires to be a 4d tensor.
This function here reshapes and applies my laplace filter:
def convoluteTest(image_tensor, kernel_tensor):
shape = tf.shape(image_tensor)
reshaped_image_tensor = tf.reshape(image_tensor, [1, shape[0].numpy(), shape[1].numpy(), 1])
reshaped_kernel_tensor = tf.reshape(kernel_tensor,
[tf.shape(kernel_tensor)[0].numpy(), tf.shape(kernel_tensor)[0].numpy(), 1,
1])
convoluted = tf.nn.conv2d(reshaped_image_tensor, reshaped_kernel_tensor, strides=[1, 1, 1, 1], padding='VALID')
return convoluted
Original Picture:
Failed laplace:
Update:
Greyish output:
What did I do wrong? I can't wrap my head around this...

I believe the problem is casted_image = tf.cast(rs_convoluted_image, tf.uint8) truncates data outside of [0, 255] to pure black or pure white (0 and 255).
I think you are missing a normalization step back to the [0, 255] range before casting to utint8.
Try
normalized_convolved = (rs_convoluted_image - tf.reduce_min(rs_convoluted_image) / (tf.reduce_max(rs_convoluted_image) - tf.reduce_min(rs_convoluted_image))
normalized_convolved = normalized_convolved * 255
casted_image = tf.cast(normalized_convolved, tf.uint8)

binary matrix aa (only contain 0, 1), why sum(sum(aa)) isn't equal sum(sum(aa>0))?

I have a binary mask named crop_mask, which only contains 0 and 1. Why isn't the sum(sum(aa1)) equal sum(sum(aa2)).
aa1 = crop_mask,
aa2 = (aa1>0)
print(sum(sum(aa1)), sum(sum(aa2)))
This might be a minor issue, but I am just so confused now. Thanks for any help. I made a screenshot of the result in the attached figure.
updated screenshot

By definition the sum should be the same.
The only thing I can thing of is that the dtype of your array (assuming you are using a numpy array) is not int or float.
Did you check that the "True"s in aa2 match the "1" in aa1?
EDIT:
dtype = np.uint8 limits the maximum value of the column sum to 255 (2^8). So the sum(sum(a)) --> sum([0,160,0,...]) (160 is the remainder of 4000/256)
aa0 = aa0.astype(int) will solve your issue
a = np.zeros((4000, 4000)).astype(np.uint8)
a[:,1] = 1
a[:,4] = 1
b = (a > 0)
sum(sum(b)) # 8000
sum(sum(a)) # 320
a = a.astype(int)
sum(sum(a)) #8000

Assuming your cropmask is indeed a 2-dimensional ndarray with only 1s and 0s, this works:
import numpy as np
cropmask = np.array([[1, 1, 1], [1, 1, 0], [1, 0, 0]], np.uint8)
x = (cropmask > 0)
print(sum(sum(cropmask)), sum(sum(x)))
Result:
6 6
The most likely cause here is that you're wrong and your cropmask doesn't actually contain only 1s and 0s.
Have you tried:
print(sum(sum(np.logical_and((0 != crop_mask), (1 != crop_mask)))))
If that comes up greater than 0, there's something else in there.

An equivalent but differentiable argmax expression in Tensorflow

I need a one-hot representation for the maximum value in a tensor.
For example, consider a tensor 2 x 3:
[ [1, 5, 2],
[0, 3, 7] ]
The one-hot-argmax representation I am aiming for looks like this:
[ [0, 1, 0],
[0, 0, 1] ]
I can do it as follows, where my_tensor is a N x 3 tensor:
position = tf.argmax(my_tensor, axis=1). # Shape (N x )
one_hot_pos = tf.one_hot(position, depth=3) # Shape (N x 3)
But this part of the code need be differentiable since I'm training over it.
My workaround was as follows, where EPSILON = 1e-3 is a small constant:
max_value = tf.reduce_max(my_tensor, axis=1, keepdims=True)
clip_min = max_value - EPSILON
one_hot_pos = (tf.clip_by_value(my_tensor, clip_min, max_value) - clip_min) / (max_value - clip_min)
The workaround works most of the time, but - as expected - it has some issues:
Sensible to EPSILON: if it is too small, a division by zero might happen
Can't solve ties: argmax only chooses one even in a tie situation
Do you know any better way of simulating the argmax followed by one_hot situation, while fixing the two mentioned issues, but using only differentiable Tensorflow functions?

Do some maximum, tile and multiplication operations. Like:
a = tf.Variable([ [1, 5, 2], [0, 3, 7] ]) # your tensor
m = tf.reduce_max(a, axis=1) # [5,7]
m = tf.expand_dims(m, -1) # [[5],[7]]
m = tf.tile(m, [1,3]) # [[5,5,5],[7,7,7]]
y = tf.cast(tf.equal(a,m), tf.float32)) # [[0,1,0],[0,0,1]]
This is a tricky multiplication operation that is differentiable.

numpy vectorized way to change multiple rows of array(rows can be repeated)

I run into this problem when implementing the vectorized svm gradient for cs231n assignment1.
here is an example:
ary = np.array([[1,-9,0],
[1,2,3],
[0,0,0]])
ary[[0,1]] += np.ones((2,2),dtype='int')
and it outputs:
array([[ 2, -8, 1],
[ 2, 3, 4],
[ 0, 0, 0]])
everything is fine until rows is not unique:
ary[[0,1,1]] += np.ones((3,3),dtype='int')
although it didn't throw an error,the output was really strange:
array([[ 2, -8, 1],
[ 2, 3, 4],
[ 0, 0, 0]])
and I expect the second row should be [3,4,5] rather than [2,3,4],
the naive way I used to solve this problem is using a for loop like this:
ary = np.array([[ 2, -8, 1],
[ 2, 3, 4],
[ 0, 0, 0]])
# the rows I want to change
rows = [0,1,2,1,0,1]
# the change matrix
change = np.random.randn((6,3))
for i,row in enumerate(rows):
ary[row] += change[i]
so I really don't know how to vectorize this for loop, is there a better way to do this in NumPy?
and why it's wrong to do something like this?:
ary[rows] += change
In case anyone is curious why I want to do so, here is my implementation of svm_loss_vectorized function, I need to compute the gradients of weights based on labels y:
def svm_loss_vectorized(W, X, y, reg):
"""
Structured SVM loss function, vectorized implementation.
Inputs and outputs are the same as svm_loss_naive.
"""
loss = 0.0
dW = np.zeros(W.shape) # initialize the gradient as zero
# transpose X and W
# D means input dimensions, N means number of train example
# C means number of classes
# X.shape will be (D,N)
# W.shape will be (C,D)
X = X.T
W = W.T
dW = dW.T
num_train = X.shape[1]
# transpose W_y shape to (D,N)
W_y = W[y].T
S_y = np.sum(W_y*X ,axis=0)
margins = np.dot(W,X) + 1 - S_y
mask = np.array(margins>0)
# get the impact of num_train examples made on W's gradient
# that is,only when the mask is positive
# the train example has impact on W's gradient
dW_j = np.dot(mask, X.T)
dW += dW_j
mul_mask = np.sum(mask, axis=0, keepdims=True).T
# dW[y] -= mul_mask * X.T
dW_y = mul_mask * X.T
for i,label in enumerate(y):
dW[label] -= dW_y[i]
loss = np.sum(margins*mask) - num_train
loss /= num_train
dW /= num_train
# add regularization term
loss += reg * np.sum(W*W)
dW += reg * 2 * W
dW = dW.T
return loss, dW

Using built-in np.add.at
The built-in is np.add.at for such tasks, i,e.
np.add.at(ary, rows, change)
But, since we are working with a 2D array, that might not be the most performant one.
Leveraging fast matrix-multiplication
As it turns out, we can leverage the very efficient matrix-multplication for such a case as well and given enough number of repeated rows for summation, could be really good. Here's how we can use it -
mask = rows == np.arange(len(ary))[:,None]
ary += mask.dot(change)
Benchmarking
Let's time np.add.at method against matrix-multiplication based one for bigger arrays -
In [681]: ary = np.random.rand(1000,1000)
In [682]: rows = np.random.randint(0,len(ary),(10000))
In [683]: change = np.random.rand(10000,1000)
In [684]: %timeit np.add.at(ary, rows, change)
1 loop, best of 3: 604 ms per loop
In [687]: def matmul_addat(ary, rows, change):
...: mask = rows == np.arange(len(ary))[:,None]
...: ary += mask.dot(change)
In [688]: %timeit matmul_addat(ary, rows, change)
10 loops, best of 3: 158 ms per loop

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

INT8 quantization for FP32 matrix multiplication - python

Related

Lorentzian inner product in matrix form using Pytorch

Conv2D produces weird output

binary matrix aa (only contain 0, 1), why sum(sum(aa)) isn't equal sum(sum(aa>0))?

An equivalent but differentiable argmax expression in Tensorflow

numpy vectorized way to change multiple rows of array(rows can be repeated)

Categories

Resources