An equivalent but differentiable argmax expression in Tensorflow

An equivalent but differentiable argmax expression in Tensorflow - python

I need a one-hot representation for the maximum value in a tensor.
For example, consider a tensor 2 x 3:
[ [1, 5, 2],
[0, 3, 7] ]
The one-hot-argmax representation I am aiming for looks like this:
[ [0, 1, 0],
[0, 0, 1] ]
I can do it as follows, where my_tensor is a N x 3 tensor:
position = tf.argmax(my_tensor, axis=1). # Shape (N x )
one_hot_pos = tf.one_hot(position, depth=3) # Shape (N x 3)
But this part of the code need be differentiable since I'm training over it.
My workaround was as follows, where EPSILON = 1e-3 is a small constant:
max_value = tf.reduce_max(my_tensor, axis=1, keepdims=True)
clip_min = max_value - EPSILON
one_hot_pos = (tf.clip_by_value(my_tensor, clip_min, max_value) - clip_min) / (max_value - clip_min)
The workaround works most of the time, but - as expected - it has some issues:
Sensible to EPSILON: if it is too small, a division by zero might happen
Can't solve ties: argmax only chooses one even in a tie situation
Do you know any better way of simulating the argmax followed by one_hot situation, while fixing the two mentioned issues, but using only differentiable Tensorflow functions?

Do some maximum, tile and multiplication operations. Like:
a = tf.Variable([ [1, 5, 2], [0, 3, 7] ]) # your tensor
m = tf.reduce_max(a, axis=1) # [5,7]
m = tf.expand_dims(m, -1) # [[5],[7]]
m = tf.tile(m, [1,3]) # [[5,5,5],[7,7,7]]
y = tf.cast(tf.equal(a,m), tf.float32)) # [[0,1,0],[0,0,1]]
This is a tricky multiplication operation that is differentiable.

Related

INT8 quantization for FP32 matrix multiplication

I tried to apply INT8bit quantization before FloatingPoint32bit Matrix Multiplication, then requantize accumulated INT32bit output to INT8bit. After all, I guess there's a couple of mix-ups somewhere in the process. I feel stuck in spotting those trouble spots.
data flow [Affine Quantization]:
input(fp32) -> quant(int8) ____\ matmul(int32) -> requant(int8) ->deq(fp32)
input(fp32) -> quant(int8) ----/
My Pseudo Code
INPUT(FP32) :
Embedded Words in Tensor (shape : [1, 4, 1024, 256]) A and B (B is the same as A)
input A(=B) : enter image description here
EXPECTING OUTPUT(FP32) :
Embedded Words in Tensor (shape : [1, 4, 1024, 1024]) AB(after matrix multiplication to itself)
do while(true):
# convert A and B of FP32 into INT8
A_zero_offset = torch.empty(A.shape)
A_zero_offset = torch.zeros_like(A_zero_offset) # offset to be zero **[Question1]**
scale = 255 / (torch.max(A) - torch.min(B)) # 2^8 - 1 = 255
A_quantized = np.round((A - A_zero_offset) * scale)
# likewise
B_quantized = A_quantized
AB = A_quantized.matmul(B_quantized.transpose(-1, -2))
# now accumulated datatype is INT32
AB_offset = torch.empty(AB.shape)
AB_offset = AB_offset.new_full(AB.shape, torch.min(AB)) # offset to be AB's min element **[Question 1]**
scale_AB = 255 / (torch.max(AB) - torch.min(AB)) **[Question 2]**
AB_requantized = np.round((AB - AB_offset) * scale_AB)
# dequantize AB(INT8 at the status quo) into FP32
**[Question 3]**
[Question 1] : does it make sense to set A's offset to be zero and AB's to be min(AB)?
[Question 2] : What operation should I follow with the scale calculation, "max(AB) - min(AB)" or any otherwise method?
[Question 3] : After all, what operation do I have to follow especially with the scale and offset calculation when to dequantize the result into FP32?

I believe this approach is totally wrong because for every embedded word tensor there is an different max and min values, so this bug changes your data continuity. I assume you are aware of you loose information anyway because you cant sequezee(map) fp32 to int8 in same tensor shapes
import torch
import numpy as np
# create Pseudo tensor
a = torch.tensor([[0.654654, 1.654687, -0.5645365],
[5.687646, -5.662354, 0.6546646]], dtype=torch.float32)
print(a.dtype)
print(a)
# torch.float32
# tensor([[ 0.6547, 1.6547, -0.5645],
# [ 5.6876, -5.6624, 0.6547]])
b = a.clone().int()
print(b)
# tensor([[ 0, 1, 0],
# [ 5, -5, 0]], dtype=torch.int32)
# converting to int8 please note range is here -128 to + 128
c = a.clone().to(torch.int8)
print(c)
# tensor([[ 0, 1, 0],
# [ 5, -5, 0]], dtype=torch.int8)
# converting to uint8 please note range is here 0 to 255
d = a.clone().byte()
print(d)
# tensor([[ 0, 1, 0],
# [ 5, 251, 0]], dtype=torch.uint8)
Your approach(wrong)
A, B = a
A_zero_offset = torch.empty(A.shape)
A_zero_offset = torch.zeros_like(A_zero_offset) # offset to be zero **[Question1]**
scale = 255 / (torch.max(A) - torch.min(B)) # 2^8 - 1 = 255
A_quantized = np.round((A - A_zero_offset) * scale)
print(A_quantized.dtype)
print(A_quantized)
# torch.float32
# tensor([ 23., 58., -20.])

Lorentzian inner product in matrix form using Pytorch

I want to compute the Lorentzian inner product, that is <x,y> = -x1y1 + x2y2 + x3y3 +...
I have the code
res = torch.sum(x * y, dim=-1) - 2 * x[..., 0] * y[..., 0]
But this fails to work, because I keep getting this error -
RuntimeError: The size of tensor a (450) must match the size of tensor b (30) at non-singleton dimension 0
I need the inner product in the matrix form. So I did this -
res = torch.matmul(x,torch.transpose(y,0,1))
-2*torch.matmul(x[...,0],torch.transpose(y[...,0],0,0))
But I get a new error
RuntimeError: inconsistent tensor size, expected tensor [450] and src [30] to have the same number of elements, but got 450 and 30 elements respectively.
I have tried this on a simple toy example -
x = torch.tensor([[1, 2, 3]])
y = torch.tensor([[2, 2, 2]])
prod = torch.matmul(x,torch.transpose(y,0,1))-2*torch.matmul(x[...,0],torch.transpose(y[...,0],0,0))
print(prod)
Output : tensor([[8]]) which is right. But somehow doesn't seem to work in the application I am working on.
I am not sure how to solve this. Any insights are welcome please!

So I did this -
import torch
x = torch.tensor([[1, 2, 3]])
y = torch.tensor([[2, 2, 2]])
x[...,0] *= -1
res = torch.matmul(x,torch.transpose(y,0,1))
print (res)
It started working for my application.

numpy vectorized way to change multiple rows of array(rows can be repeated)

I run into this problem when implementing the vectorized svm gradient for cs231n assignment1.
here is an example:
ary = np.array([[1,-9,0],
[1,2,3],
[0,0,0]])
ary[[0,1]] += np.ones((2,2),dtype='int')
and it outputs:
array([[ 2, -8, 1],
[ 2, 3, 4],
[ 0, 0, 0]])
everything is fine until rows is not unique:
ary[[0,1,1]] += np.ones((3,3),dtype='int')
although it didn't throw an error,the output was really strange:
array([[ 2, -8, 1],
[ 2, 3, 4],
[ 0, 0, 0]])
and I expect the second row should be [3,4,5] rather than [2,3,4],
the naive way I used to solve this problem is using a for loop like this:
ary = np.array([[ 2, -8, 1],
[ 2, 3, 4],
[ 0, 0, 0]])
# the rows I want to change
rows = [0,1,2,1,0,1]
# the change matrix
change = np.random.randn((6,3))
for i,row in enumerate(rows):
ary[row] += change[i]
so I really don't know how to vectorize this for loop, is there a better way to do this in NumPy?
and why it's wrong to do something like this?:
ary[rows] += change
In case anyone is curious why I want to do so, here is my implementation of svm_loss_vectorized function, I need to compute the gradients of weights based on labels y:
def svm_loss_vectorized(W, X, y, reg):
"""
Structured SVM loss function, vectorized implementation.
Inputs and outputs are the same as svm_loss_naive.
"""
loss = 0.0
dW = np.zeros(W.shape) # initialize the gradient as zero
# transpose X and W
# D means input dimensions, N means number of train example
# C means number of classes
# X.shape will be (D,N)
# W.shape will be (C,D)
X = X.T
W = W.T
dW = dW.T
num_train = X.shape[1]
# transpose W_y shape to (D,N)
W_y = W[y].T
S_y = np.sum(W_y*X ,axis=0)
margins = np.dot(W,X) + 1 - S_y
mask = np.array(margins>0)
# get the impact of num_train examples made on W's gradient
# that is,only when the mask is positive
# the train example has impact on W's gradient
dW_j = np.dot(mask, X.T)
dW += dW_j
mul_mask = np.sum(mask, axis=0, keepdims=True).T
# dW[y] -= mul_mask * X.T
dW_y = mul_mask * X.T
for i,label in enumerate(y):
dW[label] -= dW_y[i]
loss = np.sum(margins*mask) - num_train
loss /= num_train
dW /= num_train
# add regularization term
loss += reg * np.sum(W*W)
dW += reg * 2 * W
dW = dW.T
return loss, dW

Using built-in np.add.at
The built-in is np.add.at for such tasks, i,e.
np.add.at(ary, rows, change)
But, since we are working with a 2D array, that might not be the most performant one.
Leveraging fast matrix-multiplication
As it turns out, we can leverage the very efficient matrix-multplication for such a case as well and given enough number of repeated rows for summation, could be really good. Here's how we can use it -
mask = rows == np.arange(len(ary))[:,None]
ary += mask.dot(change)
Benchmarking
Let's time np.add.at method against matrix-multiplication based one for bigger arrays -
In [681]: ary = np.random.rand(1000,1000)
In [682]: rows = np.random.randint(0,len(ary),(10000))
In [683]: change = np.random.rand(10000,1000)
In [684]: %timeit np.add.at(ary, rows, change)
1 loop, best of 3: 604 ms per loop
In [687]: def matmul_addat(ary, rows, change):
...: mask = rows == np.arange(len(ary))[:,None]
...: ary += mask.dot(change)
In [688]: %timeit matmul_addat(ary, rows, change)
10 loops, best of 3: 158 ms per loop

Why does a linear regression placeholder have shape [1, 1] in tensorflow?

I've been reading this guide on tensorflow: https://medium.com/all-of-us-are-belong-to-machines/the-gentlest-introduction-to-tensorflow-248dc871a224
...and mostly, I see what's happening.
However, the linear model in the example code defines the linear model like this:
# Model linear regression y = Wx + b
x = tf.placeholder(tf.float32, [None, 1])
W = tf.Variable(tf.zeros([1,1]))
b = tf.Variable(tf.zeros([1]))
product = tf.matmul(x,W)
y = product + b
y_ = tf.placeholder(tf.float32, [None, 1])
# Cost function sum((y_-y)**2)
cost = tf.reduce_mean(tf.square(y_-y))
# Training using Gradient Descent to minimize cost
train_step = tf.train.GradientDescentOptimizer(0.0000001).minimize(cost)
The question is: Why is Wx + b represented with these values:
x = tf.placeholder(tf.float32, [None, 1])
W = tf.Variable(tf.zeros([1,1]))
b = tf.Variable(tf.zeros([1]))
? [None, 1], [1, 1]? Why [None, 1] for x and [1, 1] for W?
If [1, 1] is 1 element of size 1, then why is b just [1], what does that mean? 1 element of size 0?
For W = tf.Variable, the first '1' is feature, house size, and the 2nd '1' is output, house price.
Does that mean if I was trying to represent the model, say:
y = Ax + Bz
That means I have two 'features' (x and z) and that my A and B values should be shaped [2, 1]? It doesn't seem right...
This seems utterly unlike what is done in polynomial regression, where weight factors are shape [1]. Why is this different?

I think maybe you should learn something like linear algebra.
Let's start with this line # Model linear regression y = Wx + b which is the first line in the code you post. Actually, it means two matrix operations.
First one is Wx, that means matrix X matrix multiply x. In your case, means:
[x11, x21, x31, ..., xn1]T * [w] = [x11*w, x21*w, x31*w, ..., xn1*w]T
Let Wx as R(Result), we can rewrite Wx + B into R + B. This is the second matrix operation. In your case, means:
[x11*w, x21*w, x31*w, ..., xn1*w]T + [b] = [x11*w + b, x21*w + b, x31*w + b, ..., xn1*w + b]T
So if you have more than one features in your input, and want to output multiple results, the definition of model should be:
x = tf.placeholder(tf.float32, [None, your_input_features])
W = tf.Variable(tf.zeros([your_input_features, your_output_features]))
b = tf.Variable(tf.zeros([your_output_features]))
product = tf.matmul(x,W)
y = product + b

The original author should have chosen the shape as [1, 1] because she/he wanted to have a more general function than plain scalar product.
This way, you can change the shape to [1, d] to have d features for each sample.
Of course one should also change the shape of x to d then.

Are you familiar with linear algebra ?
A placeholder of shape [None, 1] means unlimited rows and 1 column.
A placeholder of shape [1, 1] means 1 row and 1 column.
Shape [1, 1] and [1] are different in that sense:
[1] => plh = [x]
[1, 1] => plh = [[x]]
Then tf.matmul compute the dot product: x.W and add b.
In order for tensorflow to work, the tensors must be of similar shape, that's why W is of shape [1, 1] and not just [1].
Let us have:
x = [[1], [2], [3]]
W = [[10]]
b = [[9], [8], [7]]
Then:
tf.matmul(x, W) = [[10], [20], [30]]
tf.matmul(x, W) + b = [[19], [28], [27]]
I hope this answer your question

Tensorflow - pick values from indicies, what is the operation called?

An example
Suppose I have a tensor values with shape (2,2,2)
values = [[[0, 1],[2, 3]],[[4, 5],[6, 7]]]
And a tensor indicies with shape (2,2) which describes what values to be selected in the innermost dimension
indicies = [[1,0],[0,0]]
Then the result will be a (2,2) matrix with these values
result = [[1,2],[4,6]]
What is this operation called in tensorflow and how to do it?
General
Note that the above shape (2,2,2) is only an example, it can be any dimension. Some conditions for this operation:
ndim(values) -1 = ndim(indicies)
values.shape[:-1] == indicies.shape == result.shape
indicies.max() < values.shape[-1] -1

I think you can emulate this with tf.gather_nd. You will just have to convert "your" indices to a representation that is suitable for tf.gather_nd. The following example here is tied to your specific example, i.e. input tensors of shape (2, 2, 2) but I think this gives you an idea how you could write the conversion for input tensors with arbitrary shape, although I am not sure how easy it would be to implement this (haven't thought about it too long). Also, I'm not claiming that this is the easiest possible solution.
import tensorflow as tf
import numpy as np
values = np.array([[[0, 1], [2, 3]], [[4, 5], [6, 7]]])
values_tf = tf.constant(values)
indices = np.array([[1, 0], [0, 0]])
converted_idx = []
for k in range(values.shape[0]):
outer = []
for l in range(values.shape[1]):
inds = [k, l, indices[k][l]]
outer.append(inds)
print(inds)
converted_idx.append(outer)
with tf.Session() as sess:
result = tf.gather_nd(values_tf, converted_idx)
print(sess.run(result))
This prints
[[1 2]
[4 6]]
Edit: To handle arbitrary shapes here is a recursive solution that should work (only tested on your example):
def convert_idx(last_dim_vals, ori_indices, access_to_ori, depth):
if depth == len(last_dim_vals.shape) - 1:
inds = access_to_ori + [ori_indices[tuple(access_to_ori)]]
return inds
outer = []
for k in range(ori_indices.shape[depth]):
inds = convert_idx(last_dim_vals, ori_indices, access_to_ori + [k], depth + 1)
outer.append(inds)
return outer
You can use this together with the original code I posted like so:
...
converted_idx = convert_idx(values, indices, [], 0)
with tf.Session() as sess:
result = tf.gather_nd(values_tf, converted_idx)
print(sess.run(result))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

An equivalent but differentiable argmax expression in Tensorflow - python

Related

INT8 quantization for FP32 matrix multiplication

Lorentzian inner product in matrix form using Pytorch

numpy vectorized way to change multiple rows of array(rows can be repeated)

Why does a linear regression placeholder have shape [1, 1] in tensorflow?

Tensorflow - pick values from indicies, what is the operation called?

Categories

Resources