Implementation of the Dense Synthesizer

Implementation of the Dense Synthesizer - python

I’m trying to understand the Synthesizer paper (https://arxiv.org/pdf/2005.00743.pdf 1) and there’s a description of the dense synthesizer mechanism that should replace the traditional attention model as described in the Transformer architecture.
The Dense Synthesizer is described as such:
So I tried to implement the layer and it looks like this but I’m not sure whether I’m getting it right:
class DenseSynthesizer(nn.Module):
def __init__(self, l, d):
super(DenseSynthesizer, self).__init__()
self.linear1 = nn.Linear(d, l)
self.linear2 = nn.Linear(l, l)
def forward(self, x, v):
# Equation (1) and (2)
# Shape: l x l
b = self.linear2(F.relu(self.linear1(x)))
# Equation (3)
# [l x l] x [l x d] -> [l x d]
return torch.matmul(F.softmax(b), v)
Usage:
l, d = 4, 5
x, v = torch.rand(l, d), torch.rand(l, d)
synthesis = DenseSynthesizer(l, d)
synthesis(x, v)
Example:
x and v are tensors:
x = tensor([[0.0844, 0.2683, 0.4299, 0.1827, 0.1188],
[0.2793, 0.0389, 0.3834, 0.9897, 0.4197],
[0.1420, 0.8051, 0.1601, 0.3299, 0.3340],
[0.8908, 0.1066, 0.1140, 0.7145, 0.3619]])
v = tensor([[0.3806, 0.1775, 0.5457, 0.6746, 0.4505],
[0.6309, 0.2790, 0.7215, 0.4283, 0.5853],
[0.7548, 0.6887, 0.0426, 0.1057, 0.7895],
[0.1881, 0.5334, 0.6834, 0.4845, 0.1960]])
And passing through a forward pass through the dense synthesis, it returns:
>>> synthesis = DenseSynthesizer(l, d)
>>> synthesis(x, v)
tensor([[0.5371, 0.4528, 0.4560, 0.3735, 0.5492],
[0.5426, 0.4434, 0.4625, 0.3770, 0.5536],
[0.5362, 0.4477, 0.4658, 0.3769, 0.5468],
[0.5430, 0.4461, 0.4559, 0.3755, 0.5551]], grad_fn=<MmBackward>)
Is the implementation and understanding of the dense synthesizer correct?
Theoretically, how is that different from a multi-layered perceptron that takes in two different inputs and makes uses of it at different point in the forward propagation?

Is the implementation and understanding of the dense synthesizer correct?
Not exactly, linear1 = nn.Linear(d,d) according to the paper and not (d,l).
Of course this does not work if X.shape = (l,d) according to matrix multiplication rules.
This is because :
So F is applied to each Xi in X for i in [1,l]
The resulting matrix B is then passed to the softmax function and multiplied by G(x).
So you'd have to modify your code to sequentially process the input then use the returned matrix to compute Y.
how is that different from a multi-layered perceptron that takes in two different inputs and makes uses of it at different point in the forward propagation?
To understand, we need to put things into context, the idea of introducing attention mechanism was first described here in the context of Encoder - Decoder : https://arxiv.org/pdf/1409.0473.pdf
The core idea is to allow the model to have control over how the context vector from the encoder is retrieved using a neural network instead of relying solely on the last encoded state :
see this post for more detail.
The Transformers introduced the idea of using "Multi-Head Attention" (see graph below) to reduce the computational burden and focus solely on the attention mechanism itself. post
https://arxiv.org/pdf/1706.03762.pdf
So where does the Dense synthesizer fits into all of that ?
It simply replaces the Dot product (as illustrated in the first pictures in your post) by F(.). If you replace what's inside the softmax by F you get the equation for Y
Conclusion
This is an MLP but applied step wise to the input in the context of sequence processing.
Thank you

Related

How to build TF tensor with ones in specified locations - batch compatible

I apologize for the poor question title but I'm not sure quite how to phrase it. Here's the problem I'm trying to solve: I have two NNs working off of the same input dataset in my code. One of them is a traditional network while the other is used to limit the acceptable range of the first. This works by using a tf.where() statement which works fine in most cases, such as this toy example:
pcts= [0.04,0.06,0.06,0.06,0.06,0.06,0.06,0.04,0.04,0.04]
legal_actions = tf.where(pcts>=0.05, tf.ones_like(pcts), tf.zeros_like(pcts))
Which gives the correct result: legal_actions = [0,1,1,1,1,1,1,0,0,0]
I can then multiply this by the output of my first network to limit its Q values to only those of the legal actions. In a case like the above this works great.
However, it is also possible that my original vector looks something like this, with low values in the middle of the high values: pcts= [0.04,0.06,0.06,0.04,0.04,0.06,0.06,0.04,0.04,0.04]
Using the same code as above my legal_actions comes out as this: legal_actions = [0,1,1,0,0,1,1,0,0,0]
Based on the code I have this is correct, however, I'd like to include any zeros in the middle as part of my legal_actions. In other words, I'd like this second example to be the same as the first. Working in basic TF this is easy to do in several different ways, such as in this reproducible example (it's also easy to do with sparse tensors):
import tensorflow as tf
pcts= tf.placeholder(tf.float32, shape=(10,))
legal_actions = tf.where(pcts>=0.05, tf.ones_like(pcts), tf.zeros_like(pcts))
mask = tf.where(tf.greater(legal_actions,0))
legals = tf.cast(tf.range(tf.reduce_min(mask),tf.reduce_max(mask)+1),tf.int64)
oh = tf.one_hot(legals,10)
oh = tf.reduce_sum(oh,0)
with tf.Session() as sess:
print(sess.run(oh,feed_dict={pcts:[0.04,0.06,0.06,0.04,0.04,0.06,0.06,0.04,0.04,0.04]}))
The problem that I'm running into is when I try to apply this to my actual code which is reading in batches from a file. I can't figure out a way to fill in the "gaps" in my tensor without the range function and/or I can't figure out how to make the range function work with batches (it will only make one range at a time, not one per batch, as near as I can tell). Any suggestions on how to either make what I'm working on work or how to solve the problem a completely different way would be appreciated.

Try this code:
import tensorflow as tf
pcts = tf.random.uniform((2,3,4))
a = pcts>=0.5
shape = tf.shape(pcts)[-1]
a = tf.reshape(a, (-1, shape))
a = tf.cast(a, dtype=tf.float32)
def rng(t):
left = tf.scan(lambda a, x: max(a, x), t)
right = tf.scan(lambda a, x: max(a, x), t, reverse=True)
return tf.minimum(left, right)
a = tf.map_fn(lambda x: rng(x), a)
a = tf.reshape(a, (tf.shape(pcts)))

How does Tensorflow construct a graph from your code arguments?

I am trying to understand how Tensorflow is able to accept an expression as an argument and turn it into a graph in Python
I have tried looking at the Tensorflow code base (e.g. https://github.com/tensorflow/tensorflow/blob/v1.13.1/tensorflow/python/ops/nn_ops.py), and maybe I'm just being dense but I can't see how they are doing it from this.
An example might be:
b = tf.Variable(tf.zeros((100,)))
W = tf.Variable(tf.random_uniform((784, 100), -1, 1))
x = tf.placeholder(tf.float32,(100,784))
h = tf.nn.relu(tf.matmul(x, W) + b)
It is the definition of 'h' here that interests me. How does tf/Python facilitate turning matmul + b into 2 operands and an add operation in the graph? I haven't found any declaration for Python that allows tf.matmul(x, W) + b to be tokenized and registered as graph elements as I thought I might expect to see in the declaration of tf.nn.relu().

What is the best approach to deal with batches within a Lambda layer?

I created a neural network with Keras, and added a Lambda layer to perform some calculations, but it is showing a poor performance on inferences.
I was able to make the inferences successfully using a batch of one input and added one more loop to handle multiple inputs. Everything works fine, but the performance is somewhat poor. I figured using a larger batch would make things a lot faster. My question is whether I am handling batches correctly (is it really necessary to use another loop?) as I have not found any keras or tensorflow documentation dealing with this topic in more depth.
Below is a code with a structure similar to the one I'm using in the Lambda layer.
def GenericFunc(x, batch=10, channels=64):
y, group = [], []
for i in range(batch):
for j in range(channels):
y.append(backend.sum(x[0, :, :, j]))
group.append(tf.convert_to_tensor(y, dtype=np.float32))
y = []
yy = backend.stack(group, axis=0)
tensor_stack = backend.reshape(yy, [batch,channels])
return tensor_stack
Any suggestions will be welcome!

Never use loops. Tensors are made for tensor operations.
def GenericFunc(x):
y = backend.sum(x, axis=1)
y = backend.sum(y, axis=1)
return y
Probably also works with
def GenericFunc(x):
return backend.sum(x, axis=[1,2])

Avoiding duplicating graph in tensorflow (LSTM model)

I have the following simplified code (actually, unrolled LSTM model):
def func(a, b):
with tf.variable_scope('name'):
res = tf.add(a, b)
print(res.name)
return res
func(tf.constant(10), tf.constant(20))
Whenever I run the last line, it seems that it changes the graph. But I don't want the graph changes. Actually my code is different and is a neural network model but it is too huge, so I've added the above code. I want to call the func without changing the graph of model but it changes. I read about variable scope in TensorFlow but it seems that I've not understand it at all.

You should take a look at the source code of tf.nn.dynamic_rnn, specifically _dynamic_rnn_loop function at python/ops/rnn.py - it's solving the same problem. In order not blow up the graph, it's using tf.while_loop to reuse the same graph ops for new data. But this approach adds several restrictions, namely the shape of tensors that are passing through in a loop must be invariant. See the examples in tf.while_loop documentation:
i0 = tf.constant(0)
m0 = tf.ones([2, 2])
c = lambda i, m: i < 10
b = lambda i, m: [i+1, tf.concat([m, m], axis=0)]
tf.while_loop(
c, b, loop_vars=[i0, m0],
shape_invariants=[i0.get_shape(), tf.TensorShape([None, 2])])

Using op inputs when defining custom gradients in TensorFlow

I'm trying to define a gradient method for my custom TF operation. Most of the solutions I have found online seem to based on a gist by harpone. I'm reluctant to use that approach as it uses py_func which won't run on GPU. I found another solution here that uses tf.identity() that looks more elegant and I think will run on GPU. However, I have some problems accessing inputs of the ops in my custom gradient function. Here's my code:
#tf.RegisterGradient('MyCustomGradient')
def _custom_gradient(op, gradients):
x = op.inputs[0]
return(x)
def my_op(w):
return tf.pow(w,3)
var_foo = tf.Variable(5, dtype=tf.float32)
bar = my_op(var_foo)
g = tf.get_default_graph()
with g.gradient_override_map({'Identity': 'MyCustomGradient'}):
bar = tf.identity(bar)
g = tf.gradients(bar, var_foo)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
print(sess.run(g))
I was expecting _custom_gradient() to return the input to the op (5 in this example) but instead it seems to return op output x gradient. My custom my_op will have non-differentiable operations like tf.sign and I'd like to define my custom gradient based on the inputs. What am I doing wrong?

There is no problem with your code:
Let's first do the forward pass:
var_foo = 5 -> bar = 125 -> tf.identity(bar) = 125
Now let's backpropagate:
The gradient of tf.identity(bar) with respect to its argument bar equals (by your definition) to bar, that is, 125. The gradient of bar with respect to var_foo equals 3 times the square of var_foo which is 75. Multiply, and you get 9375, which is indeed the output of your code.
op.inputs[0] contains the forward-pass value of the op. In this case, the forward pass of the identity op is 125.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Implementation of the Dense Synthesizer - python

Related

How to build TF tensor with ones in specified locations - batch compatible

How does Tensorflow construct a graph from your code arguments?

What is the best approach to deal with batches within a Lambda layer?

Avoiding duplicating graph in tensorflow (LSTM model)

Using op inputs when defining custom gradients in TensorFlow

Categories

Resources