Related
I apologize for the poor question title but I'm not sure quite how to phrase it. Here's the problem I'm trying to solve: I have two NNs working off of the same input dataset in my code. One of them is a traditional network while the other is used to limit the acceptable range of the first. This works by using a tf.where() statement which works fine in most cases, such as this toy example:
pcts= [0.04,0.06,0.06,0.06,0.06,0.06,0.06,0.04,0.04,0.04]
legal_actions = tf.where(pcts>=0.05, tf.ones_like(pcts), tf.zeros_like(pcts))
Which gives the correct result: legal_actions = [0,1,1,1,1,1,1,0,0,0]
I can then multiply this by the output of my first network to limit its Q values to only those of the legal actions. In a case like the above this works great.
However, it is also possible that my original vector looks something like this, with low values in the middle of the high values: pcts= [0.04,0.06,0.06,0.04,0.04,0.06,0.06,0.04,0.04,0.04]
Using the same code as above my legal_actions comes out as this: legal_actions = [0,1,1,0,0,1,1,0,0,0]
Based on the code I have this is correct, however, I'd like to include any zeros in the middle as part of my legal_actions. In other words, I'd like this second example to be the same as the first. Working in basic TF this is easy to do in several different ways, such as in this reproducible example (it's also easy to do with sparse tensors):
import tensorflow as tf
pcts= tf.placeholder(tf.float32, shape=(10,))
legal_actions = tf.where(pcts>=0.05, tf.ones_like(pcts), tf.zeros_like(pcts))
mask = tf.where(tf.greater(legal_actions,0))
legals = tf.cast(tf.range(tf.reduce_min(mask),tf.reduce_max(mask)+1),tf.int64)
oh = tf.one_hot(legals,10)
oh = tf.reduce_sum(oh,0)
with tf.Session() as sess:
print(sess.run(oh,feed_dict={pcts:[0.04,0.06,0.06,0.04,0.04,0.06,0.06,0.04,0.04,0.04]}))
The problem that I'm running into is when I try to apply this to my actual code which is reading in batches from a file. I can't figure out a way to fill in the "gaps" in my tensor without the range function and/or I can't figure out how to make the range function work with batches (it will only make one range at a time, not one per batch, as near as I can tell). Any suggestions on how to either make what I'm working on work or how to solve the problem a completely different way would be appreciated.
Try this code:
import tensorflow as tf
pcts = tf.random.uniform((2,3,4))
a = pcts>=0.5
shape = tf.shape(pcts)[-1]
a = tf.reshape(a, (-1, shape))
a = tf.cast(a, dtype=tf.float32)
def rng(t):
left = tf.scan(lambda a, x: max(a, x), t)
right = tf.scan(lambda a, x: max(a, x), t, reverse=True)
return tf.minimum(left, right)
a = tf.map_fn(lambda x: rng(x), a)
a = tf.reshape(a, (tf.shape(pcts)))
In this notebook the author writes the following nesterov update:
def nesterov_update(w, dw, v, lr, weight_decay, momentum):
dw.add_(weight_decay, w).mul_(-lr)
v.mul_(momentum).add_(dw)
w.add_(dw.add_(momentum, v))
As I understand it, a.add(b) in PyTorch implements a+b and a.add(b,c) implements a+(b*c), because b is in the slot of the alpha parameter. And lastly, add_ does the in-place version of add.
Q: Am I right so far?
Then, if I were to sketch the above nesterov update in an expanded form that illustrates the logic, I would write:
dw = -lr*(dw + weight_decay*w)
v = v*momentum + dw
w = w + dw + momentum*v
Q: is this correct?
I'm not planning to use the above expanded "code," I'm just writing it this way to try communicate what I'm understanding that it's doing, to check.
It is important to note the PyTorch version (1.1.0) the tutorial is using. According to 1.1.0, function prototype for torch.add is torch.add(input, value=1, other, out=None). So, your interpretation of the following line:
dw.add_(weight_decay, w)
as: dw = dw + weight_decay * w is correct. So, the answer to your first question is, yes, you are right.
However, with the latest versions of PyTorch, you would get an error if torch.add is used in the same fashion.
a = torch.FloatTensor([0, 1.0, 2.0, 3.0])
b = torch.FloatTensor([0, 4.0, 5.0, 6.0])
c = 1.0
z = a.add(b, c)
The above code gives: (In PyTorch 1.5.0)
TypeError: add() takes 1 positional argument but 2 were given
However, if you perform the following, then it works fine.
z = a.add(b, alpha=c)
Note that, the prototype of torch.add is now: torch.add(input, other, *, alpha=1, out=None)
The answer to your second question is, yes, you are right.
I’m trying to understand the Synthesizer paper (https://arxiv.org/pdf/2005.00743.pdf 1) and there’s a description of the dense synthesizer mechanism that should replace the traditional attention model as described in the Transformer architecture.
The Dense Synthesizer is described as such:
So I tried to implement the layer and it looks like this but I’m not sure whether I’m getting it right:
class DenseSynthesizer(nn.Module):
def __init__(self, l, d):
super(DenseSynthesizer, self).__init__()
self.linear1 = nn.Linear(d, l)
self.linear2 = nn.Linear(l, l)
def forward(self, x, v):
# Equation (1) and (2)
# Shape: l x l
b = self.linear2(F.relu(self.linear1(x)))
# Equation (3)
# [l x l] x [l x d] -> [l x d]
return torch.matmul(F.softmax(b), v)
Usage:
l, d = 4, 5
x, v = torch.rand(l, d), torch.rand(l, d)
synthesis = DenseSynthesizer(l, d)
synthesis(x, v)
Example:
x and v are tensors:
x = tensor([[0.0844, 0.2683, 0.4299, 0.1827, 0.1188],
[0.2793, 0.0389, 0.3834, 0.9897, 0.4197],
[0.1420, 0.8051, 0.1601, 0.3299, 0.3340],
[0.8908, 0.1066, 0.1140, 0.7145, 0.3619]])
v = tensor([[0.3806, 0.1775, 0.5457, 0.6746, 0.4505],
[0.6309, 0.2790, 0.7215, 0.4283, 0.5853],
[0.7548, 0.6887, 0.0426, 0.1057, 0.7895],
[0.1881, 0.5334, 0.6834, 0.4845, 0.1960]])
And passing through a forward pass through the dense synthesis, it returns:
>>> synthesis = DenseSynthesizer(l, d)
>>> synthesis(x, v)
tensor([[0.5371, 0.4528, 0.4560, 0.3735, 0.5492],
[0.5426, 0.4434, 0.4625, 0.3770, 0.5536],
[0.5362, 0.4477, 0.4658, 0.3769, 0.5468],
[0.5430, 0.4461, 0.4559, 0.3755, 0.5551]], grad_fn=<MmBackward>)
Is the implementation and understanding of the dense synthesizer correct?
Theoretically, how is that different from a multi-layered perceptron that takes in two different inputs and makes uses of it at different point in the forward propagation?
Is the implementation and understanding of the dense synthesizer correct?
Not exactly, linear1 = nn.Linear(d,d) according to the paper and not (d,l).
Of course this does not work if X.shape = (l,d) according to matrix multiplication rules.
This is because :
So F is applied to each Xi in X for i in [1,l]
The resulting matrix B is then passed to the softmax function and multiplied by G(x).
So you'd have to modify your code to sequentially process the input then use the returned matrix to compute Y.
how is that different from a multi-layered perceptron that takes in two different inputs and makes uses of it at different point in the forward propagation?
To understand, we need to put things into context, the idea of introducing attention mechanism was first described here in the context of Encoder - Decoder : https://arxiv.org/pdf/1409.0473.pdf
The core idea is to allow the model to have control over how the context vector from the encoder is retrieved using a neural network instead of relying solely on the last encoded state :
see this post for more detail.
The Transformers introduced the idea of using "Multi-Head Attention" (see graph below) to reduce the computational burden and focus solely on the attention mechanism itself. post
https://arxiv.org/pdf/1706.03762.pdf
So where does the Dense synthesizer fits into all of that ?
It simply replaces the Dot product (as illustrated in the first pictures in your post) by F(.). If you replace what's inside the softmax by F you get the equation for Y
Conclusion
This is an MLP but applied step wise to the input in the context of sequence processing.
Thank you
I have the following simplified code (actually, unrolled LSTM model):
def func(a, b):
with tf.variable_scope('name'):
res = tf.add(a, b)
print(res.name)
return res
func(tf.constant(10), tf.constant(20))
Whenever I run the last line, it seems that it changes the graph. But I don't want the graph changes. Actually my code is different and is a neural network model but it is too huge, so I've added the above code. I want to call the func without changing the graph of model but it changes. I read about variable scope in TensorFlow but it seems that I've not understand it at all.
You should take a look at the source code of tf.nn.dynamic_rnn, specifically _dynamic_rnn_loop function at python/ops/rnn.py - it's solving the same problem. In order not blow up the graph, it's using tf.while_loop to reuse the same graph ops for new data. But this approach adds several restrictions, namely the shape of tensors that are passing through in a loop must be invariant. See the examples in tf.while_loop documentation:
i0 = tf.constant(0)
m0 = tf.ones([2, 2])
c = lambda i, m: i < 10
b = lambda i, m: [i+1, tf.concat([m, m], axis=0)]
tf.while_loop(
c, b, loop_vars=[i0, m0],
shape_invariants=[i0.get_shape(), tf.TensorShape([None, 2])])
As you can see in the picture, I want to compute the total cost during one training, but the tf.equal returns a tensor type and the tf.equal(y1[i],y2[i]) can't be True.
How can I use the data in the tensor
You cannot do this in Tensorflow (pyTorch can do something like this). The reason is, that TF requires a static graph. But you are trying to do dynamic evaluations.
Many people claim this static graph being a disadvantage of TF. But in fact, it enables many cool features. But in you use case it is a little bit cumbersome to get a solution:
You need to write it like:
z = tf.zeros_like(y1)
label_a = z + 2
label_b = z + 20
case_001 = tf.where(tf.equal(y1, z), z + 2, z)
case_002 = tf.where(tf.equal(y2, z), z + 20, z)
switch_op = tf.where(tf.equal(y1, y2), ..., case_001 + case_002)