Avoiding duplicating graph in tensorflow (LSTM model)

Avoiding duplicating graph in tensorflow (LSTM model) - python

I have the following simplified code (actually, unrolled LSTM model):
def func(a, b):
with tf.variable_scope('name'):
res = tf.add(a, b)
print(res.name)
return res
func(tf.constant(10), tf.constant(20))
Whenever I run the last line, it seems that it changes the graph. But I don't want the graph changes. Actually my code is different and is a neural network model but it is too huge, so I've added the above code. I want to call the func without changing the graph of model but it changes. I read about variable scope in TensorFlow but it seems that I've not understand it at all.

You should take a look at the source code of tf.nn.dynamic_rnn, specifically _dynamic_rnn_loop function at python/ops/rnn.py - it's solving the same problem. In order not blow up the graph, it's using tf.while_loop to reuse the same graph ops for new data. But this approach adds several restrictions, namely the shape of tensors that are passing through in a loop must be invariant. See the examples in tf.while_loop documentation:
i0 = tf.constant(0)
m0 = tf.ones([2, 2])
c = lambda i, m: i < 10
b = lambda i, m: [i+1, tf.concat([m, m], axis=0)]
tf.while_loop(
c, b, loop_vars=[i0, m0],
shape_invariants=[i0.get_shape(), tf.TensorShape([None, 2])])

Related

How to build TF tensor with ones in specified locations - batch compatible

I apologize for the poor question title but I'm not sure quite how to phrase it. Here's the problem I'm trying to solve: I have two NNs working off of the same input dataset in my code. One of them is a traditional network while the other is used to limit the acceptable range of the first. This works by using a tf.where() statement which works fine in most cases, such as this toy example:
pcts= [0.04,0.06,0.06,0.06,0.06,0.06,0.06,0.04,0.04,0.04]
legal_actions = tf.where(pcts>=0.05, tf.ones_like(pcts), tf.zeros_like(pcts))
Which gives the correct result: legal_actions = [0,1,1,1,1,1,1,0,0,0]
I can then multiply this by the output of my first network to limit its Q values to only those of the legal actions. In a case like the above this works great.
However, it is also possible that my original vector looks something like this, with low values in the middle of the high values: pcts= [0.04,0.06,0.06,0.04,0.04,0.06,0.06,0.04,0.04,0.04]
Using the same code as above my legal_actions comes out as this: legal_actions = [0,1,1,0,0,1,1,0,0,0]
Based on the code I have this is correct, however, I'd like to include any zeros in the middle as part of my legal_actions. In other words, I'd like this second example to be the same as the first. Working in basic TF this is easy to do in several different ways, such as in this reproducible example (it's also easy to do with sparse tensors):
import tensorflow as tf
pcts= tf.placeholder(tf.float32, shape=(10,))
legal_actions = tf.where(pcts>=0.05, tf.ones_like(pcts), tf.zeros_like(pcts))
mask = tf.where(tf.greater(legal_actions,0))
legals = tf.cast(tf.range(tf.reduce_min(mask),tf.reduce_max(mask)+1),tf.int64)
oh = tf.one_hot(legals,10)
oh = tf.reduce_sum(oh,0)
with tf.Session() as sess:
print(sess.run(oh,feed_dict={pcts:[0.04,0.06,0.06,0.04,0.04,0.06,0.06,0.04,0.04,0.04]}))
The problem that I'm running into is when I try to apply this to my actual code which is reading in batches from a file. I can't figure out a way to fill in the "gaps" in my tensor without the range function and/or I can't figure out how to make the range function work with batches (it will only make one range at a time, not one per batch, as near as I can tell). Any suggestions on how to either make what I'm working on work or how to solve the problem a completely different way would be appreciated.

Try this code:
import tensorflow as tf
pcts = tf.random.uniform((2,3,4))
a = pcts>=0.5
shape = tf.shape(pcts)[-1]
a = tf.reshape(a, (-1, shape))
a = tf.cast(a, dtype=tf.float32)
def rng(t):
left = tf.scan(lambda a, x: max(a, x), t)
right = tf.scan(lambda a, x: max(a, x), t, reverse=True)
return tf.minimum(left, right)
a = tf.map_fn(lambda x: rng(x), a)
a = tf.reshape(a, (tf.shape(pcts)))

Tracing in Tensorflow

I read the following statement when covering Autographs and Tracing in Tensorflow.
TensorFlow will only capture for loops that iterate over a tensor or a
dataset. So make sure you use for i in tf.range(x) rather than for i
in range(x), or else the loop will not be captured in the graph.
Instead, it will run during tracing.
(This may be what you want if the for loop is meant to build the graph, for example to create each layer in a neural network.)
I am confused as to what exactly happens. If it runs during tracing how it not registered on the graph but also how would the for loop build the graph?

An example which shows the difference between a tf.range loop and a range loop:
for i in tf.range(3):
x = tf.add(x, i)
results in a graph which contains a tf.while_loop that matches the for loop; this is the translation that AutoGraph makes:
def cond(i, x):
return tf.lesss(i, 3)
def body(i, x):
x = tf.add(x, i)
return i, x
tf.while_loop(cond, body, ...)
In turn:
for i in range(3):
x = tf.add(x, i)
results in a graph which contains a three tf.add calls, and i is substituted by constants, without any loop ops:
x = tf.add(x, 0)
x = tf.add(x, 1)
x = tf.add(x, 2)

Implementation of the Dense Synthesizer

I’m trying to understand the Synthesizer paper (https://arxiv.org/pdf/2005.00743.pdf 1) and there’s a description of the dense synthesizer mechanism that should replace the traditional attention model as described in the Transformer architecture.
The Dense Synthesizer is described as such:
So I tried to implement the layer and it looks like this but I’m not sure whether I’m getting it right:
class DenseSynthesizer(nn.Module):
def __init__(self, l, d):
super(DenseSynthesizer, self).__init__()
self.linear1 = nn.Linear(d, l)
self.linear2 = nn.Linear(l, l)
def forward(self, x, v):
# Equation (1) and (2)
# Shape: l x l
b = self.linear2(F.relu(self.linear1(x)))
# Equation (3)
# [l x l] x [l x d] -> [l x d]
return torch.matmul(F.softmax(b), v)
Usage:
l, d = 4, 5
x, v = torch.rand(l, d), torch.rand(l, d)
synthesis = DenseSynthesizer(l, d)
synthesis(x, v)
Example:
x and v are tensors:
x = tensor([[0.0844, 0.2683, 0.4299, 0.1827, 0.1188],
[0.2793, 0.0389, 0.3834, 0.9897, 0.4197],
[0.1420, 0.8051, 0.1601, 0.3299, 0.3340],
[0.8908, 0.1066, 0.1140, 0.7145, 0.3619]])
v = tensor([[0.3806, 0.1775, 0.5457, 0.6746, 0.4505],
[0.6309, 0.2790, 0.7215, 0.4283, 0.5853],
[0.7548, 0.6887, 0.0426, 0.1057, 0.7895],
[0.1881, 0.5334, 0.6834, 0.4845, 0.1960]])
And passing through a forward pass through the dense synthesis, it returns:
>>> synthesis = DenseSynthesizer(l, d)
>>> synthesis(x, v)
tensor([[0.5371, 0.4528, 0.4560, 0.3735, 0.5492],
[0.5426, 0.4434, 0.4625, 0.3770, 0.5536],
[0.5362, 0.4477, 0.4658, 0.3769, 0.5468],
[0.5430, 0.4461, 0.4559, 0.3755, 0.5551]], grad_fn=<MmBackward>)
Is the implementation and understanding of the dense synthesizer correct?
Theoretically, how is that different from a multi-layered perceptron that takes in two different inputs and makes uses of it at different point in the forward propagation?

Is the implementation and understanding of the dense synthesizer correct?
Not exactly, linear1 = nn.Linear(d,d) according to the paper and not (d,l).
Of course this does not work if X.shape = (l,d) according to matrix multiplication rules.
This is because :
So F is applied to each Xi in X for i in [1,l]
The resulting matrix B is then passed to the softmax function and multiplied by G(x).
So you'd have to modify your code to sequentially process the input then use the returned matrix to compute Y.
how is that different from a multi-layered perceptron that takes in two different inputs and makes uses of it at different point in the forward propagation?
To understand, we need to put things into context, the idea of introducing attention mechanism was first described here in the context of Encoder - Decoder : https://arxiv.org/pdf/1409.0473.pdf
The core idea is to allow the model to have control over how the context vector from the encoder is retrieved using a neural network instead of relying solely on the last encoded state :
see this post for more detail.
The Transformers introduced the idea of using "Multi-Head Attention" (see graph below) to reduce the computational burden and focus solely on the attention mechanism itself. post
https://arxiv.org/pdf/1706.03762.pdf
So where does the Dense synthesizer fits into all of that ?
It simply replaces the Dot product (as illustrated in the first pictures in your post) by F(.). If you replace what's inside the softmax by F you get the equation for Y
Conclusion
This is an MLP but applied step wise to the input in the context of sequence processing.
Thank you

What is the best approach to deal with batches within a Lambda layer?

I created a neural network with Keras, and added a Lambda layer to perform some calculations, but it is showing a poor performance on inferences.
I was able to make the inferences successfully using a batch of one input and added one more loop to handle multiple inputs. Everything works fine, but the performance is somewhat poor. I figured using a larger batch would make things a lot faster. My question is whether I am handling batches correctly (is it really necessary to use another loop?) as I have not found any keras or tensorflow documentation dealing with this topic in more depth.
Below is a code with a structure similar to the one I'm using in the Lambda layer.
def GenericFunc(x, batch=10, channels=64):
y, group = [], []
for i in range(batch):
for j in range(channels):
y.append(backend.sum(x[0, :, :, j]))
group.append(tf.convert_to_tensor(y, dtype=np.float32))
y = []
yy = backend.stack(group, axis=0)
tensor_stack = backend.reshape(yy, [batch,channels])
return tensor_stack
Any suggestions will be welcome!

Never use loops. Tensors are made for tensor operations.
def GenericFunc(x):
y = backend.sum(x, axis=1)
y = backend.sum(y, axis=1)
return y
Probably also works with
def GenericFunc(x):
return backend.sum(x, axis=[1,2])

The right way to define a function in theano?

Background:
Usually I will define a theano function with input like 'x = fmatrix()', however, during modifying keras (a deep learning library based on theano) to make it work with CTC cost, I noticed a very weird problem: if one input of the cost function is declared as
x = tensor.zeros(shape=[M,N], dtype='float32')
instead of
x = fmatrix()
the training process will converge much faster.
A simplified problem:
The whole codes above are quite big. So I try to simplify the problem like the following: say a function for computing Levenshtein edit distance as
import theano
from theano import tensor
from theano.ifelse import ifelse
def editdist(s, t):
def update(x, previous_row, target):
current_row = previous_row + 1
current_row = tensor.set_subtensor(current_row[1:], tensor.minimum(current_row[1:], tensor.add(previous_row[:-1], tensor.neq(target,x))))
current_row = tensor.set_subtensor(current_row[1:], tensor.minimum(current_row[1:], current_row[0:-1] + 1))
return current_row
source, target = ifelse(tensor.lt(s.shape[0], t.shape[0]), (t, s), (s, t))
previous_row = tensor.arange(target.size + 1, dtype=theano.config.floatX)
result, updates = theano.scan(fn = update, sequences=source, outputs_info=previous_row, non_sequences=target, name='editdist')
return result[-1,-1]
then I define two functions f1 and f2 like:
x1 = tensor.fvector()
x2 = tensor.fvector()
r1 = editdist(x1,x2)
f1 = theano.function([x1,x2], r1)
x3 = tensor.zeros(3, dtype='float32')
x4 = tensor.zeros(3, dtype='float32')
r2 = editdist(x3,x4)
f2 = theano.function([x3,x4], r2)
When computing with f1 and f2, the results are different:
>>f1([1,2,3],[1,3,3])
array(1.0)
>>f2([1,2,3],[1,3,3])
array(3.0)
f1 gives the right result, but f2 doen't.
So my problem is: what is the right way to define a theano function? And, what actually went wrong about f2?
Update:
I'm using theano of version 0.8.0.dev0. I just tried theano 0.7.0, both f1 and f2 give correct result. Maybe this is a bug of theano?
Update_1st 1-27-2016:
According to the explanation of #lamblin on this issue (https://github.com/Theano/Theano/issues/3925#issuecomment-175088918), this was actually a bug of theano, and has been fixed in the latest (1-26-2016) version. For convenience, lamblin's explanation is quoted here:
The first way is the most natural one, but in theory both should be equivalent.
x3 and x4 are created as the output of an "alloc" operation, the input of which would be the constant 3, rather than free inputs like x1 and x2, but that should not matter since you pass [x3, x4] as inputs to theano.function, which should cut the computation graph right there.
My guess is that scan is optimizing prematurely, believing that x3 or x4 is guaranteed to always be the constant 0, and does some simplifications that proved incorrect when values are provided for them. That would be an actual bug in scan."
Update_2nd 1-27-2016:
Unfortunately the bug is not totally fixed yet. In the background section I mentioned if one input of the cost function is declared as tensor.zeros() the convergence process will be much faster, I've found the reason: when input declared as tensor.zeros(), the cost function gave incorrect result, though mysteriously this helped the convergence.
I managed a simplified problem reproduction demo here (https://github.com/daweileng/TheanoDebug), run the ctc_bench.py and you can see the results.

theano.tensor.zeros(...) can't take any other value than 0.
Unless you add nodes to the graph of course and modify parts of the zeros tensor using theano.tensor.set_subtensor.
The input tensor theano.tensor.fmatrix can take any value you input.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Avoiding duplicating graph in tensorflow (LSTM model) - python

Related

How to build TF tensor with ones in specified locations - batch compatible

Tracing in Tensorflow

Implementation of the Dense Synthesizer

What is the best approach to deal with batches within a Lambda layer?

The right way to define a function in theano?

Categories

Resources