The right way to define a function in theano? - python

Background:
Usually I will define a theano function with input like 'x = fmatrix()', however, during modifying keras (a deep learning library based on theano) to make it work with CTC cost, I noticed a very weird problem: if one input of the cost function is declared as
x = tensor.zeros(shape=[M,N], dtype='float32')
instead of
x = fmatrix()
the training process will converge much faster.
A simplified problem:
The whole codes above are quite big. So I try to simplify the problem like the following: say a function for computing Levenshtein edit distance as
import theano
from theano import tensor
from theano.ifelse import ifelse
def editdist(s, t):
def update(x, previous_row, target):
current_row = previous_row + 1
current_row = tensor.set_subtensor(current_row[1:], tensor.minimum(current_row[1:], tensor.add(previous_row[:-1], tensor.neq(target,x))))
current_row = tensor.set_subtensor(current_row[1:], tensor.minimum(current_row[1:], current_row[0:-1] + 1))
return current_row
source, target = ifelse(tensor.lt(s.shape[0], t.shape[0]), (t, s), (s, t))
previous_row = tensor.arange(target.size + 1, dtype=theano.config.floatX)
result, updates = theano.scan(fn = update, sequences=source, outputs_info=previous_row, non_sequences=target, name='editdist')
return result[-1,-1]
then I define two functions f1 and f2 like:
x1 = tensor.fvector()
x2 = tensor.fvector()
r1 = editdist(x1,x2)
f1 = theano.function([x1,x2], r1)
x3 = tensor.zeros(3, dtype='float32')
x4 = tensor.zeros(3, dtype='float32')
r2 = editdist(x3,x4)
f2 = theano.function([x3,x4], r2)
When computing with f1 and f2, the results are different:
>>f1([1,2,3],[1,3,3])
array(1.0)
>>f2([1,2,3],[1,3,3])
array(3.0)
f1 gives the right result, but f2 doen't.
So my problem is: what is the right way to define a theano function? And, what actually went wrong about f2?
Update:
I'm using theano of version 0.8.0.dev0. I just tried theano 0.7.0, both f1 and f2 give correct result. Maybe this is a bug of theano?
Update_1st 1-27-2016:
According to the explanation of #lamblin on this issue (https://github.com/Theano/Theano/issues/3925#issuecomment-175088918), this was actually a bug of theano, and has been fixed in the latest (1-26-2016) version. For convenience, lamblin's explanation is quoted here:
The first way is the most natural one, but in theory both should be equivalent.
x3 and x4 are created as the output of an "alloc" operation, the input of which would be the constant 3, rather than free inputs like x1 and x2, but that should not matter since you pass [x3, x4] as inputs to theano.function, which should cut the computation graph right there.
My guess is that scan is optimizing prematurely, believing that x3 or x4 is guaranteed to always be the constant 0, and does some simplifications that proved incorrect when values are provided for them. That would be an actual bug in scan."
Update_2nd 1-27-2016:
Unfortunately the bug is not totally fixed yet. In the background section I mentioned if one input of the cost function is declared as tensor.zeros() the convergence process will be much faster, I've found the reason: when input declared as tensor.zeros(), the cost function gave incorrect result, though mysteriously this helped the convergence.
I managed a simplified problem reproduction demo here (https://github.com/daweileng/TheanoDebug), run the ctc_bench.py and you can see the results.

theano.tensor.zeros(...) can't take any other value than 0.
Unless you add nodes to the graph of course and modify parts of the zeros tensor using theano.tensor.set_subtensor.
The input tensor theano.tensor.fmatrix can take any value you input.

Related

How to build TF tensor with ones in specified locations - batch compatible

I apologize for the poor question title but I'm not sure quite how to phrase it. Here's the problem I'm trying to solve: I have two NNs working off of the same input dataset in my code. One of them is a traditional network while the other is used to limit the acceptable range of the first. This works by using a tf.where() statement which works fine in most cases, such as this toy example:
pcts= [0.04,0.06,0.06,0.06,0.06,0.06,0.06,0.04,0.04,0.04]
legal_actions = tf.where(pcts>=0.05, tf.ones_like(pcts), tf.zeros_like(pcts))
Which gives the correct result: legal_actions = [0,1,1,1,1,1,1,0,0,0]
I can then multiply this by the output of my first network to limit its Q values to only those of the legal actions. In a case like the above this works great.
However, it is also possible that my original vector looks something like this, with low values in the middle of the high values: pcts= [0.04,0.06,0.06,0.04,0.04,0.06,0.06,0.04,0.04,0.04]
Using the same code as above my legal_actions comes out as this: legal_actions = [0,1,1,0,0,1,1,0,0,0]
Based on the code I have this is correct, however, I'd like to include any zeros in the middle as part of my legal_actions. In other words, I'd like this second example to be the same as the first. Working in basic TF this is easy to do in several different ways, such as in this reproducible example (it's also easy to do with sparse tensors):
import tensorflow as tf
pcts= tf.placeholder(tf.float32, shape=(10,))
legal_actions = tf.where(pcts>=0.05, tf.ones_like(pcts), tf.zeros_like(pcts))
mask = tf.where(tf.greater(legal_actions,0))
legals = tf.cast(tf.range(tf.reduce_min(mask),tf.reduce_max(mask)+1),tf.int64)
oh = tf.one_hot(legals,10)
oh = tf.reduce_sum(oh,0)
with tf.Session() as sess:
print(sess.run(oh,feed_dict={pcts:[0.04,0.06,0.06,0.04,0.04,0.06,0.06,0.04,0.04,0.04]}))
The problem that I'm running into is when I try to apply this to my actual code which is reading in batches from a file. I can't figure out a way to fill in the "gaps" in my tensor without the range function and/or I can't figure out how to make the range function work with batches (it will only make one range at a time, not one per batch, as near as I can tell). Any suggestions on how to either make what I'm working on work or how to solve the problem a completely different way would be appreciated.
Try this code:
import tensorflow as tf
pcts = tf.random.uniform((2,3,4))
a = pcts>=0.5
shape = tf.shape(pcts)[-1]
a = tf.reshape(a, (-1, shape))
a = tf.cast(a, dtype=tf.float32)
def rng(t):
left = tf.scan(lambda a, x: max(a, x), t)
right = tf.scan(lambda a, x: max(a, x), t, reverse=True)
return tf.minimum(left, right)
a = tf.map_fn(lambda x: rng(x), a)
a = tf.reshape(a, (tf.shape(pcts)))

How should Euler integration be implemented in TensorFlow?

I want to write a crude Euler simulation of a set of PDEs. I read the PDE tutorial on tensorflow.org and I am a little puzzled about how to do this properly. I have two specific questions but would welcome further feedback if there is anything I have overlooked or misunderstood.
The following code is from the tutorial:
# Discretized PDE update rules
U_ = U + eps * Ut
Ut_ = Ut + eps * (laplace(U) - damping * Ut)
# Operation to update the state
step = tf.group(
U.assign(U_),
Ut.assign(Ut_))
Question 1
Isn't there a bug here? Once U.assign(U_) has been evaluated, surely the next evaluation of Ut_ will use the updated value of U rather than the value from the same time step? I would have thought that the correct way to do it would be as follows:
delta_U = tf.Variable(dU_init)
delta_Ut = tf.Variable(dUt_init)
delta_step = tf.group(
delta_U.assign(Ut)
delta_Ut.assign(laplace(U) - damping * Ut)
)
update_step = tf.group(
U.assign_add(eps * delta_U),
Ut.assign_add(eps * delta_Ut)
)
We could then run Euler integration steps by alternating evaluations of delta_step and update_step. If I understand correctly, this could be done via separate invocations of Session.run():
with tf.Session() as sess:
...
for i in range(1000):
sess.run(delta_step)
sess.run(update_step)
Question 2
It seems frustrating that a single operation can't be defined that combines both steps in a fixed order, e.g.
combined_update = tf.group(delta_step, update_step)
with tf.Session() as sess:
...
for i in range(1000):
sess.run(combined_update)
but according to an answer on this thread, tf.group() does not guarantee any particular evaluation order. The approach described on that thread for controlling evaluation order involves something called "control dependencies"; can they be used in this instance, where we want to ensure that repeated evaluations of two tensors are made in a fixed order?
If not, is there another way to control the order of evaluation of these tensors, beyond explicitly using sequential Session.run() calls?
Update (12/02/2019)
Update: based on jdehesa's answer, I investigated in greater detail. The results support my original intuition that there is a bug in the PDE tutorial which produces incorrect results due to inconsistent evaluation order of tf.assign() calls; this is not resolved by using control dependencies. However, the method from the PDE tutorial usually produces correct results, and I don't understand why.
I checked the results of running the assignment operations in an explicit order, using the following code:
import tensorflow as tf
import numpy as np
# define two variables a and b, and the PDEs that govern them
a = tf.Variable(0.0)
b = tf.Variable(1.0)
da_dt_ = b * 2
db_dt_ = 10 - a * b
dt = 0.1 # integration step size
# after one step of Euler integration, we should have
# a = 0.2 [ = 0.0 + (1.0 * 2) * 0.1 ]
# b = 2.0 [ = 1.0 + (10 - 0.0 * 1.0) * 0.1 ]
# using the method from the PDE tutorial, define updated values for a and b
a_ = a + da_dt_ * dt
b_ = b + db_dt_ * dt
# and define the update operations
assignA = a.assign(a_)
assignB = b.assign(b_)
# define a higher-order function that runs a particular simulation n times
# and summarises the results
def summarise(simulation, n=500):
runs = np.array( [ simulation() for i in range(n) ] )
summary = dict( { (tuple(run), 0) for run in np.unique(runs, axis=0) } )
for run in runs:
summary[tuple(run)] += 1
return summary
# check the results of running the assignment operations in an explicit order
def explicitOrder(first, second):
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
sess.run(first)
sess.run(second)
return (sess.run(a), sess.run(b))
print( summarise(lambda: explicitOrder(assignA, assignB)) )
# prints {(0.2, 1.98): 500}
print( summarise(lambda: explicitOrder(assignB, assignA)) )
# prints {(0.4, 2.0): 500}
As expected, if we evaluate assignA first then a gets updated to 0.2, and this updated value is then used to update b to 1.98. If we evaluate assignB first, b is first updated to 2.0, and this updated value is then used to update a to 0.4. These are both the wrong answer to the Euler integration: what we ought to get is a = 0.2, b = 2.0.
I tested what happens when we allow the order of evaluation to be controlled implicitly by tf.group(), without using control dependencies.
noCDstep = tf.group(assignA, assignB)
def implicitOrder():
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
sess.run(noCDstep)
return (sess.run(a), sess.run(b))
print( summarise(lambda: implicitOrder()) )
# prints, e.g. {(0.4, 2.0): 37, (0.2, 1.98): 1, (0.2, 2.0): 462}
Occasionally, this produces the same result as evaluating assignB followed by assignA, or (more rarely) evaluating assignA followed by assignB. But most of the time, there is an entirely unexpected result: the correct answer to the Euler integration step. This behaviour is both inconsistent and surprising.
I tried to resolve this inconsistent behaviour by introducing control dependencies as suggested by jdehesa, using the following code:
with tf.control_dependencies([a_, b_]):
cdStep = tf.group(assignA, assignB)
def cdOrder():
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
sess.run(cdStep)
return (sess.run(a), sess.run(b))
print( summarise(lambda: cdOrder()) )
# prints, e.g. {(0.4, 2.0): 3, (0.2, 1.98): 3, (0.2, 2.0): 494}
It appears that control dependencies do not resolve this inconsistency, and it is not clear that they make any difference at all. I then tried implementing the approach originally suggested in my question, which uses additional variables to enforce the computation of deltas and updates independently:
da_dt = tf.Variable(0.0)
db_dt = tf.Variable(0.0)
assignDeltas = tf.group( da_dt.assign(da_dt_), db_dt.assign(db_dt_) )
assignUpdates = tf.group( a.assign_add(da_dt * dt), b.assign_add(db_dt * dt) )
def explicitDeltas():
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
sess.run(assignDeltas)
sess.run(assignUpdates)
return (sess.run(a), sess.run(b))
print( summarise(lambda: explicitDeltas()) )
# prints {(0.2, 2.0): 500}
As expected, this consistently computes the Euler integration step correctly.
I can understand why sometimes tf.group(assignA, assignB) produces an answer consistent with running assignA followed by assignB, and why it sometimes produces an answer consistent with running assignB followed by assignA, but I don't understand why it usually produces an answer that is magically correct (for the Euler integration case) and consistent with neither of these orders. What is going on?
Indeed, you can make sure that things run in the order that you want using control dependencies. In this case, you just need to make sure that U_ and Ut_ are computed before the assignment operations are executed. I think (although I'm not absolutely sure) that the code in the tutorial is probably correct, and that for Ut_ to be computed with the updated U you would need to have something like:
U_ = U + eps * Ut
U = U.assign(U_)
Ut_ = Ut + eps * (laplace(U) - damping * Ut)
step = Ut.assign(Ut_)
However, whenever you want to make sure that some thing gets executed before another, you can just write the dependencies explicitly:
# Discretized PDE update rules
U_ = U + eps * Ut
Ut_ = Ut + eps * (laplace(U) - damping * Ut)
# Operation to update the state
with tf.control_dependencies([U_, Ut_]):
step = tf.group(
U.assign(U_),
Ut.assign(Ut_))
This will make sure that, before any of the assignment operations are executed, both U_ and Ut_ will have been computed first.
EDIT: Some additional explanation about the new snippets.
In the first snippet in your update (12/02/2019), the code runs first one assignment, then the next. As you said, this is obviously wrong, since the second update will use the already updated value of the other variable.
The second snippet, if I'm not mistaken (correct me if I'm wrong) is what the tutorial proposes, grouping the assignment operations. Since you say you have seen instances of this producing the wrong result, I suppose it is not always safe to evaluate it like this. However, it is not surprising that you frequently get the right result. Here TensorFlow will compute all the necessary values to update both variables. Since evaluation order is not deterministic (when there are no explicit dependencies), it may happen that the update of a happens before b_ is computed, for example, in which case you would get the wrong result. But it is reasonable to expect that many times a_ and b_ will get computed before a and b are updated.
In the third snippet, you use control dependencies, but not in an effective manner. What you are indicating with your code is that the group operation should not run before a_ and b_ are computed. However, that does not mean much; the group operation is pretty much a no-op with dependencies to its inputs. The control dependencies there only affect this no-op, but does not prevent the assignment operations to run whenever before. As I suggested originally, you should instead put the assignment operations within the control dependencies block, to make sure that the assignments do not happen sooner than they should (in my snippet I also put the group operation within the block just for convenience, but it does not really matter whether that is in or out).

TensorFlow gradient with tf.where returns NaN when it shouldn't

Below is reproducible code. If you run it, you will see that in the first sess run, the result is nan, whereas the second case gives the correct gradient value of 0.5. But per tf.where and condition specified, they should return the same value. I also simply don't understand why the tf.where function gradient is nan at 1 or -1, which seem to be totally fine input values to me.
tf.reset_default_graph()
x = tf.get_variable('x', shape=[1])
condition = tf.less(x, 0.0)
output = tf.where(condition, -tf.log(-x + 1), tf.log(x + 1))
deriv = tf.gradients(output, x)
with tf.Session() as sess:
print(sess.run(deriv, {x:np.array([-1])}))
logg = -tf.log(-x+1)
derivv = tf.gradients(logg, x)
with tf.Session() as sess:
print(sess.run(derivv, {x:np.array([-1])}))
Thanks for comments!
As explained in the github issue provided by #mikkola, the problem stems from the internal implementation of tf.where. Basically, both alternatives (and their gradient) are computed, and only the correct part is chosen by multiplication of the conditionnal. Alas, if the gradient is inf or nan for the part that is not selected, even when multiplied by 0 you get a nan that eventually propagates to the result.
Since the issue has been filed in May 2016 (that's tensorflow v0.7!) and not patched since, one can safely assume that this won't be anytime soon and start looking for work around.
The easiest way to fix it is to modify your statements so that they always valid and differentiable, even for values that are not meant to be selected.
A general technique would be to clip the input value inside its valid domain. So in your case for example, you could use
cond = tf.less(x, 0.0)
output = tf.where(cond,
-tf.log(-tf.where(cond, x, 0) + 1),
tf.log(tf.where(cond, 0, x) + 1))
In your particular case however it would be simpler to just use
output = tf.sign(x) * tf.log(tf.abs(x) + 1)

Speed of scipy fsolve in vectorised code

I have an array of size (254, 80) which I need to use scipy's fsolve on. I have found that the speed of using fsolve on a vector is quicker than it is in a for loop but only for vectors upto about 100 values long. After this, the speed quickly drops off and becomes very slow, sometimes completely stopping.
I'm currently looping through one dimension of the array and using a vectorised fsolve on the smaller dimension but it's still taking longer than I would expect/like.
Does anyone have a good work around for this or a know of a similar function which will be happy handling a vector of a larger size? Or perhaps if I am doing something wrong...
Here's the current code:
for i in range(array.shape[0]):
f = lambda y: a[i] - m[i]*y - md[i]*(( y**4 + 2*(y**2)*np.cos(Thetas[i,:]) )**0.25)
ystar[i,:] = fsolve(f, y0[i])
(The rest of the variables are all a similar size)
Digging in to this further, I have found that a function such as
f = lambda y: y*np.tanh(y) - a0/(m**2)
is faster to solve than
f = lambda y: (m**2)y*np.tanh(y) - a0
where m and a0 are large 2D np arrays.
Can anyone explain why this is?
Thanks,
Rachael
Although noone answered I found a workaround which avoided the fsolve function and used interpolation instead. Luckily the initial guess is good enough that only a few y values are needed. If the initial guess knowledge is poor then this method is probably not appropriate. Do note this still has some issues but for my purposes it performs well...
ystar = np.empty((A,B)) # empty array for the solutions
num_ys = 20 #number of points to find where the solution is
y0_u = y0 #just so the calculated initial guess isn't overwritten
for i in range(Thetas.shape[1]):
ys = np.linspace(-.05,.2,num_ys)[:,None]*np.ones((num_ys,Thetas.shape[0])) + y0_u
vals = (np.squeeze(eta) - np.squeeze(m)*ys*np.sqrt(g*np.tanh(ys**2*depth)) - np.squeeze(md)*np.sqrt(g*np.tanh(depth*np.sqrt(ys**4+2*(ys**2)*kB*np.cos(Thetas[:,i]+phi_bi)+kB**2)))*(( ys**4+2*(ys**2)*kB*np.cos(Thetas[:,i]+phi_bi)+kB**2 )**0.25))
idxs_important = -1*(np.clip(np.vstack(((np.sign(vals[:-1]*vals[1:])-1),np.zeros((1,Thetas[:,i].size)))),-1,0) + np.clip(np.vstack((np.zeros((1,Thetas[:,i].size)),(np.sign(vals[:-1]*vals[1:]))-1)),-1,0))
ys_chosen = idxs_important*ys
ys_chosen[ys_chosen==0] = 10000
sorted_ys_idx = np.argsort(ys_chosen.T, axis = 1)
sorted_ys = ((ys_chosen.T)[np.arange(np.shape(ys_chosen.T)[0])[:,np.newaxis],sorted_ys_idx]).T
sorted_vals = (((vals*idxs_important).T)[np.arange(np.shape(vals.T)[0])[:,np.newaxis],sorted_ys_idx]).T
# interpolation bit
x_id = 0
yposs = sorted_ys[:2,:]
valposs = sorted_vals[:2,:]
y = yposs[0,:] + (yposs[1,:] - yposs[0,:])*(x_id - valposs[0,:])/(valposs[1,:] - valposs[0,:])
ystar[:,i] = np.squeeze(y)
y0_u=ystar[:,i]

Defining a gradient with respect to a subtensor in Theano

I have what is conceptually a simple question about Theano but I haven't been able to find the answer (I'll confess upfront to not really understanding how shared variables work in Theano, despite many hours with the tutorials).
I'm trying to implement a "deconvolutional network"; specifically I have a 3-tensor of inputs (each input is a 2D image) and a 4-tensor of codes; for the ith input codes[i] represents a set of codewords which together code for input i.
I've been having a lot of trouble figuring out how to do gradient descent on the codewords. Here are the relevant parts of my code:
idx = T.lscalar()
pre_loss_conv = conv2d(input = codes[idx].dimshuffle('x', 0, 1,2),
filters = dicts.dimshuffle('x', 0,1, 2),
border_mode = 'valid')
loss_conv = pre_loss_conv.reshape((pre_loss_conv.shape[2], pre_loss_conv.shape[3]))
loss_in = inputs[idx]
loss = T.sum(1./2.*(loss_in - loss_conv)**2)
del_codes = T.grad(loss, codes[idx])
delc_fn = function([idx], del_codes)
train_codes = function([input_index], loss, updates = [
[codes, T.set_subtensor(codes[input_index], codes[input_index] -
learning_rate*del_codes[input_index]) ]])
(here codes and dicts are shared tensor variables). Theano is unhappy with this, specifically with defining
del_codes = T.grad(loss, codes[idx])
The error message I'm getting is: theano.gradient.DisconnectedInputError: grad method was asked to compute the gradient with respect to a variable that is not part of the computational graph of the cost, or is used only by a non-differentiable operator: Subtensor{int64}.0
I'm guessing that it wants a symbolic variable instead of codes[idx]; but then I'm not sure how to get everything connected to get the intended effect. I'm guessing I'll need to change the final line to something like
learning_rate*del_codes) ]])
Can someone give me some pointers as to how to define this function properly? I think I'm probably missing something basic about working with Theano but I'm not sure what.
Thanks in advance!
-Justin
Update: Kyle's suggestion worked very nicely. Here's the specific code I used
current_codes = T.tensor3('current_codes')
current_codes = codes[input_index]
pre_loss_conv = conv2d(input = current_codes.dimshuffle('x', 0, 1,2),
filters = dicts.dimshuffle('x', 0,1, 2),
border_mode = 'valid')
loss_conv = pre_loss_conv.reshape((pre_loss_conv.shape[2], pre_loss_conv.shape[3]))
loss_in = inputs[input_index]
loss = T.sum(1./2.*(loss_in - loss_conv)**2)
del_codes = T.grad(loss, current_codes)
train_codes = function([input_index], loss)
train_dicts = theano.function([input_index], loss, updates = [[dicts, dicts - learning_rate*del_dicts]])
codes_update = ( codes, T.set_subtensor(codes[input_index], codes[input_index] - learning_rate*del_codes) )
codes_update_fn = function([input_index], updates = [codes_update])
for i in xrange(num_inputs):
current_loss = train_codes(i)
codes_update_fn(i)
To summarize the findings:
Assigning grad_var = codes[idx], then making a new variable such as:
subgrad = T.set_subtensor(codes[input_index], codes[input_index] - learning_rate*del_codes[input_index])
Then calling
train_codes = function([input_index], loss, updates = [[codes, subgrad]])
seemed to do the trick. In general, I try to make variables for as many things as possible. Sometimes tricky problems can arise from trying to do too much in a single statement, plus it is hard to debug and understand later! Also, in this case I think theano needs a shared variable, but has issues if the shared variable is created inside the function that requires it.
Glad this worked for you!

Categories