How does tensorflow handle non differentiable nodes during gradient calculation?

How does tensorflow handle non differentiable nodes during gradient calculation? - python

I understood the concept of automatic differentiation, but couldn't find any explanation how tensorflow calculates the error gradient for non differentiable functions as for example tf.where in my loss function or tf.cond in my graph. It works just fine, but I would like to understand how tensorflow backpropagates the error through such nodes, since there is no formula to calculate the gradient from them.

In the case of tf.where, you have a function with three inputs, condition C, value on true T and value on false F, and one output Out. The gradient receives one value and has to return three values. Currently, no gradient is computed for the condition (that would hardly make sense), so you just need to do the gradients for T and F. Assuming the input and the outputs are vectors, imagine C[0] is True. Then Out[0] comes from T[0], and its gradient should propagate back. On the other hand, F[0] would have been discarded, so its gradient should be made zero. If Out[1] were False, then the gradient for F[1] should propagate but not for T[1]. So, in short, for T you should propagate the given gradient where C is True and make it zero where it is False, and the opposite for F. If you look at the implementation of the gradient of tf.where (Select operation), it does exactly that:
#ops.RegisterGradient("Select")
def _SelectGrad(op, grad):
c = op.inputs[0]
x = op.inputs[1]
zeros = array_ops.zeros_like(x)
return (None, array_ops.where(c, grad, zeros), array_ops.where(
c, zeros, grad))
Note the input values themselves are not used in the computation, that will be done by the gradients of the operation producing those inputs. For tf.cond, the code is a bit more complicated, because the same operation (Merge) is used in different contexts, and also tf.cond also uses Switch operations inside. However the idea is the same. Essentially, Switch operations are used for each input, so the input that was activated (the first if the condition was True and the second otherwise) gets the received gradient and the other input gets a "switched off" gradient (like None), and does not propagate back further.

Related

Why the gradients are unconnected in the following function?

I am implementing a customer operation whose gradients must be calculated. The following is the function:
def difference(prod,box):
result = tf.Variable(tf.zeros((prod.shape[0],box.shape[1]),dtype=tf.float16))
for i in tf.range(0,prod.shape[0]):
for j in tf.range(0,box.shape[1]):
result[i,j].assign((tf.reduce_prod(box[:,j])-tf.reduce_prod(prod[i,:]))/tf.reduce_prod(box[:,j]))
return result
I am unable to calculate the gradients with respect to box, the tape.gradient() is returning None, here is the code I have written for calculating gradients
prod = tf.constant([[3,4,5],[4,5,6],[1,3,3]],dtype=tf.float16)
box = tf.Variable([[4,5],[5,6],[5,7]],dtype=tf.float16)
with tf.GradientTape() as tape:
tape.watch(box)
loss = difference(prod,box)
print(tape.gradient(loss,box))
I am not able to find the reason for unconnected gradients. Is the result variable causing it? Kindly suggest an alternative implementation.

Yes, in order to calculate gradients we need a set of (differentiable) operations on your variables.
You should re-write difference as a function of the 2 input tensors. I think (though happy to confess I am not 100% sure!) that it is the use of 'assign' that makes the gradient tape fall over.
Perhaps something like this:
def difference(prod, box):
box_red = tf.reduce_prod(box, axis=0)
prod_red = tf.reduce_prod(prod, axis=1)
return (tf.expand_dims(box_red, 0) - tf.expand_dims(prod_red, 1)) / tf.expand_dims(box_red, 0)
would get you the desired result

Using stop_gradient with AdamOptimizer in TensorFlow

I am trying to implement a training/finetuning framework when in each backpropagation iteration a certain set of parameters stay fixed. I want to be able to change the set of updating or fixed parameters from iteration to iteration. TensorFlow method tf.stop_gradient, which apparently forces gradients of some parameters to stay zero, is very useful for this purpose and it works perfectly fine with different optimizers if the set of updating or fixed parameters do not change from iterations to iterations. It can also handle varying set of updating or fixed parameters if it is used with stochastic gradient descent. My problem is that tf.stop_gradient cannot handle such cases when being used with Adam optimizer. More specifically, it does keep the gradients of the fixed parameters at zero in the output of tf.compute_gradients, but when applying the gradients (tf.apply_gradients), value of the fixed parameters does change. I suppose this is because the optimiaztion step in Adam optimizer is not zero even if the gradient is zero (based on algorithm 1 in Kingma and Ba's paper). Is there a cheap way of freezing a variable set of parameters in each Adam iteration, without explicitly saving the previous iteration's values of the fixed parameters?
More Details:
Suppose I have a single-layer network with weight matrix variable W and a binary mask matrix placeholder MW that specifies which elements of W should get updated in each iteration (value 1 in the ). Instead of using W to write the input/output relationship of this layer, I modify it as below
masked_W = MW*W + tf.stop_gradient(tf.abs(1-MW)*W)
to mask certain elements of W from having non-zero gradients. Then I use masked_W to form the output of the layer and consequently the loss of the network depends on this masked variable. The point is that MW changes in each iteration. Suppose W is a vector of 4 elements initialized to all-zero vector. Here is what happens:
opt=tf.AdamOptimizer(1e-5)
sess.run(tf.global_variables_initializer())
grads_vars=opt.compute_gradients(loss, W)
# initial value of W=[0,0,0,0]
# first iteration:
MW_val = [0,1,1,0]
feed_dict={MW:MW_val, x: batch_of_data, y_:batch_of_labels}
sess.run(opt.apply_gradients(grads_vars), feed_dict=feed_dict))
# gradient of W=[0,xx,xx,0]
# new value of W=[0,a,b,0]
where xx are some non-zero gradient values, and a and b are new values of updating elements of W. In the second iteration, we change the value assigned to the binary mask matrix MW to [1,0,0,1], hence we expect to have fixed values for W[1] and W[2] and updating values for W[0] and W[3]. But this is what happens:
# second iteration
MW_val = [1,0,0,1]
feed_dict={MW:MW_val, x: batch_of_data, y_:batch_of_labels}
sess.run(opt.apply_gradients(grads_vars), feed_dict=feed_dict))
# gradient of W=[xx,0,0,xx]
# new value of W=[c,aa,bb,d]
That is, although the gradients of W[1] and W[2] are zero, they get new values (aa != a and bb != b). When changing the optimizer from Adam to SGD, the values of fixed parameters stay the same as expected.

I found a solution to my question and am sharing it here in case others will find it useful. After the first iteration, the moments of those parameters that had been updated in the first iteration are already non-zero. Therefore, even if one puts their gradients to zero in the second iteration, they will be updated because of their non-zero momentum tensors. In order to prevent the updates, only using tf.stop_gradient is not enough, we have to remove their momentum as well. In case of Adam optimizer, this can be done through get_slot method of the optimizer: opt.get_slot(par, 'm') and opt.get_slot(par,'v'), where the former and latter give access to the first and second momentum tensors of parameter par, respectively. In the example of the question, we have to add the following lines to freeze W[1] and W[2] in the second iteration:
# momentums of W in the first iteration
m_vals = opt.get_slot(W, 'm')
v_vals = opt.get_slot(W, 'v')
# mask them before running the second iteration
masked_m_vals[[1,2]]=0
masked_v_vals[[1,2]]=0
sess.run(opt.get_slot(W, 'm').assign(masked_m_vals))
sess.run(opt.get_slot(W, 'v').assign(masked_v_vals))
It is better to save the masked momentums, in example above m_vals[[1,2]] and v_vals[[1,2]], so that if in the third iteration we relax the fixing constraint of W[1] and W[2], we can restore their momentums to their original values in the first iteration.

Alternatively you can pass different subsets of the variables to apply_gradients when you want to update different subsets of variables.

How to use tensorflow to approximate hessian matrix's norm

I wonder is there any method to recompute gradients with updated weights within a graph or if there is any better way to do this. For example, for estimating hessian norm, we need to compute
delta ~ N(0, I)
hessian_norm = 1/M \sum_{1}^{M} gradient(f(x+delta))- gradient(f(x-delta))/(2*delta)
we need to gradient value on x+delta. Currently we will get None type if we use tf.gradient on var+delta directly.
More specifally speaking, if we define
a = tf.Variable
b = some_function(a)
grad = tf.gradients(b, a)
that's a normal gradient computation but if we do
grad_delta = tf.gradients(b, a+delta)
it will return None. This feature seems to make it impossible to approximate the hessian norm using the above method.

b is not a function of a+delta, so you get Nones. You either need to create new value b2 which depends on a+delta, or just move your a variable by delta and eval again to get second value.
This is similar to how you do line search in TensorFlow.

Alternative plan of tf.floor

One of my operaction need integer, but output of convolution is float.
It means I need to use tf.floor, tf.ceil, tf.cast...etc to handle it.
But these operactions cause None gradients, since operactions like tf.floor are not differentiable
So, I tried something like below
First. detour
out1 = tf.subtract(vif, tf.subtract(vif, tf.floor(vif)))
But output of test.compute_gradient_error is 500 or 0, I don't think this is a reasonable gradient.
Second. override gradient function of floor
#ops.RegisterGradient("CustomFloor")
def _custom_floor_grad(op, grads):
return [grads]
A, B = 50, 7
shape = [A, B]
f = np.ones(shape, dtype=np.float32)
vif = tf.constant(f, dtype=tf.float32)
# out1 = tf.subtract(vif, tf.subtract(vif, tf.floor(vif)))
with tf.get_default_graph().gradient_override_map({"Floor": "CustomFloor"}):
out1 = tf.floor(vif)
with tf.Session() as sess:
err1 = tf.test.compute_gradient_error(vif, shape, out1, shape)
print err1
output of test.compute_gradient_error is 500 or 1, doesn't work too.
Question: A way to get integer and keep back propagation work fine (value like 2.0, 5.0 is ok)

In general, it's not inadvisable to solve discrete problem with gradient descent. You should be able express, to some extent integer solvers in TF but you're more or less on your own.
FWIW, the floor function looks like a saw. Its derivative is a constant function at 1 with little holes at every integer. At these positions you have a Dirac functional pointing downwards, like a rake if you wish. The Dirac functional has finite energy but no finite value.
The canonical way to tackle these problems is to relax the problem by "relaxiing" the hard floor constraint with something that is (at least once) differentiable (smooth).
There are multiple ways to do this. Perhaps the most popular are:
Hack up a function that looks like what you want. For instance a piece-wise linear function that slopes down quickly, but not vertically.
Replace step functions by sigmoids
Use a filter approximation which is well understood if it's a time series

Tensorflow's gradient_override_map function

Can someone explain me gradient_override_map function in TensorFlow?
I couldn't understand its usage precisely.
I see code usage as:
with G.gradient_override_map({"Floor": "Identity"}):
return tf.reduce_mean(SomeVals) * SomeOtherVal
What exactly is happening here? What is Identity?

Both "Floor" and "Identity" are type strings of operations, the former is corresponding to tf.floor while the latter tf.identity. So the function of your code, I guess, is to substitute tf.identity's back-propagated gradient(BPG for short) calculation mechanism for BPG calculation mechanism of tf.floor operations within graph G while passing forward output of tf.reduce_mean. It seems a little weird since in all applications of gradient_override_map I've found so far, the key of op_type_map is always identical to the type string of the operation used to produce an output in the context. By this I mean I'm more familiar with scenarios with tf.floor(SomeVals) returned, instead of tf.reduce_mean(SomeVals).
What gradient_override_map({op_A_type: op_B_type}) does is to replace op_A's BPG calculation mechanism with op_B's while remaining op_A_type's forward propagation calculation mechanism. A common application of gradient_override_map is shown in lahwran's answer.
#tf.RegisterGradient("CustomGrad")
def _const_mul_grad(unused_op, grad):
return 5.0 * grad
g = tf.get_default_graph()
with g.gradient_override_map({"Identity": "CustomGrad"}):
output = tf.identity(input, name="Identity")
by
#tf.RegisterGradient("CustomGrad")
def _const_mul_grad(unused_op, grad):
return 5.0 * grad
the decorator, tf.RegisterGradient("CustomGrad") registers the gradient function defined by _const_mul_grad(unused_op, grad) for a customized op type -- "CustomGrad",
while
g = tf.get_default_graph()
with g.gradient_override_map({"Identity": "CustomGrad"}):
output = tf.identity(input, name="Identity")
assures outputs of all operations (in graph g) with string type "Identity" (tf.identity) are as they were whereas BPG calculation mechanism of tf.identitys replaced by BPG calculation mechanism of operation with string type "CustomGrad".
P.S.
The type string of an op corresponds to the OpDef.name field for the proto that defines the operation. To find an op's OpDef.name , please refer to MingXing's answer under this question
It is not necessary to declare the name of tf.identity operation since the arg 'name' in tf.identity is optional.

As best as I can tell, gradient_override_map allows you to say "in this context, any time you would use the gradient of X, instead use the gradient of Y". which means you still need the gradient of Y to be the gradient you want to use.
This is an example I've seen floating around while looking for how this works:
#tf.RegisterGradient("CustomGrad")
def _const_mul_grad(unused_op, grad):
return 5.0 * grad
g = tf.get_default_graph()
with g.gradient_override_map({"Identity": "CustomGrad"}):
output = tf.identity(input, name="Identity")
cite: https://stackoverflow.com/a/43948872/1102705
RegisterGradient() allows you to register the gradient of a new op you're defining, thereby allowing you to have an op that has the gradient you wanted, and then you can use that op in the gradient override map. It's kind of clunky - you're defining an op with no forward pass.
Something I'm not clear on is whether the name="Identity" is actually necessary.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How does tensorflow handle non differentiable nodes during gradient calculation? - python

Related

Why the gradients are unconnected in the following function?

Using stop_gradient with AdamOptimizer in TensorFlow

How to use tensorflow to approximate hessian matrix's norm

Alternative plan of tf.floor

Tensorflow's gradient_override_map function

Categories

Resources