Tensorflow's gradient_override_map function - python

Can someone explain me gradient_override_map function in TensorFlow?
I couldn't understand its usage precisely.
I see code usage as:
with G.gradient_override_map({"Floor": "Identity"}):
return tf.reduce_mean(SomeVals) * SomeOtherVal
What exactly is happening here? What is Identity?

Both "Floor" and "Identity" are type strings of operations, the former is corresponding to tf.floor while the latter tf.identity. So the function of your code, I guess, is to substitute tf.identity's back-propagated gradient(BPG for short) calculation mechanism for BPG calculation mechanism of tf.floor operations within graph G while passing forward output of tf.reduce_mean. It seems a little weird since in all applications of gradient_override_map I've found so far, the key of op_type_map is always identical to the type string of the operation used to produce an output in the context. By this I mean I'm more familiar with scenarios with tf.floor(SomeVals) returned, instead of tf.reduce_mean(SomeVals).
What gradient_override_map({op_A_type: op_B_type}) does is to replace op_A's BPG calculation mechanism with op_B's while remaining op_A_type's forward propagation calculation mechanism. A common application of gradient_override_map is shown in lahwran's answer.
#tf.RegisterGradient("CustomGrad")
def _const_mul_grad(unused_op, grad):
return 5.0 * grad
g = tf.get_default_graph()
with g.gradient_override_map({"Identity": "CustomGrad"}):
output = tf.identity(input, name="Identity")
by
#tf.RegisterGradient("CustomGrad")
def _const_mul_grad(unused_op, grad):
return 5.0 * grad
the decorator, tf.RegisterGradient("CustomGrad") registers the gradient function defined by _const_mul_grad(unused_op, grad) for a customized op type -- "CustomGrad",
while
g = tf.get_default_graph()
with g.gradient_override_map({"Identity": "CustomGrad"}):
output = tf.identity(input, name="Identity")
assures outputs of all operations (in graph g) with string type "Identity" (tf.identity) are as they were whereas BPG calculation mechanism of tf.identitys replaced by BPG calculation mechanism of operation with string type "CustomGrad".
P.S.
The type string of an op corresponds to the OpDef.name field for the proto that defines the operation. To find an op's OpDef.name , please refer to MingXing's answer under this question
It is not necessary to declare the name of tf.identity operation since the arg 'name' in tf.identity is optional.

As best as I can tell, gradient_override_map allows you to say "in this context, any time you would use the gradient of X, instead use the gradient of Y". which means you still need the gradient of Y to be the gradient you want to use.
This is an example I've seen floating around while looking for how this works:
#tf.RegisterGradient("CustomGrad")
def _const_mul_grad(unused_op, grad):
return 5.0 * grad
g = tf.get_default_graph()
with g.gradient_override_map({"Identity": "CustomGrad"}):
output = tf.identity(input, name="Identity")
cite: https://stackoverflow.com/a/43948872/1102705
RegisterGradient() allows you to register the gradient of a new op you're defining, thereby allowing you to have an op that has the gradient you wanted, and then you can use that op in the gradient override map. It's kind of clunky - you're defining an op with no forward pass.
Something I'm not clear on is whether the name="Identity" is actually necessary.

Related

What is the correct way to use a PyTorch Module inside a PyTorch Function?

We have a custom torch.autograd.Function z(x, t) which computes an output y in a way not amenable to direct automatic differentiation, and have computed the Jacobian of the operation with respect to its inputs x and t, so we can implement the backward method.
However, the operation involves making several internal calls to a neural network, which we have implemented for now as a stack of torch.nn.Linear objects, wrapped in net, a torch.nn.Module. Mathematically, these are parameterized by t.
Is there any way that we can have net itself be an input to the forward method of z? Then, we would return from our backward the list of products of the upstream gradient Dy and parameter Jacobia dydt_i, one for each of the parameters ti that are children of net (in addition to Dy*dydx, although x is data and does not need gradient accumulation).
Or do we really instead need to take t (actually a list of individual t_i), and reconstruct internally in z.forward the actions of all the Linear layers in net?
I guess you could create a custom functor that inherits torch.autograd.Function and make the forward and backward methods non-static (i.e remove the #staticmethod in this example so that net could be an attribute of your functor. that would look like
class MyFunctor(torch.nn.autograd.Function):
def __init(net):
self.net = net
def forward(ctx, x, t):
#store x and t in ctx in the way you find useful
# not sure how t is involved here
return self.net(x)
def backward(ctx, grad):
# do your backward stuff
net = nn.Sequential(nn.Linear(...), ...)
z = MyFunctor(net)
y = z(x, t)
This will yield a warning that you are using a deprecated legacy way of creating autograd functions (because of the non-static methods), and you need to be extra careful with zeroing the gradients in netafter having backpropagated. So not really convenient, but I am not aware of any better way to have a stateful autograd function.
I'm doing something similar, where the static restrictions on PyTorch functions are cumbersome. Similar in spirit to trialNerror's answer, I instead keep the PyTorch function methods static and pass in functions for them to use, which gets around the issues with making the functor non-static:
class NonStaticBackward(Function):
#staticmethod
def forward(ctx, backward_fn, input):
ctx.backward_fn = backward_fn
# ... do other stuff
return input
#staticmethod
def backward(ctx, grad_output):
# Call into our non-static backward function
# Since we passed in the backward function as input,
# PyTorch expects a placeholder grad for it.
return None, ctx.backward_fn(ctx, grad_output)
Passing in the backwards function every time gets annoying, so I usually wrap it:
def my_non_static_backward(ctx, grad_output):
print("Hello from backward!")
return grad_output
my_fn = lambda x: NonStaticBackward.apply(my_non_static_backward, x)
y = my_fn(Tensor([1, 2, 3]))
This way, you can write the grad function somewhere where it has access to what it needs: no need to pass net.

Why the gradients are unconnected in the following function?

I am implementing a customer operation whose gradients must be calculated. The following is the function:
def difference(prod,box):
result = tf.Variable(tf.zeros((prod.shape[0],box.shape[1]),dtype=tf.float16))
for i in tf.range(0,prod.shape[0]):
for j in tf.range(0,box.shape[1]):
result[i,j].assign((tf.reduce_prod(box[:,j])-tf.reduce_prod(prod[i,:]))/tf.reduce_prod(box[:,j]))
return result
I am unable to calculate the gradients with respect to box, the tape.gradient() is returning None, here is the code I have written for calculating gradients
prod = tf.constant([[3,4,5],[4,5,6],[1,3,3]],dtype=tf.float16)
box = tf.Variable([[4,5],[5,6],[5,7]],dtype=tf.float16)
with tf.GradientTape() as tape:
tape.watch(box)
loss = difference(prod,box)
print(tape.gradient(loss,box))
I am not able to find the reason for unconnected gradients. Is the result variable causing it? Kindly suggest an alternative implementation.
Yes, in order to calculate gradients we need a set of (differentiable) operations on your variables.
You should re-write difference as a function of the 2 input tensors. I think (though happy to confess I am not 100% sure!) that it is the use of 'assign' that makes the gradient tape fall over.
Perhaps something like this:
def difference(prod, box):
box_red = tf.reduce_prod(box, axis=0)
prod_red = tf.reduce_prod(prod, axis=1)
return (tf.expand_dims(box_red, 0) - tf.expand_dims(prod_red, 1)) / tf.expand_dims(box_red, 0)
would get you the desired result

How does tensorflow handle non differentiable nodes during gradient calculation?

I understood the concept of automatic differentiation, but couldn't find any explanation how tensorflow calculates the error gradient for non differentiable functions as for example tf.where in my loss function or tf.cond in my graph. It works just fine, but I would like to understand how tensorflow backpropagates the error through such nodes, since there is no formula to calculate the gradient from them.
In the case of tf.where, you have a function with three inputs, condition C, value on true T and value on false F, and one output Out. The gradient receives one value and has to return three values. Currently, no gradient is computed for the condition (that would hardly make sense), so you just need to do the gradients for T and F. Assuming the input and the outputs are vectors, imagine C[0] is True. Then Out[0] comes from T[0], and its gradient should propagate back. On the other hand, F[0] would have been discarded, so its gradient should be made zero. If Out[1] were False, then the gradient for F[1] should propagate but not for T[1]. So, in short, for T you should propagate the given gradient where C is True and make it zero where it is False, and the opposite for F. If you look at the implementation of the gradient of tf.where (Select operation), it does exactly that:
#ops.RegisterGradient("Select")
def _SelectGrad(op, grad):
c = op.inputs[0]
x = op.inputs[1]
zeros = array_ops.zeros_like(x)
return (None, array_ops.where(c, grad, zeros), array_ops.where(
c, zeros, grad))
Note the input values themselves are not used in the computation, that will be done by the gradients of the operation producing those inputs. For tf.cond, the code is a bit more complicated, because the same operation (Merge) is used in different contexts, and also tf.cond also uses Switch operations inside. However the idea is the same. Essentially, Switch operations are used for each input, so the input that was activated (the first if the condition was True and the second otherwise) gets the received gradient and the other input gets a "switched off" gradient (like None), and does not propagate back further.

How to use tensorflow to approximate hessian matrix's norm

I wonder is there any method to recompute gradients with updated weights within a graph or if there is any better way to do this. For example, for estimating hessian norm, we need to compute
delta ~ N(0, I)
hessian_norm = 1/M \sum_{1}^{M} gradient(f(x+delta))- gradient(f(x-delta))/(2*delta)
we need to gradient value on x+delta. Currently we will get None type if we use tf.gradient on var+delta directly.
More specifally speaking, if we define
a = tf.Variable
b = some_function(a)
grad = tf.gradients(b, a)
that's a normal gradient computation but if we do
grad_delta = tf.gradients(b, a+delta)
it will return None. This feature seems to make it impossible to approximate the hessian norm using the above method.
b is not a function of a+delta, so you get Nones. You either need to create new value b2 which depends on a+delta, or just move your a variable by delta and eval again to get second value.
This is similar to how you do line search in TensorFlow.

Alternative plan of tf.floor

One of my operaction need integer, but output of convolution is float.
It means I need to use tf.floor, tf.ceil, tf.cast...etc to handle it.
But these operactions cause None gradients, since operactions like tf.floor are not differentiable
So, I tried something like below
First. detour
out1 = tf.subtract(vif, tf.subtract(vif, tf.floor(vif)))
But output of test.compute_gradient_error is 500 or 0, I don't think this is a reasonable gradient.
Second. override gradient function of floor
#ops.RegisterGradient("CustomFloor")
def _custom_floor_grad(op, grads):
return [grads]
A, B = 50, 7
shape = [A, B]
f = np.ones(shape, dtype=np.float32)
vif = tf.constant(f, dtype=tf.float32)
# out1 = tf.subtract(vif, tf.subtract(vif, tf.floor(vif)))
with tf.get_default_graph().gradient_override_map({"Floor": "CustomFloor"}):
out1 = tf.floor(vif)
with tf.Session() as sess:
err1 = tf.test.compute_gradient_error(vif, shape, out1, shape)
print err1
output of test.compute_gradient_error is 500 or 1, doesn't work too.
Question: A way to get integer and keep back propagation work fine (value like 2.0, 5.0 is ok)
In general, it's not inadvisable to solve discrete problem with gradient descent. You should be able express, to some extent integer solvers in TF but you're more or less on your own.
FWIW, the floor function looks like a saw. Its derivative is a constant function at 1 with little holes at every integer. At these positions you have a Dirac functional pointing downwards, like a rake if you wish. The Dirac functional has finite energy but no finite value.
The canonical way to tackle these problems is to relax the problem by "relaxiing" the hard floor constraint with something that is (at least once) differentiable (smooth).
There are multiple ways to do this. Perhaps the most popular are:
Hack up a function that looks like what you want. For instance a piece-wise linear function that slopes down quickly, but not vertically.
Replace step functions by sigmoids
Use a filter approximation which is well understood if it's a time series

Categories