I would like to write a TensorFlow op in python, but I would like it to be differentiable (to be able to compute a gradient).
This question asks how to write an op in python, and the answer suggests using py_func (which has no gradient): Tensorflow: Writing an Op in Python
The TF documentation describes how to add an op starting from C++ code only: https://www.tensorflow.org/versions/r0.10/how_tos/adding_an_op/index.html
In my case, I am prototyping so I don't care about whether it runs on GPU, and I don't care about it being usable from anything other than the TF python API.
Yes, as mentionned in #Yaroslav's answer, it is possible and the key is the links he references: here and here. I want to elaborate on this answer by giving a concret example.
Modulo opperation: Let's implement the element-wise modulo operation in tensorflow (it already exists but its gradient is not defined, but for the example we will implement it from scratch).
Numpy function: The first step is to define the opperation we want for numpy arrays. The element-wise modulo opperation is already implemented in numpy so it is easy:
import numpy as np
def np_mod(x,y):
return (x % y).astype(np.float32)
The reason for the .astype(np.float32) is because by default tensorflow takes float32 types and if you give it float64 (the numpy default) it will complain.
Gradient Function: Next we need to define the gradient function for our opperation for each input of the opperation as tensorflow function. The function needs to take a very specific form. It need to take the tensorflow representation of the opperation op and the gradient of the output grad and say how to propagate the gradients. In our case, the gradients of the mod opperation are easy, the derivative is 1 with respect to the first argument and
with respect to the second (almost everywhere, and infinite at a finite number of spots, but let's ignore that, see https://math.stackexchange.com/questions/1849280/derivative-of-remainder-function-wrt-denominator for details). So we have
def modgrad(op, grad):
x = op.inputs[0] # the first argument (normally you need those to calculate the gradient, like the gradient of x^2 is 2x. )
y = op.inputs[1] # the second argument
return grad * 1, grad * tf.neg(tf.floordiv(x, y)) #the propagated gradient with respect to the first and second argument respectively
The grad function needs to return an n-tuple where n is the number of arguments of the operation. Notice that we need to return tensorflow functions of the input.
Making a TF function with gradients: As explained in the sources mentioned above, there is a hack to define gradients of a function using tf.RegisterGradient [doc] and tf.Graph.gradient_override_map [doc].
Copying the code from harpone we can modify the tf.py_func function to make it define the gradient at the same time:
import tensorflow as tf
def py_func(func, inp, Tout, stateful=True, name=None, grad=None):
# Need to generate a unique name to avoid duplicates:
rnd_name = 'PyFuncGrad' + str(np.random.randint(0, 1E+8))
tf.RegisterGradient(rnd_name)(grad) # see _MySquareGrad for grad example
g = tf.get_default_graph()
with g.gradient_override_map({"PyFunc": rnd_name}):
return tf.py_func(func, inp, Tout, stateful=stateful, name=name)
The stateful option is to tell tensorflow whether the function always gives the same output for the same input (stateful = False) in which case tensorflow can simply the tensorflow graph, this is our case and will probably be the case in most situations.
Combining it all together: Now that we have all the pieces, we can combine them all together:
from tensorflow.python.framework import ops
def tf_mod(x,y, name=None):
with ops.op_scope([x,y], name, "mod") as name:
z = py_func(np_mod,
[x,y],
[tf.float32],
name=name,
grad=modgrad) # <-- here's the call to the gradient
return z[0]
tf.py_func acts on lists of tensors (and returns a list of tensors), that is why we have [x,y] (and return z[0]).
And now we are done. And we can test it.
Test:
with tf.Session() as sess:
x = tf.constant([0.3,0.7,1.2,1.7])
y = tf.constant([0.2,0.5,1.0,2.9])
z = tf_mod(x,y)
gr = tf.gradients(z, [x,y])
tf.initialize_all_variables().run()
print(x.eval(), y.eval(),z.eval(), gr[0].eval(), gr[1].eval())
[ 0.30000001 0.69999999 1.20000005 1.70000005] [ 0.2 0.5 1. 2.9000001] [ 0.10000001 0.19999999 0.20000005 1.70000005] [ 1. 1. 1. 1.] [ -1. -1. -1. 0.]
Success!
Here's an example of adding gradient to a specific py_func
https://gist.github.com/harpone/3453185b41d8d985356cbe5e57d67342
Here's the issue discussion
Related
I am using PyTorch 1.6.0 to learn a tensor (lets say x) with autograd.
After x is learnt, how can I reset .requires_grad of every tensor that was a node in the autograd comp. graph to zero?
I know about torch.detach() and about setting .requires_grad to False manually. I am searching for an one-shot instruction.
Ps: I want to do that because I still want to use these tensors after the part of my code that learns x is executed. Plus, some are to be converted to numpy.
There is no "one shot instruction" to switch .requires_grad for all tensors in graph.
Usually parameters are kept in torch.nn.Module instances but in case they are elsewhere, you can always add them to some list and iterate over it, I'd do something like this:
import torch
class Leafs:
def __init__(self):
self.leafs = []
def add(self, tensor):
self.leafs.append(tensor)
return tensor
def clear(self):
for leaf in self.leafs:
leaf.requires_grad_(False)
keeper = Leafs()
x = keeper.add(torch.tensor([1.2], requires_grad=True))
y = keeper.add(torch.tensor([1.3], requires_grad=True))
print(x.requires_grad, y.requires_grad)
keeper.clear()
print(x.requires_grad, y.requires_grad)
Usually there is no need for that, also if you don't want gradient for some part of computation you can always use with torch.no_grad() context manager.
I’m trying to understand the Synthesizer paper (https://arxiv.org/pdf/2005.00743.pdf 1) and there’s a description of the dense synthesizer mechanism that should replace the traditional attention model as described in the Transformer architecture.
The Dense Synthesizer is described as such:
So I tried to implement the layer and it looks like this but I’m not sure whether I’m getting it right:
class DenseSynthesizer(nn.Module):
def __init__(self, l, d):
super(DenseSynthesizer, self).__init__()
self.linear1 = nn.Linear(d, l)
self.linear2 = nn.Linear(l, l)
def forward(self, x, v):
# Equation (1) and (2)
# Shape: l x l
b = self.linear2(F.relu(self.linear1(x)))
# Equation (3)
# [l x l] x [l x d] -> [l x d]
return torch.matmul(F.softmax(b), v)
Usage:
l, d = 4, 5
x, v = torch.rand(l, d), torch.rand(l, d)
synthesis = DenseSynthesizer(l, d)
synthesis(x, v)
Example:
x and v are tensors:
x = tensor([[0.0844, 0.2683, 0.4299, 0.1827, 0.1188],
[0.2793, 0.0389, 0.3834, 0.9897, 0.4197],
[0.1420, 0.8051, 0.1601, 0.3299, 0.3340],
[0.8908, 0.1066, 0.1140, 0.7145, 0.3619]])
v = tensor([[0.3806, 0.1775, 0.5457, 0.6746, 0.4505],
[0.6309, 0.2790, 0.7215, 0.4283, 0.5853],
[0.7548, 0.6887, 0.0426, 0.1057, 0.7895],
[0.1881, 0.5334, 0.6834, 0.4845, 0.1960]])
And passing through a forward pass through the dense synthesis, it returns:
>>> synthesis = DenseSynthesizer(l, d)
>>> synthesis(x, v)
tensor([[0.5371, 0.4528, 0.4560, 0.3735, 0.5492],
[0.5426, 0.4434, 0.4625, 0.3770, 0.5536],
[0.5362, 0.4477, 0.4658, 0.3769, 0.5468],
[0.5430, 0.4461, 0.4559, 0.3755, 0.5551]], grad_fn=<MmBackward>)
Is the implementation and understanding of the dense synthesizer correct?
Theoretically, how is that different from a multi-layered perceptron that takes in two different inputs and makes uses of it at different point in the forward propagation?
Is the implementation and understanding of the dense synthesizer correct?
Not exactly, linear1 = nn.Linear(d,d) according to the paper and not (d,l).
Of course this does not work if X.shape = (l,d) according to matrix multiplication rules.
This is because :
So F is applied to each Xi in X for i in [1,l]
The resulting matrix B is then passed to the softmax function and multiplied by G(x).
So you'd have to modify your code to sequentially process the input then use the returned matrix to compute Y.
how is that different from a multi-layered perceptron that takes in two different inputs and makes uses of it at different point in the forward propagation?
To understand, we need to put things into context, the idea of introducing attention mechanism was first described here in the context of Encoder - Decoder : https://arxiv.org/pdf/1409.0473.pdf
The core idea is to allow the model to have control over how the context vector from the encoder is retrieved using a neural network instead of relying solely on the last encoded state :
see this post for more detail.
The Transformers introduced the idea of using "Multi-Head Attention" (see graph below) to reduce the computational burden and focus solely on the attention mechanism itself. post
https://arxiv.org/pdf/1706.03762.pdf
So where does the Dense synthesizer fits into all of that ?
It simply replaces the Dot product (as illustrated in the first pictures in your post) by F(.). If you replace what's inside the softmax by F you get the equation for Y
Conclusion
This is an MLP but applied step wise to the input in the context of sequence processing.
Thank you
Inside my custom loss function I need to call a pure python function passing in the computed TD errors and some indexes. The function doesn't need to return anything or be differentiated. Here's the function I want to call:
def update_priorities(self, traces_idxs, td_errors):
"""Updates the priorities of the traces with specified indexes."""
self.priorities[traces_idxs] = td_errors + eps
I've tried using tf.py_function to call a wrapper function but it only gets called if it's embedded in the graph i.e. if it has inputs and outputs and the outputs are used. Therefore I tried to pass through some of the tensors without performing any operations on them and the function now gets called. Here's my entire custom loss function:
def masked_q_loss(data, y_pred):
"""Computes the MSE between the Q-values of the actions that were taken and the cumulative
discounted rewards obtained after taking those actions. Updates trace priorities.
"""
action_batch, target_qvals, traces_idxs = data[:,0], data[:,1], data[:,2]
seq = tf.cast(tf.range(0, tf.shape(action_batch)[0]), tf.int32)
action_idxs = tf.transpose(tf.stack([seq, tf.cast(action_batch, tf.int32)]))
qvals = tf.gather_nd(y_pred, action_idxs)
def update_priorities(_qvals, _target_qvals, _traces_idxs):
"""Computes the TD error and updates memory priorities."""
td_error = _target_qvals - _qvals
_traces_idxs = tf.cast(_traces_idxs, tf.int32)
mem.update_priorities(_traces_idxs, td_error)
return _qvals
qvals = tf.py_function(func=update_priorities, inp=[qvals, target_qvals, traces_idxs], Tout=[tf.float32])
return tf.keras.losses.mse(qvals, target_qvals)
However I get the following error due to the call mem.update_priorities(_traces_idxs, td_error)
ValueError: An operation has `None` for gradient. Please make sure that all of your ops have a gradient defined (i.e. are differentiable). Common ops without gradient: K.argmax, K.round, K.eval.
I don't need to compute gradients for update_priorities, I just want to call it at a specific point in the graph computation and forget about it. How can I do that?
Using .numpy() on the tensors inside the wrapper function fixed the problem:
def update_priorities(_qvals, _target_qvals, _traces_idxs):
"""Computes the TD error and updates memory priorities."""
td_error = np.abs((_target_qvals - _qvals).numpy())
_traces_idxs = (tf.cast(_traces_idxs, tf.int32)).numpy()
mem.update_priorities(_traces_idxs, td_error)
return _qvals
I am migrating my training loop to Tensorflow 2.0 API. In eager execution mode, tf.GradientTape replaces tf.gradients. The question is, do they have the same functionality? Specifically:
In function gradient():
Is the parameter output_gradients equivalent to grad_ys in the old API?
What about parameters colocate_gradients_with_ops. aggregation_method, gate_gradients of tf.gradients? Are they deprecated due to lack of use? Can they be replaced by using other methods in 2.0 API? Are they needed in Eager Execution at all?
Is function jacobian() equivalent to tf.python.ops.parallel_for.gradients?
Please find the response below.
Regarding Output Gradients and grad_ys: Yes, they can be considered same.
Detailed Explanation: Info about Output Gradients is mentioned in Github -> imperative_grad.py as shown below.
output_gradients: if not None, a list of gradient provided for each
Target,
or None if we are to use the target's computed downstream gradient,
Info about grad_ys is mentioned in TF Site as shown below:
grad_ys: is a list of tensors of the same length as ys that holds the
initial gradients for each y in ys. When grad_ys is None, we fill in a
tensor of '1's of the shape of y for each y in ys. A user can provide
their own initial grad_ys to compute the derivatives using a different
initial gradient for each y (e.g., if one wanted to weight the
gradient differently for each value in each y).
From the above explanations, and from the below code, mentioned in page 394 of the book, Hands on ML using Scikit-Learn & Tensorflow,
we can conclude that initial value of Theta can be a Random Value and we can pass that using the parameters, output_gradients or grad_ys.
theta = tf.Variable(tf.random_uniform([n + 1, 1], -1.0, 1.0), name="theta")
gradients = tf.gradients(mse, [theta])[0]
training_op = tf.assign(theta, theta - learning_rate * gradients)
Regarding colocate_gradients_with_ops: Yes, it is not needed for Eager Execution as it is related to Control Flow Context of Graphs.
Detailed Explanation: colocate_gradients_with_ops points to the below code mentioned in Github -> ops.py. Control flow Context is related to the concept of Context, which is related to Graphs, as explained in TF Site -> Graphs
def _colocate_with_for_gradient(self, op, gradient_uid,
ignore_existing=False):
with self.colocate_with(op, ignore_existing):
if gradient_uid is not None and self._control_flow_context is not None:
self._control_flow_context.EnterGradientColocation(op, gradient_uid)
try:
yield
finally:
self._control_flow_context.ExitGradientColocation(op, gradient_uid)
else:
yield
Regarding aggregation_method: The equivalent of this parameter has been implemented in 2.0, named _aggregate_grads as shown in Github link
Regarding gate_gradients: Not needed for Eager as this also is related to Graph Context.
Detailed Explanation: As shown in the below code from Github -> gradients_utils.py, if gate_gradients is True, then some operations are added to graph using the function, _colocate_with_for_gradient, which in turn depends on Control Flow Context of Graphs.
if gate_gradients and len([x for x in in_grads
if x is not None]) > 1:
with ops.device(None):
with ops._colocate_with_for_gradient( # pylint: disable=protected-access
None,
gradient_uid,
ignore_existing=True):
in_grads = control_flow_ops.tuple(in_grads)
Regarding jacobian: Yes they are same.
I'm trying to define a gradient method for my custom TF operation. Most of the solutions I have found online seem to based on a gist by harpone. I'm reluctant to use that approach as it uses py_func which won't run on GPU. I found another solution here that uses tf.identity() that looks more elegant and I think will run on GPU. However, I have some problems accessing inputs of the ops in my custom gradient function. Here's my code:
#tf.RegisterGradient('MyCustomGradient')
def _custom_gradient(op, gradients):
x = op.inputs[0]
return(x)
def my_op(w):
return tf.pow(w,3)
var_foo = tf.Variable(5, dtype=tf.float32)
bar = my_op(var_foo)
g = tf.get_default_graph()
with g.gradient_override_map({'Identity': 'MyCustomGradient'}):
bar = tf.identity(bar)
g = tf.gradients(bar, var_foo)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
print(sess.run(g))
I was expecting _custom_gradient() to return the input to the op (5 in this example) but instead it seems to return op output x gradient. My custom my_op will have non-differentiable operations like tf.sign and I'd like to define my custom gradient based on the inputs. What am I doing wrong?
There is no problem with your code:
Let's first do the forward pass:
var_foo = 5 -> bar = 125 -> tf.identity(bar) = 125
Now let's backpropagate:
The gradient of tf.identity(bar) with respect to its argument bar equals (by your definition) to bar, that is, 125. The gradient of bar with respect to var_foo equals 3 times the square of var_foo which is 75. Multiply, and you get 9375, which is indeed the output of your code.
op.inputs[0] contains the forward-pass value of the op. In this case, the forward pass of the identity op is 125.