Using op inputs when defining custom gradients in TensorFlow - python

I'm trying to define a gradient method for my custom TF operation. Most of the solutions I have found online seem to based on a gist by harpone. I'm reluctant to use that approach as it uses py_func which won't run on GPU. I found another solution here that uses tf.identity() that looks more elegant and I think will run on GPU. However, I have some problems accessing inputs of the ops in my custom gradient function. Here's my code:
def _custom_gradient(op, gradients):
x = op.inputs[0]
def my_op(w):
return tf.pow(w,3)
var_foo = tf.Variable(5, dtype=tf.float32)
bar = my_op(var_foo)
g = tf.get_default_graph()
with g.gradient_override_map({'Identity': 'MyCustomGradient'}):
bar = tf.identity(bar)
g = tf.gradients(bar, var_foo)
with tf.Session() as sess:
I was expecting _custom_gradient() to return the input to the op (5 in this example) but instead it seems to return op output x gradient. My custom my_op will have non-differentiable operations like tf.sign and I'd like to define my custom gradient based on the inputs. What am I doing wrong?

There is no problem with your code:
Let's first do the forward pass:
var_foo = 5 -> bar = 125 -> tf.identity(bar) = 125
Now let's backpropagate:
The gradient of tf.identity(bar) with respect to its argument bar equals (by your definition) to bar, that is, 125. The gradient of bar with respect to var_foo equals 3 times the square of var_foo which is 75. Multiply, and you get 9375, which is indeed the output of your code.
op.inputs[0] contains the forward-pass value of the op. In this case, the forward pass of the identity op is 125.


How to trigger a python function inside a tf.keras custom loss function?

Inside my custom loss function I need to call a pure python function passing in the computed TD errors and some indexes. The function doesn't need to return anything or be differentiated. Here's the function I want to call:
def update_priorities(self, traces_idxs, td_errors):
"""Updates the priorities of the traces with specified indexes."""
self.priorities[traces_idxs] = td_errors + eps
I've tried using tf.py_function to call a wrapper function but it only gets called if it's embedded in the graph i.e. if it has inputs and outputs and the outputs are used. Therefore I tried to pass through some of the tensors without performing any operations on them and the function now gets called. Here's my entire custom loss function:
def masked_q_loss(data, y_pred):
"""Computes the MSE between the Q-values of the actions that were taken and the cumulative
discounted rewards obtained after taking those actions. Updates trace priorities.
action_batch, target_qvals, traces_idxs = data[:,0], data[:,1], data[:,2]
seq = tf.cast(tf.range(0, tf.shape(action_batch)[0]), tf.int32)
action_idxs = tf.transpose(tf.stack([seq, tf.cast(action_batch, tf.int32)]))
qvals = tf.gather_nd(y_pred, action_idxs)
def update_priorities(_qvals, _target_qvals, _traces_idxs):
"""Computes the TD error and updates memory priorities."""
td_error = _target_qvals - _qvals
_traces_idxs = tf.cast(_traces_idxs, tf.int32)
mem.update_priorities(_traces_idxs, td_error)
return _qvals
qvals = tf.py_function(func=update_priorities, inp=[qvals, target_qvals, traces_idxs], Tout=[tf.float32])
return tf.keras.losses.mse(qvals, target_qvals)
However I get the following error due to the call mem.update_priorities(_traces_idxs, td_error)
ValueError: An operation has `None` for gradient. Please make sure that all of your ops have a gradient defined (i.e. are differentiable). Common ops without gradient: K.argmax, K.round, K.eval.
I don't need to compute gradients for update_priorities, I just want to call it at a specific point in the graph computation and forget about it. How can I do that?
Using .numpy() on the tensors inside the wrapper function fixed the problem:
def update_priorities(_qvals, _target_qvals, _traces_idxs):
"""Computes the TD error and updates memory priorities."""
td_error = np.abs((_target_qvals - _qvals).numpy())
_traces_idxs = (tf.cast(_traces_idxs, tf.int32)).numpy()
mem.update_priorities(_traces_idxs, td_error)
return _qvals

Is tf.GradientTape in TF 2.0 equivalent to tf.gradients?

I am migrating my training loop to Tensorflow 2.0 API. In eager execution mode, tf.GradientTape replaces tf.gradients. The question is, do they have the same functionality? Specifically:
In function gradient():
Is the parameter output_gradients equivalent to grad_ys in the old API?
What about parameters colocate_gradients_with_ops. aggregation_method, gate_gradients of tf.gradients? Are they deprecated due to lack of use? Can they be replaced by using other methods in 2.0 API? Are they needed in Eager Execution at all?
Is function jacobian() equivalent to tf.python.ops.parallel_for.gradients?
Please find the response below.
Regarding Output Gradients and grad_ys: Yes, they can be considered same.
Detailed Explanation: Info about Output Gradients is mentioned in Github -> as shown below.
output_gradients: if not None, a list of gradient provided for each
or None if we are to use the target's computed downstream gradient,
Info about grad_ys is mentioned in TF Site as shown below:
grad_ys: is a list of tensors of the same length as ys that holds the
initial gradients for each y in ys. When grad_ys is None, we fill in a
tensor of '1's of the shape of y for each y in ys. A user can provide
their own initial grad_ys to compute the derivatives using a different
initial gradient for each y (e.g., if one wanted to weight the
gradient differently for each value in each y).
From the above explanations, and from the below code, mentioned in page 394 of the book, Hands on ML using Scikit-Learn & Tensorflow,
we can conclude that initial value of Theta can be a Random Value and we can pass that using the parameters, output_gradients or grad_ys.
theta = tf.Variable(tf.random_uniform([n + 1, 1], -1.0, 1.0), name="theta")
gradients = tf.gradients(mse, [theta])[0]
training_op = tf.assign(theta, theta - learning_rate * gradients)
Regarding colocate_gradients_with_ops: Yes, it is not needed for Eager Execution as it is related to Control Flow Context of Graphs.
Detailed Explanation: colocate_gradients_with_ops points to the below code mentioned in Github -> Control flow Context is related to the concept of Context, which is related to Graphs, as explained in TF Site -> Graphs
def _colocate_with_for_gradient(self, op, gradient_uid,
with self.colocate_with(op, ignore_existing):
if gradient_uid is not None and self._control_flow_context is not None:
self._control_flow_context.EnterGradientColocation(op, gradient_uid)
self._control_flow_context.ExitGradientColocation(op, gradient_uid)
Regarding aggregation_method: The equivalent of this parameter has been implemented in 2.0, named _aggregate_grads as shown in Github link
Regarding gate_gradients: Not needed for Eager as this also is related to Graph Context.
Detailed Explanation: As shown in the below code from Github ->, if gate_gradients is True, then some operations are added to graph using the function, _colocate_with_for_gradient, which in turn depends on Control Flow Context of Graphs.
if gate_gradients and len([x for x in in_grads
if x is not None]) > 1:
with ops.device(None):
with ops._colocate_with_for_gradient( # pylint: disable=protected-access
in_grads = control_flow_ops.tuple(in_grads)
Regarding jacobian: Yes they are same. does not run?

I'm a new here, studying tensorflow and encountering a problem.
import model_method
The above is in the importing Function fittt in
def fittt(model,...):
build() in
def build(self,...):
self.op_C,self.op_A = self.function_A(...)
self.op_B = self.function_B(self.op_C,...)
fit() in
def fit(self,...):
sess = tf.Session(graph=self.graph,config=config)
BB,AA =[self.op_B,self.op_A],feed_dict)
To check running process, I added pdb.set_trace() at the beginning of function_A() and function_B() in as follows:
def function_A(self,...):
def function_B(self,...):
The two pdb.set_trace() only stopped when the build() called and didn't work when[self.op_B,self.op_A],feed_dict) called. So it means the didn't run function_A() and function_B() actually. I wonder why and wanna know how to make the two functions work?
By calling the function you create a computation graph. In this call every line of code is executed (hence why pdb stopped).
However, executes only those parts of computational graph which are necessary to compute the fetched values (self.op_A, self.op_B in your example). The function does not execute the entire build() function again.
Therefore the reason why pdb.set_trace() did not execute when you've run is because they are not valid Tensor objects and hence not part of the computational graph.
Consider the following:
class My_Model:
def __init__(self):
self.np_input = np.random.normal(size=(10,2)) # 10x2
def build(self):
self._in = tf.placeholder(dtype=tf.float32, shape=[10, None]) # matrix 10xN
W_exception = tf.random_normal(dtype=tf.float32, shape=[3,3]) # matrix 3x3
W_success = tf.random_normal(dtype=tf.float32, shape=[2,3]) # matrix 2x3
self.op_exception = tf.matmul(self._in, W_exception) # [10x2] x [3x3] = ERROR
self.op_success = tf.matmul(self._in, W_success) # [10x2] x [2x3] = [10x3]
print('Computational Graph Built')
def fit_success(self):
with tf.Session() as sess:
res =, feed_dict={self._in : self.np_input})
print('Result shape: {}'.format(res.shape))
def fit_exception(self):
with tf.Session() as sess:
res =, feed_dict={self._in : self.np_input})
print('Result shape: {}'.format(res.shape))
and then calling:
m = My_Model()
#> Computational Graph Built
#> Result shape: (10, 3)
#> InvalidArgumentError: Matrix size-incompatible: In[0]: [10,2], In[1]: [3,3]
So to explain what you see there. We first define the computational graph in the build() function. The _in is our input tensor; None means the dimension 1 is determined dynamically - that is once we provide a tensor with specified values.
Then we defined two matrices W_exception and W_success which have all dimensions specified and their values will be randomly generated.
Then we define two operations, matrix multiplication, that each returns a tensor.
We called the build() function and created the computational graph, print() function is also executed but NOT added to the graph. Nothing is computed here. In fact, it can't even be, because the values of _in are not specified.
Now to show, that only necessary parts required for computation are evaluated, we call the fit_success() function, which simply multiplies the input tensor _in with the W_success tensor (with correct dimensions). We receive a tensor with correct shape: [10x3]. Note, that we receive no error that op_exception cannot be computed due to mismatched dimensions. That's because we do not need it to evaluate op_success.
Lastly, I just show that exception is indeed thrown when we try to evaluate the op_exception with the same input tensor.

Initializing variables, variable scope and import_graph_def in tensorflow

I have a number of related questions about tensorflow behavior when attempting to do graph surgery using import_graph_def. 2 different graph surgeries
In the image above, I represent with bold red arrows 2 different graph surgeries. On the left, there are 2 graphs, g1 and g2, and the surgery consists of replacing a node in graph g2 by a node - and everything below it - from graph g1. How to do that is explained in this post. The surgery on the right, which involves replacing nodes that belong to the same graph, I haven't been able to figure out how to perform, or even if it is at all possible. I ended up with this minimal example
with tf.Graph().as_default() as g1:
with tf.variable_scope('foo', reuse=tf.AUTO_REUSE):
x = tf.placeholder(dtype=tf.float64, shape=[2], name='x')
c = tf.get_variable('c', initializer=tf.cast(1.0, tf.float64))
y = tf.identity(2*x, 'y')
z = tf.identity(3*x*c, 'z')
g1_def = g1.as_graph_def()
z1, = tf.import_graph_def(g1_def, input_map={'foo/x:0' : y}, return_elements=["foo/z:0"],
init_op = tf.global_variables_initializer()
print(tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='foo'))
with tf.Session(graph=g1) as sess:
print(, feed_dict={'foo/x:0' : np.array([1.0, 2.0])}) )
# z1 =, feed_dict={'foo/x:0' : np.array([1.0, 2.0])})
This code runs as it is. The 3 prints yield respectively:
[<tf.Variable 'foo/c:0' shape=() dtype=float64_ref>]
[ 3. 6.]
In particular, the last print informs that there are no unintialized variables. However, uncommenting the last line, yields the error
FailedPreconditionError (see above for traceback): Attempting to use uninitialized value foo/z1/foo/c
Note that if I remove c from the definition of z above, this would also work. However, I would like to understand this error. To begin with, why is the variable reported as foo/z1/foo/c? Why does the scope foo appear twice? Why is nothing reported when I print the uninitialized variables? Why is only foo/c reported when I print the GLOBAL_VARIABLES collection under the scope foo?
PS: I guess that there is a simpler way to ask the question which is, what is the tensorflow analogue of
theano.clone(some_tensor, replace={input_var : replace_var})
To begin with, why is the variable reported as foo/z1/foo/c?
Why does the scope foo appear twice?
After you've called tf.import_graph_def(...), the graph got duplicated. The first graph is defined in foo score. The second subgraph has been imported under the scope foo/z1 (because name='z1', plus foo is preserved from the scope above). So the graph g1 now contains the following tensors:
The first foo/c is initialized, but the second foo/z1/foo/c is not (see below).
Why is nothing reported when I print the uninitialized variables? Why is only foo/c reported when I print the GLOBAL_VARIABLES collection under the scope foo?
Since report_uninitialized_variables() scans LOCAL_VARIABLES and GLOBAL_VARIABLES by default, this is basically the same question.
And it probably is a bug: GLOBAL_VARIABLES collection isn't updated after tf.import_graph_def call. I say probably because GLOBAL_VARIABLES was designed as a mere convenience collection. Tensorflow tries to keep it up do date, but probably doesn't guarantee it always has all variables. The fact that tf.add_to_collection exists publicly supports this idea -- one can add any value to any collection if they want it. Bottom line: this behavior may or may not change in future versions, but as of 1.5 the client is responsible to update the global variables after graph import.
In particular, the last print informs that there are no unintialized variables. However, uncommenting the last line, yields the error
To fix this error, you simply need to run the initializer for the z1 subgraph. Like this:
# note that it's defined before `g1.as_graph_def()` to be a part of graph def
init_op = tf.global_variables_initializer()
g1_def = g1.as_graph_def()
z1, = tf.import_graph_def(g1_def, input_map={'foo/x:0': y}, return_elements=["foo/z:0"],
# find the init op
z1_init_op = tf.get_default_graph().get_operation_by_name('foo/z1/foo/init')
And voila! You have the duplicated graphs, just like you wanted to.
I faced a similar issue but simply running the init operation didn't work.
I fixed it by manually running all "Assign" ops of the global variables of the imported graph.
In my scenario I want to run an encoding op 'z' with input 'patch:0' using two different input tensors.
with tf.Session(graph=tf.get_default_graph()).as_default() as sess:
g = tf.Graph()
saved_model = predictor.from_saved_model(args.export_dir, graph=g)
variables = g.get_collection(tf.GraphKeys.GLOBAL_VARIABLES)]
fetch_ops = ['z:0','init']
fetch_ops.extend([":0") + "/Assign" for v in variables)
image_graph = tf.graph_util.import_graph_def(
input_map={'patch:0': image},
warped_graph = tf.graph_util.import_graph_def(
input_map={'patch:0': warped_image},
loss = tf.reduce_sum(tf.math.squared_difference(image_graph[0], warped_graph[0]))
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.0001)
compute_gradients = optimizer.compute_gradients(
apply_gradients = optimizer.apply_gradients(compute_gradients, global_step=step)[1:])[1:])
gradients =
When extracting the operation and running it by feeding my tensors with feed_dict, gradient_computation doesn't work, that's why I used tf.graph_util.import_graph_def(...).
Hope this might help anyone facing the same issue.

Tensorflow: How to write op with gradient in python?

I would like to write a TensorFlow op in python, but I would like it to be differentiable (to be able to compute a gradient).
This question asks how to write an op in python, and the answer suggests using py_func (which has no gradient): Tensorflow: Writing an Op in Python
The TF documentation describes how to add an op starting from C++ code only:
In my case, I am prototyping so I don't care about whether it runs on GPU, and I don't care about it being usable from anything other than the TF python API.
Yes, as mentionned in #Yaroslav's answer, it is possible and the key is the links he references: here and here. I want to elaborate on this answer by giving a concret example.
Modulo opperation: Let's implement the element-wise modulo operation in tensorflow (it already exists but its gradient is not defined, but for the example we will implement it from scratch).
Numpy function: The first step is to define the opperation we want for numpy arrays. The element-wise modulo opperation is already implemented in numpy so it is easy:
import numpy as np
def np_mod(x,y):
return (x % y).astype(np.float32)
The reason for the .astype(np.float32) is because by default tensorflow takes float32 types and if you give it float64 (the numpy default) it will complain.
Gradient Function: Next we need to define the gradient function for our opperation for each input of the opperation as tensorflow function. The function needs to take a very specific form. It need to take the tensorflow representation of the opperation op and the gradient of the output grad and say how to propagate the gradients. In our case, the gradients of the mod opperation are easy, the derivative is 1 with respect to the first argument and
with respect to the second (almost everywhere, and infinite at a finite number of spots, but let's ignore that, see for details). So we have
def modgrad(op, grad):
x = op.inputs[0] # the first argument (normally you need those to calculate the gradient, like the gradient of x^2 is 2x. )
y = op.inputs[1] # the second argument
return grad * 1, grad * tf.neg(tf.floordiv(x, y)) #the propagated gradient with respect to the first and second argument respectively
The grad function needs to return an n-tuple where n is the number of arguments of the operation. Notice that we need to return tensorflow functions of the input.
Making a TF function with gradients: As explained in the sources mentioned above, there is a hack to define gradients of a function using tf.RegisterGradient [doc] and tf.Graph.gradient_override_map [doc].
Copying the code from harpone we can modify the tf.py_func function to make it define the gradient at the same time:
import tensorflow as tf
def py_func(func, inp, Tout, stateful=True, name=None, grad=None):
# Need to generate a unique name to avoid duplicates:
rnd_name = 'PyFuncGrad' + str(np.random.randint(0, 1E+8))
tf.RegisterGradient(rnd_name)(grad) # see _MySquareGrad for grad example
g = tf.get_default_graph()
with g.gradient_override_map({"PyFunc": rnd_name}):
return tf.py_func(func, inp, Tout, stateful=stateful, name=name)
The stateful option is to tell tensorflow whether the function always gives the same output for the same input (stateful = False) in which case tensorflow can simply the tensorflow graph, this is our case and will probably be the case in most situations.
Combining it all together: Now that we have all the pieces, we can combine them all together:
from tensorflow.python.framework import ops
def tf_mod(x,y, name=None):
with ops.op_scope([x,y], name, "mod") as name:
z = py_func(np_mod,
grad=modgrad) # <-- here's the call to the gradient
return z[0]
tf.py_func acts on lists of tensors (and returns a list of tensors), that is why we have [x,y] (and return z[0]).
And now we are done. And we can test it.
with tf.Session() as sess:
x = tf.constant([0.3,0.7,1.2,1.7])
y = tf.constant([0.2,0.5,1.0,2.9])
z = tf_mod(x,y)
gr = tf.gradients(z, [x,y])
print(x.eval(), y.eval(),z.eval(), gr[0].eval(), gr[1].eval())
[ 0.30000001 0.69999999 1.20000005 1.70000005] [ 0.2 0.5 1. 2.9000001] [ 0.10000001 0.19999999 0.20000005 1.70000005] [ 1. 1. 1. 1.] [ -1. -1. -1. 0.]
Here's an example of adding gradient to a specific py_func
Here's the issue discussion
