How to use tensorflow to approximate hessian matrix's norm - python

I wonder is there any method to recompute gradients with updated weights within a graph or if there is any better way to do this. For example, for estimating hessian norm, we need to compute
delta ~ N(0, I)
hessian_norm = 1/M \sum_{1}^{M} gradient(f(x+delta))- gradient(f(x-delta))/(2*delta)
we need to gradient value on x+delta. Currently we will get None type if we use tf.gradient on var+delta directly.
More specifally speaking, if we define
a = tf.Variable
b = some_function(a)
grad = tf.gradients(b, a)
that's a normal gradient computation but if we do
grad_delta = tf.gradients(b, a+delta)
it will return None. This feature seems to make it impossible to approximate the hessian norm using the above method.

b is not a function of a+delta, so you get Nones. You either need to create new value b2 which depends on a+delta, or just move your a variable by delta and eval again to get second value.
This is similar to how you do line search in TensorFlow.

Related

Why the gradients are unconnected in the following function?

I am implementing a customer operation whose gradients must be calculated. The following is the function:
def difference(prod,box):
result = tf.Variable(tf.zeros((prod.shape[0],box.shape[1]),dtype=tf.float16))
for i in tf.range(0,prod.shape[0]):
for j in tf.range(0,box.shape[1]):
result[i,j].assign((tf.reduce_prod(box[:,j])-tf.reduce_prod(prod[i,:]))/tf.reduce_prod(box[:,j]))
return result
I am unable to calculate the gradients with respect to box, the tape.gradient() is returning None, here is the code I have written for calculating gradients
prod = tf.constant([[3,4,5],[4,5,6],[1,3,3]],dtype=tf.float16)
box = tf.Variable([[4,5],[5,6],[5,7]],dtype=tf.float16)
with tf.GradientTape() as tape:
tape.watch(box)
loss = difference(prod,box)
print(tape.gradient(loss,box))
I am not able to find the reason for unconnected gradients. Is the result variable causing it? Kindly suggest an alternative implementation.
Yes, in order to calculate gradients we need a set of (differentiable) operations on your variables.
You should re-write difference as a function of the 2 input tensors. I think (though happy to confess I am not 100% sure!) that it is the use of 'assign' that makes the gradient tape fall over.
Perhaps something like this:
def difference(prod, box):
box_red = tf.reduce_prod(box, axis=0)
prod_red = tf.reduce_prod(prod, axis=1)
return (tf.expand_dims(box_red, 0) - tf.expand_dims(prod_red, 1)) / tf.expand_dims(box_red, 0)
would get you the desired result

Fitting with funtional parameter constraints in Python

I have some data {x_i,y_i} and I want to fit a model function y=f(x,a,b,c) to find the best fitting values of the parameters (a,b,c); however, the three of them are not totally independent but constraints to 1<b , 0<=c<1 and g(a,b,c)>0, where g is a "good" function. How could I implement this in Python since with curve_fit one cannot put the parametric constraints directly?
I have been reading with lmfit but I only see numerical constraints like 1<b, 0<=c<1 and not the one with g(a,b,c)>0, which is the most important.
If I understand correctly, you have
def g(a,b,c):
c1 = (1.0 - c)
cx = 1/c1
c2 = 2*c1
g = a*a*b*gamma(2+cx)*gamma(cx)/gamma(1+3/c2)-b*b/(1+b**c2)**(1/c2)
return g
If so, and if get the math right, this could be represented as
a = sqrt((g+b*b/(1+b**c2)**(1/c2))*gamma(1+3/c2)/(b*gamma(2+cx)*gamma(cx)))
Which is to say that you could think about your problem as having a variable g which is > 0 and a value for a derived from b, c, and g by the above expression.
And that you can do with lmfit and its expression-based constraint mechanism. You would have to add the gamma function, as with
from lmfit import Parameters
from scipy.special import gamma
params = Parameters()
params._asteval.symtable['gamma'] = gamma
and then set up the parameters with bounds and constraints. I would probably follow the math above to allow better debugging and use something like:
params.add('b', 1.5, min=1)
params.add('c', 0.4, min=0, max=1)
params.add('g', 0.2, min=0)
params.add('c1', expr='1-c')
params.add('cx', expr='1.0/c1')
params.add('c2', expr='2*c1')
params.add('gprod', expr='b*gamma(2+cx)*gamma(cx)/gamma(1+3/c2)')
params.add('bfact', expr='(1+b**c2)**(1/c2)')
params.add('a', expr='sqrt(g+b*b/(bfact*gprod))')
Note that this gives 3 actual variables (now g, b, and c) with plenty of derived values calculated from these, including a. I would certainly check all that math. It looks like you're safe from negative**fractional_power, sqrt(negitive), and gamma(-1), but be aware of these possibilities that will kill the fit.
You could embed all of that into your fitting function, but using constraint expressions gives you the ability to constrain parameter values independently of how the fitting or model function is defined.
Hope that helps. Again, if this does not get to what you are trying to do, post more details about the constraint you are trying to impose.
Like James Phillips, I was going to suggest SciPy's curve_fit. But the way that you have defined your function, one of the constraints is on the function itself, and SciPy's bounds are defined only in terms of input variables.
What, exactly, are the forms of your functions? Can you transform them so that you can use a standard definition of bounds, and then reverse the transformation to give a function in the original form that you wanted?
I have encountered a related problem when trying to fit exponential regressions using SciPy's curve_fit. The parameter search algorithms vary in a linear fashion, and it's really easy to fail to establish a gradient. If I write a function which fits the logarithm of the function I want, it's much easier to make curve_fit work. Then, for my final work, I take the exponent of my fitted function.
This same strategy could work for you. Predict ln(y). The value of that function can be unbounded. Then for your final result, output exp(ln(y)) = y.

Using automatic differentiation libraries to compute partial derivatives of an arbitrary tensor

(Note: this is not a question about back-propagation.)
I am trying so solve on a GPU a non-linear PDE using PyTorch tensors in place of Numpy arrays. I want to calculate the partial derivatives of an arbitrary tensor, akin to the action of the center finite-difference numpy.gradient function. I have other ways around this problem, but since I am already using PyTorch, I'm wondering if it is possible use the autograd module (or, in general, any other autodifferentiation module) to perform this action.
I have created a tensor-compatible version of the numpy.gradient function - which runs a lot faster. But perhaps there is a more elegant way of doing this. I can't find any other sources that address this question, either to show that it's possible or impossible; perhaps this reflects my ignorance with the autodifferentiation algorithms.
I've had this same question myself: when numerically solving PDEs, we need access to spatial gradients (which the numpy.gradients function can give us) all the time - could it be possible to use automatic differentiation to compute the gradients, instead of using finite-difference or some flavor of it?
"I'm wondering if it is possible use the autograd module (or, in general, any other autodifferentiation module) to perform this action."
The answer is no: as soon as you discretize your problem in space or time, then time and space become discrete variables with a grid-like structure, and are not explicit variables which you feed into some function to compute the solution to the PDE.
For example, if I wanted to compute the velocity field of some fluid flow u(x,t), I would discretize in space and time, and I would have u[:,:] where the indices represent positions in space and time.
Automatic differentiation can compute the derivative of a function u(x,t). So why can't it compute the spatial or time derivative here? Because you've discretized your problem. This means you don't have a function for u for arbitrary x, but rather a function of u at some grid points. You can't differentiate automatically with respect to the spacing of the grid points.
As far as I can tell, the tensor-compatible function you've written is probably your best bet. You can see that a similar question has been asked in the PyTorch forums here and here. Or you could do something like
dx = x[:,:,1:]-x[:,:,:-1]
if you're not worried about the endpoints.
You can use PyTorch to compute the gradients of a tensor with respect to another tensor under some constraints. If you're careful to stay within the tensor framework to ensure a computation graph is created, then by repeatedly calling backward on each element of the output tensor and zeroing the grad member of the independent variable you can iteratively query the gradient of each entry. This approach allows you to gradually build the gradient of a vector valued function.
Unfortunately this approach requires calling backward many times which may be slow in practice and may result in very large matrices.
import torch
from copy import deepcopy
def get_gradient(f, x):
""" computes gradient of tensor f with respect to tensor x """
assert x.requires_grad
x_shape = x.shape
f_shape = f.shape
f = f.view(-1)
x_grads = []
for f_val in f:
if x.grad is not None:
x.grad.data.zero_()
f_val.backward(retain_graph=True)
if x.grad is not None:
x_grads.append(deepcopy(x.grad.data))
else:
# in case f isn't a function of x
x_grads.append(torch.zeros(x.shape).to(x))
output_shape = list(f_shape) + list(x_shape)
return torch.cat((x_grads)).view(output_shape)
For example, given the following function:
f(x0,x1,x2) = (x0*x1*x2, x1^2, x0+x2)
The Jacobian at x0, x1, x2 = (1, 2, 3) could be computed as follows
x = torch.tensor((1.0, 2.0, 3.0))
x.requires_grad_(True) # must be set before further computation
f = torch.stack((x[0]*x[1]*x[2], x[1]**2, x[0]+x[2]))
df_dx = get_gradient(f, x)
print(df_dx)
which results in
tensor([[6., 3., 2.],
[0., 4., 0.],
[1., 0., 1.]])
For your case if you can define an output tensor with respect to an input tensor you could use such a function to compute the gradient.
A useful feature of PyTorch is the ability to compute the vector-Jacobian product. The previous example required lots of re-applications of the chain rule (a.k.a. back propagation) via the backward method to compute the Jacobian directly. But PyTorch allows you to compute the matrix/vector product of the Jacobian with an arbitrary vector which is much more efficient than actually building the Jacobian. This may be more in line with what you're looking for since you can finagle it to compute multiple gradients at various values of a function, similar to the way I believe numpy.gradient operates.
For example, here we compute f(x) = x^2 + sqrt(x) for x = 1, 1.1, ..., 1.8 and compute the derivative (which is f'(x) = 2x + 0.5/sqrt(x)) at each of these points
dx = 0.1
x = torch.arange(1, 1.8, dx, requires_grad=True)
f = x**2 + torch.sqrt(x)
f.backward(torch.ones(f.shape))
x_grad = x.grad
print(x_grad)
which results in
tensor([2.5000, 2.6767, 2.8564, 3.0385, 3.2226, 3.4082, 3.5953, 3.7835])
Compare this to numpy.gradient
dx = 0.1
x_np = np.arange(1, 1.8, dx)
f_np = x_np**2 + np.sqrt(x_np)
x_grad_np = np.gradient(f_np, dx)
print(x_grad_np)
which results in the following approximation
[2.58808848 2.67722558 2.85683288 3.03885421 3.22284723 3.40847554 3.59547805 3.68929417]

Alternative plan of tf.floor

One of my operaction need integer, but output of convolution is float.
It means I need to use tf.floor, tf.ceil, tf.cast...etc to handle it.
But these operactions cause None gradients, since operactions like tf.floor are not differentiable
So, I tried something like below
First. detour
out1 = tf.subtract(vif, tf.subtract(vif, tf.floor(vif)))
But output of test.compute_gradient_error is 500 or 0, I don't think this is a reasonable gradient.
Second. override gradient function of floor
#ops.RegisterGradient("CustomFloor")
def _custom_floor_grad(op, grads):
return [grads]
A, B = 50, 7
shape = [A, B]
f = np.ones(shape, dtype=np.float32)
vif = tf.constant(f, dtype=tf.float32)
# out1 = tf.subtract(vif, tf.subtract(vif, tf.floor(vif)))
with tf.get_default_graph().gradient_override_map({"Floor": "CustomFloor"}):
out1 = tf.floor(vif)
with tf.Session() as sess:
err1 = tf.test.compute_gradient_error(vif, shape, out1, shape)
print err1
output of test.compute_gradient_error is 500 or 1, doesn't work too.
Question: A way to get integer and keep back propagation work fine (value like 2.0, 5.0 is ok)
In general, it's not inadvisable to solve discrete problem with gradient descent. You should be able express, to some extent integer solvers in TF but you're more or less on your own.
FWIW, the floor function looks like a saw. Its derivative is a constant function at 1 with little holes at every integer. At these positions you have a Dirac functional pointing downwards, like a rake if you wish. The Dirac functional has finite energy but no finite value.
The canonical way to tackle these problems is to relax the problem by "relaxiing" the hard floor constraint with something that is (at least once) differentiable (smooth).
There are multiple ways to do this. Perhaps the most popular are:
Hack up a function that looks like what you want. For instance a piece-wise linear function that slopes down quickly, but not vertically.
Replace step functions by sigmoids
Use a filter approximation which is well understood if it's a time series

How to calculate gradients in a numerically stable fashion

I would like to compute derivatives of a ratio f = - a / b in a numerically stable fashion using tensorflow but am running into problems when a and b are small (<1e-20 when using 32-bit floating point representation). Of course, the derivative of f is df_db = a / b ** 2 but because of the operator precedence, the square in the denominator is computed first, underflows, and leads to an undefined gradient.
If the derivative was calculated as df_db = (a / b) / b, the underflow would not occur and the gradient would be well-defined as illustrated in the figure below which shows the gradient as a function of a = b. The blue line corresponds to the domain in which tensorflow can calculate the derivative. The orange line corresponds to the domain in which the denominator underflows yielding an infinite gradient. The green line corresponds to the domain in which the denominator overflows yielding a zero gradient. In both problematic domains, the gradient can be calculated using the modified expression above.
I've been able to get a more numerically stable expression by using the ugly hack
g = exp(log(a) - log(b))
which is equivalent to f but yields a different tensorflow graph. But I run into the same problem if I want to calculate a higher-order derivative. The code to reproduce the problem can be found here.
Is there a recommended approach to alleviate such problems? Is it possible to explicitly define a derivative of an expression in tensorflow if one doesn't want to rely on autodifferentiation?
Thanks to Yaroslav Bulatov's pointer, I was able to implement a custom function with the desired gradient.
# Define the division function and its gradient
#function.Defun(tf.float32, tf.float32, tf.float32)
def newDivGrad(x, y, grad):
return tf.reciprocal(y) * grad, - tf.div(tf.div(x, y), y) * grad
#function.Defun(tf.float32, tf.float32, grad_func=newDivGrad)
def newDiv(x, y):
return tf.div(x, y)
Full notebook is here. PR is here.

Categories