How to calculate gradients in a numerically stable fashion

How to calculate gradients in a numerically stable fashion - python

I would like to compute derivatives of a ratio f = - a / b in a numerically stable fashion using tensorflow but am running into problems when a and b are small (<1e-20 when using 32-bit floating point representation). Of course, the derivative of f is df_db = a / b ** 2 but because of the operator precedence, the square in the denominator is computed first, underflows, and leads to an undefined gradient.
If the derivative was calculated as df_db = (a / b) / b, the underflow would not occur and the gradient would be well-defined as illustrated in the figure below which shows the gradient as a function of a = b. The blue line corresponds to the domain in which tensorflow can calculate the derivative. The orange line corresponds to the domain in which the denominator underflows yielding an infinite gradient. The green line corresponds to the domain in which the denominator overflows yielding a zero gradient. In both problematic domains, the gradient can be calculated using the modified expression above.
I've been able to get a more numerically stable expression by using the ugly hack
g = exp(log(a) - log(b))
which is equivalent to f but yields a different tensorflow graph. But I run into the same problem if I want to calculate a higher-order derivative. The code to reproduce the problem can be found here.
Is there a recommended approach to alleviate such problems? Is it possible to explicitly define a derivative of an expression in tensorflow if one doesn't want to rely on autodifferentiation?

Thanks to Yaroslav Bulatov's pointer, I was able to implement a custom function with the desired gradient.
# Define the division function and its gradient
#function.Defun(tf.float32, tf.float32, tf.float32)
def newDivGrad(x, y, grad):
return tf.reciprocal(y) * grad, - tf.div(tf.div(x, y), y) * grad
#function.Defun(tf.float32, tf.float32, grad_func=newDivGrad)
def newDiv(x, y):
return tf.div(x, y)
Full notebook is here. PR is here.

Related

Mixed partial dervative w.r.t. tensor in Pytorch

Question:
Is there any working method to calculate gradient of (non-scalar) tensor function?
Example
Given n by n symmetric matrices X, Y and matrix function Z(X, Y) = torch.mm(X.mm(X), Y) calculate d(dZ/dX)/dY.
Expected answer
d(dZ/dX)/dY = d(2*XY)/dY = 2*X
Attempts
Because torch's .backward() works only for scalar variables I've tried to calculate derivative by applying torch.autograd.grad() to each element of tensor Z, but this approach is not correct, because it gives d(X^2)/dX = X + 2*D where D is a diagonal matrix with diagonal values of X. For me it's a bit weird that torch has an ability to build a computational graph, but can't track tensor through it as a variable to get tensor derivative.
Edit
Question was not very clear, so I decided to give more details.
My aim is to get partial derivative of loss function, which involves two matrices as variables. It looks like that:
loss = torch.linalg.norm(my_formula(X, Y) , ord='fro')
And I need to find
d^2(loss)/d(Y^2)
d/dX[d(loss)/dY]
Torch is capable of calculating 1. by using .backward() two times, but it's problematic to find 2. because torch.autograd.grad() expects scalar input and not the tensor

TL;DR
For function f which takes a matrix and gives a scalar:
Find first order derivative, let's name it dX
Take trace: Tr(dX)
To get mixed partial derivative just use the trace from above: d/dY[df/dX] = d/dY[Tr(df/dX)]
Intro
At the moment of posting the question I was not really that good at theory of matrix derivatives, but now I know much more all thanks to this Yandex ml book (unfortunately, I didn't find the english equivalent). This is an attempt to give a full answer to my question.
Basic Theory
Forgive me, Lord, for ugly representation of latex
Let's say you have a function which takes matrix X and returns it's squared Frobenius norm: f(X) = ||X||_F^2
It is a well-known fact that: ||X||_F^2 = Tr(X X^T)
Let's define derivative as shown in same book: D[f] at X_0 = f(X + H) - f(X)
We are ready to find dg(X)/dX:
df(X)/dX = dTr(X X^T)/dX =
(using Trace's feature)
= Tr(d/dX[X X^T]) = Tr(dX/dX X^T + X d[X^T]/dX ) =
(then we should use the definition of derivative from above)
= Tr(HX^T + XH^T) = Tr(HX^T) + Tr(XH^T) =
(now the main trick is to get all matrices H on the right side and get something like
Tr(g(X) H) or Tr(g(X) H^T), where g(X) will be the derivative we are looking for)
= Tr(HX^T) + Tr(XH^T) = Tr(XH^T) + Tr(XH^T) = Tr(2*XH^T)
That means: df(X)/dX = 2X
Second order derivative
Now, after we found out how to get matrix derivatives, let's try to find second order derivative of the same function f(X):
d/dX[df(X)/dX] = d/dX[Tr(2XH_1^T)] = Tr(d/dX[2XH_1^T]) =
= Tr(2I H_2 H_1^T)
We found out that d/dX[df(X)/dX] = 2I where I stands for Identity matrix. But how will it help us to find derivatives in Pytorch?
Trace is the trick
As we can see from the formulas, both first and second order derivatives have Trace inside them, but when we take first order derivative we just instantly get matrix as a result. To get a higher order derivative we just need to take the derivative of trace of first order derivative:
d/dY[df/dX] = d/dY[Tr(df/dX)]
The thing is I was using JAX autograd library when this trick came to my mind, so the code with a function f(X,Y) will look like this:
def scalarized_dy(X, Y):
dY = grad(f, argnums=1)(X, Y)
return jnp.trace(dY)
dYX = grad(scalarized_dy, argnums=0)(X, Y)
dYY = grad(scalarized_dy, argnums=1)(X, Y)
In case of Pytorch I guess we will need to look after tensors' gradients (let loss be a function with X and Y as arguments):
loss = f(X, Y)
loss.backward(create_graph = True)
dX = torch.trace(X.grad)
dX.backward()
dXX = X.grad
dXY = Y.grad
Epilogue
I thought that the question itself is in some way interesting. Also, it took me several months to figure things out, so I decided to give my current point of view on this problem. I will not mark my answer as correct yet in hope that I will get some kind of feedback or, perhaps, even better answers or ideas.

Differential entropy is calculated with integrate.quad in scipy.stats?

scipy.stats.entropy calculates the differential entropy for a continuous random variable. By which estimation method, and which formula, exactly is it calculating differential entropy? (i.e. the differential entropy of a norm distribution versus that of the beta distribution)
Below is its github code. Differential entropy is the negative integral sum of the p.d.f. multiplied by the log p.d.f., but nowhere do I see this or the log written. Could it be in the call to integrate.quad?
def _entropy(self, *args):
def integ(x):
val = self._pdf(x, *args)
return entr(val)
# upper limit is often inf, so suppress warnings when integrating
_a, _b = self._get_support(*args)
with np.errstate(over='ignore'):
h = integrate.quad(integ, _a, _b)[0]
if not np.isnan(h):
return h
else:
# try with different limits if integration problems
low, upp = self.ppf([1e-10, 1. - 1e-10], *args)
if np.isinf(_b):
upper = upp
else:
upper = _b
if np.isinf(_a):
lower = low
else:
lower = _a
return integrate.quad(integ, lower, upper)[0]
Source (lines 2501 - 2524): https://github.com/scipy/scipy/blob/master/scipy/stats/_distn_infrastructure.py

You have to store a continuous random variable in some parametrized way anyway, unless you work with an approximation. In that case, you usually work with distribution objects; and for known distributions, formulae for the differential entropy in terms of the parameters exist.
Scipy accordingly provides an entropy method for rv_continuous that calculates the differential entropy where possible:
In [5]: import scipy.stats as st
In [6]: rv = st.beta(0.5, 0.5)
In [7]: rv.entropy()
Out[7]: array(-0.24156448)

The actual question here is how do you store a continuous variable in memory. You might use some discretization techniques and calculate entropy for a discrete random variable.
You also may check Tensorflow Probability, which treats distributions essentially as tensors and has a method entropy() for a Distribution class.

How does pytorch compute derivatives for simple functions?

When we talk about the auto-differentiation in the pytorch, we are usually presented a graphical structures of tensors based on their formulas, and pytorch will compute the gradients by tracing down the graphical tree using chain rules. However, I want to know what will happen at the leaf nodes? Does pytorch hardcode a whole list of basic functions with their analytical derivatives, or does it compute the gradients using numerical methods? A quick example:
import torch
def f(x):
return x ** 2
x = torch.tensor([1.0], requires_grad=True)
y = f(x)
y.backward()
print(x.grad) # 2.0
In this example, does pytorch compute the derivative by $$ (x^2)' = 2x = 2 * 1 = 2 $$, or does pytorch compute in a way similar to $$ (1.00001^2 - 1^2) / (1.000001 - 1) ~ 2 $$ ?
Thanks!

See this paper for exact answer, specifically section 2.1 or figure 2.
In short, PyTorch has a list of basic functions and the expression of their derivatives. So, what is done in your case (y =xx), is evaluating $$ y' = 2x $$.
The numerical method you mentioned is called numerical differentiation or finite differences, and it is an approximation of the derivative. But it is not what PyTorch does.

How to use tensorflow to approximate hessian matrix's norm

I wonder is there any method to recompute gradients with updated weights within a graph or if there is any better way to do this. For example, for estimating hessian norm, we need to compute
delta ~ N(0, I)
hessian_norm = 1/M \sum_{1}^{M} gradient(f(x+delta))- gradient(f(x-delta))/(2*delta)
we need to gradient value on x+delta. Currently we will get None type if we use tf.gradient on var+delta directly.
More specifally speaking, if we define
a = tf.Variable
b = some_function(a)
grad = tf.gradients(b, a)
that's a normal gradient computation but if we do
grad_delta = tf.gradients(b, a+delta)
it will return None. This feature seems to make it impossible to approximate the hessian norm using the above method.

b is not a function of a+delta, so you get Nones. You either need to create new value b2 which depends on a+delta, or just move your a variable by delta and eval again to get second value.
This is similar to how you do line search in TensorFlow.

On ordinary differential equations (ODE) and optimization, in Python

I want to solve this kind of problem:
dy/dt = 0.01*y*(1-y), find t when y = 0.8 (0<t<3000)
I've tried the ode function in Python, but it can only calculate y when t is given.
So are there any simple ways to solve this problem in Python?
PS: This function is just a simple example. My real problem is so complex that can't be solve analytically. So I want to know how to solve it numerically. And I think this problem is more like an optimization problem:
Objective function y(t) = 0.8, Subject to dy/dt = 0.01*y*(1-y), and 0<t<3000
PPS: My real problem is:
objective function: F(t) = 0.85,
subject to: F(t) = sqrt(x(t)^2+y(t)^2+z(t)^2),
x''(t) = (1/F(t)-1)*250*x(t),
y''(t) = (1/F(t)-1)*250*y(t),
z''(t) = (1/F(t)-1)*250*z(t)-10,
x(0) = 0, y(0) = 0, z(0) = 0.7,
x'(0) = 0.1, y'(0) = 1.5, z'(0) = 0,
0<t<5

This differential equation can be solved analytically quite easily:
dy/dt = 0.01 * y * (1-y)
rearrange to gather y and t terms on opposite sides
100 dt = 1/(y * (1-y)) dy
The lhs integrates trivially to 100 * t, rhs is slightly more complicated. We can always write a product of two quotients as a sum of the two quotients * some constants:
1/(y * (1-y)) = A/y + B/(1-y)
The values for A and B can be worked out by putting the rhs on the same denominator and comparing constant and first order y terms on both sides. In this case it is simple, A=B=1. Thus we have to integrate
1/y + 1/(1-y) dy
The first term integrates to ln(y), the second term can be integrated with a change of variables u = 1-y to -ln(1-y). Our integrated equation therefor looks like:
100 * t + C = ln(y) - ln(1-y)
not forgetting the constant of integration (it is convenient to write it on the lhs here). We can combine the two logarithm terms:
100 * t + C = ln( y / (1-y) )
In order to solve t for an exact value of y, we first need to work out the value of C. We do this using the initial conditions. It is clear that if y starts at 1, dy/dt = 0 and the value of y never changes. Thus plug in the values for y and t at the beginning
100 * 0 + C = ln( y(0) / (1 - y(0) )
This will give a value for C (assuming y is not 0 or 1) and then use y=0.8 to get a value for t. Note that because of the logarithm and the factor 100 multiplying t y will reach 0.8 within a relatively short range of t values, unless the initial value of y is incredibly small. It is of course also straightforward to rearrange the equation above to express y in terms of t, then you can plot the function as well.
Edit: Numerical integration
For a more complexed ODE which cannot be solved analytically, you will have to try numerically. Initially we only know the value of the function at zero time y(0) (we have to know at least that in order to uniquely define the trajectory of the function), and how to evaluate the gradient. The idea of numerical integration is that we can use our knowledge of the gradient (which tells us how the function is changing) to work out what the value of the function will be in the vicinity of our starting point. The simplest way to do this is Euler integration:
y(dt) = y(0) + dy/dt * dt
Euler integration assumes that the gradient is constant between t=0 and t=dt. Once y(dt) is known, the gradient can be calculated there also and in turn used to calculate y(2 * dt) and so on, gradually building up the complete trajectory of the function. If you are looking for a particular target value, just wait until the trajectory goes past that value, then interpolate between the last two positions to get the precise t.
The problem with Euler integration (and with all other numerical integration methods) is that its results are only accurate when its assumptions are valid. Because the gradient is not constant between pairs of time points, a certain amount of error will arise for each integration step, which over time will build up until the answer is completely inaccurate. In order to improve the quality of the integration, it is necessary to use more sophisticated approximations to the gradient. Check out for example the Runge-Kutta methods, which are a family of integrators which remove progressive orders of error term at the cost of increased computation time. If your function is differentiable, knowing the second or even third derivatives can also be used to reduce the integration error.
Fortunately of course, somebody else has done the hard work here, and you don't have to worry too much about solving problems like numerical stability or have an in depth understanding of all the details (although understanding roughly what is going on helps a lot). Check out http://docs.scipy.org/doc/scipy/reference/generated/scipy.integrate.ode.html#scipy.integrate.ode for an example of an integrator class which you should be able to use straightaway. For instance
from scipy.integrate import ode
def deriv(t, y):
return 0.01 * y * (1 - y)
my_integrator = ode(deriv)
my_integrator.set_initial_value(0.5)
t = 0.1 # start with a small value of time
while t < 3000:
y = my_integrator.integrate(t)
if y > 0.8:
print "y(%f) = %f" % (t, y)
break
t += 0.1
This code will print out the first t value when y passes 0.8 (or nothing if it never reaches 0.8). If you want a more accurate value of t, keep the y of the previous t as well and interpolate between them.

As an addition to Krastanov`s answer:
Aside of PyDSTool there are other packages, like Pysundials and Assimulo which provide bindings to the solver IDA from Sundials. This solver has root finding capabilites.

Use scipy.integrate.odeint to handle your integration, and analyse the results afterward.
import numpy as np
from scipy.integrate import odeint
ts = np.arange(0,3000,1) # time series - start, stop, step
def rhs(y,t):
return 0.01*y*(1-y)
y0 = np.array([1]) # initial value
ys = odeint(rhs,y0,ts)
Then analyse the numpy array ys to find your answer (dimensions of array ts matches ys). (This may not work first time because I am constructing from memory).
This might involve using the scipy interpolate function for the ys array, such that you get a result at time t.
EDIT: I see that you wish to solve a spring in 3D. This should be fine with the above method; Odeint on the scipy website has examples for systems such as coupled springs that can be solved for, and these could be extended.

What you are asking for is a ODE integrator with root finding capabilities. They exist and the low-level code for such integrators is supplied with scipy, but they have not yet been wrapped in python bindings.
For more information see this mailing list post that provides a few alternatives: http://mail.scipy.org/pipermail/scipy-user/2010-March/024890.html
You can use the following example implementation which uses backtracking (hence it is not optimal as it is a bolt-on addition to an integrator that does not have root finding on its own): https://github.com/scipy/scipy/pull/4904/files

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.