When we talk about the auto-differentiation in the pytorch, we are usually presented a graphical structures of tensors based on their formulas, and pytorch will compute the gradients by tracing down the graphical tree using chain rules. However, I want to know what will happen at the leaf nodes? Does pytorch hardcode a whole list of basic functions with their analytical derivatives, or does it compute the gradients using numerical methods? A quick example:
import torch
def f(x):
return x ** 2
x = torch.tensor([1.0], requires_grad=True)
y = f(x)
y.backward()
print(x.grad) # 2.0
In this example, does pytorch compute the derivative by $$ (x^2)' = 2x = 2 * 1 = 2 $$, or does pytorch compute in a way similar to $$ (1.00001^2 - 1^2) / (1.000001 - 1) ~ 2 $$ ?
Thanks!
See this paper for exact answer, specifically section 2.1 or figure 2.
In short, PyTorch has a list of basic functions and the expression of their derivatives. So, what is done in your case (y =xx), is evaluating $$ y' = 2x $$.
The numerical method you mentioned is called numerical differentiation or finite differences, and it is an approximation of the derivative. But it is not what PyTorch does.
Related
I have the following three functions implements in JAX.
def helper_1(params1):
...calculations...
return z
def helper_2(z, params2):
...calculations...
return y
def main(params1, params2):
z = helper_1(params1)
y = helper_2(z, params2)
return z,y
I am interested in the partial derivatives of the output from main, i.e. z and y, with respect to both params1 and params2. As params1 and params2 are low dimensional and z and y are high dimensional, I am using the jax.jacfwd function.
When calling
jax.jacfwd(main,argnums=(0,1))(params1,params2)
Jax computes the derivatives of z with respect to params1 (and params2, which in this case is just a bunch of zeros). My question is: does Jax recompute dz/d_param1 for the derivatives of y with respect to params1 and params2, or does it somehow figure out this has already been computed?
I don't know if this is relevant, but the 'helper_1' function contains functions from the TensorFlow library for Jax. Thanks!
In general, in the situation you describe JAX's forward-mode autodiff approach will re-use the derivative of z when computing the derivative of y. If you wish, you can confirm this by looking at the jaxpr of your differentiated function:
print(jax.make_jaxpr(jax.jacfwd(main, (0, 1)))(params1, params2))
Though if your function is more than moderately complicated, the output might be hard to understand.
As a general note, though, JAX's autodiff implementation does tend to produce a small number of unnecessary or duplicated computations. As a simple example consider this:
import jax
print(jax.make_jaxpr(jax.grad(jax.lax.sin))(1.0))
# { lambda ; a:f32[]. let
# _:f32[] = sin a
# b:f32[] = cos a
# c:f32[] = mul 1.0 b
# in (c,) }
Here the primal value sin(a) is computed even though it is never used in computing the final output.
In practice this can be addressed by wrapping your computation in jit, in which case the XLA compiler takes care of optimization, including dead code elimination and de-duplication when applicable:
result = jit(jax.jacfwd(main, (0, 1)))(params1, params2)
Question:
Is there any working method to calculate gradient of (non-scalar) tensor function?
Example
Given n by n symmetric matrices X, Y and matrix function Z(X, Y) = torch.mm(X.mm(X), Y) calculate d(dZ/dX)/dY.
Expected answer
d(dZ/dX)/dY = d(2*XY)/dY = 2*X
Attempts
Because torch's .backward() works only for scalar variables I've tried to calculate derivative by applying torch.autograd.grad() to each element of tensor Z, but this approach is not correct, because it gives d(X^2)/dX = X + 2*D where D is a diagonal matrix with diagonal values of X. For me it's a bit weird that torch has an ability to build a computational graph, but can't track tensor through it as a variable to get tensor derivative.
Edit
Question was not very clear, so I decided to give more details.
My aim is to get partial derivative of loss function, which involves two matrices as variables. It looks like that:
loss = torch.linalg.norm(my_formula(X, Y) , ord='fro')
And I need to find
d^2(loss)/d(Y^2)
d/dX[d(loss)/dY]
Torch is capable of calculating 1. by using .backward() two times, but it's problematic to find 2. because torch.autograd.grad() expects scalar input and not the tensor
TL;DR
For function f which takes a matrix and gives a scalar:
Find first order derivative, let's name it dX
Take trace: Tr(dX)
To get mixed partial derivative just use the trace from above: d/dY[df/dX] = d/dY[Tr(df/dX)]
Intro
At the moment of posting the question I was not really that good at theory of matrix derivatives, but now I know much more all thanks to this Yandex ml book (unfortunately, I didn't find the english equivalent). This is an attempt to give a full answer to my question.
Basic Theory
Forgive me, Lord, for ugly representation of latex
Let's say you have a function which takes matrix X and returns it's squared Frobenius norm: f(X) = ||X||_F^2
It is a well-known fact that: ||X||_F^2 = Tr(X X^T)
Let's define derivative as shown in same book: D[f] at X_0 = f(X + H) - f(X)
We are ready to find dg(X)/dX:
df(X)/dX = dTr(X X^T)/dX =
(using Trace's feature)
= Tr(d/dX[X X^T]) = Tr(dX/dX X^T + X d[X^T]/dX ) =
(then we should use the definition of derivative from above)
= Tr(HX^T + XH^T) = Tr(HX^T) + Tr(XH^T) =
(now the main trick is to get all matrices H on the right side and get something like
Tr(g(X) H) or Tr(g(X) H^T), where g(X) will be the derivative we are looking for)
= Tr(HX^T) + Tr(XH^T) = Tr(XH^T) + Tr(XH^T) = Tr(2*XH^T)
That means: df(X)/dX = 2X
Second order derivative
Now, after we found out how to get matrix derivatives, let's try to find second order derivative of the same function f(X):
d/dX[df(X)/dX] = d/dX[Tr(2XH_1^T)] = Tr(d/dX[2XH_1^T]) =
= Tr(2I H_2 H_1^T)
We found out that d/dX[df(X)/dX] = 2I where I stands for Identity matrix. But how will it help us to find derivatives in Pytorch?
Trace is the trick
As we can see from the formulas, both first and second order derivatives have Trace inside them, but when we take first order derivative we just instantly get matrix as a result. To get a higher order derivative we just need to take the derivative of trace of first order derivative:
d/dY[df/dX] = d/dY[Tr(df/dX)]
The thing is I was using JAX autograd library when this trick came to my mind, so the code with a function f(X,Y) will look like this:
def scalarized_dy(X, Y):
dY = grad(f, argnums=1)(X, Y)
return jnp.trace(dY)
dYX = grad(scalarized_dy, argnums=0)(X, Y)
dYY = grad(scalarized_dy, argnums=1)(X, Y)
In case of Pytorch I guess we will need to look after tensors' gradients (let loss be a function with X and Y as arguments):
loss = f(X, Y)
loss.backward(create_graph = True)
dX = torch.trace(X.grad)
dX.backward()
dXX = X.grad
dXY = Y.grad
Epilogue
I thought that the question itself is in some way interesting. Also, it took me several months to figure things out, so I decided to give my current point of view on this problem. I will not mark my answer as correct yet in hope that I will get some kind of feedback or, perhaps, even better answers or ideas.
(Note: this is not a question about back-propagation.)
I am trying so solve on a GPU a non-linear PDE using PyTorch tensors in place of Numpy arrays. I want to calculate the partial derivatives of an arbitrary tensor, akin to the action of the center finite-difference numpy.gradient function. I have other ways around this problem, but since I am already using PyTorch, I'm wondering if it is possible use the autograd module (or, in general, any other autodifferentiation module) to perform this action.
I have created a tensor-compatible version of the numpy.gradient function - which runs a lot faster. But perhaps there is a more elegant way of doing this. I can't find any other sources that address this question, either to show that it's possible or impossible; perhaps this reflects my ignorance with the autodifferentiation algorithms.
I've had this same question myself: when numerically solving PDEs, we need access to spatial gradients (which the numpy.gradients function can give us) all the time - could it be possible to use automatic differentiation to compute the gradients, instead of using finite-difference or some flavor of it?
"I'm wondering if it is possible use the autograd module (or, in general, any other autodifferentiation module) to perform this action."
The answer is no: as soon as you discretize your problem in space or time, then time and space become discrete variables with a grid-like structure, and are not explicit variables which you feed into some function to compute the solution to the PDE.
For example, if I wanted to compute the velocity field of some fluid flow u(x,t), I would discretize in space and time, and I would have u[:,:] where the indices represent positions in space and time.
Automatic differentiation can compute the derivative of a function u(x,t). So why can't it compute the spatial or time derivative here? Because you've discretized your problem. This means you don't have a function for u for arbitrary x, but rather a function of u at some grid points. You can't differentiate automatically with respect to the spacing of the grid points.
As far as I can tell, the tensor-compatible function you've written is probably your best bet. You can see that a similar question has been asked in the PyTorch forums here and here. Or you could do something like
dx = x[:,:,1:]-x[:,:,:-1]
if you're not worried about the endpoints.
You can use PyTorch to compute the gradients of a tensor with respect to another tensor under some constraints. If you're careful to stay within the tensor framework to ensure a computation graph is created, then by repeatedly calling backward on each element of the output tensor and zeroing the grad member of the independent variable you can iteratively query the gradient of each entry. This approach allows you to gradually build the gradient of a vector valued function.
Unfortunately this approach requires calling backward many times which may be slow in practice and may result in very large matrices.
import torch
from copy import deepcopy
def get_gradient(f, x):
""" computes gradient of tensor f with respect to tensor x """
assert x.requires_grad
x_shape = x.shape
f_shape = f.shape
f = f.view(-1)
x_grads = []
for f_val in f:
if x.grad is not None:
x.grad.data.zero_()
f_val.backward(retain_graph=True)
if x.grad is not None:
x_grads.append(deepcopy(x.grad.data))
else:
# in case f isn't a function of x
x_grads.append(torch.zeros(x.shape).to(x))
output_shape = list(f_shape) + list(x_shape)
return torch.cat((x_grads)).view(output_shape)
For example, given the following function:
f(x0,x1,x2) = (x0*x1*x2, x1^2, x0+x2)
The Jacobian at x0, x1, x2 = (1, 2, 3) could be computed as follows
x = torch.tensor((1.0, 2.0, 3.0))
x.requires_grad_(True) # must be set before further computation
f = torch.stack((x[0]*x[1]*x[2], x[1]**2, x[0]+x[2]))
df_dx = get_gradient(f, x)
print(df_dx)
which results in
tensor([[6., 3., 2.],
[0., 4., 0.],
[1., 0., 1.]])
For your case if you can define an output tensor with respect to an input tensor you could use such a function to compute the gradient.
A useful feature of PyTorch is the ability to compute the vector-Jacobian product. The previous example required lots of re-applications of the chain rule (a.k.a. back propagation) via the backward method to compute the Jacobian directly. But PyTorch allows you to compute the matrix/vector product of the Jacobian with an arbitrary vector which is much more efficient than actually building the Jacobian. This may be more in line with what you're looking for since you can finagle it to compute multiple gradients at various values of a function, similar to the way I believe numpy.gradient operates.
For example, here we compute f(x) = x^2 + sqrt(x) for x = 1, 1.1, ..., 1.8 and compute the derivative (which is f'(x) = 2x + 0.5/sqrt(x)) at each of these points
dx = 0.1
x = torch.arange(1, 1.8, dx, requires_grad=True)
f = x**2 + torch.sqrt(x)
f.backward(torch.ones(f.shape))
x_grad = x.grad
print(x_grad)
which results in
tensor([2.5000, 2.6767, 2.8564, 3.0385, 3.2226, 3.4082, 3.5953, 3.7835])
Compare this to numpy.gradient
dx = 0.1
x_np = np.arange(1, 1.8, dx)
f_np = x_np**2 + np.sqrt(x_np)
x_grad_np = np.gradient(f_np, dx)
print(x_grad_np)
which results in the following approximation
[2.58808848 2.67722558 2.85683288 3.03885421 3.22284723 3.40847554 3.59547805 3.68929417]
I want to write some custom CUDA kernels for neural networks to speed up computations, but I don't want to spend time differentiating tensor expressions by hand if there are packages that can do it automatically.
Is there a python package that can show expression for symbolic matrix differentiation?
I know sympy can do it for non-matrix expressions like this:
def func(x):
return 1 / x
arg_symbols = sp.symbols(inspect.getfullargspec(func).args)
sym_func = func(*arg_symbols)
s = ''
for arg in arg_symbols:
s += '{}\n'.format(arg, sp.Lambda(arg_symbols, sym_func.diff(arg)))
# this is what I need:
print(s)
>>> Lambda(x, -1/x**2)
I know autograd package can compute the derivatives of matrix expressions
After the function is evaluated, autograd has a list of all operations
that were performed and which nodes they depended on. This is the
computational graph of the function evaluation. To compute the
derivative, we simply apply the rules of differentiation to each node
in the graph.
But is there a way to get this differentiation computational graph from it or some similar package?
There are some severe differences between the packages you cited. The differences are the reason why you cannot get (AFAIK) directly the computational graph from an automatic differentiation library, but you can get it from a symbolic based one.
In short:
numerical differentiation: numpy is enough
symbolic differentiation: sympy
automatic differentiation: autograd is one example
There are three kind of differentiations methods:
numerical differentiation: this evaluates Delta(f(x)) / Delta(x) where Delta(x) is a small difference that represents variation of x while remaining in the domain of f. This is not what you need. You don't need a package for this.
symbolic differentiation: is based upon the construction of a graph that represents the symbolic applications of function (I have an article on the implementation of symbolic engine in Ruby here). In this case the differentiation is performed through a recursive application of the chain derivation rule:
f(g(x))' = f'(g(x)) * g'(x)
When this rule is applied to the whole symbolic graph, what results is a newer symbolic graph with the derivative. The advantage lies in the fact that the derivative is exact, but for very complex graph, the final derivative graph may be non-tractable (exceeds memory limits or stack limits for too deep recursion). In python sympy implements this kind of derivation. On the other side, if you have a graph of the derivative you can perform operations on it, such as simplifications or substitutions.
from sympy import *
import numpy as np
x = symbol('x')
f = 1 / x
df = diff(f, x)
print(df)
# -1/x**2
ldf = lambdify((x), df)
# Now ldf is a lambda
x_ary = np.array([
[[1, 2, 3], [1, 2, 3]],
[[1, 2, 3], [1, 2, 3]]
])
y_ary = ldf(x_ary)
print(xn.shape)
# (2, 2, 3)
print(y_ary)
# array([[[-1. , -0.25 , -0.11111111],
# [-1. , -0.25 , -0.11111111]],
# [[-1. , -0.25 , -0.11111111],
# [-1. , -0.25 , -0.11111111]]])
as you can see it works with numpy, but it covers some of the basic examples not everything, and in fact sympy.matrix alongside with sympy.symbol should be used for particular graph (for example: I don't think it can handle diff(x.T A x, x) = x.T A + A x) directly).
It is possible also to export the graph as C code, but it has some limited capabilities and for your application you must certainly modify the result:
from scipy.utilities.codegen import codegen
[(cf, cc), (hf, hc)] = codegen(("df", df), "C", "df")
print(hc, cc)
Prints out:
/*****************************************************
* Code generated with sympy 1.1.1 *
* See http://www.sympy.org/ for more information. *
* This file is part of 'project' *
*****************************************************/
#ifndef PROJECT__DIFF__H
#define PROJECT__DIFF__H
double df(double x);
#endif
/*****************************************************
* Code generated with sympy 1.1.1 *
* See http://www.sympy.org/ for more information. *
* This file is part of 'project' *
*****************************************************/
#include "diff.h"
#include <math.h>
double df(double x) {
double df_result;
df_result = -1/pow(x, 2);
return df_result;
}
automatic differentiation is what is done through autograd. In this case the best of both world is united. From one side it is not necessary to explicitly evaluate the graph, on the other side you cannot perform further operation on the derived function, while keeping the derivative exact. This is done (usually) by augmenting the float definition with an additional field (something like float[2] or more), where the additional field contains the derivative. For example in an automatic differentiation environment, the sin function may be overloaded with:
def sin(x):
return [sin(x[0]), x[1] * cos(x[0])]
but as you can understand in this way, no computational graph is available, but you get directly the value of the exact derivative along x (all functions must be overloaded). I have an example more complete (in C lang, using only macros) here. Notice that internally Tensorflow uses automatic differentiation instead of symbolic one but suggests user to directly provide an "explicit version" that handles numerical instabilities!. Automatic differentiation does not handles numerical instability usually.
I would like to compute derivatives of a ratio f = - a / b in a numerically stable fashion using tensorflow but am running into problems when a and b are small (<1e-20 when using 32-bit floating point representation). Of course, the derivative of f is df_db = a / b ** 2 but because of the operator precedence, the square in the denominator is computed first, underflows, and leads to an undefined gradient.
If the derivative was calculated as df_db = (a / b) / b, the underflow would not occur and the gradient would be well-defined as illustrated in the figure below which shows the gradient as a function of a = b. The blue line corresponds to the domain in which tensorflow can calculate the derivative. The orange line corresponds to the domain in which the denominator underflows yielding an infinite gradient. The green line corresponds to the domain in which the denominator overflows yielding a zero gradient. In both problematic domains, the gradient can be calculated using the modified expression above.
I've been able to get a more numerically stable expression by using the ugly hack
g = exp(log(a) - log(b))
which is equivalent to f but yields a different tensorflow graph. But I run into the same problem if I want to calculate a higher-order derivative. The code to reproduce the problem can be found here.
Is there a recommended approach to alleviate such problems? Is it possible to explicitly define a derivative of an expression in tensorflow if one doesn't want to rely on autodifferentiation?
Thanks to Yaroslav Bulatov's pointer, I was able to implement a custom function with the desired gradient.
# Define the division function and its gradient
#function.Defun(tf.float32, tf.float32, tf.float32)
def newDivGrad(x, y, grad):
return tf.reciprocal(y) * grad, - tf.div(tf.div(x, y), y) * grad
#function.Defun(tf.float32, tf.float32, grad_func=newDivGrad)
def newDiv(x, y):
return tf.div(x, y)
Full notebook is here. PR is here.