I am interested in computing the derivative of a matrix determinant using TensorFlow. I can see from experimentation that TensorFlow has not implemented a method of differentiating through a determinant:
LookupError: No gradient defined for operation 'MatrixDeterminant'
(op type: MatrixDeterminant)
A little further investigation revealed that it is actually possible to compute the derivative; see for example Jacobi's formula. I determined that in order to implement this means of differentiating through a determinant that I need to use the function decorator,
#tf.RegisterGradient("MatrixDeterminant")
def _sub_grad(op, grad):
...
However, I am not familiar enough with tensor flow to understand how this can be accomplished. Does anyone have any insight on this matter?
Here's an example where I run into this issue:
x = tf.Variable(tf.ones(shape=[1]))
y = tf.Variable(tf.ones(shape=[1]))
A = tf.reshape(
tf.pack([tf.sin(x), tf.zeros([1, ]), tf.zeros([1, ]), tf.cos(y)]), (2,2)
)
loss = tf.square(tf.matrix_determinant(A))
optimizer = tf.train.GradientDescentOptimizer(0.001)
train = optimizer.minimize(loss)
init = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init)
for step in xrange(100):
sess.run(train)
print sess.run(x)
Please check "Implement Gradient in Python" section here
In particular, you can implement it as follows
#ops.RegisterGradient("MatrixDeterminant")
def _MatrixDeterminantGrad(op, grad):
"""Gradient for MatrixDeterminant. Use formula from 2.2.4 from
An extended collection of matrix derivative results for forward and reverse
mode algorithmic differentiation by Mike Giles
-- http://eprints.maths.ox.ac.uk/1079/1/NA-08-01.pdf
"""
A = op.inputs[0]
C = op.outputs[0]
Ainv = tf.matrix_inverse(A)
return grad*C*tf.transpose(Ainv)
Then a simple training loop to check that it works:
a0 = np.array([[1,2],[3,4]]).astype(np.float32)
a = tf.Variable(a0)
b = tf.square(tf.matrix_determinant(a))
init_op = tf.initialize_all_variables()
sess = tf.InteractiveSession()
init_op.run()
minimization_steps = 50
learning_rate = 0.001
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
train_op = optimizer.minimize(b)
losses = []
for i in range(minimization_steps):
train_op.run()
losses.append(b.eval())
Then you can visualize your loss over time
import matplotlib.pyplot as plt
plt.ylabel("Determinant Squared")
plt.xlabel("Iterations")
plt.plot(losses)
Should see something like this
I think you are confused with what is a derivative of a matrix determinant.
Matrix determinant is a function which is calculated over the elements of the matrix by some formula. So if all the elements of the matrix are numbers, you the determinant will you you just one number and the derivative will be 0. When some of the elements are variables, you will get an expression of these variables. For example:
x, x^2
1, sin(x)
The determinant will be x*sin(x) - x^2 and the derivative is 2x + sin(x) + x*cos(x). The Jacobi formula just connects the determinant with adjunct matrix.
In your example your matrix A consists of only numbers and therefore the determinant is just a number and the loss is just a number as well. GradientDescentOptimizer needs to have some free variables to minimize and does not have any because your loss is just a number.
For those who are interested, I discovered the solution that works on my problems:
#tf.RegisterGradient("MatrixDeterminant")
def _MatrixDeterminant(op, grad):
"""Gradient for MatrixDeterminant."""
return op.outputs[0] * tf.transpose(tf.matrix_inverse(op.inputs[0]))
Related
I'm trying to implement the backward pass for an NN with a final layer of Softmax and loss function of Cross Entropy. I'm following the notes in this article (particularly the "Matrix Multiplication" section).
I'd first like to make sure I'm calculating the derivative of the error with respect to the final outputs correctly. I'm working on the MNIST classification problem, and so y represents a one-hot encoding of the target and y_hat is my predicted probabilities.
def cross_entropy(y, y_hat):
value = np.log2(np.sum(y*y_hat))
return value
def d_cross_entropy(y, y_hat):
return -y/y_hat*np.log(2)
I'm a lot more confused on getting the gradient of Softmax. If we say that A = Softmax(Wx+b), then taking the gradient of A with respect to X is more difficult because Ai does not just depend on Xi but on all elements of the X vector. This means that rather than getting a simple 10-dimensional dA/dX term, I get a 10x10 matrix which throws off the matrix multiplication. I tried taking the sum to reduce this to a 10-dimensional vector, but this seems incorrect
def softmax(x):
exp = np.exp(x)
return exp/np.sum(exp)
def d_softmax(x):
softmax_x = softmax(x)
jacobian = np.outer(softmax_x, -softmax_x)
adj = np.eye(x.shape[0])*softmax_x
jacobian += adj
return jacobian.reshape((x.shape[0], x.shape[0])).sum()
I have two variables, x and theta. I am trying to minimise my loss with respect to theta only, but as part of my loss function I need the derivative of a different function (f) with respect to x. This derivative itself is not relevant to the minimisation, only its output. However, when implementing this in PyTorch I am getting a Runtime error.
A minimal example is as follows:
# minimal example of two different autograds
import torch
from torch.autograd.functional import jacobian
def f(theta, x):
return torch.sum(theta * x ** 2)
def df(theta, x):
J = jacobian(lambda x: f(theta, x), x)
return J
# example evaluations of the autograd gradient
x = torch.tensor([1., 2.])
theta = torch.tensor([1., 1.], requires_grad = True)
# derivative should be 2*theta*x (same as an analytical)
with torch.no_grad():
print(df(theta, x))
print(2*theta*x)
tensor([2., 4.])
tensor([2., 4.])
# define some arbitrary loss as a fn of theta
loss = torch.sum(df(theta, x)**2)
loss.backward()
gives the following error
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
If I provide an analytic derivative (2*theta*x), it works fine:
loss = torch.sum((2*theta*x)**2)
loss.backward()
Is there a way to do this in PyTorch? Or am I limited in some way?
Let me know if anyone needs any more details.
PS
I am imagining the solution is something similar to the way that JAX does autograd, as that is what I am more familiar with. What I mean here is that in JAX I believe you would just do:
from jax import grad
df = grad(lambda x: f(theta, x))
and then df would just be a function that can be called at any point. But is PyTorch the same? Or is there some conflict within .backward() that causes this error?
PyTorch's jacobian does not create a computation graph unless you explicitely ask for it
J = jacobian(lambda x: f(theta, x), x, create_graph=True)
.. with create_graph argument.
The documentation is quite clear about it
create_graph (bool, optional) – If True, the Jacobian will be computed in a differentiable manner
I am learning custom gradient in Tensorflow 1.14. I am testing it out by defining custom gradient for a simple ReLu function as follows:
import numpy as np
import tensorflow as tf
#tf.custom_gradient
def rateFunction(v_):
z_ = tf.nn.relu(v_)
def grad(dy):
dz_dv = tf.where(tf.greater(v_, 0.), tf.ones_like(v_), tf.zeros_like(v_))
dv = dy * dz_dv
return [dv]
return z_, grad
# define test input
vv = tf.random.normal((32,100))
# output from customized gradient
z1 = rateFunction(vv)
and I expect the gradient computed using the custom gradient to match the gradient of the actual ReLU, but it does not:
# output of actual relu
z2 = tf.nn.relu(vv)
# Compute the gradient
sess = tf.Session()
dzdv1=sess.run(tf.gradients(z1, vv)[0])
dzdv2=sess.run(tf.gradients(z2, vv)[0])
# Expect to match, i.e. difference to be 0
print(np.mean(np.abs(dzdv1-dzdv2)))
but the difference between the expected and actual is not zero. I got an mean absolute difference of about 0.49. Can someone please explain to me why this is happening? Thanks a lot!
The problem comes from
vv = tf.random.normal((32,100))
a different input is generated each time.
I am writing ElasticWeightConsolidation method and for that I need to compute Fisher matrix. As I understood Fisher Matrix is just Hessian of likelihood by weights of neural network. There is good function as torch.autograd.functional.hessian(func, inputs, create_graph=False, strict=False)
So I want to compute hessian(loss,weights) where loss = torch.nn.CrossEntropyLoss(). I also prepared weights of the network so that it its long 1D tensor, to have a possibility simply take diagonal elements of hessian like that:
def flat_param(model_param = yann_lecun.parameters()):
ans_data = []
ans_data = torch.tensor(ans_data, requires_grad=True)
ans_data = ans_data.to(device)
for p in model_param:
temp_data = p.data.flatten()
ans_data = torch.cat((ans_data,temp_data))
return ans_data
ans = flat_param(yann_lecun.parameters())
then I tried so: hessian(loss, inputs = ans) but problem is that loss takes also targets, but I don't want to compute hessian of them. The task is mnist classification so that targets are integers 0,...,9
and if I add y_train to the parameters like that hessian(loss,inputs = (ans,y_train_01)
It is crashing with words "can't take gradient from integer". I tried also to make y_train_01.requires_grad = False but it didn't help. I understood that loss also depends on y_train_01, but is there any way to determine that targets are constants in my case?
You can create a new 'wrapper' function where the targets are fixed:
def original_model(features, targets):
...
return ...
def feature_differentiable_model(features):
fixed_targets = ...
return original_model(features, fixed_targets)
And then call:
hessian(feature_differentiable_model, features_vals)
The second order partial derivatives from this will be equivalent to the analogous ones of the full Hessian product at the location (features_vals, fixed_targets).
I'm trying to implement training an MLP with Newton's method in Theano. I've managed to find out how to obtain the inverse hessian (it's pretty easy, just needed to remove some asserts in the nlinalg module). However, when I try to update my weight vectors with my gradient_descent function, I get the following error:
numpy.linalg.linalg.LinAlgError: Singular matrix
Apply node that caused the error: MatrixInverse(InplaceDimShuffle{x,0,1}.0)
Obviously, if my hessian is singular, it can't be inverted. But the weights are random, so how come it's always singular? I don't know much about matrix computation, but it seems to me that there must be a flaw in my process. Here's the relevant part of my code:
def gradient_descent(error, weights, w_flat, learning_rate=0.1):
# decrease flattened weights by the product of the inverse Hessian and the gradient
hess = T.hessian(cost=error, wrt=[w_flat])
grad = T.grad(error, wrt=[w_flat])
mult = T.nlinalg.matrix_inverse(hess) * grad
update = w_flat - learning_rate * mult
return T.flatten(update)
#initialize weight matrices
w_h_flat = theano.shared(np.array(np.random.randn(11,6), dtype=theano.config.floatX).flatten())
w_hidden = w_h_flat.reshape((11,6))
w_o_flat = theano.shared(np.array(np.random.randn(7,1), dtype=theano.config.floatX).flatten())
w_output = w_o_flat.reshape((7,1))
# ... other mlp stuff, define the cost, etc ...
train = theano.function(inputs=[x, y], outputs=cost, updates=[(w_h_flat, gradient_descent(cost, w_hidden, w_h_flat)), (w_o_flat, gradient_descent(cost, w_output, w_o_flat))])
Would really appreciate any pointers as to what I'm doing wrong.