Hessian matrix always singular in Theano - python

I'm trying to implement training an MLP with Newton's method in Theano. I've managed to find out how to obtain the inverse hessian (it's pretty easy, just needed to remove some asserts in the nlinalg module). However, when I try to update my weight vectors with my gradient_descent function, I get the following error:
numpy.linalg.linalg.LinAlgError: Singular matrix
Apply node that caused the error: MatrixInverse(InplaceDimShuffle{x,0,1}.0)
Obviously, if my hessian is singular, it can't be inverted. But the weights are random, so how come it's always singular? I don't know much about matrix computation, but it seems to me that there must be a flaw in my process. Here's the relevant part of my code:
def gradient_descent(error, weights, w_flat, learning_rate=0.1):
# decrease flattened weights by the product of the inverse Hessian and the gradient
hess = T.hessian(cost=error, wrt=[w_flat])
grad = T.grad(error, wrt=[w_flat])
mult = T.nlinalg.matrix_inverse(hess) * grad
update = w_flat - learning_rate * mult
return T.flatten(update)
#initialize weight matrices
w_h_flat = theano.shared(np.array(np.random.randn(11,6), dtype=theano.config.floatX).flatten())
w_hidden = w_h_flat.reshape((11,6))
w_o_flat = theano.shared(np.array(np.random.randn(7,1), dtype=theano.config.floatX).flatten())
w_output = w_o_flat.reshape((7,1))
# ... other mlp stuff, define the cost, etc ...
train = theano.function(inputs=[x, y], outputs=cost, updates=[(w_h_flat, gradient_descent(cost, w_hidden, w_h_flat)), (w_o_flat, gradient_descent(cost, w_output, w_o_flat))])
Would really appreciate any pointers as to what I'm doing wrong.

Related

Backpropagation for cross entropy and softmax?

I'm trying to implement the backward pass for an NN with a final layer of Softmax and loss function of Cross Entropy. I'm following the notes in this article (particularly the "Matrix Multiplication" section).
I'd first like to make sure I'm calculating the derivative of the error with respect to the final outputs correctly. I'm working on the MNIST classification problem, and so y represents a one-hot encoding of the target and y_hat is my predicted probabilities.
def cross_entropy(y, y_hat):
value = np.log2(np.sum(y*y_hat))
return value
def d_cross_entropy(y, y_hat):
return -y/y_hat*np.log(2)
I'm a lot more confused on getting the gradient of Softmax. If we say that A = Softmax(Wx+b), then taking the gradient of A with respect to X is more difficult because Ai does not just depend on Xi but on all elements of the X vector. This means that rather than getting a simple 10-dimensional dA/dX term, I get a 10x10 matrix which throws off the matrix multiplication. I tried taking the sum to reduce this to a 10-dimensional vector, but this seems incorrect
def softmax(x):
exp = np.exp(x)
return exp/np.sum(exp)
def d_softmax(x):
softmax_x = softmax(x)
jacobian = np.outer(softmax_x, -softmax_x)
adj = np.eye(x.shape[0])*softmax_x
jacobian += adj
return jacobian.reshape((x.shape[0], x.shape[0])).sum()

Optimizing conditional multiclass softmax objective function in XGBoost

I have successfully implemented a custom multiclass softmax function function in XGBoost based on this tutorial. The reason for customization is that the classes I want to predict are conditional on some data inputs - i.e. of the 24 possible classes being predicted, only a certain subset are valid. valid_transitions are lists of indices corresponding to classes we want to make predictions on and invalid_transitions are the inverse set of indices.
I have implemented .fit() and .predict_proba() such that they take valid_transitions and invalid_transitions as arguments which tells softprob_obj() and softmax()which classes to null out during training and prediction.
def softmax(x, valid_transitions, invalid_transitions):
for i in range(len(x)):
e = np.exp(x[i,valid_transitions[i]])
x[i, valid_transitions[i]] = e/np.sum(e)
x[i, invalid_transitions[i]] = 0
return x
def softprob_obj(labels, predt, data, valid_transitions, invalid_transitions):
'''Loss function. Computing the gradient and approximated hessian (diagonal).
Reimplements the `multi:softprob` inside XGBoost.
'''
kRows = len(data)
kClasses = len(np.unique(labels))
# The prediction is of shape (rows, classes), each element in a row
# represents a raw prediction (leaf weight, hasn't gone through softmax
# yet). In XGBoost 1.0.0, the prediction is transformed by a softmax
# function, fixed in later versions.
assert predt.shape == (kRows, kClasses)
eps = 1e-6
# compute the gradient and hessian, slow iterations in Python, only
# suitable for demo. Also the one in native XGBoost core is more robust to
# numeric overflow as we don't do anything to mitigate the `exp` in
# `softmax` here.
probs = softmax(predt, valid_transitions, invalid_transitions)
labels = labels.astype(int)
hess = np.maximum((2.0 * probs * (1.0 - probs)), eps)
probs[np.arange(len(probs)),labels] -= 1
# Right now (XGBoost 1.0.0), reshaping is necessary
grad = probs.reshape((kRows * kClasses, 1))
hess = hess.reshape((kRows * kClasses, 1))
return grad, hess
This works, but is pretty slow in training, presumably because the core xgboost functions I'm replacing are not written in python. I made some attempts to try to vectorize the calculation in numpy to avoid the for loop in softmax(), but ran into some issues with the ragged arrays that valid_transition and invalid_transition create. Was wondering if anyone had any ideas on how to optimize this within python. Thanks!

What is the alternative of elementwise_grad of autograd in JAX?

I want to solve a second order differential equation with neural network. For automatic differentiation I am using JAX library. To compute first order and second order derivative of my target variable 'u' i.e to compute du/dx and d2u/dx2 elementwise_grad has been used in an example. In jax what is its alternative?
For example neural network function is evaluating 'u': which is defined as below:
'''
def u(params, inputs):
for Weights, biases in params:
outputs = np.dot(inputs, Weights) + biases
inputs = sigmoid(outputs)
return outputs
'''
u has two arguments: params is the set of weights and biases and inputs is the x range with respect to which u will be differentiated.
suppose x has a length of 50, so size of output u will also be 50*1
Now I have to take differentiation of all 50 values of u at a time. By JAX, which functions should I use to calculate du/dx and d2u/dx2? grad function is not working
dudx = grad(u,1)(x)
d2udx2 = grad(grad(u,1)(x))(x)
These are giving some errors
This isn't really a function that has a meaningful elementwise gradient. It's mapping one vector space to another vector space, and the appropriate derivative for this kind of operation is a jacobian:
dudx = jax.jacobian(u, 1)(params, x)
The result is a matrix whose entries are the derivative of the ith output with respect to the jth input.
Note that if you had a truly element-wise function and wanted to compute the element-wise gradient, you could do so with vmap; for example:
def f(x):
return jnp.exp(x) - 1
df_dx = jax.vmap(jax.grad(f))(x)
That doesn't work for your function, because the mapping to the output vector space is determined by the contents of params, and vmap cannot easily account for that.

PyTorch: Calculate Hessian only for a subset of parameters?

I am writing ElasticWeightConsolidation method and for that I need to compute Fisher matrix. As I understood Fisher Matrix is just Hessian of likelihood by weights of neural network. There is good function as torch.autograd.functional.hessian(func, inputs, create_graph=False, strict=False)
So I want to compute hessian(loss,weights) where loss = torch.nn.CrossEntropyLoss(). I also prepared weights of the network so that it its long 1D tensor, to have a possibility simply take diagonal elements of hessian like that:
def flat_param(model_param = yann_lecun.parameters()):
ans_data = []
ans_data = torch.tensor(ans_data, requires_grad=True)
ans_data = ans_data.to(device)
for p in model_param:
temp_data = p.data.flatten()
ans_data = torch.cat((ans_data,temp_data))
return ans_data
ans = flat_param(yann_lecun.parameters())
then I tried so: hessian(loss, inputs = ans) but problem is that loss takes also targets, but I don't want to compute hessian of them. The task is mnist classification so that targets are integers 0,...,9
and if I add y_train to the parameters like that hessian(loss,inputs = (ans,y_train_01)
It is crashing with words "can't take gradient from integer". I tried also to make y_train_01.requires_grad = False but it didn't help. I understood that loss also depends on y_train_01, but is there any way to determine that targets are constants in my case?
You can create a new 'wrapper' function where the targets are fixed:
def original_model(features, targets):
...
return ...
def feature_differentiable_model(features):
fixed_targets = ...
return original_model(features, fixed_targets)
And then call:
hessian(feature_differentiable_model, features_vals)
The second order partial derivatives from this will be equivalent to the analogous ones of the full Hessian product at the location (features_vals, fixed_targets).

matrix determinant differentiation in tensorflow

I am interested in computing the derivative of a matrix determinant using TensorFlow. I can see from experimentation that TensorFlow has not implemented a method of differentiating through a determinant:
LookupError: No gradient defined for operation 'MatrixDeterminant'
(op type: MatrixDeterminant)
A little further investigation revealed that it is actually possible to compute the derivative; see for example Jacobi's formula. I determined that in order to implement this means of differentiating through a determinant that I need to use the function decorator,
#tf.RegisterGradient("MatrixDeterminant")
def _sub_grad(op, grad):
...
However, I am not familiar enough with tensor flow to understand how this can be accomplished. Does anyone have any insight on this matter?
Here's an example where I run into this issue:
x = tf.Variable(tf.ones(shape=[1]))
y = tf.Variable(tf.ones(shape=[1]))
A = tf.reshape(
tf.pack([tf.sin(x), tf.zeros([1, ]), tf.zeros([1, ]), tf.cos(y)]), (2,2)
)
loss = tf.square(tf.matrix_determinant(A))
optimizer = tf.train.GradientDescentOptimizer(0.001)
train = optimizer.minimize(loss)
init = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init)
for step in xrange(100):
sess.run(train)
print sess.run(x)
Please check "Implement Gradient in Python" section here
In particular, you can implement it as follows
#ops.RegisterGradient("MatrixDeterminant")
def _MatrixDeterminantGrad(op, grad):
"""Gradient for MatrixDeterminant. Use formula from 2.2.4 from
An extended collection of matrix derivative results for forward and reverse
mode algorithmic differentiation by Mike Giles
-- http://eprints.maths.ox.ac.uk/1079/1/NA-08-01.pdf
"""
A = op.inputs[0]
C = op.outputs[0]
Ainv = tf.matrix_inverse(A)
return grad*C*tf.transpose(Ainv)
Then a simple training loop to check that it works:
a0 = np.array([[1,2],[3,4]]).astype(np.float32)
a = tf.Variable(a0)
b = tf.square(tf.matrix_determinant(a))
init_op = tf.initialize_all_variables()
sess = tf.InteractiveSession()
init_op.run()
minimization_steps = 50
learning_rate = 0.001
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
train_op = optimizer.minimize(b)
losses = []
for i in range(minimization_steps):
train_op.run()
losses.append(b.eval())
Then you can visualize your loss over time
import matplotlib.pyplot as plt
plt.ylabel("Determinant Squared")
plt.xlabel("Iterations")
plt.plot(losses)
Should see something like this
I think you are confused with what is a derivative of a matrix determinant.
Matrix determinant is a function which is calculated over the elements of the matrix by some formula. So if all the elements of the matrix are numbers, you the determinant will you you just one number and the derivative will be 0. When some of the elements are variables, you will get an expression of these variables. For example:
x, x^2
1, sin(x)
The determinant will be x*sin(x) - x^2 and the derivative is 2x + sin(x) + x*cos(x). The Jacobi formula just connects the determinant with adjunct matrix.
In your example your matrix A consists of only numbers and therefore the determinant is just a number and the loss is just a number as well. GradientDescentOptimizer needs to have some free variables to minimize and does not have any because your loss is just a number.
For those who are interested, I discovered the solution that works on my problems:
#tf.RegisterGradient("MatrixDeterminant")
def _MatrixDeterminant(op, grad):
"""Gradient for MatrixDeterminant."""
return op.outputs[0] * tf.transpose(tf.matrix_inverse(op.inputs[0]))

Categories