PyTorch: Calculate Hessian only for a subset of parameters? - python

I am writing ElasticWeightConsolidation method and for that I need to compute Fisher matrix. As I understood Fisher Matrix is just Hessian of likelihood by weights of neural network. There is good function as torch.autograd.functional.hessian(func, inputs, create_graph=False, strict=False)
So I want to compute hessian(loss,weights) where loss = torch.nn.CrossEntropyLoss(). I also prepared weights of the network so that it its long 1D tensor, to have a possibility simply take diagonal elements of hessian like that:
def flat_param(model_param = yann_lecun.parameters()):
ans_data = []
ans_data = torch.tensor(ans_data, requires_grad=True)
ans_data = ans_data.to(device)
for p in model_param:
temp_data = p.data.flatten()
ans_data = torch.cat((ans_data,temp_data))
return ans_data
ans = flat_param(yann_lecun.parameters())
then I tried so: hessian(loss, inputs = ans) but problem is that loss takes also targets, but I don't want to compute hessian of them. The task is mnist classification so that targets are integers 0,...,9
and if I add y_train to the parameters like that hessian(loss,inputs = (ans,y_train_01)
It is crashing with words "can't take gradient from integer". I tried also to make y_train_01.requires_grad = False but it didn't help. I understood that loss also depends on y_train_01, but is there any way to determine that targets are constants in my case?

You can create a new 'wrapper' function where the targets are fixed:
def original_model(features, targets):
...
return ...
def feature_differentiable_model(features):
fixed_targets = ...
return original_model(features, fixed_targets)
And then call:
hessian(feature_differentiable_model, features_vals)
The second order partial derivatives from this will be equivalent to the analogous ones of the full Hessian product at the location (features_vals, fixed_targets).

Related

Backpropagation for cross entropy and softmax?

I'm trying to implement the backward pass for an NN with a final layer of Softmax and loss function of Cross Entropy. I'm following the notes in this article (particularly the "Matrix Multiplication" section).
I'd first like to make sure I'm calculating the derivative of the error with respect to the final outputs correctly. I'm working on the MNIST classification problem, and so y represents a one-hot encoding of the target and y_hat is my predicted probabilities.
def cross_entropy(y, y_hat):
value = np.log2(np.sum(y*y_hat))
return value
def d_cross_entropy(y, y_hat):
return -y/y_hat*np.log(2)
I'm a lot more confused on getting the gradient of Softmax. If we say that A = Softmax(Wx+b), then taking the gradient of A with respect to X is more difficult because Ai does not just depend on Xi but on all elements of the X vector. This means that rather than getting a simple 10-dimensional dA/dX term, I get a 10x10 matrix which throws off the matrix multiplication. I tried taking the sum to reduce this to a 10-dimensional vector, but this seems incorrect
def softmax(x):
exp = np.exp(x)
return exp/np.sum(exp)
def d_softmax(x):
softmax_x = softmax(x)
jacobian = np.outer(softmax_x, -softmax_x)
adj = np.eye(x.shape[0])*softmax_x
jacobian += adj
return jacobian.reshape((x.shape[0], x.shape[0])).sum()

What is the alternative of elementwise_grad of autograd in JAX?

I want to solve a second order differential equation with neural network. For automatic differentiation I am using JAX library. To compute first order and second order derivative of my target variable 'u' i.e to compute du/dx and d2u/dx2 elementwise_grad has been used in an example. In jax what is its alternative?
For example neural network function is evaluating 'u': which is defined as below:
'''
def u(params, inputs):
for Weights, biases in params:
outputs = np.dot(inputs, Weights) + biases
inputs = sigmoid(outputs)
return outputs
'''
u has two arguments: params is the set of weights and biases and inputs is the x range with respect to which u will be differentiated.
suppose x has a length of 50, so size of output u will also be 50*1
Now I have to take differentiation of all 50 values of u at a time. By JAX, which functions should I use to calculate du/dx and d2u/dx2? grad function is not working
dudx = grad(u,1)(x)
d2udx2 = grad(grad(u,1)(x))(x)
These are giving some errors
This isn't really a function that has a meaningful elementwise gradient. It's mapping one vector space to another vector space, and the appropriate derivative for this kind of operation is a jacobian:
dudx = jax.jacobian(u, 1)(params, x)
The result is a matrix whose entries are the derivative of the ith output with respect to the jth input.
Note that if you had a truly element-wise function and wanted to compute the element-wise gradient, you could do so with vmap; for example:
def f(x):
return jnp.exp(x) - 1
df_dx = jax.vmap(jax.grad(f))(x)
That doesn't work for your function, because the mapping to the output vector space is determined by the contents of params, and vmap cannot easily account for that.

Efficient batch derivative operations in PyTorch

I am using Pytorch to implement a neural network that has (say) 5 inputs and 2 outputs
class myNetwork(nn.Module):
def __init__(self):
super(myNetwork,self).__init__()
self.layer1 = nn.Linear(5,32)
self.layer2 = nn.Linear(32,2)
def forward(self,x):
x = torch.relu(self.layer1(x))
x = self.layer2(x)
return x
Obviously, I can feed this an (N x 5) Tensor and get an (N x 2) result,
net = myNetwork()
nbatch = 100
inp = torch.rand([nbatch,5])
inp.requires_grad = True
out = net(inp)
I would now like to compute the derivatives of the NN output with respect to one element of the input vector (let's say the 5th element), for each example in the batch. I know I can calculate the derivatives of one element of the output with respect to all inputs using torch.autograd.grad, and I could use this as follows:
deriv = torch.zeros([nbatch,2])
for i in range(nbatch):
for j in range(2):
deriv[i,j] = torch.autograd.grad(out[i,j],inp,retain_graph=True)[0][i,4]
However, this seems very inefficient: it calculates the gradient of out[i,j] with respect to every single element in the batch, and then discards all except one. Is there a better way to do this?
By virtue of backpropagation, if you did only compute the gradient w.r.t a single input, the computational savings wouldn't necessarily amount to much, you would only save some in the first layer, all layers afterwards need to be backpropagated either way.
So this may not be the optimal way, but it doesn't actually create much overhead, especially if your network has many layers.
By the way, is there a reason that you need to loop over nbatch? If you wanted the gradient of each element of a batch w.r.t a parameter, I could understand that, because pytorch will lump them together, but you seem to be solely interested in the input...

Keras custom loss function with samples from complete input dataset

I am trying to devise a custom loss function for Variational auto-encoder in Keras with two parts: reconstruction loss and divergence loss. However, instead of using the gaussian distribution for divergence loss, I want to sample randomly from the input and then perform the divergence loss based on the sampled inputs. However, I do not know how to sample inputs which are from the complete datastet and then perform a loss with respect to it. The encoder model is:
x_input = Input((input_size,))
enc1 = Dense(encoder_size[0], activation='relu')(x_input)
drop = Dropout(keep_prob)(enc1)
enc2 = Dense(encoder_size[1], activation='relu')(drop)
drop = Dropout(keep_prob)(enc2)
mu = Dense(latent_dim, activation='linear', name='encoder_mean')(drop)
encoder = Model(x_input,mu)
The structure of loss should be:
# the input is the placeholder for the complete input
def loss(x, y, input):
reconstruction_loss = mean_squared_error(x, y)
sample_num = 100
sample_input = sample_from_input(input, sample_num)
sample_encoded = encoder.predict(sample_input) <-- this would not work with placeholder
sample_prior = gaussian(mean=0, std=1)
# perform KL divergence between sample_encoded and sample_prior
I have not found anything similar given. It would be great if somebody can point me in the right direction.
There are couple of problems in your code. First, when you create your custom loss function, it expects only two (equivalent) parameters of y_true and y_pred. So you will not be able to pass explicitly the parameter of input in your case. If you wish to pass additional parameters, you have to use the concept of nested function.
Next thing is inside predict function you will not be able to pass TensorFlow placeholders. You will have to pass Numpy array equivalents in it. So I would recommend you to rewrite your sample_from_input which samples from a set of file path inputs, reads it and sends a Numpy array of file data. Also, in the parameter of input_data, pass it the file paths where your data is present.
I have enclosed only the relevant parts of code.
def custom_loss(input_data):
def loss(y_true, y_pred):
reconstruction_loss = mean_squared_error(x, y)
sample_num = 100
sample_input = sample_from_input(input_data)
# sample_input is a Numpy array
sample_encoded = encoder.predict(sample_input)
sample_prior = gaussian(mean=0, std=1)
# perform KL divergence between sample_encoded and sample_prior
divergence_loss = # Your logic returning a numeric value
return reconstruction_loss + divergence_loss
return loss
encoder.compile(optimizer='adam', loss=custom_loss('<<input_data_path>>'))

Hessian matrix always singular in Theano

I'm trying to implement training an MLP with Newton's method in Theano. I've managed to find out how to obtain the inverse hessian (it's pretty easy, just needed to remove some asserts in the nlinalg module). However, when I try to update my weight vectors with my gradient_descent function, I get the following error:
numpy.linalg.linalg.LinAlgError: Singular matrix
Apply node that caused the error: MatrixInverse(InplaceDimShuffle{x,0,1}.0)
Obviously, if my hessian is singular, it can't be inverted. But the weights are random, so how come it's always singular? I don't know much about matrix computation, but it seems to me that there must be a flaw in my process. Here's the relevant part of my code:
def gradient_descent(error, weights, w_flat, learning_rate=0.1):
# decrease flattened weights by the product of the inverse Hessian and the gradient
hess = T.hessian(cost=error, wrt=[w_flat])
grad = T.grad(error, wrt=[w_flat])
mult = T.nlinalg.matrix_inverse(hess) * grad
update = w_flat - learning_rate * mult
return T.flatten(update)
#initialize weight matrices
w_h_flat = theano.shared(np.array(np.random.randn(11,6), dtype=theano.config.floatX).flatten())
w_hidden = w_h_flat.reshape((11,6))
w_o_flat = theano.shared(np.array(np.random.randn(7,1), dtype=theano.config.floatX).flatten())
w_output = w_o_flat.reshape((7,1))
# ... other mlp stuff, define the cost, etc ...
train = theano.function(inputs=[x, y], outputs=cost, updates=[(w_h_flat, gradient_descent(cost, w_hidden, w_h_flat)), (w_o_flat, gradient_descent(cost, w_output, w_o_flat))])
Would really appreciate any pointers as to what I'm doing wrong.

Categories