Gradient of neural network with respect to inputs - python

I am working on a NN with Pytorch which simply maps points from the plane into real numbers, for example
model = nn.Sequential(nn.Linear(2,2),nn.ReLU(),nn.Linear(2,1))
What I want to do, since this network defines a map h:R^2->R, is to compute the gradient of this mapping h in the training loop. So for example
for it in range(epochs):
pred = model(X_train)
grad = torch.autograd.grad(pred,X_train)
....
The training set has been defined as a tensor requiring the gradient. My problem is that even if the output, for each fixed point, is a scalar, since I am propagating a set of N=100 points, the output is actually a Nx1 tensor. This brings to the error: autograd can compute the gradient just of scalar functions.
In fact, trying with the little change
pred = torch.sum(model(X_train))
everything works perfectly. However I am interested in all the single gradients so, is there a way to compute all these gradients together?
Actually computing the sum as presented above gives exactly the same result I expect of course, but I wanted to know if this is the only possiblity.

There are other possibilities but using .sum is the simplest way. Using .sum() on the final loss vector and computing dpred/dinput will give you the desired output. Here is why:
Since, pred = sum(loss) = sum (f(xi))
where i is the index of input x.
dpred/dinput will be a matrix [dpred/dx0, dpred/dx1, dpred/dx...]
Consider, dpred/dx0, it will be equal to df(x0)/dx0, since other df(xi)/dx0 is 0.
PS: Please excuse the crappy mathematical expressions... SO does not support latex/math expressions.

Related

No gradients provided for any variable error

I'm creating a model using the Keras functional API.
The layer architecture is as follows:
n = tf.keras.layers.Dense(1)(input)
for i in tf.range(n):
output = tf.keras.layers.Dense(4)(input)
I then concat the outputs and return for a tensor with shape [1, None, 4] where [1] is the batch dimension, [None] is n, and [4] is the output from the second dense layer.
My loss function involves comparing the shape of the expected output, and comparing the outputs.
loss = tf.convert_to_tensor(abs(tf.shape(logits)[1] - tf.shape(expected)[1])) * 100.
When running this on a custom training loop, I'm getting the error
ValueError: No gradients provided for any variable: (['while/dense/kernel:0',
'while/dense/bias:0', 'while/while/dense_1/kernel:0', 'while/while/dense_1/bias:0'],).
Provided `grads_and_vars` is ((None, <tf.Variable 'while/dense/kernel:0' shape=(786432, 1)
Shape is not differentiable, you cannot do things like this with gradient based learning. Problems like this need to be tackled with more powerful tools, e.g. reinforcement learning where one considers n as an action, and get policy gradient for that.
A rule of thumb to remember is that you cannot really backprop through discrete objects. You need to produce floats, as gradients require smooth functions. In your case n should be an integer (what does a loop over a float mean?) so this should be your first warning sign. The other being shape itself, which is also an integer. A target can be discrete, but not the prediction. Note that even in classification we do not output class we output probability as probability is smooth.
You could build your model by assuming some maximum number of N and treat it more like a classification where you supervise N directly, and use some form of masking to keep all the results around.

PyTorch: Is retain_graph=True necessary in alternating optimization?

I'm trying to optimize two models in an alternating fashion using PyTorch. The first is a neural network that is changing the representation of my data (ie a map f(x) on my input data x, parameterized by some weights W). The second is a Gaussian mixture model that is operating on the f(x) points, ie in the neural network space (rather than clustering points in the input space. I am optimizing the GMM using expectation maximization, so the parameter updates are analytically derived, rather than using gradient descent.
I have two loss functions here: the first is a function of the distances ||f(x) - f(y)||, and the second is the loss function of the Gaussian mixture model (ie how 'clustered' everything looks in the NN representation space). What I want to do is take a step in the NN optimization using both of the above loss functions (since it depends on both), and then do an expectation-maximization step for the GMM. The code looks like this (I have removed a lot since there is a ton of code):
data, labels = load_dataset()
net = NeuralNetwork()
net_optim = torch.optim.Adam(net.parameters(), lr=0.05, weight_decay=1)
# initialize weights, means, and covariances for the Gaussian clusters
concentrations, means, covariances, precisions = initialization(net.forward_one(data))
for i in range(1000):
net_optim.zero_grad()
pairs, pair_labels = pairGenerator(data, labels) # samples some pairs of datapoints
outputs = net(pairs[:, 0, :], pairs[:, 1, :]) # computes pairwise distances
net_loss = NeuralNetworkLoss(outputs, pair_labels) # loss function based on pairwise dist.
embedding = net.forward_one(data) # embeds all data in the NN space
log_prob, log_likelihoods = expectation_step(embedding, means, precisions, concentrations)
concentrations, means, covariances, precisions = maximization_step(embedding, log_likelihoods)
gmm_loss = GMMLoss(log_likelihoods, log_prob, precisions, concentrations)
net_loss.backward(retain_graph=True)
gmm_loss.backward(retain_graph=True)
net_optim.step()
Essentially, this is what is happening:
Sample some pairs of points from the dataset
Push pairs of points through the NN and compute network loss based on those outputs
Embed all datapoints using the NN and perform a clustering EM step in that embedding space
Compute variational loss (ELBO) based on clustering parameters
Update neural network parameters using both the variational loss and the network loss
However, to perform (5), I am required to add the flag retain_graph=True, otherwise I get the error:
RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.
It seems like having two loss functions means that I need to retain the computational graph?
I am not sure how to work around this, as with retain_graph=True, around iteration 400, each iteration is taking ~30 minutes to complete. Does anyone know how I might fix this? I apologize in advance – I am still very new to automatic differentiation.
I would recommend doing
total_loss = net_loss + gmm_loss
total_loss.backward()
Note that the gradient of net_loss w.r.t gmm weights is 0 thus summing the losses won't have any effect.
Here is a good thread on pytorch regarding the retain_graph. https://discuss.pytorch.org/t/what-exactly-does-retain-variables-true-in-loss-backward-do/3508/24

Use PyTorch to adjust Tensor matrix values based on numbers I calculate from the Tensors?

I have two tensors (matrices) that I've initialized:
sm=Var(torch.randn(20,1),requires_grad=True)
sm = torch.mm(sm,sm.t())
freq_m=Var(torch.randn(12,20),requires_grad=True)
I am creating two lists from the data inside these 2 matrices, and I am using spearmanr to get a correlation value between these 2 lists. How I am creating the lists is not important, but the goal is to adjust the values inside the matrices so that the calculated correlation value is as close to 1 as possible.
If I were to solve this problem manually, I would tweak values in the matrices by .01 (or some small number) each time and recalculate the lists and correlation score. If the new correlation value is higher than the previous one, I would save the 2 matrices and tweak a different value until I get the 2 matrices that give me the highest correlation score possible.
Is PyTorch capable of doing this automatically? I know PyTorch can adjust based on an equation but the way I want to adjust the matrix values is not against an equation, it's against a correlation value that I calculate. Any guidance with this is greatly appreciated!
Pytorch has an autograd package, that means if you have variable and you pass them through differentiable functions and get a scalar result, you can perform a gradient descent to update the variable to lower or augment the scalar result.
So what you need to do is to define a function f that works on tensor level such that f(sm, freq_m) will give you the desired correlation.
Then, you should do something like:
lr = 1e-3
for i in range(100):
# 100 updates
loss = 1 - f(sm, freq_m)
print(loss)
loss.backward()
with torch.no_grad():
sm -= lr * sm.grad
freq_m -= lr * freq_m.grad
# Manually zero the gradients after updating weights
sm.grad.zero_()
freq_m.grad.zero_()
The learning rate is basically the size of the step you do, a learning rate too high will cause the loss to explode, and a learning rate too little will cause a slow convergence, I suggest you experiment.
Edit : To answer the comment on loss.backward : for any differentiable function f, f is a function of multiple tensors t1, ..., tn with requires_grad=True as a result, you can calculate the gradient of the loss with respect to each of those tensors. When you do loss.backward, it calculates those gradients and store those in t1.grad, ..., tn.grad. Then you update t1, ..., tn using gradient descent in order to lower the value of f. This update doesn't need a computational graph, so this is why you use with torch.no_grad().
At the end of the loop, you zero the gradients because .backward doesn't overwrite the gradients but rather add the new gradients to them. More on that here : https://discuss.pytorch.org/t/why-do-we-need-to-set-the-gradients-manually-to-zero-in-pytorch/4903

Pytorch: Gradient of output w.r.t parameters

I'm interested in Finding the Gradient of Neural Network output with respect to the parameters (weights and biases).
More specifically, assume I have the following Neural Network Structure [6,4,3,1]. The input samples size is 20. What I'm interested in, is finding the gradient of Neural Network output w.r.t the weights (and biases), which if I'm not mistaken, in this case would be 47. In Literature , this gradient is sometimes known as Weight_Jacobian.
I'm using Pytorch version 0.4.0 on Python 3.6 on Jupyter Notebook.
The Code that I have produced is this:
def init_params(layer_sizes, scale=0.1, rs=npr.RandomState(0)):
return [(rs.randn(insize, outsize) * scale, # weight matrix
rs.randn(outsize) * scale) # bias vector
for insize, outsize in
zip(layer_sizes[:-1],layer_sizes[1:])]
layers = [6, 4, 3, 1]
w = init_params(layers)
first_layer_w = Variable(torch.tensor(w[0][0],requires_grad=True))
first_layer_bias = Variable(torch.tensor(w[0][1],requires_grad=True))
second_layer_w = Variable(torch.tensor(w[1][0],requires_grad=True))
second_layer_bias = Variable(torch.tensor(w[1][1],requires_grad=True))
third_layer_w = Variable(torch.tensor(w[2][0],requires_grad=True))
third_layer_bias = Variable(torch.tensor(w[2][1],requires_grad=True))
X = Variable(torch.tensor(X_batch),requires_grad=True)
output=torch.tanh(torch.mm(torch.tanh(torch.mm(torch.tanh(torch.mm(X,first_layer_w)+first_layer_bias),second_layer_w)+second_layer_bias),third_layer_w)+third_layer_bias)
output.backward()
As it is obvious from the code, I'm using hyperbolic tangent as the Non Linearity. The code produces the output vector with length 20. Now, I'm interested in finding the Gradient of this Output vector w.r.t all the weights (all 47 of them). I have read the documentation of Pytorch at here. I have also seen similar questions for example, here. However, I have failed to find the gradient of output vector w.r.t parameters.
If I use the Pytorch function backward(), it generates an error as
RuntimeError: grad can be implicitly created only for scalar outputs
My Question is, is there a way to calculate the gradient of output vector w.r.t parameters, which could essentially be represented as a 20*47 matrix as I have the size of output vector to be 20 and size of parameter vector to be 47? If so, how ? Is there anything wrong in my code ? You can take any example of X as long as its dimension is 20*6.
You're trying to compute a Jacobian of a function, while PyTorch is expecting you to compute vector-Jacobian products. You can see an in-depth discussion of computing Jacobians with PyTorch here.
You have two options. Your first option is to use JAX or autograd and use the jacobian() function. Your second option is to stick with Pytorch and compute 20 vector-jacobian products, by calling backwards(vec) 20 times, where vec is a length-20 one-hot vector where the index of the component which is one ranges from 0 to 19. If this is confusing, I recommend reading the autodiff cookbook from the JAX tutorials.
The matrix of partial derivatives of a function with respect to its parameters is known as the Jacobian, and can be computed in PyTorch with:
torch.autograd.functional.jacobian(func, inputs)

Sparse Cross Entropy in Tensorflow

Using tf.nn.sparse_softmax_cross_entropy_with_logits in tensorflow, its possible to only calculate loss for specific rows by setting the class label to -1 (it is otherwise expected to be in the range 0->numclasses-1).
Unfortunately this breaks the gradient computations (as is mentioned in the comments in the source nn_ops.py).
What I would like to do is something like the following:
raw_classification_output1 = [0,1,0]
raw_classification_output2 = [0,0,1]
classification_output =tf.concat(0,[raw_classification_output1,raw_classification_output2])
classification_labels = [1,-1]
classification_loss = tf.nn.sparse_softmax_cross_entropy_with_logits(classification_output,classification_labels)
total_loss = tf.reduce_sum(classification_loss) + tf.reduce_sum(other_loss)
optimizer = tf.train.GradientDescentOptimizer(1e-3)
grads_and_vars = optimizer.compute_gradients(total_loss)
changed_grads_and_vars = #do something to 0 the incorrect gradients
optimizer.apply_gradients(changed_grads_and_vars)
What's the most straightforward way to zero those gradients?
The easiest method is to just multiply the classification loss by a similar tensor of 1's where the loss is desired, and zeros where it isn't. This is made easier by the fact that the loss is already zero where you don't want it to be updated. This is basically just a workaround for the fact that it still does some weird gradient behavior if you have loss zero for this sparse softmax.
adding this line after tf.nn.sparse_softmax_cross_entropy_with_logits:
classification_loss_zeroed = tf.mul(classification_loss,tf.to_float(tf.not_equal(classification_loss,0)))
It should zero out the gradients also.

Categories