I'm interested in Finding the Gradient of Neural Network output with respect to the parameters (weights and biases).
More specifically, assume I have the following Neural Network Structure [6,4,3,1]. The input samples size is 20. What I'm interested in, is finding the gradient of Neural Network output w.r.t the weights (and biases), which if I'm not mistaken, in this case would be 47. In Literature , this gradient is sometimes known as Weight_Jacobian.
I'm using Pytorch version 0.4.0 on Python 3.6 on Jupyter Notebook.
The Code that I have produced is this:
def init_params(layer_sizes, scale=0.1, rs=npr.RandomState(0)):
return [(rs.randn(insize, outsize) * scale, # weight matrix
rs.randn(outsize) * scale) # bias vector
for insize, outsize in
zip(layer_sizes[:-1],layer_sizes[1:])]
layers = [6, 4, 3, 1]
w = init_params(layers)
first_layer_w = Variable(torch.tensor(w[0][0],requires_grad=True))
first_layer_bias = Variable(torch.tensor(w[0][1],requires_grad=True))
second_layer_w = Variable(torch.tensor(w[1][0],requires_grad=True))
second_layer_bias = Variable(torch.tensor(w[1][1],requires_grad=True))
third_layer_w = Variable(torch.tensor(w[2][0],requires_grad=True))
third_layer_bias = Variable(torch.tensor(w[2][1],requires_grad=True))
X = Variable(torch.tensor(X_batch),requires_grad=True)
output=torch.tanh(torch.mm(torch.tanh(torch.mm(torch.tanh(torch.mm(X,first_layer_w)+first_layer_bias),second_layer_w)+second_layer_bias),third_layer_w)+third_layer_bias)
output.backward()
As it is obvious from the code, I'm using hyperbolic tangent as the Non Linearity. The code produces the output vector with length 20. Now, I'm interested in finding the Gradient of this Output vector w.r.t all the weights (all 47 of them). I have read the documentation of Pytorch at here. I have also seen similar questions for example, here. However, I have failed to find the gradient of output vector w.r.t parameters.
If I use the Pytorch function backward(), it generates an error as
RuntimeError: grad can be implicitly created only for scalar outputs
My Question is, is there a way to calculate the gradient of output vector w.r.t parameters, which could essentially be represented as a 20*47 matrix as I have the size of output vector to be 20 and size of parameter vector to be 47? If so, how ? Is there anything wrong in my code ? You can take any example of X as long as its dimension is 20*6.
You're trying to compute a Jacobian of a function, while PyTorch is expecting you to compute vector-Jacobian products. You can see an in-depth discussion of computing Jacobians with PyTorch here.
You have two options. Your first option is to use JAX or autograd and use the jacobian() function. Your second option is to stick with Pytorch and compute 20 vector-jacobian products, by calling backwards(vec) 20 times, where vec is a length-20 one-hot vector where the index of the component which is one ranges from 0 to 19. If this is confusing, I recommend reading the autodiff cookbook from the JAX tutorials.
The matrix of partial derivatives of a function with respect to its parameters is known as the Jacobian, and can be computed in PyTorch with:
torch.autograd.functional.jacobian(func, inputs)
Related
I was wondering if there was any method to manually add a gradient to a step in pytorch while otherwise using autograd. There is one middle step in my loss function that I cannot compute without transforming the datatype out of a tensor so I don't get an autograd of that component so the gradient doesn't get computed correctly. However, manually I could compute the gradient. How would I go about incorporating this into the gradient graph in pytorch? All the guides I've found don't use autograd at all (as far I understand it).
The specific issue I am trying to solve is normalizing a function over some interval. The following example does this for the sum of gaussians. The tensor m is [[m1,m2,m3,m4...]] and represents means, s represents standard deviations and p represents weights. p,m and s are all outputs from my model. I want the integral between the low and high cutoff to be 1 so I can get that by taking the cdf at higher cutoff and subtracting the lower cutoff cdf before dividing all the ps by that value. I would then use these new values of p (along with m and s and a target) to calculate some value for the loss function. Then when I call loss.backward() I would get the correct gradients, including the part of the gradient that comes from the normalization factor changing as p,m and s change.
normFactor=0
for gaussianInd in range(numberGaussians):
normFactor += (spstats.norm.cdf(higherCutoff,m[0][gaussianInd].cpu().detach(),s[0][gaussianInd].cpu().detach()+1e-6)-spstats.norm.cdf(lowerCutoff,m[0][gaussianInd].cpu().detach(),s[0][gaussianInd].cpu().detach()+1e-6))*p[0][gaussianInd]
p=p/normFactor
Edit: Added specific example
Of course, you can always modify the grad attribute of the tensor of interest. Your optimizer will tap on this attribute to update the corresponding tensor.
For illustration purposes:
>>> p = nn.Linear(10, 1, bias=False)
>>> p.weight
Parameter containing:
tensor([[ 0.3148, -0.2287, 0.1254, -0.1360, 0.2799, -0.0225, -0.3006, -0.0605,
-0.2784, -0.2618]], requires_grad=True)
>>> optim = torch.optim.SGD(p.parameters(), lr=.1)
Manually modify the gradient:
>>> p.weight.grad = torch.rand_like(p.weight)
Update with the optimizer:
>>> optim.step()
The parameter will get updated:
>>> p.weight
Parameter containing:
tensor([[ 0.2514, -0.2555, 0.1026, -0.1881, 0.2529, -0.0497, -0.3750, -0.1489,
-0.3762, -0.2839]], requires_grad=True)
I need to calculate Aitchison distance as a loss function between input and output datasets.
While calculating this mstric I need to calculate geometric mean on each row (where [batches x features] - size of a dataset during loss ).
In simple case we could imagine that there is only 1 batch so I need just to calculate one geomean for input and one for output dataset
So how it could be done on tensorflow? I didn't find any specified metrics or reduced functions
You can easily calculate the geometric mean of a tensor as a loss function (or in your case as part of the loss function) with tensorflow using a numerically stable formula highlighted here. The provided code fragment highly resembles to the pytorch solution posted here that follows the abovementioned formula (and scipy implementation).
from tensorflow.python.keras import backend as K
def gmean_loss((y_true, y_pred, dim=1):
error = y_pred - y_true
logx = K.log(inputs)
return K.exp(K.mean(logx, dim=dim))
You can define dim according to your needs or integrate it into your code.
I am working on a NN with Pytorch which simply maps points from the plane into real numbers, for example
model = nn.Sequential(nn.Linear(2,2),nn.ReLU(),nn.Linear(2,1))
What I want to do, since this network defines a map h:R^2->R, is to compute the gradient of this mapping h in the training loop. So for example
for it in range(epochs):
pred = model(X_train)
grad = torch.autograd.grad(pred,X_train)
....
The training set has been defined as a tensor requiring the gradient. My problem is that even if the output, for each fixed point, is a scalar, since I am propagating a set of N=100 points, the output is actually a Nx1 tensor. This brings to the error: autograd can compute the gradient just of scalar functions.
In fact, trying with the little change
pred = torch.sum(model(X_train))
everything works perfectly. However I am interested in all the single gradients so, is there a way to compute all these gradients together?
Actually computing the sum as presented above gives exactly the same result I expect of course, but I wanted to know if this is the only possiblity.
There are other possibilities but using .sum is the simplest way. Using .sum() on the final loss vector and computing dpred/dinput will give you the desired output. Here is why:
Since, pred = sum(loss) = sum (f(xi))
where i is the index of input x.
dpred/dinput will be a matrix [dpred/dx0, dpred/dx1, dpred/dx...]
Consider, dpred/dx0, it will be equal to df(x0)/dx0, since other df(xi)/dx0 is 0.
PS: Please excuse the crappy mathematical expressions... SO does not support latex/math expressions.
I have two tensors (matrices) that I've initialized:
sm=Var(torch.randn(20,1),requires_grad=True)
sm = torch.mm(sm,sm.t())
freq_m=Var(torch.randn(12,20),requires_grad=True)
I am creating two lists from the data inside these 2 matrices, and I am using spearmanr to get a correlation value between these 2 lists. How I am creating the lists is not important, but the goal is to adjust the values inside the matrices so that the calculated correlation value is as close to 1 as possible.
If I were to solve this problem manually, I would tweak values in the matrices by .01 (or some small number) each time and recalculate the lists and correlation score. If the new correlation value is higher than the previous one, I would save the 2 matrices and tweak a different value until I get the 2 matrices that give me the highest correlation score possible.
Is PyTorch capable of doing this automatically? I know PyTorch can adjust based on an equation but the way I want to adjust the matrix values is not against an equation, it's against a correlation value that I calculate. Any guidance with this is greatly appreciated!
Pytorch has an autograd package, that means if you have variable and you pass them through differentiable functions and get a scalar result, you can perform a gradient descent to update the variable to lower or augment the scalar result.
So what you need to do is to define a function f that works on tensor level such that f(sm, freq_m) will give you the desired correlation.
Then, you should do something like:
lr = 1e-3
for i in range(100):
# 100 updates
loss = 1 - f(sm, freq_m)
print(loss)
loss.backward()
with torch.no_grad():
sm -= lr * sm.grad
freq_m -= lr * freq_m.grad
# Manually zero the gradients after updating weights
sm.grad.zero_()
freq_m.grad.zero_()
The learning rate is basically the size of the step you do, a learning rate too high will cause the loss to explode, and a learning rate too little will cause a slow convergence, I suggest you experiment.
Edit : To answer the comment on loss.backward : for any differentiable function f, f is a function of multiple tensors t1, ..., tn with requires_grad=True as a result, you can calculate the gradient of the loss with respect to each of those tensors. When you do loss.backward, it calculates those gradients and store those in t1.grad, ..., tn.grad. Then you update t1, ..., tn using gradient descent in order to lower the value of f. This update doesn't need a computational graph, so this is why you use with torch.no_grad().
At the end of the loop, you zero the gradients because .backward doesn't overwrite the gradients but rather add the new gradients to them. More on that here : https://discuss.pytorch.org/t/why-do-we-need-to-set-the-gradients-manually-to-zero-in-pytorch/4903
I'm working in Keras with TensorFlow under the hood. I have a deep neural model (predictive autoencoder). I'm doing something somewhat similar to this: https://arxiv.org/abs/1612.00796 -- I'm trying to understand influence of variables in a given layer on the output.
For this I need to find 2nd derivative (Hessian) of the loss (L) with respect to output of particular layer (s):
Diagonal entries would be sufficient. L is a scalar, s is 1 by n.
What I tried first:
dLds = tf.gradients(L, s) # works fine to get first order derivatives
d2Lds2 = tf.gradients(dLds, s) # throws an error
TypeError: Second-order gradient for while loops not supported.
I also tried:
d2Lds2 = tf.hessians(L, s)
ValueError: Computing hessians is currently only supported for one-dimensional tensors. Element number 0 of `xs` has 2 dimensions.
I cannot change shape of s cause it's a part of neural network (LSTM's state). The first dimension (batch_size) is already set to 1, I don't think I can get rid of it.
I cannot reshape s because it breaks flow of the gradients, e.g.:
tf.gradients(L, tf.reduce_sum(s, axis=0))
gives:
[None]
Any ideas on what can I do in this situation?
This is not supported at the moment. See this report.