Adding manual gradient to step - python

I was wondering if there was any method to manually add a gradient to a step in pytorch while otherwise using autograd. There is one middle step in my loss function that I cannot compute without transforming the datatype out of a tensor so I don't get an autograd of that component so the gradient doesn't get computed correctly. However, manually I could compute the gradient. How would I go about incorporating this into the gradient graph in pytorch? All the guides I've found don't use autograd at all (as far I understand it).
The specific issue I am trying to solve is normalizing a function over some interval. The following example does this for the sum of gaussians. The tensor m is [[m1,m2,m3,m4...]] and represents means, s represents standard deviations and p represents weights. p,m and s are all outputs from my model. I want the integral between the low and high cutoff to be 1 so I can get that by taking the cdf at higher cutoff and subtracting the lower cutoff cdf before dividing all the ps by that value. I would then use these new values of p (along with m and s and a target) to calculate some value for the loss function. Then when I call loss.backward() I would get the correct gradients, including the part of the gradient that comes from the normalization factor changing as p,m and s change.
normFactor=0
for gaussianInd in range(numberGaussians):
normFactor += (spstats.norm.cdf(higherCutoff,m[0][gaussianInd].cpu().detach(),s[0][gaussianInd].cpu().detach()+1e-6)-spstats.norm.cdf(lowerCutoff,m[0][gaussianInd].cpu().detach(),s[0][gaussianInd].cpu().detach()+1e-6))*p[0][gaussianInd]
p=p/normFactor
Edit: Added specific example

Of course, you can always modify the grad attribute of the tensor of interest. Your optimizer will tap on this attribute to update the corresponding tensor.
For illustration purposes:
>>> p = nn.Linear(10, 1, bias=False)
>>> p.weight
Parameter containing:
tensor([[ 0.3148, -0.2287, 0.1254, -0.1360, 0.2799, -0.0225, -0.3006, -0.0605,
-0.2784, -0.2618]], requires_grad=True)
>>> optim = torch.optim.SGD(p.parameters(), lr=.1)
Manually modify the gradient:
>>> p.weight.grad = torch.rand_like(p.weight)
Update with the optimizer:
>>> optim.step()
The parameter will get updated:
>>> p.weight
Parameter containing:
tensor([[ 0.2514, -0.2555, 0.1026, -0.1881, 0.2529, -0.0497, -0.3750, -0.1489,
-0.3762, -0.2839]], requires_grad=True)

Related

Gradient of neural network with respect to inputs

I am working on a NN with Pytorch which simply maps points from the plane into real numbers, for example
model = nn.Sequential(nn.Linear(2,2),nn.ReLU(),nn.Linear(2,1))
What I want to do, since this network defines a map h:R^2->R, is to compute the gradient of this mapping h in the training loop. So for example
for it in range(epochs):
pred = model(X_train)
grad = torch.autograd.grad(pred,X_train)
....
The training set has been defined as a tensor requiring the gradient. My problem is that even if the output, for each fixed point, is a scalar, since I am propagating a set of N=100 points, the output is actually a Nx1 tensor. This brings to the error: autograd can compute the gradient just of scalar functions.
In fact, trying with the little change
pred = torch.sum(model(X_train))
everything works perfectly. However I am interested in all the single gradients so, is there a way to compute all these gradients together?
Actually computing the sum as presented above gives exactly the same result I expect of course, but I wanted to know if this is the only possiblity.
There are other possibilities but using .sum is the simplest way. Using .sum() on the final loss vector and computing dpred/dinput will give you the desired output. Here is why:
Since, pred = sum(loss) = sum (f(xi))
where i is the index of input x.
dpred/dinput will be a matrix [dpred/dx0, dpred/dx1, dpred/dx...]
Consider, dpred/dx0, it will be equal to df(x0)/dx0, since other df(xi)/dx0 is 0.
PS: Please excuse the crappy mathematical expressions... SO does not support latex/math expressions.

PyTorch: Is retain_graph=True necessary in alternating optimization?

I'm trying to optimize two models in an alternating fashion using PyTorch. The first is a neural network that is changing the representation of my data (ie a map f(x) on my input data x, parameterized by some weights W). The second is a Gaussian mixture model that is operating on the f(x) points, ie in the neural network space (rather than clustering points in the input space. I am optimizing the GMM using expectation maximization, so the parameter updates are analytically derived, rather than using gradient descent.
I have two loss functions here: the first is a function of the distances ||f(x) - f(y)||, and the second is the loss function of the Gaussian mixture model (ie how 'clustered' everything looks in the NN representation space). What I want to do is take a step in the NN optimization using both of the above loss functions (since it depends on both), and then do an expectation-maximization step for the GMM. The code looks like this (I have removed a lot since there is a ton of code):
data, labels = load_dataset()
net = NeuralNetwork()
net_optim = torch.optim.Adam(net.parameters(), lr=0.05, weight_decay=1)
# initialize weights, means, and covariances for the Gaussian clusters
concentrations, means, covariances, precisions = initialization(net.forward_one(data))
for i in range(1000):
net_optim.zero_grad()
pairs, pair_labels = pairGenerator(data, labels) # samples some pairs of datapoints
outputs = net(pairs[:, 0, :], pairs[:, 1, :]) # computes pairwise distances
net_loss = NeuralNetworkLoss(outputs, pair_labels) # loss function based on pairwise dist.
embedding = net.forward_one(data) # embeds all data in the NN space
log_prob, log_likelihoods = expectation_step(embedding, means, precisions, concentrations)
concentrations, means, covariances, precisions = maximization_step(embedding, log_likelihoods)
gmm_loss = GMMLoss(log_likelihoods, log_prob, precisions, concentrations)
net_loss.backward(retain_graph=True)
gmm_loss.backward(retain_graph=True)
net_optim.step()
Essentially, this is what is happening:
Sample some pairs of points from the dataset
Push pairs of points through the NN and compute network loss based on those outputs
Embed all datapoints using the NN and perform a clustering EM step in that embedding space
Compute variational loss (ELBO) based on clustering parameters
Update neural network parameters using both the variational loss and the network loss
However, to perform (5), I am required to add the flag retain_graph=True, otherwise I get the error:
RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.
It seems like having two loss functions means that I need to retain the computational graph?
I am not sure how to work around this, as with retain_graph=True, around iteration 400, each iteration is taking ~30 minutes to complete. Does anyone know how I might fix this? I apologize in advance – I am still very new to automatic differentiation.
I would recommend doing
total_loss = net_loss + gmm_loss
total_loss.backward()
Note that the gradient of net_loss w.r.t gmm weights is 0 thus summing the losses won't have any effect.
Here is a good thread on pytorch regarding the retain_graph. https://discuss.pytorch.org/t/what-exactly-does-retain-variables-true-in-loss-backward-do/3508/24

Use PyTorch to adjust Tensor matrix values based on numbers I calculate from the Tensors?

I have two tensors (matrices) that I've initialized:
sm=Var(torch.randn(20,1),requires_grad=True)
sm = torch.mm(sm,sm.t())
freq_m=Var(torch.randn(12,20),requires_grad=True)
I am creating two lists from the data inside these 2 matrices, and I am using spearmanr to get a correlation value between these 2 lists. How I am creating the lists is not important, but the goal is to adjust the values inside the matrices so that the calculated correlation value is as close to 1 as possible.
If I were to solve this problem manually, I would tweak values in the matrices by .01 (or some small number) each time and recalculate the lists and correlation score. If the new correlation value is higher than the previous one, I would save the 2 matrices and tweak a different value until I get the 2 matrices that give me the highest correlation score possible.
Is PyTorch capable of doing this automatically? I know PyTorch can adjust based on an equation but the way I want to adjust the matrix values is not against an equation, it's against a correlation value that I calculate. Any guidance with this is greatly appreciated!
Pytorch has an autograd package, that means if you have variable and you pass them through differentiable functions and get a scalar result, you can perform a gradient descent to update the variable to lower or augment the scalar result.
So what you need to do is to define a function f that works on tensor level such that f(sm, freq_m) will give you the desired correlation.
Then, you should do something like:
lr = 1e-3
for i in range(100):
# 100 updates
loss = 1 - f(sm, freq_m)
print(loss)
loss.backward()
with torch.no_grad():
sm -= lr * sm.grad
freq_m -= lr * freq_m.grad
# Manually zero the gradients after updating weights
sm.grad.zero_()
freq_m.grad.zero_()
The learning rate is basically the size of the step you do, a learning rate too high will cause the loss to explode, and a learning rate too little will cause a slow convergence, I suggest you experiment.
Edit : To answer the comment on loss.backward : for any differentiable function f, f is a function of multiple tensors t1, ..., tn with requires_grad=True as a result, you can calculate the gradient of the loss with respect to each of those tensors. When you do loss.backward, it calculates those gradients and store those in t1.grad, ..., tn.grad. Then you update t1, ..., tn using gradient descent in order to lower the value of f. This update doesn't need a computational graph, so this is why you use with torch.no_grad().
At the end of the loop, you zero the gradients because .backward doesn't overwrite the gradients but rather add the new gradients to them. More on that here : https://discuss.pytorch.org/t/why-do-we-need-to-set-the-gradients-manually-to-zero-in-pytorch/4903

Implementing a Gaussian-based loss function in Keras

I'm trying to implement my custom loss function in Keras using TensorFlow backend. The idea is for the neural network to input coefficients for Gaussians and compare the sum of four Gaussians to the model output. So we're fitting Gaussians to the data. I'd like to have y_pred in the form of [a_0, b_0, c_0, a_1, ..., c_3] and calculate the sum of a_i*e^((x-b_i)^2/2c_i), i=0,1,2,3 and then work out for example mean absolute error comparing this function to y_true. What I tried was
def gauss_loss(y_true, y_pred):
# zs is the the size y_true
# the size of y_pred is 12
xs = np.linspace(0, 1, zs)
gauss_sum = 0
for i in range(0, 12, 3):
gauss_sum += y_pred[:,i]*K.exp(-(xs-y_pred[:,i+1])**2/(2*y_pred[:,i+2]))
return 1./zs*sum(K.abs(y_true-gauss_sum))
I get the error "TypeError: Tensor objects are not iterable when eager execution is not enabled. To iterate over this tensor use tf.map_fn".
However, I don't think I can use tf.map_fn either because it only accepts one argument so I can't use the first entry of y_pred as coefficient a and the next as b in the same formula.
All examples I find just use tensor operations for the entire matrix. It seems to me that this might not even be possible in Keras. Is this possible and if so, how is it done?

Pytorch: Gradient of output w.r.t parameters

I'm interested in Finding the Gradient of Neural Network output with respect to the parameters (weights and biases).
More specifically, assume I have the following Neural Network Structure [6,4,3,1]. The input samples size is 20. What I'm interested in, is finding the gradient of Neural Network output w.r.t the weights (and biases), which if I'm not mistaken, in this case would be 47. In Literature , this gradient is sometimes known as Weight_Jacobian.
I'm using Pytorch version 0.4.0 on Python 3.6 on Jupyter Notebook.
The Code that I have produced is this:
def init_params(layer_sizes, scale=0.1, rs=npr.RandomState(0)):
return [(rs.randn(insize, outsize) * scale, # weight matrix
rs.randn(outsize) * scale) # bias vector
for insize, outsize in
zip(layer_sizes[:-1],layer_sizes[1:])]
layers = [6, 4, 3, 1]
w = init_params(layers)
first_layer_w = Variable(torch.tensor(w[0][0],requires_grad=True))
first_layer_bias = Variable(torch.tensor(w[0][1],requires_grad=True))
second_layer_w = Variable(torch.tensor(w[1][0],requires_grad=True))
second_layer_bias = Variable(torch.tensor(w[1][1],requires_grad=True))
third_layer_w = Variable(torch.tensor(w[2][0],requires_grad=True))
third_layer_bias = Variable(torch.tensor(w[2][1],requires_grad=True))
X = Variable(torch.tensor(X_batch),requires_grad=True)
output=torch.tanh(torch.mm(torch.tanh(torch.mm(torch.tanh(torch.mm(X,first_layer_w)+first_layer_bias),second_layer_w)+second_layer_bias),third_layer_w)+third_layer_bias)
output.backward()
As it is obvious from the code, I'm using hyperbolic tangent as the Non Linearity. The code produces the output vector with length 20. Now, I'm interested in finding the Gradient of this Output vector w.r.t all the weights (all 47 of them). I have read the documentation of Pytorch at here. I have also seen similar questions for example, here. However, I have failed to find the gradient of output vector w.r.t parameters.
If I use the Pytorch function backward(), it generates an error as
RuntimeError: grad can be implicitly created only for scalar outputs
My Question is, is there a way to calculate the gradient of output vector w.r.t parameters, which could essentially be represented as a 20*47 matrix as I have the size of output vector to be 20 and size of parameter vector to be 47? If so, how ? Is there anything wrong in my code ? You can take any example of X as long as its dimension is 20*6.
You're trying to compute a Jacobian of a function, while PyTorch is expecting you to compute vector-Jacobian products. You can see an in-depth discussion of computing Jacobians with PyTorch here.
You have two options. Your first option is to use JAX or autograd and use the jacobian() function. Your second option is to stick with Pytorch and compute 20 vector-jacobian products, by calling backwards(vec) 20 times, where vec is a length-20 one-hot vector where the index of the component which is one ranges from 0 to 19. If this is confusing, I recommend reading the autodiff cookbook from the JAX tutorials.
The matrix of partial derivatives of a function with respect to its parameters is known as the Jacobian, and can be computed in PyTorch with:
torch.autograd.functional.jacobian(func, inputs)

Categories