Inner workings of pytorch autograd.grad for inner derivatives - python

Consider the following code:
x = torch.tensor(2.0, requires_grad=True)
y = torch.square(x)
grad = autograd.grad(y, x)
x = x + grad[0]
y = torch.square(x)
grad2 = autograd.grad(y, x)
First, we have that ∇(x^2)=2x. In my understanding, grad2=∇((x + ∇(x^2))^2)=∇((x+2x)^2)=∇((3x)^2)=9∇x^2=18x . As expected, grad=4.0=2x, but grad2=12.0=6x, which I don't understand where it comes from. It feels as though the 3 comes from the expression I had, but it is not squared, and the 2 comes from the traditional derivative. Could somebody help me understand why this is happening? Furthermore, how far back does the computational graph that stores the gradients go?
Specifically, I am coming from a meta learning perspective, where one is interested in computing a quantity of the following form ∇ L(theta - alpha * ∇ L(theta))=(1 + ∇^2 L(theta)) ∇L(theta - alpha * ∇ L(theta) (here the derivative is with respect to theta). Therefore, the computation, let's call it A, includes a second derivative. Computation is quite different than the following ∇_{theta - alpha ∇ L(theta)}L(\theta - alpha * ∇ L(theta))=∇_beta L(beta), which I will call B.
Hopefully, it is clear how the snippet I had is related to what I described in the second paragraph. My overall question is: under what circumstances does pytorch realize computation A vs computation B when using autograd.grad? I'd appreciate any explanation that goes into technical details about how this particular case is handled by autograd.
PD. The original code I was following made me wonder this is here; in particular, lines 69 through 106, and subsequently line 193, which is when they use autograd.grad. For the code is even more unclear because they do a lot of model.clone() and so on.
If the question is unclear in any way, please let me know.

I made a few changes:
I am not sure what torch.rand(2.0) is supposed to do. According to the text I simply set it to 2.
An intermediate variable z is added so that we can compute gradient w.r.t. to the original variable. Yours is overwritten.
Set create_graph=True to compute higher order gradients. See https://pytorch.org/docs/stable/generated/torch.autograd.grad.html
import torch
from torch import autograd
x = torch.ones(1, requires_grad=True)*2
y = torch.square(x)
grad = autograd.grad(y, x, create_graph=True)
z = x + grad[0]
y = torch.square(z)
grad2 = autograd.grad(y, x)
# yours is more like autograd.grad(y, z)
print(x)
print(grad)
print(grad2)

Related

Mixed partial dervative w.r.t. tensor in Pytorch

Question:
Is there any working method to calculate gradient of (non-scalar) tensor function?
Example
Given n by n symmetric matrices X, Y and matrix function Z(X, Y) = torch.mm(X.mm(X), Y) calculate d(dZ/dX)/dY.
Expected answer
d(dZ/dX)/dY = d(2*XY)/dY = 2*X
Attempts
Because torch's .backward() works only for scalar variables I've tried to calculate derivative by applying torch.autograd.grad() to each element of tensor Z, but this approach is not correct, because it gives d(X^2)/dX = X + 2*D where D is a diagonal matrix with diagonal values of X. For me it's a bit weird that torch has an ability to build a computational graph, but can't track tensor through it as a variable to get tensor derivative.
Edit
Question was not very clear, so I decided to give more details.
My aim is to get partial derivative of loss function, which involves two matrices as variables. It looks like that:
loss = torch.linalg.norm(my_formula(X, Y) , ord='fro')
And I need to find
d^2(loss)/d(Y^2)
d/dX[d(loss)/dY]
Torch is capable of calculating 1. by using .backward() two times, but it's problematic to find 2. because torch.autograd.grad() expects scalar input and not the tensor
TL;DR
For function f which takes a matrix and gives a scalar:
Find first order derivative, let's name it dX
Take trace: Tr(dX)
To get mixed partial derivative just use the trace from above: d/dY[df/dX] = d/dY[Tr(df/dX)]
Intro
At the moment of posting the question I was not really that good at theory of matrix derivatives, but now I know much more all thanks to this Yandex ml book (unfortunately, I didn't find the english equivalent). This is an attempt to give a full answer to my question.
Basic Theory
Forgive me, Lord, for ugly representation of latex
Let's say you have a function which takes matrix X and returns it's squared Frobenius norm: f(X) = ||X||_F^2
It is a well-known fact that: ||X||_F^2 = Tr(X X^T)
Let's define derivative as shown in same book: D[f] at X_0 = f(X + H) - f(X)
We are ready to find dg(X)/dX:
df(X)/dX = dTr(X X^T)/dX =
(using Trace's feature)
= Tr(d/dX[X X^T]) = Tr(dX/dX X^T + X d[X^T]/dX ) =
(then we should use the definition of derivative from above)
= Tr(HX^T + XH^T) = Tr(HX^T) + Tr(XH^T) =
(now the main trick is to get all matrices H on the right side and get something like
Tr(g(X) H) or Tr(g(X) H^T), where g(X) will be the derivative we are looking for)
= Tr(HX^T) + Tr(XH^T) = Tr(XH^T) + Tr(XH^T) = Tr(2*XH^T)
That means: df(X)/dX = 2X
Second order derivative
Now, after we found out how to get matrix derivatives, let's try to find second order derivative of the same function f(X):
d/dX[df(X)/dX] = d/dX[Tr(2XH_1^T)] = Tr(d/dX[2XH_1^T]) =
= Tr(2I H_2 H_1^T)
We found out that d/dX[df(X)/dX] = 2I where I stands for Identity matrix. But how will it help us to find derivatives in Pytorch?
Trace is the trick
As we can see from the formulas, both first and second order derivatives have Trace inside them, but when we take first order derivative we just instantly get matrix as a result. To get a higher order derivative we just need to take the derivative of trace of first order derivative:
d/dY[df/dX] = d/dY[Tr(df/dX)]
The thing is I was using JAX autograd library when this trick came to my mind, so the code with a function f(X,Y) will look like this:
def scalarized_dy(X, Y):
dY = grad(f, argnums=1)(X, Y)
return jnp.trace(dY)
dYX = grad(scalarized_dy, argnums=0)(X, Y)
dYY = grad(scalarized_dy, argnums=1)(X, Y)
In case of Pytorch I guess we will need to look after tensors' gradients (let loss be a function with X and Y as arguments):
loss = f(X, Y)
loss.backward(create_graph = True)
dX = torch.trace(X.grad)
dX.backward()
dXX = X.grad
dXY = Y.grad
Epilogue
I thought that the question itself is in some way interesting. Also, it took me several months to figure things out, so I decided to give my current point of view on this problem. I will not mark my answer as correct yet in hope that I will get some kind of feedback or, perhaps, even better answers or ideas.

Gaussian process regression - explain behaviour

I'm looking into GP regression, but I'm getting some behaviour that I do not understand.
Basically, I wanted to show convergence for GP on the osciallatory Genz function (basically a period wave), which led me to this picture Gp convergence, sorry for the missing labels (x axis: num samples, y axis: relative error measure in 2000 points)
This is OK, but I was curious why it took so long before the error started to drop. Plotting the resulting GP fit I got this (busy) plot GP fit is orange, true function is blue. What I don't understand is what happens up until it starts to capture the true function. I assumed it had something to do with the kernel. The plot here uses a RBF kernel with length_scale = 1 (I also tried both higher and lower values, but got the same results).
I kind of expected it to have a more smooth behaviour even if it couldn't capture the true model.
So, to my question: why do I see this "spikey" behaviour? And can I do something to change it (kernel-wise or other)?
kernel = RBF(length_scale = 1, length_scale_bounds = (1e-2, 1e2))
gp = GaussianProcessRegressor(kernel=kernel)
gp.fit(X, y)
def genz(x, method = 'default'):
d = x.shape[1]
a = 10/d
w = 1/2
num_points = x.shape[0]
funcval = np.empty([1,num_points])
for i in range(num_points):
funcval[0,i] = np.cos(2 * np.pi * w + np.sum(a * x[i,:]))
return funcval
It seems like the optimized length scale is very small compared to its domain space. I also felt very weird when I was digging into this library; changing some hyperparameters and the number of optimization didn't work for me as well. It might be helpful to change your kernel function to matern with changing the gamma value but not very much. If you really want to customize as you want, I might recommend you to use gpytorch similar to torch implementation or the GPML matlab toolbox.

Calculating decision function of SVM manually

I'm attempting to calculate the decision_function of a SVC classifier MANUALLY (as opposed to using the inbuilt method) using the the python library SKLearn.
I've tried several methods, however, I can only ever get the manual calculation to match when I don't scale my data.
z is a test datum (that's been scaled) and I think the other variables speak for themselves (also, I'm using an rbf kernel if thats not obvious from the code).
Here are the methods that I've tried:
1 Looping method:
dec_func = 0
for j in range(np.shape(sup_vecs)[0]):
norm2 = np.linalg.norm(sup_vecs[j, :] - z)**2
dec_func = dec_func + dual_coefs[0, j] * np.exp(-gamma*norm2)
dec_func += intercept
2 Vectorized Method
diff = sup_vecs - z
norm2 = np.sum(np.sqrt(diff*diff), 1)**2
dec_func = dual_coefs.dot(np.exp(-gamma_params*norm2)) + intercept
However, neither of these ever returns the same value as decision_function. I think it may have something to do with rescaling my values or more likely its something silly that I've been over looking!
Any help would be appreciated.
So after a bit more digging and head scratching, I've figured it out.
As I mentioned above z is a test datum that's been scaled. To scale it I had to extract .mean_ and .std_ attributes from the preprocessing.StandardScaler() object (after calling .fit() on my training data of course).
I was then using this scaled z as an input to both my manual calculations and to the inbuilt function. However the inbuilt function was a part of a pipeline which already had StandardScaler as its first 'pipe' in the pipeline and as a result z was getting scaled twice!
Hence, when I removed scaling from my pipeline, the manual answers "matched" the inbuilt function's answer.
I say "matched" in quotes by the way as I found I always had to flip the sign of my manual calculations to match the inbuilt version. Currently I have no idea why this is the case.
To conclude, I misunderstood how pipelines worked.
For those that are interested, here's the final versions of my manual methods:
diff = sup_vecs - z_scaled
# Looping Method
dec_func_loop = 0
for j in range(np.shape(sup_vecs)[0]):
norm2 = np.linalg.norm(diff[j,:])
dec_func_loop = dec_func_loop + dual_coefs[j] * np.exp(-gamma*(norm2**2))
dec_func_loop = -1 * (dec_func_loop - intercept)
# Vectorized method
norm2 = np.array([np.linalg.norm(diff[n, :]) for n in range(np.shape(sup_vecs)[0])])
dec_func_vec = -1 * (dual_coefs.dot(np.exp(-gamma*(norm2**2))) - intercept)
Addendum
For those who are interested in implementing a manual method for a multiclass SVC, the following link is helpful: https://stackoverflow.com/a/27752709/1182556

On ordinary differential equations (ODE) and optimization, in Python

I want to solve this kind of problem:
dy/dt = 0.01*y*(1-y), find t when y = 0.8 (0<t<3000)
I've tried the ode function in Python, but it can only calculate y when t is given.
So are there any simple ways to solve this problem in Python?
PS: This function is just a simple example. My real problem is so complex that can't be solve analytically. So I want to know how to solve it numerically. And I think this problem is more like an optimization problem:
Objective function y(t) = 0.8, Subject to dy/dt = 0.01*y*(1-y), and 0<t<3000
PPS: My real problem is:
objective function: F(t) = 0.85,
subject to: F(t) = sqrt(x(t)^2+y(t)^2+z(t)^2),
x''(t) = (1/F(t)-1)*250*x(t),
y''(t) = (1/F(t)-1)*250*y(t),
z''(t) = (1/F(t)-1)*250*z(t)-10,
x(0) = 0, y(0) = 0, z(0) = 0.7,
x'(0) = 0.1, y'(0) = 1.5, z'(0) = 0,
0<t<5
This differential equation can be solved analytically quite easily:
dy/dt = 0.01 * y * (1-y)
rearrange to gather y and t terms on opposite sides
100 dt = 1/(y * (1-y)) dy
The lhs integrates trivially to 100 * t, rhs is slightly more complicated. We can always write a product of two quotients as a sum of the two quotients * some constants:
1/(y * (1-y)) = A/y + B/(1-y)
The values for A and B can be worked out by putting the rhs on the same denominator and comparing constant and first order y terms on both sides. In this case it is simple, A=B=1. Thus we have to integrate
1/y + 1/(1-y) dy
The first term integrates to ln(y), the second term can be integrated with a change of variables u = 1-y to -ln(1-y). Our integrated equation therefor looks like:
100 * t + C = ln(y) - ln(1-y)
not forgetting the constant of integration (it is convenient to write it on the lhs here). We can combine the two logarithm terms:
100 * t + C = ln( y / (1-y) )
In order to solve t for an exact value of y, we first need to work out the value of C. We do this using the initial conditions. It is clear that if y starts at 1, dy/dt = 0 and the value of y never changes. Thus plug in the values for y and t at the beginning
100 * 0 + C = ln( y(0) / (1 - y(0) )
This will give a value for C (assuming y is not 0 or 1) and then use y=0.8 to get a value for t. Note that because of the logarithm and the factor 100 multiplying t y will reach 0.8 within a relatively short range of t values, unless the initial value of y is incredibly small. It is of course also straightforward to rearrange the equation above to express y in terms of t, then you can plot the function as well.
Edit: Numerical integration
For a more complexed ODE which cannot be solved analytically, you will have to try numerically. Initially we only know the value of the function at zero time y(0) (we have to know at least that in order to uniquely define the trajectory of the function), and how to evaluate the gradient. The idea of numerical integration is that we can use our knowledge of the gradient (which tells us how the function is changing) to work out what the value of the function will be in the vicinity of our starting point. The simplest way to do this is Euler integration:
y(dt) = y(0) + dy/dt * dt
Euler integration assumes that the gradient is constant between t=0 and t=dt. Once y(dt) is known, the gradient can be calculated there also and in turn used to calculate y(2 * dt) and so on, gradually building up the complete trajectory of the function. If you are looking for a particular target value, just wait until the trajectory goes past that value, then interpolate between the last two positions to get the precise t.
The problem with Euler integration (and with all other numerical integration methods) is that its results are only accurate when its assumptions are valid. Because the gradient is not constant between pairs of time points, a certain amount of error will arise for each integration step, which over time will build up until the answer is completely inaccurate. In order to improve the quality of the integration, it is necessary to use more sophisticated approximations to the gradient. Check out for example the Runge-Kutta methods, which are a family of integrators which remove progressive orders of error term at the cost of increased computation time. If your function is differentiable, knowing the second or even third derivatives can also be used to reduce the integration error.
Fortunately of course, somebody else has done the hard work here, and you don't have to worry too much about solving problems like numerical stability or have an in depth understanding of all the details (although understanding roughly what is going on helps a lot). Check out http://docs.scipy.org/doc/scipy/reference/generated/scipy.integrate.ode.html#scipy.integrate.ode for an example of an integrator class which you should be able to use straightaway. For instance
from scipy.integrate import ode
def deriv(t, y):
return 0.01 * y * (1 - y)
my_integrator = ode(deriv)
my_integrator.set_initial_value(0.5)
t = 0.1 # start with a small value of time
while t < 3000:
y = my_integrator.integrate(t)
if y > 0.8:
print "y(%f) = %f" % (t, y)
break
t += 0.1
This code will print out the first t value when y passes 0.8 (or nothing if it never reaches 0.8). If you want a more accurate value of t, keep the y of the previous t as well and interpolate between them.
As an addition to Krastanov`s answer:
Aside of PyDSTool there are other packages, like Pysundials and Assimulo which provide bindings to the solver IDA from Sundials. This solver has root finding capabilites.
Use scipy.integrate.odeint to handle your integration, and analyse the results afterward.
import numpy as np
from scipy.integrate import odeint
ts = np.arange(0,3000,1) # time series - start, stop, step
def rhs(y,t):
return 0.01*y*(1-y)
y0 = np.array([1]) # initial value
ys = odeint(rhs,y0,ts)
Then analyse the numpy array ys to find your answer (dimensions of array ts matches ys). (This may not work first time because I am constructing from memory).
This might involve using the scipy interpolate function for the ys array, such that you get a result at time t.
EDIT: I see that you wish to solve a spring in 3D. This should be fine with the above method; Odeint on the scipy website has examples for systems such as coupled springs that can be solved for, and these could be extended.
What you are asking for is a ODE integrator with root finding capabilities. They exist and the low-level code for such integrators is supplied with scipy, but they have not yet been wrapped in python bindings.
For more information see this mailing list post that provides a few alternatives: http://mail.scipy.org/pipermail/scipy-user/2010-March/024890.html
You can use the following example implementation which uses backtracking (hence it is not optimal as it is a bolt-on addition to an integrator that does not have root finding on its own): https://github.com/scipy/scipy/pull/4904/files

Multiple linear regression in python without fitting the origin?

I found this chunk of code on http://rosettacode.org/wiki/Multiple_regression#Python, which does a multiple linear regression in python. Print b in the following code gives you the coefficients of x1, ..., xN. However, this code is fitting the line through the origin (i.e. the resulting model does not include a constant).
All I'd like to do is the exact same thing except I do not want to fit the line through the origin, I need the constant in my resulting model.
Any idea if it's a small modification to do this? I've searched and found numerous documents on multiple regressions in python, except they are lengthy and overly complicated for what I need. This code works perfect, except I just need a model that fits through the intercept not the origin.
import numpy as np
from numpy.random import random
n=100
k=10
y = np.mat(random((1,n)))
X = np.mat(random((k,n)))
b = y * X.T * np.linalg.inv(X*X.T)
print(b)
Any help would be appreciated. Thanks.
you only need to add a row to X that is all 1.
Maybe a more stable approach would be to use a least squares algorithm anyway. This can also be done in numpy in a few lines. Read the documentation about numpy.linalg.lstsq.
Here you can find an example implementation:
http://glowingpython.blogspot.de/2012/03/linear-regression-with-numpy.html
What you have written out, b = y * X.T * np.linalg.inv(X * X.T), is the solution to the normal equations, which gives the least squares fit with a multi-linear model. swang's response is correct (and EMS's elaboration)---you need to add a row of 1's to X. If you want some idea of why it works theoretically, keep in mind that you are finding b_i such that
y_j = sum_i b_i x_{ij}.
By adding a row of 1's, you are are setting x_{(k+1)j} = 1 for all j, which means that you are finding b_i such that:
y_j = (sum_i b_i x_{ij}) + b_{k+1}
because the k+1st x_ij term is always equal to one. Thus, b_{k+1} is your intercept term.

Categories