Question:
Is there any working method to calculate gradient of (non-scalar) tensor function?
Example
Given n by n symmetric matrices X, Y and matrix function Z(X, Y) = torch.mm(X.mm(X), Y) calculate d(dZ/dX)/dY.
Expected answer
d(dZ/dX)/dY = d(2*XY)/dY = 2*X
Attempts
Because torch's .backward() works only for scalar variables I've tried to calculate derivative by applying torch.autograd.grad() to each element of tensor Z, but this approach is not correct, because it gives d(X^2)/dX = X + 2*D where D is a diagonal matrix with diagonal values of X. For me it's a bit weird that torch has an ability to build a computational graph, but can't track tensor through it as a variable to get tensor derivative.
Edit
Question was not very clear, so I decided to give more details.
My aim is to get partial derivative of loss function, which involves two matrices as variables. It looks like that:
loss = torch.linalg.norm(my_formula(X, Y) , ord='fro')
And I need to find
d^2(loss)/d(Y^2)
d/dX[d(loss)/dY]
Torch is capable of calculating 1. by using .backward() two times, but it's problematic to find 2. because torch.autograd.grad() expects scalar input and not the tensor
TL;DR
For function f which takes a matrix and gives a scalar:
Find first order derivative, let's name it dX
Take trace: Tr(dX)
To get mixed partial derivative just use the trace from above: d/dY[df/dX] = d/dY[Tr(df/dX)]
Intro
At the moment of posting the question I was not really that good at theory of matrix derivatives, but now I know much more all thanks to this Yandex ml book (unfortunately, I didn't find the english equivalent). This is an attempt to give a full answer to my question.
Basic Theory
Forgive me, Lord, for ugly representation of latex
Let's say you have a function which takes matrix X and returns it's squared Frobenius norm: f(X) = ||X||_F^2
It is a well-known fact that: ||X||_F^2 = Tr(X X^T)
Let's define derivative as shown in same book: D[f] at X_0 = f(X + H) - f(X)
We are ready to find dg(X)/dX:
df(X)/dX = dTr(X X^T)/dX =
(using Trace's feature)
= Tr(d/dX[X X^T]) = Tr(dX/dX X^T + X d[X^T]/dX ) =
(then we should use the definition of derivative from above)
= Tr(HX^T + XH^T) = Tr(HX^T) + Tr(XH^T) =
(now the main trick is to get all matrices H on the right side and get something like
Tr(g(X) H) or Tr(g(X) H^T), where g(X) will be the derivative we are looking for)
= Tr(HX^T) + Tr(XH^T) = Tr(XH^T) + Tr(XH^T) = Tr(2*XH^T)
That means: df(X)/dX = 2X
Second order derivative
Now, after we found out how to get matrix derivatives, let's try to find second order derivative of the same function f(X):
d/dX[df(X)/dX] = d/dX[Tr(2XH_1^T)] = Tr(d/dX[2XH_1^T]) =
= Tr(2I H_2 H_1^T)
We found out that d/dX[df(X)/dX] = 2I where I stands for Identity matrix. But how will it help us to find derivatives in Pytorch?
Trace is the trick
As we can see from the formulas, both first and second order derivatives have Trace inside them, but when we take first order derivative we just instantly get matrix as a result. To get a higher order derivative we just need to take the derivative of trace of first order derivative:
d/dY[df/dX] = d/dY[Tr(df/dX)]
The thing is I was using JAX autograd library when this trick came to my mind, so the code with a function f(X,Y) will look like this:
def scalarized_dy(X, Y):
dY = grad(f, argnums=1)(X, Y)
return jnp.trace(dY)
dYX = grad(scalarized_dy, argnums=0)(X, Y)
dYY = grad(scalarized_dy, argnums=1)(X, Y)
In case of Pytorch I guess we will need to look after tensors' gradients (let loss be a function with X and Y as arguments):
loss = f(X, Y)
loss.backward(create_graph = True)
dX = torch.trace(X.grad)
dX.backward()
dXX = X.grad
dXY = Y.grad
Epilogue
I thought that the question itself is in some way interesting. Also, it took me several months to figure things out, so I decided to give my current point of view on this problem. I will not mark my answer as correct yet in hope that I will get some kind of feedback or, perhaps, even better answers or ideas.
Related
Consider the following code:
x = torch.tensor(2.0, requires_grad=True)
y = torch.square(x)
grad = autograd.grad(y, x)
x = x + grad[0]
y = torch.square(x)
grad2 = autograd.grad(y, x)
First, we have that ∇(x^2)=2x. In my understanding, grad2=∇((x + ∇(x^2))^2)=∇((x+2x)^2)=∇((3x)^2)=9∇x^2=18x . As expected, grad=4.0=2x, but grad2=12.0=6x, which I don't understand where it comes from. It feels as though the 3 comes from the expression I had, but it is not squared, and the 2 comes from the traditional derivative. Could somebody help me understand why this is happening? Furthermore, how far back does the computational graph that stores the gradients go?
Specifically, I am coming from a meta learning perspective, where one is interested in computing a quantity of the following form ∇ L(theta - alpha * ∇ L(theta))=(1 + ∇^2 L(theta)) ∇L(theta - alpha * ∇ L(theta) (here the derivative is with respect to theta). Therefore, the computation, let's call it A, includes a second derivative. Computation is quite different than the following ∇_{theta - alpha ∇ L(theta)}L(\theta - alpha * ∇ L(theta))=∇_beta L(beta), which I will call B.
Hopefully, it is clear how the snippet I had is related to what I described in the second paragraph. My overall question is: under what circumstances does pytorch realize computation A vs computation B when using autograd.grad? I'd appreciate any explanation that goes into technical details about how this particular case is handled by autograd.
PD. The original code I was following made me wonder this is here; in particular, lines 69 through 106, and subsequently line 193, which is when they use autograd.grad. For the code is even more unclear because they do a lot of model.clone() and so on.
If the question is unclear in any way, please let me know.
I made a few changes:
I am not sure what torch.rand(2.0) is supposed to do. According to the text I simply set it to 2.
An intermediate variable z is added so that we can compute gradient w.r.t. to the original variable. Yours is overwritten.
Set create_graph=True to compute higher order gradients. See https://pytorch.org/docs/stable/generated/torch.autograd.grad.html
import torch
from torch import autograd
x = torch.ones(1, requires_grad=True)*2
y = torch.square(x)
grad = autograd.grad(y, x, create_graph=True)
z = x + grad[0]
y = torch.square(z)
grad2 = autograd.grad(y, x)
# yours is more like autograd.grad(y, z)
print(x)
print(grad)
print(grad2)
I'm trying to implement the following formula in python for X and Y points
I have tried following approach
def f(c):
"""This function computes the curvature of the leaf."""
tt = c
n = (tt[0]*tt[3] - tt[1]*tt[2])
d = (tt[0]**2 + tt[1]**2)
k = n/d
R = 1/k # Radius of Curvature
return R
There is something incorrect as it is not giving me correct result. I think I'm making some mistake while computing derivatives in first two lines. How can I fix that?
Here are some of the points which are in a data frame:
pts = pd.DataFrame({'x': x, 'y': y})
x y
0.089631 97.710199
0.089831 97.904541
0.090030 98.099313
0.090229 98.294513
0.090428 98.490142
0.090627 98.686200
0.090827 98.882687
0.091026 99.079602
0.091225 99.276947
0.091424 99.474720
0.091623 99.672922
0.091822 99.871553
0.092022 100.070613
0.092221 100.270102
0.092420 100.470020
0.092619 100.670366
0.092818 100.871142
0.093017 101.072346
0.093217 101.273979
0.093416 101.476041
0.093615 101.678532
0.093814 101.881451
0.094013 102.084800
0.094213 102.288577
pts_x = np.gradient(x_c, t) # first derivatives
pts_y = np.gradient(y_c, t)
pts_xx = np.gradient(pts_x, t) # second derivatives
pts_yy = np.gradient(pts_y, t)
After getting the derivatives I am putting the derivatives x_prim, x_prim_prim, y_prim, y_prim_prim in another dataframe using the following code:
d = pd.DataFrame({'x_prim': pts_x, 'y_prim': pts_y, 'x_prim_prim': pts_xx, 'y_prim_prim':pts_yy})
after having everything in the data frame I am calling function for each row of the data frame to get curvature at that point using following code:
# Getting the curvature at each point
for i in range(len(d)):
temp = d.iloc[i]
c_temp = f(temp)
curv.append(c_temp)
You do not specify exactly what the structure of the parameter pts is. But it seems that it is a two-dimensional array where each row has two values x and y and the rows are the points in your curve. That itself is problematic, since the documentation is not quite clear on what exactly is returned in such a case.
But you clearly are not getting the derivatives of x or y. If you supply only one array to np.gradient then numpy assumes that the points are evenly spaced with a distance of one. But that is probably not the case. The meaning of x' in your formula is the derivative of x with respect to t, the parameter variable for the curve (which is separate from the parameters to the computer functions). But you never supply the values of t to numpy. The values of t must be the second parameter passed to the gradient function.
So to get your derivatives, split the x, y, and t values into separate one-dimensional arrays--lets call them x and y and t. Then get your first and second derivatives with
pts_x = np.gradient(x, t) # first derivatives
pts_y = np.gradient(y, t)
pts_xx = np.gradient(pts_x, t) # second derivatives
pts_yy = np.gradient(pts_y, t)
Then continue from there. You no longer need the t values to calculate the curvatures, which is the point of the formula you are using. Note that gradient is not really designed to calculate the second derivatives, and it absolutely should not be used to calculate third or higher-order derivatives. More complex formulas are needed for those. Numpy's gradient uses "second order accurate central differences" which are pretty good for the first derivative, poor for the second derivative, and worthless for higher-order derivatives.
I think your problem is that x and y are arrays of double values.
The array x is the independent variable; I'd expect it to be sorted into ascending order. If I evaluate y[i], I expect to get the value of the curve at x[i].
When you call that numpy function you get an array of derivative values that are the same shape as the (x, y) arrays. If there are n pairs from (x, y), then
y'[i] gives the value of the first derivative of y w.r.t. x at x[i];
y''[i] gives the value of the second derivative of y w.r.t. x at x[i].
The curvature k will also be an array with n points:
k[i] = abs(x'[i]*y''[i] -y'[i]*x''[i])/(x'[i]**2 + y'[i]**2)**1.5
Think of x and y as both being functions of a parameter t. x' = dx/dt, etc. This means curvature k is also a function of that parameter t.
I like to have a well understood closed form solution available when I program a solution.
y(x) = sin(x) for 0 <= x <= pi
y'(x) = cos(x)
y''(x) = -sin(x)
k = sin(x)/(1+(cos(x))**2)**1.5
Now you have a nice formula for curvature as a function of x.
If you want to parameterize it, use
x(t) = pi*t for 0 <= t <= 1
x'(t) = pi
x''(t) = 0
See if you can plot those and make your Python solution match it.
I am trying to calculate the wind gradient given u-wind and v-wind. The u and v values have a 3d-array with the following shape:
u(122,9,9) such that u(time,latitude,longitude). The same applies for v.
I have also calculated the dx and dy values (in 2-d array for both lat and lon direction)
The sample of my code is as below at time 0 for example:
dudx = np.gradient(u[0,0,:], dx[0,0], edge_order=2)
dvdy = np.gradient(v[0,:,0], dy[0,0], edge_order=2)
I can then sum dudx and dvdy to get the gradient. I have a data that has already calculated the divergence, and upon comparing my calculation with the divergence data, i expected the values to be the same, but they're not. I can't seem to figure out where i went wrong besides using the np.gradient function incorrectly.
I would like to know if my methods above to calculate the gradient of u and v winds are correct.
Cheers.
Edit
The full code i am using to calculate the wind gradient is as below:
dqu_dx = np.zeros((122,9,9))
dqv_dy = np.zeros((122,9,9))
for i in range(122):
for j in range(9):
for k in range(9):
dqu_dx[i,j,:] = np.gradient(dqu_18hr[i,j,:], dx[0,k], edge_order=2)
dqv_dy[i,:,k] = np.gradient(dqv_18hr[i,:,k], dy[j,0], edge_order=2)
Unfortunately I can't comment on your question to ask for explanations because I don't have enough reputation, so I am forced to make some assumptions. Feel free to correct me if I am wrong.
I will assume that dqu_18hr and dqv_18hr are arrays storing the value of two different functions, u(t, y, x) and v(t, y, x). If I understand correctly you want to calculate du/dx and dv/dy.
I don't know what are the dx and dy values that you store in the arrays, also because you define them as 2D-arrays but use them as 1D-arrays. I will assume that dx and dy are coordinates of the points at which you computed u and v, and that the grid they produce is regular.
A first problem with your code is that you are passing a single scalar number as the second argument of np.gradient. When this is done, numpy assumes that this is the distance between points. However, this distance changes at every iteration. I can think of a quite convoluted case for which the definition of dx is such that this gives the correct result, but generally this is a mistake.
Another problem with the code is that it doesn't take advantage of numpy vectorization, using explicitly three for loops. This is extremely inefficient computationally.
I would suggest you the following code:
x = dx[0, :] # or whatever is the correct definition
y = dy[:, 0] # not enough info in the post to understand it
a = np.gradient(dqu_18hr, x, axis=2, edge_order=2)
b = np.gradient(dqv_18hr, y, axis=1, edge_order=2)
Please also notice that in your code x is associated to the axis 2 and y to the axis 1, which is absolutely legit but unusual so you might want to check if that's a mistake.
I am trying to find higher order derivatives of a dataset (x,y). x and y are 1D arrays of length N.
Let's say I generate them as :
xder0=np.linspace(0,10,1000)
yder0=np.sin(xder0)
I define the derivative function which takes in 2 array (x,y) and returns (x1, y1) where y1 is the derivative calculated at each index as : (y[i+1]-y[i])/(x[i+1]-x[i]). x1 is just the mean of x[i+1] and x[i]
Here is the function that does it:
def deriv(x,y):
delx =np.zeros((len(x)-1), dtype=np.longdouble)
ydiff=np.zeros((len(x)-1), dtype=np.longdouble)
for i in range(len(x)-1):
delx[i] =(x[i+1]+x[i])/2.0
ydiff[i] =(y[i+1]-y[i])/(x[i+1]-x[i])
return delx, ydiff
Now to calculate the first derivative, I call this function as:
xder1, yder1 = deriv(xder0, yder0)
Similarly for second derivative, I call this function giving first derivatives as input:
xder2, yder2 = deriv(xder1, yder1)
And it goes on:
xder3, yder3 = deriv(xder2, yder2)
xder4, yder4 = deriv(xder3, yder3)
xder5, yder5 = deriv(xder4, yder4)
xder6, yder6 = deriv(xder5, yder5)
xder7, yder7 = deriv(xder6, yder6)
xder8, yder8 = deriv(xder7, yder7)
xder9, yder9 = deriv(xder8, yder8)
Something peculiar happens after I reach order 7. The 7th order becomes very noisy! Earlier derivatives are all either sine or cos functions as expected. However 7th order is a noisy sine. And hence all derivatives after that blow up.
Any idea what is going on?
This is a well known stability issue with numerical interpolation using equally-spaced points. Read the answers at http://math.stackexchange.com.
To overcome this problem you have to use non-equally-spaced points, like the roots of Lagendre polynomial. The instability occurs due to the unavailability of information at the boundaries, thus more concentration of points at the boundaries is required, as per the roots of say Lagendre polynomials or others with similar properties such as Chebyshev polynomial.
The leastsq method in scipy lib fits a curve to some data. And this method implies that in this data Y values depends on some X argument. And calculates the minimal distance between curve and the data point in the Y axis (dy)
But what if I need to calculate minimal distance in both axes (dy and dx)
Is there some ways to implement this calculation?
Here is a sample of code when using one axis calculation:
import numpy as np
from scipy.optimize import leastsq
xData = [some data...]
yData = [some data...]
def mFunc(p, x, y):
return y - (p[0]*x**p[1]) # is takes into account only y axis
plsq, pcov = leastsq(mFunc, [1,1], args=(xData,yData))
print plsq
I recently tryed scipy.odr library and it returns the proper results only for linear function. For other functions like y=a*x^b it returns wrong results. This is how I use it:
def f(p, x):
return p[0]*x**p[1]
myModel = Model(f)
myData = Data(xData, yData)
myOdr = ODR(myData, myModel , beta0=[1,1])
myOdr.set_job(fit_type=0) #if set fit_type=2, returns the same as leastsq
out = myOdr.run()
out.pprint()
This returns wrong results, not desired, and in some input data not even close to real.
May be, there is some special ways of using it, what do I do wrong?
I've found the solution. Scipy Odrpack works noramally but it needs a good initial guess for correct results. So I divided the process into two steps.
First step: find the initial guess by using ordinaty least squares method.
Second step: substitude these initial guess in ODR as beta0 parameter.
And it works very well with an acceptable speed.
Thank you guys, your advice directed me to the right solution
scipy.odr implements the Orthogonal Distance Regression. See the instructions for basic use in the docstring and documentation.
If/when you are able to invert the function described by p you may just include x-pinverted(y) in mFunc, I guess as sqrt(a^2+b^2), so (pseudo code)
return sqrt( (y - (p[0]*x**p[1]))^2 + (x - (pinverted(y))^2)
for example for
y=kx+m p=[m,k]
pinv=[-m/k,1/k]
return sqrt( (y - (p[0]+x*p[1]))^2 + (x - (pinv[0]+y*pinv[1]))^2)
But what you ask for is in some cases problematic. For example, if a polynomial (or your x^j) curve has a minimum ym at y(m) and you have a point x,y lower than ym, what kind of value do you want to return? There's not always a solution.
you can use the ONLS package in R.