Need help understanding the gradient function in pytorch - python

The following code
w = np.array([[2., 2.],[2., 2.]])
x = np.array([[3., 3.],[3., 3.]])
b = np.array([[4., 4.],[4., 4.]])
w = torch.tensor(w, requires_grad=True)
x = torch.tensor(x, requires_grad=True)
b = torch.tensor(b, requires_grad=True)
y = w*x + b
print(y)
# tensor([[10., 10.],
# [10., 10.]], dtype=torch.float64, grad_fn=<AddBackward0>)
y.backward(torch.FloatTensor([[1, 1],[ 1, 1]]))
print(w.grad)
# tensor([[3., 3.],
# [3., 3.]], dtype=torch.float64)
print(x.grad)
# tensor([[2., 2.],
# [2., 2.]], dtype=torch.float64)
print(b.grad)
# tensor([[1., 1.],
# [1., 1.]], dtype=torch.float64)
As the tensor argument inside gradient function is an all ones tensor in the shape of the input tensor, my understanding says that
w.grad means derivative of y w.r.t w, and produces b,
x.grad means derivative of y w.r.t x, and produces b and
b.grad means derivative of y w.r.t b, and produces all ones.
Out of these, only point 3 answer is matching my expected result. Can someone help me in understanding the first two answers. I think I understand the accumulation part, but don't think that is happening here.

To find the correct derivatives in this example, we need to take the sum and product rule into consideration.
Sum rule:
Product rule:
That means the derivatives of your equation are calculated as follows.
With respect to x:
With respect to w:
With respect to b:
The gradients reflect exactly that:
torch.equal(w.grad, x) # => True
torch.equal(x.grad, w) # => True
torch.equal(b.grad, torch.tensor([[1, 1], [1, 1]], dtype=torch.float64)) # => True

Related

How to add two torch tensors along given dimension?

I have a torch tensor, pred, in the form (B, 2, H, W) and I want to sum two different values, val1 and val2, to the channels on axis 1.
I managed to do it in a "mechanical" way by accessing the single channels directly, e.g.:
def thresh_format(pred, val1, val2):
tr = torch.zeros_like(pred)
tr[:, 0, :, :] = tr[:, 0, :, :].add(val1)
tr[:, 1, :, :] = tr[:, 1, :, :].add(val2)
return pred + tr
However I'm wondering if there's a "better" way to do it, e.g. by exploiting broadcasting. My understanding from the documentation is that broadcasting happens from trailing dimensions, so in this case I'm puzzled how to make it work for dimension 1.
Any ideas?
The easiest way to achieve this is to stack val1 and val2 in a tensor and reshape it to match the shape of the pred tensor along the common dimension.
pred + torch.tensor([val1, val2]).reshape((1,-1,1,1))
This way, for the addition, torch automatically broadcasts the values along the dimensions where pred has higher order.
It's pretty similar to what happens when you just add a simple scalar value to a tensor, like:
>>> torch.ones((2, 2)) + 3.
tensor([[4., 4.],
[4., 4.]])
But instead of broadcasting the one scalar value to every element of the tensor during the addition, in the aforementioned case the values are broadcasted along the dimensions that do not already match.
>>> B=1; W=2; H=2; val1=3; val2=7
>>> pred = torch.zeros((B,2,W,H))
>>> val = torch.tensor([val1, val2]).reshape((1,-1,1,1))
>>> pred
tensor([[[[0., 0.],
[0., 0.]],
[[0., 0.],
[0., 0.]]]])
>>> val
tensor([[[[3]],
[[7]]]])
>>> pred + val
tensor([[[[3., 3.],
[3., 3.]],
[[7., 7.],
[7., 7.]]]])

What is the derivative of Shannon's Entropy?

I have the following simple python function that calculates the entropy of a single input X according to Shannon's Theory of Information:
import numpy as np
def entropy(X:'numpy array'):
_, frequencies = np.unique(X, return_counts=True)
probabilities = frequencies/X.shape[0]
return -np.sum(probabilities*np.log2(probabilities))
a = np.array([1., 1., 1., 3., 3., 2.])
b = np.array([1., 1., 1., 3., 3., 3.])
c = np.array([1., 1., 1., 1., 1., 1.])
print(f"entropy(a): {entropy(a)}")
print(f"entropy(b): {entropy(b)}")
print(f"entropy(c): {entropy(c)}")
With the output being the following:
entropy(a): 1.4591479170272446
entropy(b): 1.0
entropy(c): -0.0
However, I also need to calculate the derivative over dx:
d entropy / dx
This is not an easy task since the main formula
-np.sum(probabilities*np.log2(probabilities))
takes in probabilities, not x values, therefore it is not clear how to differentiate over dx.
Does anyone have an idea on how to do this?
One way to solve this is to use finite differences to compute the derivative numerically.
In this context, we can define a small constant to help us compute the numerical derivative. This function takes a one-argument function and computes its derivative for input x:
ε = 1e-12
def derivative(f, x):
return (f(x + ε) - f(x)) / ε
To make our work easier, let us define a function that computes the innermost operation of the entropy:
def inner(x):
return x * np.log2(x)
Recall that the derivative of the sum is the sum of derivatives. Therefore, the real derivative computation takes place in the inner function we just defined.
So, the numerical derivative of the entropy is:
def numerical_dentropy(X):
_, frequencies = np.unique(X, return_counts=True)
probabilities = frequencies / X.shape[0]
return -np.sum([derivative(inner, p) for p in probabilities])
Can we do better? Of course we can! The key insight here is the product rule: (f g)' = fg' + gf', where f=x and g=np.log2(x). (Also notice that d[log_a(x)]/dx = 1/(x ln(a)).)
So, the analytical entropy can be computed as:
import math
def dentropy(X):
_, frequencies = np.unique(X, return_counts=True)
probabilities = frequencies / X.shape[0]
return -np.sum([(1/math.log(2, math.e) + np.log2(p)) for p in probabilities])
Using the sample vectors for testing, we have:
a = np.array([1., 1., 1., 3., 3., 2.])
b = np.array([1., 1., 1., 3., 3., 3.])
c = np.array([1., 1., 1., 1., 1., 1.])
print(f"numerical d[entropy(a)]: {numerical_dentropy(a)}")
print(f"numerical d[entropy(b)]: {numerical_dentropy(b)}")
print(f"numerical d[entropy(c)]: {numerical_dentropy(c)}")
print(f"analytical d[entropy(a)]: {dentropy(a)}")
print(f"analytical d[entropy(b)]: {dentropy(b)}")
print(f"analytical d[entropy(c)]: {dentropy(c)}")
Which, when executed, gives us:
numerical d[entropy(a)]: 0.8417710972707937
numerical d[entropy(b)]: -0.8854028621385623
numerical d[entropy(c)]: -1.4428232973189605
analytical d[entropy(a)]: 0.8418398787754222
analytical d[entropy(b)]: -0.8853900817779268
analytical d[entropy(c)]: -1.4426950408889634
As a bonus, we can test whether this is correct with an automatic differentiation library:
import torch
a, b, c = torch.from_numpy(a), torch.from_numpy(b), torch.from_numpy(c)
def torch_entropy(X):
_, frequencies = torch.unique(X, return_counts=True)
frequencies = frequencies.type(torch.float32)
probabilities = frequencies / X.shape[0]
probabilities.requires_grad_(True)
return -(probabilities * torch.log2(probabilities)).sum(), probabilities
for v in a, b, c:
h, p = torch_entropy(v)
print(f'torch entropy: {h}')
h.backward()
print(f'torch derivative: {p.grad.sum()}')
Which gives us:
torch entropy: 1.4591479301452637
torch derivative: 0.8418397903442383
torch entropy: 1.0
torch derivative: -0.885390043258667
torch entropy: -0.0
torch derivative: -1.4426950216293335

Why is the derivative of f(x) with respect to 'x' 'x' and not 1 in pytorch?

I am trying to understand pytorch's autograd in full and I stumbled with this: let f(x)=x, from basic maths we know that f'(x)=1, however when I do that exercise in pytorch I get that f'(x) = x.
z = torch.linspace(-1, 1, steps=5, requires_grad=True)
y = z
y.backward(z)
print("Z tensor is: {} \n Gradient of y with respect to z is: {}".format(z, z.grad))
I would expect to get a tensor of size 5 full of 1 but instead I get:
Z tensor is: tensor([-1.0000, -0.5000, 0.0000, 0.5000, 1.0000], requires_grad=True)
Gradient of y with respect to z is: tensor([-1.0000, -0.5000, 0.0000, 0.5000, 1.0000])
Why is this the behavior of pytorch?
First of all, given z = torch.linspace(-1, 1, steps=5, requires_grad=True) and y = z, the function is a vector-valued function, so the derivative of y w.r.t z is not as simple as 1 but a Jacobian matrix. Actually in your case z = [z1, z2, z3, z4, z5]T , the upper case T means z is a row vector. Here is what the official doc says:
Secondly, notice the official doc says: Now in this case y is no longer a scalar. torch.autograd could not compute the full Jacobian directly, but if we just want the vector-Jacobian product, simply pass the vector to backward as argument link. In that case x.grad is not the actual gradient value (matrix) but the vector-Jacobian product.
EDIT:
x.grad is the actual gradient if your output y is a scalar.
See the example here:
z = torch.linspace(-1, 1, steps=5, requires_grad=True)
y = torch.sum(z)
y.backward()
z.grad
This will output:
tensor([1., 1., 1., 1., 1.])
As you can see, it is the gradient. Notice the only difference is that y is a scalar value here while a vector value in your example. grad can be implicitly created only for scalar outputs
You might wonder what if the gradient is not a constant, like dependent on input z as in this case
z = torch.linspace(-1, 1, steps=5, requires_grad=True)
y = torch.sum(torch.pow(z,2))
y.backward()
z.grad
The output is:
tensor([-2., -1., 0., 1., 2.])
It is the same as
z = torch.linspace(-1, 1, steps=5, requires_grad=True)
y = torch.sum(torch.pow(z,2))
y.backward(torch.tensor(1.))
z.grad
The blitz tutorial is kind of brief so it is actually quite hard to understand for beginners.
After discussing with a colleague, he found that the 'backward()' method actually is multiplying the gradient evaluated at z, with z itself. This makes sense for neural network applications. A short code snippet to understand this is the following:
z = torch.linspace(1, 5, steps=5, requires_grad=True)
y = torch.pow(z,2)
y.backward(z)
print("Z tensor is: {} \n Gradient of y with respect to z is: {}".format(z, z.grad/z))
The output is:
Z tensor is: tensor([1., 2., 3., 4., 5.], requires_grad=True)
Gradient of y with respect to z is: tensor([ 2., 4., 6., 8., 10.], grad_fn=<DivBackward0>)
In this case, you can see that z.grad divided by z is the actual expected gradient of z which would be 2*z.

Does pytorch do eager pruning of its computational graph?

This is a very simple example:
import torch
x = torch.tensor([1., 2., 3., 4., 5.], requires_grad=True)
y = torch.tensor([2., 2., 2., 2., 2.], requires_grad=True)
z = torch.tensor([1., 1., 0., 0., 0.], requires_grad=True)
s = torch.sum(x * y * z)
s.backward()
print(x.grad)
This will print,
tensor([2., 2., 0., 0., 0.]),
since, of course, ds/dx is zero for the entries where z is zero.
My question is: Is pytorch smart and stop the computations when it reaches a zero? Or does in fact do the calculation "2*5", only to later do "10 * 0 = 0"?
In this simple example it doesn't make a big difference, but in the (bigger) problem I am looking at, this will make a difference.
Thank you for any input.
No, pytorch does no such thing as pruning any subsequent calculations when zero is reached. Even worse, due to how float arithmetic works all subsequent multiplication by zero will take roughly the same time as any regular multiplication.
For some cases there are ways around it though, for example if you want to use a masked loss you can just set the masked outputs to be zero, or detach them from gradients.
This example makes the difference clear:
def time_backward(do_detach):
x = torch.tensor(torch.rand(100000000), requires_grad=True)
y = torch.tensor(torch.rand(100000000), requires_grad=True)
s2 = torch.sum(x * y)
s1 = torch.sum(x * y)
if do_detach:
s2 = s2.detach()
s = s1 + 0 * s2
t = time.time()
s.backward()
print(time.time() - t)
time_backward(do_detach= False)
time_backward(do_detach= True)
outputs:
0.502875089645
0.198422908783

Create a Transformation Matrix out of Scalar Angle Tensors

Original Question
I want to create a custom Lambda function using keras that does the forward kinematics of an articulated arm.
This function has a set of angles as input and should output a vector containing the position and orientation of the end effector.
I could create this function in numpy easily; but when I wanted to move it to Keras, things got hard.
Since the input and the output of the lambda function are tensors, all operations should be done using tensors and the backend operations.
The problem is that I have to create a transformation matrix out of the input angles.
I could use K.cos and K.sin (K is the backend tensorflow) to compute the cosines and sines of the angles. But the problem is how to create a tensor that is a 4X4 matrix that contains some cells which are just numbers (0 or 1) and the others are parts of a tensor.
For example for a Z rotation :
T = tf.convert_to_tensor( [[c, -s, 0, dX],
[s, c, 0, dY],
[0, 0, 1, dZ],
[0, 0, 0, 1]])
Here c and s are computed using K.cos(input[3]) and K.sin(input[3]).
This does not work. I get :
ValueError: Shapes must be equal rank, but are 1 and 0
From merging shape 1 with other shapes. for 'lambda_1/packed/0' (op: 'Pack') with input shapes: [5], [5], [], [].
Any suggestions?
Further Problems
The code provided by #Aldream did work fine.
The problem is when I embed this into a Lambda layer, I get an error when I compile the model.
...
self.model.add(Lambda(self.FK_Keras))
self.model.compile(optimizer="adam", loss='mse', metrics=['mse'])
As you can see, I use a class that holds the model and the various functions.
First I have a helper function That computes the transformation matrix:
def trig_K( angle):
r = angle*np.pi/180.0
return K.cos(r), K.sin(r)
def T_matrix_K(rotation, axis="z", translation=K.constant([0,0,0])):
c, s = trig_K(rotation)
dX = translation[0]
dY = translation[1]
dZ = translation[2]
if(axis=="z"):
T = K.stack( [[c, -s, 0., dX],
[s, c, 0., dY],
[0., 0., 1., dZ],
[0., 0., 0., 1.]], axis=0)
if(axis=="y"):
T = K.stack( [ [c, 0.,-s, dX],
[0., 1., 0., dY],
[s, 0., c, dZ],
[0., 0., 0., .1]], axis=0)
if(axis=="x"):
T = K.stack( [ [1., 0., 0., dX],
[0., c, -s, dY],
[0., s, c, dZ],
[0., 0., 0., 1.]], axis=0)
return T
Then FK_keras computes the end effector transformation:
def FK_Keras(self, angs):
# Compute local transformations
base_T=T_matrix_K(angs[0],"z",self.base_pos_K)
shoulder_T=T_matrix_K(angs[1],"y",self.shoulder_pos_K)
elbow_T=T_matrix_K(angs[2],"y",self.elbow_pos_K)
wrist_1_T=T_matrix_K(angs[3],"y",self.wrist_1_pos_K)
wrist_2_T=T_matrix_K(angs[4],"x",self.wrist_2_pos_K)
# Compute end effector transformation
end_effector_T=K.dot(base_T,K.dot(shoulder_T,K.dot(elbow_T,K.dot(wrist_1_T,wrist_2_T))))
# Compute Yaw, Pitch, Roll of end effector
y=K.tf.atan2(end_effector_T[1,0],end_effector_T[1,1])
p=K.tf.atan2(-end_effector_T[2,0],K.tf.sqrt(end_effector_T[2,1]*end_effector_T[2,1]+end_effector_T[2,2]*end_effector_T[2,2]))
r=K.tf.atan2(end_effector_T[2,1],end_effector_T[2,2])
# Construct the output tensor [x,y,z,y,p,r]
output = K.stack([end_effector_T[0,3],end_effector_T[1,3],end_effector_T[2,3], y, p, r], axis=0)
return output
Here self.base_pos_K and the other translations vectors are constants :
self.base_pos_K = K.constant(np.array([x,y,z]))
Tle code stucks in the compile function and return this error :
ValueError: Shapes must be equal rank, but are 1 and 0
From merging shape 1 with other shapes. for 'lambda_1/stack_1' (op: 'Pack') with input shapes: [5], [5], [], [].
I tried to create a fast test code like this :
arm = Bot("")
# Articulation angles
input_data =np.array([90., 180., 45., 25., 25.])
sess = K.get_session()
inp = K.placeholder(shape=(5), name="inp")#)
res = sess.run(arm.FK_Keras(inp),{inp: input_data})
This code do works with no errors.
There is something about integrating this into a Lambda layer of a sequential model.
Problem Solved
Indeed, the problem was related to the way Keras deals with data. It adds a batch dimension which should be taken into consideration while implmenting the function.
I dealt with this in a different way which involved reimplementing the T_matrix_K to deal with this extra dimension, but I think the way proposed by #Aldream is more elegent.
Many thanks to #Aldream. His answers were quite helpful.
Using K.stack():
import keras
import keras.backend as K
input = K.constant([3.14, 0., 0, 3.14])
dX, dY, dZ = K.constant(1.), K.constant(2.), K.constant(3.)
c, s = K.cos(input[3]), K.sin(input[3])
T = K.stack([[ c, -s, 0., dX],
[ s, c, 0., dY],
[0., 0., 1., dZ],
[0., 0., 0., 1.]], axis=0
)
sess = K.get_session()
res = sess.run(T)
print(res)
# [[ -9.99998748e-01 -1.59254798e-03 0.00000000e+00 1.00000000e+00]
# [ 1.59254798e-03 -9.99998748e-01 0.00000000e+00 2.00000000e+00]
# [ 0.00000000e+00 0.00000000e+00 1.00000000e+00 3.00000000e+00]
# [ 0.00000000e+00 0.00000000e+00 0.00000000e+00 1.00000000e+00]]
How to use with Lambda:
Keras layers are expecting/dealing with batched data. Keras would for instance assume that the input (angs) of your Lambda(FK_Keras) layer is of shape (batch_size, 5). Your FK_Keras() thus need to be adapted to deal with such inputs.
A rather straightforward way to do so, requiring only minor edits to your T_matrix_K() is to use K.map_fn() to loop over every list of angles in the batch and apply the proper T_matrix_K() function to each.
Other minor changes to deal with batches:
Using K.batch_dot() instead of K.dot()
Broadcasting accordindly your constant tensors e.g. self.base_pos_K
Taking into account the additional 1st dimension to batched tensors, e.g. replacing end_effector_T[1,0] by end_effector_T[:, 1,0]
Find below a shortened working code (extending to all joints is left to you):
import keras
import keras.backend as K
from keras.layers import Lambda, Dense
from keras.models import Model, Sequential
import numpy as np
def trig_K( angle):
r = angle*np.pi/180.0
return K.cos(r), K.sin(r)
def T_matrix_K_z(x):
rotation, translation = x[0], x[1]
c, s = trig_K(rotation)
T = K.stack( [[c, -s, 0., translation[0]],
[s, c, 0., translation[1]],
[0., 0., 1., translation[2]],
[0., 0., 0., 1.]], axis=0)
# We have 2 inputs, so have to return 2 outputs for `K.map_fn()`:
return T, 0.
def T_matrix_K_y(x):
rotation, translation = x[0], x[1]
c, s = trig_K(rotation)
T = K.stack( [ [c, 0.,-s, translation[0]],
[0., 1., 0., translation[1]],
[s, 0., c, translation[2]],
[0., 0., 0., .1]], axis=0)
# We have 2 inputs, so have to return 2 outputs for `K.map_fn()`:
return T, 0.
def FK_Keras(angs):
base_pos_K = K.constant(np.array([1, 2, 3])) # replace with your self.base_pos_K
shoulder_pos_K = K.constant(np.array([1, 2, 3])) # replace with your self.shoulder_pos_K
# Manually broadcast your constants to batches:
batch_size = K.shape(angs)[0]
base_pos_K = K.tile(K.expand_dims(base_pos_K, 0), (batch_size, 1))
shoulder_pos_K = K.tile(K.expand_dims(shoulder_pos_K, 0), (batch_size, 1))
# Compute local transformations, for each list of angles in the batch:
base_T, _ = K.map_fn(T_matrix_K_z, (angs[:, 0], base_pos_K))
shoulder_T, _ = K.map_fn(T_matrix_K_y, (angs[:, 1], shoulder_pos_K))
# ... (repeat with your other joints)
# Compute end effector transformation, over batch:
end_effector_T = K.batch_dot(base_T,shoulder_T) # add your other joints
# Compute Yaw, Pitch, Roll of end effector
y=K.tf.atan2(end_effector_T[:, 1,0],end_effector_T[:, 1,1])
p=K.tf.atan2(-end_effector_T[:, 2,0],K.tf.sqrt(end_effector_T[:, 2,1]*end_effector_T[:, 2,1]+end_effector_T[:, 2,2]*end_effector_T[:, 2,2]))
r=K.tf.atan2(end_effector_T[:, 2,1],end_effector_T[:, 2,2])
# Construct the output tensor [x,y,z,y,p,r]
output = K.stack([end_effector_T[:, 0,3],end_effector_T[:, 1,3],end_effector_T[:, 2,3], y, p, r], axis=1)
return output
# Demonstration:
input_data =np.array([[90., 180., 45., 25., 25.],[90., 180., 45., 25., 25.]])
sess = K.get_session()
inp = K.placeholder(shape=(None, 5), name="inp")#)
res = sess.run(FK_Keras(inp),{inp: input_data})
model = Sequential()
model.add(Dense(5, input_dim=5))
model.add(Lambda(FK_Keras))
model.compile(optimizer="adam", loss='mse', metrics=['mse'])

Categories