Does pytorch do eager pruning of its computational graph?

Does pytorch do eager pruning of its computational graph? - python

This is a very simple example:
import torch
x = torch.tensor([1., 2., 3., 4., 5.], requires_grad=True)
y = torch.tensor([2., 2., 2., 2., 2.], requires_grad=True)
z = torch.tensor([1., 1., 0., 0., 0.], requires_grad=True)
s = torch.sum(x * y * z)
s.backward()
print(x.grad)
This will print,
tensor([2., 2., 0., 0., 0.]),
since, of course, ds/dx is zero for the entries where z is zero.
My question is: Is pytorch smart and stop the computations when it reaches a zero? Or does in fact do the calculation "2*5", only to later do "10 * 0 = 0"?
In this simple example it doesn't make a big difference, but in the (bigger) problem I am looking at, this will make a difference.
Thank you for any input.

No, pytorch does no such thing as pruning any subsequent calculations when zero is reached. Even worse, due to how float arithmetic works all subsequent multiplication by zero will take roughly the same time as any regular multiplication.
For some cases there are ways around it though, for example if you want to use a masked loss you can just set the masked outputs to be zero, or detach them from gradients.
This example makes the difference clear:
def time_backward(do_detach):
x = torch.tensor(torch.rand(100000000), requires_grad=True)
y = torch.tensor(torch.rand(100000000), requires_grad=True)
s2 = torch.sum(x * y)
s1 = torch.sum(x * y)
if do_detach:
s2 = s2.detach()
s = s1 + 0 * s2
t = time.time()
s.backward()
print(time.time() - t)
time_backward(do_detach= False)
time_backward(do_detach= True)
outputs:
0.502875089645
0.198422908783

Related

Need help understanding the gradient function in pytorch

The following code
w = np.array([[2., 2.],[2., 2.]])
x = np.array([[3., 3.],[3., 3.]])
b = np.array([[4., 4.],[4., 4.]])
w = torch.tensor(w, requires_grad=True)
x = torch.tensor(x, requires_grad=True)
b = torch.tensor(b, requires_grad=True)
y = w*x + b
print(y)
# tensor([[10., 10.],
# [10., 10.]], dtype=torch.float64, grad_fn=<AddBackward0>)
y.backward(torch.FloatTensor([[1, 1],[ 1, 1]]))
print(w.grad)
# tensor([[3., 3.],
# [3., 3.]], dtype=torch.float64)
print(x.grad)
# tensor([[2., 2.],
# [2., 2.]], dtype=torch.float64)
print(b.grad)
# tensor([[1., 1.],
# [1., 1.]], dtype=torch.float64)
As the tensor argument inside gradient function is an all ones tensor in the shape of the input tensor, my understanding says that
w.grad means derivative of y w.r.t w, and produces b,
x.grad means derivative of y w.r.t x, and produces b and
b.grad means derivative of y w.r.t b, and produces all ones.
Out of these, only point 3 answer is matching my expected result. Can someone help me in understanding the first two answers. I think I understand the accumulation part, but don't think that is happening here.

To find the correct derivatives in this example, we need to take the sum and product rule into consideration.
Sum rule:
Product rule:
That means the derivatives of your equation are calculated as follows.
With respect to x:
With respect to w:
With respect to b:
The gradients reflect exactly that:
torch.equal(w.grad, x) # => True
torch.equal(x.grad, w) # => True
torch.equal(b.grad, torch.tensor([[1, 1], [1, 1]], dtype=torch.float64)) # => True

What is the derivative of Shannon's Entropy?

I have the following simple python function that calculates the entropy of a single input X according to Shannon's Theory of Information:
import numpy as np
def entropy(X:'numpy array'):
_, frequencies = np.unique(X, return_counts=True)
probabilities = frequencies/X.shape[0]
return -np.sum(probabilities*np.log2(probabilities))
a = np.array([1., 1., 1., 3., 3., 2.])
b = np.array([1., 1., 1., 3., 3., 3.])
c = np.array([1., 1., 1., 1., 1., 1.])
print(f"entropy(a): {entropy(a)}")
print(f"entropy(b): {entropy(b)}")
print(f"entropy(c): {entropy(c)}")
With the output being the following:
entropy(a): 1.4591479170272446
entropy(b): 1.0
entropy(c): -0.0
However, I also need to calculate the derivative over dx:
d entropy / dx
This is not an easy task since the main formula
-np.sum(probabilities*np.log2(probabilities))
takes in probabilities, not x values, therefore it is not clear how to differentiate over dx.
Does anyone have an idea on how to do this?

One way to solve this is to use finite differences to compute the derivative numerically.
In this context, we can define a small constant to help us compute the numerical derivative. This function takes a one-argument function and computes its derivative for input x:
ε = 1e-12
def derivative(f, x):
return (f(x + ε) - f(x)) / ε
To make our work easier, let us define a function that computes the innermost operation of the entropy:
def inner(x):
return x * np.log2(x)
Recall that the derivative of the sum is the sum of derivatives. Therefore, the real derivative computation takes place in the inner function we just defined.
So, the numerical derivative of the entropy is:
def numerical_dentropy(X):
_, frequencies = np.unique(X, return_counts=True)
probabilities = frequencies / X.shape[0]
return -np.sum([derivative(inner, p) for p in probabilities])
Can we do better? Of course we can! The key insight here is the product rule: (f g)' = fg' + gf', where f=x and g=np.log2(x). (Also notice that d[log_a(x)]/dx = 1/(x ln(a)).)
So, the analytical entropy can be computed as:
import math
def dentropy(X):
_, frequencies = np.unique(X, return_counts=True)
probabilities = frequencies / X.shape[0]
return -np.sum([(1/math.log(2, math.e) + np.log2(p)) for p in probabilities])
Using the sample vectors for testing, we have:
a = np.array([1., 1., 1., 3., 3., 2.])
b = np.array([1., 1., 1., 3., 3., 3.])
c = np.array([1., 1., 1., 1., 1., 1.])
print(f"numerical d[entropy(a)]: {numerical_dentropy(a)}")
print(f"numerical d[entropy(b)]: {numerical_dentropy(b)}")
print(f"numerical d[entropy(c)]: {numerical_dentropy(c)}")
print(f"analytical d[entropy(a)]: {dentropy(a)}")
print(f"analytical d[entropy(b)]: {dentropy(b)}")
print(f"analytical d[entropy(c)]: {dentropy(c)}")
Which, when executed, gives us:
numerical d[entropy(a)]: 0.8417710972707937
numerical d[entropy(b)]: -0.8854028621385623
numerical d[entropy(c)]: -1.4428232973189605
analytical d[entropy(a)]: 0.8418398787754222
analytical d[entropy(b)]: -0.8853900817779268
analytical d[entropy(c)]: -1.4426950408889634
As a bonus, we can test whether this is correct with an automatic differentiation library:
import torch
a, b, c = torch.from_numpy(a), torch.from_numpy(b), torch.from_numpy(c)
def torch_entropy(X):
_, frequencies = torch.unique(X, return_counts=True)
frequencies = frequencies.type(torch.float32)
probabilities = frequencies / X.shape[0]
probabilities.requires_grad_(True)
return -(probabilities * torch.log2(probabilities)).sum(), probabilities
for v in a, b, c:
h, p = torch_entropy(v)
print(f'torch entropy: {h}')
h.backward()
print(f'torch derivative: {p.grad.sum()}')
Which gives us:
torch entropy: 1.4591479301452637
torch derivative: 0.8418397903442383
torch entropy: 1.0
torch derivative: -0.885390043258667
torch entropy: -0.0
torch derivative: -1.4426950216293335

PyCUDA when using multiple blocks to deal with matrix operation, why does matrix size have to be divisible by the block size?

I am learning GPU programming on PyCUDA. I am a bit confused by the calculation of matrix operation on the blocks. Like the example below, I want to redo the calculation
a = np.array([1,2,3,4,5,6])
c = a[:,np.newaxis] - a
which should be
c = [[0,-1,-2,-3,-4,-5],
[1,0,-1,-2,-3,-4],
[2,1,0,-1,-2,-3],
[3,2,1,0,-1,-2]]
on GPU.
Follow the code below, if I allocate the same size for matrix and the block. Everything works fine. But to test computation in multiple blocks, I allocated 4 to the block size, things got wrong. I have checked the blockDim for each entry in output c. It shows some of the entries have 0 blockDim but they should be all 4.
array([[4., 4., 4., 4., 4., 4.],
[0., 0., 4., 4., 4., 4.],
[0., 0., 4., 4., 4., 4.],
[0., 0., 4., 4., 4., 4.],
[4., 4., 4., 4., 4., 4.],
[4., 4., 4., 4., 4., 4.]], dtype=float32)
and the threadIdx.x shows wrong number at the same position.
array([[0., 1., 2., 3., 0., 1.],
[0., 0., 2., 3., 0., 1.],
[0., 0., 2., 3., 0., 1.],
[0., 0., 2., 3., 0., 1.],
[0., 1., 2., 3., 0., 1.],
[0., 1., 2., 3., 0., 1.]], dtype=float32)
This is very strange.
Repeatable code is as follows.
import numpy as np
from pycuda import compiler, gpuarray, tools
import pycuda.driver as drv
# -- initialize the device
import pycuda.autoinit
kernel_code_template = """
__global__ void com_t(float *a, float *c)
{
// 2D Thread ID
int tx = blockDim.x*blockIdx.x + threadIdx.x; // Compute row index
int ty = blockDim.y*blockIdx.y + threadIdx.y; // Compute column index
// Pvalue is used to store the element of the matrix
// that is computed by the thread
float Pvalue = 0;
// Each thread loads one row of M and one column of N,
// to produce one element of P.
float Aelement = blockDim.x;
float Belement = 0;
Pvalue = Aelement - Belement;
// Write the matrix to device memory;
// each thread writes one element
c[ty * %(MATRIX_SIZE)s + tx] = Pvalue;
}
"""
MATRIX_SIZE = 6
BLOCK_SIZE = 6
start = drv.Event()
end = drv.Event()
# # create a random vector
a_cpu = np.array([i for i in range(MATRIX_SIZE)]).astype(np.float32)
# compute reference on the CPU to verify GPU computation
start.record() # start timing
start.synchronize()
c_cpu = a_cpu[:,np.newaxis] - a_cpu
end.record() # end timing
# calculate the run length
end.synchronize()
secs = start.time_till(end)*1e-3
print("CPU time:")
print("%fs" % (secs))
# transfer host (CPU) memory to device (GPU) memory
a_gpu = gpuarray.to_gpu(a_cpu)
# create empty gpu array for the result (C = A * B)
c_gpu = gpuarray.empty((MATRIX_SIZE, MATRIX_SIZE), np.float32)
# get the kernel code from the template
# by specifying the constant MATRIX_SIZE
kernel_code = kernel_code_template % {
'MATRIX_SIZE': MATRIX_SIZE
}
# compile the kernel code
mod = compiler.SourceModule(kernel_code)
# get the kernel function from the compiled module
matrixmul = mod.get_function("com_t")
start.record() # start timing
# set grid size
if MATRIX_SIZE%BLOCK_SIZE != 0:
grid=(MATRIX_SIZE//BLOCK_SIZE+1,MATRIX_SIZE//BLOCK_SIZE+1,1)
else:
grid=(MATRIX_SIZE//BLOCK_SIZE,MATRIX_SIZE//BLOCK_SIZE,1)
# call the kernel on the card
matrixmul(
# inputs
a_gpu,
# output
c_gpu,
grid = grid,
# (only one) block of MATRIX_SIZE x MATRIX_SIZE threads
block = (BLOCK_SIZE, BLOCK_SIZE, 1),
)
end.record() # end timing
end.synchronize()
secs = start.time_till(end)*1e-3
print("GPU time:")
print("%fs" % (secs))
# print the results
print("-" * 80)
print("Matrix A (GPU):")
print(a_gpu.get())
print("-" * 80)
print("Matrix C (GPU):")
print(c_gpu.get())
print("-" * 80)
print("CPU-GPU difference:")
print(c_cpu - c_gpu.get())
np.allclose(c_cpu, c_gpu.get())

Problem solved. The evaluation of matrix c should be put in a constrain like
if((ty <matrixsize) && (tx < matrixsize))
Otherwise, the over requested threads will be invoked and squeeze out the right entry of c.

Expanding tensor using native tensorflow ops

I have a single dimensional data (floats) as shown below:
[-8., 18., 9., -3., 12., 11., -13., 38., ...]
I want to replace each negative element with an equivalent number of zeros.
My result would look something like this for the example above:
[0., 0., 0., 0., 0., 0., 0., 0., 18., 9., 0., 0., 0., 12., ...]
I am able to do this in Tensorflow by using tf.py_func().
But it turns out the graph is not serializable if I use that method.
Are there native tensorflow ops that can help me get the same result?

Not a straightforward task! Here is a pure TensorFlow implementation:
import tensorflow as tf
# Input vector
inp = tf.placeholder(tf.int32, [None])
# Find positive and negative indices
mask = inp < 0
num_inputs = tf.size(inp)
pos_idx, neg_idx = tf.dynamic_partition(tf.range(num_inputs), tf.cast(mask, tf.int32), 2)
# Negative values
negs = -tf.gather(inp, neg_idx)
total_neg = tf.reduce_sum(negs)
cum_neg = tf.cumsum(negs)
# Compute the final index of each positive element
pos_neg_idx = tf.cast(pos_idx[:, tf.newaxis] > neg_idx, inp.dtype)
neg_ref = tf.reduce_sum(pos_neg_idx, axis=1)
shifts = tf.gather(tf.concat([[0], cum_neg], axis=0), neg_ref) - neg_ref
final_pos_idx = pos_idx + shifts
# Compute the final size
final_size = num_inputs + total_neg - tf.size(negs)
# Make final vector by scattering positive values
result = tf.scatter_nd(final_pos_idx[:, tf.newaxis], tf.gather(inp, pos_idx), [final_size])
with tf.Session() as sess:
print(sess.run(result, feed_dict={inp: [-1, 1, -2, 2, 1, -3]}))
Output:
[0 1 0 0 2 1 0 0 0]
There is some "more than necessary" computational cost in this solution, namely the computation of final indices of positive elements through pos_neg_idx, which is O(n2), while it could be done iteratively in O(n). However, I cannot think of a way to replicate the loop iteratively, and a TensorFlow loop (using tf.while_loop) would be awkward and slow. In any case, unless you are using quite large vectors (with evenly distributed positive and negative values) it should not be a big issue.

How to minimize an quadratic objective function with constraint violation using penalty method

I have compared many Quadratic Programming(QP) solvers like cvxopt, qpoases and osqp and found that osqp works faster and better for my application.
Now, I want to minimize an indefinite quadratic function with both equality and inequality constraints that may get violated depending on various factors. So I want to use l1 penalty method that penalizes the violating constraints.
for example,
I have modified an example, to violate the constraints.
import osqp
import scipy.sparse as sparse
import numpy as np
# Define problem data
P = sparse.csc_matrix([[4., 1.], [1., 2.]])
q = np.array([1., 1.])
A = sparse.csc_matrix([[1., 0.], [0., 1.], [1., 0.], [0., 1.]])
l = np.array([0., 0., 0.2, 1.1])
u = np.array([1., 1., 0.2, 1.1])
# Create an OSQP object
prob = osqp.OSQP()
# Setup workspace and change alpha parameter
prob.setup(P, q, A, l, u, alpha=1.0)
# Solve problem
res = prob.solve()
print res.x
Obviously, this is an infeasible problem, so we need to change the objective function to penalize the error.
So, I need help to formulate this problem that can be solved using osqp's python interface.
Or, please let me know if there is any other python interface available to solve this kind of constraint violation problems.

In general abs functions can be dangerous (they are non-differentiable). A standard way to deal with this is to add slacks. E.g.
g(x) <= 0
becomes
g(x) <= s
s >= 0
Now add a term mu*s to the objective.
For
h(x) = 0
one could do
h(x) = s1 - s2
s1, s2 >= 0
and add mu*(s1+s2) to the objective.
As usual: this is just one approach (there are other formulations).

I had the same problem and this question helped a lot. This is how I solved it in OSQP interface.
I redefined example to be:
# Define problem data
P = sparse.csc_matrix([[4., 1.], [1., 2.]])
q = np.array([1., 1.])
A = sparse.csc_matrix([[1., 0.], [0., 1.], [1., 1.]])
l = np.array([0., 0., 3])
u = np.array([1., 1., 3])
Here first and second variable are constrained to be at most 1. But their sum should equal 3. This makes this problem unfeasible.
Now let's transform inequality constraints as Erwin suggested by adding two slack variables.
# Redefine problem data with 2 slack variableы
# Added quadratic penalties to variables s1 and s2 with penalty coefficient == 1
P = sparse.csc_matrix([[4., 1., 0., 0.], [1., 2., 0., 0.], [0., 0., 1., 0.], [0., 0., 0., 1.]])
# Zero linear penalties for s1 and s2.
q = np.array([1., 1., 0., 0.])
# First constraint is x1 <= s1, second is s1 >= 0.
# Third constraint is x2 <= s2, fourth is s2 >= 0.
A = sparse.csc_matrix([[1., 0., -1., 0.], [0., 0., 1., 0.], [0., 1., 0., -1.], [0., 0., 0., 1.], [1., 1., 0., 0.]])
l = np.array([-np.inf, 0., -np.inf, 0., 3])
u = np.array([0., np.inf, 0., np.inf, 3])
When I run solver, problem has a solution and is softly penalised for exceeding upper bounds.
iter objective pri res dua res rho time
1 -4.9403e-03 3.00e+00 5.99e+02 1.00e-01 8.31e-04s
50 1.3500e+01 1.67e-07 7.91e-08 9.96e-01 8.71e-04s
status: solved
number of iterations: 50
optimal objective: 13.5000
run time: 8.93e-04s
optimal rho estimate: 1.45e+00
[1.00 2.00 1.00 2.00]
Hope this helps somebody.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Does pytorch do eager pruning of its computational graph? - python

Related

Need help understanding the gradient function in pytorch

What is the derivative of Shannon's Entropy?

PyCUDA when using multiple blocks to deal with matrix operation, why does matrix size have to be divisible by the block size?

Expanding tensor using native tensorflow ops

How to minimize an quadratic objective function with constraint violation using penalty method

Categories

Resources