Parallel loops with Numba -- no parallelization with prange

Parallel loops with Numba -- no parallelization with prange - python

I am implementing a sequential algorithm (Kalman Filter) with a particular structure where a lot of inner looping can be done in parallel. I need to get as much performance out of this function as possible. Currently, it runs in about 600ms on my machine with representative data inputs (n, p = 12, d = 3, T = 3000)
I have used #numba.jit with nopython=True, parallel=True and annotated my ranges with numba.prange. However, even with very large data inputs (n > 5000) there is clearly no parallelism occurring (based on just looking at cores with top).
There is quite a bit of code here, I'm showing only the main chunk. Is there a reason Numba wouldn't be able to parallelize the array operations under the prange? I have also checked numba.config.NUMBA_NUM_THREADS (it is 8) and played with different numba.config.THREADING_LAYER (it is currently 'tbb'). I have also tried with both the openblas and the MKL versions of numpy+scipy, the MKL version appears to be slightly slower, and still no parallelization.
The annotation is:
#numba.jit(nopython=True, cache=False, parallel=True,
fastmath=True, nogil=True)
And the main part of the function:
P = np.empty((T + 1, n, p, d, d))
m = np.empty((T + 1, n, p, d))
P[0] = P0
m[0] = m0
phi = 0.0
Xt = np.empty((n, p)
for t in range(1, T + 1):
sum_P00 = 0.0
v = y[t - 1]
# Purely for convenience, little performance impact
for tau in range(1, p + 1):
Xt[:, tau - 1] = X[p + t - 1 - tau]
# Predict
for i in numba.prange(n):
for tau in range(p):
# Prediction step
m[t, i, tau] = Phi[i, tau] # m[t - 1, i, tau]
P[t, i, tau] = Phi[i, tau] # P[t - 1, i, tau] # Phi[i, tau].T
# Auxiliary gain variables
for i in numba.prange(n):
for tau in range(p):
v = v - Xt[i, tau] * m[t, i, tau, 0]
sum_P00 = sum_P00 + P[t, i, tau, 0, 0]
# Energy function update
s = np.linalg.norm(Xt)**2 * sum_P00 + sv2
phi += np.pi * s + 0.5 * v**2 / s
# Update
for i in numba.prange(n):
for tau in range(p):
k = Xt[i, tau] * P[t, i, tau, :, 0] # Gain
m[t, i, tau] = m[t, i, tau] + (v / s) * k
P[t, i, tau] = P[t, i, tau] + (k / s) # k.T

It appears to simply have been a problem with running interactively in Ipython. Running a test script from the console leads to parallel execution, as expected.

Related

Vectorizing theano/aesara variable operations

I am trying to compute the function below with theano/aesara in an preferably vectorized manner:
![image|620x182](upload://9Px5wAGjZdkBXVBg4fqmuSPorPr.png)
The solution i have is not vectorized and therefore way too slow:
def apply_adstock_with_lag(x, L, P, D):
"""
params:
x: original array
L: length
P: peak, delay in effect
D: decay, retain
"""
x = np.append(np.zeros(L - 1), x)
weights = [0 for _ in range(L)]
for l in range(L):
weight = D ** ((l - P) ** 2)
weights[L - 1 - l] = weight
weights = np.array(weights)
adstocked_x = []
for i in range(L - 1, len(x)):
x_array = x[i - L + 1:i + 1]
xi = sum(x_array * weights) / sum(weights)
adstocked_x.append(xi)
adstocked_x = tt.as_tensor_variable(adstocked_x)
return adstocked_x
An similar function although simplier and its vectorized solution can be found below, note that this is much much quicker probably due to the vectorized operations:
![image|252x39](upload://ucZeqCmCXcBRAHLdA7lJ0crs1Oz.png)
def adstock_geometric_theano_pymc3(x, theta):
x = tt.as_tensor_variable(x)
def adstock_geometric_recurrence_theano(index, input_x, decay_x, theta):
return tt.set_subtensor(decay_x[index], tt.sum(input_x + theta * decay_x[index - 1]))
len_observed = x.shape[0]
x_decayed = tt.zeros_like(x)
x_decayed = tt.set_subtensor(x_decayed[0], x[0])
output, _ = theano.scan(
fn=adstock_geometric_recurrence_theano,
sequences=[tt.arange(1, len_observed), x[1:len_observed]],
outputs_info=x_decayed,
non_sequences=theta,
n_steps=len_observed - 1
)
return output[-1]
I cant come up with the vectorized solution to my adstock-function, can anyone give it a go?

Have you tried:
def apply_adstock_with_lag(x, L, P, D):
adstocked_x = np.convolve(x, D**((np.arange(0, L, 1) - P)**2))[:-(L-1)] / sum(D**((np.arange(0, L, 1) - P)**2))
adstocked_x = at.as_tensor_variable(adstocked_x)
return adstocked_x
This should work

Is there no faster way to convert (BGR) OpenCV image to CMYK?

I have an OpenCV image, as usual in BGR color space, and I need to convert it to CMYK. I searched online but found basically only (slight variations of) the following approach:
def bgr2cmyk(cv2_bgr_image):
bgrdash = cv2_bgr_image.astype(float) / 255.0
# Calculate K as (1 - whatever is biggest out of Rdash, Gdash, Bdash)
K = 1 - numpy.max(bgrdash, axis=2)
with numpy.errstate(divide="ignore", invalid="ignore"):
# Calculate C
C = (1 - bgrdash[..., 2] - K) / (1 - K)
C = 255 * C
C = C.astype(numpy.uint8)
# Calculate M
M = (1 - bgrdash[..., 1] - K) / (1 - K)
M = 255 * M
M = M.astype(numpy.uint8)
# Calculate Y
Y = (1 - bgrdash[..., 0] - K) / (1 - K)
Y = 255 * Y
Y = Y.astype(numpy.uint8)
return (C, M, Y, K)
This works fine, however, it feels quite slow - for an 800 x 600 px image it takes about 30 ms on my i7 CPU. Typical operations with cv2 like thresholding and alike take only a few ms for the same image, so since this is all numpy I was expecting this CMYK conversion to be faster.
However, I haven't found anything that makes this significantly fater. There is a conversion to CMYK via PIL.Image, but the resulting channels do not look as they do with the algorithm listed above.
Any other ideas?

There are several things you should do:
shake the math
use integer math where possible
optimize beyond what numpy can do
Shaking the math
Given
RGB' = RGB / 255
K = 1 - max(RGB')
C = (1-K - R') / (1-K)
M = (1-K - G') / (1-K)
Y = (1-K - B') / (1-K)
You see what you can factor out.
RGB' = RGB / 255
J = max(RGB')
K = 1 - J
C = (J - R') / J
M = (J - G') / J
Y = (J - B') / J
Integer math
Don't normalize to [0,1] for these calculations. The max() can be done on integers. The differences can too. K can be calculated entirely with integer math.
J = max(RGB)
K = 255 - J
C = 255 * (J - R) / J
M = 255 * (J - G) / J
Y = 255 * (J - B) / J
Numba
import numba
Numba will optimize that code beyond simply using numpy library routines. It will also do the parallelization as indicated. Choosing the numpy error model and allowing fastmath will cause division by zero to not throw an exception or warning, but also make the math a little faster.
Both variants significantly outperform a plain python/numpy solution. Much of that is due to better use of CPU registers caches, rather than intermediate arrays, as is usual with numpy.
First variant: ~1.9 ms
#numba.njit(parallel=True, error_model="numpy", fastmath=True)
def bgr2cmyk_v4(bgr_img):
bgr_img = np.ascontiguousarray(bgr_img)
(height, width) = bgr_img.shape[:2]
CMYK = np.empty((height, width, 4), dtype=np.uint8)
for i in numba.prange(height):
for j in range(width):
B,G,R = bgr_img[i,j]
J = max(R, G, B)
K = np.uint8(255 - J)
C = np.uint8(255 * (J - R) / J)
M = np.uint8(255 * (J - G) / J)
Y = np.uint8(255 * (J - B) / J)
CMYK[i,j] = (C,M,Y,K)
return CMYK
Thanks to Cris Luengo for pointing out further refactoring potential (pulling out 255/J), leading to a second variant. It takes ~1.6 ms
#numba.njit(parallel=True, error_model="numpy", fastmath=True)
def bgr2cmyk_v5(bgr_img):
bgr_img = np.ascontiguousarray(bgr_img)
(height, width) = bgr_img.shape[:2]
CMYK = np.empty((height, width, 4), dtype=np.uint8)
for i in numba.prange(height):
for j in range(width):
B,G,R = bgr_img[i,j]
J = np.uint8(max(R, G, B))
Jinv = np.uint16((255*256) // J) # fixed point math
K = np.uint8(255 - J)
C = np.uint8(((J - R) * Jinv) >> 8)
M = np.uint8(((J - G) * Jinv) >> 8)
Y = np.uint8(((J - B) * Jinv) >> 8)
CMYK[i,j] = (C,M,Y,K)
return CMYK
This fixed point math causes floor rounding. For round-to-nearest, the expression must be ((J - R) * Jinv + 128) >> 8. That would cost a bit more time then (~1.8 ms).
What else?
I think that numba/LLVM didn't apply SIMD here. Some investigation revealed that the Loop Vectorizer doesn't like any of the instances it was asked to consider.
An OpenCL kernel might be even faster. OpenCL can run on CPUs.
Numba can also use CUDA.

I would start by profiling which part is the bottleneck.
e.g how fast is it without the / (1 - K)calculation?
-> precalculate 1/(1-K) might help. Even precalculation of 255/(1-K) is possible.
K = 1 - numpy.max(bgrdash, axis=2)
kRez255=255/(1 - K)
with numpy.errstate(divide="ignore", invalid="ignore"):
# Calculate C
C = (1 - bgrdash[..., 2] - K) * kRez255
C = C.astype(numpy.uint8)
# Calculate M
M = (1 - bgrdash[..., 1] - K) * kRez255
M = M.astype(numpy.uint8)
# Calculate Y
Y = (1 - bgrdash[..., 0] - K) * kRez255
Y = Y.astype(numpy.uint8)
return (C, M, Y, K)
But only profiling can show if it is the calculation at all which slows down the conversion.

Error in implementation of Crank-Nicolson method applied to 1D TDSE?

This is more of a computational physics problem, and I've asked it on physics stack exchange, but no answers on there. This is, I suppose, a mix of the disciplines on here and there (and maybe even mathematics stack exchange), so finding the right place to post is a task in of itself apparently...
I'm attempting to use Crank-Nicolson scheme to solve the TDSE in 1D. The initial wave is a real Gaussian that has been normalised wrt its probability density. As the solution evolves, a depression grows in the central peak of the real part of the wave, and the imaginary part's central trough is perhaps a bit higher than I expect (image below).
Does this behaviour seem reasonable? I have searched around and not seen questions/figures that are similar. I've tested another person's code from Github and it exhibits the same behaviour, which makes me feel a bit better. But I still think the center peak should just decrease in height and increase in width. The likelihood of me getting a physics-based explanation is relatively low here I'd assume, but a computational-based explanation on errors I may have made is more likely.
I'm happy to give more information, for example my code, or the matrices used in the scheme, etc. Thanks in advance!
Here's a link to GIF of time evolution:
And the part of my code relevant to solving the 1D TDSE:
(pretty much the entire thing except the plotting)
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation
# Define function for norm.
def normf(dxc, uc, ic):
return sum(dxc * np.square(np.abs(uc[ic, :])))
# Define function for expectation value of position.
def xexpf(dxc, xc, uc, ic):
return sum(dxc * xc * np.square(np.abs(uc[ic, :])))
# Define function for expectation value of squared position.
def xexpsf(dxc, xc, uc, ic):
return sum(dxc * np.square(xc) * np.square(np.abs(uc[ic, :])))
# Define function for standard deviation.
def sdaf(xexpc, xexpsc, ic):
return np.sqrt(xexpsc[ic] - np.square(xexpc[ic]))
# Time t: t0 =< t =< tf. Have N steps at which to evaluate the CN scheme. The
# time interval is dt. decp: variable for plotting to certain number of decimal
# places.
t0 = 0
tf = 20
N = 200
dt = tf / N
t = np.linspace(t0, tf, num = N + 1, endpoint = True)
decp = str(dt)[::-1].find('.')
# Initialise array for filling with norm values at each time step.
norm = np.zeros(len(t))
# Initialise array for expectation value of position.
xexp = np.zeros(len(t))
# Initialise array for expectation value of squared position.
xexps = np.zeros(len(t))
# Initialise array for alternate standard deviation.
sda = np.zeros(len(t))
# Position x: -a =< x =< a. M is an even number. There are M + 1 total discrete
# positions, for the points to be symmetric and centred at x = 0.
a = 100
M = 1200
dx = (2 * a) / M
x = np.linspace(-a, a, num = M + 1, endpoint = True)
# The gaussian function u diffuses over time. sd sets the width of gaussian. u0
# is the initial gaussian at t0.
sd = 1
var = np.power(sd, 2)
mu = 0
u0 = np.sqrt(1 / np.sqrt(np.pi * var)) * np.exp(-np.power(x - mu, 2) / (2 * \
var))
u = np.zeros([len(t), len(x)], dtype = 'complex_')
u[0, :] = u0
# Normalise u.
u[0, :] = u[0, :] / np.sqrt(normf(dx, u, 0))
# Set coefficients of CN scheme.
alpha = dt * -1j / (4 * np.power(dx, 2))
beta = dt * 1j / (4 * np.power(dx, 2))
# Tridiagonal matrices Al and AR. Al to be solved using Thomas algorithm.
Al = np.zeros([len(x), len(x)], dtype = 'complex_')
for i in range (0, M):
Al[i + 1, i] = alpha
Al[i, i] = 1 - (2 * alpha)
Al[i, i + 1] = alpha
# Corner elements for BC's.
Al[M, M], Al[0, 0] = 1 - alpha, 1 - alpha
Ar = np.zeros([len(x), len(x)], dtype = 'complex_')
for i in range (0, M):
Ar[i + 1, i] = beta
Ar[i, i] = 1 - (2 * beta)
Ar[i, i + 1] = beta
# Corner elements for BC's.
Ar[M, M], Ar[0, 0] = 1 - 2*beta, 1 - beta
# Thomas algorithm variables. Following similar naming as in Wiki article.
a = np.diag(Al, -1)
b = np.diag(Al)
c = np.diag(Al, 1)
NT = len(b)
cp = np.zeros(NT - 1, dtype = 'complex_')
for n in range(0, NT - 1):
if n == 0:
cp[n] = c[n] / b[n]
else:
cp[n] = c[n] / (b[n] - (a[n - 1] * cp[n - 1]))
d = np.zeros(NT, dtype = 'complex_')
dp = np.zeros(NT, dtype = 'complex_')
# Iterate over each time step to solve CN method. Maintain boundary
# conditions. Keep track of standard deviation.
for i in range(0, N):
# BC's.
u[i, 0], u[i, M] = 0, 0
# Find RHS.
d = np.dot(Ar, u[i, :])
for n in range(0, NT):
if n == 0:
dp[n] = d[n] / b[n]
else:
dp[n] = (d[n] - (a[n - 1] * dp[n - 1])) / (b[n] - (a[n - 1] * \
cp[n - 1]))
nc = NT - 1
while nc > -1:
if nc == NT - 1:
u[i + 1, nc] = dp[nc]
nc -= 1
else:
u[i + 1, nc] = dp[nc] - (cp[nc] * u[i + 1, nc + 1])
nc -= 1
norm[i] = normf(dx, u, i)
xexp[i] = xexpf(dx, x, u, i)
xexps[i] = xexpsf(dx, x, u, i)
sda[i] = sdaf(xexp, xexps, i)
# Fill in final norm value.
norm[N] = normf(dx, u, N)
# Fill in final position expectation value.
xexp[N] = xexpf(dx, x, u, N)
# Fill in final squared position expectation value.
xexps[N] = xexpsf(dx, x, u, N)
# Fill in final standard deviation value.
sda[N] = sdaf(xexp, xexps, N)

How to find fundamental matrix based on other fundamental matrix and camera movement?

I am trying to speed up some multi-camera system that relies on calculation of fundamental matrices between each camera pair.
Please notice the following is pseudocode. # means matrix multiplication, | means concatenation.
I have code to calculate F for each pair calculate_f(camera_matrix1_3x4, camera_matrix1_3x4), and the naiive solution is
for c1 in cameras:
for c2 in cameras:
if c1 != c2:
f = calculate_f(c1.proj_matrix, c2.proj_matrix)
This is slow, and I would like to speed it up. I have ~5000 cameras.
I have pre calculated all rotations and translations (in world coordinates) between every pair of cameras, and internal parameters k, such that for each camera c, it holds that c.matrix = c.k # (c.rot | c.t)
Can I use the parameters r, t to help speed up following calculations for F?
In mathematical form, for 3 different cameras c1, c2, c3 I have
f12=(c1.proj_matrix, c2.proj_matrix), and I want f23=(c2.proj_matrix, c3.proj_matrix), f13=(c1.proj_matrix, c3.proj_matrix) with some function f23, f13 = fast_f(f12, c1.r, c1.t, c2.r, c2.t, c3.r, c3.t)?
A working function for calculating the fundamental matrix in numpy:
def fundamental_3x3_from_projections(p_left_3x4: np.array, p_right__3x4: np.array) -> np.array:
# The following is based on OpenCv-contrib's c++ implementation.
# see https://github.com/opencv/opencv_contrib/blob/master/modules/sfm/src/fundamental.cpp#L109
# see https://sourishghosh.com/2016/fundamental-matrix-from-camera-matrices/
# see https://answers.opencv.org/question/131017/how-do-i-compute-the-fundamental-matrix-from-2-projection-matrices/
f_3x3 = np.zeros((3, 3))
p1, p2 = p_left_3x4, p_right__3x4
x = np.empty((3, 2, 4), dtype=np.float)
x[0, :, :] = np.vstack([p1[1, :], p1[2, :]])
x[1, :, :] = np.vstack([p1[2, :], p1[0, :]])
x[2, :, :] = np.vstack([p1[0, :], p1[1, :]])
y = np.empty((3, 2, 4), dtype=np.float)
y[0, :, :] = np.vstack([p2[1, :], p2[2, :]])
y[1, :, :] = np.vstack([p2[2, :], p2[0, :]])
y[2, :, :] = np.vstack([p2[0, :], p2[1, :]])
for i in range(3):
for j in range(3):
xy = np.vstack([x[j, :], y[i, :]])
f_3x3[i, j] = np.linalg.det(xy)
return f_3x3

Numpy is clearly not optimized for working on small matrices. The parsing of CPython input objects, internal checks and function calls introduce a significant overhead which is far bigger than the execution time need to perform the actual computation. Not to mention the creation of many temporary arrays is also expensive. One solution to solve this problem is to use Numba or Cython.
Moreover, the computation of the determinant can be optimized a lot since you know the exact size of the matrix and a part of the matrix does not always change. Indeed, using a basic algebraic expression for the 4x4 determinant help compilers to optimize a lot the overall computation thanks to the common sub-expression elimination (not performed by the CPython interpreter) and the removal of complex loops/conditionals in np.linalg.det.
Here is the resulting code:
import numba as nb
#nb.njit('float64(float64[:,::1])')
def det_4x4(mat):
a, b, c, d = mat[0,0], mat[0,1], mat[0,2], mat[0,3]
e, f, g, h = mat[1,0], mat[1,1], mat[1,2], mat[1,3]
i, j, k, l = mat[2,0], mat[2,1], mat[2,2], mat[2,3]
m, n, o, p = mat[3,0], mat[3,1], mat[3,2], mat[3,3]
return a * (f * (k*p - l*o) + g * (l*n - j*p) + h * (j*o - k*n)) + \
b * (e * (l*o - k*p) + g * (i*p - l*m) + h * (k*m - i*o)) + \
c * (e * (j*p - l*n) + f * (l*m - i*p) + h * (i*n - j*m)) + \
d * (e * (k*n - j*o) + f * (i*o - k*m) + g * (j*m - i*n))
#nb.njit('float64[:,::1](float64[:,::1], float64[:,::1])')
def fundamental_3x3_from_projections(p_left_3x4, p_right_3x4):
f_3x3 = np.empty((3, 3))
p1, p2 = p_left_3x4, p_right_3x4
x = np.empty((3, 2, 4), dtype=np.float64)
x[0, 0, :] = p1[1, :]
x[0, 1, :] = p1[2, :]
x[1, 0, :] = p1[2, :]
x[1, 1, :] = p1[0, :]
x[2, 0, :] = p1[0, :]
x[2, 1, :] = p1[1, :]
y = np.empty((3, 2, 4), dtype=np.float64)
y[0, 0, :] = p2[1, :]
y[0, 1, :] = p2[2, :]
y[1, 0, :] = p2[2, :]
y[1, 1, :] = p2[0, :]
y[2, 0, :] = p2[0, :]
y[2, 1, :] = p2[1, :]
xy = np.empty((4, 4), dtype=np.float64)
for i in range(3):
xy[2:4, :] = y[i, :, :]
for j in range(3):
xy[0:2, :] = x[j, :, :]
f_3x3[i, j] = det_4x4(xy)
return f_3x3
This is 130 times faster on my machine (85.6 us VS 0.66 us).
You can speed up the process even more by a factor of two if the applied function is commutative (ie. f(c1, c2) == f(c2, c1)). If so, you could compute only the upper part. It turns out that your function have some interesting property since f(c1, c2) == f(c2, c1).T appear to be always true. Another possible optimization is to run the loop in parallel.
With all these optimizations, the resulting program should be about 3 order of magnitude faster.
Analysis of the accuracy of the approach
The precision provided appear to be similar than the original one. Regarding the input matrix, results are sometime more accurate and sometimes less accurate than the Numpy method. This is specifically due to the computation of the determinant. With 24-digit decimals, there is no visible error compared to the reliable result of Wolphram Alpha. This show that the method is correct, results as not the same due to numerical stability details. Here is the code used to test the accuracy of the methods:
# Imports
from decimal import Decimal
import numba as nb
# Definitions
def det_4x4(mat):
a, b, c, d = mat[0,0], mat[0,1], mat[0,2], mat[0,3]
e, f, g, h = mat[1,0], mat[1,1], mat[1,2], mat[1,3]
i, j, k, l = mat[2,0], mat[2,1], mat[2,2], mat[2,3]
m, n, o, p = mat[3,0], mat[3,1], mat[3,2], mat[3,3]
return a * (f * (k*p - l*o) + g * (l*n - j*p) + h * (j*o - k*n)) + \
b * (e * (l*o - k*p) + g * (i*p - l*m) + h * (k*m - i*o)) + \
c * (e * (j*p - l*n) + f * (l*m - i*p) + h * (i*n - j*m)) + \
d * (e * (k*n - j*o) + f * (i*o - k*m) + g * (j*m - i*n))
#nb.njit('float64(float64[:,::1])')
def det_4x4_numba(mat):
a, b, c, d = mat[0,0], mat[0,1], mat[0,2], mat[0,3]
e, f, g, h = mat[1,0], mat[1,1], mat[1,2], mat[1,3]
i, j, k, l = mat[2,0], mat[2,1], mat[2,2], mat[2,3]
m, n, o, p = mat[3,0], mat[3,1], mat[3,2], mat[3,3]
return a * (f * (k*p - l*o) + g * (l*n - j*p) + h * (j*o - k*n)) + \
b * (e * (l*o - k*p) + g * (i*p - l*m) + h * (k*m - i*o)) + \
c * (e * (j*p - l*n) + f * (l*m - i*p) + h * (i*n - j*m)) + \
d * (e * (k*n - j*o) + f * (i*o - k*m) + g * (j*m - i*n))
# Example matrix
precise_xy = np.array(
[[Decimal('42'),Decimal('-6248'),Decimal('4060'),Decimal('845')],
[Decimal('-0.00992'),Decimal('-0.704'),Decimal('-0.71173298417'),Decimal('300.532')],
[Decimal('-8.94274'),Decimal('-7554.39'),Decimal('604.57'),Decimal('706282')],
[Decimal('-0.0132'),Decimal('-0.2757'),Decimal('-0.961'),Decimal('247.65')]]
)
xy = precise_xy.astype(np.float64)
res_numpy = Decimal(np.linalg.det(xy))
res_numba = Decimal(det_4x4_numba(xy))
res_precise = det_4x4(precise_xy)
# The Wolphram Alpha expression used is:
# det({{42,-6248,4060,845},
# {-0.00992,-0.704,-0.71173298417,300.532},
# {-8.94274,-7554.39,604.57,706282},
# {-0.0132,-0.2757,-0.961,247.65}})
res_wolframalpha = Decimal('-323312.2164828991329828243')
# The result got from Wolfram-Alpha have a 25-digit precision
# and is exactly the same than the one of det_4x4 using 24-digit decimals.
assert res_precise == res_wolframalpha
print(abs((res_numpy-res_precise)/res_precise)) # 1.7E-14
print(abs((res_numba-res_precise)/res_precise)) # 3.1E-14
# => Similar relative error (Numba slightly less accurate
# but both are not close to the 1e-16 relative epsilon)

Can I speed up this aerodynamics calculation with Numba, vectorization, or multiprocessing?

Problem:
I am trying to increase the speed of an aerodynamics function in Python.
Function Set:
import numpy as np
from numba import njit
def calculate_velocity_induced_by_line_vortices(
points, origins, terminations, strengths, collapse=True
):
# Expand the dimensionality of the points input. It is now of shape (N x 1 x 3).
# This will allow NumPy to broadcast the upcoming subtractions.
points = np.expand_dims(points, axis=1)
# Define the vectors from the vortex to the points. r_1 and r_2 now both are of
# shape (N x M x 3). Each row/column pair holds the vector associated with each
# point/vortex pair.
r_1 = points - origins
r_2 = points - terminations
r_0 = r_1 - r_2
r_1_cross_r_2 = nb_2d_explicit_cross(r_1, r_2)
r_1_cross_r_2_absolute_magnitude = (
r_1_cross_r_2[:, :, 0] ** 2
+ r_1_cross_r_2[:, :, 1] ** 2
+ r_1_cross_r_2[:, :, 2] ** 2
)
r_1_length = nb_2d_explicit_norm(r_1)
r_2_length = nb_2d_explicit_norm(r_2)
# Define the radius of the line vortices. This is used to get rid of any
# singularities.
radius = 3.0e-16
# Set the lengths and the absolute magnitudes to zero, at the places where the
# lengths and absolute magnitudes are less than the vortex radius.
r_1_length[r_1_length < radius] = 0
r_2_length[r_2_length < radius] = 0
r_1_cross_r_2_absolute_magnitude[r_1_cross_r_2_absolute_magnitude < radius] = 0
# Calculate the vector dot products.
r_0_dot_r_1 = np.einsum("ijk,ijk->ij", r_0, r_1)
r_0_dot_r_2 = np.einsum("ijk,ijk->ij", r_0, r_2)
# Calculate k and then the induced velocity, ignoring any divide-by-zero or nan
# errors. k is of shape (N x M)
with np.errstate(divide="ignore", invalid="ignore"):
k = (
strengths
/ (4 * np.pi * r_1_cross_r_2_absolute_magnitude)
* (r_0_dot_r_1 / r_1_length - r_0_dot_r_2 / r_2_length)
)
# Set the shape of k to be (N x M x 1) to support numpy broadcasting in the
# subsequent multiplication.
k = np.expand_dims(k, axis=2)
induced_velocities = k * r_1_cross_r_2
# Set the values of the induced velocity to zero where there are singularities.
induced_velocities[np.isinf(induced_velocities)] = 0
induced_velocities[np.isnan(induced_velocities)] = 0
if collapse:
induced_velocities = np.sum(induced_velocities, axis=1)
return induced_velocities
#njit
def nb_2d_explicit_norm(vectors):
return np.sqrt(
(vectors[:, :, 0]) ** 2 + (vectors[:, :, 1]) ** 2 + (vectors[:, :, 2]) ** 2
)
#njit
def nb_2d_explicit_cross(a, b):
e = np.zeros_like(a)
e[:, :, 0] = a[:, :, 1] * b[:, :, 2] - a[:, :, 2] * b[:, :, 1]
e[:, :, 1] = a[:, :, 2] * b[:, :, 0] - a[:, :, 0] * b[:, :, 2]
e[:, :, 2] = a[:, :, 0] * b[:, :, 1] - a[:, :, 1] * b[:, :, 0]
return e
Context:
This function is used by Ptera Software, an open-source solver for flapping wing aerodynamics. As shown by the profile output below, it is by far the largest contributor to Ptera Software's run time.
Currently, Ptera Software takes just over 3 minutes to run a typical case, and my goal is to get this below 1 minute.
The function takes in a group of points, origins, terminations, and strengths. At every point, it finds the induced velocity due to the line vortices, which are characterized by the groups of origins, terminations, and strengths. If collapse is true, then the output is the cumulative velocity induced at each point due to the vortices. If false, the function outputs each vortex's contribution to the velocity at each point.
During a typical run, the velocity function is called approximately 2000 times. At first, the calls involve vectors with relatively small input arguments (around 200 points, origins, terminations, and strengths). Later calls involve large input arguments (around 400 points and around 6,000 origins, terminations, and strengths). An ideal solution would be fast for all size inputs, but increasing the speed of large input calls is more important.
For testing, I recommend running the following script with your own implementation of the function:
import timeit
import matplotlib.pyplot as plt
import numpy as np
n_repeat = 2
n_execute = 10 ** 3
min_oom = 0
max_oom = 3
times_py = []
for i in range(max_oom - min_oom + 1):
n_elem = 10 ** i
n_elem_pretty = np.format_float_scientific(n_elem, 0)
print("Number of elements: " + n_elem_pretty)
# Benchmark Python.
print("\tBenchmarking Python...")
setup = '''
import numpy as np
these_points = np.random.random((''' + str(n_elem) + ''', 3))
these_origins = np.random.random((''' + str(n_elem) + ''', 3))
these_terminations = np.random.random((''' + str(n_elem) + ''', 3))
these_strengths = np.random.random(''' + str(n_elem) + ''')
def calculate_velocity_induced_by_line_vortices(points, origins, terminations,
strengths, collapse=True):
pass
'''
statement = '''
results_orig = calculate_velocity_induced_by_line_vortices(these_points, these_origins,
these_terminations,
these_strengths)
'''
times = timeit.repeat(repeat=n_repeat, stmt=statement, setup=setup, number=n_execute)
time_py = min(times)/n_execute
time_py_pretty = np.format_float_scientific(time_py, 2)
print("\t\tAverage Time per Loop: " + time_py_pretty + " s")
# Record the times.
times_py.append(time_py)
sizes = [10 ** i for i in range(max_oom - min_oom + 1)]
fig, ax = plt.subplots()
ax.plot(sizes, times_py, label='Python')
ax.set_xscale("log")
ax.set_xlabel("Size of List or Array (elements)")
ax.set_ylabel("Average Time per Loop (s)")
ax.set_title(
"Comparison of Different Optimization Methods\nBest of "
+ str(n_repeat)
+ " Runs, each with "
+ str(n_execute)
+ " Loops"
)
ax.legend()
plt.show()
Previous Attempts:
My prior attempts at speeding up this function involved vectorizing it (which worked great, so I kept those changes) and trying out Numba's JIT compiler. I had mixed results with Numba. When I tried to use Numba on a modified version of the entire velocity function, my results were much slower than before. However, I found that Numba significantly sped up the cross-product and norm functions, which I implemented above.
Updates:
Update 1:
Based on Mercury's comment (which has since been deleted), I replaced
points = np.expand_dims(points, axis=1)
r_1 = points - origins
r_2 = points - terminations
with two calls to the following function:
#njit
def subtract(a, b):
c = np.empty((a.shape[0], b.shape[0], 3))
for i in range(a.shape[0]):
for j in range(b.shape[0]):
for k in range(3):
c[i, j, k] = a[i, k] - b[j, k]
return c
This resulted in a speed increase from 227 s to 220 s. This is better! However, it is still not fast enough.
I also have tried setting the njit fastmath flag to true, and using a numba function instead of calls to np.einsum. Neither increased the speed.
Update 2:
With Jérôme Richard's answer, the run time is now 156 s, which is a decrease of 29%! I'm satisfied enough to accept this answer, but feel free to make other suggestions if you think you can improve on their work!

First of all, Numba can perform parallel computations resulting in a faster code if you manually request it using mainly parallel=True and prange. This is useful for big arrays (but not for small ones).
Moreover, your computation is mainly memory bound. Thus, you should avoid creating big arrays when they are not reused multiple times, or more generally when they cannot be recomputed on the fly (in a relatively cheap way). This is the case for r_0 for example.
In addition, memory access pattern matters: vectorization is more efficient when accesses are contiguous in memory and the cache/RAM is use more efficiently. Consequently, arr[0, :, :] = 0 should be faster then arr[:, :, 0] = 0. Similarly, arr[:, :, 0] = arr[:, :, 1] = 0 should be mush slower than arr[:, :, 0:2] = 0 since the former performs to noncontinuous memory passes while the latter performs only one more contiguous memory pass. Sometimes, it can be beneficial to transpose your data so that the following calculations are much faster.
Moreover, Numpy tends to create many temporary arrays that are costly to allocate. This is a huge problem when the input arrays are small. The Numba jit can avoid that in most cases.
Finally, regarding your computation, it may be a good idea to use GPUs for big arrays (definitively not for small ones). You can give a look to cupy or clpy to do that quite easily.
Here is an optimized implementation working on the CPU:
import numpy as np
from numba import njit, prange
#njit(parallel=True)
def subtract(a, b):
c = np.empty((a.shape[0], b.shape[0], 3))
for i in prange(c.shape[0]):
for j in range(c.shape[1]):
for k in range(3):
c[i, j, k] = a[i, k] - b[j, k]
return c
#njit(parallel=True)
def nb_2d_explicit_norm(vectors):
res = np.empty((vectors.shape[0], vectors.shape[1]))
for i in prange(res.shape[0]):
for j in range(res.shape[1]):
res[i, j] = np.sqrt(vectors[i, j, 0] ** 2 + vectors[i, j, 1] ** 2 + vectors[i, j, 2] ** 2)
return res
# NOTE: better memory access pattern
#njit(parallel=True)
def nb_2d_explicit_cross(a, b):
e = np.empty(a.shape)
for i in prange(e.shape[0]):
for j in range(e.shape[1]):
e[i, j, 0] = a[i, j, 1] * b[i, j, 2] - a[i, j, 2] * b[i, j, 1]
e[i, j, 1] = a[i, j, 2] * b[i, j, 0] - a[i, j, 0] * b[i, j, 2]
e[i, j, 2] = a[i, j, 0] * b[i, j, 1] - a[i, j, 1] * b[i, j, 0]
return e
# NOTE: avoid the slow building of temporary arrays
#njit(parallel=True)
def cross_absolute_magnitude(cross):
return cross[:, :, 0] ** 2 + cross[:, :, 1] ** 2 + cross[:, :, 2] ** 2
# NOTE: avoid the slow building of temporary arrays again and multiple pass in memory
# Warning: do the work in-place
#njit(parallel=True)
def discard_singularities(arr):
for i in prange(arr.shape[0]):
for j in range(arr.shape[1]):
for k in range(3):
if np.isinf(arr[i, j, k]) or np.isnan(arr[i, j, k]):
arr[i, j, k] = 0.0
#njit(parallel=True)
def compute_k(strengths, r_1_cross_r_2_absolute_magnitude, r_0_dot_r_1, r_1_length, r_0_dot_r_2, r_2_length):
return (strengths
/ (4 * np.pi * r_1_cross_r_2_absolute_magnitude)
* (r_0_dot_r_1 / r_1_length - r_0_dot_r_2 / r_2_length)
)
#njit(parallel=True)
def rDotProducts(b, c):
assert b.shape == c.shape and b.shape[2] == 3
n, m = b.shape[0], b.shape[1]
ab = np.empty((n, m))
ac = np.empty((n, m))
for i in prange(n):
for j in range(m):
ab[i, j] = 0.0
ac[i, j] = 0.0
for k in range(3):
a = b[i, j, k] - c[i, j, k]
ab[i, j] += a * b[i, j, k]
ac[i, j] += a * c[i, j, k]
return (ab, ac)
# Compute `np.sum(arr, axis=1)` in parallel.
#njit(parallel=True)
def collapseArr(arr):
assert arr.shape[2] == 3
n, m = arr.shape[0], arr.shape[1]
res = np.empty((n, 3))
for i in prange(n):
res[i, 0] = np.sum(arr[i, :, 0])
res[i, 1] = np.sum(arr[i, :, 1])
res[i, 2] = np.sum(arr[i, :, 2])
return res
def calculate_velocity_induced_by_line_vortices(points, origins, terminations, strengths, collapse=True):
r_1 = subtract(points, origins)
r_2 = subtract(points, terminations)
# NOTE: r_0 is computed on the fly by rDotProducts
r_1_cross_r_2 = nb_2d_explicit_cross(r_1, r_2)
r_1_cross_r_2_absolute_magnitude = cross_absolute_magnitude(r_1_cross_r_2)
r_1_length = nb_2d_explicit_norm(r_1)
r_2_length = nb_2d_explicit_norm(r_2)
radius = 3.0e-16
r_1_length[r_1_length < radius] = 0
r_2_length[r_2_length < radius] = 0
r_1_cross_r_2_absolute_magnitude[r_1_cross_r_2_absolute_magnitude < radius] = 0
r_0_dot_r_1, r_0_dot_r_2 = rDotProducts(r_1, r_2)
with np.errstate(divide="ignore", invalid="ignore"):
k = compute_k(strengths, r_1_cross_r_2_absolute_magnitude, r_0_dot_r_1, r_1_length, r_0_dot_r_2, r_2_length)
k = np.expand_dims(k, axis=2)
induced_velocities = k * r_1_cross_r_2
discard_singularities(induced_velocities)
if collapse:
induced_velocities = collapseArr(induced_velocities)
return induced_velocities
On my machine, this code is 2.5 times faster than the initial implementation on arrays of size 10**3. It also use a bit less memory.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parallel loops with Numba -- no parallelization with prange - python

It appears to simply have been a problem with running interactively in Ipython. Running a test script from the console leads to parallel execution, as expected.

Related

Vectorizing theano/aesara variable operations

Is there no faster way to convert (BGR) OpenCV image to CMYK?

Error in implementation of Crank-Nicolson method applied to 1D TDSE?

How to find fundamental matrix based on other fundamental matrix and camera movement?

Can I speed up this aerodynamics calculation with Numba, vectorization, or multiprocessing?

Categories

Resources