Iterating over regions of numpy array in parallel

Iterating over regions of numpy array in parallel - python

I have an 3D array and need to iterate over it, extract a 2x2x2 voxel large region and check if any voxel is non-zero. Of these locations, I need the unique elements of the region and the index:
import time
import numpy as np
np.random.seed(1234)
def _naive_iterator(array):
lookup = np.pad(array, (1, 1), 'constant') # Border is filled with 0
nx, ny, nz = lookup.shape
for i in range(nx - 1):
for j in range(ny - 1):
for k in range(nz - 1):
n = lookup[i:i + 2, j:j + 2, k:k + 2]
if n.any(): # check if any value in the region is non-zero
yield n.ravel(), i, j, k
# yield set(n.ravel()), i, j, k # `set()` alone takes some time - for testing purposes exclude this
# arrays that shall be used here are in the region of (1000, 1000, 1000) or larger
# arrays are asserted to contain only integer values >= 0.
array = np.random.randint(0, 2, (200, 200, 200), dtype=np.uint8)
for fun in (_naive_iterator, ):
print(f"* {fun}")
for _ in range(2):
tic = time.time()
[x for x in fun(array)]
print(f" ** execution took {time.time() - tic}")
On my PC, this loop takes about 24s to run. (Interesting sidenote: without the n.any(), the loop needs only 8s, so maybe there is some optimization potential as well?)
I thought about how I could make this faster, potentially by running it in parallel. But, I can not figure out how I could do that, without pre-generating all the 2x2x2 arrays.
I also thought about using scipy.ndimage.generic_filter but with that I can only get an image which has for example 1 on all pixels that I want to include, but I would had to iterate over the original image to get n.ravel() (Ideally, one would use generic_filter directly, but I can not get the index inside the called function).
How can I speed up this loop, potentially by parallelizing the iteration?

without the n.any(), the loop needs only 8s, so maybe there is some optimization potential as well?
This is because Numpy function have a big overhead for very small arrays like 2x2x2. The overhead of a Numpy function is about few microseconds while the actual n.any() computation should take no more than a dozen of nanoseconds on a mainstream processor. The usual solution is to vectorize the operation so to avoid many Numpy calls. You can use Numba to speed up this code and removes most of the CPython/Numpy overheads. Note that Numba does not support all function like pad currently so a workaround is needed. Here is the resulting code:
import time
import numpy as np
import numba as nb
np.random.seed(1234)
#nb.njit('(uint8[:,:,::1],)')
def numba_iterator(lookup):
nx, ny, nz = lookup.shape
for i in range(nx - 1):
for j in range(ny - 1):
for k in range(nz - 1):
n = lookup[i:i + 2, j:j + 2, k:k + 2]
if n.any():
yield n.ravel(), i, j, k
array = np.random.randint(0, 2, (200, 200, 200), dtype=np.uint8)
for fun in (numba_iterator, ):
print(f"* {fun}")
for _ in range(2):
tic = time.time()
lookup = np.pad(array, (1, 1), 'constant') # Border is filled with 0
[x for x in fun(lookup)]
print(f" ** execution took {time.time() - tic}")
This is significantly times faster on my machine (but still quite slow).
I thought about how I could make this faster, potentially by running it in parallel.
This is not possible as long as the yield is used since generators are inherently sequential.
How can I speed up this loop
One solution could be to generate the whole output as a Numpy array in Numba so to avoid the creation of 8 million Numpy objects stored in a CPython list which is the main source of slowdown of the code once optimized with Numba (each call to n.ravel creates a new array). Note that generators are generally slow since they often requires a context-switch (of a kind of lightweight-thread / coroutine). The best solution in term of performance is to compute data on-the-fly in the loop.
Additionally, n.any and n.ravel can be manually rewritten in Numba so to be more efficient. Indeed, the n array views are very small and using 3 nested loops with a constant compile-time bound help the compiler to produce a fast code (ie. it can unroll the loops and generate only few instructions the processor can execute very efficiently).
Here is a modified improved code (that compute the padded array manually):
#nb.njit('(uint8[:,:,::1],)')
def fast_compute(array):
nx, ny, nz = array.shape
# Padding (with zeros)
lookup = np.zeros((nx+2, ny+2, nz+2), dtype=np.uint8)
for i in range(nx):
for j in range(ny):
for k in range(nz):
lookup[i+1, j+1, k+1] = array[i, j, k]
# Actual computation
size = (nx + 1) * (ny + 1) * (nz + 1)
result = np.empty((size, 8), dtype=np.uint8)
indices = np.empty((size, 3), dtype=np.uint32)
cur = 0
for i in range(nx + 1):
for j in range(ny + 1):
for k in range(nz + 1):
n = lookup[i:i+2, j:j+2, k:k+2]
# Fast manual n.any()
found = False
for i2 in range(2):
for j2 in range(2):
for k2 in range(2):
found |= n[i2, j2, k2]
if found:
# Fast manual n.ravel()
cur2 = 0
for i2 in range(2):
for j2 in range(2):
for k2 in range(2):
result[cur, cur2] = n[i2, j2, k2]
cur2 += 1
indices[cur, 0] = i
indices[cur, 1] = j
indices[cur, 2] = k
cur += 1
return result[:cur].reshape(cur,2,2,2), indices[:cur]
The resulting code is quite big, but this is the price to pay for high performance computing.
As pointed out by #norok2, result[:cur] and indices[:cur] are views referencing arrays. The view can be quite small compared to the allocated arrays. If this is a problem, you can return a copy (eg. result[:cur].copy()) so to avoid a possible memory overconsumption. In practice, it should not be a problem since the array is allocated in virtual memory and only the written pages are mapped in physical memory on mainstream systems (eg. Windows & Linux). Page of virtual memory are only mapped to physical memory during the first touch (ie. when items are written for the first time). Modern platforms can allocate huge amount of virtual memory (eg. 131072 GiB on my mainstream x86-64 Windows, and even more on mainstream x86-64 Linux) while the physical memory is much more scarce (eg. 16 GiB on my machine). The underlying array is freed when there is no view referencing it anymore.
Benchmark
_naive_iterator: 21.25 s
numba_iterator: 8.10 s
get_windows_and_indices: 1.35 s
fast_compute: 0.13 s
The last Numba function is 163 times faster than the initial one and 10 times faster than the vectorized Numpy implementation of #flawr.
The Numba implementation could certainly be multi-threaded, but it is not easy to do since threads need to write the output and the location of the written items (ie. cur) is dependent of the other threads. Moreover, it would make the code significantly more complex.

Whenever you're working with numpy, you should try to avoid explicit loops. These loops are written in python and therefore usually slower than anything you can do with vectorization. That way you defer the looping to the underlying C functions that are pretty much as fast as anything can be. So I would approach your problem with something like the following. This function does roughly the same thing as your _naive_iterator but in a vectorized manner without any python loops:
from numpy.lib.stride_tricks import sliding_window_view
def get_windows_and_indices(array):
lookup = np.pad(array, (1, 1), 'constant') # Border is filled with 0
nx, ny, nz = lookup.shape
x, y, z = np.mgrid[0:nx, 0:ny, 0:nz]
lookup = np.stack([lookup, x, y, z])
out = sliding_window_view(lookup, (2, 2, 2), axis=(1, 2, 3)).reshape(4, -1, 2, 2, 2)
windows = out[0, ...]
ic = out[1, ..., 0, 0, 0]
jc = out[2, ..., 0, 0, 0]
kc = out[3, ..., 0, 0, 0]
mask = windows.any(axis=(1, 2, 3))
return windows[mask], ic[mask], jc[mask], kc[mask]
Of course you will also need to think abou the rest of the code a little bit differently but vectorization is really something you need to get used to if you want to (efficiently) work with numpy.
Also I'm pretty sure that even this function above is not optimal and can definitely be improved further.

The simplest approach to speed up your code while retaining the features is with Numba. I assume the padding to be essentially a decorating step, and I will deal with it separately at end of the answer.
Here is a cleaner implementation of the originally proposed code and the naïve Numba acceleration:
import numpy as np
import numba as nb
def i_cubicles_3D_set_OP(arr, size=2):
nx, ny, nz = arr.shape
nx += 1 - size
ny += 1 - size
nz += 1 - size
for i in range(nx):
for j in range(ny):
for k in range(nz):
window = arr[i:i + size, j:j + size, k:k + size]
if window.any():
yield set(window.ravel()), (i, j, k)
i_cubicles_3D_set_OP_nb = nb.njit(i_cubicles_3D_set_OP)
i_cubicles_3D_set_OP_nb.__name__ = "i_cubicles_3D_set_OP_nb"
If one is interested is a dimension-agnostic version of it (which comes at the cost of some speed) one could write:
def i_cubicles_set_nb(arr, size=2):
window = (size,) * arr.ndim
window_size = size ** arr.ndim
reduced_shape = tuple(dim - size + 1 for dim, size in zip(arr.shape, window))
view = np.lib.stride_tricks.as_strided(
arr, shape=reduced_shape + window, strides=arr.strides * 2, writeable=False)
return _i_cubicles_set_nb(view.reshape((-1, window_size)), reduced_shape)
#nb.njit
def unravel_index(x, shape):
result = np.zeros(len(shape), dtype=np.int_)
for i, dim in enumerate(shape[::-1], 1):
result[-i] = x % dim
x //= dim
return result
#nb.njit
def not_only_zeros(seq):
# assumes seq is not empty
count = 0
for x in seq:
if x == 0:
count += 1
break # because only unique values
return len(seq) != count
#nb.njit
def _i_cubicles_set_nb(arr, shape):
for i, x in enumerate(arr):
uniques = set(x)
if not_only_zeros(uniques):
yield uniques, unravel_index(i, shape)
This introduces the important trick of generating a strided (read-only) view of the input, which can be used to simplify conceptually all the looping, at the cost of having to manually unravel the index.
This is a similar idea as the one proposed in #flawr's answer.
On a 50³-sized index, I get the following timings:
np.random.seed(42)
n = 50
arr = np.random.randint(0, 3, (n, n, n), dtype=np.uint8)
def is_equal_i_set(a, b):
return all(x[0] == y[0] and np.allclose(x[1], y[1]) for x, y in zip(a, b))
funcs = i_cubicles_3d_set_OP_nb, i_cubicles_3d_set_OP, i_cubicles_set_nb
base = list(funcs[0](arr))
for func in funcs:
res = list(func(arr))
print(f"{func.__name__:>24} {is_equal_i_set(base, res)!s:>5}", end=' ')
# %timeit -n 1 -r 1 list(func(arr))
%timeit list(func(arr))
# i_cubicles_3d_set_OP_nb True 1 loop, best of 5: 130 ms per loop
# i_cubicles_3d_set_OP True 1 loop, best of 5: 776 ms per loop
# i_cubicles_set_nb True 10 loops, best of 5: 151 ms per loop
Indicating the use of Numba to be quite effective.
No uniques
If one is willing to forego the requirement of returning only unique elements inside a cubicle, replacing them with all the elements inside the cubicles, one does gain some (but not much) speed:
#nb.njit
def i_cubicles_3d_nb(arr, size=2):
nx, ny, nz = arr.shape
nx += 1 - size
ny += 1 - size
nz += 1 - size
for i in range(nx):
for j in range(ny):
for k in range(nz):
window = arr[i:i + size, j:j + size, k:k + size]
if window.any():
yield window.ravel(), (i, j, k)
def i_cubicles_nb(arr, size=2):
window = (size,) * arr.ndim
window_size = size ** arr.ndim
reduced_shape = tuple(dim - size + 1 for dim, size in zip(arr.shape, window))
view = np.lib.stride_tricks.as_strided(
arr, shape=reduced_shape + window, strides=arr.strides * 2, writeable=False)
return _i_cubicles_nb(view.reshape((-1, window_size)), reduced_shape)
#nb.njit
def unravel_index(x, shape):
result = np.zeros(len(shape), dtype=np.int_)
for i, dim in enumerate(shape[::-1], 1):
result[-i] = x % dim
x //= dim
return result
#nb.njit
def any_nb(arr):
for x in arr:
if x:
return True
return False
#nb.njit
def _i_cubicles_nb(arr, shape):
for i, x in enumerate(arr):
if any_nb(x):
yield x, unravel_index(i, shape)
as evidenced by the following benchmark (on the same 50³-sized input as before):
def is_equal_i(a, b):
return all(np.allclose(x[0], y[0]) and np.allclose(x[1], y[1]) for x, y in zip(a, b))
funcs = i_cubicles_3d_nb, i_cubicles_nb
base = list(funcs[0](arr))
for func in funcs:
res = list(func(arr))
print(f"{func.__name__:>24} {is_equal_i(base, res)!s:>5}", end=' ')
# %timeit -n 1 -r 1 list(func(arr))
%timeit list(func(arr))
# print()
# i_cubicles_3d_nb True 10 loops, best of 5: 116 ms per loop
# i_cubicles_nb True 10 loops, best of 5: 125 ms per loop
No yield (and no uniques)
While it is clear that a function matching exactly the OP output can be made faster only with Numba / Cython, a number of fast approaches can be obtained by foregoing some features of the OP code.
In particular, when creating generators, a significant amount of time is spent on creating the actual objects to yield.
The same information can be returned (and most importantly allocated) all at once with substantial speed gain, especially if we skip creating the containers for computing the unique elements.
Once we are accepting to return all elements inside a cubicle instead its unique elements, it is possible to devise also NumPy-only vectorized (fast and dimension-agnostic) approaches, alongside faster Numba (3d-specific) implementations:
def cubicles_np(arr, size=2):
window = (size,) * arr.ndim
window_size = size ** arr.ndim
reduced_shape = tuple(dim - size + 1 for dim, size in zip(arr.shape, window))
view = np.lib.stride_tricks.as_strided(
arr, shape=reduced_shape + window, strides=arr.strides * 2, writeable=False)
mask = np.any(view, axis=tuple(range(-arr.ndim, 0)))
return view[mask, ...], np.array(np.nonzero(mask)).transpose()
def cubicles_tr_np(arr, size=2):
window = (size,) * arr.ndim
window_size = size ** arr.ndim
reduced_shape = tuple(dim - size + 1 for dim, size in zip(arr.shape, window))
view = np.lib.stride_tricks.as_strided(
arr, shape=window + reduced_shape, strides=arr.strides * 2, writeable=False)
mask = np.any(view, axis=tuple(range(arr.ndim)))
return (
view[..., mask].reshape((window_size, -1)).transpose().reshape((-1, *window)),
np.array(np.nonzero(mask)).transpose())
def cubicles_nb(arr, size=2):
window = (size,) * arr.ndim
window_size = size ** arr.ndim
reduced_shape = tuple(dim - size + 1 for dim, size in zip(arr.shape, window))
view = np.lib.stride_tricks.as_strided(
arr, shape=reduced_shape + window, strides=arr.strides * 2, writeable=False)
values, indexes = _cubicles_nb(view.reshape((-1, window_size)), reduced_shape, arr.ndim)
return values.reshape((-1, *window)), indexes
#nb.njit
def any_nb(arr):
for x in arr:
if x:
return True
return False
#nb.njit
def _cubicles_nb(arr, shape, ndim):
n, k = arr.shape
indexes = np.empty((n, ndim), dtype=np.bool_)
result = np.empty((n, k), dtype=arr.dtype)
count = 0
for i in range(n):
x = arr[i]
if any_nb(x):
indexes[count] = unravel_index(i, shape)
result[count] = x
count += 1
return result[:count].copy(), indexes[:count].copy()
#nb.njit
def any_cubicle_3d_nb(arr, size):
for i in range(size):
for j in range(size):
for k in range(size):
if arr[i, j, k]:
return True
return False
#nb.njit
def cubicles_3d_nb(arr, size=2):
nx, ny, nz = arr.shape
nx += 1 - size
ny += 1 - size
nz += 1 - size
nn = nx * ny * nz
indexes = np.empty((nn, 3), dtype=np.bool_)
result = np.empty((nn, size, size, size), dtype=arr.dtype)
count = 0
for i in range(nx):
for j in range(ny):
for k in range(nz):
x = arr[i:i + size, j:j + size, k:k + size]
if any_cubicle_3d_nb(x, size):
result[count] = x
indexes[count] = i, j, k
count += 1
return result[:count].copy(), indexes[:count].copy()
The timings, obtained again on the 50³-sized input, do indicate for the Numba-based approaches that spelling out the loops is significantly faster than looping through a view.
In fact, without explicitly looping along the dimensions, the NumPy-only approaches can be faster than the Numba accelerated one.
Note that cubicles_3d_nb() can be seen essentially as a cleaned-up version of #JérômeRichard's answer.
(Actually, the timing for JérômeRichard's fast_compute() on my machine and input -- with the addition of the extra .copy() -- seem to indicate that cubicles_3d_nb() is more efficient -- possibly because of the short-circuiting in the "any" code, and the lack of need to ravel the values manually).
def is_equal(a, b):
return all(np.allclose(x[0], y[0]) and np.allclose(x[1], y[1]) for x, y in zip(a, b))
funcs = cubicles_3d_nb, cubicles_nb, cubicles_np, cubicles_tr_np
base = funcs[0](arr)
for func in funcs:
res = func(arr)
print(f"{func.__name__:>24} {is_equal(base, res)!s:>5}", end=' ')
%timeit func(arr)
# cubicles_3d_nb True 100 loops, best of 5: 3.82 ms per loop
# cubicles_nb True 10 loops, best of 5: 23 ms per loop
# cubicles_np True 10 loops, best of 5: 24.7 ms per loop
# cubicles_tr_np True 100 loops, best of 5: 16.5 ms per loop
Notes on indexes
If one is to give the result all at once, then the indexes themselves are not particularly efficient to store the information as to where the non-zero cubicles are, unless there are few of them.
Instead, a boolean array is more memory efficient.
The indexing requires index_size * ndim * num (num being the number of non-zero cubicles, bounded to be 0 < num < prod(shape)).
The masking requires bool_size * prod(shape).
For NumPy bool_size = 8 while index_size = 64 (can be tweaked but typically at least 16), so: index_size = bool_size * k.
So the masking is more efficient as long as:
num < prod(shape) // (k * ndim)
For 3D and typical index_size = 64, this means that (num / prod(shape)) < (1 / 24), so that indexing is efficient if non-zero cubicles are ~5% or less.
Speed-wise, using a boolean mask instead of the indexes could lead to implementations that are faster by a small but fair margin (~5 to ~20%) as long as the non-zero cubicles are not too few.
Addendum: Padding
While np.pad() is not supported by Numba, it is quite simple to call any padding function outside of Numba.
Additionally, for some combination of inputs np.pad() is slower then simple assign on a sliced output:
import numpy as np
import numba as nb
#nb.njit
def pad_3d_nb(arr, size=1):
nx, ny, nz = arr.shape
result = np.zeros((nx + 2 * size, ny + 2 * size, nz + 2 * size), dtype=arr.dtype)
for i in range(nx):
for j in range(ny):
for k in range(nz):
result[i + size, j + size, k + size] = arr[i, j, k]
return result
def const_pad(arr, size=1, value=0):
shape = tuple(dim + 2 * size for dim in arr.shape)
mask = tuple(slice(size, dim + size) for dim in arr.shape)
result = np.full(shape, value, dtype=arr.dtype)
result[mask] = arr
return result
np.random.seed(42)
n = 200
k = 10
arr = np.random.randint(0, 3, (n, n, n), dtype=np.uint8)
base = np.pad(arr, (k, k))
print(np.allclose(pad_3d_nb(arr, k), base))
# True
print(np.allclose(const_pad(arr, k), base))
# True
%timeit np.pad(arr, (k, k))
# 100 loops, best of 5: 3.01 ms per loop
%timeit pad_3d_nb(arr, k)
# 100 loops, best of 5: 11.5 ms per loop
%timeit const_pad(arr, k)
# 100 loops, best of 5: 2.53 ms per loop

Related

Numba parallel code slower than its sequential counterpart

I'm new to Numba and I'm trying to implement an old Fortran code in Python using Numba (version 0.54.1), but when I add parallel = True the program actually slows down. My program is very simple: I change the positions x and y in a L x L grid and for each position in the grid I perform a summation
import numpy as np
import numba as nb
#nb.njit(parallel=True)
def lyapunov_grid(x_grid, y_grid, k, N):
L = len(x_grid)
lypnv = np.zeros((L, L))
for ii in nb.prange(L):
for jj in range(L):
x = x_grid[ii]
y = y_grid[jj]
beta0 = 0
sumT11 = 0
for j in range(N):
y = (y - k*np.sin(x)) % (2*np.pi)
x = (x + y) % (2*np.pi)
J = np.array([[1.0, -k*np.cos(x)], [1.0, 1.0 - k*np.cos(x)]])
beta = np.arctan((-J[1,0]*np.cos(beta0) + J[1,1]*np.sin(beta0))/(J[0,0]*np.cos(beta0) - J[0,1]*np.sin(beta0)))
T11 = np.cos(beta0)*(J[0,0]*np.cos(beta) - J[1,0]*np.sin(beta)) - np.sin(beta0)*(J[0,1]*np.cos(beta) - J[1,1]*np.sin(beta))
sumT11 += np.log(abs(T11))/np.log(2)
beta0 = beta
lypnv[ii, jj] = sumT11/N
return lypnv
# Compile
_ = lyapunov_grid(np.linspace(0, 1, 10), np.linspace(0, 1, 10), 1, 10)
# Parameters
N = int(1e3)
L = 128
pi = np.pi
k = 1.5
# Limits of the phase space
x0 = -pi
xf = pi
y0 = -pi
yf = pi
# Grid positions
x = np.linspace(x0, xf, L, endpoint=True)
y = np.linspace(y0, yf, L, endpoint=True)
lypnv = lyapunov_grid(x, y, k, N)
With parallel=False it takes about 8s to run, however with parallel=True it takes about 14s. I also tested with another code from https://github.com/animator/mandelbrot-numba and in this case the parallelization works.
import math
import numpy as np
import numba as nb
WIDTH = 1000
MAX_ITER = 1000
#nb.njit(parallel=True)
def mandelbrot(width, max_iter):
pixels = np.zeros((width, width, 3), dtype=np.uint8)
for y in nb.prange(width):
for x in range(width):
c0 = complex(3.0*x/width - 2, 3.0*y/width - 1.5)
c = 0
for i in range(1, max_iter):
if abs(c) > 2:
log_iter = math.log(i)
pixels[y, x, :] = np.array([int(255*(1+math.cos(3.32*log_iter))/2),
int(255*(1+math.cos(0.774*log_iter))/2),
int(255*(1+math.cos(0.412*log_iter))/2)],
dtype=np.uint8)
break
c = c * c + c0
return pixels
# compile
_ = mandelbrot(WIDTH, 10)
calcpixels = mandelbrot(WIDTH, MAX_ITER)

One main issue is that the second function call compile the function again. Indeed, the types of the provided arguments change: in the first call the third argument is an integer (int transformed to a np.int_) while in the second call the third argument (k) is a floating point number (float transformed to a np.float64). Numba recompiles the function for different parameter types because they are deduced from the type of the arguments and it does not know you want to use a np.float64 type for the third argument (since the first time the function is compiled with for a np.int_ type). One simple solution to fix the problem is to change the first call to:
_ = lyapunov_grid(np.linspace(0, 1, 10), np.linspace(0, 1, 10), 1.0, 10)
However, this is not a robust way to fix the problem. You can specify the parameter types to Numba so it will compile the function at declaration time. This also remove the need to artificially call the function (with useless parameters).
#nb.njit('float64[:,:](float64[::1], float64[::1], float64, float64)', parallel=True)
Note that (J[0,0]*np.cos(beta0) - J[0,1]*np.sin(beta0)) is zero the first time resulting in a division by 0.
Another main issue comes from the allocations of many small arrays in the loop causing a contention of the standard allocator (see this post for more information). While Numba could theoretically optimize it (ie. replace the array with local variables), it actually does not, resulting in a huge slowdown and a contention. Hopefully, in your case, you do not need to actually create the array. At last, you can create it only in the encompassing loop and modify it in the innermost loop. Here is the optimized code:
#nb.njit('float64[:,:](float64[::1], float64[::1], float64, float64)', parallel=True)
def lyapunov_grid(x_grid, y_grid, k, N):
L = len(x_grid)
lypnv = np.zeros((L, L))
for ii in nb.prange(L):
J = np.ones((2, 2), dtype=np.float64)
for jj in range(L):
x = x_grid[ii]
y = y_grid[jj]
beta0 = 0
sumT11 = 0
for j in range(N):
y = (y - k*np.sin(x)) % (2*np.pi)
x = (x + y) % (2*np.pi)
J[0, 1] = -k*np.cos(x)
J[1, 1] = 1.0 - k*np.cos(x)
beta = np.arctan((-J[1,0]*np.cos(beta0) + J[1,1]*np.sin(beta0))/(J[0,0]*np.cos(beta0) - J[0,1]*np.sin(beta0)))
T11 = np.cos(beta0)*(J[0,0]*np.cos(beta) - J[1,0]*np.sin(beta)) - np.sin(beta0)*(J[0,1]*np.cos(beta) - J[1,1]*np.sin(beta))
sumT11 += np.log(abs(T11))/np.log(2)
beta0 = beta
lypnv[ii, jj] = sumT11/N
return lypnv
Here is the results on a old 2-core machine (with 4 hardware threads):
Original sequential: 15.9 s
Original parallel: 11.9 s
Fix-build sequential: 15.7 s
Fix-build parallel: 10.1 s
Optimized sequential: 2.73 s
Optimized parallel: 0.94 s
The optimized implementation is much faster than the others. The parallel optimized version scale very well compared than the original one (2.9 times faster than the sequential one). Finally, the best version is about 12 times faster than the original parallel version. I expect a much faster computation on a recent machine with many more cores.

How to make Min-plus matrix multiplication in python faster?

So I have two matrices, A and B, and I want to compute the min-plus product as given here: Min-plus matrix multiplication. For that I've implemented the following:
def min_plus_product(A,B):
B = np.transpose(B)
Y = np.zeros((len(B),len(A)))
for i in range(len(B)):
Y[i] = (A + B[i]).min(1)
return np.transpose(Y)
This works fine, but is slow for big matrices, is there a way to make it faster? I've heard that implemeting in C or using the GPU might be good options.

Here is an algo that saves a bit if the middle dimension is large enough and entries are uniformly distributed. It exploits the fact that the smallest sum typically will be from two small terms.
import numpy as np
def min_plus_product(A,B):
B = np.transpose(B)
Y = np.zeros((len(B),len(A)))
for i in range(len(B)):
Y[i] = (A + B[i]).min(1)
return np.transpose(Y)
def min_plus_product_opt(A,B, chop=None):
if chop is None:
# not sure this is optimal
chop = int(np.ceil(np.sqrt(A.shape[1])))
B = np.transpose(B)
Amin = A.min(1)
Y = np.zeros((len(B),len(A)))
for i in range(len(B)):
o = np.argsort(B[i])
Y[i] = (A[:, o[:chop]] + B[i, o[:chop]]).min(1)
if chop < len(o):
idx = np.where(Amin + B[i, o[chop]] < Y[i])[0]
for j in range(chop, len(o), chop):
if len(idx) == 0:
break
x, y = np.ix_(idx, o[j : j + chop])
slmin = (A[x, y] + B[i, o[j : j + chop]]).min(1)
slmin = np.minimum(Y[i, idx], slmin)
Y[i, idx] = slmin
nidx = np.where(Amin[idx] + B[i, o[j + chop]] < Y[i, idx])[0]
idx = idx[nidx]
return np.transpose(Y)
A = np.random.random(size=(1000,1000))
B = np.random.random(size=(1000,2000))
print(np.allclose(min_plus_product(A,B), min_plus_product_opt(A,B)))
import time
t = time.time();min_plus_product(A,B);print('naive {}sec'.format(time.time()-t))
t = time.time();min_plus_product_opt(A,B);print('opt {}sec'.format(time.time()-t))
Sample output:
True
naive 7.794037580490112sec
opt 1.65810227394104sec

A possible simple route is to use numba.
from numba import autojit
import numpy as np
#autojit(nopython=True)
def min_plus_product(A,B):
n = A.shape[0]
C = np.zeros((n,n))
for i in range(n):
for j in range(n):
minimum = A[i,0]+B[0,j]
for k in range(1,n):
minimum = min(A[i,k]+B[k,j],minimum)
C[i,j] = minimum
return C
Timings on 1000x1000 A,B matrices are:
1 loops, best of 3: 4.28 s per loop for the original code
1 loops, best of 3: 2.32 s per loop for the numba code

Here is a succinct and fully numpy solution, without any python-based loops:
(np.expand_dims(a, 0) + np.expand_dims(b.T, 1)).min(axis=2).T

Calculate gradient only in a masked area

I have a very large array with only a few small areas of interest. I need to calculate the gradient of this array, but for performance reasons I need this calculation to be restricted to these areas of interest.
I can't do something like this:
phi_grad0[mask] = np.gradient(phi[mask], axis=0)
Because of how fancy indexing works, phi[mask] just becomes a 1D array of the masked pixels, losing spatial information and making the gradient calculation worthless.
np.gradient does handle np.ma.masked_arrays, but the performance is an order of magnitude worse:
import numpy as np
from timeit_context import timeit_context
phi = np.random.randint(low=-100, high=100, size=[100, 100])
phi_mask = np.random.randint(low=0, high=2, size=phi.shape, dtype=np.bool)
with timeit_context('full array'):
for i2 in range(1000):
phi_masked_grad1 = np.gradient(phi)
with timeit_context('masked_array'):
phi_masked = np.ma.masked_array(phi, ~phi_mask)
for i1 in range(1000):
phi_masked_grad2 = np.gradient(phi_masked)
This produces the output below:
[full array] finished in 143 ms
[masked_array] finished in 1961 ms
I think its because operations run on masked_arrays are not vectorized, but I'm not sure.
Is there any way of restricting np.gradient so as to achieve better performance?
This timeit_context is a handy timer that works like this, if anyone is interested:
from contextlib import contextmanager
import time
#contextmanager
def timeit_context(name):
"""
Use it to time a specific code snippet
Usage: 'with timeit_context('Testcase1'):'
:param name: Name of the context
"""
start_time = time.time()
yield
elapsed_time = time.time() - start_time
print('[{}] finished in {} ms'.format(name, int(elapsed_time * 1000)))

Not exactly an answer, but this is what I've managed to patch together for my situation, which works pretty well:
I get 1D indices of the pixels where the condition is true (in this case the condition being < 5 for example):
def get_indices_1d(image, band_thickness):
return np.where(image.reshape(-1) < 5)[0]
This gives me a 1D array with those indices.
Then I manually calculate the gradient at those positions, in different ways:
def gradient_at_points1(image, indices_1d):
width = image.shape[1]
size = image.size
# Using this instead of ravel() is more likely to produce a view instead of a copy
raveled_image = image.reshape(-1)
res_x = 0.5 * (raveled_image[(indices_1d + 1) % size] - raveled_image[(indices_1d - 1) % size])
res_y = 0.5 * (raveled_image[(indices_1d + width) % size] - raveled_image[(indices_1d - width) % size])
return [res_y, res_x]
def gradient_at_points2(image, indices_1d):
indices_2d = np.unravel_index(indices_1d, dims=image.shape)
# Even without doing the actual deltas this is already slower, and we'll have to check boundary conditions, etc
res_x = 0.5 * (image[indices_2d] - image[indices_2d])
res_y = 0.5 * (image[indices_2d] - image[indices_2d])
return [res_y, res_x]
def gradient_at_points3(image, indices_1d):
width = image.shape[1]
raveled_image = image.reshape(-1)
res_x = 0.5 * (raveled_image.take(indices_1d + 1, mode='wrap') - raveled_image.take(indices_1d - 1, mode='wrap'))
res_y = 0.5 * (raveled_image.take(indices_1d + width, mode='wrap') - raveled_image.take(indices_1d - width, mode='wrap'))
return [res_y, res_x]
def gradient_at_points4(image, indices_1d):
width = image.shape[1]
raveled_image = image.ravel()
res_x = 0.5 * (raveled_image.take(indices_1d + 1, mode='wrap') - raveled_image.take(indices_1d - 1, mode='wrap'))
res_y = 0.5 * (raveled_image.take(indices_1d + width, mode='wrap') - raveled_image.take(indices_1d - width, mode='wrap'))
return [res_y, res_x]
My test arrays look like this:
a = np.random.randint(-10, 10, size=[512, 512])
# Force edges to not pass the condition
a[:, 0] = 99
a[:, -1] = 99
a[0, :] = 99
a[-1, :] = 99
indices = get_indices_1d(a, 5)
mask = a < 5
Then I can run these tests:
with timeit_context('full gradient'):
for i in range(100):
grad1 = np.gradient(a)
with timeit_context('With masked_array'):
for im in range(100):
ma = np.ma.masked_array(a, mask)
grad6 = np.gradient(ma)
with timeit_context('gradient at points 1'):
for i1 in range(100):
grad2 = gradient_at_points1(image=a, indices_1d=indices)
with timeit_context('gradient at points 2'):
for i2 in range(100):
grad3 = gradient_at_points2(image=a, indices_1d=indices)
with timeit_context('gradient at points 3'):
for i3 in range(100):
grad4 = gradient_at_points3(image=a, indices_1d=indices)
with timeit_context('gradient at points 4'):
for i4 in range(100):
grad5 = gradient_at_points4(image=a, indices_1d=indices)
Which give the following results:
[full gradient] finished in 576 ms
[With masked_array] finished in 3455 ms
[gradient at points 1] finished in 421 ms
[gradient at points 2] finished in 451 ms
[gradient at points 3] finished in 112 ms
[gradient at points 4] finished in 102 ms
As you can see method 4 is by far the best (don't care much about how much memory its consuming however).
This probably only holds because my 2D array is relatively small (512x512). Maybe with much larger arrays this won't be true.
Another caveat is that ndarray.take(indices, mode='wrap') will do some weird stuff around the image edges (one row will 'loop' into the next, etc) to maintain good performance, so if edges are ever important for your application you might want to pad the input array with 1 pixel around the edges.
Still super interesting how slow masked_arrays are. Pulling the constructor ma = np.ma.masked_array(a, mask) outside the loop doesn't affect the time since the masked_array itself just keeps references to the array and its mask

Numpy speed up nested loop with fancy indexing

i currently implemented an algorithm which calculates a quality assesment of disparity maps based on total variation.
I'm relatively new to python, but already read numerous threads on speed up numpy code. Views vs Fancy indexing, tried using Cython, Vectorization of nested loops etc. I achieved a bit of speed up's but altogether, i ended more and more in messy code without achieving a proper speed up.
I wonder if someone can give me a hint if there is a clean and easy way to speed up this 2d loop.
TV is a 2D array with ~ 15k x 15k elements
footprint_ix and _iy are 2 lists of arrays which contain the index offset to the neighbor pixels if pixel x,y in a ringshaped manner. With m = 1 the 8 neighborpixels are selected, m = 2 the next 16, and so on
The algorithm sums up the neighbor pixels of x,y and increases m when a threshold TAU is not exceeded.
The best solution i come up with, so far uses row-wise multiprocessing.
# create footprints
footprint_ix = []
footprint_iy = []
for m in range(1, m_classes):
fp = np.ones((2 * m + 1, 2 * m + 1), dtype = int)
fp [ 1 : -1, 1 : -1] = 0
i, j = np.nonzero(fp)
i = i - m
j = j - m
footprint_ix.append(i)
footprint_iy.append(j)
m_classes = 21
for x in xrange( 0, rows):
for y in xrange ( 0, cols):
if disp[x,y] == np.inf:
continue
else:
tv_m = 0
for m_i in range (0, m_classes-1):
m = m_i + 1
try:
tv_m += np.sum( tv[footprint_ix[m_i] + x, footprint_iy[m_i] + y] ) / (8 * m)
except IndexError:
tv_m = np.inf
if tv_m >= TAU:
tv_classes[x,y] = m
break
if m == m_classes - 1:
tv_classes[x,y] = m

Vectorizing for loop with repeated indices in python

I am trying to optimize a snippet that gets called a lot (millions of times) so any type of speed improvement (hopefully removing the for-loop) would be great.
I am computing a correlation function of some j'th particle with all others
C_j(|r-r'|) = sqrt(E((s_j(r')-s_k(r))^2)) averaged over k.
My idea is to have a variable corrfun which bins data into some bins (the r, defined elsewhere). I find what bin of r each s_k belongs to and this is stored in ind. So ind[0] is the index of r (and thus the corrfun) for which the j=0 point corresponds to. Multiple points can fall into the same bin (in fact I want bins to be big enough to contain multiple points) so I sum together all of the (s_j(r')-s_k(r))^2 and then divide by number of points in that bin (stored in variable rw). The code I ended up making for this is the following (np is for numpy):
for k, v in enumerate(ind):
if j==k:
continue
corrfun[v] += (s[k]-s[j])**2
rw[v] += 1
rw2 = rw
rw2[rw < 1] = 1
corrfun = np.sqrt(np.divide(corrfun, rw2))
Note, the rw2 business was because I want to avoid divide by 0 problems but I do return the rw array and I want to be able to differentiate between the rw=0 and rw=1 elements. Perhaps there is a more elegant solution for this as well.
Is there a way to make the for-loop faster? While I would like to not add the self interaction (j==k) I am even ok with having self interaction if it means I can get significantly faster calculation (length of ind ~ 1E6 so self interaction is probably insignificant anyways).
Thank you!
Ilya
Edit:
Here is the full code. Note, in the full code I am averaging over j as well.
import numpy as np
def twopointcorr(x,y,s,dr):
width = np.max(x)-np.min(x)
height = np.max(y)-np.min(y)
n = len(x)
maxR = np.sqrt((width/2)**2 + (height/2)**2)
r = np.arange(0, maxR, dr)
print(r)
corrfun = r*0
rw = r*0
print(maxR)
''' go through all points'''
for j in range(0, n-1):
hypot = np.sqrt((x[j]-x)**2+(y[j]-y)**2)
ind = [np.abs(r-h).argmin() for h in hypot]
for k, v in enumerate(ind):
if j==k:
continue
corrfun[v] += (s[k]-s[j])**2
rw[v] += 1
rw2 = rw
rw2[rw < 1] = 1
corrfun = np.sqrt(np.divide(corrfun, rw2))
return r, corrfun, rw
I debug test it the following way
from twopointcorr import twopointcorr
import numpy as np
import matplotlib.pyplot as plt
import time
n=1000
x = np.random.rand(n)
y = np.random.rand(n)
s = np.random.rand(n)
print('running two point corr functinon')
start_time = time.time()
r,corrfun,rw = twopointcorr(x,y,s,0.1)
print("--- Execution time is %s seconds ---" % (time.time() - start_time))
fig1=plt.figure()
plt.plot(r, corrfun,'-x')
fig2=plt.figure()
plt.plot(r, rw,'-x')
plt.show()
Again, the main issue is that in the real dataset n~1E6. I can resample to make it smaller, of course, but I would love to actually crank through the dataset.

Here is the code that use broadcast, hypot, round, bincount to remove all the loops:
def twopointcorr2(x, y, s, dr):
width = np.max(x)-np.min(x)
height = np.max(y)-np.min(y)
n = len(x)
maxR = np.sqrt((width/2)**2 + (height/2)**2)
r = np.arange(0, maxR, dr)
osub = lambda x:np.subtract.outer(x, x)
ind = np.clip(np.round(np.hypot(osub(x), osub(y)) / dr), 0, len(r)-1).astype(int)
rw = np.bincount(ind.ravel())
rw[0] -= len(x)
corrfun = np.bincount(ind.ravel(), (osub(s)**2).ravel())
return r, corrfun, rw
to compare, I modified your code as follows:
def twopointcorr(x,y,s,dr):
width = np.max(x)-np.min(x)
height = np.max(y)-np.min(y)
n = len(x)
maxR = np.sqrt((width/2)**2 + (height/2)**2)
r = np.arange(0, maxR, dr)
corrfun = r*0
rw = r*0
for j in range(0, n):
hypot = np.sqrt((x[j]-x)**2+(y[j]-y)**2)
ind = [np.abs(r-h).argmin() for h in hypot]
for k, v in enumerate(ind):
if j==k:
continue
corrfun[v] += (s[k]-s[j])**2
rw[v] += 1
return r, corrfun, rw
and here is the code to check the results:
import numpy as np
n=1000
x = np.random.rand(n)
y = np.random.rand(n)
s = np.random.rand(n)
r1, corrfun1, rw1 = twopointcorr(x,y,s,0.1)
r2, corrfun2, rw2 = twopointcorr2(x,y,s,0.1)
assert np.allclose(r1, r2)
assert np.allclose(corrfun1, corrfun2)
assert np.allclose(rw1, rw2)
and the %timeit results:
%timeit twopointcorr(x,y,s,0.1)
%timeit twopointcorr2(x,y,s,0.1)
outputs:
1 loop, best of 3: 5.16 s per loop
10 loops, best of 3: 134 ms per loop

Your original code on my system runs in about 5.7 seconds. I fully vectorized the inner loop and got it to run in 0.39 seconds. Simply replace your "go through all points" loop with this:
points = np.column_stack((x,y))
hypots = scipy.spatial.distance.cdist(points, points)
inds = np.rint(hypots.clip(max=maxR) / dr).astype(np.int)
# go through all points
for j in range(n): # n.b. previously n-1, not sure why
ind = inds[j]
np.add.at(corrfun, ind, (s - s[j])**2)
np.add.at(rw, ind, 1)
rw[ind[j]] -= 1 # subtract self
The first observation was that your hypot code was computing 2D distances, so I replaced that with cdist from SciPy to do it all in a single call. The second was that the inner for loop was slow, and thanks to an insightful comment from #hpaulj I vectorized that as well using np.add.at().
Since you asked how to vectorize the inner loop as well, I did that later. It now takes 0.25 seconds to run, for a total speedup of over 20x. Here's the final code:
points = np.column_stack((x,y))
hypots = scipy.spatial.distance.cdist(points, points)
inds = np.rint(hypots.clip(max=maxR) / dr).astype(np.int)
sn = np.tile(s, (n,1)) # n copies of s
diffs = (sn - sn.T)**2 # squares of pairwise differences
np.add.at(corrfun, inds, diffs)
rw = np.bincount(inds.flatten(), minlength=len(r))
np.subtract.at(rw, inds.diagonal(), 1) # subtract self
This uses more memory but does produce a substantial speedup vs. the single-loop version above.

Ok, so as it turns out outer products are incredibly memory expensive, however, using answers from #HYRY and #JohnZwinck i was able to make code that is still roughly linear in n in memory and computes fast (0.5 seconds for the test case)
import numpy as np
def twopointcorr(x,y,s,dr,maxR=-1):
width = np.max(x)-np.min(x)
height = np.max(y)-np.min(y)
n = len(x)
if maxR < dr:
maxR = np.sqrt((width/2)**2 + (height/2)**2)
r = np.arange(0, maxR+dr, dr)
corrfun = r*0
rw = r*0
for j in range(0, n):
ind = np.clip(np.round(np.hypot(x[j]-x,y[j]-y) / dr), 0, len(r)-1).astype(int)
np.add.at(corrfun, ind, (s - s[j])**2)
np.add.at(rw, ind, 1)
rw[0] -= n
corrfun = np.sqrt(np.divide(corrfun, np.maximum(rw,1)))
r=np.delete(r,-1)
rw=np.delete(rw,-1)
corrfun=np.delete(corrfun,-1)
return r, corrfun, rw

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Iterating over regions of numpy array in parallel - python

Related

Numba parallel code slower than its sequential counterpart

How to make Min-plus matrix multiplication in python faster?

Calculate gradient only in a masked area

Numpy speed up nested loop with fancy indexing

Vectorizing for loop with repeated indices in python

Categories

Resources