I have an 3D array and need to iterate over it, extract a 2x2x2 voxel large region and check if any voxel is non-zero. Of these locations, I need the unique elements of the region and the index:
import time
import numpy as np
np.random.seed(1234)
def _naive_iterator(array):
lookup = np.pad(array, (1, 1), 'constant') # Border is filled with 0
nx, ny, nz = lookup.shape
for i in range(nx - 1):
for j in range(ny - 1):
for k in range(nz - 1):
n = lookup[i:i + 2, j:j + 2, k:k + 2]
if n.any(): # check if any value in the region is non-zero
yield n.ravel(), i, j, k
# yield set(n.ravel()), i, j, k # `set()` alone takes some time - for testing purposes exclude this
# arrays that shall be used here are in the region of (1000, 1000, 1000) or larger
# arrays are asserted to contain only integer values >= 0.
array = np.random.randint(0, 2, (200, 200, 200), dtype=np.uint8)
for fun in (_naive_iterator, ):
print(f"* {fun}")
for _ in range(2):
tic = time.time()
[x for x in fun(array)]
print(f" ** execution took {time.time() - tic}")
On my PC, this loop takes about 24s to run. (Interesting sidenote: without the n.any(), the loop needs only 8s, so maybe there is some optimization potential as well?)
I thought about how I could make this faster, potentially by running it in parallel. But, I can not figure out how I could do that, without pre-generating all the 2x2x2 arrays.
I also thought about using scipy.ndimage.generic_filter but with that I can only get an image which has for example 1 on all pixels that I want to include, but I would had to iterate over the original image to get n.ravel() (Ideally, one would use generic_filter directly, but I can not get the index inside the called function).
How can I speed up this loop, potentially by parallelizing the iteration?
without the n.any(), the loop needs only 8s, so maybe there is some optimization potential as well?
This is because Numpy function have a big overhead for very small arrays like 2x2x2. The overhead of a Numpy function is about few microseconds while the actual n.any() computation should take no more than a dozen of nanoseconds on a mainstream processor. The usual solution is to vectorize the operation so to avoid many Numpy calls. You can use Numba to speed up this code and removes most of the CPython/Numpy overheads. Note that Numba does not support all function like pad currently so a workaround is needed. Here is the resulting code:
import time
import numpy as np
import numba as nb
np.random.seed(1234)
#nb.njit('(uint8[:,:,::1],)')
def numba_iterator(lookup):
nx, ny, nz = lookup.shape
for i in range(nx - 1):
for j in range(ny - 1):
for k in range(nz - 1):
n = lookup[i:i + 2, j:j + 2, k:k + 2]
if n.any():
yield n.ravel(), i, j, k
array = np.random.randint(0, 2, (200, 200, 200), dtype=np.uint8)
for fun in (numba_iterator, ):
print(f"* {fun}")
for _ in range(2):
tic = time.time()
lookup = np.pad(array, (1, 1), 'constant') # Border is filled with 0
[x for x in fun(lookup)]
print(f" ** execution took {time.time() - tic}")
This is significantly times faster on my machine (but still quite slow).
I thought about how I could make this faster, potentially by running it in parallel.
This is not possible as long as the yield is used since generators are inherently sequential.
How can I speed up this loop
One solution could be to generate the whole output as a Numpy array in Numba so to avoid the creation of 8 million Numpy objects stored in a CPython list which is the main source of slowdown of the code once optimized with Numba (each call to n.ravel creates a new array). Note that generators are generally slow since they often requires a context-switch (of a kind of lightweight-thread / coroutine). The best solution in term of performance is to compute data on-the-fly in the loop.
Additionally, n.any and n.ravel can be manually rewritten in Numba so to be more efficient. Indeed, the n array views are very small and using 3 nested loops with a constant compile-time bound help the compiler to produce a fast code (ie. it can unroll the loops and generate only few instructions the processor can execute very efficiently).
Here is a modified improved code (that compute the padded array manually):
#nb.njit('(uint8[:,:,::1],)')
def fast_compute(array):
nx, ny, nz = array.shape
# Padding (with zeros)
lookup = np.zeros((nx+2, ny+2, nz+2), dtype=np.uint8)
for i in range(nx):
for j in range(ny):
for k in range(nz):
lookup[i+1, j+1, k+1] = array[i, j, k]
# Actual computation
size = (nx + 1) * (ny + 1) * (nz + 1)
result = np.empty((size, 8), dtype=np.uint8)
indices = np.empty((size, 3), dtype=np.uint32)
cur = 0
for i in range(nx + 1):
for j in range(ny + 1):
for k in range(nz + 1):
n = lookup[i:i+2, j:j+2, k:k+2]
# Fast manual n.any()
found = False
for i2 in range(2):
for j2 in range(2):
for k2 in range(2):
found |= n[i2, j2, k2]
if found:
# Fast manual n.ravel()
cur2 = 0
for i2 in range(2):
for j2 in range(2):
for k2 in range(2):
result[cur, cur2] = n[i2, j2, k2]
cur2 += 1
indices[cur, 0] = i
indices[cur, 1] = j
indices[cur, 2] = k
cur += 1
return result[:cur].reshape(cur,2,2,2), indices[:cur]
The resulting code is quite big, but this is the price to pay for high performance computing.
As pointed out by #norok2, result[:cur] and indices[:cur] are views referencing arrays. The view can be quite small compared to the allocated arrays. If this is a problem, you can return a copy (eg. result[:cur].copy()) so to avoid a possible memory overconsumption. In practice, it should not be a problem since the array is allocated in virtual memory and only the written pages are mapped in physical memory on mainstream systems (eg. Windows & Linux). Page of virtual memory are only mapped to physical memory during the first touch (ie. when items are written for the first time). Modern platforms can allocate huge amount of virtual memory (eg. 131072 GiB on my mainstream x86-64 Windows, and even more on mainstream x86-64 Linux) while the physical memory is much more scarce (eg. 16 GiB on my machine). The underlying array is freed when there is no view referencing it anymore.
Benchmark
_naive_iterator: 21.25 s
numba_iterator: 8.10 s
get_windows_and_indices: 1.35 s
fast_compute: 0.13 s
The last Numba function is 163 times faster than the initial one and 10 times faster than the vectorized Numpy implementation of #flawr.
The Numba implementation could certainly be multi-threaded, but it is not easy to do since threads need to write the output and the location of the written items (ie. cur) is dependent of the other threads. Moreover, it would make the code significantly more complex.
Whenever you're working with numpy, you should try to avoid explicit loops. These loops are written in python and therefore usually slower than anything you can do with vectorization. That way you defer the looping to the underlying C functions that are pretty much as fast as anything can be. So I would approach your problem with something like the following. This function does roughly the same thing as your _naive_iterator but in a vectorized manner without any python loops:
from numpy.lib.stride_tricks import sliding_window_view
def get_windows_and_indices(array):
lookup = np.pad(array, (1, 1), 'constant') # Border is filled with 0
nx, ny, nz = lookup.shape
x, y, z = np.mgrid[0:nx, 0:ny, 0:nz]
lookup = np.stack([lookup, x, y, z])
out = sliding_window_view(lookup, (2, 2, 2), axis=(1, 2, 3)).reshape(4, -1, 2, 2, 2)
windows = out[0, ...]
ic = out[1, ..., 0, 0, 0]
jc = out[2, ..., 0, 0, 0]
kc = out[3, ..., 0, 0, 0]
mask = windows.any(axis=(1, 2, 3))
return windows[mask], ic[mask], jc[mask], kc[mask]
Of course you will also need to think abou the rest of the code a little bit differently but vectorization is really something you need to get used to if you want to (efficiently) work with numpy.
Also I'm pretty sure that even this function above is not optimal and can definitely be improved further.
The simplest approach to speed up your code while retaining the features is with Numba. I assume the padding to be essentially a decorating step, and I will deal with it separately at end of the answer.
Here is a cleaner implementation of the originally proposed code and the naïve Numba acceleration:
import numpy as np
import numba as nb
def i_cubicles_3D_set_OP(arr, size=2):
nx, ny, nz = arr.shape
nx += 1 - size
ny += 1 - size
nz += 1 - size
for i in range(nx):
for j in range(ny):
for k in range(nz):
window = arr[i:i + size, j:j + size, k:k + size]
if window.any():
yield set(window.ravel()), (i, j, k)
i_cubicles_3D_set_OP_nb = nb.njit(i_cubicles_3D_set_OP)
i_cubicles_3D_set_OP_nb.__name__ = "i_cubicles_3D_set_OP_nb"
If one is interested is a dimension-agnostic version of it (which comes at the cost of some speed) one could write:
def i_cubicles_set_nb(arr, size=2):
window = (size,) * arr.ndim
window_size = size ** arr.ndim
reduced_shape = tuple(dim - size + 1 for dim, size in zip(arr.shape, window))
view = np.lib.stride_tricks.as_strided(
arr, shape=reduced_shape + window, strides=arr.strides * 2, writeable=False)
return _i_cubicles_set_nb(view.reshape((-1, window_size)), reduced_shape)
#nb.njit
def unravel_index(x, shape):
result = np.zeros(len(shape), dtype=np.int_)
for i, dim in enumerate(shape[::-1], 1):
result[-i] = x % dim
x //= dim
return result
#nb.njit
def not_only_zeros(seq):
# assumes seq is not empty
count = 0
for x in seq:
if x == 0:
count += 1
break # because only unique values
return len(seq) != count
#nb.njit
def _i_cubicles_set_nb(arr, shape):
for i, x in enumerate(arr):
uniques = set(x)
if not_only_zeros(uniques):
yield uniques, unravel_index(i, shape)
This introduces the important trick of generating a strided (read-only) view of the input, which can be used to simplify conceptually all the looping, at the cost of having to manually unravel the index.
This is a similar idea as the one proposed in #flawr's answer.
On a 50³-sized index, I get the following timings:
np.random.seed(42)
n = 50
arr = np.random.randint(0, 3, (n, n, n), dtype=np.uint8)
def is_equal_i_set(a, b):
return all(x[0] == y[0] and np.allclose(x[1], y[1]) for x, y in zip(a, b))
funcs = i_cubicles_3d_set_OP_nb, i_cubicles_3d_set_OP, i_cubicles_set_nb
base = list(funcs[0](arr))
for func in funcs:
res = list(func(arr))
print(f"{func.__name__:>24} {is_equal_i_set(base, res)!s:>5}", end=' ')
# %timeit -n 1 -r 1 list(func(arr))
%timeit list(func(arr))
# i_cubicles_3d_set_OP_nb True 1 loop, best of 5: 130 ms per loop
# i_cubicles_3d_set_OP True 1 loop, best of 5: 776 ms per loop
# i_cubicles_set_nb True 10 loops, best of 5: 151 ms per loop
Indicating the use of Numba to be quite effective.
No uniques
If one is willing to forego the requirement of returning only unique elements inside a cubicle, replacing them with all the elements inside the cubicles, one does gain some (but not much) speed:
#nb.njit
def i_cubicles_3d_nb(arr, size=2):
nx, ny, nz = arr.shape
nx += 1 - size
ny += 1 - size
nz += 1 - size
for i in range(nx):
for j in range(ny):
for k in range(nz):
window = arr[i:i + size, j:j + size, k:k + size]
if window.any():
yield window.ravel(), (i, j, k)
def i_cubicles_nb(arr, size=2):
window = (size,) * arr.ndim
window_size = size ** arr.ndim
reduced_shape = tuple(dim - size + 1 for dim, size in zip(arr.shape, window))
view = np.lib.stride_tricks.as_strided(
arr, shape=reduced_shape + window, strides=arr.strides * 2, writeable=False)
return _i_cubicles_nb(view.reshape((-1, window_size)), reduced_shape)
#nb.njit
def unravel_index(x, shape):
result = np.zeros(len(shape), dtype=np.int_)
for i, dim in enumerate(shape[::-1], 1):
result[-i] = x % dim
x //= dim
return result
#nb.njit
def any_nb(arr):
for x in arr:
if x:
return True
return False
#nb.njit
def _i_cubicles_nb(arr, shape):
for i, x in enumerate(arr):
if any_nb(x):
yield x, unravel_index(i, shape)
as evidenced by the following benchmark (on the same 50³-sized input as before):
def is_equal_i(a, b):
return all(np.allclose(x[0], y[0]) and np.allclose(x[1], y[1]) for x, y in zip(a, b))
funcs = i_cubicles_3d_nb, i_cubicles_nb
base = list(funcs[0](arr))
for func in funcs:
res = list(func(arr))
print(f"{func.__name__:>24} {is_equal_i(base, res)!s:>5}", end=' ')
# %timeit -n 1 -r 1 list(func(arr))
%timeit list(func(arr))
# print()
# i_cubicles_3d_nb True 10 loops, best of 5: 116 ms per loop
# i_cubicles_nb True 10 loops, best of 5: 125 ms per loop
No yield (and no uniques)
While it is clear that a function matching exactly the OP output can be made faster only with Numba / Cython, a number of fast approaches can be obtained by foregoing some features of the OP code.
In particular, when creating generators, a significant amount of time is spent on creating the actual objects to yield.
The same information can be returned (and most importantly allocated) all at once with substantial speed gain, especially if we skip creating the containers for computing the unique elements.
Once we are accepting to return all elements inside a cubicle instead its unique elements, it is possible to devise also NumPy-only vectorized (fast and dimension-agnostic) approaches, alongside faster Numba (3d-specific) implementations:
def cubicles_np(arr, size=2):
window = (size,) * arr.ndim
window_size = size ** arr.ndim
reduced_shape = tuple(dim - size + 1 for dim, size in zip(arr.shape, window))
view = np.lib.stride_tricks.as_strided(
arr, shape=reduced_shape + window, strides=arr.strides * 2, writeable=False)
mask = np.any(view, axis=tuple(range(-arr.ndim, 0)))
return view[mask, ...], np.array(np.nonzero(mask)).transpose()
def cubicles_tr_np(arr, size=2):
window = (size,) * arr.ndim
window_size = size ** arr.ndim
reduced_shape = tuple(dim - size + 1 for dim, size in zip(arr.shape, window))
view = np.lib.stride_tricks.as_strided(
arr, shape=window + reduced_shape, strides=arr.strides * 2, writeable=False)
mask = np.any(view, axis=tuple(range(arr.ndim)))
return (
view[..., mask].reshape((window_size, -1)).transpose().reshape((-1, *window)),
np.array(np.nonzero(mask)).transpose())
def cubicles_nb(arr, size=2):
window = (size,) * arr.ndim
window_size = size ** arr.ndim
reduced_shape = tuple(dim - size + 1 for dim, size in zip(arr.shape, window))
view = np.lib.stride_tricks.as_strided(
arr, shape=reduced_shape + window, strides=arr.strides * 2, writeable=False)
values, indexes = _cubicles_nb(view.reshape((-1, window_size)), reduced_shape, arr.ndim)
return values.reshape((-1, *window)), indexes
#nb.njit
def any_nb(arr):
for x in arr:
if x:
return True
return False
#nb.njit
def _cubicles_nb(arr, shape, ndim):
n, k = arr.shape
indexes = np.empty((n, ndim), dtype=np.bool_)
result = np.empty((n, k), dtype=arr.dtype)
count = 0
for i in range(n):
x = arr[i]
if any_nb(x):
indexes[count] = unravel_index(i, shape)
result[count] = x
count += 1
return result[:count].copy(), indexes[:count].copy()
#nb.njit
def any_cubicle_3d_nb(arr, size):
for i in range(size):
for j in range(size):
for k in range(size):
if arr[i, j, k]:
return True
return False
#nb.njit
def cubicles_3d_nb(arr, size=2):
nx, ny, nz = arr.shape
nx += 1 - size
ny += 1 - size
nz += 1 - size
nn = nx * ny * nz
indexes = np.empty((nn, 3), dtype=np.bool_)
result = np.empty((nn, size, size, size), dtype=arr.dtype)
count = 0
for i in range(nx):
for j in range(ny):
for k in range(nz):
x = arr[i:i + size, j:j + size, k:k + size]
if any_cubicle_3d_nb(x, size):
result[count] = x
indexes[count] = i, j, k
count += 1
return result[:count].copy(), indexes[:count].copy()
The timings, obtained again on the 50³-sized input, do indicate for the Numba-based approaches that spelling out the loops is significantly faster than looping through a view.
In fact, without explicitly looping along the dimensions, the NumPy-only approaches can be faster than the Numba accelerated one.
Note that cubicles_3d_nb() can be seen essentially as a cleaned-up version of #JérômeRichard's answer.
(Actually, the timing for JérômeRichard's fast_compute() on my machine and input -- with the addition of the extra .copy() -- seem to indicate that cubicles_3d_nb() is more efficient -- possibly because of the short-circuiting in the "any" code, and the lack of need to ravel the values manually).
def is_equal(a, b):
return all(np.allclose(x[0], y[0]) and np.allclose(x[1], y[1]) for x, y in zip(a, b))
funcs = cubicles_3d_nb, cubicles_nb, cubicles_np, cubicles_tr_np
base = funcs[0](arr)
for func in funcs:
res = func(arr)
print(f"{func.__name__:>24} {is_equal(base, res)!s:>5}", end=' ')
%timeit func(arr)
# cubicles_3d_nb True 100 loops, best of 5: 3.82 ms per loop
# cubicles_nb True 10 loops, best of 5: 23 ms per loop
# cubicles_np True 10 loops, best of 5: 24.7 ms per loop
# cubicles_tr_np True 100 loops, best of 5: 16.5 ms per loop
Notes on indexes
If one is to give the result all at once, then the indexes themselves are not particularly efficient to store the information as to where the non-zero cubicles are, unless there are few of them.
Instead, a boolean array is more memory efficient.
The indexing requires index_size * ndim * num (num being the number of non-zero cubicles, bounded to be 0 < num < prod(shape)).
The masking requires bool_size * prod(shape).
For NumPy bool_size = 8 while index_size = 64 (can be tweaked but typically at least 16), so: index_size = bool_size * k.
So the masking is more efficient as long as:
num < prod(shape) // (k * ndim)
For 3D and typical index_size = 64, this means that (num / prod(shape)) < (1 / 24), so that indexing is efficient if non-zero cubicles are ~5% or less.
Speed-wise, using a boolean mask instead of the indexes could lead to implementations that are faster by a small but fair margin (~5 to ~20%) as long as the non-zero cubicles are not too few.
Addendum: Padding
While np.pad() is not supported by Numba, it is quite simple to call any padding function outside of Numba.
Additionally, for some combination of inputs np.pad() is slower then simple assign on a sliced output:
import numpy as np
import numba as nb
#nb.njit
def pad_3d_nb(arr, size=1):
nx, ny, nz = arr.shape
result = np.zeros((nx + 2 * size, ny + 2 * size, nz + 2 * size), dtype=arr.dtype)
for i in range(nx):
for j in range(ny):
for k in range(nz):
result[i + size, j + size, k + size] = arr[i, j, k]
return result
def const_pad(arr, size=1, value=0):
shape = tuple(dim + 2 * size for dim in arr.shape)
mask = tuple(slice(size, dim + size) for dim in arr.shape)
result = np.full(shape, value, dtype=arr.dtype)
result[mask] = arr
return result
np.random.seed(42)
n = 200
k = 10
arr = np.random.randint(0, 3, (n, n, n), dtype=np.uint8)
base = np.pad(arr, (k, k))
print(np.allclose(pad_3d_nb(arr, k), base))
# True
print(np.allclose(const_pad(arr, k), base))
# True
%timeit np.pad(arr, (k, k))
# 100 loops, best of 5: 3.01 ms per loop
%timeit pad_3d_nb(arr, k)
# 100 loops, best of 5: 11.5 ms per loop
%timeit const_pad(arr, k)
# 100 loops, best of 5: 2.53 ms per loop
My problem is the following. I have two arrays X and Y of shape n, p where p >> n (e.g. n = 50, p = 10000).
I also have a mask mask (1-d array of booleans of size p) with respect to p, of small density (e.g. np.mean(mask) is 0.05).
I try to compute, as fast as possible, the inner product of X and Y with respect to mask: the output inner is an array of shape n, n, and is such that inner[i, j] = np.sum(X[i, np.logical_not(mask)] * Y[j, np.logical_not(mask)]).
I have tried using the numpy.ma library, but it is quite slow for my use:
import numpy as np
import numpy.ma as ma
n, p = 50, 10000
density = 0.05
mask = np.array(np.random.binomial(1, density, size=p), dtype=np.bool_)
mask_big = np.ones(n)[:, None] * mask[None, :]
X = np.random.randn(n, p)
Y = np.random.randn(n, p)
X_ma = ma.array(X, mask=mask_big)
Y_ma = ma.array(Y, mask=mask_big)
But then, on my machine, X_ma.dot(Y_ma.T) is about 5 times slower than X.dot(Y.T)...
To begin with, I think it is a problem that .dot does not know that the mask is only with respect to p but I don't if its possible to use this information.
I'm looking for a way to perform the computation without being much slower than the naive dot.
Thanks a lot !
We can use matrix-multiplication with and without the masked versions as the masked subtraction from the full version yields to us the desired output -
inner = X.dot(Y.T)-X[:,mask].dot(Y[:,mask].T)
Or simply use the reversed mask, would be slower though for a sparsey mask -
inner = X[:,~mask].dot(Y[:,~mask].T)
Timings -
In [34]: np.random.seed(0)
...: p,n = 10000,50
...: X = np.random.rand(n,p)
...: Y = np.random.rand(n,p)
...: mask = np.random.rand(p)>0.95
In [35]: mask.mean()
Out[35]: 0.0507
In [36]: %timeit X.dot(Y.T)-X[:,mask].dot(Y[:,mask].T)
100 loops, best of 3: 2.54 ms per loop
In [37]: %timeit X[:,~mask].dot(Y[:,~mask].T)
100 loops, best of 3: 4.1 ms per loop
In [39]: %%timeit
...: inner = np.empty((n,n))
...: for i in range(X.shape[0]):
...: for j in range(X.shape[0]):
...: inner[i, j] = np.sum(X[i, ~mask] * Y[j, ~mask])
1 loop, best of 3: 302 ms per loop
In python I have a three dimensional array T with size (n,n,n), and a 2-D array, with size (n,k).
In linear algebra, the multilinear map defined by T and applied to W, in code would be:
X3 = np.zeros((k,k,k))
for i in xrange(k):
for j in xrange(k):
for t in xrange(k):
for l in xrange(n):
for m in xrange(n)
for h in xrange(n):
X3[i, j, t] += M3[l, m, h] * W[l, i] * W[m, j] * W[h, t]
See
https://en.wikipedia.org/wiki/Multilinear_map
For a reference.
This is very slow. I am wondering if there exist any alternative or any pre build function in numpy that can speed up the operations.
Here's an approach using a series of dot-products -
# Get partial products and thus reach to final output
p1 = np.tensordot(W,M3,axes=(0,0))
p2 = np.tensordot(p1,W,axes=(1,0))
X3out = np.tensordot(p2,W,axes=(1,0))
Einstein summation convention? Use np.einsum!
X3 = np.einsum('lmh,li,mj,ht->ijt', M3, W, W, W)
EDIT:
Out of curiosity, I just ran some benchmarks comparing np.einsum to Divakar's approach. The difference is hugely against np.einsum!
import numpy as np
def approach1(a, b):
x = np.einsum('lmh,li,mj,ht->ijt', a, b, b, b)
def approach2(a, b):
p1 = np.tensordot(b, a, axes=(0,0))
p2 = np.tensordot(p1, b, axes=(1,0))
x = np.tensordot(p2, b, axes=(1,0))
n = 100
k = 10
a = np.random.random((n, n, n))
b = np.random.random((n, k))
%timeit approach1(a, b) # => 1 loop, best of 3: 26 s per loop
%timeit approach2(a, b) # => 100 loops, best of 3: 4.23 ms per loop
There's some discussion about it in this question. It all seems to come down to the generality that np.einsum tries to achieve — at the cost of being able to offload computations to low-level linear algebra packages.
I want to calculate the squared euclidean distance between two sets of points, inputs and testing. inputs is typically a real array of size ~(200, N), whereas testing is typically ~(1e8, N), and N is around 10. The distances should be scaled in each dimension in N, so I'd be aggregating the expression scale[j]*(inputs[i,j] - testing[ii,j])**2 (where scale is the scaling vector) over N times. I am trying to make this as fast as possible, particularly as N can be large. My first test is
def old_version (inputs, testing, x0):
nn, d1 = testing.shape
n, d1 = inputs.shape
b = np.zeros((n, nn))
for d in xrange(d1):
b += x0[d] * (((np.tile(inputs[:, d], (nn, 1)) -
np.tile (testing[:, d], (n, 1)).T))**2).T
return b
Nothing too fancy. I then tried using scipy.spatial.distance.cdist, although I still have to loop through it to get the scaling right
def new_version (inputs, testing, x0):
# import scipy.spatial.distance as dist
nn, d1 = testing.shape
n, d1 = inputs.shape
b = np.zeros ((n, nn))
for d in xrange(d1):
b += x0[d] * dist.cdist(inputs[:, d][:, None],
testing[:, d][:, None], 'sqeuclidean')
return b
It would appear that new_version scales better (as N > 1000), but I'm not sure that I've gone as fast as possible here. Any further ideas much appreciated!
This code gave me a factor of 10 over your implementation, give it a try:
x = np.random.randn(200, 10)
y = np.random.randn(1e5, 10)
scale = np.abs(np.random.randn(1, 10))
scale_sqrt = np.sqrt(scale)
dist_map = dist.cdist(x*scale_sqrt, y*scale_sqrt, 'sqeuclidean')
These are the test results:
In [135]: %timeit suggested_version(inputs, testing, x0)
1 loops, best of 3: 341 ms per loop
In [136]: %timeit op_version(inputs, testing, x00) (NOTICE: x00 is a reshape of x0)
1 loops, best of 3: 3.37 s per loop
Just make sure than when you go for the larger N you don't get low on memory. It can really slow things down.
A numerical integration is taking exponentially longer than I expect it to. I would like to know if the way that I implement the iteration over the mesh could be a contributing factor. My code looks like this:
import numpy as np
import itertools as it
U = np.linspace(0, 2*np.pi)
V = np.linspace(0, np.pi)
for (u, v) in it.product(U,V):
# values = computation on each grid point, does not call any outside functions
# solution = sum(values)
return solution
I left out the computations because they are long and my question is specifically about the way that I have implemented the computation over the parameter space (u, v). I know of alternatives such as numpy.meshgrid; however, these all seem to create instances of (very large) matrices, and I would guess that storing them in memory would slow things down.
Is there an alternative to it.product that would speed up my program, or should I be looking elsewhere for the bottleneck?
Edit: Here is the for loop in question (to see if it can be vectorized).
import random
import numpy as np
import itertools as it
##########################################################################
# Initialize the inputs with random (to save space)
##########################################################################
mat1 = np.array([[random.random() for i in range(3)] for i in range(3)])
mat2 = np.array([[random.random() for i in range(3)] for i in range(3)])
a1, a2, a3 = np.array([random.random() for i in range(3)])
plane_normal = np.array([random.random() for i in range(3)])
plane_point = np.array([random.random() for i in range(3)])
d = np.dot(plane_normal, plane_point)
truthval = True
##########################################################################
# Initialize the loop
##########################################################################
N = 100
U = np.linspace(0, 2*np.pi, N + 1, endpoint = False)
V = np.linspace(0, np.pi, N + 1, endpoint = False)
U = U[1:N+1] V = V[1:N+1]
Vsum = 0
Usum = 0
##########################################################################
# The for loops starts here
##########################################################################
for (u, v) in it.product(U,V):
cart_point = np.array([a1*np.cos(u)*np.sin(v),
a2*np.sin(u)*np.sin(v),
a3*np.cos(v)])
surf_normal = np.array(
[2*x / a**2 for (x, a) in zip(cart_point, [a1,a2,a3])])
differential_area = \
np.sqrt((a1*a2*np.cos(v)*np.sin(v))**2 + \
a3**2*np.sin(v)**4 * \
((a2*np.cos(u))**2 + (a1*np.sin(u))**2)) * \
(np.pi**2 / (2*N**2))
if (np.dot(plane_normal, cart_point) - d > 0) == truthval:
perp_normal = plane_normal
f = np.dot(np.dot(mat2, surf_normal), perp_normal)
Vsum += f*differential_area
else:
perp_normal = - plane_normal
f = np.dot(np.dot(mat2, surf_normal), perp_normal)
Usum += f*differential_area
integral = abs(Vsum) + abs(Usum)
If U.shape == (nu,) and (V.shape == (nv,), then the following arrays vectorize most of your calculations. With numpy you get the best speed by using arrays for the largest dimensions, and looping on the small ones (e.g. 3x3).
Corrected version
A = np.cos(U)[:,None]*np.sin(V)
B = np.sin(U)[:,None]*np.sin(V)
C = np.repeat(np.cos(V)[None,:],U.size,0)
CP = np.dstack([a1*A, a2*B, a3*C])
SN = np.dstack([2*A/a1, 2*B/a2, 2*C/a3])
DA1 = (a1*a2*np.cos(V)*np.sin(V))**2
DA2 = a3*a3*np.sin(V)**4
DA3 = (a2*np.cos(U))**2 + (a1*np.sin(U))**2
DA = DA1 + DA2 * DA3[:,None]
DA = np.sqrt(DA)*(np.pi**2 / (2*Nu*Nv))
D = np.dot(CP, plane_normal)
S = np.sign(D-d)
F1 = np.dot(np.dot(SN, mat2.T), plane_normal)
F = F1 * DA
#F = F * S # apply sign
Vsum = F[S>0].sum()
Usum = F[S<=0].sum()
With the same random values, this produces the same values. On a 100x100 case, it is 10x faster. It's been fun playing with these matrices after a year.
In ipython I did simple sum calculations on your 50 x 50 gridspace
In [31]: sum(u*v for (u,v) in it.product(U,V))
Out[31]: 12337.005501361698
In [33]: UU,VV = np.meshgrid(U,V); sum(sum(UU*VV))
Out[33]: 12337.005501361693
In [34]: timeit UU,VV = np.meshgrid(U,V); sum(sum(UU*VV))
1000 loops, best of 3: 293 us per loop
In [35]: timeit sum(u*v for (u,v) in it.product(U,V))
100 loops, best of 3: 2.95 ms per loop
In [38]: timeit list(it.product(U,V))
1000 loops, best of 3: 213 us per loop
In [45]: timeit UU,VV = np.meshgrid(U,V); (UU*VV).sum().sum()
10000 loops, best of 3: 70.3 us per loop
# using numpy's own sum is even better
product is slower (by factor 10), not because product itself is slow, but because of the point by point calculation. If you can vectorize your calculations so they use the 2 (50,50) arrays (without any sort of looping) it should speed up the overall time. That's the main reason for using numpy.
[k for k in it.product(U,V)] runs in 2ms for me, and the itertool package is made to be efficient, e.g. it does not create a long array first (http://docs.python.org/2/library/itertools.html).
The culprit seems to be your code inside the iteration, or your using a lot of points in linspace.