A matrix/vector multiplication looks like that and works fine:
A = np.array([[1, 0],
[0,1]]) # matrix
u = np.array([[3],[9]]) # column vector
U = np.dot(A,u) # U = A*u
U
Is there a way to keep the same structure if the elements Xx,Xy,Yx,Yy, u,v are 2- or 3-dimensional arrays each ?
A = np.array([[Xx, Xy],
[Yx,Yy]])
u = np.array([[u],[v]])
U = np.dot(A,u)
The desired result is that and works fine (so the answer to my question is rather 'nice to have'):
U = Xx*u + Xy*v
V = Yx*u + Yy*v
Succes Story
THX to #loopy walt and to #Ivan !
1. it works :
np.einsum('ijmn,jkmn->ikmn', A, w) and (w.T#A.T).T , both reproduce the correct contravariant coordinate transformation.
2. minor issue :
np.einsum('ijmn,jkmn->ikmn', A, w) and (w.T#A.T).T , both return an array with a pair of square brackets too much (it should be 2-dimensional). See #Ivan 's answer why:
array([[[ 0, 50, 200],
[ 450, 800, 1250],
[1800, 2450, 3200]]])
3. to optimize : building the block matrix is time consuming:
A = np.array([[Xx, Xy],
[Yx,Yy]])
w = np.array([[u],[v]])
So the overall return time is for np.einsum('ijmn,jkmn->ikmn', A, w) and (w.T#A.T).T lager. Perhaps can this be optimized.
- p_einstein np.einsum('ijmn,jkmn->ikmn', A, w) 9.07 µs ± 267 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
- p_tensor: (w.T#A.T).T 7.89 µs ± 225 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
- p_vect: [Xx*u1 + Xy*v1, Yx*u1 + Yy*v1] 2.63 µs ± 173 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
- p_vect: including A = np.array([[Xx, Xy],[Yx,Yy]]) 7.66 µs ± 290 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
This is essentially a matrix block operation which you can generalize from a simple matrix operation. Your example is good because you defined your block matrices as such (I renamed them A_p and u_p to not conflict with matrix u):
A_p = np.array([[Xx, Xy],
[Yx, Yy]])
u_p = np.array([[u], [v]])
If you look at a standard matrix operation, in terms of shapes you have (r, s) meets (s, t) which ultimately results in a matrix of shape (r, t). The 's' dimension gets reduced in the process. If you're not yet familiar with the Einstein Summation, and np.einsum, you should take a quick look. Given matrices M and N, the matrix operation would look like this:
out = np.zeros((r, t))
for i in range(r):
for j in range(s):
for k in range(t):
out[i, k] += M[i, j]*N[j, k]
This becomes very intuitive when you come to realize how the operation takes place: covering the entire multi-dimension space, we accumulate products on the output matrix.
With np.einsum, this would very similar indeed (here M=A, and N=u):
>>> np.einsum('ij,jk->ik', A, u)
array([[3],
[9]])
It is useful to think with coordinates: i, j, and k rather than dimension sizes: r, s, and t respectively, just like when using np.einsum above. So I'll keep the former notation for the rest of the answer.
Now looking at the block matrices, input matrices contain blocks of matrices themselves. Matrix M is shape (i, j, (m, n)) (i.e. (i, j, m, n)) where (k, l) refers to the shape of the blocks. Similarly, for N we have (j, k, m, n). We've basically replaced the scalar shaped (,) with a 2x2 block (m, n), that's all.
Therefore you can directly infer that the A_p*u_p operation as being:
>>> np.einsum('ijmn,jkmn->ikmn', A_p, u_p)
array([[[[ 0, 50],
[200, 450]]],
[[[ 0, 110],
[440, 990]]]])
Which you can compare with the 'hand-made' result:
>>> np.stack((Xx*u + Xy*v, Yx*u + Yy*v))[:, None]
Notice this can be used for any (m, n) block and any matrix size.
You can now try implementing A_p*u_p with a loop just like A*u.
Edit: Expanding on #loopy walt's answer who showed you can achieve the same result with:
>>> (u_p.T # A_p.T).T
Indeed __matmul__ can handle multi-dimensional arrays, and will do so in the following manner (for a number of dimensions higher than 2): both inputs will be treated as a stack of matrices residing in the last two indexes. The inputs shapes are (i, j, m, n), and (j, k, m, n). In order to use __matmul__, we need to pull the block dimensions - namely (m, n) - in first positions, and leave the matrix dimensions - (i, j) and (j, k) respectively - last. The following steps need to be taken
Transposing will have the effect of inverting the order of dimensions: (n, m, j, i) and (n, m, k, j) respectively.
We then swap the two operands, we manage to compute something in the form of (*, k, j) x (*, j, i) = (*, k, i), where * = (n, m).
Transposing the outputs leads to a shape of (i, k, m, n).
Borrowing A_p and u_p from #Ivan I'd like to advertise the following much simpler approach that makes use of __matmul__'s inbuilt broadcasting. All we need to do is reverse the order of dimensions:
(u_p.T#A_p.T).T
This gives the same result as Ivan's einsum but is a bit more flexible giving us full broadcasting without us having to figure out the corresponding format string every time.
np.allclose(np.einsum('ijmn,jkmn->ikmn', A_p, u_p),(u_p.T#A_p.T).T)
# True
Related
I have a 2x2 reference tensor and a batch of candidate 2x2 tensors. I would like to find the closest candidate tensor to the reference tensor by summed euclidean distance over the identically indexed (except for the batch index) elements.
For example:
ref = torch.as_tensor([[1, 2], [3, 4]])
candidates = torch.rand(100, 2, 2)
I would like to find the 2x2 tensor index in candidates that minimizes:
(ref[0][0] - candidates[index][0][0])**2 +
(ref[0][1] - candidates[index][0][1])**2 +
(ref[1][0] - candidates[index][1][0])**2 +
(ref[1][1] - candidates[index][1][1])**2
Ideally, this solution would work for arbitrary dimension reference tensor of size (b, c, d, ...., z) and an arbitrary batch_size of candidate tensors with equal dimensions to the reference tensor (batch_size, b, c, d,..., z)
Elaborating on #ndrwnaguib's answer, it should be rather:
dist = torch.cdist( ref.float().flatten().unsqueeze(0), candidates.flatten(start_dim=1))
print(torch.square( dist ))
torch.argmin( dist )
tensor([[23.3516, 21.8078, 25.5247, 26.3465, 21.3161, 17.7537, 24.1075, 22.4388,
22.7513, 16.8489]])
tensor(9)
other options, worth noting:
torch.square(ref.float()- candidates).sum( dim=(1,2) )
tensor([[23.3516, 21.8078, 25.5247, 26.3465, 21.3161, 17.7537, 24.1075, 22.4388,
22.7513, 16.8489]])
diff = ref.float()- candidates
torch.einsum( "abc,abc->a" ,diff, diff)
tensor([[23.3516, 21.8078, 25.5247, 26.3465, 21.3161, 17.7537, 24.1075, 22.4388,
22.7513, 16.8489]])
The following line returns the index of the tensor in candidates that minimizes the summation of the element-wise Euclidean distance to ref
In [1]: import torch
In [2]: ref = torch.as_tensor([[1, 2], [3, 4]])
...: candidates = torch.rand(100, 2, 2)
In [3]: %timeit torch.argmin(((ref - candidates) ** 2).sum((1, 2)))
16.9 µs ± 350 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
I would like to write a function f that takes an arbitrary function g with the signature g : R^n -> R^n -> int that "lifts" g so that it operates on (R^{nxm}, R^{kxm}) by behaving like a dot product. Meaning I want f to have the signature f : R^{nxm} -> R^{mxk} -> R^{nxk} by applying g to all pairs of rows and columns in constructing a matrix M where M_ij = g(A[i,:], B[:,j]).
Is that possible?
For example scipy.spatial.distance.cosine expects two vectors. Now I would lift cosine with f:
from scipy.spatial.distance import cosine
A = np.random.randint(0, 3, (3,4))
B = np.random.randint(0, 3, (5,4))
cosine_lifted = f(cosine)
cosine_lifted(A, B)
This would then produce the same output as
def sim(A, B):
ignored_states = np.seterr(divide='raise')
return 1 - np.divide(np.dot(A, B.T), np.outer(np.linalg.norm(A, axis=1), np.linalg.norm(B, axis=1)))
Which is the same as sklearn.metrics.pairwise.cosine_similarity plus the 1 - blah part.
But if there was not sklearn.metrics.pairwise.cosine_similarity, I would have to implement this lifted version of cosine myself (which I of course did here...). But I don't want to do that for all function that behave basically the same as the dot product in regard to how they mechanically process their argument. Therefore, I would like o have this f function.
I wrote my other answer assuming your
np.dot(A, B.T)
with a (3,4) and (5,4) inputs was the primary dot functionality that you were trying to emulate. In other words, (3,4), (4,5) => (3,5) with summation on the common size 4 dimension. My answer showed how that 2d calculation can be performed with element-wise multiplications.
For what it's worth, np.dot gets much of its speed by passing the task to BLAS (or similar) optimized libraries. These have been written in C or Fortran, and optimized by generations of numerical-analysis coders.
But your signature description may be talking about a different thing. It's a bit confusing.
g : R^n -> R^n -> int
Does this mean that g(x,y) takes two (n,) shape arrays, and returns an integer? And it can't be generalized to work with 2d arrays?
f : R^{nxm} -> R^{kxm} -> R^{nxm}
Does this mean f(A, B) takes a (n,m) shape, and a (k,m) shape, and returns a (n,m) shape? What happened to the k shape? Is that k a typo?
Alternatively you talk about doing (I believe)
M = np.zeros((N,N)) # (N,M) ok?
for i in range(N):
for j in range(N):
x = A[i,:]; y = B[:,j]
M[i,j] = g(x, y)
alternatively:
M = np.array([[g(x,y) for y in B.T] for x in A])
Assuming g is a python function that can only work with 2 1d arrays (of matching length), and cannot be generalized to 2d arrays, there isn't any mechanism in numpy to compile the above double loop. g has to be evaluated N**2 times. And assuming g is not trivial, those N*2 evaluations will dominate the total evaluation time, not the iteration mechanism.
np.vectorize normally takes a function that accepts scalar inputs, but with a signature parameter it can work with your g:
f = np.vectorize(g, signature='(n),(n)') # signature syntax may be wrong
M = f(A, B.T)
but in my testing vectorize has always been slower than an explicit iteration. With a signature it's even slower. So I kind of hesitate even mentioning it.
Are you asking for a function with a signature as simple as what follows (able to multiply two matrices), or do you want to emulate the entire np.dot api surface?
def lift(f):
def dot(A, B):
return np.array([[f(v,w) for w in zip(*B)] for v in A])
return dot
A major source of inefficiency in the above code is the allocations for all the intermediate lists. Since we know the final return value those are easy to avoid:
def lift(f):
def dot(A, B):
result = np.empty((A.shape[0], B.shape[1]))
for i,v in enumerate(A):
for j,w in enumerate(zip(*B)):
result[i,j] = f(v,w)
return result
return dot
Loops are fairly expensive in Python, but since f is operating on k elements it seems reasonable to assume that this overhead is small. You could reduce it further by compiling with pypy or cython.
matmul has been cast as a ufunc, and formally has a signature. np.dot is an earlier version and doesn't have a signature.
But given 2d arrays, np.dot is effectively a broadcasted form of multiplication followed by summation, or 'sum of products':
In [587]: A = np.arange(12).reshape(3,4)
In [588]: B = np.arange(8).reshape(2,4)
In [589]: np.dot(A, B.T)
Out[589]:
array([[ 14, 38],
[ 38, 126],
[ 62, 214]])
equivalent:
In [591]: (A[:,None,:]*B[None,:,:]).sum(axis=2)
Out[591]:
array([[ 14, 38],
[ 38, 126],
[ 62, 214]])
Some find the einsum style of signature easier to follow:
In [594]: np.einsum('ij,kj->ik', A, B)
Out[594]:
array([[ 14, 38],
[ 38, 126],
[ 62, 214]])
where the repeated j signals dot like summation.
===
Illustrating the iteration in my other answer:
In [601]: def g(x,y):
...: return (x*y).sum()
...:
In [602]: A.shape, B.shape
Out[602]: ((3, 4), (2, 4))
In [603]: np.array([[g(x,y) for y in B] for x in A])
Out[603]:
array([[ 14, 38],
[ 38, 126],
[ 62, 214]])
and the vectorize version:
In [614]: f = np.vectorize(g, signature='(n),(n)->()')
In [615]: f(A[:,None,:], B[None,:,:])
Out[615]:
array([[ 14, 38],
[ 38, 126],
[ 62, 214]])
comparative times:
In [616]: timeit f(A[:,None,:], B[None,:,:])
255 µs ± 6.67 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [617]: timeit np.array([[g(x,y) for y in B] for x in A])
69.4 µs ± 116 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [618]: timeit np.dot(A, B.T)
3.15 µs ± 128 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
and using #hans' 2nd lift:
In [623]: h = lift(g)
In [624]: h(A,B.T)
Out[624]:
array([[ 14., 38.],
[ 38., 126.],
[ 62., 214.]])
In [625]: timeit h(A,B.T)
102 µs ± 56.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Assume you have a numpy array with shape (a,b,c) and a boolean mask of shape (a,b,c,d).
I would like to apply the mask to the array iterating over the last axis, sum the masked array along the first three axes, and obtain a list (or an array) of length/shape (d,).
I tried to do this with a list comprehension:
Result = [np.sum(Array[Mask[:,:,:,i]], axis=(0,1,2)) for i in range(d)]
It works, but it does not look very pythonic and it is a bit slow as well.
I also tried something like
Array = Array[:,:,:,np.newaxis]
Result = np.sum(Array[Mask], axis=(0,1,2))
but of course this doesn't work, since the dimension of the Mask along the last axis, d, is larger than the dimension of the last axis of the Array, 1.
Also, consider that each axis could have dimension of order 100 or 200, so repeating the Array d times along a new last axis using np.repeat would be really memory intensive, and I would like to avoid this.
Are there any other faster and more pythonic alternatives to the list comprehension?
The most straightforward way of broadcasting a N-dimensional array to a matching (N+1)-dimensional array is to use np.broadcast_to():
import numpy as np
arr = np.random.randint(0, 100, (2, 3))
mask = np.random.randint(0, 2, (2, 3, 4), dtype=bool)
b_arr = np.broadcast_to(arr[..., None], mask.shape)
print(mask.shape == b_arr.shape)
# True
However, as #hpaulj already pointed out, you cannot use mask for slicing b_arr without loosing the dimensions.
Given that you want to just sum the elements together and summing zeroes "does not hurt", you could simply multiply element-wise your array and your mask so as to keep the correct dimension but the elements that are False in the mask are irrelevant for the subsequent sum of the corresponding array elements:
result = np.sum(b_arr * mask, axis=tuple(range(mask.ndim - 1)))
or, since * will do the broadcasting automatically:
result = np.sum(arr[..., None] * mask, axis=tuple(range(mask.ndim - 1)))
without the need to use np.broadcast_to() in the first place (but you still need to match the number of dimension, i.e. using arr[..., None] and not just arr).
As #PaulPanzer already pointed out, since you want to sum up over all but one dimensions, this can be further simplified using np.matmul()/#:
result2 = arr.ravel() # mask.reshape(-1, mask.shape[-1])
print(np.all(result == result2))
# True
For fancier operations involving the summation, please have a look at np.einsum().
EDIT
The catch with broadcasting is that it will create temporary arrays during the evaluation of your expressions.
With the number you seems to be dealing with, I simply cannot use the broadcasted arrays as I run into MemoryError, but time-wise the element-wise multiplication may still be a better approach than what you originally proposed.
Alternatively, if you are after speed, you could do this at a somewhat lower level with explicit looping in Cython or Numba.
Below you can find a couple of Numba-based solutions (working on ravel()-ed data):
_vector_matrix_product(): does not use any temporary array
_vector_matrix_product_mp(): some as above but using parallel execution
_vector_matrix_product_sum(): uses np.sum() and parallel execution
import numpy as np
import numba as nb
#nb.jit(nopython=True)
def _vector_matrix_product(
vect_arr,
mat_arr,
result_arr):
rows, cols = mat_arr.shape
if vect_arr.shape == result_arr.shape:
for i in range(rows):
for j in range(cols):
result_arr[i] += vect_arr[j] * mat_arr[i, j]
else:
for i in range(rows):
for j in range(cols):
result_arr[j] += vect_arr[i] * mat_arr[i, j]
#nb.jit(nopython=True, parallel=True)
def _vector_matrix_product_mp(
vect_arr,
mat_arr,
result_arr):
rows, cols = mat_arr.shape
if vect_arr.shape == result_arr.shape:
for i in nb.prange(rows):
for j in nb.prange(cols):
result_arr[i] += vect_arr[j] * mat_arr[i, j]
else:
for i in nb.prange(rows):
for j in nb.prange(cols):
result_arr[j] += vect_arr[i] * mat_arr[i, j]
#nb.jit(nopython=True, parallel=True)
def _vector_matrix_product_sum(
vect_arr,
mat_arr,
result_arr):
rows, cols = mat_arr.shape
if vect_arr.shape == result_arr.shape:
for i in nb.prange(rows):
result_arr[i] = np.sum(vect_arr * mat_arr[i, :])
else:
for j in nb.prange(cols):
result_arr[j] = np.sum(vect_arr * mat_arr[:, j])
def vector_matrix_product(
vect_arr,
mat_arr,
swap=False,
dtype=None,
mode=None):
rows, cols = mat_arr.shape
if not dtype:
dtype = (vect_arr[0] * mat_arr[0, 0]).dtype
if not swap:
result_arr = np.zeros(cols, dtype=dtype)
else:
result_arr = np.zeros(rows, dtype=dtype)
if mode == 'sum':
_vector_matrix_product_sum(vect_arr, mat_arr, result_arr)
elif mode == 'mp':
_vector_matrix_product_mp(vect_arr, mat_arr, result_arr)
else:
_vector_matrix_product(vect_arr, mat_arr, result_arr)
return result_arr
np.random.seed(0)
arr = np.random.randint(0, 100, (2, 3, 4))
mask = np.random.randint(0, 2, (2, 3, 4, 5), dtype=bool)
target = arr.ravel() # mask.reshape(-1, mask.shape[-1])
print(target)
# [820 723 861 486 408]
result1 = vector_matrix_product(arr.ravel(), mask.reshape(-1, mask.shape[-1]))
print(result1)
# [820 723 861 486 408]
result2 = vector_matrix_product(arr.ravel(), mask.reshape(-1, mask.shape[-1]), mode='mp')
print(result2)
# [820 723 861 486 408]
result3 = vector_matrix_product(arr.ravel(), mask.reshape(-1, mask.shape[-1]), mode='sum')
print(result3)
# [820 723 861 486 408]
with improved timing over any list-comprehension-based solutions:
arr = np.random.randint(0, 100, (256, 256, 256))
mask = np.random.randint(0, 2, (256, 256, 256, 128), dtype=bool)
%timeit np.sum(arr[..., None] * mask, axis=tuple(range(mask.ndim - 1)))
# MemoryError
%timeit arr.ravel() # mask.reshape(-1, mask.shape[-1])
# MemoryError
%timeit np.array([np.sum(arr * mask[..., i], axis=tuple(range(mask.ndim - 1))) for i in range(mask.shape[-1])])
# 24.1 s ± 105 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit np.array([np.sum(arr[mask[..., i]]) for i in range(mask.shape[-1])])
# 46 s ± 119 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit vector_matrix_product(arr.ravel(), mask.reshape(-1, mask.shape[-1]))
# 408 ms ± 2.12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit vector_matrix_product(arr.ravel(), mask.reshape(-1, mask.shape[-1]), mode='mp')
# 1.63 s ± 3.58 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit vector_matrix_product(arr.ravel(), mask.reshape(-1, mask.shape[-1]), mode='sum')
# 7.17 s ± 258 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
As expected, the JIT accelerated version is the fastest, and enforcing parallelism on the code does not result in improved speed-ups.
Note also that the approach with element-wise multiplication is faster than slicing (approx. twice as fast for these benchmarks).
EDIT 2
Following #max9111 suggestion, looping first by rows and then by cols cause the most time-consuming loop to run on contiguous data, resulting in significant speed-up.
Without this trick, _vector_matrix_product_sum() and _vector_matrix_product_mp() would run at essentially the same speed.
How about
Array.reshape(-1)#Mask.reshape(-1,d)
Since you are summing over the first three axes anyway you may as well merge them after which it is easy to see that the operation can be written as matrix-vector product
Example:
a,b,c,d = 4,5,6,7
Mask = np.random.randint(0,2,(a,b,c,d),bool)
Array = np.random.randint(0,10,(a,b,c))
[np.sum(Array[Mask[:,:,:,i]]) for i in range(d)]
# [310, 237, 253, 261, 229, 268, 184]
Array.reshape(-1)#Mask.reshape(-1,d)
# array([310, 237, 253, 261, 229, 268, 184])
I have an input matrix A of size I*J
And an output matrix B of size N*M
And some precalculated map of size N*M*2 that dictates for each coordinate in B, which coordinate in A to take. The map has no specific rule or linearity that I can use. Just a map that seems random.
The matrices are pretty big (~5000*~3000) so creating a mapping matrix is out of the question (5000*3000*5000*3000)
I managed to do it using a simple map and loop:
for i in range(N):
for j in range(M):
B[i, j] = A[mapping[i, j, 0], mapping[i, j, 1]]
And I managed to do it using indexing:
B[coords_y, coords_x] = A[some_mapping[:, 0], some_mapping[:, 1]]
# Where coords_x, coords_y are defined as all of the coordinates:
# [[0,0],[0,1]..[0,M-1],[1,0],[1,1]...[N-1,M-1]]
This works much better, but still kind of slow.
I have infinite time in advance to calculate the mapping or any other utility calculation. But after these precalculations, this mapping should happen as fast as possible.
Currently, the only other option that I see is just to reimplement this in C or something faster...
(Just to make it clear if someone is curious, I'm creating an image out of some other, differently shaped and oriented image with some encoding. But its' mapping is very complicated and not something simple or linear that can be used)
If you have infinity time for precomputing you can get a slight speedup by going to flat indexing:
map_f = np.ravel_multi_index((*np.moveaxis(mapping, 2, 0),), A.shape)
Then simply do:
A.ravel()[map_f]
Please note that this speedup is on top of the large speedup we get from fancy indexing. For example:
>>> A = np.random.random((5000, 3000))
>>> mapping = np.random.randint(0, 15000, (5000, 3000, 2)) % [5000, 3000]
>>>
>>> map_f = np.ravel_multi_index((*np.moveaxis(mapping, 2, 0),), A.shape)
>>>
>>> np.all(A.ravel()[map_f] == A[mapping[..., 0], mapping[..., 1]])
True
>>>
>>> timeit('A[mapping[:, :, 0], mappping[:, :, 1]]', globals=globals(), number=10)
4.101239089999581
>>> timeit('A.ravel()[map_f]', globals=globals(), number=10)
2.7831342950012186
If we were to compare to the original loopy code, the speedup would be more like ~40x.
Finally, note that this solution does not only avoid the additional dependency and potential installation nightmare that is numba, but is also simpler, shorter and faster:
numba:
precomp: 132.957 ms
main 238.359 ms
flat indexing:
precomp: 76.223 ms
main: 219.910 ms
Code:
import numpy as np
from numba import jit
#jit
def fast(A, B, mapping):
N, M = B.shape
for i in range(N):
for j in range(M):
B[i, j] = A[mapping[i, j, 0], mapping[i, j, 1]]
return B
from timeit import timeit
A = np.random.random((5000, 3000))
mapping = np.random.randint(0, 15000, (5000, 3000, 2)) % [5000, 3000]
a = np.random.random((5, 3))
m = np.random.randint(0, 15, (5, 3, 2)) % [5, 3]
print('numba:')
print(f"precomp: {timeit('b = fast(a, np.empty_like(a), m)', globals=globals(), number=1)*1000:10.3f} ms")
print(f"main {timeit('B = fast(A, np.empty_like(A), mapping)', globals=globals(), number=10)*100:10.3f} ms")
print('\nflat indexing:')
print(f"precomp: {timeit('map_f = np.ravel_multi_index((*np.moveaxis(mapping, 2, 0),), A.shape)', globals=globals(), number=10)*100:10.3f} ms")
map_f = np.ravel_multi_index((*np.moveaxis(mapping, 2, 0),), A.shape)
print(f"main: {timeit('B = A.ravel()[map_f]', globals=globals(), number=10)*100:10.3f} ms")
One very nice solution to these types of performance critical problems is to keep it simple and utilize one of the high performance packages. The easiest might be Numba which provides the jit decorator that compiles array and loop heavy code to optimized LLVM. Below is a full example:
from time import time
import numpy as np
from numba import jit
# Function doing the computation
def normal(A, B, mapping):
N, M = B.shape
for i in range(N):
for j in range(M):
B[i, j] = A[mapping[i, j, 0], mapping[i, j, 1]]
return B
# The same exact function, but with the Numba jit decorator
#jit
def fast(A, B, mapping):
N, M = B.shape
for i in range(N):
for j in range(M):
B[i, j] = A[mapping[i, j, 0], mapping[i, j, 1]]
return B
# Create sample data
def create_sample_data(I, J, N, M):
A = np.random.random((I, J))
B = np.empty((N, M))
mapping = np.asarray(np.stack((
np.random.random((N, M))*I,
np.random.random((N, M))*J,
), axis=2), dtype=int)
return A, B, mapping
A, B, mapping = create_sample_data(500, 600, 700, 800)
# Run normally
t0 = time()
B = normal(A, B, mapping)
t1 = time()
print('normal took', t1 - t0, 'seconds')
# Run using Numba.
# First we should run the function with smaller arrays,
# just to compile the code.
fast(*create_sample_data(5, 6, 7, 8))
# Now, run with real data
t0 = time()
B = fast(A, B, mapping)
t1 = time()
print('fast took', t1 - t0, 'seconds')
This uses your own looping solution, which is inherently slow using standard Python, but as fast as C when using Numba. On my machine the normal function executes in 0.270 seconds, while the fast function executes in 0.00248 seconds. That is, Numba gives us a 109x speedup (!) pretty much for free.
Note that the fast Numba function is called twice, first with small input arrays and only then with the real data. This is a critical step which is often neglected. Without it, you will find that the performance increase is not nearly as good, as the first call is used to compile the code. The types and dimensions of the input arrays should be the same in this initial call, but the size in each dimension is not important.
I create B outside of the function(s) and passed it as an argument (to be "filled with values"). You might just as well allocate B inside of the function, Numba does not care.
The easiest way to get Numba is properly via the Anaconda distribution.
One option would be to use numba, which can often provide substantial improvements in this kind of simple algorithmic code.
import numpy as np
from numba import njit
I, J = 5000, 5000
N, M = 3000, 3000
A = np.random.randint(0, 10, [I, J])
B = np.random.randint(0, 10, [N, M])
mapping = np.dstack([np.random.randint(0, I - 1, (N, M)),
np.random.randint(0, J - 1, (N, M))])
B0 = B.copy()
def orig(A, B, mapping):
for i in range(N):
for j in range(M):
B[i, j] = A[mapping[i, j, 0], mapping[i, j, 1]]
new = njit(orig)
which gives us matching results:
In [313]: Bold = B0.copy()
In [314]: orig(A, Bold, mapping)
In [315]: Bnew = B0.copy()
In [316]: new(A, Bnew, mapping)
In [317]: (Bold == Bnew).all()
Out[317]: True
and is much faster:
In [320]: %time orig(A, B0.copy(), mapping)
Wall time: 6.11 s
In [321]: %time new(A, B0.copy(), mapping)
Wall time: 257 ms
and faster still after the first call, when it has to do its jit work:
In [322]: %time new(A, B0.copy(), mapping)
Wall time: 171 ms
In [323]: %time new(A, B0.copy(), mapping)
Wall time: 163 ms
for a 30x improvement for adding two lines of code.
The most straightforward optimization you can do is drop the native python loops and use fancy numpy indexing. You already have the array to do that:
import numpy as np
A = np.random.rand(2000,3000)
B = np.empty((2500,3500)) # just for shape, really
# this is the same as your original, but with random indices
mapping = np.stack([np.random.randint(0, A.shape[0] - 1, B.shape),
np.random.randint(0, A.shape[1] - 1, B.shape)],
axis=-1)
# your loopy original
def loopy(A, B, mapping):
B = B.copy()
for i in range(B.shape[0]):
for j in range(B.shape[1]):
B[i, j] = A[mapping[i, j, 0], mapping[i, j, 1]]
return B
# vectorization with fancy indexing
def fancy(A, mapping):
return A[mapping[...,0], mapping[...,1]]
Note that the fancy advanced-indexing function doesn't need preallocation of a B array, as a new array is constructed by the indexing operation.
There's a slight variation of the fancy indexing version which could be marginally more efficient: put your last dimension of mapping first, in this way both indexing arrays are contiguous blocks of memory. It turns out from my timing test that this happens to be slower in the above setup. Anyway:
mapping_T = mapping.transpose(2, 0, 1).copy() # but it's actually `mapping` without axis=-1 kwarg
# has shape (2, N, M)
def fancy_T(A, mapping_T):
return A[tuple(mapping_T)]
As Paul Panzer noted in a comment, just calling .transpose on mapping will not create a copy, but rather implement the transpose using stride tricks. In order to end up with a contiguous array (which is the point of the optimization) we need to force the creation of a copy.
I get the following timings in ipython:
# loopy(A, B, mapping)
6.63 s ± 141 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# fancy(A, mapping)
250 ms ± 3.79 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# fancy_T(A, mapping_T)
277 ms ± 1.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
To be honest I don't understand why the original array order is faster compared to the transposed, but there's that.
I noticed that indexing a multi dimensional array takes more time than indexing a single dimensional array
a1 = np.arange(1000000)
a2 = np.arange(1000000).reshape(1000, 1000)
a3 = np.arange(1000000).reshape(100, 100, 100)
When I index a1
%%timeit
a1[500000]
The slowest run took 39.17 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 84.6 ns per loop
%%timeit
a2[500, 0]
The slowest run took 31.85 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 102 ns per loop
%%timeit
a3[50, 0, 0]
The slowest run took 46.72 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 119 ns per loop
At what point should I consider an alternative way to index or slice a multi-dimensional array? What are the circumstances that make it worth the effort and loss of transparency?
One alternative to slicing an (n, m) array is to flatten the array and derive what it's one dimensional position must be.
consider a = np.array([[0, 1, 2], [3, 4, 5], [6, 7, 8]])
we can get the 2nd row, 3rd column with a[1, 2] and get 5
or we can calculate that 1 * a.shape[1] + 2 is the one dimensional position if we flatten a with order='C'
thus we can perform the equivalent slice with a.ravel()[1 * a.shape[1] + 2]
Is this efficient? No, for indexing a single number from an array, it isn't worth the trouble.
What about if we want to slice many numbers from the array? I devised the following test for a 2-D array
2-D test
from timeit import timeit
n, m = 10000, 10000
a = np.random.rand(n, m)
r = pd.DataFrame(index=np.power(10, np.arange(7)), columns=['Multi', 'Flat'])
for k in r.index:
b = np.random.randint(n, size=k)
c = np.random.randint(m, size=k)
kw = dict(setup='from __main__ import a, b, c', number=100)
r.loc[k, 'Multi'] = timeit('a[b, c]', **kw)
r.loc[k, 'Flat'] = timeit('a.ravel()[b * a.shape[1] + c]', **kw)
r.div(r.sum(1), 0).plot.bar()
It appears that when slicing more than 100,000 numbers, it's better to flatten the array.
What about 3-D
3-D test
from timeit import timeit
l, n, m = 1000, 1000, 1000
a = np.random.rand(l, n, m)
r = pd.DataFrame(index=np.power(10, np.arange(7)), columns=['Multi', 'Flat'])
for k in r.index:
b = np.random.randint(l, size=k)
c = np.random.randint(m, size=k)
d = np.random.randint(n, size=k)
kw = dict(setup='from __main__ import a, b, c, d', number=100)
r.loc[k, 'Multi'] = timeit('a[b, c, d]', **kw)
r.loc[k, 'Flat'] = timeit('a.ravel()[b * a.shape[1] * a.shape[2] + c * a.shape[1] + d]', **kw)
r.div(r.sum(1), 0).plot.bar()
Similar results, maybe more dramatic.
Conclusion
For 2 dimensional arrays, consider flattening and deriving flatten positions if you need to pull more than 100,000 elements from the array.
For 3 or more dimensions, it seems clear that flattening the array is almost always better.
Criticism is welcome
Did I do something wrong? Did I not think of something obvious?