I have some large numpy arrays of complex numbers I need to perform computations on.
import numpy as np
# Reduced sizes -- real ones are orders of magnitude larger
n, d, l = 50000, 3, 1000
# Two complex matrices and a column vector
x = np.random.rand(n, d) + 1j*np.random.rand(n, d)
s = np.random.rand(n, d) + 1j*np.random.rand(n, d)
v = np.random.rand(l)[:, np.newaxis]
The function is basically x*v*s for each row of x (and s) and then that product is summed across the row. Because the arrays are different sizes, I can't figure out a way to vectorize the computation and it's way too slow to use a for-loop.
My current implementation is this (~3.5 seconds):
h = []
for i in range(len(x)):
h.append(np.sum(x[i,:]*v*s[i,:], axis=1))
h = np.asarray(h)
I also tried using np.apply_along_axis() with an augmented matrix but it's only slightly faster (~2.6s) and not that readable.
def func(m, v):
return np.sum(m[:d]*v*m[d:], axis=1)
h = np.apply_along_axis(func, 1, np.hstack([x, s]), v)
What's a much quicker way to compute this result? I can leverage other packages such as dask if that helps.
With broadcasting this should work:
np.sum(((x*s)[...,None]*v[:,0], axis=1)
but with your sample dimensions I'm getting a memory error. The 'outer' broadcasted array (n,d,l) shape is too large for my memory.
I can reduce memory usage by iterating on the smaller d dimension:
res = np.zeros((n,l), dtype=x.dtype)
for i in range(d):
res += (x[:,i]*s[:,i])[:,None]*v[:,0]
This tests the same as your h, but I wasn't able to complete time tests. Generally iterating on the smaller dimension is faster.
I may repeat things with small dimensions.
This probably can also be expressed as an einsum problem, though it may not help with these dimensions.
In [1]: n, d, l = 5000, 3, 1000
...:
...: # Two complex matrices and a column vector
...: x = np.random.rand(n, d) + 1j*np.random.rand(n, d)
...: s = np.random.rand(n, d) + 1j*np.random.rand(n, d)
...: v = np.random.rand(l)[:, np.newaxis]
In [2]:
In [2]: h = []
...: for i in range(len(x)):
...: h.append(np.sum(x[i,:]*v*s[i,:], axis=1))
...:
...: h = np.asarray(h)
In [3]: h.shape
Out[3]: (5000, 1000)
In [4]: res = np.zeros((n,l), dtype=x.dtype)
...: for i in range(d):
...: res += (x[:,i]*s[:,i])[:,None]*v[:,0]
...:
In [5]: res.shape
Out[5]: (5000, 1000)
In [6]: np.allclose(res,h)
Out[6]: True
In [7]: %%timeit
...: h = []
...: for i in range(len(x)):
...: h.append(np.sum(x[i,:]*v*s[i,:], axis=1))
...: h = np.asarray(h)
...:
...:
490 ms ± 3.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [8]: %%timeit
...: res = np.zeros((n,l), dtype=x.dtype)
...: for i in range(d):
...: res += (x[:,i]*s[:,i])[:,None]*v[:,0]
...:
354 ms ± 1.44 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [9]:
In [9]: np.sum((x*s)[...,None]*v[:,0], axis=1).shape
Out[9]: (5000, 1000)
In [10]: out = np.sum((x*s)[...,None]*v[:,0], axis=1)
In [11]: np.allclose(h,out)
Out[11]: True
In [12]: timeit out = np.sum((x*s)[...,None]*v[:,0], axis=1)
310 ms ± 964 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
Some time savings, but not big.
And the einsum version:
In [13]: np.einsum('ij,ij,k->ik',x,s,v[:,0]).shape
Out[13]: (5000, 1000)
In [14]: np.allclose(np.einsum('ij,ij,k->ik',x,s,v[:,0]),h)
Out[14]: True
In [15]: timeit np.einsum('ij,ij,k->ik',x,s,v[:,0]).shape
167 ms ± 137 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Good time savings. But I don't know how it will scale.
But the einsum made me realize that we can sum on d dimension earlier, before multiplying by v - and gain a lot in time and memory usage:
In [16]: np.allclose(np.sum(x*s, axis=1)[:,None]*v[:,0],h)
Out[16]: True
In [17]: timeit np.sum(x*s, axis=1)[:,None]*v[:,0]
68.4 ms ± 1.58 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
#cs95 got there first!
As per #PaulPanzer's comment, the optimize flag helps. It's probably making the same deduction - that we can sum on j early:
In [18]: timeit np.einsum('ij,ij,k->ik',x,s,v[:,0],optimize=True).shape
91.6 ms ± 991 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Related
I have two 3d arrays of shape (N, M, D) and I want to perform an efficient row wise (over N) matrix multiplication such that the resulting array is of shape (N, D, D).
An inefficient code sample showing what I try to achieve is given by:
N = 100
M = 10
D = 50
arr1 = np.random.normal(size=(N, M, D))
arr2 = np.random.normal(size=(N, M, D))
result = []
for i in range(N):
result.append(arr1[i].T # arr2[i])
result = np.array(result)
However, this application is quite slow for large N due to the loop. Is there a more efficient way to achieve this computation without using loops? I already tried to find a solution via tensordot and einsum to no avail.
The vectorization solution is to swap the last two axes of arr1:
>>> N, M, D = 2, 3, 4
>>> np.random.seed(0)
>>> arr1 = np.random.normal(size=(N, M, D))
>>> arr2 = np.random.normal(size=(N, M, D))
>>> arr1.transpose(0, 2, 1) # arr2
array([[[ 6.95815626, 0.38299107, 0.40600482, 0.35990016],
[-0.95421604, -2.83125879, -0.2759683 , -0.38027618],
[ 3.54989101, -0.31274318, 0.14188485, 0.19860495],
[ 3.56319723, -6.36209602, -0.42687188, -0.24932248]],
[[ 0.67081341, -0.08816343, 0.35430089, 0.69962394],
[ 0.0316968 , 0.15129449, -0.51592291, 0.07118177],
[-0.22274906, -0.28955683, -1.78905988, 1.1486345 ],
[ 1.68432706, 1.93915798, 2.25785798, -2.34404577]]])
A simple benchmark for the super N:
In [225]: arr1.shape
Out[225]: (100000, 10, 50)
In [226]: %%timeit
...: result = []
...: for i in range(N):
...: result.append(arr1[i].T # arr2[i])
...: result = np.array(result)
...:
...:
12.4 s ± 224 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [227]: %timeit arr1.transpose(0, 2, 1) # arr2
843 ms ± 7.79 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Use pre allocated lists and do not perform data conversion after the loop ends. The performance here is not much worse than vectorization, which means that the most overhead comes from the final data conversion:
In [375]: %%timeit
...: result = [None] * N
...: for i in range(N):
...: result[i] = arr1[i].T # arr2[i]
...: # result = np.array(result)
...:
...:
1.22 s ± 16.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Performance of loop solution with data conversion:
In [376]: %%timeit
...: result = [None] * N
...: for i in range(N):
...: result[i] = arr1[i].T # arr2[i]
...: result = np.array(result)
...:
...:
11.3 s ± 179 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Another refers to the answer of #9769953 and makes additional optimization test. To my surprise, its performance is almost the same as the vectorization solution:
In [378]: %%timeit
...: result = np.empty_like(arr1, shape=(N, D, D))
...: for res, ar1, ar2 in zip(result, arr1.transpose(0, 2, 1), arr2):
...: np.matmul(ar1, ar2, out=res)
...:
843 ms ± 4.04 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
For interest, I wondered about the loop overhead, which I guess is minimal compared to the matrix multiplication; but in particular, the loop overhead is minimal to the potential reallocation of the list memory, which with N = 10000 could be significant.
Using a pre-allocated array instead of a list, I compared the loop result and the solution provided by Mechanic Pig, and achieved the following results on my machine:
In [10]: %timeit result1 = arr1.transpose(0, 2, 1) # arr2
33.7 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
versus
In [14]: %%timeit
...: result = np.empty_like(arr1, shape=(N, D, D))
...: for i in range(N):
...: result[i, ...] = arr1[i].T # arr2[i]
...:
...:
48.5 ms ± 184 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
The pure NumPy solution is still faster, so that's good, but only by a factor of about 1.5. Not too bad. Depending on the needs, the loop may be clearer as to what it intents (and easier to modify, in case there's a need for an if-statement or other shenigans).
And naturally, a simple comment above the faster solution can easily point out what it actually replaces.
Following the comments to this answer by Mechanic Pig, I've added below the timing results of a loop without preallocating an array (but with a preallocated list) and without conversion to a NumPy array. Mainly so the results are compared for the same machine:
In [11]: %%timeit
...: result = [None] * N
...: for i in range(N):
...: result.append(arr1[i].T # arr2[i])
...:
49.5 ms ± 672 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
So interestingly, this results, without conversion, is (a tiny bit) slower than the one with a pre-allocated array and directly assigning into the array.
I'm trying to compute an inner product between tensors in numpy.
I have a vector x of shape (n,) and a tensor y of shape d*(n,) with d > 1 and would like to compute $\langle y, x^{\otimes d} \rangle$. That is, I want to compute the sum
$$\langle y, x^{\otimes d} \rangle= \sum_{i_1,\dots,i_d\in{1,\dots,n}}y[i_1, \dots, i_d]x[i_1]\dots x[i_d].$$
A working implementation I have uses a function to first compute $x^{\otimes d}$ and then uses np.tensordot:
def d_fold_tensor_product(x, d) -> np.ndarray:
"""
Compute d-fold tensor product of a vector.
"""
assert d > 1, "Tensor order must be bigger than 1."
xd = np.tensordot(x, x, axes=0)
while d > 2:
xd = np.tensordot(xd, x, axes=0)
d -= 1
return xd
n = 10
d = 4
x = np.random.random(n)
y = np.random.random(d * (n,))
result = np.tensordot(y, d_fold_tensor_product(x, d), axes=d)
Is there a more efficient and pythonic way? Perhaps without having to compute $x^{\otimes d}$.
The math is hard to read, so I'm going to skip that. Instead let's look at the sample calculation
In [168]: n = 10
...: d = 4
...: x = np.random.random(n)
...: y = np.random.random(d * (n,))
In [169]: x.shape
Out[169]: (10,)
In [171]: d_fold_tensor_product(x,d).shape
Out[171]: (10, 10, 10, 10)
In [172]: result = np.tensordot(y, d_fold_tensor_product(x, d), axes=d)
In [174]: result
Out[174]: array(384.20478955)
In [175]: y.shape
Out[175]: (10, 10, 10, 10)
tensordot can a be a complex call, though it all reduces to a call to dot. I once dug through the action with a single axis value. But without revisiting it that, or even looking at the docs (shame on me, I know :), this flattened dot does the same thing:
In [176]: np.dot(y.ravel(), d_fold_tensor_product(x, d).ravel())
Out[176]: 384.20478955316673
So the d_fold... has somehow expanded or replicated x to a 4d array. Guess I'll have to digest that action :(
That function is doing repeated outer products:
In [177]: np.tensordot(x,x,axes=0).shape
Out[177]: (10, 10)
In [178]: np.allclose(np.tensordot(x,x,axes=0), x[:,None]*x)
Out[178]: True
In [181]: temp = d_fold_tensor_product(x,d)
In [182]: np.allclose(temp, x[:,None,None,None]*x[:,None,None]*x[:,None]*x)
Out[182]: True
or put all together:
In [184]: np.dot((x[:,None,None,None]*x[:,None,None]*x[:,None]*x).ravel(),y.ravel())
Out[184]: 384.20478955316673
So that eliminates the repeated tensordot, but isn't easily generalizable to other d.
Another way - still not generalizable, but may help visualize the task:
In [186]: np.einsum('ijkl,i,j,k,l',y,x,x,x,x)
Out[186]: 384.2047895531675
Some timings - your use of tensordot is slower than the most direct outer product:
In [193]: timeit temp = d_fold_tensor_product(x,d)
151 µs ± 127 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [194]: timeit x[:,None,None,None]*x[:,None,None]*x[:,None]*x
61.3 µs ± 1.36 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
A generalization of the outer product is in between:
In [195]: timeit np.multiply.reduce(np.array(np.ix_(x,x,x,x),object))
85.1 µs ± 57.2 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
A more general way of doing the repeated outer product:
def foo(x,d):
x1 = np.expand_dims(x, tuple(range(1,d))) # make (10,1,1,1)
res = x1
for _ in range(1,d):
x1 = x1[...,0]
res = res*x1
return res
In [219]: foo(x,d).shape
Out[219]: (10, 10, 10, 10)
times almost as good as the explicit version:
In [220]: timeit foo(x,d)
72.7 µs ± 47.3 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [221]: np.dot(foo(x,d).ravel(),y.ravel())
Out[221]: 384.20478955316673
Related question BLAS with symmetry in higher order tensor in Fortran
I tried to use python code to exploit the symmetry in tensor contraction, A[a,b] B[b,c,d] = C[a,c,d] when B[b,c,d] = B[b,d,c] hence C[a,c,d] = C[a,d,c]. (Einstein summation convention assumed, i.e., repeated b means summation over it)
By the following code
import numpy as np
import time
# A[a,b] * B[b,c,d] = C[a,c,d]
na = nb = nc = nd = 100
A = np.random.random((na,nb))
B = np.random.random((nb,nc,nd))
C = np.zeros((na,nc,nd))
C2= np.zeros((na,nc,nd))
C3= np.zeros((na,nc,nd))
# symmetrize B
for c in range(nc):
for d in range(c):
B[:,c,d] = B[:,d,c]
start_time = time.time()
C2 = np.einsum('ab,bcd->acd', A, B)
finish_time = time.time()
print('time einsum', finish_time - start_time )
start_time = time.time()
for c in range(nc):
# c+1 is needed, since range(0) will be skipped
for d in range(c+1):
#C3[:,c,d] = np.einsum('ab,b->a', A[:,:],B[:,c,d] )
C3[:,c,d] = np.matmul(A[:,:],B[:,c,d] )
for c in range(nc):
for d in range(c+1,nd):
C3[:,c,d] = C3[:,d,c]
finish_time = time.time()
print( 'time partial einsum', finish_time - start_time )
for a in range(int(na/10)):
for c in range(int(nc/10)):
for d in range(int(nd/10)):
if abs((C3-C2)[a,c,d])> 1.0e-12:
print('warning', a,c,d, (C3-C2)[a,c,d])
it seems to me that np.matmul is faster than np.einsum, e.g., by using np.matmul, I got
time einsum 0.07406115531921387
time partial einsum 0.0553278923034668
by using np.einsum, I got
time einsum 0.0751657485961914
time partial einsum 0.11624622344970703
Is the above performance difference general? I often took einsum for granted.
As a general rule I expect matmul to be faster, though with simpler cases it appears that einsum actually uses matmul.
But here my timings
In [20]: C2 = np.einsum('ab,bcd->acd', A, B)
In [21]: timeit C2 = np.einsum('ab,bcd->acd', A, B)
126 ms ± 1.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Your symmetry try with einsum:
In [22]: %%timeit
...: for c in range(nc):
...: # c+1 is needed, since range(0) will be skipped
...: for d in range(c+1):
...: C3[:,c,d] = np.einsum('ab,b->a', A[:,:],B[:,c,d] )
...: #C3[:,c,d] = np.matmul(A[:,:],B[:,c,d] )
...:
...: for c in range(nc):
...: for d in range(c+1,nd):
...: C3[:,c,d] = C3[:,d,c]
...:
128 ms ± 3.39 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Same with matmul:
In [23]: %%timeit
...: for c in range(nc):
...: # c+1 is needed, since range(0) will be skipped
...: for d in range(c+1):
...: #C3[:,c,d] = np.einsum('ab,b->a', A[:,:],B[:,c,d] )
...: C3[:,c,d] = np.matmul(A[:,:],B[:,c,d] )
...:
...: for c in range(nc):
...: for d in range(c+1,nd):
...: C3[:,c,d] = C3[:,d,c]
...:
81.3 ms ± 1.14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
And direct matmul:
In [24]: C4 = np.matmul(A, B.reshape(100,-1)).reshape(100,100,100)
In [25]: np.allclose(C2,C4)
Out[25]: True
In [26]: timeit C4 = np.matmul(A, B.reshape(100,-1)).reshape(100,100,100)
14.9 ms ± 167 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
einsum also has an optimize flag. I thought that only mattered with there are 3 or more arguments, but it seems to help here:
In [27]: timeit C2 = np.einsum('ab,bcd->acd', A, B, optimize=True)
20.3 ms ± 688 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Sometimes when the arrays are very big, some iteration is faster because it reduces memory management complexities. But I don't think it's worth it when trying exploit symmetry. Other SO have shown that in some cases matmul can detect symmetry, and use a custom BLAS call, but I don't think that's the case here (it can't detect symmetry in B without an expensive comparison.)
I have an array of angles that I want to group into arrays with a max difference of 2 deg between them.
eg: input:
angles = np.array([[1],[2],[3],[4],[4],[5],[10]])
output
('group', 1)
[[1]
[2]
[3]]
('group', 2)
[[4]
[4]
[5]]
('group', 3)
[[10]]
numpy.diff gets the difference of the next element from the current, I need the difference of the next elements from the first of the group
itertools.groupby groups the elements not within a definable range
numpy.digitize groups the elements by a predefined range, not by the range specified by the elements of the array.
(Maybe I can use this by getting the unique values of angles, grouping them by their difference and using that as the predefined range?)
.
My approach which works but seems extremely inefficient and non-pythonic:
(I am using expand_dims and vstack because I'm working with a 1d arrays (not just angles) but I've reduced them to simplify it for this question)
angles = np.array([[1],[2],[3],[4],[4],[5],[10]])
groupedangles = []
idx1 = 0
diffAngleMax = 2
while(idx1 < len(angles)):
angleA = angles[idx1]
group = np.expand_dims(angleA, axis=0)
for idx2 in xrange(idx1+1,len(angles)):
angleB = angles[idx2]
diffAngle = angleB - angleA
if abs(diffAngle) <= diffAngleMax:
group = np.vstack((group,angleB))
else:
idx1 = idx2
groupedangles.append(group)
break
if idx2 == len(angles) - 1:
if idx1 == idx2:
angleA = angles[idx1]
group = np.expand_dims(angleA, axis=0)
groupedangles.append(group)
break
for idx, x in enumerate(groupedangles):
print('group', idx+1)
print(x)
What is a better and faster way to do this?
Update Here is some Cython treatment
In [1]: import cython
In [2]: %load_ext Cython
In [3]: %%cython
...: import numpy as np
...: cimport numpy as np
...: def cluster(np.ndarray array, np.float64_t maxdiff):
...: cdef np.ndarray[np.float64_t, ndim=1] flat = np.sort(array.flatten())
...: cdef list breakpoints = []
...: cdef np.float64_t seed = flat[0]
...: cdef np.int64_t int = 0
...: for i in range(0, len(flat)):
...: if (flat[i] - seed) > maxdiff:
...: breakpoints.append(i)
...: seed = flat[i]
...: return np.split(array, breakpoints)
...:
Sparsity test
In [4]: angles = np.random.choice(np.arange(5000), 500).astype(np.float64)[:, None]
In [5]: %timeit cluster(angles, 2)
422 µs ± 12.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Duplication test
In [6]: angles = np.random.choice(np.arange(500), 1500).astype(np.float64)[:, None]
In [7]: %timeit cluster(angles, 2)
263 µs ± 14.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Both tests show a significant improvement. The algorithm now sorts the input and makes a single run over the sorted array, which makes it stable O(N*log(N)).
Pre-update
This is a variation on seed clustering. It requires no sorting
def cluster(array, maxdiff):
tmp = array.copy()
groups = []
while len(tmp):
# select seed
seed = tmp.min()
mask = (tmp - seed) <= maxdiff
groups.append(tmp[mask, None])
tmp = tmp[~mask]
return groups
Example:
In [27]: cluster(angles, 2)
Out[27]:
[array([[1],
[2],
[3]]), array([[4],
[4],
[5]]), array([[10]])]
A benchmark for 500, 1000 and 1500 angles:
In [4]: angles = np.random.choice(np.arange(500), 500)[:, None]
In [5]: %timeit cluster(angles, 2)
1.25 ms ± 60.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [6]: angles = np.random.choice(np.arange(500), 1000)[:, None]
In [7]: %timeit cluster(angles, 2)
1.46 ms ± 37 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [8]: angles = np.random.choice(np.arange(500), 1500)[:, None]
In [9]: %timeit cluster(angles, 2)
1.99 ms ± 72.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
While the algorithm is O(N^2) in the worst case and O(N) in the best case, the benchmarks above clearly show near-linear time growth, because the actual runtime depends on the structure of your data: sparsity and the duplication rate. In most real cases you won't hit the worst case.
Some sparsity benchmarks
In [4]: angles = np.random.choice(np.arange(500), 500)[:, None]
In [5]: %timeit cluster(angles, 2)
1.06 ms ± 27.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [6]: angles = np.random.choice(np.arange(1000), 500)[:, None]
In [7]: %timeit cluster(angles, 2)
1.79 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [8]: angles = np.random.choice(np.arange(1500), 500)[:, None]
In [9]: %timeit cluster(angles, 2)
2.16 ms ± 90.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [10]: angles = np.random.choice(np.arange(5000), 500)[:, None]
In [11]: %timeit cluster(angles, 2)
3.21 ms ± 139 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Here is a sorting based solution. One could try and be a bit smarter and use bincount and argpartition to avoid the sorting, but at N <= 500 it's not worth the trouble.
import numpy as np
def flexibin(a):
idx0 = np.argsort(a)
as_ = a[idx0]
A = np.r_[as_, as_+2]
idx = np.argsort(A)
uinv = np.flatnonzero(idx >= len(a))
linv = np.empty_like(idx)
linv[np.flatnonzero(idx < len(a))] = np.arange(len(a))
bins = [0]
curr = 0
while True:
for j in range(uinv[idx[curr]], len(idx)):
if idx[j] < len(a) and A[idx[j]] > A[idx[curr]] + 2:
bins.append(j)
curr = j
break
else:
return np.split(idx0, linv[bins[1:]])
a = 180 * np.random.random((500,))
bins = flexibin(a)
mn, mx = zip(*((np.min(a[b]), np.max(a[b])) for b in bins))
assert np.all(np.diff(mn) > 2)
assert np.all(np.subtract(mx, mn) <= 2)
print('all ok')
I have a three-dimensional array like
A=np.array([[[1,1],
[1,0]],
[[1,2],
[1,0]],
[[1,0],
[0,0]]])
Now I would like to obtain an array that has a nonzero value in a given position if only a unique nonzero value (or zero) occurs in that position. It should have zero if only zeros or more than one nonzero value occur in that position. For the example above, I would like
[[1,0],
[1,0]]
since
in A[:,0,0] there are only 1s
in A[:,0,1] there are 0, 1 and 2, so more than one nonzero value
in A[:,1,0] there are 0 and 1, so 1 is retained
in A[:,1,1] there are only 0s
I can find how many nonzero elements there are with np.count_nonzero(A, axis=0), but I would like to keep 1s or 2s even if there are several of them. I looked at np.unique but it doesn't seem to support what I'd like to do.
Ideally, I'd like a function like np.count_unique(A, axis=0) which would return an array in the original shape, e.g. [[1, 3],[2, 1]], so I could check whether 3 or more occur and then ignore that position.
All I could come up with was a list comprehension iterating over the that I'd like to obtain
[[len(np.unique(A[:, i, j])) for j in range(A.shape[2])] for i in range(A.shape[1])]
Any other ideas?
You can use np.diff to stay at numpy level for the second task.
def diffcount(A):
B=A.copy()
B.sort(axis=0)
C=np.diff(B,axis=0)>0
D=C.sum(axis=0)+1
return D
# [[1 3]
# [2 1]]
it's seems to be a little faster on big arrays:
In [62]: A=np.random.randint(0,100,(100,100,100))
In [63]: %timeit diffcount(A)
46.8 ms ± 769 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [64]: timeit [[len(np.unique(A[:, i, j])) for j in range(A.shape[2])]\
for i in range(A.shape[1])]
149 ms ± 700 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Finally counting unique is simpler than sorting, a ln(A.shape[0]) factor can be win.
A way to win this factor is to use the set mechanism :
In [81]: %timeit np.apply_along_axis(lambda a:len(set(a)),axis=0,A)
183 ms ± 1.17 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Unfortunately, this is not faster.
Another way is to do it by hand :
def countunique(A,Amax):
res=np.empty(A.shape[1:],A.dtype)
c=np.empty(Amax+1,A.dtype)
for i in range(A.shape[1]):
for j in range(A.shape[2]):
T=A[:,i,j]
for k in range(c.size): c[k]=0
for x in T:
c[x]=1
res[i,j]= c.sum()
return res
At python level:
In [70]: %timeit countunique(A,100)
429 ms ± 18.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Which is not so bad for a pure python approach. Then just shift this code at low level with numba :
import numba
countunique2=numba.jit(countunique)
In [71]: %timeit countunique2(A,100)
3.63 ms ± 70.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Which will be difficult to improve a lot.
One approach would be to use A as first axis indices for setting a boolean array of the same lengths along the other two axes and then simply counting the non-zeros along the first axis of it. Two variants would be possible - One keeping it as 3D and another would be to reshape into 2D for some performance benefit as indexing into 2D would be faster. Thus, the two implementations would be -
def nunique_axis0_maskcount_app1(A):
m,n = A.shape[1:]
mask = np.zeros((A.max()+1,m,n),dtype=bool)
mask[A,np.arange(m)[:,None],np.arange(n)] = 1
return mask.sum(0)
def nunique_axis0_maskcount_app2(A):
m,n = A.shape[1:]
A.shape = (-1,m*n)
maxn = A.max()+1
N = A.shape[1]
mask = np.zeros((maxn,N),dtype=bool)
mask[A,np.arange(N)] = 1
A.shape = (-1,m,n)
return mask.sum(0).reshape(m,n)
Runtime test -
In [154]: A = np.random.randint(0,100,(100,100,100))
# #B. M.'s soln
In [155]: %timeit f(A)
10 loops, best of 3: 28.3 ms per loop
# #B. M.'s soln using slicing : (B[1:] != B[:-1]).sum(0)+1
In [156]: %timeit f2(A)
10 loops, best of 3: 26.2 ms per loop
In [157]: %timeit nunique_axis0_maskcount_app1(A)
100 loops, best of 3: 12 ms per loop
In [158]: %timeit nunique_axis0_maskcount_app2(A)
100 loops, best of 3: 9.14 ms per loop
Numba method
Using the same strategy as used for nunique_axis0_maskcount_app2 with directly getting the counts at C-level with numba, we would have -
from numba import njit
#njit
def nunique_loopy_func(mask, N, A, p, count):
for j in range(N):
mask[:] = True
mask[A[0,j]] = False
c = 1
for i in range(1,p):
if mask[A[i,j]]:
c += 1
mask[A[i,j]] = False
count[j] = c
return count
def nunique_axis0_numba(A):
p,m,n = A.shape
A.shape = (-1,m*n)
maxn = A.max()+1
N = A.shape[1]
mask = np.empty(maxn,dtype=bool)
count = np.empty(N,dtype=int)
out = nunique_loopy_func(mask, N, A, p, count).reshape(m,n)
A.shape = (-1,m,n)
return out
Runtime test -
In [328]: np.random.seed(0)
In [329]: A = np.random.randint(0,100,(100,100,100))
In [330]: %timeit nunique_axis0_maskcount_app2(A)
100 loops, best of 3: 11.1 ms per loop
# #B.M.'s numba soln
In [331]: %timeit countunique2(A,A.max()+1)
100 loops, best of 3: 3.43 ms per loop
# Numba soln posted in this post
In [332]: %timeit nunique_axis0_numba(A)
100 loops, best of 3: 2.76 ms per loop