Performance difference between einsum and matmul

Performance difference between einsum and matmul - python

Related question BLAS with symmetry in higher order tensor in Fortran
I tried to use python code to exploit the symmetry in tensor contraction, A[a,b] B[b,c,d] = C[a,c,d] when B[b,c,d] = B[b,d,c] hence C[a,c,d] = C[a,d,c]. (Einstein summation convention assumed, i.e., repeated b means summation over it)
By the following code
import numpy as np
import time
# A[a,b] * B[b,c,d] = C[a,c,d]
na = nb = nc = nd = 100
A = np.random.random((na,nb))
B = np.random.random((nb,nc,nd))
C = np.zeros((na,nc,nd))
C2= np.zeros((na,nc,nd))
C3= np.zeros((na,nc,nd))
# symmetrize B
for c in range(nc):
for d in range(c):
B[:,c,d] = B[:,d,c]
start_time = time.time()
C2 = np.einsum('ab,bcd->acd', A, B)
finish_time = time.time()
print('time einsum', finish_time - start_time )
start_time = time.time()
for c in range(nc):
# c+1 is needed, since range(0) will be skipped
for d in range(c+1):
#C3[:,c,d] = np.einsum('ab,b->a', A[:,:],B[:,c,d] )
C3[:,c,d] = np.matmul(A[:,:],B[:,c,d] )
for c in range(nc):
for d in range(c+1,nd):
C3[:,c,d] = C3[:,d,c]
finish_time = time.time()
print( 'time partial einsum', finish_time - start_time )
for a in range(int(na/10)):
for c in range(int(nc/10)):
for d in range(int(nd/10)):
if abs((C3-C2)[a,c,d])> 1.0e-12:
print('warning', a,c,d, (C3-C2)[a,c,d])
it seems to me that np.matmul is faster than np.einsum, e.g., by using np.matmul, I got
time einsum 0.07406115531921387
time partial einsum 0.0553278923034668
by using np.einsum, I got
time einsum 0.0751657485961914
time partial einsum 0.11624622344970703
Is the above performance difference general? I often took einsum for granted.

As a general rule I expect matmul to be faster, though with simpler cases it appears that einsum actually uses matmul.
But here my timings
In [20]: C2 = np.einsum('ab,bcd->acd', A, B)
In [21]: timeit C2 = np.einsum('ab,bcd->acd', A, B)
126 ms ± 1.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Your symmetry try with einsum:
In [22]: %%timeit
...: for c in range(nc):
...: # c+1 is needed, since range(0) will be skipped
...: for d in range(c+1):
...: C3[:,c,d] = np.einsum('ab,b->a', A[:,:],B[:,c,d] )
...: #C3[:,c,d] = np.matmul(A[:,:],B[:,c,d] )
...:
...: for c in range(nc):
...: for d in range(c+1,nd):
...: C3[:,c,d] = C3[:,d,c]
...:
128 ms ± 3.39 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Same with matmul:
In [23]: %%timeit
...: for c in range(nc):
...: # c+1 is needed, since range(0) will be skipped
...: for d in range(c+1):
...: #C3[:,c,d] = np.einsum('ab,b->a', A[:,:],B[:,c,d] )
...: C3[:,c,d] = np.matmul(A[:,:],B[:,c,d] )
...:
...: for c in range(nc):
...: for d in range(c+1,nd):
...: C3[:,c,d] = C3[:,d,c]
...:
81.3 ms ± 1.14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
And direct matmul:
In [24]: C4 = np.matmul(A, B.reshape(100,-1)).reshape(100,100,100)
In [25]: np.allclose(C2,C4)
Out[25]: True
In [26]: timeit C4 = np.matmul(A, B.reshape(100,-1)).reshape(100,100,100)
14.9 ms ± 167 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
einsum also has an optimize flag. I thought that only mattered with there are 3 or more arguments, but it seems to help here:
In [27]: timeit C2 = np.einsum('ab,bcd->acd', A, B, optimize=True)
20.3 ms ± 688 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Sometimes when the arrays are very big, some iteration is faster because it reduces memory management complexities. But I don't think it's worth it when trying exploit symmetry. Other SO have shown that in some cases matmul can detect symmetry, and use a custom BLAS call, but I don't think that's the case here (it can't detect symmetry in B without an expensive comparison.)

Related

Efficient numpy row-wise matrix multiplication using 3d arrays

I have two 3d arrays of shape (N, M, D) and I want to perform an efficient row wise (over N) matrix multiplication such that the resulting array is of shape (N, D, D).
An inefficient code sample showing what I try to achieve is given by:
N = 100
M = 10
D = 50
arr1 = np.random.normal(size=(N, M, D))
arr2 = np.random.normal(size=(N, M, D))
result = []
for i in range(N):
result.append(arr1[i].T # arr2[i])
result = np.array(result)
However, this application is quite slow for large N due to the loop. Is there a more efficient way to achieve this computation without using loops? I already tried to find a solution via tensordot and einsum to no avail.

The vectorization solution is to swap the last two axes of arr1:
>>> N, M, D = 2, 3, 4
>>> np.random.seed(0)
>>> arr1 = np.random.normal(size=(N, M, D))
>>> arr2 = np.random.normal(size=(N, M, D))
>>> arr1.transpose(0, 2, 1) # arr2
array([[[ 6.95815626, 0.38299107, 0.40600482, 0.35990016],
[-0.95421604, -2.83125879, -0.2759683 , -0.38027618],
[ 3.54989101, -0.31274318, 0.14188485, 0.19860495],
[ 3.56319723, -6.36209602, -0.42687188, -0.24932248]],
[[ 0.67081341, -0.08816343, 0.35430089, 0.69962394],
[ 0.0316968 , 0.15129449, -0.51592291, 0.07118177],
[-0.22274906, -0.28955683, -1.78905988, 1.1486345 ],
[ 1.68432706, 1.93915798, 2.25785798, -2.34404577]]])
A simple benchmark for the super N:
In [225]: arr1.shape
Out[225]: (100000, 10, 50)
In [226]: %%timeit
...: result = []
...: for i in range(N):
...: result.append(arr1[i].T # arr2[i])
...: result = np.array(result)
...:
...:
12.4 s ± 224 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [227]: %timeit arr1.transpose(0, 2, 1) # arr2
843 ms ± 7.79 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Use pre allocated lists and do not perform data conversion after the loop ends. The performance here is not much worse than vectorization, which means that the most overhead comes from the final data conversion:
In [375]: %%timeit
...: result = [None] * N
...: for i in range(N):
...: result[i] = arr1[i].T # arr2[i]
...: # result = np.array(result)
...:
...:
1.22 s ± 16.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Performance of loop solution with data conversion:
In [376]: %%timeit
...: result = [None] * N
...: for i in range(N):
...: result[i] = arr1[i].T # arr2[i]
...: result = np.array(result)
...:
...:
11.3 s ± 179 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Another refers to the answer of #9769953 and makes additional optimization test. To my surprise, its performance is almost the same as the vectorization solution:
In [378]: %%timeit
...: result = np.empty_like(arr1, shape=(N, D, D))
...: for res, ar1, ar2 in zip(result, arr1.transpose(0, 2, 1), arr2):
...: np.matmul(ar1, ar2, out=res)
...:
843 ms ± 4.04 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

For interest, I wondered about the loop overhead, which I guess is minimal compared to the matrix multiplication; but in particular, the loop overhead is minimal to the potential reallocation of the list memory, which with N = 10000 could be significant.
Using a pre-allocated array instead of a list, I compared the loop result and the solution provided by Mechanic Pig, and achieved the following results on my machine:
In [10]: %timeit result1 = arr1.transpose(0, 2, 1) # arr2
33.7 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
versus
In [14]: %%timeit
...: result = np.empty_like(arr1, shape=(N, D, D))
...: for i in range(N):
...: result[i, ...] = arr1[i].T # arr2[i]
...:
...:
48.5 ms ± 184 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
The pure NumPy solution is still faster, so that's good, but only by a factor of about 1.5. Not too bad. Depending on the needs, the loop may be clearer as to what it intents (and easier to modify, in case there's a need for an if-statement or other shenigans).
And naturally, a simple comment above the faster solution can easily point out what it actually replaces.
Following the comments to this answer by Mechanic Pig, I've added below the timing results of a loop without preallocating an array (but with a preallocated list) and without conversion to a NumPy array. Mainly so the results are compared for the same machine:
In [11]: %%timeit
...: result = [None] * N
...: for i in range(N):
...: result.append(arr1[i].T # arr2[i])
...:
49.5 ms ± 672 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
So interestingly, this results, without conversion, is (a tiny bit) slower than the one with a pre-allocated array and directly assigning into the array.

Tensor dot product with rank one tensor

I'm trying to compute an inner product between tensors in numpy.
I have a vector x of shape (n,) and a tensor y of shape d*(n,) with d > 1 and would like to compute $\langle y, x^{\otimes d} \rangle$. That is, I want to compute the sum
$$\langle y, x^{\otimes d} \rangle= \sum_{i_1,\dots,i_d\in{1,\dots,n}}y[i_1, \dots, i_d]x[i_1]\dots x[i_d].$$
A working implementation I have uses a function to first compute $x^{\otimes d}$ and then uses np.tensordot:
def d_fold_tensor_product(x, d) -> np.ndarray:
"""
Compute d-fold tensor product of a vector.
"""
assert d > 1, "Tensor order must be bigger than 1."
xd = np.tensordot(x, x, axes=0)
while d > 2:
xd = np.tensordot(xd, x, axes=0)
d -= 1
return xd
n = 10
d = 4
x = np.random.random(n)
y = np.random.random(d * (n,))
result = np.tensordot(y, d_fold_tensor_product(x, d), axes=d)
Is there a more efficient and pythonic way? Perhaps without having to compute $x^{\otimes d}$.

The math is hard to read, so I'm going to skip that. Instead let's look at the sample calculation
In [168]: n = 10
...: d = 4
...: x = np.random.random(n)
...: y = np.random.random(d * (n,))
In [169]: x.shape
Out[169]: (10,)
In [171]: d_fold_tensor_product(x,d).shape
Out[171]: (10, 10, 10, 10)
In [172]: result = np.tensordot(y, d_fold_tensor_product(x, d), axes=d)
In [174]: result
Out[174]: array(384.20478955)
In [175]: y.shape
Out[175]: (10, 10, 10, 10)
tensordot can a be a complex call, though it all reduces to a call to dot. I once dug through the action with a single axis value. But without revisiting it that, or even looking at the docs (shame on me, I know :), this flattened dot does the same thing:
In [176]: np.dot(y.ravel(), d_fold_tensor_product(x, d).ravel())
Out[176]: 384.20478955316673
So the d_fold... has somehow expanded or replicated x to a 4d array. Guess I'll have to digest that action :(
That function is doing repeated outer products:
In [177]: np.tensordot(x,x,axes=0).shape
Out[177]: (10, 10)
In [178]: np.allclose(np.tensordot(x,x,axes=0), x[:,None]*x)
Out[178]: True
In [181]: temp = d_fold_tensor_product(x,d)
In [182]: np.allclose(temp, x[:,None,None,None]*x[:,None,None]*x[:,None]*x)
Out[182]: True
or put all together:
In [184]: np.dot((x[:,None,None,None]*x[:,None,None]*x[:,None]*x).ravel(),y.ravel())
Out[184]: 384.20478955316673
So that eliminates the repeated tensordot, but isn't easily generalizable to other d.
Another way - still not generalizable, but may help visualize the task:
In [186]: np.einsum('ijkl,i,j,k,l',y,x,x,x,x)
Out[186]: 384.2047895531675
Some timings - your use of tensordot is slower than the most direct outer product:
In [193]: timeit temp = d_fold_tensor_product(x,d)
151 µs ± 127 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [194]: timeit x[:,None,None,None]*x[:,None,None]*x[:,None]*x
61.3 µs ± 1.36 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
A generalization of the outer product is in between:
In [195]: timeit np.multiply.reduce(np.array(np.ix_(x,x,x,x),object))
85.1 µs ± 57.2 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
A more general way of doing the repeated outer product:
def foo(x,d):
x1 = np.expand_dims(x, tuple(range(1,d))) # make (10,1,1,1)
res = x1
for _ in range(1,d):
x1 = x1[...,0]
res = res*x1
return res
In [219]: foo(x,d).shape
Out[219]: (10, 10, 10, 10)
times almost as good as the explicit version:
In [220]: timeit foo(x,d)
72.7 µs ± 47.3 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [221]: np.dot(foo(x,d).ravel(),y.ravel())
Out[221]: 384.20478955316673

How to vectorize computation on arrays of different dimensions?

I have some large numpy arrays of complex numbers I need to perform computations on.
import numpy as np
# Reduced sizes -- real ones are orders of magnitude larger
n, d, l = 50000, 3, 1000
# Two complex matrices and a column vector
x = np.random.rand(n, d) + 1j*np.random.rand(n, d)
s = np.random.rand(n, d) + 1j*np.random.rand(n, d)
v = np.random.rand(l)[:, np.newaxis]
The function is basically x*v*s for each row of x (and s) and then that product is summed across the row. Because the arrays are different sizes, I can't figure out a way to vectorize the computation and it's way too slow to use a for-loop.
My current implementation is this (~3.5 seconds):
h = []
for i in range(len(x)):
h.append(np.sum(x[i,:]*v*s[i,:], axis=1))
h = np.asarray(h)
I also tried using np.apply_along_axis() with an augmented matrix but it's only slightly faster (~2.6s) and not that readable.
def func(m, v):
return np.sum(m[:d]*v*m[d:], axis=1)
h = np.apply_along_axis(func, 1, np.hstack([x, s]), v)
What's a much quicker way to compute this result? I can leverage other packages such as dask if that helps.

With broadcasting this should work:
np.sum(((x*s)[...,None]*v[:,0], axis=1)
but with your sample dimensions I'm getting a memory error. The 'outer' broadcasted array (n,d,l) shape is too large for my memory.
I can reduce memory usage by iterating on the smaller d dimension:
res = np.zeros((n,l), dtype=x.dtype)
for i in range(d):
res += (x[:,i]*s[:,i])[:,None]*v[:,0]
This tests the same as your h, but I wasn't able to complete time tests. Generally iterating on the smaller dimension is faster.
I may repeat things with small dimensions.
This probably can also be expressed as an einsum problem, though it may not help with these dimensions.
In [1]: n, d, l = 5000, 3, 1000
...:
...: # Two complex matrices and a column vector
...: x = np.random.rand(n, d) + 1j*np.random.rand(n, d)
...: s = np.random.rand(n, d) + 1j*np.random.rand(n, d)
...: v = np.random.rand(l)[:, np.newaxis]
In [2]:
In [2]: h = []
...: for i in range(len(x)):
...: h.append(np.sum(x[i,:]*v*s[i,:], axis=1))
...:
...: h = np.asarray(h)
In [3]: h.shape
Out[3]: (5000, 1000)
In [4]: res = np.zeros((n,l), dtype=x.dtype)
...: for i in range(d):
...: res += (x[:,i]*s[:,i])[:,None]*v[:,0]
...:
In [5]: res.shape
Out[5]: (5000, 1000)
In [6]: np.allclose(res,h)
Out[6]: True
In [7]: %%timeit
...: h = []
...: for i in range(len(x)):
...: h.append(np.sum(x[i,:]*v*s[i,:], axis=1))
...: h = np.asarray(h)
...:
...:
490 ms ± 3.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [8]: %%timeit
...: res = np.zeros((n,l), dtype=x.dtype)
...: for i in range(d):
...: res += (x[:,i]*s[:,i])[:,None]*v[:,0]
...:
354 ms ± 1.44 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [9]:
In [9]: np.sum((x*s)[...,None]*v[:,0], axis=1).shape
Out[9]: (5000, 1000)
In [10]: out = np.sum((x*s)[...,None]*v[:,0], axis=1)
In [11]: np.allclose(h,out)
Out[11]: True
In [12]: timeit out = np.sum((x*s)[...,None]*v[:,0], axis=1)
310 ms ± 964 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
Some time savings, but not big.
And the einsum version:
In [13]: np.einsum('ij,ij,k->ik',x,s,v[:,0]).shape
Out[13]: (5000, 1000)
In [14]: np.allclose(np.einsum('ij,ij,k->ik',x,s,v[:,0]),h)
Out[14]: True
In [15]: timeit np.einsum('ij,ij,k->ik',x,s,v[:,0]).shape
167 ms ± 137 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Good time savings. But I don't know how it will scale.
But the einsum made me realize that we can sum on d dimension earlier, before multiplying by v - and gain a lot in time and memory usage:
In [16]: np.allclose(np.sum(x*s, axis=1)[:,None]*v[:,0],h)
Out[16]: True
In [17]: timeit np.sum(x*s, axis=1)[:,None]*v[:,0]
68.4 ms ± 1.58 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
#cs95 got there first!
As per #PaulPanzer's comment, the optimize flag helps. It's probably making the same deduction - that we can sum on j early:
In [18]: timeit np.einsum('ij,ij,k->ik',x,s,v[:,0],optimize=True).shape
91.6 ms ± 991 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Numpy - How to remove trailing N*8 zeros

I have 1d array, I need to remove all trailing blocks of 8 zeros.
[0,1,1,0,1,0,0,0, 0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0]
->
[0,1,1,0,1,0,0,0]
a.shape[0] % 8 == 0 always, so no worries about that.
Is there a better way to do it?
import numpy as np
P = 8
arr1 = np.random.randint(2,size=np.random.randint(5,10) * P)
arr2 = np.random.randint(1,size=np.random.randint(5,10) * P)
arr = np.concatenate((arr1, arr2))
indexes = []
arr = np.flip(arr).reshape(arr.shape[0] // P, P)
for i, f in enumerate(arr):
if (f == 0).all():
indexes.append(i)
else:
break
arr = np.delete(arr, indexes, axis=0)
arr = np.flip(arr.reshape(arr.shape[0] * P))

You can do it without allocating more space by using views and np.argmax to get the last nonzero element:
index = arr.size - np.argmax(arr[::-1])
Rounding up to the nearest multiple of eight is easy:
index = np.ceil(index / 8) * 8
Now chop off the rest:
arr = arr[:index]
Or as a one-liner:
arr = arr[:(arr.size - np.argmax(arr[::-1])) / 8) * 8]
This version is O(n) in time and O(1) in space because it reuses the same buffers for everything (including the output).
This has the additional advantage that it will work correctly even if there are no trailing zeros. Using argmax does rely on all the elements being the same though. If that is not the case, you will need to compute a mask first, e.g. with arr.astype(bool).
If you want to use your original approach, you could vectorize that too, although there will be a bit more overhead:
view = arr.reshape(-1, 8)
mask = view.any(axis = 1)
index = view.shape[0] - np.argmax(mask[::-1])
arr = arr[:index * 8]

There is a numpy function that does almost what you want np.trim_zeros. We can use that:
import numpy as np
def trim_mod(a, m=8):
t = np.trim_zeros(a, 'b')
return a[:len(a)-(len(a)-len(t))//m*m]
def test(a, t, m=8):
assert (len(a) - len(t)) % m == 0
assert len(t) < m or np.any(t[-m:])
assert not np.any(a[len(t):])
for _ in range(1000):
a = (np.random.random(np.random.randint(10, 100000))<0.002).astype(int)
m = np.random.randint(4, 20)
t = trim_mod(a, m)
test(a, t, m)
print("Looks correct")
Prints:
Looks correct
It seems to scale linearly in the number of trailing zeros:
But feels rather slow in absolute terms (units are ms per trial), so maybe np.trim_zeros is just a python loop.
Code for the picture:
from timeit import timeit
A = (np.random.random(1000000)<0.02).astype(int)
m = 8
T = []
for last in range(1, 1000, 9):
A[-last:] = 0
A[-last] = 1
T.append(timeit(lambda: trim_mod(A, m), number=100)*10)
import pylab
pylab.plot(range(1, 1000, 9), T)
pylab.show()

A low level approach :
import numba
#numba.njit
def trim8(a):
n=a.size-1
while n>=0 and a[n]==0 : n-=1
c= (n//8+1)*8
return a[:c]
Some tests :
In [194]: A[-1]=1 # best case
In [196]: %timeit trim_mod(A,8)
5.7 µs ± 323 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [197]: %timeit trim8(A)
714 ns ± 33.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [198]: %timeit A[:(A.size - np.argmax(A[::-1]) // 8) * 8]
4.83 ms ± 479 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [202]: A[:]=0 #worst case
In [203]: %timeit trim_mod(A,8)
2.5 s ± 49.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [204]: %timeit trim8(A)
1.14 ms ± 71.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [205]: %timeit A[:(A.size - np.argmax(A[::-1]) // 8) * 8]
5.5 ms ± 950 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
It has a short circuit mechanism like trim_zeros, but is much faster.

Numpy grouping by range of difference between elements

I have an array of angles that I want to group into arrays with a max difference of 2 deg between them.
eg: input:
angles = np.array([[1],[2],[3],[4],[4],[5],[10]])
output
('group', 1)
[[1]
[2]
[3]]
('group', 2)
[[4]
[4]
[5]]
('group', 3)
[[10]]
numpy.diff gets the difference of the next element from the current, I need the difference of the next elements from the first of the group
itertools.groupby groups the elements not within a definable range
numpy.digitize groups the elements by a predefined range, not by the range specified by the elements of the array.
(Maybe I can use this by getting the unique values of angles, grouping them by their difference and using that as the predefined range?)
.
My approach which works but seems extremely inefficient and non-pythonic:
(I am using expand_dims and vstack because I'm working with a 1d arrays (not just angles) but I've reduced them to simplify it for this question)
angles = np.array([[1],[2],[3],[4],[4],[5],[10]])
groupedangles = []
idx1 = 0
diffAngleMax = 2
while(idx1 < len(angles)):
angleA = angles[idx1]
group = np.expand_dims(angleA, axis=0)
for idx2 in xrange(idx1+1,len(angles)):
angleB = angles[idx2]
diffAngle = angleB - angleA
if abs(diffAngle) <= diffAngleMax:
group = np.vstack((group,angleB))
else:
idx1 = idx2
groupedangles.append(group)
break
if idx2 == len(angles) - 1:
if idx1 == idx2:
angleA = angles[idx1]
group = np.expand_dims(angleA, axis=0)
groupedangles.append(group)
break
for idx, x in enumerate(groupedangles):
print('group', idx+1)
print(x)
What is a better and faster way to do this?

Update Here is some Cython treatment
In [1]: import cython
In [2]: %load_ext Cython
In [3]: %%cython
...: import numpy as np
...: cimport numpy as np
...: def cluster(np.ndarray array, np.float64_t maxdiff):
...: cdef np.ndarray[np.float64_t, ndim=1] flat = np.sort(array.flatten())
...: cdef list breakpoints = []
...: cdef np.float64_t seed = flat[0]
...: cdef np.int64_t int = 0
...: for i in range(0, len(flat)):
...: if (flat[i] - seed) > maxdiff:
...: breakpoints.append(i)
...: seed = flat[i]
...: return np.split(array, breakpoints)
...:
Sparsity test
In [4]: angles = np.random.choice(np.arange(5000), 500).astype(np.float64)[:, None]
In [5]: %timeit cluster(angles, 2)
422 µs ± 12.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Duplication test
In [6]: angles = np.random.choice(np.arange(500), 1500).astype(np.float64)[:, None]
In [7]: %timeit cluster(angles, 2)
263 µs ± 14.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Both tests show a significant improvement. The algorithm now sorts the input and makes a single run over the sorted array, which makes it stable O(N*log(N)).
Pre-update
This is a variation on seed clustering. It requires no sorting
def cluster(array, maxdiff):
tmp = array.copy()
groups = []
while len(tmp):
# select seed
seed = tmp.min()
mask = (tmp - seed) <= maxdiff
groups.append(tmp[mask, None])
tmp = tmp[~mask]
return groups
Example:
In [27]: cluster(angles, 2)
Out[27]:
[array([[1],
[2],
[3]]), array([[4],
[4],
[5]]), array([[10]])]
A benchmark for 500, 1000 and 1500 angles:
In [4]: angles = np.random.choice(np.arange(500), 500)[:, None]
In [5]: %timeit cluster(angles, 2)
1.25 ms ± 60.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [6]: angles = np.random.choice(np.arange(500), 1000)[:, None]
In [7]: %timeit cluster(angles, 2)
1.46 ms ± 37 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [8]: angles = np.random.choice(np.arange(500), 1500)[:, None]
In [9]: %timeit cluster(angles, 2)
1.99 ms ± 72.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
While the algorithm is O(N^2) in the worst case and O(N) in the best case, the benchmarks above clearly show near-linear time growth, because the actual runtime depends on the structure of your data: sparsity and the duplication rate. In most real cases you won't hit the worst case.
Some sparsity benchmarks
In [4]: angles = np.random.choice(np.arange(500), 500)[:, None]
In [5]: %timeit cluster(angles, 2)
1.06 ms ± 27.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [6]: angles = np.random.choice(np.arange(1000), 500)[:, None]
In [7]: %timeit cluster(angles, 2)
1.79 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [8]: angles = np.random.choice(np.arange(1500), 500)[:, None]
In [9]: %timeit cluster(angles, 2)
2.16 ms ± 90.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [10]: angles = np.random.choice(np.arange(5000), 500)[:, None]
In [11]: %timeit cluster(angles, 2)
3.21 ms ± 139 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Here is a sorting based solution. One could try and be a bit smarter and use bincount and argpartition to avoid the sorting, but at N <= 500 it's not worth the trouble.
import numpy as np
def flexibin(a):
idx0 = np.argsort(a)
as_ = a[idx0]
A = np.r_[as_, as_+2]
idx = np.argsort(A)
uinv = np.flatnonzero(idx >= len(a))
linv = np.empty_like(idx)
linv[np.flatnonzero(idx < len(a))] = np.arange(len(a))
bins = [0]
curr = 0
while True:
for j in range(uinv[idx[curr]], len(idx)):
if idx[j] < len(a) and A[idx[j]] > A[idx[curr]] + 2:
bins.append(j)
curr = j
break
else:
return np.split(idx0, linv[bins[1:]])
a = 180 * np.random.random((500,))
bins = flexibin(a)
mn, mx = zip(*((np.min(a[b]), np.max(a[b])) for b in bins))
assert np.all(np.diff(mn) > 2)
assert np.all(np.subtract(mx, mn) <= 2)
print('all ok')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Performance difference between einsum and matmul - python

Related

Efficient numpy row-wise matrix multiplication using 3d arrays

Tensor dot product with rank one tensor

How to vectorize computation on arrays of different dimensions?

Numpy - How to remove trailing N*8 zeros

Numpy grouping by range of difference between elements

Categories

Resources