Let's say I have an array L = [1,0,5,1] and I want to put it into two bins, I would like to get out Lbin = [1,6]. Similarly let's say L = [1,3,5,2,6,7] and I want to put it into three bins, I would like to get out Lbin = [4,7,13].
If b is the number of bins and we assume that b divides len(L), is
there a numpy function to do this?
My array L will be large and I have a lot of them so I need a linear time solution to the problem.
The answer by Divakar is very nice. As an addition:
Is there an easy way to deal with the situation where b doesn't
divide len(L) so the last bin just has fewer elements in it? So L=[1,0,5,1,4] with b = 2 would give you [6,5].
We could simply reshape to basically split into rows of such groups and hence sum each row for the desired output, like so -
np.reshape(L,(num_bins,-1)).sum(1)
For arrays with lengths not necessarily divisible by the number of bins -
def sum_groups(L, num_bins):
n = len(L)
grp_len = int(np.ceil(n/float(num_bins)))
b = int(n%num_bins!=0)
lim = grp_len*(num_bins-b)
p0 = np.reshape(L[:lim],(-1,grp_len)).sum(1)
if b!=0:
p1 = np.sum(L[lim:])
return np.r_[p0,p1]
else:
return p0
Bringing in np.einsum for cases when the binned summations are within the input array dtype precision -
def sum_groups_einsum(L, num_bins):
n = len(L)
grp_len = int(np.ceil(n/float(num_bins)))
b = int(n%num_bins!=0)
lim = grp_len*(num_bins-b)
p0 = np.einsum('ij->i',np.reshape(L[:lim],(-1,grp_len)))
if b!=0:
p1 = np.einsum('i->',L[lim:])
return np.r_[p0,p1]
else:
return p0
Benchmarking
Following closely the OP's timing setup -
In [404]: # Setup
...: np.random.seed(0)
...: L = np.random.randint(0,high = 6, size = 10000000)
...: b = 20
In [405]: %timeit sum_groups(L, num_bins=b)
...: %timeit sum_groups_einsum(L, num_bins=b)
...: %timeit np.array([t.sum() for t in np.array_split(L, b)])
...: %timeit np.add.reduceat(L, np.linspace(0.5, L.size+0.5, b, False, dtype=int))
100 loops, best of 3: 6.2 ms per loop
100 loops, best of 3: 6 ms per loop
100 loops, best of 3: 6.25 ms per loop # #user2699's soln
100 loops, best of 3: 6.19 ms per loop # #Paul Panzer's soln
For the case when the array length is not divisible by the number of bins, let's have few more elements in the input array to achieve the same -
In [406]: # Setup
...: np.random.seed(0)
...: L = np.random.randint(0,high = 6, size = 10000012)
...: b = 20
In [407]: %timeit sum_groups(L, num_bins=b)
...: %timeit sum_groups_einsum(L, num_bins=b)
...: %timeit np.array([t.sum() for t in np.array_split(L, b)])
...: %timeit np.add.reduceat(L, np.linspace(0.5, L.size+0.5, b, False, dtype=int))
100 loops, best of 3: 6.45 ms per loop
100 loops, best of 3: 6.05 ms per loop
100 loops, best of 3: 6.45 ms per loop
100 loops, best of 3: 6.51 ms per loop
Running those again few more times, the first one and the last two had very comparable runtimes and the second one with einsum was tiny bit faster than the rest.
The following works,
array([t.sum() for t in array_split(L, b)])
And if, as you stated, you know that b divides L evenly, you can replace array_split with the split function.
Here's some benchmarks, with b=100 and L = randint(0, 100, 1000)
%timeit sum_groups(L, b) # Defined in Divakar's answer
8.09 µs ± 293 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit array([t.sum() for t in array_split(L, b)])
260 µs ± 2.12 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.add.reduceat(L, np.linspace(0.5, L.size+0.5, b, False, dtype=int))
15.9 µs ± 1.45 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
and with with b=3 and L = randint(0, 100, 1000)
%timeit sum_groups(L, b)
23.2 µs ± 317 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit array([t.sum() for t in array_split(L, b)])
16.2 µs ± 171 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.add.reduceat(L, np.linspace(0.5, L.size+0.5, b, False, dtype=int))
15 µs ± 1.77 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Depending on your data, it looks like Divakar's answer using reshaping may be the best approach.
You could use np.add.reduceat:
>>> np.add.reduceat(L, np.linspace(0, L.size, nbin, False, dtype=int))
It rounds the bin edges differently to your example, though:
>>> L = np.array([1,0,5,1,4])
>>> np.add.reduceat(L, np.linspace(0, L.size, nbin, False, dtype=int))
array([ 1, 10])
To get your rounding:
>>> np.add.reduceat(L, np.linspace(0.5, L.size+0.5, nbin, False, dtype=int))
array([6, 5])
To squeeze out a bit more performance we can avoid linspace and use integer arithmetic:
>>> np.add.reduceat(L, np.arange(nbin//2, L.size * nbin, L.size) // nbin)
It is worth mentioning that reshape based solutions do not always give the same result as the others, in fact, there are quite a few cases where reshape simply doesn't work. Example: 50 elements, 20 groups. This requires groups of 2 and 3 elements, 10 groups each. Obviously, this cannot be done by reshaping.
Performance comparison (10 bins, element count not a multiple):
Benchmarking code:
import perfplot
import numpy as np
def sg_reshape(args):
L, num_bins = args
n = len(L)
grp_len = int(np.ceil(n/float(num_bins)))
b = int(n%num_bins!=0)
lim = grp_len*(num_bins-b)
p0 = np.reshape(L[:lim],(-1,grp_len)).sum(1)
if b!=0:
p1 = np.sum(L[lim:])
return np.r_[p0,p1]
else:
return p0
def sg_einsum(args):
L, num_bins = args
n = len(L)
grp_len = int(np.ceil(n/float(num_bins)))
b = int(n%num_bins!=0)
lim = grp_len*(num_bins-b)
p0 = np.einsum('ij->i',np.reshape(L[:lim],(-1,grp_len)))
if b!=0:
p1 = np.sum(L[lim:])
return np.r_[p0,p1]
else:
return p0
def sg_addred(args):
L, nbin = args
return np.add.reduceat(L, np.linspace(0.5, L.size+0.5, nbin, False, dtype=int))
def sg_intarith(args):
L, nbin = args
return np.add.reduceat(L, np.arange(nbin//2, L.size * nbin, L.size) // nbin)
def sg_arrsplit(args):
L, b = args
return np.array([t.sum() for t in np.array_split(L, b)])
perfplot.save('cho10.png',
setup=lambda n: (np.random.randint(0, 9, (n,)), 10),
n_range=[2**k for k in range(8, 23)],
kernels=[
sg_reshape,
sg_einsum,
sg_addred,
sg_intarith,
sg_arrsplit
],
logx=True,
logy=True,
xlabel='#elements',
equality_check=None
)
Related
Say I have an array of distances x=[1,2,1,3,3,2,1,5,1,1].
I want to get the indices from x where cumsum reaches 10, in this case, idx=[4,9].
So the cumsum restarts after the condition are met.
I can do it with a loop, but loops are slow for large arrays and I was wondering if I could do it in a vectorized way.
A fun method
sumlm = np.frompyfunc(lambda a,b:a+b if a < 10 else b,2,1)
newx=sumlm.accumulate(x, dtype=np.object)
newx
array([1, 3, 4, 7, 10, 2, 3, 8, 9, 10], dtype=object)
np.nonzero(newx==10)
(array([4, 9]),)
Here's one with numba and array-initialization -
from numba import njit
#njit
def cumsum_breach_numba2(x, target, result):
total = 0
iterID = 0
for i,x_i in enumerate(x):
total += x_i
if total >= target:
result[iterID] = i
iterID += 1
total = 0
return iterID
def cumsum_breach_array_init(x, target):
x = np.asarray(x)
result = np.empty(len(x),dtype=np.uint64)
idx = cumsum_breach_numba2(x, target, result)
return result[:idx]
Timings
Including #piRSquared's solutions and using the benchmarking setup from the same post -
In [58]: np.random.seed([3, 1415])
...: x = np.random.randint(100, size=1000000).tolist()
# #piRSquared soln1
In [59]: %timeit list(cumsum_breach(x, 10))
10 loops, best of 3: 73.2 ms per loop
# #piRSquared soln2
In [60]: %timeit cumsum_breach_numba(np.asarray(x), 10)
10 loops, best of 3: 69.2 ms per loop
# From this post
In [61]: %timeit cumsum_breach_array_init(x, 10)
10 loops, best of 3: 39.1 ms per loop
Numba : Appending vs. array-initialization
For a closer look at how the array-initialization helps, which seems be the big difference between the two numba implementations, let's time these on the array data, as the array data creation was in itself heavy on runtime and they both depend on it -
In [62]: x = np.array(x)
In [63]: %timeit cumsum_breach_numba(x, 10)# with appending
10 loops, best of 3: 31.5 ms per loop
In [64]: %timeit cumsum_breach_array_init(x, 10)
1000 loops, best of 3: 1.8 ms per loop
To force the output to have it own memory space, we can make a copy. Won't change the things in a big way though -
In [65]: %timeit cumsum_breach_array_init(x, 10).copy()
100 loops, best of 3: 2.67 ms per loop
Loops are not always bad (especially when you need one). Also, There is no tool or algorithm that will make this quicker than O(n). So let's just make a good loop.
Generator Function
def cumsum_breach(x, target):
total = 0
for i, y in enumerate(x):
total += y
if total >= target:
yield i
total = 0
list(cumsum_breach(x, 10))
[4, 9]
Just In Time compiling with Numba
Numba is a third party library that needs to be installed.
Numba can be persnickety about what features are supported. But this works.
Also, as pointed out by Divakar, Numba performs better with arrays
from numba import njit
#njit
def cumsum_breach_numba(x, target):
total = 0
result = []
for i, y in enumerate(x):
total += y
if total >= target:
result.append(i)
total = 0
return result
cumsum_breach_numba(x, 10)
Testing the Two
Because I felt like it ¯\_(ツ)_/¯
Setup
np.random.seed([3, 1415])
x0 = np.random.randint(100, size=1_000_000)
x1 = x0.tolist()
Accuracy
i0 = cumsum_breach_numba(x0, 200_000)
i1 = list(cumsum_breach(x1, 200_000))
assert i0 == i1
Time
%timeit cumsum_breach_numba(x0, 200_000)
%timeit list(cumsum_breach(x1, 200_000))
582 µs ± 40.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
64.3 ms ± 5.66 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Numba was on the order of 100 times faster.
For a more true apples to apples test, I convert a list to a Numpy array
%timeit cumsum_breach_numba(np.array(x1), 200_000)
%timeit list(cumsum_breach(x1, 200_000))
43.1 ms ± 202 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
62.8 ms ± 327 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Which brings them to about even.
I have some large numpy arrays of complex numbers I need to perform computations on.
import numpy as np
# Reduced sizes -- real ones are orders of magnitude larger
n, d, l = 50000, 3, 1000
# Two complex matrices and a column vector
x = np.random.rand(n, d) + 1j*np.random.rand(n, d)
s = np.random.rand(n, d) + 1j*np.random.rand(n, d)
v = np.random.rand(l)[:, np.newaxis]
The function is basically x*v*s for each row of x (and s) and then that product is summed across the row. Because the arrays are different sizes, I can't figure out a way to vectorize the computation and it's way too slow to use a for-loop.
My current implementation is this (~3.5 seconds):
h = []
for i in range(len(x)):
h.append(np.sum(x[i,:]*v*s[i,:], axis=1))
h = np.asarray(h)
I also tried using np.apply_along_axis() with an augmented matrix but it's only slightly faster (~2.6s) and not that readable.
def func(m, v):
return np.sum(m[:d]*v*m[d:], axis=1)
h = np.apply_along_axis(func, 1, np.hstack([x, s]), v)
What's a much quicker way to compute this result? I can leverage other packages such as dask if that helps.
With broadcasting this should work:
np.sum(((x*s)[...,None]*v[:,0], axis=1)
but with your sample dimensions I'm getting a memory error. The 'outer' broadcasted array (n,d,l) shape is too large for my memory.
I can reduce memory usage by iterating on the smaller d dimension:
res = np.zeros((n,l), dtype=x.dtype)
for i in range(d):
res += (x[:,i]*s[:,i])[:,None]*v[:,0]
This tests the same as your h, but I wasn't able to complete time tests. Generally iterating on the smaller dimension is faster.
I may repeat things with small dimensions.
This probably can also be expressed as an einsum problem, though it may not help with these dimensions.
In [1]: n, d, l = 5000, 3, 1000
...:
...: # Two complex matrices and a column vector
...: x = np.random.rand(n, d) + 1j*np.random.rand(n, d)
...: s = np.random.rand(n, d) + 1j*np.random.rand(n, d)
...: v = np.random.rand(l)[:, np.newaxis]
In [2]:
In [2]: h = []
...: for i in range(len(x)):
...: h.append(np.sum(x[i,:]*v*s[i,:], axis=1))
...:
...: h = np.asarray(h)
In [3]: h.shape
Out[3]: (5000, 1000)
In [4]: res = np.zeros((n,l), dtype=x.dtype)
...: for i in range(d):
...: res += (x[:,i]*s[:,i])[:,None]*v[:,0]
...:
In [5]: res.shape
Out[5]: (5000, 1000)
In [6]: np.allclose(res,h)
Out[6]: True
In [7]: %%timeit
...: h = []
...: for i in range(len(x)):
...: h.append(np.sum(x[i,:]*v*s[i,:], axis=1))
...: h = np.asarray(h)
...:
...:
490 ms ± 3.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [8]: %%timeit
...: res = np.zeros((n,l), dtype=x.dtype)
...: for i in range(d):
...: res += (x[:,i]*s[:,i])[:,None]*v[:,0]
...:
354 ms ± 1.44 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [9]:
In [9]: np.sum((x*s)[...,None]*v[:,0], axis=1).shape
Out[9]: (5000, 1000)
In [10]: out = np.sum((x*s)[...,None]*v[:,0], axis=1)
In [11]: np.allclose(h,out)
Out[11]: True
In [12]: timeit out = np.sum((x*s)[...,None]*v[:,0], axis=1)
310 ms ± 964 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
Some time savings, but not big.
And the einsum version:
In [13]: np.einsum('ij,ij,k->ik',x,s,v[:,0]).shape
Out[13]: (5000, 1000)
In [14]: np.allclose(np.einsum('ij,ij,k->ik',x,s,v[:,0]),h)
Out[14]: True
In [15]: timeit np.einsum('ij,ij,k->ik',x,s,v[:,0]).shape
167 ms ± 137 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Good time savings. But I don't know how it will scale.
But the einsum made me realize that we can sum on d dimension earlier, before multiplying by v - and gain a lot in time and memory usage:
In [16]: np.allclose(np.sum(x*s, axis=1)[:,None]*v[:,0],h)
Out[16]: True
In [17]: timeit np.sum(x*s, axis=1)[:,None]*v[:,0]
68.4 ms ± 1.58 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
#cs95 got there first!
As per #PaulPanzer's comment, the optimize flag helps. It's probably making the same deduction - that we can sum on j early:
In [18]: timeit np.einsum('ij,ij,k->ik',x,s,v[:,0],optimize=True).shape
91.6 ms ± 991 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
I have 1d array, I need to remove all trailing blocks of 8 zeros.
[0,1,1,0,1,0,0,0, 0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0]
->
[0,1,1,0,1,0,0,0]
a.shape[0] % 8 == 0 always, so no worries about that.
Is there a better way to do it?
import numpy as np
P = 8
arr1 = np.random.randint(2,size=np.random.randint(5,10) * P)
arr2 = np.random.randint(1,size=np.random.randint(5,10) * P)
arr = np.concatenate((arr1, arr2))
indexes = []
arr = np.flip(arr).reshape(arr.shape[0] // P, P)
for i, f in enumerate(arr):
if (f == 0).all():
indexes.append(i)
else:
break
arr = np.delete(arr, indexes, axis=0)
arr = np.flip(arr.reshape(arr.shape[0] * P))
You can do it without allocating more space by using views and np.argmax to get the last nonzero element:
index = arr.size - np.argmax(arr[::-1])
Rounding up to the nearest multiple of eight is easy:
index = np.ceil(index / 8) * 8
Now chop off the rest:
arr = arr[:index]
Or as a one-liner:
arr = arr[:(arr.size - np.argmax(arr[::-1])) / 8) * 8]
This version is O(n) in time and O(1) in space because it reuses the same buffers for everything (including the output).
This has the additional advantage that it will work correctly even if there are no trailing zeros. Using argmax does rely on all the elements being the same though. If that is not the case, you will need to compute a mask first, e.g. with arr.astype(bool).
If you want to use your original approach, you could vectorize that too, although there will be a bit more overhead:
view = arr.reshape(-1, 8)
mask = view.any(axis = 1)
index = view.shape[0] - np.argmax(mask[::-1])
arr = arr[:index * 8]
There is a numpy function that does almost what you want np.trim_zeros. We can use that:
import numpy as np
def trim_mod(a, m=8):
t = np.trim_zeros(a, 'b')
return a[:len(a)-(len(a)-len(t))//m*m]
def test(a, t, m=8):
assert (len(a) - len(t)) % m == 0
assert len(t) < m or np.any(t[-m:])
assert not np.any(a[len(t):])
for _ in range(1000):
a = (np.random.random(np.random.randint(10, 100000))<0.002).astype(int)
m = np.random.randint(4, 20)
t = trim_mod(a, m)
test(a, t, m)
print("Looks correct")
Prints:
Looks correct
It seems to scale linearly in the number of trailing zeros:
But feels rather slow in absolute terms (units are ms per trial), so maybe np.trim_zeros is just a python loop.
Code for the picture:
from timeit import timeit
A = (np.random.random(1000000)<0.02).astype(int)
m = 8
T = []
for last in range(1, 1000, 9):
A[-last:] = 0
A[-last] = 1
T.append(timeit(lambda: trim_mod(A, m), number=100)*10)
import pylab
pylab.plot(range(1, 1000, 9), T)
pylab.show()
A low level approach :
import numba
#numba.njit
def trim8(a):
n=a.size-1
while n>=0 and a[n]==0 : n-=1
c= (n//8+1)*8
return a[:c]
Some tests :
In [194]: A[-1]=1 # best case
In [196]: %timeit trim_mod(A,8)
5.7 µs ± 323 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [197]: %timeit trim8(A)
714 ns ± 33.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [198]: %timeit A[:(A.size - np.argmax(A[::-1]) // 8) * 8]
4.83 ms ± 479 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [202]: A[:]=0 #worst case
In [203]: %timeit trim_mod(A,8)
2.5 s ± 49.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [204]: %timeit trim8(A)
1.14 ms ± 71.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [205]: %timeit A[:(A.size - np.argmax(A[::-1]) // 8) * 8]
5.5 ms ± 950 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
It has a short circuit mechanism like trim_zeros, but is much faster.
I have an array of angles that I want to group into arrays with a max difference of 2 deg between them.
eg: input:
angles = np.array([[1],[2],[3],[4],[4],[5],[10]])
output
('group', 1)
[[1]
[2]
[3]]
('group', 2)
[[4]
[4]
[5]]
('group', 3)
[[10]]
numpy.diff gets the difference of the next element from the current, I need the difference of the next elements from the first of the group
itertools.groupby groups the elements not within a definable range
numpy.digitize groups the elements by a predefined range, not by the range specified by the elements of the array.
(Maybe I can use this by getting the unique values of angles, grouping them by their difference and using that as the predefined range?)
.
My approach which works but seems extremely inefficient and non-pythonic:
(I am using expand_dims and vstack because I'm working with a 1d arrays (not just angles) but I've reduced them to simplify it for this question)
angles = np.array([[1],[2],[3],[4],[4],[5],[10]])
groupedangles = []
idx1 = 0
diffAngleMax = 2
while(idx1 < len(angles)):
angleA = angles[idx1]
group = np.expand_dims(angleA, axis=0)
for idx2 in xrange(idx1+1,len(angles)):
angleB = angles[idx2]
diffAngle = angleB - angleA
if abs(diffAngle) <= diffAngleMax:
group = np.vstack((group,angleB))
else:
idx1 = idx2
groupedangles.append(group)
break
if idx2 == len(angles) - 1:
if idx1 == idx2:
angleA = angles[idx1]
group = np.expand_dims(angleA, axis=0)
groupedangles.append(group)
break
for idx, x in enumerate(groupedangles):
print('group', idx+1)
print(x)
What is a better and faster way to do this?
Update Here is some Cython treatment
In [1]: import cython
In [2]: %load_ext Cython
In [3]: %%cython
...: import numpy as np
...: cimport numpy as np
...: def cluster(np.ndarray array, np.float64_t maxdiff):
...: cdef np.ndarray[np.float64_t, ndim=1] flat = np.sort(array.flatten())
...: cdef list breakpoints = []
...: cdef np.float64_t seed = flat[0]
...: cdef np.int64_t int = 0
...: for i in range(0, len(flat)):
...: if (flat[i] - seed) > maxdiff:
...: breakpoints.append(i)
...: seed = flat[i]
...: return np.split(array, breakpoints)
...:
Sparsity test
In [4]: angles = np.random.choice(np.arange(5000), 500).astype(np.float64)[:, None]
In [5]: %timeit cluster(angles, 2)
422 µs ± 12.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Duplication test
In [6]: angles = np.random.choice(np.arange(500), 1500).astype(np.float64)[:, None]
In [7]: %timeit cluster(angles, 2)
263 µs ± 14.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Both tests show a significant improvement. The algorithm now sorts the input and makes a single run over the sorted array, which makes it stable O(N*log(N)).
Pre-update
This is a variation on seed clustering. It requires no sorting
def cluster(array, maxdiff):
tmp = array.copy()
groups = []
while len(tmp):
# select seed
seed = tmp.min()
mask = (tmp - seed) <= maxdiff
groups.append(tmp[mask, None])
tmp = tmp[~mask]
return groups
Example:
In [27]: cluster(angles, 2)
Out[27]:
[array([[1],
[2],
[3]]), array([[4],
[4],
[5]]), array([[10]])]
A benchmark for 500, 1000 and 1500 angles:
In [4]: angles = np.random.choice(np.arange(500), 500)[:, None]
In [5]: %timeit cluster(angles, 2)
1.25 ms ± 60.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [6]: angles = np.random.choice(np.arange(500), 1000)[:, None]
In [7]: %timeit cluster(angles, 2)
1.46 ms ± 37 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [8]: angles = np.random.choice(np.arange(500), 1500)[:, None]
In [9]: %timeit cluster(angles, 2)
1.99 ms ± 72.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
While the algorithm is O(N^2) in the worst case and O(N) in the best case, the benchmarks above clearly show near-linear time growth, because the actual runtime depends on the structure of your data: sparsity and the duplication rate. In most real cases you won't hit the worst case.
Some sparsity benchmarks
In [4]: angles = np.random.choice(np.arange(500), 500)[:, None]
In [5]: %timeit cluster(angles, 2)
1.06 ms ± 27.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [6]: angles = np.random.choice(np.arange(1000), 500)[:, None]
In [7]: %timeit cluster(angles, 2)
1.79 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [8]: angles = np.random.choice(np.arange(1500), 500)[:, None]
In [9]: %timeit cluster(angles, 2)
2.16 ms ± 90.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [10]: angles = np.random.choice(np.arange(5000), 500)[:, None]
In [11]: %timeit cluster(angles, 2)
3.21 ms ± 139 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Here is a sorting based solution. One could try and be a bit smarter and use bincount and argpartition to avoid the sorting, but at N <= 500 it's not worth the trouble.
import numpy as np
def flexibin(a):
idx0 = np.argsort(a)
as_ = a[idx0]
A = np.r_[as_, as_+2]
idx = np.argsort(A)
uinv = np.flatnonzero(idx >= len(a))
linv = np.empty_like(idx)
linv[np.flatnonzero(idx < len(a))] = np.arange(len(a))
bins = [0]
curr = 0
while True:
for j in range(uinv[idx[curr]], len(idx)):
if idx[j] < len(a) and A[idx[j]] > A[idx[curr]] + 2:
bins.append(j)
curr = j
break
else:
return np.split(idx0, linv[bins[1:]])
a = 180 * np.random.random((500,))
bins = flexibin(a)
mn, mx = zip(*((np.min(a[b]), np.max(a[b])) for b in bins))
assert np.all(np.diff(mn) > 2)
assert np.all(np.subtract(mx, mn) <= 2)
print('all ok')
I have a three-dimensional array like
A=np.array([[[1,1],
[1,0]],
[[1,2],
[1,0]],
[[1,0],
[0,0]]])
Now I would like to obtain an array that has a nonzero value in a given position if only a unique nonzero value (or zero) occurs in that position. It should have zero if only zeros or more than one nonzero value occur in that position. For the example above, I would like
[[1,0],
[1,0]]
since
in A[:,0,0] there are only 1s
in A[:,0,1] there are 0, 1 and 2, so more than one nonzero value
in A[:,1,0] there are 0 and 1, so 1 is retained
in A[:,1,1] there are only 0s
I can find how many nonzero elements there are with np.count_nonzero(A, axis=0), but I would like to keep 1s or 2s even if there are several of them. I looked at np.unique but it doesn't seem to support what I'd like to do.
Ideally, I'd like a function like np.count_unique(A, axis=0) which would return an array in the original shape, e.g. [[1, 3],[2, 1]], so I could check whether 3 or more occur and then ignore that position.
All I could come up with was a list comprehension iterating over the that I'd like to obtain
[[len(np.unique(A[:, i, j])) for j in range(A.shape[2])] for i in range(A.shape[1])]
Any other ideas?
You can use np.diff to stay at numpy level for the second task.
def diffcount(A):
B=A.copy()
B.sort(axis=0)
C=np.diff(B,axis=0)>0
D=C.sum(axis=0)+1
return D
# [[1 3]
# [2 1]]
it's seems to be a little faster on big arrays:
In [62]: A=np.random.randint(0,100,(100,100,100))
In [63]: %timeit diffcount(A)
46.8 ms ± 769 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [64]: timeit [[len(np.unique(A[:, i, j])) for j in range(A.shape[2])]\
for i in range(A.shape[1])]
149 ms ± 700 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Finally counting unique is simpler than sorting, a ln(A.shape[0]) factor can be win.
A way to win this factor is to use the set mechanism :
In [81]: %timeit np.apply_along_axis(lambda a:len(set(a)),axis=0,A)
183 ms ± 1.17 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Unfortunately, this is not faster.
Another way is to do it by hand :
def countunique(A,Amax):
res=np.empty(A.shape[1:],A.dtype)
c=np.empty(Amax+1,A.dtype)
for i in range(A.shape[1]):
for j in range(A.shape[2]):
T=A[:,i,j]
for k in range(c.size): c[k]=0
for x in T:
c[x]=1
res[i,j]= c.sum()
return res
At python level:
In [70]: %timeit countunique(A,100)
429 ms ± 18.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Which is not so bad for a pure python approach. Then just shift this code at low level with numba :
import numba
countunique2=numba.jit(countunique)
In [71]: %timeit countunique2(A,100)
3.63 ms ± 70.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Which will be difficult to improve a lot.
One approach would be to use A as first axis indices for setting a boolean array of the same lengths along the other two axes and then simply counting the non-zeros along the first axis of it. Two variants would be possible - One keeping it as 3D and another would be to reshape into 2D for some performance benefit as indexing into 2D would be faster. Thus, the two implementations would be -
def nunique_axis0_maskcount_app1(A):
m,n = A.shape[1:]
mask = np.zeros((A.max()+1,m,n),dtype=bool)
mask[A,np.arange(m)[:,None],np.arange(n)] = 1
return mask.sum(0)
def nunique_axis0_maskcount_app2(A):
m,n = A.shape[1:]
A.shape = (-1,m*n)
maxn = A.max()+1
N = A.shape[1]
mask = np.zeros((maxn,N),dtype=bool)
mask[A,np.arange(N)] = 1
A.shape = (-1,m,n)
return mask.sum(0).reshape(m,n)
Runtime test -
In [154]: A = np.random.randint(0,100,(100,100,100))
# #B. M.'s soln
In [155]: %timeit f(A)
10 loops, best of 3: 28.3 ms per loop
# #B. M.'s soln using slicing : (B[1:] != B[:-1]).sum(0)+1
In [156]: %timeit f2(A)
10 loops, best of 3: 26.2 ms per loop
In [157]: %timeit nunique_axis0_maskcount_app1(A)
100 loops, best of 3: 12 ms per loop
In [158]: %timeit nunique_axis0_maskcount_app2(A)
100 loops, best of 3: 9.14 ms per loop
Numba method
Using the same strategy as used for nunique_axis0_maskcount_app2 with directly getting the counts at C-level with numba, we would have -
from numba import njit
#njit
def nunique_loopy_func(mask, N, A, p, count):
for j in range(N):
mask[:] = True
mask[A[0,j]] = False
c = 1
for i in range(1,p):
if mask[A[i,j]]:
c += 1
mask[A[i,j]] = False
count[j] = c
return count
def nunique_axis0_numba(A):
p,m,n = A.shape
A.shape = (-1,m*n)
maxn = A.max()+1
N = A.shape[1]
mask = np.empty(maxn,dtype=bool)
count = np.empty(N,dtype=int)
out = nunique_loopy_func(mask, N, A, p, count).reshape(m,n)
A.shape = (-1,m,n)
return out
Runtime test -
In [328]: np.random.seed(0)
In [329]: A = np.random.randint(0,100,(100,100,100))
In [330]: %timeit nunique_axis0_maskcount_app2(A)
100 loops, best of 3: 11.1 ms per loop
# #B.M.'s numba soln
In [331]: %timeit countunique2(A,A.max()+1)
100 loops, best of 3: 3.43 ms per loop
# Numba soln posted in this post
In [332]: %timeit nunique_axis0_numba(A)
100 loops, best of 3: 2.76 ms per loop