Say I have an array of distances x=[1,2,1,3,3,2,1,5,1,1].
I want to get the indices from x where cumsum reaches 10, in this case, idx=[4,9].
So the cumsum restarts after the condition are met.
I can do it with a loop, but loops are slow for large arrays and I was wondering if I could do it in a vectorized way.
A fun method
sumlm = np.frompyfunc(lambda a,b:a+b if a < 10 else b,2,1)
newx=sumlm.accumulate(x, dtype=np.object)
newx
array([1, 3, 4, 7, 10, 2, 3, 8, 9, 10], dtype=object)
np.nonzero(newx==10)
(array([4, 9]),)
Here's one with numba and array-initialization -
from numba import njit
#njit
def cumsum_breach_numba2(x, target, result):
total = 0
iterID = 0
for i,x_i in enumerate(x):
total += x_i
if total >= target:
result[iterID] = i
iterID += 1
total = 0
return iterID
def cumsum_breach_array_init(x, target):
x = np.asarray(x)
result = np.empty(len(x),dtype=np.uint64)
idx = cumsum_breach_numba2(x, target, result)
return result[:idx]
Timings
Including #piRSquared's solutions and using the benchmarking setup from the same post -
In [58]: np.random.seed([3, 1415])
...: x = np.random.randint(100, size=1000000).tolist()
# #piRSquared soln1
In [59]: %timeit list(cumsum_breach(x, 10))
10 loops, best of 3: 73.2 ms per loop
# #piRSquared soln2
In [60]: %timeit cumsum_breach_numba(np.asarray(x), 10)
10 loops, best of 3: 69.2 ms per loop
# From this post
In [61]: %timeit cumsum_breach_array_init(x, 10)
10 loops, best of 3: 39.1 ms per loop
Numba : Appending vs. array-initialization
For a closer look at how the array-initialization helps, which seems be the big difference between the two numba implementations, let's time these on the array data, as the array data creation was in itself heavy on runtime and they both depend on it -
In [62]: x = np.array(x)
In [63]: %timeit cumsum_breach_numba(x, 10)# with appending
10 loops, best of 3: 31.5 ms per loop
In [64]: %timeit cumsum_breach_array_init(x, 10)
1000 loops, best of 3: 1.8 ms per loop
To force the output to have it own memory space, we can make a copy. Won't change the things in a big way though -
In [65]: %timeit cumsum_breach_array_init(x, 10).copy()
100 loops, best of 3: 2.67 ms per loop
Loops are not always bad (especially when you need one). Also, There is no tool or algorithm that will make this quicker than O(n). So let's just make a good loop.
Generator Function
def cumsum_breach(x, target):
total = 0
for i, y in enumerate(x):
total += y
if total >= target:
yield i
total = 0
list(cumsum_breach(x, 10))
[4, 9]
Just In Time compiling with Numba
Numba is a third party library that needs to be installed.
Numba can be persnickety about what features are supported. But this works.
Also, as pointed out by Divakar, Numba performs better with arrays
from numba import njit
#njit
def cumsum_breach_numba(x, target):
total = 0
result = []
for i, y in enumerate(x):
total += y
if total >= target:
result.append(i)
total = 0
return result
cumsum_breach_numba(x, 10)
Testing the Two
Because I felt like it ¯\_(ツ)_/¯
Setup
np.random.seed([3, 1415])
x0 = np.random.randint(100, size=1_000_000)
x1 = x0.tolist()
Accuracy
i0 = cumsum_breach_numba(x0, 200_000)
i1 = list(cumsum_breach(x1, 200_000))
assert i0 == i1
Time
%timeit cumsum_breach_numba(x0, 200_000)
%timeit list(cumsum_breach(x1, 200_000))
582 µs ± 40.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
64.3 ms ± 5.66 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Numba was on the order of 100 times faster.
For a more true apples to apples test, I convert a list to a Numpy array
%timeit cumsum_breach_numba(np.array(x1), 200_000)
%timeit list(cumsum_breach(x1, 200_000))
43.1 ms ± 202 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
62.8 ms ± 327 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Which brings them to about even.
Related
I need to find fast way to get indicies of neighbors with values like current
For example:
arr = [0, 0, 0, 1, 0, 1, 1, 1, 1, 0]
indicies = func(arr, 6)
# [5, 6, 7, 8]
6th element has value 1, so I need full slice containing 6th and all it's neighbors with same value
It is like a part of flood fill algorithm. Is there a way to do it fast in numpy?
Is there a way for 2D array?
EDIT
Let's see some perfomance tests:
import numpy as np
import random
np.random.seed(1488)
arr = np.zeros(5000)
for x in np.random.randint(0, 5000, size = 100):
arr[x:x+50] = 1
I will compare function from #Ehsan:
def func_Ehsan(arr, idx):
change = np.insert(np.flatnonzero(np.diff(arr)), 0, -1)
loc = np.searchsorted(change, idx)
start = change[max(loc-1,0)]+1 if loc<len(change) else change[loc-1]
end = change[min(loc, len(change)-1)]
return (start, end)
change = np.insert(np.flatnonzero(np.diff(arr)), 0, -1)
def func_Ehsan_same_arr(arr, idx):
loc = np.searchsorted(change, idx)
start = change[max(loc-1,0)]+1 if loc<len(change) else change[loc-1]
end = change[min(loc, len(change)-1)]
return (start, end)
with my pure python function:
def my_func(arr, index):
val = arr[index]
size = arr.size
end = index + 1
while end < size and arr[end] == val:
end += 1
start = index - 1
while start > -1 and arr[start] == val:
start -= 1
return start + 1, end
Take a look:
np.random.seed(1488)
%timeit my_func(arr, np.random.randint(0, 5000))
# 42.4 µs ± 700 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
np.random.seed(1488)
%timeit func_Ehsan(arr, np.random.randint(0, 5000))
# 115 µs ± 1.92 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
np.random.seed(1488)
%timeit func_Ehsan_same_arr(arr, np.random.randint(0, 5000))
# 18.1 µs ± 953 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Is there a way to use same logic by numpy, without C module/Cython/Numba/python loops? And make it faster!
I don't know how to solve this problem with numpy but If you use pandas, you might get the result that you want with this:
import pandas as pd
df=pd.DataFrame(arr,columns=["data"])
df["new"] = df["data"].diff().ne(0).cumsum()
[{i[0]:j.index.tolist()} for i,j in df.groupby(["data","new"],sort=False)]
Output:
[{0: [0, 1, 2]}, {1: [3]}, {0: [4]}, {1: [5, 6, 7, 8]}, {0: [9]}]
The main problem is that Numpy is not currently designed to solve this problem efficiently. A "find first index of value fast" or any similar lazy function call is required to solve this problem efficiently. However, while this feature as been discussed since 10 years ago, this feature is still no available in Numpy. See this post for more information. I do not expect any change soon. Until then, the best solution on relatively big array appear to use an iterative solution using relatively slow pure-Python loops and slow Numpy calls/accesses.
Beside this, one solution to speed up the computation is to work on small chunks. Here is an implementation:
def my_func_opt1(arr, index):
val = arr[index]
size = arr.size
chunkSize = 128
end = index + 1
while end < size:
chunk = arr[end:end+chunkSize]
locations = (chunk != val).nonzero()[0]
if len(locations) > 0:
foundCount = locations[0]
end += foundCount
break
else:
end += len(chunk)
start = index
while start > 0:
chunk = arr[max(start-chunkSize,0):start]
locations = (chunk != val).nonzero()[0]
if len(locations) > 0:
foundCount = locations[-1]
start -= chunkSize - 1 - foundCount
break
else:
start -= len(chunk)
return start, end
Here are performance results on my machine:
func_Ehsan: 53.8 µs ± 449 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
my_func: 17.5 µs ± 97.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
my_func_opt1: 7.31 µs ± 52.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
The thing is the result are a bit biased since np.random.randint takes actually 2.01 µs. Without this Numpy call included in the benchmark, here are the results:
func_Ehsan: 51.8 µs
my_func: 15.5 µs
my_func_opt1: 5.31 µs
As a result, my_func_opt1 is about 3 times faster than my_func. This is very difficult to write a faster code as any Numpy call introduces a relatively big overhead of 0.5-1.0 µs on my machine whatever the array size (eg. empty arrays) due to internal checks.
The following is useful for people interested in speeding up the operation and that can use Numba.
The simplest solution consist in using the Numba's JIT and more specifically just add decorator. This solution is also very fast.
#nb.njit('UniTuple(i8,2)(f8[::1], i8)')
def my_func_opt2(arr, index):
val = arr[index]
size = arr.size
end = index + 1
while end < size and arr[end] == val:
end += 1
start = index - 1
while start > -1 and arr[start] == val:
start -= 1
return start + 1, end
On my machine my_func_opt2 takes only 0.63 µs (wit the random call excluded). As a result, my_func_opt2 is about 25 times faster than my_func. I highly doubt there is a faster solution since any Numpy calls on my machine take at least 0.5 µs and an empty Numba function takes 0.25 µs to call.
Beside this, please note that arr contains double-precision values which are pretty expensive to compute. It should be faster to use integers if you can. Also, please note that an array of 0 and 1 values can be stored in a int8 value which takes 8 times less memory and is often significantly faster to compute (due to CPU caches, the smaller the array the faster the computation). You can specify the type during the creation of the array: np.zeros(5000, dtype=np.int8)
Here is a numpy solution. I think you can improve it by a little more work:
def func(arr, idx):
change = np.insert(np.flatnonzero(np.diff(arr)), 0, -1)
loc = np.searchsorted(change, idx)
start = change[max(loc-1,0)]+1 if loc<len(change) else change[loc-1]
end = change[min(loc, len(change)-1)]
return np.arange(start, end)
sample output:
indices = func(arr, 6)
#[5 6 7 8]
This would specially be faster if you have few changes in your arr (relative to array size) and you are looking for multiple of those index searches in the same array. Otherwise, faster solutions come to mind.
Performance comparison:
If you are performing on same array multiple times, simply put first line out of function like this to avoid repetition.
change = np.insert(np.flatnonzero(np.diff(arr)), 0, -1)
def func(arr, idx):
loc = np.searchsorted(change, idx)
start = change[max(loc-1,0)]+1 if loc<len(change) else change[loc-1]
end = change[min(loc, len(change)-1)]
return np.arange(start, end)
For same input as OP's example:
np.random.seed(1488)
%timeit func_OP(arr, np.random.randint(0, 5000))
#23.5 µs ± 631 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
np.random.seed(1488)
%timeit func_Ehsan(arr, np.random.randint(0, 5000))
#7.89 µs ± 113 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
np.random.seed(1488)
%timeit func_Jérôme_opt1(arr, np.random.randint(0, 5000))
#12.1 µs ± 757 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit func_Jérôme_opt2(arr, np.random.randint(0, 5000))
#3.45 µs ± 179 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
With func_Ehsan being fastest (excluding Numba). Please mind that again, the performance of these functions vary significantly on number of changes in array, array size and how many times the function is being called on the same array. And of course Numba is faster than all (almost 2x faster than func_Ehsan. And if you are going to run it many times, build the groups in O(n) and use hash map to indices in O(1).
using the following code i am trying to convert a list of numbers into binary number but getting an error
import numpy as np
lis=np.array([1,2,3,4,5,6,7,8,9])
a=np.binary_repr(lis,width=32)
the error after running the program is
Traceback (most recent call last):
File "", line 4, in
a=np.binary_repr(lis,width=32)
File "C:\Users.......",
in binary_repr
if num == 0:
ValueError: The truth value of an array with more than one element is
ambiguous. Use a.any() or a.all()
any way to fix this?
You can use np.vectorize to overcome this issue.
>>> lis=np.array([1,2,3,4,5,6,7,8,9])
>>> a=np.binary_repr(lis,width=32)
>>> binary_repr_vec = np.vectorize(np.binary_repr)
>>> binary_repr_vec(lis, width=32)
array(['00000000000000000000000000000001',
'00000000000000000000000000000010',
'00000000000000000000000000000011',
'00000000000000000000000000000100',
'00000000000000000000000000000101',
'00000000000000000000000000000110',
'00000000000000000000000000000111',
'00000000000000000000000000001000',
'00000000000000000000000000001001'], dtype='<U32')
Approach #1
Here's a vectorized one for an array of numbers, upon leveraging broadcasting -
def binary_repr_ar(A, W):
p = (((A[:,None] & (1 << np.arange(W-1,-1,-1)))!=0)).view('u1')
return p.astype('S1').view('S'+str(W)).ravel()
Sample run -
In [67]: A
Out[67]: array([1, 2, 3, 4, 5, 6, 7, 8, 9])
In [68]: binary_repr_ar(A,32)
Out[68]:
array(['00000000000000000000000000000001',
'00000000000000000000000000000010',
'00000000000000000000000000000011',
'00000000000000000000000000000100',
'00000000000000000000000000000101',
'00000000000000000000000000000110',
'00000000000000000000000000000111',
'00000000000000000000000000001000',
'00000000000000000000000000001001'], dtype='|S32')
Approach #2
Another vectorized one with array-assignment -
def binary_repr_ar_v2(A, W):
mask = (((A[:,None] & (1 << np.arange(W-1,-1,-1)))!=0))
out = np.full((len(A),W),48, dtype=np.uint8)
out[mask] = 49
return out.view('S'+str(W)).ravel()
Alternatively, use the mask directly to get the string array -
def binary_repr_ar_v3(A, W):
mask = (((A[:,None] & (1 << np.arange(W-1,-1,-1)))!=0))
return (mask+np.array([48],dtype=np.uint8)).view('S'+str(W)).ravel()
Note that the final output would be a view into one of the intermediate outputs. So, if you need it to have it own memory space, simply append with .copy().
Timings on a large sized input array -
In [49]: np.random.seed(0)
...: A = np.random.randint(1,1000,(100000))
...: W = 32
In [50]: %timeit binary_repr_ar(A, W)
...: %timeit binary_repr_ar_v2(A, W)
...: %timeit binary_repr_ar_v3(A, W)
1 loop, best of 3: 854 ms per loop
100 loops, best of 3: 14.5 ms per loop
100 loops, best of 3: 7.33 ms per loop
From other posted solutions -
In [22]: %timeit [np.binary_repr(i, width=32) for i in A]
10 loops, best of 3: 97.2 ms per loop
In [23]: %timeit np.frompyfunc(np.binary_repr,2,1)(A,32).astype('U32')
10 loops, best of 3: 80 ms per loop
In [24]: %timeit np.vectorize(np.binary_repr)(A, 32)
10 loops, best of 3: 69.8 ms per loop
On #Paul Panzer's solutions -
In [5]: %timeit bin_rep(A,32)
548 µs ± 1.02 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [6]: %timeit bin_rep(A,31)
2.2 ms ± 5.55 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
As the documentation on binary_repr says:
num : int
Only an integer decimal number can be
used.
You can however vectorize this operation, like:
np.vectorize(np.binary_repr)(lis, 32)
this then gives us:
>>> np.vectorize(np.binary_repr)(lis, 32)
array(['00000000000000000000000000000001',
'00000000000000000000000000000010',
'00000000000000000000000000000011',
'00000000000000000000000000000100',
'00000000000000000000000000000101',
'00000000000000000000000000000110',
'00000000000000000000000000000111',
'00000000000000000000000000001000',
'00000000000000000000000000001001'], dtype='<U32')
or if you need this often, you can store the vectorized variant in a variable:
binary_repr_vector = np.vectorize(np.binary_repr)
binary_repr_vector(lis, 32)
Which of course gives the same result:
>>> binary_repr_vector = np.vectorize(np.binary_repr)
>>> binary_repr_vector(lis, 32)
array(['00000000000000000000000000000001',
'00000000000000000000000000000010',
'00000000000000000000000000000011',
'00000000000000000000000000000100',
'00000000000000000000000000000101',
'00000000000000000000000000000110',
'00000000000000000000000000000111',
'00000000000000000000000000001000',
'00000000000000000000000000001001'], dtype='<U32')
Here is a fast method using np.unpackbits
(np.unpackbits(lis.astype('>u4').view(np.uint8))+ord('0')).view('S32')
# array([b'00000000000000000000000000000001',
# b'00000000000000000000000000000010',
# b'00000000000000000000000000000011',
# b'00000000000000000000000000000100',
# b'00000000000000000000000000000101',
# b'00000000000000000000000000000110',
# b'00000000000000000000000000000111',
# b'00000000000000000000000000001000',
# b'00000000000000000000000000001001'], dtype='|S32')
More general:
def bin_rep(A,n):
if n in (8,16,32,64):
return (np.unpackbits(A.astype(f'>u{n>>3}').view(np.uint8))+ord('0')).view(f'S{n}')
nb = max((n-1).bit_length()-3,0)
return (np.unpackbits(A.astype(f'>u{1<<nb}')[...,None].view(np.uint8),axis=1)[...,-n:]+ord('0')).ravel().view(f'S{n}')
Note: special casing n = 8,16,32,64 is absolutely worth it since it gives a severalfold speedup for these numbers.
Also note that this method maxes out at 2^64, larger ints require a different approach.
In [193]: alist = [1,2,3,4,5,6,7,8,9]
np.vectorize is convenient, but not fast:
In [194]: np.vectorize(np.binary_repr)(alist, 32)
Out[194]:
array(['00000000000000000000000000000001',
'00000000000000000000000000000010',
'00000000000000000000000000000011',
....
'00000000000000000000000000001001'], dtype='<U32')
In [195]: timeit np.vectorize(np.binary_repr)(alist, 32)
71.8 µs ± 1.88 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
plain old list comprehension is better:
In [196]: [np.binary_repr(i, width=32) for i in alist]
Out[196]:
['00000000000000000000000000000001',
'00000000000000000000000000000010',
'00000000000000000000000000000011',
...
'00000000000000000000000000001001']
In [197]: timeit [np.binary_repr(i, width=32) for i in alist]
11.5 µs ± 181 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
another iterator:
In [200]: timeit np.frompyfunc(np.binary_repr,2,1)(alist,32).astype('U32')
30.1 µs ± 1.79 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Let's say I have an array L = [1,0,5,1] and I want to put it into two bins, I would like to get out Lbin = [1,6]. Similarly let's say L = [1,3,5,2,6,7] and I want to put it into three bins, I would like to get out Lbin = [4,7,13].
If b is the number of bins and we assume that b divides len(L), is
there a numpy function to do this?
My array L will be large and I have a lot of them so I need a linear time solution to the problem.
The answer by Divakar is very nice. As an addition:
Is there an easy way to deal with the situation where b doesn't
divide len(L) so the last bin just has fewer elements in it? So L=[1,0,5,1,4] with b = 2 would give you [6,5].
We could simply reshape to basically split into rows of such groups and hence sum each row for the desired output, like so -
np.reshape(L,(num_bins,-1)).sum(1)
For arrays with lengths not necessarily divisible by the number of bins -
def sum_groups(L, num_bins):
n = len(L)
grp_len = int(np.ceil(n/float(num_bins)))
b = int(n%num_bins!=0)
lim = grp_len*(num_bins-b)
p0 = np.reshape(L[:lim],(-1,grp_len)).sum(1)
if b!=0:
p1 = np.sum(L[lim:])
return np.r_[p0,p1]
else:
return p0
Bringing in np.einsum for cases when the binned summations are within the input array dtype precision -
def sum_groups_einsum(L, num_bins):
n = len(L)
grp_len = int(np.ceil(n/float(num_bins)))
b = int(n%num_bins!=0)
lim = grp_len*(num_bins-b)
p0 = np.einsum('ij->i',np.reshape(L[:lim],(-1,grp_len)))
if b!=0:
p1 = np.einsum('i->',L[lim:])
return np.r_[p0,p1]
else:
return p0
Benchmarking
Following closely the OP's timing setup -
In [404]: # Setup
...: np.random.seed(0)
...: L = np.random.randint(0,high = 6, size = 10000000)
...: b = 20
In [405]: %timeit sum_groups(L, num_bins=b)
...: %timeit sum_groups_einsum(L, num_bins=b)
...: %timeit np.array([t.sum() for t in np.array_split(L, b)])
...: %timeit np.add.reduceat(L, np.linspace(0.5, L.size+0.5, b, False, dtype=int))
100 loops, best of 3: 6.2 ms per loop
100 loops, best of 3: 6 ms per loop
100 loops, best of 3: 6.25 ms per loop # #user2699's soln
100 loops, best of 3: 6.19 ms per loop # #Paul Panzer's soln
For the case when the array length is not divisible by the number of bins, let's have few more elements in the input array to achieve the same -
In [406]: # Setup
...: np.random.seed(0)
...: L = np.random.randint(0,high = 6, size = 10000012)
...: b = 20
In [407]: %timeit sum_groups(L, num_bins=b)
...: %timeit sum_groups_einsum(L, num_bins=b)
...: %timeit np.array([t.sum() for t in np.array_split(L, b)])
...: %timeit np.add.reduceat(L, np.linspace(0.5, L.size+0.5, b, False, dtype=int))
100 loops, best of 3: 6.45 ms per loop
100 loops, best of 3: 6.05 ms per loop
100 loops, best of 3: 6.45 ms per loop
100 loops, best of 3: 6.51 ms per loop
Running those again few more times, the first one and the last two had very comparable runtimes and the second one with einsum was tiny bit faster than the rest.
The following works,
array([t.sum() for t in array_split(L, b)])
And if, as you stated, you know that b divides L evenly, you can replace array_split with the split function.
Here's some benchmarks, with b=100 and L = randint(0, 100, 1000)
%timeit sum_groups(L, b) # Defined in Divakar's answer
8.09 µs ± 293 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit array([t.sum() for t in array_split(L, b)])
260 µs ± 2.12 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.add.reduceat(L, np.linspace(0.5, L.size+0.5, b, False, dtype=int))
15.9 µs ± 1.45 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
and with with b=3 and L = randint(0, 100, 1000)
%timeit sum_groups(L, b)
23.2 µs ± 317 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit array([t.sum() for t in array_split(L, b)])
16.2 µs ± 171 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.add.reduceat(L, np.linspace(0.5, L.size+0.5, b, False, dtype=int))
15 µs ± 1.77 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Depending on your data, it looks like Divakar's answer using reshaping may be the best approach.
You could use np.add.reduceat:
>>> np.add.reduceat(L, np.linspace(0, L.size, nbin, False, dtype=int))
It rounds the bin edges differently to your example, though:
>>> L = np.array([1,0,5,1,4])
>>> np.add.reduceat(L, np.linspace(0, L.size, nbin, False, dtype=int))
array([ 1, 10])
To get your rounding:
>>> np.add.reduceat(L, np.linspace(0.5, L.size+0.5, nbin, False, dtype=int))
array([6, 5])
To squeeze out a bit more performance we can avoid linspace and use integer arithmetic:
>>> np.add.reduceat(L, np.arange(nbin//2, L.size * nbin, L.size) // nbin)
It is worth mentioning that reshape based solutions do not always give the same result as the others, in fact, there are quite a few cases where reshape simply doesn't work. Example: 50 elements, 20 groups. This requires groups of 2 and 3 elements, 10 groups each. Obviously, this cannot be done by reshaping.
Performance comparison (10 bins, element count not a multiple):
Benchmarking code:
import perfplot
import numpy as np
def sg_reshape(args):
L, num_bins = args
n = len(L)
grp_len = int(np.ceil(n/float(num_bins)))
b = int(n%num_bins!=0)
lim = grp_len*(num_bins-b)
p0 = np.reshape(L[:lim],(-1,grp_len)).sum(1)
if b!=0:
p1 = np.sum(L[lim:])
return np.r_[p0,p1]
else:
return p0
def sg_einsum(args):
L, num_bins = args
n = len(L)
grp_len = int(np.ceil(n/float(num_bins)))
b = int(n%num_bins!=0)
lim = grp_len*(num_bins-b)
p0 = np.einsum('ij->i',np.reshape(L[:lim],(-1,grp_len)))
if b!=0:
p1 = np.sum(L[lim:])
return np.r_[p0,p1]
else:
return p0
def sg_addred(args):
L, nbin = args
return np.add.reduceat(L, np.linspace(0.5, L.size+0.5, nbin, False, dtype=int))
def sg_intarith(args):
L, nbin = args
return np.add.reduceat(L, np.arange(nbin//2, L.size * nbin, L.size) // nbin)
def sg_arrsplit(args):
L, b = args
return np.array([t.sum() for t in np.array_split(L, b)])
perfplot.save('cho10.png',
setup=lambda n: (np.random.randint(0, 9, (n,)), 10),
n_range=[2**k for k in range(8, 23)],
kernels=[
sg_reshape,
sg_einsum,
sg_addred,
sg_intarith,
sg_arrsplit
],
logx=True,
logy=True,
xlabel='#elements',
equality_check=None
)
I have a numpy array and I like to check if it is sorted.
>>> a = np.array([1,2,3,4,5])
array([1, 2, 3, 4, 5])
np.all(a[:-1] <= a[1:])
Examples:
is_sorted = lambda a: np.all(a[:-1] <= a[1:])
>>> a = np.array([1,2,3,4,9])
>>> is_sorted(a)
True
>>> a = np.array([1,2,3,4,3])
>>> is_sorted(a)
False
With NumPy tools:
np.all(np.diff(a) >= 0)
but numpy solutions are all O(n).
If you want quick code and very quick conclusion on unsorted arrays:
import numba
#numba.jit
def is_sorted(a):
for i in range(a.size-1):
if a[i+1] < a[i] :
return False
return True
which is O(1) (in mean) on random arrays.
The inefficient but easy-to-type solution:
(a == np.sort(a)).all()
For completeness, the O(log n) iterative solution is found below. The recursive version is slower and crashes with big vector sizes. However, it is still slower than the native numpy using np.all(a[:-1] <= a[1:]) most likely due to modern CPU optimizations. The only case where the O(log n) is faster is on the "average" random case or if it is "almost" sorted. If you suspect your array is already fully sorted then np.all will be faster.
def is_sorted(a):
idx = [(0, a.size - 1)]
while idx:
i, j = idx.pop(0) # Breadth-First will find almost-sorted in O(log N)
if i >= j:
continue
elif a[i] > a[j]:
return False
elif i + 1 == j:
continue
else:
mid = (i + j) >> 1 # Division by 2 with floor
idx.append((i, mid))
idx.append((mid, j))
return True
is_sorted2 = lambda a: np.all(a[:-1] <= a[1:])
Here are the results:
# Already sorted array - np.all is super fast
sorted_array = np.sort(np.random.rand(1000000))
%timeit is_sorted(sorted_array)
659 ms ± 3.28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit is_sorted2(sorted_array)
431 µs ± 35.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# Here I included the random in each command so we need to substract it's timing
%timeit np.random.rand(1000000)
6.08 ms ± 17.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit is_sorted(np.random.rand(1000000))
6.11 ms ± 58.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# Without random part, it took 6.11 ms - 6.08 ms = 30µs per loop
%timeit is_sorted2(np.random.rand(1000000))
6.83 ms ± 75.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# Without random part, it took 6.83 ms - 6.08 ms = 750µs per loop
Net, a O(n) vector optimized code is better than an O(log n) algorithm, unless you will run >100 million element arrays.
I have a three-dimensional array like
A=np.array([[[1,1],
[1,0]],
[[1,2],
[1,0]],
[[1,0],
[0,0]]])
Now I would like to obtain an array that has a nonzero value in a given position if only a unique nonzero value (or zero) occurs in that position. It should have zero if only zeros or more than one nonzero value occur in that position. For the example above, I would like
[[1,0],
[1,0]]
since
in A[:,0,0] there are only 1s
in A[:,0,1] there are 0, 1 and 2, so more than one nonzero value
in A[:,1,0] there are 0 and 1, so 1 is retained
in A[:,1,1] there are only 0s
I can find how many nonzero elements there are with np.count_nonzero(A, axis=0), but I would like to keep 1s or 2s even if there are several of them. I looked at np.unique but it doesn't seem to support what I'd like to do.
Ideally, I'd like a function like np.count_unique(A, axis=0) which would return an array in the original shape, e.g. [[1, 3],[2, 1]], so I could check whether 3 or more occur and then ignore that position.
All I could come up with was a list comprehension iterating over the that I'd like to obtain
[[len(np.unique(A[:, i, j])) for j in range(A.shape[2])] for i in range(A.shape[1])]
Any other ideas?
You can use np.diff to stay at numpy level for the second task.
def diffcount(A):
B=A.copy()
B.sort(axis=0)
C=np.diff(B,axis=0)>0
D=C.sum(axis=0)+1
return D
# [[1 3]
# [2 1]]
it's seems to be a little faster on big arrays:
In [62]: A=np.random.randint(0,100,(100,100,100))
In [63]: %timeit diffcount(A)
46.8 ms ± 769 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [64]: timeit [[len(np.unique(A[:, i, j])) for j in range(A.shape[2])]\
for i in range(A.shape[1])]
149 ms ± 700 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Finally counting unique is simpler than sorting, a ln(A.shape[0]) factor can be win.
A way to win this factor is to use the set mechanism :
In [81]: %timeit np.apply_along_axis(lambda a:len(set(a)),axis=0,A)
183 ms ± 1.17 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Unfortunately, this is not faster.
Another way is to do it by hand :
def countunique(A,Amax):
res=np.empty(A.shape[1:],A.dtype)
c=np.empty(Amax+1,A.dtype)
for i in range(A.shape[1]):
for j in range(A.shape[2]):
T=A[:,i,j]
for k in range(c.size): c[k]=0
for x in T:
c[x]=1
res[i,j]= c.sum()
return res
At python level:
In [70]: %timeit countunique(A,100)
429 ms ± 18.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Which is not so bad for a pure python approach. Then just shift this code at low level with numba :
import numba
countunique2=numba.jit(countunique)
In [71]: %timeit countunique2(A,100)
3.63 ms ± 70.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Which will be difficult to improve a lot.
One approach would be to use A as first axis indices for setting a boolean array of the same lengths along the other two axes and then simply counting the non-zeros along the first axis of it. Two variants would be possible - One keeping it as 3D and another would be to reshape into 2D for some performance benefit as indexing into 2D would be faster. Thus, the two implementations would be -
def nunique_axis0_maskcount_app1(A):
m,n = A.shape[1:]
mask = np.zeros((A.max()+1,m,n),dtype=bool)
mask[A,np.arange(m)[:,None],np.arange(n)] = 1
return mask.sum(0)
def nunique_axis0_maskcount_app2(A):
m,n = A.shape[1:]
A.shape = (-1,m*n)
maxn = A.max()+1
N = A.shape[1]
mask = np.zeros((maxn,N),dtype=bool)
mask[A,np.arange(N)] = 1
A.shape = (-1,m,n)
return mask.sum(0).reshape(m,n)
Runtime test -
In [154]: A = np.random.randint(0,100,(100,100,100))
# #B. M.'s soln
In [155]: %timeit f(A)
10 loops, best of 3: 28.3 ms per loop
# #B. M.'s soln using slicing : (B[1:] != B[:-1]).sum(0)+1
In [156]: %timeit f2(A)
10 loops, best of 3: 26.2 ms per loop
In [157]: %timeit nunique_axis0_maskcount_app1(A)
100 loops, best of 3: 12 ms per loop
In [158]: %timeit nunique_axis0_maskcount_app2(A)
100 loops, best of 3: 9.14 ms per loop
Numba method
Using the same strategy as used for nunique_axis0_maskcount_app2 with directly getting the counts at C-level with numba, we would have -
from numba import njit
#njit
def nunique_loopy_func(mask, N, A, p, count):
for j in range(N):
mask[:] = True
mask[A[0,j]] = False
c = 1
for i in range(1,p):
if mask[A[i,j]]:
c += 1
mask[A[i,j]] = False
count[j] = c
return count
def nunique_axis0_numba(A):
p,m,n = A.shape
A.shape = (-1,m*n)
maxn = A.max()+1
N = A.shape[1]
mask = np.empty(maxn,dtype=bool)
count = np.empty(N,dtype=int)
out = nunique_loopy_func(mask, N, A, p, count).reshape(m,n)
A.shape = (-1,m,n)
return out
Runtime test -
In [328]: np.random.seed(0)
In [329]: A = np.random.randint(0,100,(100,100,100))
In [330]: %timeit nunique_axis0_maskcount_app2(A)
100 loops, best of 3: 11.1 ms per loop
# #B.M.'s numba soln
In [331]: %timeit countunique2(A,A.max()+1)
100 loops, best of 3: 3.43 ms per loop
# Numba soln posted in this post
In [332]: %timeit nunique_axis0_numba(A)
100 loops, best of 3: 2.76 ms per loop