I want to add all nonzero elements from a numpy array arr to a list out_list. Previous research suggests that for numpy arrays, using np.nonzero is most efficient. (My own benchmark below actually suggests it can be slightly improved using np.delete).
However, in my case I want my output to be a list, because I am combining many arrays for which I don't know the number of nonzero elements (so I can't effectively preallocate a numpy array for them). Hence, I was wondering whether there are some synergies that can be exploited to speed up the process. While my naive list comprehension approach is much slower than the pure numpy approach, I got some promising results combining list comprehension with numba.
Here's what I found so far:
import numpy as np
n = 60_000 # size of array
nz = 0.3 # fraction of zero elements
arr = (np.random.random_sample(n) - nz).clip(min=0)
# method 1
def add_to_list1(arr, out):
out.extend(list(arr[np.nonzero(arr)]))
# method 2
def add_to_list2(arr, out):
out.extend(list(np.delete(arr, arr == 0)))
# method 3
def add_to_list3(arr, out):
out += [x for x in arr if x != 0]
# method 4 (not sure how to get numba to accept an empty list as argument)
#njit
def add_to_list4(arr):
return [x for x in arr if x != 0]
out_list = []
%timeit add_to_list1(arr, out_list)
out_list = []
%timeit add_to_list2(arr, out_list)
out_list = []
%timeit add_to_list3(arr, out_list)
_ = add_to_list4(arr) # call once to compile
out_list = []
%timeit out_list.extend(add_to_list4(arr))
Yielding the following results:
2.51 ms ± 137 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.19 ms ± 133 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
15.6 ms ± 183 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.63 ms ± 158 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Not surprisingly, numba outperforms all other methods. Among the rest, method 2 (using np.delete) is the best. Am I missing any obvious alternative that exploits the fact that I am converting to a list afterwards? Can you think of anything to further speed up the process?
Edit 1:
Performance of .tolist():
# method 5
def add_to_list5(arr, out):
out += arr[arr != 0].tolist()
# method 6
def add_to_list6(arr, out):
out += np.delete(arr, arr == 0).tolist()
# method 7
def add_to_list7(arr, out):
out += arr[arr.astype(bool)].tolist()
Timings are on par with numba:
1.62 ms ± 118 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.65 ms ± 104 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each
1.78 ms ± 119 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Edit 2:
Here's some benchmarking using Mad Physicist's suggestion to use np.concatenate to construct a numpy array instead.
# construct numpy array using np.concatenate
out_list = []
t = time.perf_counter()
for i in range(100):
out_list.append(arr[arr != 0])
result = np.concatenate(out_list)
print(f"Time elapsed: {time.perf_counter() - t:.4f}s")
# compare with best list-based method
out_list = []
t = time.perf_counter()
for i in range(100):
out_list += arr[arr != 0].tolist()
print(f"Time elapsed: {time.perf_counter() - t:.4f}s")
Concatenating numpy arrays yields indeed another significant speed-up, although it is not directly comparable since the output is a numpy array instead of a list. So it will depend on the precise use what will be best.
Time elapsed: 0.0400s
Time elapsed: 0.1430s
TLDR;
1/ using arr[arr != 0] is the fastest of all the indexing options
2/ using .tolist() instead of list(.) speeds up things by a factor 1.3 - 1.5
3/ with the gains of 1/ and 2/ combined, the speed is on par with numba
4/ if having a numpy array instead of a list is acceptable, then using np.concatenate yields another gain in speed by a factor of ~3.5 compared to the best alternative
I submit that the method of choice, if you are indeed looking for a list output, is:
def f(arr, out_list):
out_list += arr[arr != 0].tolist()
It seems to beat all the other methods mentioned so far in the OP's question or in other responses (at the time of this writing).
If, however, you are looking for a result as a numpy array, then following #MadPhysicist's version (slightly modified to use arr[arr != 0] instead of using np.nonzero()) is almost 6x faster, see at the end of this post.
Side note: I would avoid using %timeit out_list.extend(some_list): it keeps adding to out_list during the many loops of timeit. Example:
out_list = []
%timeit out_list.extend([1,2,3])
and now:
>>> len(out_list)
243333333 # yikes
Timings
On 60K items on my machine, I see:
out_list = []
a = %timeit -o out_list + arr[arr != 0].tolist()
b = %timeit -o out_list + arr[np.nonzero(arr)].tolist()
c = %timeit -o out_list + list(arr[np.nonzero(arr)])
Yields:
1.23 ms ± 10.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.53 ms ± 2.53 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
4.29 ms ± 3.02 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
And:
>>> c.average / a.average
3.476
>>> b.average / a.average
1.244
For a numpy array result instead
Following #MadPhysicist, you can get some extra boost by not turning the arrays into lists, but using np.concatenate() instead:
def all_nonzero(arr_iter):
"""return non zero elements of all arrays as a np.array"""
return np.concatenate([a[a != 0] for a in arr_iter])
def all_nonzero_list(arr_iter):
"""return non zero elements of all arrays as a list"""
out_list = []
for a in arr_iter:
out_list += a[a != 0].tolist()
return out_list
from itertools import repeat
ta = %timeit -o all_nonzero(repeat(arr, 100))
tl = %timeit -o all_nonzero_list(repeat(arr, 100))
Yields:
39.7 ms ± 107 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
227 ms ± 680 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
and
>>> tl.average / ta.average
5.75
Instead of extending a list by all of the elements of a new array, append the array itself. This will make for much fewer and smaller reallocations. You can also pre-allocate a list of Nones up-front or even use an object array, if you have an upper bound on the number of arrays you will process.
When you're done, call np.concatenate on the list.
So instead of this:
L = []
for i in range(10):
arr = (np.random.random_sample(n) - nz).clip(min=0)
L.extend(arr[np.nonzero(arr)])
result = np.array(L)
Try this:
L = []
for i in range(10):
arr = (np.random.random_sample(n) - nz).clip(min=0)
L.append(arr[np.nonzero(arr)])
result = np.concatenate(L)
Since you're keeping arrays around, the final concatenation will be a series of buffer copies (which is fast), rather than a bunch of python-to numpy type conversions (which won't be). The exact method you choose for deletion is of course still up to the result of your benchmark.
Also, here's another method to add to your benchmark:
def add_to_list5(arr, out):
out.extend(list(arr[arr.astype(bool)]))
I don't expect this to be overwhelmingly fast, but it's interesting to see how masking stacks up next to indexing.
Related
I need to find fast way to get indicies of neighbors with values like current
For example:
arr = [0, 0, 0, 1, 0, 1, 1, 1, 1, 0]
indicies = func(arr, 6)
# [5, 6, 7, 8]
6th element has value 1, so I need full slice containing 6th and all it's neighbors with same value
It is like a part of flood fill algorithm. Is there a way to do it fast in numpy?
Is there a way for 2D array?
EDIT
Let's see some perfomance tests:
import numpy as np
import random
np.random.seed(1488)
arr = np.zeros(5000)
for x in np.random.randint(0, 5000, size = 100):
arr[x:x+50] = 1
I will compare function from #Ehsan:
def func_Ehsan(arr, idx):
change = np.insert(np.flatnonzero(np.diff(arr)), 0, -1)
loc = np.searchsorted(change, idx)
start = change[max(loc-1,0)]+1 if loc<len(change) else change[loc-1]
end = change[min(loc, len(change)-1)]
return (start, end)
change = np.insert(np.flatnonzero(np.diff(arr)), 0, -1)
def func_Ehsan_same_arr(arr, idx):
loc = np.searchsorted(change, idx)
start = change[max(loc-1,0)]+1 if loc<len(change) else change[loc-1]
end = change[min(loc, len(change)-1)]
return (start, end)
with my pure python function:
def my_func(arr, index):
val = arr[index]
size = arr.size
end = index + 1
while end < size and arr[end] == val:
end += 1
start = index - 1
while start > -1 and arr[start] == val:
start -= 1
return start + 1, end
Take a look:
np.random.seed(1488)
%timeit my_func(arr, np.random.randint(0, 5000))
# 42.4 µs ± 700 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
np.random.seed(1488)
%timeit func_Ehsan(arr, np.random.randint(0, 5000))
# 115 µs ± 1.92 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
np.random.seed(1488)
%timeit func_Ehsan_same_arr(arr, np.random.randint(0, 5000))
# 18.1 µs ± 953 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Is there a way to use same logic by numpy, without C module/Cython/Numba/python loops? And make it faster!
I don't know how to solve this problem with numpy but If you use pandas, you might get the result that you want with this:
import pandas as pd
df=pd.DataFrame(arr,columns=["data"])
df["new"] = df["data"].diff().ne(0).cumsum()
[{i[0]:j.index.tolist()} for i,j in df.groupby(["data","new"],sort=False)]
Output:
[{0: [0, 1, 2]}, {1: [3]}, {0: [4]}, {1: [5, 6, 7, 8]}, {0: [9]}]
The main problem is that Numpy is not currently designed to solve this problem efficiently. A "find first index of value fast" or any similar lazy function call is required to solve this problem efficiently. However, while this feature as been discussed since 10 years ago, this feature is still no available in Numpy. See this post for more information. I do not expect any change soon. Until then, the best solution on relatively big array appear to use an iterative solution using relatively slow pure-Python loops and slow Numpy calls/accesses.
Beside this, one solution to speed up the computation is to work on small chunks. Here is an implementation:
def my_func_opt1(arr, index):
val = arr[index]
size = arr.size
chunkSize = 128
end = index + 1
while end < size:
chunk = arr[end:end+chunkSize]
locations = (chunk != val).nonzero()[0]
if len(locations) > 0:
foundCount = locations[0]
end += foundCount
break
else:
end += len(chunk)
start = index
while start > 0:
chunk = arr[max(start-chunkSize,0):start]
locations = (chunk != val).nonzero()[0]
if len(locations) > 0:
foundCount = locations[-1]
start -= chunkSize - 1 - foundCount
break
else:
start -= len(chunk)
return start, end
Here are performance results on my machine:
func_Ehsan: 53.8 µs ± 449 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
my_func: 17.5 µs ± 97.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
my_func_opt1: 7.31 µs ± 52.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
The thing is the result are a bit biased since np.random.randint takes actually 2.01 µs. Without this Numpy call included in the benchmark, here are the results:
func_Ehsan: 51.8 µs
my_func: 15.5 µs
my_func_opt1: 5.31 µs
As a result, my_func_opt1 is about 3 times faster than my_func. This is very difficult to write a faster code as any Numpy call introduces a relatively big overhead of 0.5-1.0 µs on my machine whatever the array size (eg. empty arrays) due to internal checks.
The following is useful for people interested in speeding up the operation and that can use Numba.
The simplest solution consist in using the Numba's JIT and more specifically just add decorator. This solution is also very fast.
#nb.njit('UniTuple(i8,2)(f8[::1], i8)')
def my_func_opt2(arr, index):
val = arr[index]
size = arr.size
end = index + 1
while end < size and arr[end] == val:
end += 1
start = index - 1
while start > -1 and arr[start] == val:
start -= 1
return start + 1, end
On my machine my_func_opt2 takes only 0.63 µs (wit the random call excluded). As a result, my_func_opt2 is about 25 times faster than my_func. I highly doubt there is a faster solution since any Numpy calls on my machine take at least 0.5 µs and an empty Numba function takes 0.25 µs to call.
Beside this, please note that arr contains double-precision values which are pretty expensive to compute. It should be faster to use integers if you can. Also, please note that an array of 0 and 1 values can be stored in a int8 value which takes 8 times less memory and is often significantly faster to compute (due to CPU caches, the smaller the array the faster the computation). You can specify the type during the creation of the array: np.zeros(5000, dtype=np.int8)
Here is a numpy solution. I think you can improve it by a little more work:
def func(arr, idx):
change = np.insert(np.flatnonzero(np.diff(arr)), 0, -1)
loc = np.searchsorted(change, idx)
start = change[max(loc-1,0)]+1 if loc<len(change) else change[loc-1]
end = change[min(loc, len(change)-1)]
return np.arange(start, end)
sample output:
indices = func(arr, 6)
#[5 6 7 8]
This would specially be faster if you have few changes in your arr (relative to array size) and you are looking for multiple of those index searches in the same array. Otherwise, faster solutions come to mind.
Performance comparison:
If you are performing on same array multiple times, simply put first line out of function like this to avoid repetition.
change = np.insert(np.flatnonzero(np.diff(arr)), 0, -1)
def func(arr, idx):
loc = np.searchsorted(change, idx)
start = change[max(loc-1,0)]+1 if loc<len(change) else change[loc-1]
end = change[min(loc, len(change)-1)]
return np.arange(start, end)
For same input as OP's example:
np.random.seed(1488)
%timeit func_OP(arr, np.random.randint(0, 5000))
#23.5 µs ± 631 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
np.random.seed(1488)
%timeit func_Ehsan(arr, np.random.randint(0, 5000))
#7.89 µs ± 113 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
np.random.seed(1488)
%timeit func_Jérôme_opt1(arr, np.random.randint(0, 5000))
#12.1 µs ± 757 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit func_Jérôme_opt2(arr, np.random.randint(0, 5000))
#3.45 µs ± 179 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
With func_Ehsan being fastest (excluding Numba). Please mind that again, the performance of these functions vary significantly on number of changes in array, array size and how many times the function is being called on the same array. And of course Numba is faster than all (almost 2x faster than func_Ehsan. And if you are going to run it many times, build the groups in O(n) and use hash map to indices in O(1).
I need to calculate values for a lot of angles in degrees. In order to build up the coarse shape fast, and the fine bits in between later, I want to calculate the shape in this order (0°, 180°, 90°, 270°,45°,135°...)
The following code does what I want. I wonder: Is there a way to do that in a more straightforward way? It needs to work for any (whole) number (eg. 72, or 7465)
Thanks for your help.
import numpy as np
def evenly_spaced_star_order(number):
Total=np.linspace(0,360,number,endpoint=False)
Res=[]
for devider in [2**_ for _ in range(1000)]:
for counter in range(devider):
Number=(counter*len(Total))//devider
if np.isfinite(Total[Number]):
Res.append(Total[Number])
Total[Number]=np.nan
if np.all(np.isnan(Total)):
break
return(Res)
print(evenly_spaced_star_order(16))
My solution recursively separates the even- and odd-numbered indexes. The odd-numbered rows are then put at the end of the final list (in order), and the even-numbered rows are recursively split apart again.
My order is consistent with your original function, but it is a lot faster (by an order of magnitude or more) and it does indeed work for any whole number.
# recursive evenly_spaced_star_order()
def esso(number):
def interleave(arr):
return arr if len(arr) <= 1 else np.append(interleave(arr[0::2]), arr[1::2])
return interleave(np.linspace(0,360,number,endpoint=False))
print(esso(16))
My timings:
%timeit evenly_spaced_star_order(16)
885 µs ± 8.68 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit esso(16)
60.1 µs ± 998 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit evenly_spaced_star_order(1000)
5.88 ms ± 192 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit esso(1000)
111 µs ± 10.9 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Mine will perform better and better as the number of points increases (as compared to the original code).
Second solution
It's not nearly as pretty, but the order is closer and it is still faster.
def esso2(number):
def interleave(arr):
if arr.shape[0] <= 1:
return arr
mid = arr.shape[0] // 2
it1 = iter(interleave(arr[0:mid]))
it2 = iter(interleave(arr[mid:]))
return sum(zip(it1, it2), ()) + tuple(it2)
return np.array(interleave(np.linspace(0,360,number,endpoint=False)))
print(esso2(72))
I discovered that when matmuling two numpy arrays, if one of the two is the real or imaginary part of a bigger complex array, the operation can be tens, or even hundreds, time slower than using the original complex array.
Consider the following example:
import numpy as np
from time import time
class timeit():
def __init__(self, string):
self.string = string
def __enter__(self):
self.t0 = time()
def __exit__(self, *args):
print(f'{self.string} : {time() - self.t0}')
A = np.random.rand(200, 1000) + 0j
B = np.random.rand(1000, 5000)
with timeit('with complex'):
out = A # B
Ar = A.real
with timeit('after .real'):
out = Ar # B
Ai = (A * 1j).imag
with timeit('after .imag'):
out = Ai # B
with timeit('after .astype(float)'):
out = A.astype(np.float64()) # B
with timeit('after .real.astype(float)'):
out = A.real.astype(np.float64()) # B
The output is
with complex : 0.09374785423278809
after .real : 1.9792003631591797
after .imag : 1.717487096786499
after .astype(float) : 0.016920804977416992
after .real.astype(float) : 0.017952442169189453
Note how when one of the two arrays is A.real or A.imag the operation is 20 times slower (the number can go up to hundreds of time slower if the arrays are bigger).
Using A.astype(np.float64) is very fast, but it throw a warning every time, even if the imaginary part is null.
The only fast and quiet solution seems to be A.real.astype(float), but, to be honest, it looks quite ugly to me.
Checking the memory address of these array I obtain the following
def aid(x):
# This function returns the memory
# block address of an array.
return x.__array_interface__['data'][0]
print(f'ID(A.real) == ID(A): {aid(A.real) == aid(A)}')
print(f'ID(A.imag) == ID(A): {aid(A.imag) == aid(A)}')
print(f'ID(A.astype) == ID(A): {aid(A.astype(np.float64())) == aid(A)}')
print(f'ID(A.real.astype) == ID(A): {aid(A.real.astype(np.float64())) == aid(A)}')
this returns
ID(A.real) == ID(A): True
ID(A.imag) == ID(A): False
ID(A.astype) == ID(A): False
ID(A.real.astype) == ID(A): False
This seems to indicate that A.real has the same memory address as A, while A.astype(np.float64) does not. Could this be causing this behavior? However A and A.imag have different memory addresses, but still the matmul is very slow.
Is this a bug?
Is the solution with A.real.astype(np.float64) the one I should use?
It's not memory location or layout that matters. It's the route that # chooses depending on the type of input. A.real is not a "new" array; it's a method of accessing the real values of the complex dtype. I don't have time to present my full timing results but here a couple of quick results
In [2]: timeit A#B
436 ms ± 32.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
adding the real slows this way down:
In [3]: timeit A.real#B
3.93 s ± 4.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [4]: %%timeit a = A.real
...: a#B
3.92 s ± 3.29 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
But making a new float array speeds things up:
In [5]: %%timeit a = A.real.copy()
...: a#B
101 ms ± 496 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
A.real does not actual do any calculations:
In [6]: timeit A.real
203 ns ± 9.42 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
And making a new array from the real values isn't that slow:
In [7]: timeit A.real.copy()
239 µs ± 3.54 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Including the copy inline doesn't hurt time:
In [8]: timeit A.real.copy()#B
102 ms ± 1.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
dot isn't bothered by the real:
In [9]: timeit A.real.dot(B)
106 ms ± 3.49 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
A.dot(B) times the same as A#B.
So complex evaluation is about 4x slower to float. Given that A has twice as mean values, that's sounds reasonable.
dot handles the real correctly, extracting the real values without much fuss.
# has some sort of bug, sending it off on a slow track.
With numpy, what's the fastest way to generate an array from -n to n, excluding 0, being n an integer?
Follows one solution, but I am not sure this is the fastest:
n = 100000
np.concatenate((np.arange(-n, 0), np.arange(1, n+1)))
An alternative approach is to create the range -n to n-1. Then add 1 to the elements from zero.
def non_zero_range(n):
# The 2nd argument to np.arange is exclusive so it should be n and not n-1
a=np.arange(-n,n)
a[n:]+=1
return a
n=1000000
%timeit np.concatenate((np.arange(-n,0), np.arange(1,n+1)))
# 4.28 ms ± 9.46 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit non_zero_range(n)
# 2.84 ms ± 13.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I think the reduced response time is due to only creating one array, not three as in the concatenate approach.
Edit
Thanks, everyone. I edited my post and updated new test time.
Interesting problem.
Experiment
I did it in my jupyter-notebook. All of them used numpy API. You can conduct the experiment of the following code by yourself.
About time measurement in jupyter-notebook, please see: Simple way to measure cell execution time in ipython notebook
Original np.concatenate
%%timeit
n = 100000
t = np.concatenate((np.arange(-n, 0), np.arange(1, n+1)))
#175 µs ± 2.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Sol 1. np.delete
%%timeit
n = 100000
a = np.arange(-n, n+1)
b = np.delete(a, n)
# 179 µs ± 5.66 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Sol 2. List comprehension + np.arrary
%%timeit
c = np.array([x for x in range(-n, n+1) if x != 0])
# 16.6 ms ± 693 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Conclusion
There's no big difference between original and solution 1, but solution 2 is the worst among the three. I'm looking for faster solutions, too.
Reference
For those who are:
interested in initialize and fill an numpy array
Best way to initialize and fill an numpy array?
get confused of is vs ==
The Difference Between “is” and “==” in Python
I have a numpy array and I like to check if it is sorted.
>>> a = np.array([1,2,3,4,5])
array([1, 2, 3, 4, 5])
np.all(a[:-1] <= a[1:])
Examples:
is_sorted = lambda a: np.all(a[:-1] <= a[1:])
>>> a = np.array([1,2,3,4,9])
>>> is_sorted(a)
True
>>> a = np.array([1,2,3,4,3])
>>> is_sorted(a)
False
With NumPy tools:
np.all(np.diff(a) >= 0)
but numpy solutions are all O(n).
If you want quick code and very quick conclusion on unsorted arrays:
import numba
#numba.jit
def is_sorted(a):
for i in range(a.size-1):
if a[i+1] < a[i] :
return False
return True
which is O(1) (in mean) on random arrays.
The inefficient but easy-to-type solution:
(a == np.sort(a)).all()
For completeness, the O(log n) iterative solution is found below. The recursive version is slower and crashes with big vector sizes. However, it is still slower than the native numpy using np.all(a[:-1] <= a[1:]) most likely due to modern CPU optimizations. The only case where the O(log n) is faster is on the "average" random case or if it is "almost" sorted. If you suspect your array is already fully sorted then np.all will be faster.
def is_sorted(a):
idx = [(0, a.size - 1)]
while idx:
i, j = idx.pop(0) # Breadth-First will find almost-sorted in O(log N)
if i >= j:
continue
elif a[i] > a[j]:
return False
elif i + 1 == j:
continue
else:
mid = (i + j) >> 1 # Division by 2 with floor
idx.append((i, mid))
idx.append((mid, j))
return True
is_sorted2 = lambda a: np.all(a[:-1] <= a[1:])
Here are the results:
# Already sorted array - np.all is super fast
sorted_array = np.sort(np.random.rand(1000000))
%timeit is_sorted(sorted_array)
659 ms ± 3.28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit is_sorted2(sorted_array)
431 µs ± 35.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# Here I included the random in each command so we need to substract it's timing
%timeit np.random.rand(1000000)
6.08 ms ± 17.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit is_sorted(np.random.rand(1000000))
6.11 ms ± 58.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# Without random part, it took 6.11 ms - 6.08 ms = 30µs per loop
%timeit is_sorted2(np.random.rand(1000000))
6.83 ms ± 75.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# Without random part, it took 6.83 ms - 6.08 ms = 750µs per loop
Net, a O(n) vector optimized code is better than an O(log n) algorithm, unless you will run >100 million element arrays.