Related
I want to add all nonzero elements from a numpy array arr to a list out_list. Previous research suggests that for numpy arrays, using np.nonzero is most efficient. (My own benchmark below actually suggests it can be slightly improved using np.delete).
However, in my case I want my output to be a list, because I am combining many arrays for which I don't know the number of nonzero elements (so I can't effectively preallocate a numpy array for them). Hence, I was wondering whether there are some synergies that can be exploited to speed up the process. While my naive list comprehension approach is much slower than the pure numpy approach, I got some promising results combining list comprehension with numba.
Here's what I found so far:
import numpy as np
n = 60_000 # size of array
nz = 0.3 # fraction of zero elements
arr = (np.random.random_sample(n) - nz).clip(min=0)
# method 1
def add_to_list1(arr, out):
out.extend(list(arr[np.nonzero(arr)]))
# method 2
def add_to_list2(arr, out):
out.extend(list(np.delete(arr, arr == 0)))
# method 3
def add_to_list3(arr, out):
out += [x for x in arr if x != 0]
# method 4 (not sure how to get numba to accept an empty list as argument)
#njit
def add_to_list4(arr):
return [x for x in arr if x != 0]
out_list = []
%timeit add_to_list1(arr, out_list)
out_list = []
%timeit add_to_list2(arr, out_list)
out_list = []
%timeit add_to_list3(arr, out_list)
_ = add_to_list4(arr) # call once to compile
out_list = []
%timeit out_list.extend(add_to_list4(arr))
Yielding the following results:
2.51 ms ± 137 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.19 ms ± 133 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
15.6 ms ± 183 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.63 ms ± 158 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Not surprisingly, numba outperforms all other methods. Among the rest, method 2 (using np.delete) is the best. Am I missing any obvious alternative that exploits the fact that I am converting to a list afterwards? Can you think of anything to further speed up the process?
Edit 1:
Performance of .tolist():
# method 5
def add_to_list5(arr, out):
out += arr[arr != 0].tolist()
# method 6
def add_to_list6(arr, out):
out += np.delete(arr, arr == 0).tolist()
# method 7
def add_to_list7(arr, out):
out += arr[arr.astype(bool)].tolist()
Timings are on par with numba:
1.62 ms ± 118 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.65 ms ± 104 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each
1.78 ms ± 119 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Edit 2:
Here's some benchmarking using Mad Physicist's suggestion to use np.concatenate to construct a numpy array instead.
# construct numpy array using np.concatenate
out_list = []
t = time.perf_counter()
for i in range(100):
out_list.append(arr[arr != 0])
result = np.concatenate(out_list)
print(f"Time elapsed: {time.perf_counter() - t:.4f}s")
# compare with best list-based method
out_list = []
t = time.perf_counter()
for i in range(100):
out_list += arr[arr != 0].tolist()
print(f"Time elapsed: {time.perf_counter() - t:.4f}s")
Concatenating numpy arrays yields indeed another significant speed-up, although it is not directly comparable since the output is a numpy array instead of a list. So it will depend on the precise use what will be best.
Time elapsed: 0.0400s
Time elapsed: 0.1430s
TLDR;
1/ using arr[arr != 0] is the fastest of all the indexing options
2/ using .tolist() instead of list(.) speeds up things by a factor 1.3 - 1.5
3/ with the gains of 1/ and 2/ combined, the speed is on par with numba
4/ if having a numpy array instead of a list is acceptable, then using np.concatenate yields another gain in speed by a factor of ~3.5 compared to the best alternative
I submit that the method of choice, if you are indeed looking for a list output, is:
def f(arr, out_list):
out_list += arr[arr != 0].tolist()
It seems to beat all the other methods mentioned so far in the OP's question or in other responses (at the time of this writing).
If, however, you are looking for a result as a numpy array, then following #MadPhysicist's version (slightly modified to use arr[arr != 0] instead of using np.nonzero()) is almost 6x faster, see at the end of this post.
Side note: I would avoid using %timeit out_list.extend(some_list): it keeps adding to out_list during the many loops of timeit. Example:
out_list = []
%timeit out_list.extend([1,2,3])
and now:
>>> len(out_list)
243333333 # yikes
Timings
On 60K items on my machine, I see:
out_list = []
a = %timeit -o out_list + arr[arr != 0].tolist()
b = %timeit -o out_list + arr[np.nonzero(arr)].tolist()
c = %timeit -o out_list + list(arr[np.nonzero(arr)])
Yields:
1.23 ms ± 10.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.53 ms ± 2.53 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
4.29 ms ± 3.02 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
And:
>>> c.average / a.average
3.476
>>> b.average / a.average
1.244
For a numpy array result instead
Following #MadPhysicist, you can get some extra boost by not turning the arrays into lists, but using np.concatenate() instead:
def all_nonzero(arr_iter):
"""return non zero elements of all arrays as a np.array"""
return np.concatenate([a[a != 0] for a in arr_iter])
def all_nonzero_list(arr_iter):
"""return non zero elements of all arrays as a list"""
out_list = []
for a in arr_iter:
out_list += a[a != 0].tolist()
return out_list
from itertools import repeat
ta = %timeit -o all_nonzero(repeat(arr, 100))
tl = %timeit -o all_nonzero_list(repeat(arr, 100))
Yields:
39.7 ms ± 107 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
227 ms ± 680 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
and
>>> tl.average / ta.average
5.75
Instead of extending a list by all of the elements of a new array, append the array itself. This will make for much fewer and smaller reallocations. You can also pre-allocate a list of Nones up-front or even use an object array, if you have an upper bound on the number of arrays you will process.
When you're done, call np.concatenate on the list.
So instead of this:
L = []
for i in range(10):
arr = (np.random.random_sample(n) - nz).clip(min=0)
L.extend(arr[np.nonzero(arr)])
result = np.array(L)
Try this:
L = []
for i in range(10):
arr = (np.random.random_sample(n) - nz).clip(min=0)
L.append(arr[np.nonzero(arr)])
result = np.concatenate(L)
Since you're keeping arrays around, the final concatenation will be a series of buffer copies (which is fast), rather than a bunch of python-to numpy type conversions (which won't be). The exact method you choose for deletion is of course still up to the result of your benchmark.
Also, here's another method to add to your benchmark:
def add_to_list5(arr, out):
out.extend(list(arr[arr.astype(bool)]))
I don't expect this to be overwhelmingly fast, but it's interesting to see how masking stacks up next to indexing.
With numpy, what's the fastest way to generate an array from -n to n, excluding 0, being n an integer?
Follows one solution, but I am not sure this is the fastest:
n = 100000
np.concatenate((np.arange(-n, 0), np.arange(1, n+1)))
An alternative approach is to create the range -n to n-1. Then add 1 to the elements from zero.
def non_zero_range(n):
# The 2nd argument to np.arange is exclusive so it should be n and not n-1
a=np.arange(-n,n)
a[n:]+=1
return a
n=1000000
%timeit np.concatenate((np.arange(-n,0), np.arange(1,n+1)))
# 4.28 ms ± 9.46 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit non_zero_range(n)
# 2.84 ms ± 13.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I think the reduced response time is due to only creating one array, not three as in the concatenate approach.
Edit
Thanks, everyone. I edited my post and updated new test time.
Interesting problem.
Experiment
I did it in my jupyter-notebook. All of them used numpy API. You can conduct the experiment of the following code by yourself.
About time measurement in jupyter-notebook, please see: Simple way to measure cell execution time in ipython notebook
Original np.concatenate
%%timeit
n = 100000
t = np.concatenate((np.arange(-n, 0), np.arange(1, n+1)))
#175 µs ± 2.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Sol 1. np.delete
%%timeit
n = 100000
a = np.arange(-n, n+1)
b = np.delete(a, n)
# 179 µs ± 5.66 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Sol 2. List comprehension + np.arrary
%%timeit
c = np.array([x for x in range(-n, n+1) if x != 0])
# 16.6 ms ± 693 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Conclusion
There's no big difference between original and solution 1, but solution 2 is the worst among the three. I'm looking for faster solutions, too.
Reference
For those who are:
interested in initialize and fill an numpy array
Best way to initialize and fill an numpy array?
get confused of is vs ==
The Difference Between “is” and “==” in Python
I have 1d array, I need to remove all trailing blocks of 8 zeros.
[0,1,1,0,1,0,0,0, 0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0]
->
[0,1,1,0,1,0,0,0]
a.shape[0] % 8 == 0 always, so no worries about that.
Is there a better way to do it?
import numpy as np
P = 8
arr1 = np.random.randint(2,size=np.random.randint(5,10) * P)
arr2 = np.random.randint(1,size=np.random.randint(5,10) * P)
arr = np.concatenate((arr1, arr2))
indexes = []
arr = np.flip(arr).reshape(arr.shape[0] // P, P)
for i, f in enumerate(arr):
if (f == 0).all():
indexes.append(i)
else:
break
arr = np.delete(arr, indexes, axis=0)
arr = np.flip(arr.reshape(arr.shape[0] * P))
You can do it without allocating more space by using views and np.argmax to get the last nonzero element:
index = arr.size - np.argmax(arr[::-1])
Rounding up to the nearest multiple of eight is easy:
index = np.ceil(index / 8) * 8
Now chop off the rest:
arr = arr[:index]
Or as a one-liner:
arr = arr[:(arr.size - np.argmax(arr[::-1])) / 8) * 8]
This version is O(n) in time and O(1) in space because it reuses the same buffers for everything (including the output).
This has the additional advantage that it will work correctly even if there are no trailing zeros. Using argmax does rely on all the elements being the same though. If that is not the case, you will need to compute a mask first, e.g. with arr.astype(bool).
If you want to use your original approach, you could vectorize that too, although there will be a bit more overhead:
view = arr.reshape(-1, 8)
mask = view.any(axis = 1)
index = view.shape[0] - np.argmax(mask[::-1])
arr = arr[:index * 8]
There is a numpy function that does almost what you want np.trim_zeros. We can use that:
import numpy as np
def trim_mod(a, m=8):
t = np.trim_zeros(a, 'b')
return a[:len(a)-(len(a)-len(t))//m*m]
def test(a, t, m=8):
assert (len(a) - len(t)) % m == 0
assert len(t) < m or np.any(t[-m:])
assert not np.any(a[len(t):])
for _ in range(1000):
a = (np.random.random(np.random.randint(10, 100000))<0.002).astype(int)
m = np.random.randint(4, 20)
t = trim_mod(a, m)
test(a, t, m)
print("Looks correct")
Prints:
Looks correct
It seems to scale linearly in the number of trailing zeros:
But feels rather slow in absolute terms (units are ms per trial), so maybe np.trim_zeros is just a python loop.
Code for the picture:
from timeit import timeit
A = (np.random.random(1000000)<0.02).astype(int)
m = 8
T = []
for last in range(1, 1000, 9):
A[-last:] = 0
A[-last] = 1
T.append(timeit(lambda: trim_mod(A, m), number=100)*10)
import pylab
pylab.plot(range(1, 1000, 9), T)
pylab.show()
A low level approach :
import numba
#numba.njit
def trim8(a):
n=a.size-1
while n>=0 and a[n]==0 : n-=1
c= (n//8+1)*8
return a[:c]
Some tests :
In [194]: A[-1]=1 # best case
In [196]: %timeit trim_mod(A,8)
5.7 µs ± 323 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [197]: %timeit trim8(A)
714 ns ± 33.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [198]: %timeit A[:(A.size - np.argmax(A[::-1]) // 8) * 8]
4.83 ms ± 479 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [202]: A[:]=0 #worst case
In [203]: %timeit trim_mod(A,8)
2.5 s ± 49.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [204]: %timeit trim8(A)
1.14 ms ± 71.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [205]: %timeit A[:(A.size - np.argmax(A[::-1]) // 8) * 8]
5.5 ms ± 950 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
It has a short circuit mechanism like trim_zeros, but is much faster.
I have a numpy array and I like to check if it is sorted.
>>> a = np.array([1,2,3,4,5])
array([1, 2, 3, 4, 5])
np.all(a[:-1] <= a[1:])
Examples:
is_sorted = lambda a: np.all(a[:-1] <= a[1:])
>>> a = np.array([1,2,3,4,9])
>>> is_sorted(a)
True
>>> a = np.array([1,2,3,4,3])
>>> is_sorted(a)
False
With NumPy tools:
np.all(np.diff(a) >= 0)
but numpy solutions are all O(n).
If you want quick code and very quick conclusion on unsorted arrays:
import numba
#numba.jit
def is_sorted(a):
for i in range(a.size-1):
if a[i+1] < a[i] :
return False
return True
which is O(1) (in mean) on random arrays.
The inefficient but easy-to-type solution:
(a == np.sort(a)).all()
For completeness, the O(log n) iterative solution is found below. The recursive version is slower and crashes with big vector sizes. However, it is still slower than the native numpy using np.all(a[:-1] <= a[1:]) most likely due to modern CPU optimizations. The only case where the O(log n) is faster is on the "average" random case or if it is "almost" sorted. If you suspect your array is already fully sorted then np.all will be faster.
def is_sorted(a):
idx = [(0, a.size - 1)]
while idx:
i, j = idx.pop(0) # Breadth-First will find almost-sorted in O(log N)
if i >= j:
continue
elif a[i] > a[j]:
return False
elif i + 1 == j:
continue
else:
mid = (i + j) >> 1 # Division by 2 with floor
idx.append((i, mid))
idx.append((mid, j))
return True
is_sorted2 = lambda a: np.all(a[:-1] <= a[1:])
Here are the results:
# Already sorted array - np.all is super fast
sorted_array = np.sort(np.random.rand(1000000))
%timeit is_sorted(sorted_array)
659 ms ± 3.28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit is_sorted2(sorted_array)
431 µs ± 35.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# Here I included the random in each command so we need to substract it's timing
%timeit np.random.rand(1000000)
6.08 ms ± 17.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit is_sorted(np.random.rand(1000000))
6.11 ms ± 58.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# Without random part, it took 6.11 ms - 6.08 ms = 30µs per loop
%timeit is_sorted2(np.random.rand(1000000))
6.83 ms ± 75.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# Without random part, it took 6.83 ms - 6.08 ms = 750µs per loop
Net, a O(n) vector optimized code is better than an O(log n) algorithm, unless you will run >100 million element arrays.
I have a three-dimensional array like
A=np.array([[[1,1],
[1,0]],
[[1,2],
[1,0]],
[[1,0],
[0,0]]])
Now I would like to obtain an array that has a nonzero value in a given position if only a unique nonzero value (or zero) occurs in that position. It should have zero if only zeros or more than one nonzero value occur in that position. For the example above, I would like
[[1,0],
[1,0]]
since
in A[:,0,0] there are only 1s
in A[:,0,1] there are 0, 1 and 2, so more than one nonzero value
in A[:,1,0] there are 0 and 1, so 1 is retained
in A[:,1,1] there are only 0s
I can find how many nonzero elements there are with np.count_nonzero(A, axis=0), but I would like to keep 1s or 2s even if there are several of them. I looked at np.unique but it doesn't seem to support what I'd like to do.
Ideally, I'd like a function like np.count_unique(A, axis=0) which would return an array in the original shape, e.g. [[1, 3],[2, 1]], so I could check whether 3 or more occur and then ignore that position.
All I could come up with was a list comprehension iterating over the that I'd like to obtain
[[len(np.unique(A[:, i, j])) for j in range(A.shape[2])] for i in range(A.shape[1])]
Any other ideas?
You can use np.diff to stay at numpy level for the second task.
def diffcount(A):
B=A.copy()
B.sort(axis=0)
C=np.diff(B,axis=0)>0
D=C.sum(axis=0)+1
return D
# [[1 3]
# [2 1]]
it's seems to be a little faster on big arrays:
In [62]: A=np.random.randint(0,100,(100,100,100))
In [63]: %timeit diffcount(A)
46.8 ms ± 769 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [64]: timeit [[len(np.unique(A[:, i, j])) for j in range(A.shape[2])]\
for i in range(A.shape[1])]
149 ms ± 700 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Finally counting unique is simpler than sorting, a ln(A.shape[0]) factor can be win.
A way to win this factor is to use the set mechanism :
In [81]: %timeit np.apply_along_axis(lambda a:len(set(a)),axis=0,A)
183 ms ± 1.17 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Unfortunately, this is not faster.
Another way is to do it by hand :
def countunique(A,Amax):
res=np.empty(A.shape[1:],A.dtype)
c=np.empty(Amax+1,A.dtype)
for i in range(A.shape[1]):
for j in range(A.shape[2]):
T=A[:,i,j]
for k in range(c.size): c[k]=0
for x in T:
c[x]=1
res[i,j]= c.sum()
return res
At python level:
In [70]: %timeit countunique(A,100)
429 ms ± 18.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Which is not so bad for a pure python approach. Then just shift this code at low level with numba :
import numba
countunique2=numba.jit(countunique)
In [71]: %timeit countunique2(A,100)
3.63 ms ± 70.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Which will be difficult to improve a lot.
One approach would be to use A as first axis indices for setting a boolean array of the same lengths along the other two axes and then simply counting the non-zeros along the first axis of it. Two variants would be possible - One keeping it as 3D and another would be to reshape into 2D for some performance benefit as indexing into 2D would be faster. Thus, the two implementations would be -
def nunique_axis0_maskcount_app1(A):
m,n = A.shape[1:]
mask = np.zeros((A.max()+1,m,n),dtype=bool)
mask[A,np.arange(m)[:,None],np.arange(n)] = 1
return mask.sum(0)
def nunique_axis0_maskcount_app2(A):
m,n = A.shape[1:]
A.shape = (-1,m*n)
maxn = A.max()+1
N = A.shape[1]
mask = np.zeros((maxn,N),dtype=bool)
mask[A,np.arange(N)] = 1
A.shape = (-1,m,n)
return mask.sum(0).reshape(m,n)
Runtime test -
In [154]: A = np.random.randint(0,100,(100,100,100))
# #B. M.'s soln
In [155]: %timeit f(A)
10 loops, best of 3: 28.3 ms per loop
# #B. M.'s soln using slicing : (B[1:] != B[:-1]).sum(0)+1
In [156]: %timeit f2(A)
10 loops, best of 3: 26.2 ms per loop
In [157]: %timeit nunique_axis0_maskcount_app1(A)
100 loops, best of 3: 12 ms per loop
In [158]: %timeit nunique_axis0_maskcount_app2(A)
100 loops, best of 3: 9.14 ms per loop
Numba method
Using the same strategy as used for nunique_axis0_maskcount_app2 with directly getting the counts at C-level with numba, we would have -
from numba import njit
#njit
def nunique_loopy_func(mask, N, A, p, count):
for j in range(N):
mask[:] = True
mask[A[0,j]] = False
c = 1
for i in range(1,p):
if mask[A[i,j]]:
c += 1
mask[A[i,j]] = False
count[j] = c
return count
def nunique_axis0_numba(A):
p,m,n = A.shape
A.shape = (-1,m*n)
maxn = A.max()+1
N = A.shape[1]
mask = np.empty(maxn,dtype=bool)
count = np.empty(N,dtype=int)
out = nunique_loopy_func(mask, N, A, p, count).reshape(m,n)
A.shape = (-1,m,n)
return out
Runtime test -
In [328]: np.random.seed(0)
In [329]: A = np.random.randint(0,100,(100,100,100))
In [330]: %timeit nunique_axis0_maskcount_app2(A)
100 loops, best of 3: 11.1 ms per loop
# #B.M.'s numba soln
In [331]: %timeit countunique2(A,A.max()+1)
100 loops, best of 3: 3.43 ms per loop
# Numba soln posted in this post
In [332]: %timeit nunique_axis0_numba(A)
100 loops, best of 3: 2.76 ms per loop