Efficiently count zero elements in numpy array? - python

I need to count the number of zero elements in numpy arrays. I'm aware of the numpy.count_nonzero function, but there appears to be no analog for counting zero elements.
My arrays are not very large (typically less than 1E5 elements) but the operation is performed several millions of times.
Of course I could use len(arr) - np.count_nonzero(arr), but I wonder if there's a more efficient way to do it.
Here's a MWE of how I do it currently:
import numpy as np
import timeit
arrs = []
for _ in range(1000):
arrs.append(np.random.randint(-5, 5, 10000))
def func1():
for arr in arrs:
zero_els = len(arr) - np.count_nonzero(arr)
print(timeit.timeit(func1, number=10))

A 2x faster approach would be to just use np.count_nonzero() but with the condition as needed.
In [3]: arr
Out[3]:
array([[1, 2, 0, 3],
[3, 9, 0, 4]])
In [4]: np.count_nonzero(arr==0)
Out[4]: 2
In [5]:def func_cnt():
for arr in arrs:
zero_els = np.count_nonzero(arr==0)
# here, it counts the frequency of zeroes actually
You can also use np.where() but it's slower than np.count_nonzero()
In [6]: np.where( arr == 0)
Out[6]: (array([0, 1]), array([2, 2]))
In [7]: len(np.where( arr == 0))
Out[7]: 2
Efficiency: (in descending order)
In [8]: %timeit func_cnt()
10 loops, best of 3: 29.2 ms per loop
In [9]: %timeit func1()
10 loops, best of 3: 46.5 ms per loop
In [10]: %timeit func_where()
10 loops, best of 3: 61.2 ms per loop
more speedups with accelerators
It is now possible to achieve more than 3 orders of magnitude speed boost with the help of JAX if you've access to accelerators (GPU/TPU). Another advantage of using JAX is that the NumPy code needs very little modification to make it JAX compatible. Below is a reproducible example:
In [1]: import jax.numpy as jnp
In [2]: from jax import jit
# set up inputs
In [3]: arrs = []
In [4]: for _ in range(1000):
...: arrs.append(np.random.randint(-5, 5, 10000))
# JIT'd function that performs the counting task
In [5]: #jit
...: def func_cnt():
...: for arr in arrs:
...: zero_els = jnp.count_nonzero(arr==0)
# efficiency test
In [8]: %timeit func_cnt()
15.6 µs ± 391 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Related

Fastest way in numpy to get distance of product of n pairs in array

I have N number of points, for example:
A = [2, 3]
B = [3, 4]
C = [3, 3]
.
.
.
And they're in an array like so:
arr = np.array([[2, 3], [3, 4], [3, 3]])
I need as output all pairwise distances in BFS (Breadth First Search) order to track which distance is which, like: A->B, A->C, B->C. For the above example data, the result would be [1.41, 1.0, 1.0].
EDIT: I have to accomplish it with numpy or core libraries.
If you can use it, SciPy has a function for this:
In [2]: from scipy.spatial.distance import pdist
In [3]: pdist(arr)
Out[3]: array([1.41421356, 1. , 1. ])
Here's a numpy-only solution (fair warning: it requires a lot of memory, unlike pdist)...
dists = np.triu(np.linalg.norm(arr - arr[:, None], axis=-1)).flatten()
dists = dists[dists != 0]
Demo:
In [4]: arr = np.array([[2, 3], [3, 4], [3, 3], [5, 2], [4, 5]])
In [5]: pdist(arr)
Out[5]:
array([1.41421356, 1. , 3.16227766, 2.82842712, 1. ,
2.82842712, 1.41421356, 2.23606798, 2.23606798, 3.16227766])
In [6]: dists = np.triu(np.linalg.norm(arr - arr[:, None], axis=-1)).flatten()
In [7]: dists = dists[dists != 0]
In [8]: dists
Out[8]:
array([1.41421356, 1. , 3.16227766, 2.82842712, 1. ,
2.82842712, 1.41421356, 2.23606798, 2.23606798, 3.16227766])
Timings (with the solution above wrapped in a function called triu):
In [9]: %timeit pdist(arr)
7.27 µs ± 738 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [10]: %timeit triu(arr)
25.5 µs ± 4.58 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
As an alternative method, but similar to ddejohn answer, we can use np.triu_indices which return just the upper triangular indices in the matrix, which may be more memory-efficient:
np.linalg.norm(arr - arr[:, None], axis=-1)[np.triu_indices(arr.shape[0], 1)]
This doesn't need additional modules like flattening and indexing. Its performance is similar to the aforementioned answer for large data (e.g. you can check it by arr = np.random.rand(10000, 2) on colab, which will be done near 4.6 s for both; It may beats the np.triu and flatten in larger data).
I have tested the memory usage one time by memory-profiler as follows, but it must be checked again if it be important in terms of memory usage (I'm not sure):
Update:
I have tried to limit the calculations just to the upper triangle, that speed the code up 2 to 3 times on the tested arrays. As array size grows, the performance difference between this loop and the previous methods by np.triu_indices or np.triu grows and be more obvious:
ind = np.arange(arr.shape[0] - 1)
sub_ind = ind + 1
result = np.zeros(sub_ind.sum())
j = 0
for i in range(ind.shape[0]):
result[j:j+ind[-1-i]+1] = np.linalg.norm(arr[ind[i]] - arr[sub_ind[i]:], axis=-1)
j += ind[-1-i]+1
Also, through this way, the memory consumption is reduced at least ~x4. So, this method made it possible to work on larger arrays and more quickly.
Benchmarks:
# arr = np.random.rand(100, 2)
100 loops, best of 5: 459 µs per loop (ddejohns --> np.triu & np.flatten)
100 loops, best of 5: 528 µs per loop (mine --> np.triu_indices)
100 loops, best of 5: 1.42 ms per loop (This method)
--------------------------------------
# arr = np.random.rand(1000, 2)
10 loops, best of 5: 49.9 ms per loop
10 loops, best of 5: 49.7 ms per loop
10 loops, best of 5: 30.4 ms per loop (~x1.7) The fastest
--------------------------------------
# arr = np.random.rand(10000, 2)
2 loops, best of 5: 4.56 s per loop
2 loops, best of 5: 4.6 s per loop
2 loops, best of 5: 1.85 s per loop (~x2.5) The fastest

Multiple cumulative sum within a numpy array

I'm sort of newbie in numpy so I'm sorry if this question was already asked. I'm looking for a vectorization solution which enable to run multiple cumsum of different size within a one dimension numpy array.
my_vector=np.array([1,2,3,4,5])
size_of_groups=np.array([3,2])
I would like something like
np.cumsum.group(my_vector,size_of_groups)
[1,3,6,4,9]
I do not want a solution with loops. Either numpy functions or numpy operations.
Not sure about numpy, but pandas can do this pretty easily with a groupby + cumsum:
import pandas as pd
s = pd.Series(my_vector)
s.groupby(s.index.isin(size_of_groups.cumsum()).cumsum()).cumsum()
0 1
1 3
2 6
3 4
4 9
dtype: int64
Here's a vectorized solution -
def intervaled_cumsum(ar, sizes):
# Make a copy to be used as output array
out = ar.copy()
# Get cumumlative values of array
arc = ar.cumsum()
# Get cumsumed indices to be used to place differentiated values into
# input array's copy
idx = sizes.cumsum()
# Place differentiated values that when cumumlatively summed later on would
# give us the desired intervaled cumsum
out[idx[0]] = ar[idx[0]] - arc[idx[0]-1]
out[idx[1:-1]] = ar[idx[1:-1]] - np.diff(arc[idx[:-1]-1])
return out.cumsum()
Sample run -
In [114]: ar = np.array([1,2,3,4,5,6,7,8,9,10,11,12])
...: sizes = np.array([3,2,2,3,2])
In [115]: intervaled_cumsum(ar, sizes)
Out[115]: array([ 1, 3, 6, 4, 9, 6, 13, 8, 17, 27, 11, 23])
Benchmarking
Other approach(es) -
# #cᴏʟᴅsᴘᴇᴇᴅ's solution
import pandas as pd
def pandas_soln(my_vector, sizes):
s = pd.Series(my_vector)
return s.groupby(s.index.isin(sizes.cumsum()).cumsum()).cumsum().values
The given sample used two intervals of lengths 2 and 3 Keeping that and simply giving it more number of groups for timing purpose.
Timings -
In [146]: N = 10000 # number of groups
...: np.random.seed(0)
...: sizes = np.random.randint(2,4,(N))
...: ar = np.random.randint(0,N,sizes.sum())
In [147]: %timeit intervaled_cumsum(ar, sizes)
...: %timeit pandas_soln(ar, sizes)
10000 loops, best of 3: 178 µs per loop
1000 loops, best of 3: 1.82 ms per loop
In [148]: N = 100000 # number of groups
...: np.random.seed(0)
...: sizes = np.random.randint(2,4,(N))
...: ar = np.random.randint(0,N,sizes.sum())
In [149]: %timeit intervaled_cumsum(ar, sizes)
...: %timeit pandas_soln(ar, sizes)
100 loops, best of 3: 3.91 ms per loop
100 loops, best of 3: 17.3 ms per loop
In [150]: N = 1000000 # number of groups
...: np.random.seed(0)
...: sizes = np.random.randint(2,4,(N))
...: ar = np.random.randint(0,N,sizes.sum())
In [151]: %timeit intervaled_cumsum(ar, sizes)
...: %timeit pandas_soln(ar, sizes)
10 loops, best of 3: 31.6 ms per loop
1 loop, best of 3: 357 ms per loop
Here is an unconventional solution. Not very fast, though. (Even a bit slower than pandas).
>>> from scipy import linalg
>>>
>>> N = len(my_vector)
>>> D = np.repeat((*zip((1,-1)),), N, axis=1)
>>> D[1, np.cumsum(size_of_groups) - 1] = 0
>>>
>>> linalg.solve_banded((1, 0), D, my_vector)
array([1., 3., 6., 4., 9.])

Vectorizing nearest neighbor computation

I have the following function which is returning an array calculating the nearest neighbor:
def p_batch(U,X,Y):
return [nearest(u,X,Y) for u in U]
I would like to replace the for loop using numpy. I've been looking into numpy.vectorize() as this seems to be the right approach, but I can't get it to work. This is what I've tried so far:
def n_batch(U,X,Y):
vbatch = np.vectorize(nearest)
return vbatch(U,X,Y)
Can anyone give me a hint where I went wrong?
Edit:
Implementation of nearest:
def nearest(u,X,Y):
return Y[np.argmin(np.sqrt(np.sum(np.square(np.subtract(u,X)),axis=1)))]
Function for U,X,Y (with M=20,N=100,d=50):
U = numpy.random.mtrand.RandomState(123).uniform(0,1,[M,d])
X = numpy.random.mtrand.RandomState(456).uniform(0,1,[N,d])
Y = numpy.random.mtrand.RandomState(789).randint(0,2,[N])
Approach #1
You could use Scipy's cdist to generate all those euclidean distances and then simply use argmin and index into Y -
from scipy.spatial.distance import cdist
out = Y[cdist(U,X).argmin(1)]
Sample run -
In [76]: M,N,d = 5,6,3
...: U = np.random.mtrand.RandomState(123).uniform(0,1,[M,d])
...: X = np.random.mtrand.RandomState(456).uniform(0,1,[N,d])
...: Y = np.random.mtrand.RandomState(789).randint(0,2,[N])
...:
# Using a loop comprehension to verify values
In [77]: [nearest(U[i], X,Y) for i in range(len(U))]
Out[77]: [1, 0, 0, 1, 1]
In [78]: Y[cdist(U,X).argmin(1)]
Out[78]: array([1, 0, 0, 1, 1])
Approach #2
Another way with sklearn.metrics.pairwise_distances_argmin_min to give us those argmin indices directly -
from sklearn.metrics import pairwise
Y[pairwise.pairwise_distances_argmin(U,X)]
Runtime test with M=20,N=100,d=50 -
In [90]: M,N,d = 20,100,50
...: U = np.random.mtrand.RandomState(123).uniform(0,1,[M,d])
...: X = np.random.mtrand.RandomState(456).uniform(0,1,[N,d])
...: Y = np.random.mtrand.RandomState(789).randint(0,2,[N])
...:
Testing between cdist and pairwise_distances_argmin -
In [91]: %timeit cdist(U,X).argmin(1)
10000 loops, best of 3: 55.2 µs per loop
In [92]: %timeit pairwise.pairwise_distances_argmin(U,X)
10000 loops, best of 3: 90.6 µs per loop
Timings against loopy version -
In [93]: %timeit [nearest(U[i], X,Y) for i in range(len(U))]
1000 loops, best of 3: 298 µs per loop
In [94]: %timeit Y[cdist(U,X).argmin(1)]
10000 loops, best of 3: 55.6 µs per loop
In [95]: %timeit Y[pairwise.pairwise_distances_argmin(U,X)]
10000 loops, best of 3: 91.1 µs per loop
In [96]: 298.0/55.6 # Speedup with cdist over loopy one
Out[96]: 5.359712230215827

Cython numpy array indexer speed improvement

I wrote the following code in pure python, the description of what it does is in the docstrings:
import numpy as np
from scipy.ndimage.measurements import find_objects
import itertools
def alt_indexer(arr):
"""
Returns a dictionary with the elements of arr as key
and the corresponding slice as value.
Note:
This function assumes arr is sorted.
Example:
>>> arr = [0,0,3,2,1,2,3]
>>> loc = _indexer(arr)
>>> loc
{0: (slice(0L, 2L, None),),
1: (slice(2L, 3L, None),),
2: (slice(3L, 5L, None),),
3: (slice(5L, 7L, None),)}
>>> arr = sorted(arr)
>>> arr[loc[3][0]]
[3, 3]
>>> arr[loc[2][0]]
[2, 2]
"""
unique, counts = np.unique(arr, return_counts=True)
labels = np.arange(1,len(unique)+1)
labels = np.repeat(labels,counts)
slicearr = find_objects(labels)
index_dict = dict(itertools.izip(unique,slicearr))
return index_dict
Since i will be indexing very large arrays, i wanted to speed up the operations by using cython, here is the equivalent implementation:
import numpy as np
cimport numpy as np
def _indexer(arr):
cdef tuple unique_counts = np.unique(arr, return_counts=True)
cdef np.ndarray[np.int32_t,ndim=1] unique = unique_counts[0]
cdef np.ndarray[np.int32_t,ndim=1] counts = unique_counts[1].astype(int)
cdef int start=0
cdef int end
cdef int i
cdef dict d ={}
for i in xrange(len(counts)):
if i>0:
start = counts[i-1]+start
end=counts[i]+start
d[unique[i]]=slice(start,end)
return d
Benchmarks
I compared the time it took to complete both operations:
In [26]: import numpy as np
In [27]: rr=np.random.randint(0,1000,1000000)
In [28]: %timeit _indexer(rr)
10 loops, best of 3: 40.5 ms per loop
In [29]: %timeit alt_indexer(rr) #pure python
10 loops, best of 3: 51.4 ms per loop
As you can see the speed improvements are minimal. I do realize that my code was already partly optimized since i used numpy.
Is there a bottleneck that i am not aware of?
Should i not use np.unique and write my own implementation instead?
Thanks.
With arr having non-negative, not very large and many repeated int numbers, here's an alternative approach using np.bincount to simulate the same behavior as np.unique(arr, return_counts=True) -
def unique_counts(arr):
counts = np.bincount(arr)
mask = counts!=0
unique = np.nonzero(mask)[0]
return unique, counts[mask]
Runtime test
Case #1 :
In [83]: arr = np.random.randint(0,100,(1000)) # Input array
In [84]: unique, counts = np.unique(arr, return_counts=True)
...: unique1, counts1 = unique_counts(arr)
...:
In [85]: np.allclose(unique,unique1)
Out[85]: True
In [86]: np.allclose(counts,counts1)
Out[86]: True
In [87]: %timeit np.unique(arr, return_counts=True)
10000 loops, best of 3: 53.2 µs per loop
In [88]: %timeit unique_counts(arr)
100000 loops, best of 3: 10.2 µs per loop
Case #2:
In [89]: arr = np.random.randint(0,1000,(10000)) # Input array
In [90]: %timeit np.unique(arr, return_counts=True)
1000 loops, best of 3: 713 µs per loop
In [91]: %timeit unique_counts(arr)
10000 loops, best of 3: 39.1 µs per loop
Case #3: Let's run a case with unique having some missing numbers in the min to max range and verify the results against np.unique version as a sanity check. We won't have a lot of repeated numbers in this case and as such isn't expected to be better on performance.
In [98]: arr = np.random.randint(0,10000,(1000)) # Input array
In [99]: unique, counts = np.unique(arr, return_counts=True)
...: unique1, counts1 = unique_counts(arr)
...:
In [100]: np.allclose(unique,unique1)
Out[100]: True
In [101]: np.allclose(counts,counts1)
Out[101]: True
In [102]: %timeit np.unique(arr, return_counts=True)
10000 loops, best of 3: 61.9 µs per loop
In [103]: %timeit unique_counts(arr)
10000 loops, best of 3: 71.8 µs per loop

Python numpy.square vs **

Is there a difference between numpy.square and using the ** operator on a Numpy array?
From what I can see it yields the same result.
Any differences in efficiency of execution?
An example for clarification:
In [1]: import numpy as np
In [2]: A = np.array([[2, 2],[2, 2]])
In [3]: np.square(A)
Out[3]:
array([[4, 4],
[4, 4]])
In [4]: A ** 2
Out[4]:
array([[4, 4],
[4, 4]])
You can check the execution time to get clear picture of it
In [2]: import numpy as np
In [3]: A = np.array([[2, 2],[2, 2]])
In [7]: %timeit np.square(A)
1000000 loops, best of 3: 923 ns per loop
In [8]: %timeit A ** 2
1000000 loops, best of 3: 668 ns per loop
For most appliances, both will give you the same results.
Generally the standard pythonic a*a or a**2 is faster than the numpy.square() or numpy.pow(), but the numpy functions are often more flexible and precise.
If you do calculations that need to be very accurate, stick to numpy and probably even use other datatypes float96.
For normal usage a**2 will do a good job and way faster job than numpy.
The guys in this thread gave some good examples to a similar questions.
#saimadhu.polamuri and #foehnx/#Lumos
On my machine, currently, NumPy performs faster than **.
In [1]: import numpy as np
In [2]: A = np.array([[1,2],[3,4]])
In [3]: %timeit A ** 2
256 ns ± 0.922 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [4]: %timeit np.square(A)
240 ns ± 0.759 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Categories