I have a 2D matrix and I need to sum a subset of the matrix elements, given two lists of indices imp_list and bath_list. Here is what I'm doing right now:
s = 0.0
for i in imp_list:
for j in bath_list:
s += K[i,j]
which appears to be very slow. What would be a better solution to perform the sum?
If you're working with large arrays, you should get a huge speed boost by using NumPy's own indexing routines over Python's for loops.
In the general case you can use np.ix_ to select a subarray of the matrix to sum:
K[np.ix_(imp_list, bath_list)].sum()
Note that np.ix_ carries some overhead, so if your two lists contain consecutive or evenly-spaced values, it's worth using regular slicing to index the array instead (see method3() below).
Here's some data to illustrate the improvements:
K = np.arange(1000000).reshape(1000, 1000)
imp_list = range(100) # [0, 1, 2, ..., 99]
bath_list = range(200) # [0, 1, 2, ..., 199]
def method1():
s = 0
for i in imp_list:
for j in bath_list:
s += K[i,j]
return s
def method2():
return K[np.ix_(imp_list, bath_list)].sum()
def method3():
return K[:100, :200].sum()
Then:
In [80]: method1() == method2() == method3()
Out[80]: True
In [91]: %timeit method1()
10 loops, best of 3: 9.93 ms per loop
In [92]: %timeit method2()
1000 loops, best of 3: 884 µs per loop
In [93]: %timeit method3()
10000 loops, best of 3: 34 µs per loop
Related
I'm sort of newbie in numpy so I'm sorry if this question was already asked. I'm looking for a vectorization solution which enable to run multiple cumsum of different size within a one dimension numpy array.
my_vector=np.array([1,2,3,4,5])
size_of_groups=np.array([3,2])
I would like something like
np.cumsum.group(my_vector,size_of_groups)
[1,3,6,4,9]
I do not want a solution with loops. Either numpy functions or numpy operations.
Not sure about numpy, but pandas can do this pretty easily with a groupby + cumsum:
import pandas as pd
s = pd.Series(my_vector)
s.groupby(s.index.isin(size_of_groups.cumsum()).cumsum()).cumsum()
0 1
1 3
2 6
3 4
4 9
dtype: int64
Here's a vectorized solution -
def intervaled_cumsum(ar, sizes):
# Make a copy to be used as output array
out = ar.copy()
# Get cumumlative values of array
arc = ar.cumsum()
# Get cumsumed indices to be used to place differentiated values into
# input array's copy
idx = sizes.cumsum()
# Place differentiated values that when cumumlatively summed later on would
# give us the desired intervaled cumsum
out[idx[0]] = ar[idx[0]] - arc[idx[0]-1]
out[idx[1:-1]] = ar[idx[1:-1]] - np.diff(arc[idx[:-1]-1])
return out.cumsum()
Sample run -
In [114]: ar = np.array([1,2,3,4,5,6,7,8,9,10,11,12])
...: sizes = np.array([3,2,2,3,2])
In [115]: intervaled_cumsum(ar, sizes)
Out[115]: array([ 1, 3, 6, 4, 9, 6, 13, 8, 17, 27, 11, 23])
Benchmarking
Other approach(es) -
# #cᴏʟᴅsᴘᴇᴇᴅ's solution
import pandas as pd
def pandas_soln(my_vector, sizes):
s = pd.Series(my_vector)
return s.groupby(s.index.isin(sizes.cumsum()).cumsum()).cumsum().values
The given sample used two intervals of lengths 2 and 3 Keeping that and simply giving it more number of groups for timing purpose.
Timings -
In [146]: N = 10000 # number of groups
...: np.random.seed(0)
...: sizes = np.random.randint(2,4,(N))
...: ar = np.random.randint(0,N,sizes.sum())
In [147]: %timeit intervaled_cumsum(ar, sizes)
...: %timeit pandas_soln(ar, sizes)
10000 loops, best of 3: 178 µs per loop
1000 loops, best of 3: 1.82 ms per loop
In [148]: N = 100000 # number of groups
...: np.random.seed(0)
...: sizes = np.random.randint(2,4,(N))
...: ar = np.random.randint(0,N,sizes.sum())
In [149]: %timeit intervaled_cumsum(ar, sizes)
...: %timeit pandas_soln(ar, sizes)
100 loops, best of 3: 3.91 ms per loop
100 loops, best of 3: 17.3 ms per loop
In [150]: N = 1000000 # number of groups
...: np.random.seed(0)
...: sizes = np.random.randint(2,4,(N))
...: ar = np.random.randint(0,N,sizes.sum())
In [151]: %timeit intervaled_cumsum(ar, sizes)
...: %timeit pandas_soln(ar, sizes)
10 loops, best of 3: 31.6 ms per loop
1 loop, best of 3: 357 ms per loop
Here is an unconventional solution. Not very fast, though. (Even a bit slower than pandas).
>>> from scipy import linalg
>>>
>>> N = len(my_vector)
>>> D = np.repeat((*zip((1,-1)),), N, axis=1)
>>> D[1, np.cumsum(size_of_groups) - 1] = 0
>>>
>>> linalg.solve_banded((1, 0), D, my_vector)
array([1., 3., 6., 4., 9.])
I need to count the number of zero elements in numpy arrays. I'm aware of the numpy.count_nonzero function, but there appears to be no analog for counting zero elements.
My arrays are not very large (typically less than 1E5 elements) but the operation is performed several millions of times.
Of course I could use len(arr) - np.count_nonzero(arr), but I wonder if there's a more efficient way to do it.
Here's a MWE of how I do it currently:
import numpy as np
import timeit
arrs = []
for _ in range(1000):
arrs.append(np.random.randint(-5, 5, 10000))
def func1():
for arr in arrs:
zero_els = len(arr) - np.count_nonzero(arr)
print(timeit.timeit(func1, number=10))
A 2x faster approach would be to just use np.count_nonzero() but with the condition as needed.
In [3]: arr
Out[3]:
array([[1, 2, 0, 3],
[3, 9, 0, 4]])
In [4]: np.count_nonzero(arr==0)
Out[4]: 2
In [5]:def func_cnt():
for arr in arrs:
zero_els = np.count_nonzero(arr==0)
# here, it counts the frequency of zeroes actually
You can also use np.where() but it's slower than np.count_nonzero()
In [6]: np.where( arr == 0)
Out[6]: (array([0, 1]), array([2, 2]))
In [7]: len(np.where( arr == 0))
Out[7]: 2
Efficiency: (in descending order)
In [8]: %timeit func_cnt()
10 loops, best of 3: 29.2 ms per loop
In [9]: %timeit func1()
10 loops, best of 3: 46.5 ms per loop
In [10]: %timeit func_where()
10 loops, best of 3: 61.2 ms per loop
more speedups with accelerators
It is now possible to achieve more than 3 orders of magnitude speed boost with the help of JAX if you've access to accelerators (GPU/TPU). Another advantage of using JAX is that the NumPy code needs very little modification to make it JAX compatible. Below is a reproducible example:
In [1]: import jax.numpy as jnp
In [2]: from jax import jit
# set up inputs
In [3]: arrs = []
In [4]: for _ in range(1000):
...: arrs.append(np.random.randint(-5, 5, 10000))
# JIT'd function that performs the counting task
In [5]: #jit
...: def func_cnt():
...: for arr in arrs:
...: zero_els = jnp.count_nonzero(arr==0)
# efficiency test
In [8]: %timeit func_cnt()
15.6 µs ± 391 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
I wrote the following code in pure python, the description of what it does is in the docstrings:
import numpy as np
from scipy.ndimage.measurements import find_objects
import itertools
def alt_indexer(arr):
"""
Returns a dictionary with the elements of arr as key
and the corresponding slice as value.
Note:
This function assumes arr is sorted.
Example:
>>> arr = [0,0,3,2,1,2,3]
>>> loc = _indexer(arr)
>>> loc
{0: (slice(0L, 2L, None),),
1: (slice(2L, 3L, None),),
2: (slice(3L, 5L, None),),
3: (slice(5L, 7L, None),)}
>>> arr = sorted(arr)
>>> arr[loc[3][0]]
[3, 3]
>>> arr[loc[2][0]]
[2, 2]
"""
unique, counts = np.unique(arr, return_counts=True)
labels = np.arange(1,len(unique)+1)
labels = np.repeat(labels,counts)
slicearr = find_objects(labels)
index_dict = dict(itertools.izip(unique,slicearr))
return index_dict
Since i will be indexing very large arrays, i wanted to speed up the operations by using cython, here is the equivalent implementation:
import numpy as np
cimport numpy as np
def _indexer(arr):
cdef tuple unique_counts = np.unique(arr, return_counts=True)
cdef np.ndarray[np.int32_t,ndim=1] unique = unique_counts[0]
cdef np.ndarray[np.int32_t,ndim=1] counts = unique_counts[1].astype(int)
cdef int start=0
cdef int end
cdef int i
cdef dict d ={}
for i in xrange(len(counts)):
if i>0:
start = counts[i-1]+start
end=counts[i]+start
d[unique[i]]=slice(start,end)
return d
Benchmarks
I compared the time it took to complete both operations:
In [26]: import numpy as np
In [27]: rr=np.random.randint(0,1000,1000000)
In [28]: %timeit _indexer(rr)
10 loops, best of 3: 40.5 ms per loop
In [29]: %timeit alt_indexer(rr) #pure python
10 loops, best of 3: 51.4 ms per loop
As you can see the speed improvements are minimal. I do realize that my code was already partly optimized since i used numpy.
Is there a bottleneck that i am not aware of?
Should i not use np.unique and write my own implementation instead?
Thanks.
With arr having non-negative, not very large and many repeated int numbers, here's an alternative approach using np.bincount to simulate the same behavior as np.unique(arr, return_counts=True) -
def unique_counts(arr):
counts = np.bincount(arr)
mask = counts!=0
unique = np.nonzero(mask)[0]
return unique, counts[mask]
Runtime test
Case #1 :
In [83]: arr = np.random.randint(0,100,(1000)) # Input array
In [84]: unique, counts = np.unique(arr, return_counts=True)
...: unique1, counts1 = unique_counts(arr)
...:
In [85]: np.allclose(unique,unique1)
Out[85]: True
In [86]: np.allclose(counts,counts1)
Out[86]: True
In [87]: %timeit np.unique(arr, return_counts=True)
10000 loops, best of 3: 53.2 µs per loop
In [88]: %timeit unique_counts(arr)
100000 loops, best of 3: 10.2 µs per loop
Case #2:
In [89]: arr = np.random.randint(0,1000,(10000)) # Input array
In [90]: %timeit np.unique(arr, return_counts=True)
1000 loops, best of 3: 713 µs per loop
In [91]: %timeit unique_counts(arr)
10000 loops, best of 3: 39.1 µs per loop
Case #3: Let's run a case with unique having some missing numbers in the min to max range and verify the results against np.unique version as a sanity check. We won't have a lot of repeated numbers in this case and as such isn't expected to be better on performance.
In [98]: arr = np.random.randint(0,10000,(1000)) # Input array
In [99]: unique, counts = np.unique(arr, return_counts=True)
...: unique1, counts1 = unique_counts(arr)
...:
In [100]: np.allclose(unique,unique1)
Out[100]: True
In [101]: np.allclose(counts,counts1)
Out[101]: True
In [102]: %timeit np.unique(arr, return_counts=True)
10000 loops, best of 3: 61.9 µs per loop
In [103]: %timeit unique_counts(arr)
10000 loops, best of 3: 71.8 µs per loop
I have the following numpy arrays:
arr_1 = [[1,2],[3,4],[5,6]] # 3 X 2
arr_2 = [[0.5,0.6],[0.7,0.8],[0.9,1.0],[1.1,1.2],[1.3,1.4]] # 5 X 2
arr_1 is clearly a 3 X 2 array, whereas arr_2 is a 5 X 2 array.
Now without looping, I want to element-wise multiply arr_1 and arr_2 so that I apply a sliding window technique (window size 3) to arr_2.
Example:
Multiplication 1: np.multiply(arr_1,arr_2[:3,:])
Multiplication 2: np.multiply(arr_1,arr_2[1:4,:])
Multiplication 3: np.multiply(arr_1,arr_2[2:5,:])
I want to do this in some sort of a matrix multiplication form to make it faster than my current solution which is of the form:
for i in (2):
np.multiply(arr_1,arr_2[i:i+3,:])
So if the number of rows in arr_2 are large (of the order of tens of thousands), this solution doesn't really scale very well.
Any help would be much appreciated.
We can use NumPy broadcasting to create those sliding windowed indices in a vectorized manner. Then, we can simply index into arr_2 with those to create a 3D array and perform element-wise multiplication with 2D array arr_1, which in turn will bring on broadcasting again.
So, we would have a vectorized implementation like so -
W = arr_1.shape[0] # Window size
idx = np.arange(arr_2.shape[0]-W+1)[:,None] + np.arange(W)
out = arr_1*arr_2[idx]
Runtime test and verify results -
In [143]: # Input arrays
...: arr_1 = np.random.rand(3,2)
...: arr_2 = np.random.rand(10000,2)
...:
...: def org_app(arr_1,arr_2):
...: W = arr_1.shape[0] # Window size
...: L = arr_2.shape[0]-W+1
...: out = np.empty((L,W,arr_1.shape[1]))
...: for i in range(L):
...: out[i] = np.multiply(arr_1,arr_2[i:i+W,:])
...: return out
...:
...: def vectorized_app(arr_1,arr_2):
...: W = arr_1.shape[0] # Window size
...: idx = np.arange(arr_2.shape[0]-W+1)[:,None] + np.arange(W)
...: return arr_1*arr_2[idx]
...:
In [144]: np.allclose(org_app(arr_1,arr_2),vectorized_app(arr_1,arr_2))
Out[144]: True
In [145]: %timeit org_app(arr_1,arr_2)
10 loops, best of 3: 47.3 ms per loop
In [146]: %timeit vectorized_app(arr_1,arr_2)
1000 loops, best of 3: 1.21 ms per loop
This is a nice case to test the speed of as_strided and Divakar's broadcasting.
In [281]: %%timeit
...: out=np.empty((L,W,arr1.shape[1]))
...: for i in range(L):
...: out[i]=np.multiply(arr1,arr2[i:i+W,:])
...:
10 loops, best of 3: 48.9 ms per loop
In [282]: %%timeit
...: idx=np.arange(L)[:,None]+np.arange(W)
...: out=arr1*arr2[idx]
...:
100 loops, best of 3: 2.18 ms per loop
In [283]: %%timeit
...: arr3=as_strided(arr2, shape=(L,W,2), strides=(16,16,8))
...: out=arr1*arr3
...:
1000 loops, best of 3: 805 µs per loop
Create Numpy array without enumerating array for more of a comparison of these methods.
I need the minimal distance between elements of an array.
I did:
numpy.min(numpy.ediff1d(numpy.sort(x)))
Is there a better / more efficient / more elegant / faster way of doing this?
If you are after sheer speed, here are some timings:
In [13]: a = np.random.rand(1000)
In [14]: %timeit np.sort(a)
10000 loops, best of 3: 31.9 us per loop
In [15]: %timeit np.ediff1d(a)
100000 loops, best of 3: 15.2 us per loop
In [16]: %timeit np.diff(a)
100000 loops, best of 3: 7.76 us per loop
In [17]: %timeit np.min(a)
100000 loops, best of 3: 3.19 us per loop
In [18]: %timeit np.unique(a)
10000 loops, best of 3: 53.8 us per loop
The timing of unique was in hopes that it would be comparably fast to sort, and you could break out early without the calls to diff and min if the length of the unique array was shorter than the array itself (as that would mean your answer was 0). But the overhead of unique is more than any gain to be made.
So it seems the only potential improvement I can offer is replacing ediff1d with diff:
In [19]: %timeit np.min(np.diff(np.sort(a)))
10000 loops, best of 3: 47.7 us per loop
In [20]: %timeit np.min(np.ediff1d(np.sort(a)))
10000 loops, best of 3: 57.1 us per loop
Your current approach is definitely optimal. By sorting first, you're reducing the space in between each element and ediff1d will return a difference array. Here's a suggestion:
Since we know that the difference must be positive since we have an ascending-order sort, we can implement ediff1d manually and include a break where the difference is zero. That way, if you have the sorted array x:
[1, 1, 2, 3, 4, 5, 6, 7, ... , n]
Rather than going through n elements, your ediff1d function breaks early and covers only the first two elements, returning [0]. This also reduces the size of the difference array, reducing the amount of iterations required by your min call.
Here is an example without the use of numpy:
x = [1, 12, 3, 8, 4, 1, 4, 9, 1, 29, 210, 313, 12]
def ediff1d_custom(x):
darr = []
for i in xrange(len(x)):
if i != len(x) - 1:
diff = x[i + 1] - x[i]
darr.append(diff)
if diff == 0:
break
return darr
print min(ediff1d_custom(sorted(x))) # prints 0
try:
min(x[i+1]-x[i] for i in xrange(0, len(x)-1))
except ValueError:
print 'Array contains less than two values.'