Vectorize addition into array indexed by another array - python

I am trying to get a fast vectorized version of the following loop:
for i in xrange(N1):
A[y[i]] -= B[i,:]
Here A.shape = (N2,N3), y.shape = (N1) with y taking values in [0,N2[, B.shape = (N1,N3). You can think of entries of y being indices into rows of A. Here N1 is large, N2 is pretty small and N3 is smallish.
I thought simply doing
A[y] -= B
would work, but the issue is that there are repeated entries in y and this does not do the right thing (i.e., if y=[1,1] then A[1] is only added to once, not twice). Also this is does not seem to be any faster than the unvectorized for loop.
Is there a better way of doing this?
EDIT: YXD linked this answer to in comments which at first seems to fit the bill. It would seem you can do exactly what I want with
np.subtract.at(A, y, B)
and it does work, however when I try to run it it is significantly slower than the unvectorized version. So, the question remains: is there a more performant way of doing this?
EDIT2: An example, to make things concrete:
n1,n2,n3 = 10000, 10, 500
A = np.random.rand(n2,n3)
y = np.random.randint(n2, size=n1)
B = np.random.rand(n1,n3)
The for loop, when run using %timeit in ipython gives on my machine:
10 loops, best of 3: 19.4 ms per loop
The subtract.at version produces the same value for A in the end, but is much slower:
1 loops, best of 3: 444 ms per loop

The code for the original for-loop based approach would look something like this -
def for_loop(A):
N1 = B.shape[0]
for i in xrange(N1):
A[y[i]] -= B[i,:]
return A
Case #1
If n2 >> n3, I would suggest this vectorized approach -
def bincount_vectorized(A):
n3 = A.shape[1]
nrows = y.max()+1
id = y[:,None] + nrows*np.arange(n3)
A[:nrows] -= np.bincount(id.ravel(),B.ravel()).reshape(n3,nrows).T
return A
Runtime tests -
In [203]: n1,n2,n3 = 10000, 500, 10
...: A = np.random.rand(n2,n3)
...: y = np.random.randint(n2, size=n1)
...: B = np.random.rand(n1,n3)
...:
...: # Make copies
...: Acopy1 = A.copy()
...: Acopy2 = A.copy()
...:
In [204]: %timeit for_loop(Acopy1)
10 loops, best of 3: 19 ms per loop
In [205]: %timeit bincount_vectorized(Acopy2)
1000 loops, best of 3: 779 µs per loop
Case #2
If n2 << n3, a modified for-loop approach with lesser loop complexity could be suggested -
def for_loop_v2(A):
n2 = A.shape[0]
for i in range(n2):
A[i] -= np.einsum('ij->j',B[y==i]) # OR (B[y==i]).sum(0)
return A
Runtime tests -
In [206]: n1,n2,n3 = 10000, 10, 500
...: A = np.random.rand(n2,n3)
...: y = np.random.randint(n2, size=n1)
...: B = np.random.rand(n1,n3)
...:
...: # Make copies
...: Acopy1 = A.copy()
...: Acopy2 = A.copy()
...:
In [207]: %timeit for_loop(Acopy1)
10 loops, best of 3: 24.2 ms per loop
In [208]: %timeit for_loop_v2(Acopy2)
10 loops, best of 3: 20.3 ms per loop

Related

Comparing numpy array with itself by element efficiently

I am performing a large number of these calculations:
A == A[np.newaxis].T
where A is a dense numpy array which frequently has common values.
For benchmarking purposes we can use:
n = 30000
A = np.random.randint(0, 1000, n)
A == A[np.newaxis].T
When I perform this calculation, I run into memory issues. I believe this is because the output isn't in more efficient bitarray or np.packedbits format. A secondary concern is we are performing twice as many comparisons as necessary, since the resulting Boolean array is symmetric.
The questions I have are:
Is it possible to produce the Boolean numpy array output in a more memory efficient fashion without sacrificing speed? The options I know about are bitarray and np.packedbits, but I only know how to apply these after the large Boolean array is created.
Can we utilise the symmetry of our calculation to halve the number of comparisons processed, again without sacrificing speed?
I will need to be able to perform & and | operations on Boolean arrays output. I have tried bitarray, which is super-fast for these bitwise operations. But it is slow to pack np.ndarray -> bitarray and then unpack bitarray -> np.ndarray.
[Edited to provide clarification.]
Here's one with numba to give us a NumPy boolean array as output -
from numba import njit
#njit
def numba_app1(idx, n, s, out):
for i,j in zip(idx[:-1],idx[1:]):
s0 = s[i:j]
c = 0
for p1 in s0[c:]:
for p2 in s0[c+1:]:
out[p1,p2] = 1
out[p2,p1] = 1
c += 1
return out
def app1(A):
s = A.argsort()
b = A[s]
n = len(A)
idx = np.flatnonzero(np.r_[True,b[1:] != b[:-1],True])
out = np.zeros((n,n),dtype=bool)
numba_app1(idx, n, s, out)
out.ravel()[::out.shape[1]+1] = 1
return out
Timings -
In [287]: np.random.seed(0)
...: n = 30000
...: A = np.random.randint(0, 1000, n)
# Original soln
In [288]: %timeit A == A[np.newaxis].T
1 loop, best of 3: 317 ms per loop
# #Daniel F's soln-1 that skips assigning lower diagonal in output
In [289]: %timeit sparse_outer_eq(A)
1 loop, best of 3: 450 ms per loop
# #Daniel F's soln-2 (complete one)
In [291]: %timeit sparse_outer_eq(A)
1 loop, best of 3: 634 ms per loop
# Solution from this post
In [292]: %timeit app1(A)
10 loops, best of 3: 66.9 ms per loop
This isn't even a numpy answer, but should work to keep your data requirements down by using a bit of homebrewed sparse notation
from numba import jit
#jit # because this is gonna be loopy
def sparse_outer_eq(A):
n = A.size
c = []
for i in range(n):
for j in range(i + 1, n):
if A[i] == A[j]:
c.append((i, j))
return c
Now c is a list of coordinate tuples (i, j), i < j that correspond to coordinates in your boolean array that are "True". You can easily do and and or operations on these setwise:
list(set(c1) & set(c2))
list(set(c1) | set(c2))
Later, when you want to apply this mask to an array, you can back out the coordinates and use them for fancy indexing instead:
i_, j_ = list(np.array(c).T)
i = np.r_[i_, j_, np.arange(n)]
j = np.r_[j_, i_, np.arange(n)]
You can then np.lexsort i nd j if you care about order
Alternatively, you can define sparse_outer_eq as:
#jit
def sparse_outer_eq(A):
n = A.size
c = []
for i in range(n):
for j in range(n):
if A[i] == A[j]:
c.append((i, j))
return c
Which keeps >2x the data, but then the coordinates come out simply:
i, j = list(np.array(c).T)
if you've done any set operations, this will still need to be lexsorted if you want a rational order.
If your coordinates are each n-bit integers, this should be more space-efficient than boolean format as long as your sparsity is less than 1/n -> 3% or so for 32-bit.
as for time, thanks to numba it's even faster than broadcasting:
n = 3000
A = np.random.randint(0, 1000, n)
%timeit sparse_outer_eq(A)
100 loops, best of 3: 4.86 ms per loop
%timeit A == A[:, None]
100 loops, best of 3: 11.8 ms per loop
and comparisons:
a = A == A[:, None]
b = B == B[:, None]
a_ = sparse_outer_eq(A)
b_ = sparse_outer_eq(B)
%timeit a & b
100 loops, best of 3: 5.9 ms per loop
%timeit list(set(a_) & set(b_))
1000 loops, best of 3: 641 µs per loop
%timeit a | b
100 loops, best of 3: 5.52 ms per loop
%timeit list(set(a_) | set(b_))
1000 loops, best of 3: 955 µs per loop
EDIT: if you want to do &~ (as per your comment) use the second sparse_outer_eq method (so you don't have to keep track of the diagonal) and just do:
list(set(a_) - set(b_))
Here is the more or less canonical argsort solution:
import numpy as np
def f_argsort(A):
idx = np.argsort(A)
As = A[idx]
ne_ = np.r_[True, As[:-1] != As[1:], True]
bnds = np.flatnonzero(ne_)
valid = np.diff(bnds) != 1
return [idx[bnds[i]:bnds[i+1]] for i in np.flatnonzero(valid)]
n = 30000
A = np.random.randint(0, 1000, n)
groups = f_argsort(A)
for grp in groups:
print(len(grp), set(A[grp]), end=' ')
print()
I'm adding a solution to my question because it satisfies these 3 properties:
Low, fixed, memory requirement
Fast bitwise operations (&, |, ~, etc)
Low storage, 1-bit per Boolean via packing integers
The downside is it is stored in np.packbits format. It is substantially slower than other methods (especially argsort), but if speed is not an issue the algorithm should work well. If anyone figures a way to optimise further, this would be very helpful.
Update: A more efficient version of the below algorithm can be found here: Improving performance on comparison algorithm np.packbits(A==A[:, None], axis=1).
import numpy as np
from numba import jit
#jit(nopython=True)
def bool2int(x):
y = 0
for i, j in enumerate(x):
if j: y += int(j)<<(7-i)
return y
#jit(nopython=True)
def compare_elementwise(arr, result, section):
n = len(arr)
for row in range(n):
for col in range(n):
section[col%8] = arr[row] == arr[col]
if ((col + 1) % 8 == 0) or (col == (n-1)):
result[row, col // 8] = bool2int(section)
section[:] = 0
return result
A = np.random.randint(0, 10, 100)
n = len(A)
result_arr = np.zeros((n, n // 8 if n % 8 == 0 else n // 8 + 1)).astype(np.uint8)
selection_arr = np.zeros(8).astype(np.uint8)
packed = compare_elementwise(A, result_arr, selection_arr)

Improving performance on comparison algorithm np.packbits(A==A[:, None], axis=1)

I am looking to memory optimise np.packbits(A==A[:, None], axis=1), where A is dense array of integers of length n. A==A[:, None] is memory hungry for large n since the resulting Boolean array is stored inefficiently with each Boolean value costing 1 byte.
I wrote the below script to achieve the same result while packing bits one section at a time. It is, however, around 3x slower, so I am looking for ways to speed it up. Or, alternatively, a better algorithm with small memory overhead.
Note: this is a follow-up question to one I asked earlier; Comparing numpy array with itself by element efficiently.
Reproducible code below for benchmarking.
import numpy as np
from numba import jit
#jit(nopython=True)
def bool2int(x):
y = 0
for i, j in enumerate(x):
if j: y += int(j)<<(7-i)
return y
#jit(nopython=True)
def compare_elementwise(arr, result, section):
n = len(arr)
for row in range(n):
for col in range(n):
section[col%8] = arr[row] == arr[col]
if ((col + 1) % 8 == 0) or (col == (n-1)):
result[row, col // 8] = bool2int(section)
section[:] = 0
return result
n = 10000
A = np.random.randint(0, 1000, n)
result_arr = np.zeros((n, n // 8 if n % 8 == 0 else n // 8 + 1)).astype(np.uint8)
selection_arr = np.zeros(8).astype(np.uint8)
# memory efficient version, but slow
packed = compare_elementwise(A, result_arr, selection_arr)
# memory inefficient version, but fast
packed2 = np.packbits(A == A[:, None], axis=1)
assert (packed == packed2).all()
%timeit compare_elementwise(A, result_arr, selection_arr) # 1.6 seconds
%timeit np.packbits(A == A[:, None], axis=1) # 0.460 second
Here is a solution 3 times faster than the numpy one (a.size must be a multiple of 8; see below) :
#nb.njit
def comp(a):
res=np.zeros((a.size,a.size//8),np.uint8)
for i,x in enumerate(a):
for j,y in enumerate(a):
if x==y: res[i,j//8] |= 128 >> j%8
return res
This works because the array is scanned one time, where you do it many times,
and amost all terms are null.
In [122]: %timeit np.packbits(A == A[:, None], axis=1)
389 ms ± 57.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [123]: %timeit comp(A)
123 ms ± 24.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
If a.size%8 > 0, the cost for find back the information will be higher. The best way in this case is to pad the initial array with some (in range(7)) zeros.
For completeness, the padding could be done as so:
if A.size % 8 != 0: A = np.pad(A, (0, 8 - A.size % 8), 'constant', constant_values=0)

Efficient Double Sum of Products

Consider two ndarrays of length n, arr1 and arr2. I'm computing the following sum of products, and doing it num_runs times to benchmark:
import numpy as np
import time
num_runs = 1000
n = 100
arr1 = np.random.rand(n)
arr2 = np.random.rand(n)
start_comp = time.clock()
for r in xrange(num_runs):
sum_prods = np.sum( [arr1[i]*arr2[j] for i in xrange(n)
for j in xrange(i+1, n)] )
print "total time for comprehension = ", time.clock() - start_comp
start_loop = time.clock()
for r in xrange(num_runs):
sum_prod = 0.0
for i in xrange(n):
for j in xrange(i+1, n):
sum_prod += arr1[i]*arr2[j]
print "total time for loop = ", time.clock() - start_loop
The output is
total time for comprehension = 3.23097066953
total time for comprehension = 3.9045544426
so using list comprehension appears faster.
Is there a much more efficient implementation, using Numpy routines perhaps, to calculate such a sum of products?
Rearrange the operation into an O(n) runtime algorithm instead of O(n^2), and take advantage of NumPy for the products and sums:
# arr1_weights[i] is the sum of all terms arr1[i] gets multiplied by in the
# original version
arr1_weights = arr2[::-1].cumsum()[::-1] - arr2
sum_prods = arr1.dot(arr1_weights)
Timing shows this to be about 200 times faster than the list comprehension for n == 100.
In [21]: %%timeit
....: np.sum([arr1[i] * arr2[j] for i in range(n) for j in range(i+1, n)])
....:
100 loops, best of 3: 5.13 ms per loop
In [22]: %%timeit
....: arr1_weights = arr2[::-1].cumsum()[::-1] - arr2
....: sum_prods = arr1.dot(arr1_weights)
....:
10000 loops, best of 3: 22.8 µs per loop
A vectorized way : np.sum(np.triu(np.multiply.outer(arr1,arr2),1)).
for a 30x improvement:
In [9]: %timeit np.sum(np.triu(np.multiply.outer(arr1,arr2),1))
1000 loops, best of 3: 272 µs per loop
In [10]: %timeit np.sum( [arr1[i]*arr2[j] for i in range(n)
for j in range(i+1, n)]
100 loops, best of 3: 7.9 ms per loop
In [11]: allclose(np.sum(np.triu(np.multiply.outer(arr1,arr2),1)),
np.sum(np.triu(np.multiply.outer(arr1,arr2),1)))
Out[11]: True
Another fast approch is to use numba :
from numba import jit
#jit
def t(arr1,arr2):
s=0
for i in range(n):
for j in range(i+1,n):
s+= arr1[i]*arr2[j]
return s
for a 10x new factor :
In [12]: %timeit t(arr1,arr2)
10000 loops, best of 3: 21.1 µs per loop
And using #user2357112 minimal answer,
#jit
def t2357112(arr1,arr2):
s=0
c=0
for i in range(n-2,-1,-1):
c += arr2[i+1]
s += arr1[i]*c
return s
for
In [13]: %timeit t2357112(arr1,arr2)
100000 loops, best of 3: 2.33 µs per loop
, just doing the necessary operations.
You can use the following broadcasting trick:
a = np.sum(np.triu(arr1[:,None]*arr2[None,:],1))
b = np.sum( [arr1[i]*arr2[j] for i in xrange(n) for j in xrange(i+1, n)] )
print a == b # True
Basically, I'm paying the price of calculating the product of all elements pairwise in arr1 and arr2 to take advantage of the speed of numpy broadcasting/vectorization being done much faster in low-level code.
And timings:
%timeit np.sum(np.triu(arr1[:,None]*arr2[None,:],1))
10000 loops, best of 3: 55.9 µs per loop
%timeit np.sum( [arr1[i]*arr2[j] for i in xrange(n) for j in xrange(i+1, n)] )
1000 loops, best of 3: 1.45 ms per loop

Optimizing python one-liner

I profiled my program, and more than 80% of the time is spent in this one-line function! How can I optimize it? I am running with PyPy, so I'd rather not use NumPy, but since my program is spending almost all of its time there, I think giving up PyPy for NumPy might be worth it. However, I would prefer to use the CFFI, since that's more compatible with PyPy.
#x, y, are lists of 1s and 0s. c_out is a positive int. bit is 1 or 0.
def findCarryIn(x, y, c_out, bit):
return (2 * c_out +
bit -
sum(map(lambda x_bit, y_bit: x_bit & y_bit, x, reversed(y)))) #note this is basically a dot product.
Without using Numpy, After testing with timeit , The fastest method for the summing (that you are doing) seems to be using simple for loop and summing over the elements, Example -
def findCarryIn(x, y, c_out, bit):
s = 0
for i,j in zip(x, reversed(y)):
s += i & j
return (2 * c_out + bit - s)
Though this did not increase the performance by a lot (maybe 20% or so).
The results of timing tests (With different methods , func4 containing the method described above) -
def func1(x,y):
return sum(map(lambda x_bit, y_bit: x_bit & y_bit, x, reversed(y)))
def func2(x,y):
return sum([i & j for i,j in zip(x,reversed(y))])
def func3(x,y):
return sum(x[i] & y[-1-i] for i in range(min(len(x),len(y))))
def func4(x,y):
s = 0
for i,j in zip(x, reversed(y)):
s += i & j
return s
In [125]: %timeit func1(x,y)
100000 loops, best of 3: 3.02 µs per loop
In [126]: %timeit func2(x,y)
The slowest run took 6.42 times longer than the fastest. This could mean that an intermediate result is being cached
100000 loops, best of 3: 2.9 µs per loop
In [127]: %timeit func3(x,y)
100000 loops, best of 3: 4.31 µs per loop
In [128]: %timeit func4(x,y)
100000 loops, best of 3: 2.2 µs per loop
This can for sure be sped up a lot using numpy. You could define your function something like this:
def find_carry_numpy(x, y, c_out, bit):
return 2 * c_out + bit - np.sum(x & y[::-1])
Create some random data:
In [36]: n = 100; c = 15; bit = 1
In [37]: x_arr = np.random.rand(n) > 0.5
In [38]: y_arr = np.random.rand(n) > 0.5
In [39]: x_list = list(x_arr)
In [40]: y_list = list(y_arr)
Check that results are the same:
In [42]: find_carry_numpy(x_arr, y_arr, c, bit)
Out[42]: 10
In [43]: findCarryIn(x_list, y_list, c, bit)
Out[43]: 10
Quick speed test:
In [44]: timeit find_carry_numpy(x_arr, y_arr, c, bit)
10000 loops, best of 3: 19.6 µs per loop
In [45]: timeit findCarryIn(x_list, y_list, c, bit)
1000 loops, best of 3: 409 µs per loop
So you gain a factor of 20 in speed! That is a pretty typical speedup when converting Python code to Numpy.

How to transform negative elements to zero without a loop?

If I have an array like
a = np.array([2, 3, -1, -4, 3])
I want to set all the negative elements to zero: [2, 3, 0, 0, 3]. How to do it with numpy without an explicit for? I need to use the modified a in a computation, for example
c = a * b
where b is another array with the same length of the original a
Conclusion
import numpy as np
from time import time
a = np.random.uniform(-1, 1, 20000000)
t = time(); b = np.where(a>0, a, 0); print ("1. ", time() - t)
a = np.random.uniform(-1, 1, 20000000)
t = time(); b = a.clip(min=0); print ("2. ", time() - t)
a = np.random.uniform(-1, 1, 20000000)
t = time(); a[a < 0] = 0; print ("3. ", time() - t)
a = np.random.uniform(-1, 1, 20000000)
t = time(); a[np.where(a<0)] = 0; print ("4. ", time() - t)
a = np.random.uniform(-1, 1, 20000000)
t = time(); b = [max(x, 0) for x in a]; print ("5. ", time() - t)
1.38629984856
0.516846179962 <- faster a.clip(min=0);
0.615426063538
0.944557905197
51.7364809513
a = a.clip(min=0)
I would do this:
a[a < 0] = 0
If you want to keep the original a and only set the negative elements to zero in a copy, you can copy the array first:
c = a.copy()
c[c < 0] = 0
Another trick is to use multiplication. This actually seems to be much faster than every other method here. For example
b = a*(a>0) # copies data
or
a *= (a>0) # in-place zero-ing
I ran tests with timeit, pre-calculating the the < and > because some of these modify in-place and that would greatly effect results. In all cases a was np.random.uniform(-1, 1, 20000000) but with negatives already set to 0 but L = a < 0 and G = a > 0 before a was changed. The clip is relatively negatively impacted since it doesn't get to use L or G (however calculating those on the same machine took only 17ms each, so it is not the major cause of speed difference).
%timeit b = np.where(G, a, 0) # 132ms copies
%timeit b = a.clip(min=0) # 165ms copies
%timeit a[L] = 0 # 158ms in-place
%timeit a[np.where(L)] = 0 # 122ms in-place
%timeit b = a*G # 87.4ms copies
%timeit np.multiply(a,G,a) # 40.1ms in-place (normal code would use `a*=G`)
When choosing to penalize the in-place methods instead of clip, the following timings come up:
%timeit b = np.where(a>0, a, 0) # 152ms
%timeit b = a.clip(min=0) # 165ms
%timeit b = a.copy(); b[a<0] = 0 # 231ms
%timeit b = a.copy(); b[np.where(a<0)] = 0 # 205ms
%timeit b = a*(a>0) # 108ms
%timeit b = a.copy(); b*=a>0 # 121ms
Non in-place methods are penalized by 20ms (the time required to calculate a>0 or a<0) and the in-place methods are penalize 73-83 ms (so it takes about 53-63ms to do b.copy()).
Overall the multiplication methods are much faster than clip. If not in-place, it is 1.5x faster. If you can do it in-place then it is 2.75x faster.
Use where
a[numpy.where(a<0)] = 0
Based on my answer here, using np.maximum is the fastest possible way.
a = np.random.random(1000) - 0.5
%%timeit
a_ = a.copy()
a_ = np.maximum(a_,0)
# 15.6 µs ± 2.14 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
a_ = a.copy()
a_ = a_.clip(min=0)
# 54.2 µs ± 10.4 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
And just for the sake of comprehensiveness, I would like to add the use of the Heaviside function (or a step function) to achieve a similar outcome as follows:
Let say for continuity we have
a = np.array([2, 3, -1, -4, 3])
Then using a step function np.heaviside() one can try
b = a * np.heaviside(a, 0)
Note something interesting in this operation because the negative signs are preserved! Not very ideal for most situations I would say.
This can then be corrected for by
b = abs(b)
So this is probably a rather long way to do it without invoking some loop.

Categories