Numba: Manual looping faster than a += c * b with numpy arrays? - python

I would like to do a 'daxpy' (add to a vector the scalar multiple of a second vector and assign the result to the first) with numpy using numba. Doing the following test, I noticed that writing the loop myself was much faster than doing a += c * b.
I was not expecting this. What is the reason for this behavior?
import numpy as np
from numba import jit
x = np.random.random(int(1e6))
o = np.random.random(int(1e6))
c = 3.4
#jit(nopython=True)
def test1(a, b, c):
a += c * b
return a
#jit(nopython=True)
def test2(a, b, c):
for i in range(len(a)):
a[i] += c * b[i]
return a
%timeit -n100 -r10 test1(x, o, c)
>>> 100 loops, best of 10: 2.48 ms per loop
%timeit -n100 -r10 test2(x, o, c)
>>> 100 loops, best of 10: 1.2 ms per loop

One thing to keep in mind is 'manual looping' in numba is very fast, essentially the same as the c-loop used by numpy operations.
In the first example there are two operations, a temporary array (c * b) is allocated / calculated, then that temporary array is added to a. In the second example, both calculations are happening in the same loop with no intermediate result.
In theory, numba could fuse loops and optimize #1 to do the same as #2, but it doesn't seem to be doing it. If you just want to optimize numpy ops, numexpr may also be worth a look as it was designed for exactly that - though probably won't do any better than the explicit fused loop.
In [17]: import numexpr as ne
In [18]: %timeit -r10 test2(x, o, c)
1000 loops, best of 10: 1.36 ms per loop
In [19]: %timeit ne.evaluate('x + o * c', out=x)
1000 loops, best of 3: 1.43 ms per loop

Related

Algorithm for finding the overlap of two arrays

I have a bottleneck in my code which I am struggling with.
Take an array A, of size (N x M), only containing 1s and 0s. I need an algorithm which takes all combinations of two rows of A and counts the overlaps between them.
More specifically, I need a faster alternative to the following algorithm:
for i in range(A.shape[0]):
for j in range(A.shape[0]):
a=b=c=d=0
for k in range(A.shape[1]):
if A[i][k]==1 and A[j][k]==1:
a+=1
if A[i][k]==0 and A[j][k]==0:
b+=1
if A[i][k]==1 and A[j][k]==0:
c+=1
if A[i][k]==0 and A[j][k]==1:
d+=1
print(a,b,c,d)
Thanks for any replies!
Since a, b, c, d are in the loop I assume you want them per each i, j. I am going to make a matrix for them with element in [i, j] be the corresponding value for a, b, c, d in your loop i, j, without ANY loops. For example a[i,j] is your value of a in the loop i,j:
A_c = 1-A
a = np.dot(A, A.T)
b = np.dot(A_c, A.T)
c = np.dot(A, A_c.T)
d = np.dot(A_c, A_c.T)
If you care even more about speed, you can factorize and shorten/reuse some of the calculations in the equations above.
While the above answer is absolutely correct, I'd like to follow up on a bit more technical sort of answer - mostly because I was doing something very similar to the problem in your question last week, and learned some cool stuff along the way.
First of all, yes, matrix multiplications and vectorization is the right way to go. However, these can get a bit expensive when the matrices become large. Let me show a small benchmark for N=100 and M=100:
N,M = 100,100
A = np.random.randint(2,size=(N,M))
def type1():
A_c = 1-A
a = np.dot(A, A.T)
b = np.dot(A_c, A.T)
c = np.dot(A, A_c.T)
d = np.dot(A_c, A_c.T)
return a,b,c,d
%timeit -n 100 type1()
>>>3.76 ms ± 48.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
One easy speedup can be done by the fact that a+b+c+d = M. We don't actually need to find d; we can thus reduce one expensive dot product here!
def type2():
A_c = 1-A
a = np.dot(A, A.T)
b = np.dot(A_c, A.T)
c = np.dot(A, A_c.T)
return a,b,c,M-(a+b+c)
%timeit -n 100 type2()
>>>2.81 ms ± 15.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
That shaved off almost a millisecond, but we can do even better. Numpy arrays come in two orders: C-Contiguous and F-Contiguous. You can check this by printing A.flags; A is a C-Contiguous array by default. However, its transpose A.T is represented as an F-Contiguous array, and when we pass them to dot, an internal copy is created for A.T since the ordering doesn't match.
One way to bypass this is by going over to scipy and hooking up our program with BLAS (https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms), particularly, the general matrix multiplication gemm routine.
from scipy.linalg import blas as B
def type3():
A_c = 1-A
a = B.dgemm(alpha=1.0, a=A, b=A, trans_b=True)
b = B.dgemm(alpha=1.0, a=A_c, b=A, trans_b=True)
c = B.dgemm(alpha=1.0, a=A, b=A_c, trans_b=True)
return a,b,c,M-(a+b+c)
%timeit -n 100 type3()
>>>449 µs ± 27 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
And the time has gone down directly from milliseconds to microseconds, which is pretty awesome.
Just for the sport of it here is a method that is three times faster than the current fastest. We take advantage of matrix computations being faster on float, in particular float32. Further we are doing only one matrix multiplication, inferring the other numbers by much cheaper methods:
def pp():
A1 = np.count_nonzero(A,1)
Af = A.astype('f4')
a = Af#Af.T
b = A1-a
c = b.T
d = M-a-b-c
return a,b,c,d
[*map(np.array_equal,pp(),type3())]
# [True, True, True, True]
timeit(pp,number=1000)
# 0.14910832402529195
timeit(type3,number=1000)
# 0.4432948770117946

Add scipy sparse row matrix to another sparse matrix

I have a csr_matrix A of shape (70000, 80000) and another csr_matrix Bof shape (1, 80000). How can I efficiently add B to every row of A? One idea is to somehow create a sparse matrix B' which is rows of B repeated, but numpy.repeat does not work and using a matrix of ones to create B' is very memory inefficient.
I also tried iterating through every row of A and adding B to it, but that again is very time inefficient.
Update:
I tried something very simple which seems to be very efficient than the ideas I mentioned above. The idea is to use scipy.sparse.vstack:
C = sparse.vstack([B for x in range(A.shape[0])])
A + C
This performs well for my task! Few more realizations: I initially tried an iterative approach where I called vstackmultiple times, this approach is slower than calling it just once.
A + B[np.zeros(A.shape[0])] is another way to expand B to the same shape as A.
It has about the same performance and memory footprint as Warren Weckesser's solution:
import numpy as np
import scipy.sparse as sparse
N, M = 70000, 80000
A = sparse.rand(N, M, density=0.001).tocsr()
B = sparse.rand(1, M, density=0.001).tocsr()
In [185]: %timeit u = sparse.csr_matrix(np.ones((A.shape[0], 1), dtype=B.dtype)); Bp = u * B; A + Bp
1 loops, best of 3: 284 ms per loop
In [186]: %timeit A + B[np.zeros(A.shape[0])]
1 loops, best of 3: 280 ms per loop
and appears to be faster than using sparse.vstack:
In [187]: %timeit A + sparse.vstack([B for x in range(A.shape[0])])
1 loops, best of 3: 606 ms per loop

Fastest way to populate a matrix with a function on pairs of elements in two numpy vectors?

I have two 1 dimensional numpy vectors va and vb which are being used to populate a matrix by passing all pair combinations to a function.
na = len(va)
nb = len(vb)
D = np.zeros((na, nb))
for i in range(na):
for j in range(nb):
D[i, j] = foo(va[i], vb[j])
As it stands, this piece of code takes a very long time to run due to the fact that va and vb are relatively large (4626 and 737). However I am hoping this can be improved due to the fact that a similiar procedure is performed using the cdist method from scipy with very good performance.
D = cdist(va, vb, metric)
I am obviously aware that scipy has the benefit of running this piece of code in C rather than in python - but I'm hoping there is some numpy function im unaware of that can execute this quickly.
One of the least known numpy functions for what the docs call functional programming routines is np.frompyfunc. This creates a numpy ufunc from a Python function. Not some other object that closely simulates a numpy ufunc, but a proper ufunc with all its bells and whistles. While the behavior is in many aspects very similar to np.vectorize, it has some distinct advantages, that hopefully the following code should highlight:
In [2]: def f(a, b):
...: return a + b
...:
In [3]: f_vec = np.vectorize(f)
In [4]: f_ufunc = np.frompyfunc(f, 2, 1) # 2 inputs, 1 output
In [5]: a = np.random.rand(1000)
In [6]: b = np.random.rand(2000)
In [7]: %timeit np.add.outer(a, b) # a baseline for comparison
100 loops, best of 3: 9.89 ms per loop
In [8]: %timeit f_vec(a[:, None], b) # 50x slower than np.add
1 loops, best of 3: 488 ms per loop
In [9]: %timeit f_ufunc(a[:, None], b) # ~20% faster than np.vectorize...
1 loops, best of 3: 425 ms per loop
In [10]: %timeit f_ufunc.outer(a, b) # ...and you get to use ufunc methods
1 loops, best of 3: 427 ms per loop
So while it is still clearly inferior to a properly vectorized implementation, it is a little faster (the looping is in C, but you still have the Python function call overhead).
cdist is fast because it is written in highly-optimized C code (as you already pointed out), and it only supports a small predefined set of metrics.
Since you want to apply the operation generically, to any given foo function, you have no choice but to call that function na-times-nb times. That part is not likely to be further optimizable.
What's left to optimize are the loops and the indexing. Some suggestions to try out:
Use xrange instead of range (if in python2.x. in python3, range is already a generator-like)
Use enumerate, instead of range + explicitly indexing
Use a python speed "magic", such as cython or numba, to speed up the looping process.
If you can make further assumptions about foo, it might be possible to speed it up further.
Like #shx2 said, it all depends on what is foo. If you can express it in terms of numpy ufuncs, then use outer method:
In [11]: N = 400
In [12]: B = np.empty((N, N))
In [13]: x = np.random.random(N)
In [14]: y = np.random.random(N)
In [15]: %%timeit
for i in range(N):
for j in range(N):
B[i, j] = x[i] - y[j]
....:
10 loops, best of 3: 87.2 ms per loop
In [16]: %timeit A = np.subtract.outer(x, y) # <--- np.subtract is a ufunc
1000 loops, best of 3: 294 µs per loop
Otherwise you can push the looping down to cython level. Continuing a trivial example above:
In [45]: %%cython
cimport cython
#cython.boundscheck(False)
#cython.wraparound(False)
def foo(double[::1] x, double[::1] y, double[:, ::1] out):
cdef int i, j
for i in xrange(x.shape[0]):
for j in xrange(y.shape[0]):
out[i, j] = x[i] - y[j]
....:
In [46]: foo(x, y, B)
In [47]: np.allclose(B, np.subtract.outer(x, y))
Out[47]: True
In [48]: %timeit foo(x, y, B)
10000 loops, best of 3: 149 µs per loop
The cython example is deliberately made overly simplistic: in reality you might want to add some shape/stride checks, allocate the memory within your function etc.

How does parakeet differ from Numba? Because I didn't see any improvements on some NumPy expressions

I am wondering if anyone knows some of the key differences between the parakeet and the Numba jit? I am curious, because I was comparing Numexpr to Numba and parakeet, and for this particular expression (which I expected to perform very very well on Numexpr, because it was the one that is mentioned in its documentation)
So the results are
and the functions I tested (via timeit - minimum of 3 repetitions and 10 loops per function)
import numpy as np
import numexpr as ne
from numba import jit as numba_jit
from parakeet import jit as para_jit
def numpy_complex_expr(A, B):
return(A*B-4.1*A > 2.5*B)
def numexpr_complex_expr(A, B):
return ne.evaluate('A*B-4.1*A > 2.5*B')
#numba_jit
def numba_complex_expr(A, B):
return A*B-4.1*A > 2.5*B
#para_jit
def parakeet_complex_expr(A, B):
return A*B-4.1*A > 2.5*B
I you can also grab the IPython nb if you'd like to double-check the results on your machine.
If someone is wondering if Numba is installed correctly... I think so, it performed as expected in my previous benchmark:
As of the current release of Numba (which you are using in your tests), there is incomplete support for ufuncs with the #jit function. On the other hand you can use #vectorize and it faster:
import numpy as np
from numba import jit, vectorize
import numexpr as ne
def numpy_complex_expr(A, B):
return(A*B+4.1*A > 2.5*B)
def numexpr_complex_expr(A, B):
return ne.evaluate('A*B+4.1*A > 2.5*B')
#jit
def numba_complex_expr(A, B):
return A*B+4.1*A > 2.5*B
#vectorize(['u1(float64, float64)'])
def numba_vec(A,B):
return A*B+4.1*A > 2.5*B
n = 1000
A = np.random.rand(n,n)
B = np.random.rand(n,n)
Timing results:
%timeit numba_complex_expr(A,B)
1 loops, best of 3: 49.8 ms per loop
%timeit numpy_complex_expr(A,B)
10 loops, best of 3: 43.5 ms per loop
%timeit numexpr_complex_expr(A,B)
100 loops, best of 3: 3.08 ms per loop
%timeit numba_vec(A,B)
100 loops, best of 3: 9.8 ms per loop
If you want to leverage numba to its fullest, then you want to unroll any vectorized operations:
#jit
def numba_unroll2(A, B):
C = np.empty(A.shape, dtype=np.uint8)
for i in xrange(A.shape[0]):
for j in xrange(A.shape[1]):
C[i,j] = A[i,j]*B[i,j] + 4.1*A[i,j] > 2.5*B[i,j]
return C
%timeit numba_unroll2(A,B)
100 loops, best of 3: 5.96 ms per loop
Also note that if you set the number of threads that numexpr uses to 1, then you'll see that its main speed advantage is that it's parallelized:
ne.set_num_threads(1)
%timeit numexpr_complex_expr(A,B)
100 loops, best of 3: 8.87 ms per loop
By default numexpr uses ne.detect_number_of_cores() as the number of threads. For my original timing on my machine, it was using 8.

Fast check for NaN in NumPy

I'm looking for the fastest way to check for the occurrence of NaN (np.nan) in a NumPy array X. np.isnan(X) is out of the question, since it builds a boolean array of shape X.shape, which is potentially gigantic.
I tried np.nan in X, but that seems not to work because np.nan != np.nan. Is there a fast and memory-efficient way to do this at all?
(To those who would ask "how gigantic": I can't tell. This is input validation for library code.)
Ray's solution is good. However, on my machine it is about 2.5x faster to use numpy.sum in place of numpy.min:
In [13]: %timeit np.isnan(np.min(x))
1000 loops, best of 3: 244 us per loop
In [14]: %timeit np.isnan(np.sum(x))
10000 loops, best of 3: 97.3 us per loop
Unlike min, sum doesn't require branching, which on modern hardware tends to be pretty expensive. This is probably the reason why sum is faster.
edit The above test was performed with a single NaN right in the middle of the array.
It is interesting to note that min is slower in the presence of NaNs than in their absence. It also seems to get slower as NaNs get closer to the start of the array. On the other hand, sum's throughput seems constant regardless of whether there are NaNs and where they're located:
In [40]: x = np.random.rand(100000)
In [41]: %timeit np.isnan(np.min(x))
10000 loops, best of 3: 153 us per loop
In [42]: %timeit np.isnan(np.sum(x))
10000 loops, best of 3: 95.9 us per loop
In [43]: x[50000] = np.nan
In [44]: %timeit np.isnan(np.min(x))
1000 loops, best of 3: 239 us per loop
In [45]: %timeit np.isnan(np.sum(x))
10000 loops, best of 3: 95.8 us per loop
In [46]: x[0] = np.nan
In [47]: %timeit np.isnan(np.min(x))
1000 loops, best of 3: 326 us per loop
In [48]: %timeit np.isnan(np.sum(x))
10000 loops, best of 3: 95.9 us per loop
I think np.isnan(np.min(X)) should do what you want.
There are two general approaches here:
Check each array item for nan and take any.
Apply some cumulative operation that preserves nans (like sum) and check its result.
While the first approach is certainly the cleanest, the heavy optimization of some of the cumulative operations (particularly the ones that are executed in BLAS, like dot) can make those quite fast. Note that dot, like some other BLAS operations, are multithreaded under certain conditions. This explains the difference in speed between different machines.
import numpy as np
import perfplot
def min(a):
return np.isnan(np.min(a))
def sum(a):
return np.isnan(np.sum(a))
def dot(a):
return np.isnan(np.dot(a, a))
def any(a):
return np.any(np.isnan(a))
def einsum(a):
return np.isnan(np.einsum("i->", a))
b = perfplot.bench(
setup=np.random.rand,
kernels=[min, sum, dot, any, einsum],
n_range=[2 ** k for k in range(25)],
xlabel="len(a)",
)
b.save("out.png")
b.show()
Even there exist an accepted answer, I'll like to demonstrate the following (with Python 2.7.2 and Numpy 1.6.0 on Vista):
In []: x= rand(1e5)
In []: %timeit isnan(x.min())
10000 loops, best of 3: 200 us per loop
In []: %timeit isnan(x.sum())
10000 loops, best of 3: 169 us per loop
In []: %timeit isnan(dot(x, x))
10000 loops, best of 3: 134 us per loop
In []: x[5e4]= NaN
In []: %timeit isnan(x.min())
100 loops, best of 3: 4.47 ms per loop
In []: %timeit isnan(x.sum())
100 loops, best of 3: 6.44 ms per loop
In []: %timeit isnan(dot(x, x))
10000 loops, best of 3: 138 us per loop
Thus, the really efficient way might be heavily dependent on the operating system. Anyway dot(.) based seems to be the most stable one.
If you're comfortable with numba it allows to create a fast short-circuit (stops as soon as a NaN is found) function:
import numba as nb
import math
#nb.njit
def anynan(array):
array = array.ravel()
for i in range(array.size):
if math.isnan(array[i]):
return True
return False
If there is no NaN the function might actually be slower than np.min, I think that's because np.min uses multiprocessing for large arrays:
import numpy as np
array = np.random.random(2000000)
%timeit anynan(array) # 100 loops, best of 3: 2.21 ms per loop
%timeit np.isnan(array.sum()) # 100 loops, best of 3: 4.45 ms per loop
%timeit np.isnan(array.min()) # 1000 loops, best of 3: 1.64 ms per loop
But in case there is a NaN in the array, especially if it's position is at low indices, then it's much faster:
array = np.random.random(2000000)
array[100] = np.nan
%timeit anynan(array) # 1000000 loops, best of 3: 1.93 µs per loop
%timeit np.isnan(array.sum()) # 100 loops, best of 3: 4.57 ms per loop
%timeit np.isnan(array.min()) # 1000 loops, best of 3: 1.65 ms per loop
Similar results may be achieved with Cython or a C extension, these are a bit more complicated (or easily avaiable as bottleneck.anynan) but ultimatly do the same as my anynan function.
use .any()
if numpy.isnan(myarray).any()
numpy.isfinite maybe better than isnan for checking
if not np.isfinite(prop).all()
Related to this is the question of how to find the first occurrence of NaN. This is the fastest way to handle that that I know of:
index = next((i for (i,n) in enumerate(iterable) if n!=n), None)
Adding to #nico-schlömer and #mseifert 's answers, I computed the performance of a numba-test has_nan with early stops, compared to some of the functions that will parse the full array.
On my machine, for an array without nans, the break-even happens for ~10^4 elements.
import perfplot
import numpy as np
import numba
import math
def min(a):
return np.isnan(np.min(a))
def dot(a):
return np.isnan(np.dot(a, a))
def einsum(a):
return np.isnan(np.einsum("i->", a))
#numba.njit
def has_nan(a):
for i in range(a.size - 1):
if math.isnan(a[i]):
return True
return False
def array_with_missing_values(n, p):
""" Return array of size n, p : nans ( % of array length )
Ex : n=1e6, p=1 : 1e4 nan assigned at random positions """
a = np.random.rand(n)
p = np.random.randint(0, len(a), int(p*len(a)/100))
a[p] = np.nan
return a
#%%
perfplot.show(
setup=lambda n: array_with_missing_values(n, 0),
kernels=[min, dot, has_nan],
n_range=[2 ** k for k in range(20)],
logx=True,
logy=True,
xlabel="len(a)",
)
What happens if the array has nans ? I investigated the impact of the nan-coverage of the array.
For arrays of length 1,000,000, has_nan becomes a better option is there are ~10^-3 % nans (so ~10 nans) in the array.
#%%
N = 1000000 # 100000
perfplot.show(
setup=lambda p: array_with_missing_values(N, p),
kernels=[min, dot, has_nan],
n_range=np.array([2 ** k for k in range(20)]) / 2**20 * 0.01,
logy=True,
xlabel=f"% of nan in array (N = {N})",
)
If in your application most arrays have nan and you're looking for ones without, then has_nan is the best approach.
Else; dot seems to be the best option.

Categories