Why is numba faster than numpy here? - python

I can't figure out why numba is beating numpy here (over 3x). Did I make some fundamental error in how I am benchmarking here? Seems like the perfect situation for numpy, no? Note that as a check, I also ran a variation combining numba and numpy (not shown), which as expected was the same as running numpy without numba.
(btw this is a followup question to: Fastest way to numerically process 2d-array: dataframe vs series vs array vs numba )
import numpy as np
from numba import jit
nobs = 10000
def proc_numpy(x,y,z):
x = x*2 - ( y * 55 ) # these 4 lines represent use cases
y = x + y*2 # where the processing time is mostly
z = x + y + 99 # a function of, say, 50 to 200 lines
z = z * ( z - .88 ) # of fairly simple numerical operations
return z
#jit
def proc_numba(xx,yy,zz):
for j in range(nobs): # as pointed out by Llopis, this for loop
x, y = xx[j], yy[j] # is not needed here. it is here by
# accident because in the original benchmarks
x = x*2 - ( y * 55 ) # I was doing data creation inside the function
y = x + y*2 # instead of passing it in as an array
z = x + y + 99 # in any case, this redundant code seems to
z = z * ( z - .88 ) # have something to do with the code running
# faster. without the redundant code, the
zz[j] = z # numba and numpy functions are exactly the same.
return zz
x = np.random.randn(nobs)
y = np.random.randn(nobs)
z = np.zeros(nobs)
res_numpy = proc_numpy(x,y,z)
z = np.zeros(nobs)
res_numba = proc_numba(x,y,z)
results:
In [356]: np.all( res_numpy == res_numba )
Out[356]: True
In [357]: %timeit proc_numpy(x,y,z)
10000 loops, best of 3: 105 µs per loop
In [358]: %timeit proc_numba(x,y,z)
10000 loops, best of 3: 28.6 µs per loop
I ran this on a 2012 macbook air (13.3), standard anaconda distribution. I can provide more detail on my setup if it's relevant.

I think this question highlights (somewhat) the limitations of calling out to precompiled functions from a higher level language. Suppose in C++ you write something like:
for (int i = 0; i != N; ++i) a[i] = b[i] + c[i] + 2 * d[i];
The compiler sees all this at compile time, the whole expression. It can do a lot of really intelligent things here, including optimizing out temporaries (and loop unrolling).
In python however, consider what's happening: when you use numpy each ''+'' uses operator overloading on the np array types (which are just thin wrappers around contiguous blocks of memory, i.e. arrays in the low level sense), and calls out to a fortran (or C++) function which does the addition super fast. But it just does one addition, and spits out a temporary.
We can see that in some way, while numpy is awesome and convenient and pretty fast, it is slowing things down because while it seems like it is calling into a fast compiled language for the hard work, the compiler doesn't get to see the whole program, it's just fed isolated little bits. And this is hugely detrimental to a compiler, especially modern compilers which are very intelligent and can retire multiple instructions per cycle when the code is well written.
Numba on the other hand, used a jit. So, at runtime it can figure out that the temporaries are not needed, and optimize them away. Basically, Numba has a chance to have the program compiled as a whole, numpy can only call small atomic blocks which themselves have been pre-compiled.

When you ask numpy to do:
x = x*2 - ( y * 55 )
It is internally translated to something like:
tmp1 = y * 55
tmp2 = x * 2
tmp3 = tmp2 - tmp1
x = tmp3
Each of those temps are arrays that have to be allocated, operated on, and then deallocated. Numba, on the other hand, handles things one item at a time, and doesn't have to deal with that overhead.

Numba is generally faster than Numpy and even Cython (at least on Linux).
Here's a plot (stolen from Numba vs. Cython: Take 2):
In this benchmark, pairwise distances have been computed, so this may depend on the algorithm.
Note that this may be different on other Platforms, see this for Winpython (From WinPython Cython tutorial):

Instead of cluttering the original question further, I'll add some more stuff here in response to Jeff, Jaime, Veedrac:
def proc_numpy2(x,y,z):
np.subtract( np.multiply(x,2), np.multiply(y,55),out=x)
np.add( x, np.multiply(y,2),out=y)
np.add(x,np.add(y,99),out=z)
np.multiply(z,np.subtract(z,.88),out=z)
return z
def proc_numpy3(x,y,z):
x *= 2
x -= y*55
y *= 2
y += x
z = x + y
z += 99
z *= (z-.88)
return z
My machine seems to be running a tad faster today than yesterday so here they are in comparison to proc_numpy (proc_numba is timing the same as before)
In [611]: %timeit proc_numpy(x,y,z)
10000 loops, best of 3: 103 µs per loop
In [612]: %timeit proc_numpy2(x,y,z)
10000 loops, best of 3: 92.5 µs per loop
In [613]: %timeit proc_numpy3(x,y,z)
10000 loops, best of 3: 85.1 µs per loop
Note that as I was writing proc_numpy2/3 that I started seeing some side effects so I made copies of x,y,z and passed the copies instead of re-using x,y,z. Also, the different functions sometimes had slight differences in precision, so some of the them didn't pass the equality tests but if you diff them, they are really close. I assume that is due to creating or (not creating) temp variables. E.g.:
In [458]: (res_numpy2 - res_numba)[:12]
Out[458]:
array([ -7.27595761e-12, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, -7.27595761e-12, 0.00000000e+00])
Also, it's pretty minor (about 10 µs) but using float literals (55. instead of 55) will also save a little time for numpy but doesn't help numba.

Related

Equivalent of `math.remainder` in NumPy?

What is the equivalent of math.remainder() function in NumPy? Basically I would like to compute y = x - np.around(x) for a NumPy array x. (It is important that all elements of y have absolute at most 1/2). Looking at NumPy documentation neither np.fmod nor np.remainder do the job.
Obviously, I thought about writing x - np.around(x) but I am afraid that for large values of x subtraction produces floating-point inaccuracies. For example:
import numpy as np
a = np.arange(1000) * 1e-9
x = a / 1e-9
y = x - np.around(x)
Should produce an all-zero vector y, but in practice there will be some errors (that get larger if I increase the size of arrays from 1000 to 10000).
The reason I am asking this question is to figure out if there is a NumPy function for this purpose that calls directly C math library remainder (as math.remainder must do) so as to minimize the floating-points errors.
I don't think this currently exists numpy. That said, the automagic numpy.vectorize call does the right thing for me, e.g:
import math
import numpy as np
ieee_remainder = np.vectorize(math.remainder)
ieee_remainder(np.arange(5), 5)
this broadcasts over the parameters nicely, giving:
array([ 0., 1., 2., -2., -1.])
which might be what you want.
performance is about 10 times slower than you'd get with a native implementation. given an array of 10,000 elements my laptop takes approximately the following times:
1200 µs for ieee_remainder
150 µs for a Cython version I hacked together
120 µs for a C program doing a naive loop
80 µs for numpy.fmod
given the complexity of glibc's remainder implementation I'm amazed it's as fast as it is.

Computing the spectral norms of ~1m Hermitian matrices: `numpy.linalg.norm` is too slow

I would like to calculate the spectral norms of N 8x8 Hermitian matrices, with N being close to 1E6. As an example, take these 1 million random complex 8x8 matrices:
import numpy as np
array = np.random.rand(8,8,1e6) + 1j*np.random.rand(8,8,1e6)
It currently takes me almost 10 seconds using numpy.linalg.norm:
np.linalg.norm(array, ord=2, axis=(0,1))
I tried using the Cython code below, but this gave me only a negligible performance improvement:
import numpy as np
cimport numpy as np
cimport cython
np.import_array()
DTYPE = np.complex64
#cython.boundscheck(False)
#cython.wraparound(False)
def function(np.ndarray[np.complex64_t, ndim=3] Array):
assert Array.dtype == DTYPE
cdef int shape0 = Array.shape[2]
cdef np.ndarray[np.float32_t, ndim=1] normarray = np.zeros(shape0, dtype=np.float32)
normarray = np.linalg.norm(Array, ord=2, axis=(0, 1))
return normarray
I also tried numba and some other scipy functions (such as scipy.linalg.svdvals) to calculate the singular values of these matrices. Everything is still too slow.
Is it not possible to make this any faster? Is numpy already optimized to the extent that no speed gains are possible by using Cython or numba? Or is my code highly inefficient and I am doing something fundamentally wrong?
I noticed that only two of my CPU cores are 100% utilized while doing the calculation. With that in mind, I looked at these previous StackOverflow questions:
why isn't numpy.mean multithreaded?
Why does multiprocessing use only a single core after I import numpy?
multithreaded blas in python/numpy (didn't help)
and several others, but unfortunately I still don't have a solution.
I considered splitting my array into smaller chunks, and processing these in parallel (perhaps on the GPU using CUDA). Is there a way within numpy/Python to do this? I don't yet know where the bottleneck is in my code, i.e. whether it is CPU or memory-bound, or perhaps something different.
Digging into the code for np.linalg.norm, I've deduced, that for these parameters, it is finding the maximum of matrix singular values over the N dimension
First generate a small sample array. Make N the first dimension to eliminate a rollaxis operation:
In [268]: N=10; A1 = np.random.rand(N,8,8)+1j*np.random.rand(N,8,8)
In [269]: np.linalg.norm(A1,ord=2,axis=(1,2))
Out[269]:
array([ 5.87718306, 5.54662999, 6.15018125, 5.869058 , 5.80882818,
5.86060462, 6.04997992, 5.85681085, 5.71243196, 5.58533323])
the equivalent operation:
In [270]: np.amax(np.linalg.svd(A1,compute_uv=0),axis=-1)
Out[270]:
array([ 5.87718306, 5.54662999, 6.15018125, 5.869058 , 5.80882818,
5.86060462, 6.04997992, 5.85681085, 5.71243196, 5.58533323])
same values, and same time:
In [271]: timeit np.linalg.norm(A1,ord=2,axis=(1,2))
1000 loops, best of 3: 398 µs per loop
In [272]: timeit np.amax(np.linalg.svd(A1,compute_uv=0),axis=-1)
1000 loops, best of 3: 389 µs per loop
And most of the time spent in svd, which produces an (N,8) array:
In [273]: timeit np.linalg.svd(A1,compute_uv=0)
1000 loops, best of 3: 366 µs per loop
So if you want to speed up the norm, you have look further into speeding up this svd. svd uses np.linalg._umath_linalg functions - that is a .so file - compiled.
The c code is in https://github.com/numpy/numpy/blob/97c35365beda55c6dead8c50df785eb857f843f0/numpy/linalg/umath_linalg.c.src
It sure looks like this is the fastest you'll get. There's no Python level loop. Any looping is in that c code, or the lapack function it calls.
np.linalg.norm(A, ord=2) computes the spectral norm by finding the largest singular value using SVD. However, since your 8x8 submatrices are Hermitian, their largest singular values will be equal to the maximum of their absolute eigenvalues (see here):
import numpy as np
def random_symmetric(N, k):
A = np.random.randn(N, k, k)
A += A.transpose(0, 2, 1)
return A
N = 100000
k = 8
A = random_symmetric(N, k)
norm1 = np.abs(np.linalg.eigvalsh(A)).max(1)
norm2 = np.linalg.norm(A, ord=2, axis=(1, 2))
print(np.allclose(norm1, norm2))
# True
Eigendecomposition on a Hermitian matrix is quite a bit faster than SVD:
In [1]: %%timeit A = random_symmetric(N, k)
np.linalg.norm(A, ord=2, axis=(1, 2))
....:
1 loops, best of 3: 1.54 s per loop
In [2]: %%timeit A = random_symmetric(N, k)
np.abs(np.linalg.eigvalsh(A)).max(1)
....:
1 loops, best of 3: 757 ms per loop

Many small matrices speed-up for loops

I have a large coordinate grid (vectors a and b), for which I generate and solve a matrix (10x10) equation. Is there a way for scipy.linalg.solve to accept vector input? So far my solution was to run for cycles over the coordinate arrays.
import time
import numpy as np
import scipy.linalg
N = 10
a = np.linspace(0, 1, 10**3)
b = np.linspace(0, 1, 2*10**3)
A = np.random.random((N, N)) # input matrix, not static
def f(a,b,n): # array-filling function
return a*b*n
def sol(x,y): # matrix solver
D = np.arange(0,N)
B = f(x,y,D)**2 + f(x-1, y+1, D) # source vector
X = scipy.linalg.solve(A,B)
return X # output an N-size vector
start = time.time()
answer = np.zeros(shape=(a.size, b.size)) # predefine output array
for egg in range(a.size): # an ugly double-for cycle
for ham in range(b.size):
aa = a[egg]
bb = b[ham]
answer[egg,ham] = sol(aa,bb)[0]
print time.time() - start
To illustrate my point about generalized ufuncs, and the ability to eliminate the loop over egg and ham, consider the following piece of code:
import numpy as np
A = np.random.randn(4,4,10,10)
AI = np.linalg.inv(A)
#check that generalized ufuncs work as expected
I = np.einsum('xyij,xyjk->xyik', A, AI)
print np.allclose(I, np.eye(10)[np.newaxis, np.newaxis, :, :])
#yevgeniy You are right, efficiently solving multiple independent linear systems A x = b with scipy a bit tricky (assuming an A array that changes for every iteration).
For instance, here is a benchmark for solving 1000 systems of the form, A x = b, where A is a 10x10 matrix, and b a 10 element vector. Surprisingly, the approach to put all this into one block diagonal matrix and call scipy.linalg.solve once is indeed slower both with dense and sparse matrices.
import numpy as np
from scipy.linalg import block_diag, solve
from scipy.sparse import block_diag as sp_block_diag
from scipy.sparse.linalg import spsolve
N = 10
M = 1000 # number of coordinates
Ai = np.random.randn(N, N) # we can compute the inverse here,
# but let's assume that Ai are different matrices in the for loop loop
bi = np.random.randn(N)
%timeit [solve(Ai, bi) for el in range(M)]
# 10 loops, best of 3: 32.1 ms per loop
Afull = sp_block_diag([Ai]*M, format='csr')
bfull = np.tile(bi, M)
%timeit Afull = sp_block_diag([Ai]*M, format='csr')
%timeit spsolve(Afull, bfull)
# 1 loops, best of 3: 303 ms per loop
# 100 loops, best of 3: 5.55 ms per loop
Afull = block_diag(*[Ai]*M)
%timeit Afull = block_diag(*[Ai]*M)
%timeit solve(Afull, bfull)
# 100 loops, best of 3: 14.1 ms per loop
# 1 loops, best of 3: 23.6 s per loop
The solution of the linear system, with sparse arrays is faster, but the time to create this block diagonal array is actually very slow. As to dense arrays, they are simply slower in this case (and take lots of RAM).
Maybe I'm missing something about how to make this work efficiently with sparse arrays, but if you are keeping the for loops, there are two things that you could do for optimizations.
From pure python, look at the source code of scipy.linalg.solve : remove unnecessary tests and factorize all repeated operations outside of your loops. For instance, assuming your arrays are not symmetrical positives, we could do
from scipy.linalg import get_lapack_funcs
gesv, = get_lapack_funcs(('gesv',), (Ai, bi))
def solve_opt(A, b, gesv=gesv):
# not sure if copying A and B is necessary, but just in case (faster if arrays are not copied)
lu, piv, x, info = gesv(A.copy(), b.copy(), overwrite_a=False, overwrite_b=False)
if info == 0:
return x
if info > 0:
raise LinAlgError("singular matrix")
raise ValueError('illegal value in %d-th argument of internal gesv|posv' % -info)
%timeit [solve(Ai, bi) for el in range(M)]
%timeit [solve_opt(Ai, bi) for el in range(M)]
# 10 loops, best of 3: 30.1 ms per loop
# 100 loops, best of 3: 3.77 ms per loop
which results in a 6.5x speed up.
If you need even better performance, you would have to port this for loop in Cython and interface the gesv BLAS functions directly in C, as discussed here, or better with the Cython API for BLAS/LAPACK in Scipy 0.16.
Edit: As #Eelco Hoogendoorn mentioned if your A matrix is fixed, there is a much simpler and more efficient approach.

Efficient way to work with python and numpy for simple to medium calculations

Assume a few functions called many times. These functions do something such as multiply, divide, add, on a 3d vector (a 1x3 array).
Given:
import numpy as np
import math
x = [0,1,2]
y = [3,2,1]
a = 1.2
Based on my testing, it is faster for python math library to do:
math.sin(a)
than for numpy to do:
np.sin(a)
Additionally, simple algorithms such as normalization are faster with python than np.linalg.norm using the method discussed in this conversation.
Now if we add a bit of complexity to the data, such as doing matrix multiplication for 3d, where we have a rotation matrix of 3x3 that is then multiplied by another matrix and transposed, numpy starts to gain the advantage.
Currently, doing operations such as:
L = math.sqrt(V[0] * V[0] + V[1] * V[1] + V[2] * V[2])
V = (V[0] / L, V[1] / L, V[2] / L)
are much faster when called repeatedly (I assume from no overhead in creating the numpy array).
However, in order to use the numpy matrix functions, the array needs to be numpy. Using np.asarray() has significant overhead, which makes the efficiency border between not using numpy at all, accepting the overhead of creating the array, or accepting the efficiency of numpy math functions on scalars and only using numpy.
Of course I can try out all of these methods, but in a large algorithm, the possible combinations are too much. Is there any strategy to efficiently switch between python and numpy in this situation?
EDIT:
From some comments, it seems the question is not clear enough. I understand numpy is more efficient with big sets, which is why this question exists. The algorithm is NOT ONLY calculating sine. The following code might make it easier to understand:
x = [2,1,2]
math.sin(x[0])
L = math.sqrt(x[0] * x[0] + x[1] * x[1] + x[2] * x[2])
V = (x[0] / L, x[1] / L, x[2] / L)
math.sin(V[0])
#Do something else here
When working with single values, and small arrays, the np.array overhead certainly slows things down compared to using the math. equivalents. But with many values, the array approach quickly becomes better.
For example in Ipython I can time sin for 50 values:
In [444]: %%timeit x=np.arange(50)
np.sin(x)
100000 loops, best of 3: 8.5 us per loop
In [445]: %%timeit x=range(50)
[math.sin(i) for i in x]
100000 loops, best of 3: 18.1 us per loop
Your V calculation is 20x faster than
Va=Va/math.sqrt((Va*Va).sum())
But if I do that on 20 sets of values, the times are about equal. And I don't have to change the expression to handle Va=np.ones((20,3), float). To time your V I had to wrap it in a function and time [foo(i) for i in V].
You might even gain more speed by doing the indexing only once, e.g.
v1, v2, v3 = V
L = math.sqrt(v1*v1+ v2*v2+v3*v3)
V = (v1/L, v2/L, v3/L)
I'd expect more gain when using arrays than lists.

Elegant expression for row-wise dot product of two matrices

I have two 2-d numpy arrays with the same dimensions, A and B, and am trying to calculate the row-wise dot product of them. I could do:
np.sum(A * B, axis=1)
Is there another way to do this so that numpy is doing the row-wise dot product in one step rather than two? Maybe with tensordot?
This is a good application for numpy.einsum.
a = np.random.randint(0, 5, size=(6, 4))
b = np.random.randint(0, 5, size=(6, 4))
res1 = np.einsum('ij, ij->i', a, b)
res2 = np.sum(a*b, axis=1)
print(res1)
# [18 6 20 9 16 24]
print(np.allclose(res1, res2))
# True
einsum also tends to be a bit faster.
a = np.random.normal(size=(5000, 1000))
b = np.random.normal(size=(5000, 1000))
%timeit np.einsum('ij, ij->i', a, b)
# 100 loops, best of 3: 8.4 ms per loop
%timeit np.sum(a*b, axis=1)
# 10 loops, best of 3: 28.4 ms per loop
Even faster is inner1d from numpy.core.umath_tests:
Code to reproduce the plot:
import numpy
from numpy.core.umath_tests import inner1d
import perfplot
perfplot.show(
setup=lambda n: (numpy.random.rand(n, 3), numpy.random.rand(n, 3)),
kernels=[
lambda a: numpy.sum(a[0]*a[1], axis=1),
lambda a: numpy.einsum('ij, ij->i', a[0], a[1]),
lambda a: inner1d(a[0], a[1])
],
labels=['sum', 'einsum', 'inner1d'],
n_range=[2**k for k in range(20)],
xlabel='len(a), len(b)',
logx=True,
logy=True
)
Even though it is significantly slower for even moderate data sizes, I would use
np.diag(A.dot(B.T))
while you are developing the library and worry about optimizing it later when it will run in a production setting, or after the unit tests are written.
To most people who come upon your code, this will be more understandable than einsum, and also doesn't require you to break some best practices by embedding your calculation within a mini DSL string to serve as the argument to some function call.
I agree that computing the off-diagonal elements is worth avoiding for large cases. It would have to be really really large for me to care about that though, and the trade-off for paying the awful price of expressing the calculating in an embedded string in einsum is pretty severe.

Categories