Performance loss in numba compiled logic comparison

Performance loss in numba compiled logic comparison - python

What could be a reason for performance degradation in the following numba compiled function for logic comparison:
from numba import njit
t = (True, 'and_', False)
##njit(boolean(boolean, unicode_type, boolean))
#njit
def f(a,b,c):
if b == 'and_':
out = a&c
elif b == 'or_':
out = a|c
return out
x = f(*t)
%timeit f(*t)
#1.78 µs ± 9.52 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit f.py_func(*t)
#108 ns ± 0.0042 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
To test this at scale as suggested in the answer:
x = np.random.choice([True,False], 1000000)
y = np.random.choice(["and_","or_"], 1000000)
z = np.random.choice([False, True], 1000000)
#using jit compiled f
def f2(x,y,z):
L = x.shape[0]
out = np.empty(L)
for i in range(L):
out[i] = f(x[i],y[i],z[i])
return out
%timeit f2(x,y,z)
#2.79 s ± 86.4 ms per loop
#using pure Python f
def f3(x,y,z):
L = x.shape[0]
out = np.empty(L)
for i in range(L):
out[i] = f.py_func(x[i],y[i],z[i])
return out
%timeit f3(x,y,z)
#572 ms ± 24.3 ms per
Am I missing something and if there a way to compile "fast" version, because this is a going to be part of a loop executed ~ 1e6 times.

You are working at a too small granularity. Numba is not designed for that. Almost all the execution time you see comes from the overhead of wrapping/unwrapping parameters, type checks, Python function wrapping, reference counting, etc. Moreover the benefit of using Numba is very small here since Numba barely optimizes unicode string operations.
One way to check this hypothesis is to just execute the following trivial function:
#njit
def f(a,b,c):
return a
x = f(True, 'and_', False)
%timeit f(True, 'and_', False)
Both the trivial function and the original version takes 1.34 µs on my machine.
Additionally, you can disassemble the Numba function to see how much instructions are executed to perform just one call and understand deeply where the overheads are coming from.
If you want Numba to be useful, you need to add more work in the compiled function, possibly by working directly on arrays/lists. If this is not possible because of the dynamic nature of the input type, then Numpy may not be the right tool for this here. You could try to rework a bit your code and use PyPy instead. Writing a native C/C++ module may help a bit but most of the time will be spend in manipulating dynamic objects and unicode string as well as doing type introspection, unless you rewrite the whole code.
UPDATE
The above overhead is only paid when transitioning from Python types to Numba (and the other way around). You can see that with the following benchmark:
#njit
def f(a,b,c):
if b == 'and_':
out = a&c
elif b == 'or_':
out = a|c
return out
#jit
def manyCalls(a, b, c):
res = True
for i in range(1_000_000):
res ^= f(a, b, c ^ res)
return res
t = (True, 'and_', False)
x = manyCalls(*t)
%timeit manyCalls(*t)
Calling manyCalls takes 3.62 ms on my machine. This means each call to f takes 3.6 ns in average (16 cycles). This means the overhead is paid only once (when manyCalls is called).

Related

The function np.dot multiplies the GF4 field matrices for a very long time

Multiplies large matrices for a very long time. How can this problem be solved. I use the galois library, and numpy, I think it should still work stably. I tried to implement my GF4 arithmetic and multiplied matrices using numpy, but it takes even longer. Thank you for your reply.
When r = 2,3,4,5,6 multiplies quickly, then it takes a long time. As for me, these are not very large sizes of matrices. This is just a code snippet. I get the sizes n, k of matrices of a certain family given r. And I need to multiply the matrices of those obtained parameters.
import numpy as np
import galois
def family_Hamming(q,r):
n = int((q**r-1)/(q-1))
k = int((q**r-1)/(q-1)-r)
res = (n,k)
return res
q = 4
r = 7
n,k = family_Hamming(q,r)
GF = galois.GF(2**2)
#(5461,5461)
a = GF(np.random.randint(4, size=(k, k)))
#(5454,5461)
b = GF(np.random.randint(4, size=(k, n)))
c = np.dot(a,b)
print(c)

I'm not sure if it is actually faster but np.dot should be used for the dot product of two vectors, for matrix multiplication use A # B. That's as efficient as you can get with Python as far as I know

I'm the author of galois. I added performance improvements to matrix multiplication in v0.3.0 by parallelizing the arithmetic over multiple cores. The next performance improvement will come once GPU support is added.
I'm open to other performance improvement suggestions, but as far as I know the algorithm is running as fast as possible on a CPU.
In [1]: import galois
In [2]: GF = galois.GF(2**2)
In [3]: A = GF.Random((300, 400), seed=1)
In [4]: B = GF.Random((400, 500), seed=2)
# v0.2.0
In [5]: %timeit A # B
1.02 s ± 7.35 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# v0.3.0
In [5]: %timeit A # B
99 ms ± 1.86 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Try using jax on a CUDA runtime. For example, you can try it out on Google Colab's free GPU. (Open a notebook -> Runtime -> Change runtime type -> GPU).
import jax.numpy as jnp
from jax import device_put
a = GF(np.random.randint(4, size=(k, k)))
b = GF(np.random.randint(4, size=(k, n)))
a, b = device_put(a), device_put(b)
c = jnp.dot(a, b)
c = np.asarray(c)
Timing test:
%timeit jnp.dot(a, b).block_until_ready()
# 765 ms ± 96.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

How to remove nested for loop?

I have the following nested loop:
sum_tot = 0.0
for i in range(len(N)-1):
for j in range(len(N)-1):
sum_tot = sum_tot + N[i]**2*N[j]**2*W[i]*W[j]*x_i[j][-1] / (N[j]**2 - x0**2) *(z_i[i][j] - z_j[i][j])*x_j[i][-1] / (N[i]**2 - x0**2)
It's basically a mathematical function that has a double summation. Each sum goes up to the length of N. I've been trying to figure out if there was a way to write this without using a nested for-loop in order to reduce computational time. I tried using list comprehension, but the computational time is similar if not the same. Is there a way to write this expression as matrices to avoid the loops?

Note that range will stop at N-2 given your current loop: range goes up to but not including its argument. You probably mean to write for i in range(len(N)).
It's also difficult to reduce summation: the actual time it takes is based on the number of terms computed, so if you write it a different way which still involves the same number of terms, it will take just as long. However, O(n^2) isn't exactly bad: it looks like the best you can do in this situation unless you find a mathematical simplification of the problem.
You might consider checking this post to gather ways to write out the summation in a neater fashion.

#Kraigolas makes valid points. But let's try a few benchmarks on a dummy, double nested operation, either way. (Hint: Numba might help you speed things up)
Note, I would avoid numpy arrays specifically because all of the cross-product between the range is going to be in memory at once. If this is a massive range, you may run out of memory.
Nested for loops
n = 5000
s1 = 0
for i in range(n):
for j in range(n):
s1 += (i/2) + (j/3)
print(s1)
#2.26 s ± 101 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
List comprehension
n = 5000
s2 = 0
s2 = sum([i/2+j/3 for i in range(n) for j in range(n)])
print(s2)
#3.2 s ± 307 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Itertools product
from itertools import product
n = 5000
s3 = 0
for i,j in product(range(n),repeat=2):
s3 += (i/2) + (j/3)
print(s3)
#2.35 s ± 186 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Note: When using Numba, you would want to run the code at least once before, because the first time it compiles the code and therefore the speed is slow. The real speedup comes second run onwards.
Numba njit (SIMD)
from numba import njit
n=5000
#njit
def f(n):
s = 0
for i in range(n):
for j in range(n):
s += (i/2) + (j/3)
return s
s4 = f(n)
#29.4 ms ± 1.85 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Numba njit parallel with prange
An excellent suggestion by #Tim, added to benchmarks
#njit(parallel=True)
def f(n):
s = 0
for i in prange(n):
for j in prange(n):
s += (i/2) + (j/3)
return s
s5 = f(n)
#21.8 ms ± 4.81 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Significant boost up with Numba as expected. Maybe try that?

To convert this to matrix calculations, I would suggest combine some terms first.
If these objects are not numpy arrays, it's better to convert them to numpy arrays, as they support element-wise operations.
To convert, simply do
import numpy
N = numpy.array(N)
w = numpy.array(w)
x_i = numpy.array(x_i)
x_j = numpy.array(x_j)
z_i = numpy.array(z_i)
z_j = numpy.array(z_j)
Then,
common_terms = N**2*w/(N**2-x0**2)
i_terms = common_terms*x_j[:,-1]
j_terms = common_terms*x_i[:,-1]
i_j_matrix = z_i - z_j
sum_output = (i_terms.reshape((1,-1)) # i_j_matrix # j_terms.reshape((-1,1)))[0,0]

Optimizing while cycle with numba for error tolerance

I have a doubt when using numba for optimization. I am coding a fixed point iteration to calculate the value of a certain array, named gamma, which satisfies the equation f(gamma)=gamma. I am trying to optimize this function with python package Numba. It seems as follows.
#jit
def fixed_point(gamma_guess):
for i in range(17):
gamma_guess=f(gamma_guess)
return gamma_guess
Numba is capable of optimizing well this function, because it knows how many times it will perform the opertation, 17 times,and it works fast. But I need to control the tolerance of error of my desired gamma, I mean , the difference of a gamma and the next one obtained by the fixed point iteration should be less than some number epsilon=0.01, then I tried
#jit
def fixed_point(gamma_guess):
err=1000
gamma_old=gamma_guess.copy()
while(error>0.01):
gamma_guess=f(gamma_guess)
err=np.max(abs(gamma_guess-gamma_old))
gamma_old=gamma_guess.copy()
return gamma_guess
It also works and calculate the desired result, but not as fast as last implementation, it is much slower. I think it is because Numba cannot optimize well the while cycle since we do not know when will it stop. Is there a way I can optimizate this and run as fast as last implementation?
Edit:
Here is the f that I'm using
from scipy import fftpack as sp
S=0.01
Amu=0.7
#jit
def f(gammaa,z,zal,kappa):
ka=sp.diff(kappa)
gamma0=gammaa
for i in range(N):
suma=0
for j in range(N):
if (abs(j-i))%2 ==1:
if((z[i]-z[j])==0):
suma+=(gamma0[j]/(z[i]-z[j]))
gamma0[i]=2.0*Amu*np.real(-(zal[i]/z[i])+zal[i]*(1.0/(2*np.pi*1j))*suma*2*h)+S*ka[i]
return gamma0
I always use np.ones(2048)*0.5 as initial guess and the other parameters that I pass to my function are z=np.cos(alphas)+1j*(np.sin(alphas)+0.1) , zal=-np.sin(alphas)+1j*np.cos(alphas) , kappa=np.ones(2048) and alphas=np.arange(0,2*np.pi,2*np.pi/2048)

I made a small test script, to see if I could reproduce your error:
import numba as nb
from IPython import get_ipython
ipython = get_ipython()
#nb.jit(nopython=True)
def f(x):
return (x+1)/x
def fixed_point_for(x):
for _ in range(17):
x = f(x)
return x
#nb.jit(nopython=True)
def fixed_point_for_nb(x):
for _ in range(17):
x = f(x)
return x
def fixed_point_while(x):
error=1
x_old = x
while error>0.01:
x = f(x)
error = abs(x_old-x)
x_old = x
return x
#nb.jit(nopython=True)
def fixed_point_while_nb(x):
error=1
x_old = x
while error>0.01:
x = f(x)
error = abs(x_old-x)
x_old = x
return x
print("for loop without numba:")
ipython.magic("%timeit fixed_point_for(10)")
print("for loop with numba:")
ipython.magic("%timeit fixed_point_for_nb(10)")
print("while loop without numba:")
ipython.magic("%timeit fixed_point_while(10)")
print("for loop with numba:")
ipython.magic("%timeit fixed_point_while_nb(10)")
As I don't know about your f I just used the most simple stabilizing function, that I could think of. I then ran tests with and without numba, both times with for and while loops. The results on my machine are:
for loop without numba:
3.35 µs ± 8.72 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
for loop with numba:
282 ns ± 1.07 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
while loop without numba:
1.86 µs ± 7.09 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
for loop with numba:
214 ns ± 1.36 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
The following thoughts arise:
It can't be, that your function is not optimizable, since your for loop is fast (at least you said so; have you tested without numba?).
It could be, that your function takes way more loops to converge as you might think
We are using different software versions. My versions are:
numba 0.49.0
numpy 1.18.3
python 3.8.2

is np.array == num comparison very slow? Can multiprocessing be used to accelerate it?

Is == comparison between a large np.array with a single number very low in python? I used line_profiler to locate the bottleneck in my code. The bottleneck is just a simple comparison between a 1d np.array with a constant number. It accounts for 80% of the total runtime. Did I do anything wrong causing it to be so slow? Is there any way to accelerate it?
I tried to use multiprocessing, however, in the test code (snippet 2), using multiprocessing is slower than running in sequence and using map directly. Could anyone explain this phenomenon?
Any comments or suggestions are sincerely appreciated.
Snippet 1:
Line # Hits Time Per Hit %Time Line Contents
38 12635 305767927.0 24200.1 80.0 res = map(logicalEqual,assembly)
def logicalEqual(x):
return F[:,-1] == x
assembly = [1,2,3,4,5,7,8,9,...,25]
F is an int typed (281900, 6) np.array
Snippet 2:
import numpy as np
from multiprocessing import Pool
import time
y=np.random.randint(2, 20, size=10000000)
def logicalEqual(x):
return y == x
p=Pool()
start = time.time()
res0=p.map(logicalEqual, [1,2,3,4,5,7,8,9,10,11,12,13,14,15])
# p.close()
# p.join()
runtime = time.time()-start
print(f'runtime using multiprocessing.Pool is {runtime}')
res1 = []
start = time.time()
for x in [1,2,3,4,5,7,8,9,10,11,12,13,14,15]:
res1.append(logicalEqual(x))
runtime = time.time()-start
print(f'sequential runtime is {runtime}')
start = time.time()
res2=list(map(logicalEqual,[1,2,3,4,5,7,8,9,10,11,12,13,14,15]))
runtime = time.time()-start
print(f'runtime is {runtime}')
runtime using multiprocessing.Pool is 0.3612203598022461
sequential runtime is 0.17401981353759766
runtime is 0.19697237014770508

Array comparison is fast, since it is done in C code, not Python.
x = np.random.rand(1000000)
y = 4.5
test = 0.55
%timeit x == test
386 µs ± 4.68 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit y == test
33.2 ns ± 0.121 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
So, comparing one Python float to another takes 33*10^-9 s, while comparing 1E6 numpy floats takes only 386 µs / 33 ns ~= 11700 times longer, despite comparing 1000000 more values. The same is true for ints (377 µs vs 34 ns). But as Dunes mentioned in a comment, comparing a lot of values takes a lot of cycles. Nothing you can do about that.

Performance of map vs starmap?

I was trying to make a pure-python (without external dependencies) element-wise comparison of two sequences. My first solution was:
list(map(operator.eq, seq1, seq2))
Then I found starmap function from itertools, which seemed pretty similar to me. But it turned out to be 37% faster on my computer in worst case. As it was not obvious to me, I measured the time necessary to retrieve 1 element from a generator (don't know if this way is correct):
from operator import eq
from itertools import starmap
seq1 = [1,2,3]*10000
seq2 = [1,2,3]*10000
seq2[-1] = 5
gen1 = map(eq, seq1, seq2))
gen2 = starmap(eq, zip(seq1, seq2))
%timeit -n1000 -r10 next(gen1)
%timeit -n1000 -r10 next(gen2)
271 ns ± 1.26 ns per loop (mean ± std. dev. of 10 runs, 1000 loops each)
208 ns ± 1.72 ns per loop (mean ± std. dev. of 10 runs, 1000 loops each)
In retrieving elements the second solution is 24% more performant. After that, they both produce the same results for list. But from somewhere we gain extra 13% in time:
%timeit list(map(eq, seq1, seq2))
%timeit list(starmap(eq, zip(seq1, seq2)))
5.24 ms ± 29.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.34 ms ± 84.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I don't know how to dig deeper in profiling of such nested code? So my question is why the first generator so faster in retrieving and from where we gain extra 13% in list function?
EDIT:
My first intention was to perform element-wise comparison instead of all, so the all function was replaced with list. This replacement does not affect the timing ratio.
CPython 3.6.2 on Windows 10 (64bit)

There are several factors that contribute (in conjunction) to the observed performance difference:
zip re-uses the returned tuple if it has a reference count of 1 when the next __next__ call is made.
map builds a new tuple that is passed to the "mapped function" every time a __next__ call is made. Actually it probably won't create a new tuple from scratch because Python maintains a storage for unused tuples. But in that case map has to find an unused tuple of the right size.
starmap checks if the next item in the iterable is of type tuple and if so it just passes it on.
Calling a C function from within C code with PyObject_Call won't create a new tuple that is passed to the callee.
So starmap with zip will only use one tuple over and over again that is passed to operator.eq thus reducing the function call overhead immensely. map on the other hand will create a new tuple (or fill a C array from CPython 3.6 on) every time operator.eq is called. So what is actually the speed difference is just the tuple creation overhead.
Instead of linking to the source code I'll provide some Cython code that can be used to verify this:
In [1]: %load_ext cython
In [2]: %%cython
...:
...: from cpython.ref cimport Py_DECREF
...:
...: cpdef func(zipper):
...: a = next(zipper)
...: print('a', a)
...: Py_DECREF(a)
...: b = next(zipper)
...: print('a', a)
In [3]: func(zip([1, 2], [1, 2]))
a (1, 1)
a (2, 2)
Yes, tuples aren't really immutable, a simple Py_DECREF was sufficient to "trick" zip into believing noone else holds a reference to the returned tuple!
As for the "tuple-pass-thru":
In [4]: %%cython
...:
...: def func_inner(*args):
...: print(id(args))
...:
...: def func(*args):
...: print(id(args))
...: func_inner(*args)
In [5]: func(1, 2)
1404350461320
1404350461320
So the tuple is passed right through (just because these are defined as C functions!) This doesn't happen for pure Python functions:
In [6]: def func_inner(*args):
...: print(id(args))
...:
...: def func(*args):
...: print(id(args))
...: func_inner(*args)
...:
In [7]: func(1, 2)
1404350436488
1404352833800
Note that it also doesn't happen if the called function isn't a C function even if called from a C function:
In [8]: %%cython
...:
...: def func_inner_c(*args):
...: print(id(args))
...:
...: def func(inner, *args):
...: print(id(args))
...: inner(*args)
...:
In [9]: def func_inner_py(*args):
...: print(id(args))
...:
...:
In [10]: func(func_inner_py, 1, 2)
1404350471944
1404353010184
In [11]: func(func_inner_c, 1, 2)
1404344354824
1404344354824
So there are a lot of "coincidences" leading up to the point that starmap with zip is faster than calling map with multiple arguments when the called function is also a C function...

One difference I can notice is the how map retrieves items from the iterables. Both map and zip create a tuple of iterators from each iterable passed. Now zip maintains a result tuple internally that is populated every time next is called and on the other hand, map creates a new array* with each next call and deallocates it.
*As pointed out by MSeifert till 3.5.4 map_next used to allocate a new Python tuple everytime. This changed in 3.6 and till 5 iterables C stack is used and for anything larger than that heap is used. Related PRs: Issue #27809: map_next() uses fast call and Add _PY_FASTCALL_SMALL_STACK constant | Issue: https://bugs.python.org/issue27809

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Performance loss in numba compiled logic comparison - python

Related

The function np.dot multiplies the GF4 field matrices for a very long time

How to remove nested for loop?

Optimizing while cycle with numba for error tolerance

is np.array == num comparison very slow? Can multiprocessing be used to accelerate it?

Performance of map vs starmap?

Categories

Resources