I am trying to do some timing comparisons using numba.
What I don't understand in the following mwe.py is why I get different results
from __future__ import print_function
import numpy as np
from numba import autojit
import time
def timethis(method):
'''decorator for timing function calls'''
def timed(*args, **kwargs):
ts = time.time()
result = method(*args, **kwargs)
te = time.time()
print('{!r} {:f} s'.format(method.__name__, te - ts))
return result
return timed
def pairwise_pure(x):
'''sample function, compute pairwise distancee, see: jakevdp.github.io/blog/2013/06/15/numba-vs-cython-take-2/'''
M, N = x.shape
D = np.empty((M, M), dtype=np.float)
for i in range(M):
for j in range(M):
d = 0.
for k in range(N):
tmp = x[i, k] - x[j, k]
d += tmp * tmp
D[i, j] = np.sqrt(d)
return D
# first version
#timethis
#autojit
def pairwise_numba(args):
return pairwise_pure(args)
# second version
#timethis
def pairwise_numba_alt(args):
return autojit(pairwise_pure)(args)
x = np.random.random((1e3, 10))
pairwise_numba(x)
pairwise_numba_alt(x)
Evaluating python3 mwe.py gives this output:
'pairwise_numba' 5.971631 s
'pairwise_numba_alt' 0.191500 s
In the first version, I decorate the method using timethis to calculate the timings, and with autojit to speed up the code , whereas in the second one I decorate the function with timethis, and call autojit(...) afterwards.
Does someone have an explanation ?
Actually the documentation explicitly states that for optimization each call to other functions "inside" a decorated function should be decorated as well or it isn't optimized.
For many functions like numpy functions that isn't necessary since they are highly optimized but for native python functions it is.
Related
I'm trying to return an nested list, however running into some conversion error.
Below is small piece of code for reproduction of error.
from numba import njit, prange
#njit("ListType(ListType(ListType(int32)))(int32, int32)", fastmath = True, parallel = True, cache = True)
def test(x, y):
a = []
for i in prange(10):
b = []
for j in range(4):
c = []
for k in range(5):
c.append(k)
b.append(c)
a.append(b)
return a
Error
I try to avoid using empty lists with numba, mainly because an empty list cannot be typed. Check out nb.typeof([])
I am not sure whether your output can be preallocated but you could consider arrays. There would also be massive performance benefits. Here is an attempt:
from numba import njit, prange, int32
import numpy as np
#njit(int32[:,:,:](int32, int32), fastmath = True, parallel = True, cache = True)
def test(x, y):
out = np.zeros((10,x,y), dtype=int32)
for i in prange(10):
for j in range(x):
for k in range(y):
out[i][j][k] = k
return out
That said, you might indeed need lists for your application, in which case this answer might not be of much use.
This worked for me.
from numba import njit, prange
from numba.typed import List
#njit(fastmath = True, parallel = True, cache = True)
def test(x, y):
a = List()
for i in prange(10):
b = List()
for j in range(4):
c = List()
for k in range(5):
c.append(k)
b.append(c)
a.append(b)
return a
Your signature is fine, but you need to match the type of List that you create inside the function. So a numba.typed.List instead of [].
from numba import njit, prange
from numba.typed import List
from numba.types import int32
#njit("ListType(ListType(ListType(int32)))(int32, int32)", fastmath=True, parallel=True, cache=True)
def test(x, y):
a = List.empty_list(List.empty_list(List.empty_list(int32)))
for i in prange(10):
b = List.empty_list(List.empty_list(int32))
for j in range(4):
c = List.empty_list(int32)
for k in range(5):
c.append(int32(k))
b.append(c)
a.append(b)
return a
I don't think you should expect much from appending to a List in parallel in this case.
Numba documentation specifies that other compiled functions can be inlined and called from other compiled functions. This does not seem to be true when compiling ahead of time.
For example: here are two functions that compute the inner dot product between 2 vector arrays, one of them does the actual product, the other makes the inline call within a loop:
# Module test.py
import numpy as np
from numba import njit, float64
#njit(float64(float64[:], float64[:]))
def product(a, b):
prod = 0
for i in range(a.size):
prod += a[i] * b[i]
return prod
#njit(float64[:](float64[:,:], float64[:,:]))
def n_inner1d(a, b):
prod = np.empty(a.shape[0])
for i in range(a.shape[0]):
prod[i] = product(a[i], b[i])
return prod
As is, I can do import test and use test.n_inner1d perfectly fine. Now lets do some modifications so this can be compiled to a .pyd
# Module test.py
import numpy as np
from numba import float64
from numba.pycc import CC
cc = CC('test')
cc.verbose = True
#cc.export('product','float64(float64[:], float64[:])')
def product(a, b):
prod = 0
for i in range(a.size):
prod += a[i] * b[i]
return prod
#cc.export('n_inner1d','float64[:](float64[:,:], float64[:,:])')
def n_inner1d(a, b):
prod = np.empty(a.shape[0])
for i in range(a.shape[0]):
prod[i] = product(a[i], b[i])
return prod
if __name__ == "__main__":
cc.compile()
When trying to compile, i get the following error:
# python test.py
Failed at nopython (nopython frontend)
Untyped global name 'product': cannot determine Numba type of <type 'function'>
File "test.py", line 20
QUESTION
For a module compiled ahead of time, is it possible for functions defined within to call one another and be used inline?
I reached out to the numba devs and they kindly answered that adding the #njit decorator after #cc.export will make the function call type resolution work and resolve.
So for example:
#cc.export('product','float64(float64[:], float64[:])')
#njit
def product(a, b):
prod = 0
for i in range(a.size):
prod += a[i] * b[i]
return prod
Will make the product function available to others. The caveat being that it is entirely possible in some cases that the inlined function ends up with a different type signature to that of the one declared AOT.
what is the best way to have better, dynamic control on the decorators - choosing from numba.cuda.jit, numba.jit and none (pure python). [please note that a project can have 10s or 100s of functions, so this should be easy to apply to all the functions]
here is an example from numba website.
import numba as nb
import numpy as np
# global control of this --> #nb.jit or #nb.cuda.jit or none
# some functions with #nb.jit or cuda.jit with kwargs like (nopython=True, **other_kwargs)
def sum2d(arr):
M, N = arr.shape
result = 0.0
for i in range(M):
for j in range(N):
result += arr[i,j]
return result
a = np.arange(81).reshape(9,9)
sum2d(a)
You may want something more sophisticated, but a relatively simple solution is redefining jit based on settings. For example
def _noop_jit(f=None, *args, **kwargs):
""" returns function unmodified, discarding decorator args"""
if f is None:
return lambda x: x
return f
# some config flag
if settings.PURE_PYTHON_MODE:
jit = _noop_jit
else: # etc
from numba import jit
#jit(nopython=True)
def f(a):
return a + 1
I'm testing some functionalities of ipython and I'm think I'm doing something wrong.
I'm testing 3 different ways to execute some math operation.
1st using #parallel.parallel(view=dview, block=True) and function map
2nd using single core function (python normal function)
3rd using clients load balance function
I have this code:
from IPython import parallel
import numpy as np
import multiprocessing as mp
import time
rc = parallel.Client(block=True)
dview = rc[:]
lbview = rc.load_balanced_view()
#parallel.require(np)
def suma_pll(a, b):
return a + b
#parallel.require(np)
def producto_pll(a, b):
return a * b
def suma(a, b):
return a + b
def producto(a, b):
return a * b
#parallel.parallel(view=dview, block=True)
#parallel.require(np)
#parallel.require(suma_pll)
#parallel.require(producto_pll)
def a_calc_pll(a, b):
result = []
for i, v in enumerate(a):
result.append(
producto_pll(suma_pll(a[i], a[i]), suma_pll(b[i], b[i]))//100
)
return result
#parallel.require(suma)
#parallel.require(producto)
def a_calc_remote(a, b):
result = []
for i, v in enumerate(a):
result.append(
producto(suma(a[i], a[i]), suma(b[i], b[i]))//100
)
return result
def a_calc(a, b):
return producto(suma(a, a), suma(b, b))//100
def main_pll(a, b):
return a_calc_pll.map(a, b)
def main_lb(a, b):
c = lbview.map(a_calc_remote, a, b, block=True)
return c
def main(a, b):
c = []
for i in range(len(a)):
c += [a_calc(a[i], b[i]).tolist()]
return c
if __name__ == '__main__':
a, b = [], []
for i in range(1, 1000):
a.append(np.array(range(i+00, i+10)))
b.append(np.array(range(i+10, i+20)))
t = time.time()
c1 = main_pll(a, b)
t1 = time.time()-t
t = time.time()
c2 = main(a, b)
t2 = time.time()-t
t = time.time()
c3 = main_lb(a, b)
t3 = time.time()-t
print(str(c1) == str(c2))
print(str(c3) == str(c2))
print('%f secs (multicore)' % t1)
print('%f secs (singlecore)' % t2)
print('%f secs (multicore_load_balance)' % t3)
My result is:
True
True
0.040741 secs (multicore)
0.004004 secs (singlecore)
1.286592 secs (multicore_load_balance)
Why are my multicore routines slower than my single core routine? What is wrong with this approach? What can I do to fix it?
Some information: python3.4.1, ipython 2.2.0, numpy 1.9.0, ipcluster starting 8 Engines with LocalEngineSetLauncher
It seems to me that you are trying to parallelise something that takes too little time to execute on a single core. In Python, any form of "true" parallelism is multi-process, which means that you have to spawn multiple Python interpreters, transfer the data via pickling/unpickling, etc.
This is going to result in a noticeable overhead for small workloads. On my system, just starting and then stopping immediately a Python interpreter takes around 1/100 of a second:
# time python -c "pass"
real 0m0.018s
user 0m0.012s
sys 0m0.005s
I am not sure what the decorators you are using are doing behind the scenes, but as you can see just setting up the infrastructure for parallel work can take quite a bit of time.
edit
On further inspection, it looks like you are already setting up the workers before running your code, so the overhead hinted above might be out of the picture.
You are though moving data to the worker processes, two lists of 1000 NumPy arrays. Pickling a and b to a string on my system takes ~0.13 seconds with pickle and ~0.046 seconds with cPickle. The pickling time can be reduced by storing your arrays in, instead of lists, NumPy arrays:
a = np.array(a)
b = np.array(b)
This cuts down the cPickle time to ~0.029 seconds.
I revrite my neural net from pure python to numpy, but now it is working even slower. So I tried this two functions:
def d():
a = [1,2,3,4,5]
b = [10,20,30,40,50]
c = [i*j for i,j in zip(a,b)]
return c
def e():
a = np.array([1,2,3,4,5])
b = np.array([10,20,30,40,50])
c = a*b
return c
timeit d = 1.77135205057
timeit e = 17.2464673758
Numpy is 10times slower. Why is it so and how to use numpy properly?
I would assume that the discrepancy is because you're constructing lists and arrays in e whereas you're only constructing lists in d. Consider:
import numpy as np
def d():
a = [1,2,3,4,5]
b = [10,20,30,40,50]
c = [i*j for i,j in zip(a,b)]
return c
def e():
a = np.array([1,2,3,4,5])
b = np.array([10,20,30,40,50])
c = a*b
return c
#Warning: Functions with mutable default arguments are below.
# This code is only for testing and would be bad practice in production!
def f(a=[1,2,3,4,5],b=[10,20,30,40,50]):
c = [i*j for i,j in zip(a,b)]
return c
def g(a=np.array([1,2,3,4,5]),b=np.array([10,20,30,40,50])):
c = a*b
return c
import timeit
print timeit.timeit('d()','from __main__ import d')
print timeit.timeit('e()','from __main__ import e')
print timeit.timeit('f()','from __main__ import f')
print timeit.timeit('g()','from __main__ import g')
Here the functions f and g avoid recreating the lists/arrays each time around and we get very similar performance:
1.53083586693
15.8963699341
1.33564996719
1.69556999207
Note that list-comp + zip still wins. However, if we make the arrays sufficiently big, numpy wins hands down:
t1 = [1,2,3,4,5] * 100
t2 = [10,20,30,40,50] * 100
t3 = np.array(t1)
t4 = np.array(t2)
print timeit.timeit('f(t1,t2)','from __main__ import f,t1,t2',number=10000)
print timeit.timeit('g(t3,t4)','from __main__ import g,t3,t4',number=10000)
My results are:
0.602419137955
0.0263929367065
import time , numpy
def d():
a = range(100000)
b =range(0,1000000,10)
c = [i*j for i,j in zip(a,b)]
return c
def e():
a = numpy.array(range(100000))
b =numpy.array(range(0,1000000,10))
c = a*b
return c
#python ['0.04s', '0.04s', '0.04s']
#numpy ['0.02s', '0.02s', '0.02s']
try it with bigger arrays... even with the overhead of creating arrays numpy is much faster
Numpy data structures is slower on adding/constructing
Here some tests:
from timeit import Timer
setup1 = '''import numpy as np
a = np.array([])'''
stmnt1 = 'np.append(a, 1)'
t1 = Timer(stmnt1, setup1)
setup2 = 'l = list()'
stmnt2 = 'l.append(1)'
t2 = Timer(stmnt2, setup2)
print('appending to empty list:')
print(t1.repeat(number=1000))
print(t2.repeat(number=1000))
setup1 = '''import numpy as np
a = np.array(range(999999))'''
stmnt1 = 'np.append(a, 1)'
t1 = Timer(stmnt1, setup1)
setup2 = 'l = [x for x in xrange(999999)]'
stmnt2 = 'l.append(1)'
t2 = Timer(stmnt2, setup2)
print('appending to large list:')
print(t1.repeat(number=1000))
print(t2.repeat(number=1000))
Results:
appending to empty list:
[0.008171333983972538, 0.0076482562944814175, 0.007862921943675175]
[0.00015624398517267296, 0.0001191077336243837, 0.000118654852507942]
appending to large list:
[2.8521017080411304, 2.8518707386717446, 2.8022625940577477]
[0.0001643958452675065, 0.00017888804099541744, 0.00016711313196715594]
I don't think numpy is slow because it must take into account the time required to write and debug. The longer the program, the more difficult it is to find problems or add new features (programmer time).
Therefore, to use a higher level language allows, at equal intelligence time and skill, to create a program complex and potentially more efficient.
Anyway, some interesting tools to optimize are:
-Psyco is a JIT (just in time, "real time"), which optimizes at runtime the code.
-Numexpr, parallelization is a good way to speed up the execution of a program, provided that is sufficiently separable.
-weave is a module within NumPy to communicate Python and C. One of its functions is to blitz, which takes a line of Python, the transparently translates C, and each time the call is executed optimized version. In making this first conversion requires around a second, but higher speeds generally get all of the above. It's not as Numexpr or Psyco bytecode, or interface C as NumPy, but your own function written directly in C and fully compiled and optimized.