Not able to automatically parallelize for loop with numba

Not able to automatically parallelize for loop with numba - python

I am trying to run the following on multiple cores for speed up using numba. Unfortunately the function seems to run only on one core when I tested it. Can someone explain to me why and if there is a possibility to get it running on multiple cores?
Minimal working example:
import numpy as np
import numba
a = np.random.rand(100000)
#numba.jit(nopython=True, parallel=True)
def func(x):
result = np.zeros_like(x)
for delta in range(1,len(x)):
thisresult = 0
for i in range(delta,len(x)):
thisresult += (x[i] - x[i-delta])**2
result[delta] = thisresult / (len(x) - delta)
return result
print(func(a))

Explicit Parallelization
I would always remommend to parallelize code explicitely. Numba tries to serialize some parallel code parts, but that won't always work or lead to best performance.
import numpy as np
import numba
a = np.random.rand(100000)
#numba.jit(nopython=True, parallel=True)
def func(x):
result = np.zeros_like(x,dtype=x.dtype)
for delta in numba.prange(1,len(x)):
thisresult = 0
for i in range(delta,len(x)):
thisresult += (x[i] - x[i-delta])**2
result[delta] = thisresult / (len(x) - delta)
return result
print(func(a))
For more details have a look at the documentation.

Related

Multiprocessing not using whole CPU

I'm testing python's module "multiprocessing". I'm trying to compute pi using a montecarlo technique using my 12 threads ryzen 5 5600.
The problem is that my cpu is not fully used, instead only 47% is used. I leave you my code below, changing the value of n_cpu leads to not so different core usage, instead increasing N by 1 order of magnitude can increase the load up to 77%... but i believed that N shouldn't affect the number of processes...
Please help me understand how to correctly parallelize my code, thanks.
import random
import math
import numpy as np
import multiprocessing
from multiprocessing import Pool
def sample(n):
n_inside_circle = 0
for i in range(n):
x = random.random()
y = random.random()
if x**2 + y**2 < 1.0:
n_inside_circle += 1
return n_inside_circle
N_test=1000
N=12*10**4
n_cpu = 12
pi=0
for j in range(N_test):
part_count=[int(N/n_cpu)] * n_cpu
pool = Pool(processes=n_cpu)
results = pool.map(sample, part_count)
pool.close()
pi += sum(results)/(N*1.0)*4
print(pi/N_test)

The lack of cpu use is because you are sending chunks of data to multiple new process pools instead of all at once to a single process pool.
simply using
pool = Pool(processes=n_cpu)
for j in range(N_test):
part_count=[int(N/n_cpu)] * n_cpu
results = pool.map(sample, part_count)
pi += sum(results)/(N*1.0)*4
pool.close()
should have some speed up
To optimize this further
We can change the way the jobs are split up to have more samples for a single process.
We can use Numpy's vectorized random functions that will run faster than random.random().
Finally for the last bit of speed, we can use numba with a threadpool to reduce overhead even more.
import time
import numpy as np
from multiprocessing.pool import ThreadPool
from numba import jit
#jit(nogil=True, parallel=True, fastmath=True)
def sample(n):
x = np.random.random(n)
y = np.random.random(n)
inside_circle = np.square(x) + np.square(y) < 1.0
return int(np.sum(inside_circle))
total_samples = int(3e9)
function_limit = int(1e7)
n_cpu = 12
pi=0
assert total_samples%function_limit == 0
start = time.perf_counter()
with ThreadPool(n_cpu) as pool:
part_count=[function_limit] * (total_samples//function_limit)
results = pool.map(sample, part_count)
pi = 4*sum(results)/(total_samples)
end = time.perf_counter()
print(pi)
print(round(end-start,3), "seconds taken")
resulting in
3.141589756
6.982 seconds taken

why is Numba parallel is slower than normal python loop?

Following is normal python loop (I copied example from official doc - https://numba.readthedocs.io/en/stable/user/parallel.html)
def two_d_array_reduction_prod(n):
shp = (13, 17)
result1 = 2 * np.ones(shp, np.int_)
tmp = 2 * np.ones_like(result1)
for i in range(n):
result1 *= tmp
return result1
I called function like:
two_d_array_reduction_prod(50000)
It takes around 0.7482060070033185.
Numba parallel code
#nb.njit(parallel=True)
def two_d_array_reduction_prod(n):
shp = (13, 17)
result1 = 2 * np.ones(shp, np.int_)
tmp = 2 * np.ones_like(result1)
for i in nb.prange(n):
result1 *= tmp
return result1
I called function like:
two_d_array_reduction_prod(50000)
It takes 3.9858204890042543
My environment:
Amazon Linux 2, x86_64 processor
8 CPUs
32G memory

I can't replicate this. Using parallel=True gives a slight performance improvement, but any method is significantly faster compared to pure Python for me.
Using:
from numba import njit, prange
import numpy as np
def two_d_array_reduction_prod(n):
shp = (13, 17)
result1 = 2 * np.ones(shp, np.int_)
tmp = 2 * np.ones_like(result1)
for i in prange(n): # or for i in range(n):
result1 *= tmp
return result1
two_d_array_reduction_prod_numba = nb.njit(parallel=False)(two_d_array_reduction_prod)
Even with parallel=False with prange or using parallel=False with range I get over 3x improvement. All these timings are done with a warm-up, pre-compiling the Numba function first.

Python Numba jit function with if statement

I have a piecewise function with 3 parts that I'm trying to write in Python using Numba #jit instruction. The function is calculated over an array. The function is defined by:
#njit(parallel=True)
def f(x_vec):
N=len(x_vec)
y_vec=np.zeros(N)
for i in prange(N):
x=x_vec[i]
if x<=2000:
y=64/x
elif x>=4000:
y=np.log(x)
else:
y=np.log(1.2*x)
y_vec[i]=y
return y_vec
I'm using Numba to make this code very fast and run it on all 8 threads of my CPU.
Now, my question is, if I wanted to define each part of the function separately as f1, f2 and f3, and put those inside the if statements (and still benefit from Numba speed), how can I do that? The reason is that the subfunctions can be more complicated and I don't want to make my code hard to read. I want it to be as fast as this one (or slightly slower but not alot).
In order to test the function, we can use this array:
Np=10000000
x_vec=100*np.power(1e8/100,np.random.rand(Np))
%timeit f(x_vec) #0.06sec on intel core i7 3610
For completionism, the following libraries are called:
import numpy as np
from numba import njit, prange
So in this case, the functions would be:
def f1(x):
return 64/x
def f2(x):
return np.log(x)
def f3(x):
return np.log(1.2*x)
The actual functions are these, which are for smooth pipe friction factor for laminar, transition and turbulent regimes:
#njit
def f1(x):
return 64/x
#njit
def f2(x):
#x is the Reynolds number(Re), y is the Darcy friction(f)
#for transition, we can assume Re=4000 (max possible friction)
y=0.02
y=(-2/np.log(10))*np.log(2.51/(4000*np.sqrt(y)))
return 1/(y*y)
#njit
def f3(x): #colebrook-white approximation
#x is the Reynolds number(Re), y is the Darcy friction(f)
y=0.02
y=(-2/np.log(10))*np.log(2.51/(x*np.sqrt(y)))
return 1/(y*y)
Thanks for contributions from everyone. This is the numpy solution (the last tree lines are slow for some reason, but doesn't need warmup):
y = np.empty_like(x_vec)
a1=np.where(x_vec<=2000,True,False)
a3=np.where(x_vec>=4000,True,False)
a2=~(a1 | a3)
y[a1] = f1(x_vec[a1])
y[a2] = f2(x_vec[a2])
y[a3] = f3(x_vec[a3])
The fastest Numba solution, allowing for passing function names and taking advantage of prange (but hindered by jit warmup) is this, which can be as fast as the first solution (top of the question):
#njit(parallel=True)
def f(x_vec,f1,f2,f3):
N = len(x_vec)
y_vec = np.zeros(N)
for i in prange(N):
x=x_vec[i]
if x<=2000:
y=f1(x)
elif x>=4000:
y=f3(x)
else:
y=f2(x)
y_vec[i]=y
return y_vec

You can write f() to accept function parameters, e.g.:
#njit
def f(arr, f1, f2, f3):
N = len(arr)
y_vec = np.zeros(N)
for i in range(N):
x = x_vec[i]
if x <= 2000:
y = f1(x)
elif x >= 4000:
y = f2(x)
else:
y = f3(x)
y_vec[i] = y
return y_vec
Make sure that the function you pass are Numba compatible.

Is this too slow? This can be done in pure numpy, by avoiding loops and using masks for indexing:
def f(x):
y = np.empty_like(x)
mask = x <= 2000
y[mask] = 64 / x[mask]
mask = (x > 2000) & (x < 4000)
y[mask] = np.log(1.2 * x[mask])
mask = x >= 4000
y[mask] = np.log(x[mask])
return y
You can also run the "else" case by first applying the middle part without any mask to the whole array, it's probably a bit slower:
def f_else(x):
y = np.log(1.2 * x)
mask = x <= 2000
y[mask] = 64 / x[mask]
mask = x >= 4000
y[mask] = np.log(x[mask])
return y
With
Np=10000000
x_vec=100*np.power(1e8/100,np.random.rand(Np))
I get (laptop with i7-8850H with 6 + 6VT cores)
f1: 1 loop, best of 5: 294 ms per loop
f_else: 1 loop, best of 5: 400 ms per loop
If your intended subfunctions are mainly numpy-operations this will still be fast.

Python: Comparing the speed of NumPy and SymPy ufuncified functions

Just wrote a code for comparison of speed of calculation for a function which is in written as numpy and a function which uses ufuncify from sympy:
import numpy as np
from sympy import symbols, Matrix
from sympy.utilities.autowrap import ufuncify
u,v,e,a1,a0 = symbols('u v e a1 a0')
dudt = u-u**3-v
dvdt = e*(u-a1*v-a0)
p = {'a1':0.5,'a0':1.5,'e':0.1}
eqs = Matrix([dudt,dvdt])
numeqs=eqs.subs([(a1,p['a1']),(a0,p['a0']),(e,p['e'])])
print eqs
print numeqs
dudt = ufuncify([u,v],numeqs[0])
dvdt = ufuncify([u,v],numeqs[1])
def syrhs(u,v):
return dudt(u,v),dvdt(u,v)
def nprhs(u,v,p):
dudt = u-u**3-v
dvdt = p['e']*(u-p['a1']*v-p['a0'])
return dudt,dvdt
def compare(n=10000):
import time
timer_np=0
timer_sy=0
error = np.zeros(n)
for i in range(n):
u=np.random.random((128,128))
v=np.random.random((128,128))
start_time=time.time()
npcalc=np.ravel(nprhs(u,v,p))
mid_time=time.time()
sycalc=np.ravel(syrhs(u,v))
end_time=time.time()
timer_np+=(mid_time-start_time)
timer_sy+=(end_time-mid_time)
error[i]=np.max(np.abs(npcalc-sycalc))
print "Max difference is ",np.max(error), ", and mean difference is ",np.mean(error)
print "Average speed for numpy ", timer_np/float(n)
print "Average speed for sympy ", timer_sy/float(n)
On my machine the result is:
In [21]: compare()
Max difference is 5.55111512313e-17 , and mean difference is 5.55111512313e-17
Average speed for numpy 0.00128133814335
Average speed for sympy 0.00127074036598
Any suggestions on how to make either of the above function faster is welcome!

After further exploration it seems that ufuncify and regular numpy functions will give more or less the same speed of computation. Using numba or printing to theano function did not result in a faster code. So the other option to make things faster is either cython or wrapping a c or FORTRAN code.

Cannot replicate results comparing Python, Numpy and Numba matrix multiplication

So while evaluating possibilities to speed up Python code i came across this Stack Overflow post: Comparing Python, Numpy, Numba and C++ for matrix multiplication
I was quite impressed with numba's performance and implemented some of our function in numba. Unfortunately the speedup was only there for very small matrices and for large matrices the code became very slow compared to the previous scipy sparse implementation. I thought this made sense but nevertheless i repeated the test in the original post (code below).
When using a 1000 x 1000 matrix, according to that post even the python implementation should take roughly 0,01 s. Here's my results though:
python : 769.6387 seconds
numpy : 0.0660 seconds
numba : 3.0779 seconds
scipy : 0.0030 seconds
What am i doing wrong to get such different results than the original post? I copied the functions and did not change anything. I tried both Python 3.5.1 (64 bit) and Python 2.7.10 (32 bit), a colleague tried the same code with the same results. This is the result for a 100x100 matrix:
python : 0.6916 seconds
numpy : 0.0035 seconds
numba : 0.0015 seconds
scipy : 0.0035 seconds
Did i make some obvious mistakes?
import numpy as np
import numba as nb
import scipy.sparse
import time
class benchmark(object):
def __init__(self, name):
self.name = name
def __enter__(self):
self.start = time.time()
def __exit__(self, ty, val, tb):
end = time.time()
print("%s : %0.4f seconds" % (self.name, end-self.start))
return False
def dot_py(A, B):
m, n = A.shape
p = B.shape[1]
C = np.zeros((m, p))
for i in range(0, m):
for j in range(0, p):
for k in range(0, n):
C[i, j] += A[i, k] * B[k, j]
return C
def dot_np(A, B):
C = np.dot(A,B)
return C
def dot_scipy(A, B):
C = A * B
return C
dot_nb = nb.jit(nb.float64[:,:](nb.float64[:,:], nb.float64[:,:]), nopython=True)(dot_py)
dim_x = 1000
dim_y = 1000
a = scipy.sparse.rand(dim_x, dim_y, density=0.01)
b = scipy.sparse.rand(dim_x, dim_y, density=0.01)
a_full = a.toarray()
b_full = b.toarray()
print("starting test")
with benchmark("python"):
dot_py(a_full, b_full)
with benchmark("numpy"):
dot_np(a_full, b_full)
with benchmark("numba"):
dot_nb(a_full, b_full)
with benchmark("scipy"):
dot_scipy(a, b)
print("finishing test")
edit:
for anyone seeing this at a later time. this is the results i got when using sparse nxn matrices (1% of elements are nonzero).

In the linked stackoverflow question where you got the code from, m = n = 3 and p is variable, whereas you are using m = n = 1000, which is going to make a huge difference in the timings.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Not able to automatically parallelize for loop with numba - python

Related

Multiprocessing not using whole CPU

why is Numba parallel is slower than normal python loop?

Python Numba jit function with if statement

Python: Comparing the speed of NumPy and SymPy ufuncified functions

Cannot replicate results comparing Python, Numpy and Numba matrix multiplication

Categories

Resources