why is Numba parallel is slower than normal python loop?

why is Numba parallel is slower than normal python loop? - python

Following is normal python loop (I copied example from official doc - https://numba.readthedocs.io/en/stable/user/parallel.html)
def two_d_array_reduction_prod(n):
shp = (13, 17)
result1 = 2 * np.ones(shp, np.int_)
tmp = 2 * np.ones_like(result1)
for i in range(n):
result1 *= tmp
return result1
I called function like:
two_d_array_reduction_prod(50000)
It takes around 0.7482060070033185.
Numba parallel code
#nb.njit(parallel=True)
def two_d_array_reduction_prod(n):
shp = (13, 17)
result1 = 2 * np.ones(shp, np.int_)
tmp = 2 * np.ones_like(result1)
for i in nb.prange(n):
result1 *= tmp
return result1
I called function like:
two_d_array_reduction_prod(50000)
It takes 3.9858204890042543
My environment:
Amazon Linux 2, x86_64 processor
8 CPUs
32G memory

I can't replicate this. Using parallel=True gives a slight performance improvement, but any method is significantly faster compared to pure Python for me.
Using:
from numba import njit, prange
import numpy as np
def two_d_array_reduction_prod(n):
shp = (13, 17)
result1 = 2 * np.ones(shp, np.int_)
tmp = 2 * np.ones_like(result1)
for i in prange(n): # or for i in range(n):
result1 *= tmp
return result1
two_d_array_reduction_prod_numba = nb.njit(parallel=False)(two_d_array_reduction_prod)
Even with parallel=False with prange or using parallel=False with range I get over 3x improvement. All these timings are done with a warm-up, pre-compiling the Numba function first.

Related

Numba parallel code slower than its sequential counterpart

I'm new to Numba and I'm trying to implement an old Fortran code in Python using Numba (version 0.54.1), but when I add parallel = True the program actually slows down. My program is very simple: I change the positions x and y in a L x L grid and for each position in the grid I perform a summation
import numpy as np
import numba as nb
#nb.njit(parallel=True)
def lyapunov_grid(x_grid, y_grid, k, N):
L = len(x_grid)
lypnv = np.zeros((L, L))
for ii in nb.prange(L):
for jj in range(L):
x = x_grid[ii]
y = y_grid[jj]
beta0 = 0
sumT11 = 0
for j in range(N):
y = (y - k*np.sin(x)) % (2*np.pi)
x = (x + y) % (2*np.pi)
J = np.array([[1.0, -k*np.cos(x)], [1.0, 1.0 - k*np.cos(x)]])
beta = np.arctan((-J[1,0]*np.cos(beta0) + J[1,1]*np.sin(beta0))/(J[0,0]*np.cos(beta0) - J[0,1]*np.sin(beta0)))
T11 = np.cos(beta0)*(J[0,0]*np.cos(beta) - J[1,0]*np.sin(beta)) - np.sin(beta0)*(J[0,1]*np.cos(beta) - J[1,1]*np.sin(beta))
sumT11 += np.log(abs(T11))/np.log(2)
beta0 = beta
lypnv[ii, jj] = sumT11/N
return lypnv
# Compile
_ = lyapunov_grid(np.linspace(0, 1, 10), np.linspace(0, 1, 10), 1, 10)
# Parameters
N = int(1e3)
L = 128
pi = np.pi
k = 1.5
# Limits of the phase space
x0 = -pi
xf = pi
y0 = -pi
yf = pi
# Grid positions
x = np.linspace(x0, xf, L, endpoint=True)
y = np.linspace(y0, yf, L, endpoint=True)
lypnv = lyapunov_grid(x, y, k, N)
With parallel=False it takes about 8s to run, however with parallel=True it takes about 14s. I also tested with another code from https://github.com/animator/mandelbrot-numba and in this case the parallelization works.
import math
import numpy as np
import numba as nb
WIDTH = 1000
MAX_ITER = 1000
#nb.njit(parallel=True)
def mandelbrot(width, max_iter):
pixels = np.zeros((width, width, 3), dtype=np.uint8)
for y in nb.prange(width):
for x in range(width):
c0 = complex(3.0*x/width - 2, 3.0*y/width - 1.5)
c = 0
for i in range(1, max_iter):
if abs(c) > 2:
log_iter = math.log(i)
pixels[y, x, :] = np.array([int(255*(1+math.cos(3.32*log_iter))/2),
int(255*(1+math.cos(0.774*log_iter))/2),
int(255*(1+math.cos(0.412*log_iter))/2)],
dtype=np.uint8)
break
c = c * c + c0
return pixels
# compile
_ = mandelbrot(WIDTH, 10)
calcpixels = mandelbrot(WIDTH, MAX_ITER)

One main issue is that the second function call compile the function again. Indeed, the types of the provided arguments change: in the first call the third argument is an integer (int transformed to a np.int_) while in the second call the third argument (k) is a floating point number (float transformed to a np.float64). Numba recompiles the function for different parameter types because they are deduced from the type of the arguments and it does not know you want to use a np.float64 type for the third argument (since the first time the function is compiled with for a np.int_ type). One simple solution to fix the problem is to change the first call to:
_ = lyapunov_grid(np.linspace(0, 1, 10), np.linspace(0, 1, 10), 1.0, 10)
However, this is not a robust way to fix the problem. You can specify the parameter types to Numba so it will compile the function at declaration time. This also remove the need to artificially call the function (with useless parameters).
#nb.njit('float64[:,:](float64[::1], float64[::1], float64, float64)', parallel=True)
Note that (J[0,0]*np.cos(beta0) - J[0,1]*np.sin(beta0)) is zero the first time resulting in a division by 0.
Another main issue comes from the allocations of many small arrays in the loop causing a contention of the standard allocator (see this post for more information). While Numba could theoretically optimize it (ie. replace the array with local variables), it actually does not, resulting in a huge slowdown and a contention. Hopefully, in your case, you do not need to actually create the array. At last, you can create it only in the encompassing loop and modify it in the innermost loop. Here is the optimized code:
#nb.njit('float64[:,:](float64[::1], float64[::1], float64, float64)', parallel=True)
def lyapunov_grid(x_grid, y_grid, k, N):
L = len(x_grid)
lypnv = np.zeros((L, L))
for ii in nb.prange(L):
J = np.ones((2, 2), dtype=np.float64)
for jj in range(L):
x = x_grid[ii]
y = y_grid[jj]
beta0 = 0
sumT11 = 0
for j in range(N):
y = (y - k*np.sin(x)) % (2*np.pi)
x = (x + y) % (2*np.pi)
J[0, 1] = -k*np.cos(x)
J[1, 1] = 1.0 - k*np.cos(x)
beta = np.arctan((-J[1,0]*np.cos(beta0) + J[1,1]*np.sin(beta0))/(J[0,0]*np.cos(beta0) - J[0,1]*np.sin(beta0)))
T11 = np.cos(beta0)*(J[0,0]*np.cos(beta) - J[1,0]*np.sin(beta)) - np.sin(beta0)*(J[0,1]*np.cos(beta) - J[1,1]*np.sin(beta))
sumT11 += np.log(abs(T11))/np.log(2)
beta0 = beta
lypnv[ii, jj] = sumT11/N
return lypnv
Here is the results on a old 2-core machine (with 4 hardware threads):
Original sequential: 15.9 s
Original parallel: 11.9 s
Fix-build sequential: 15.7 s
Fix-build parallel: 10.1 s
Optimized sequential: 2.73 s
Optimized parallel: 0.94 s
The optimized implementation is much faster than the others. The parallel optimized version scale very well compared than the original one (2.9 times faster than the sequential one). Finally, the best version is about 12 times faster than the original parallel version. I expect a much faster computation on a recent machine with many more cores.

Solving Linear Equations on the GPU with NumPy and PyTorch

I am trying to solve a lot of linear equations as fast as possible. To find out the fastest way I benchmarked NumPy and PyTorch, each on the CPU and on my GeForce 1080 GPU (using Numba for NumPy). The results really confused me.
This is the code I used with Python 3.8:
import timeit
import torch
import numpy
from numba import njit
def solve_numpy_cpu(dim: int = 5):
a = numpy.random.rand(dim, dim)
b = numpy.random.rand(dim)
for _ in range(1000):
numpy.linalg.solve(a, b)
def solve_numpy_njit_a(dim: int = 5):
njit(solve_numpy_cpu, dim=dim)
#njit
def solve_numpy_njit_b(dim: int = 5):
a = numpy.random.rand(dim, dim)
b = numpy.random.rand(dim)
for _ in range(1000):
numpy.linalg.solve(a, b)
def solve_torch_cpu(dim: int = 5):
a = torch.rand(dim, dim)
b = torch.rand(dim, 1)
for _ in range(1000):
torch.solve(b, a)
def solve_torch_gpu(dim: int = 5):
torch.set_default_tensor_type("torch.cuda.FloatTensor")
solve_torch_cpu(dim=dim)
def main():
for f in (solve_numpy_cpu, solve_torch_cpu, solve_torch_gpu, solve_numpy_njit_a, solve_numpy_njit_b):
time = timeit.timeit(f, number=1)
print(f"{f.__name__:<20s}: {time:f}")
if __name__ == "__main__":
main()
And these are the results:
solve_numpy_cpu : 0.007275
solve_torch_cpu : 0.012244
solve_torch_gpu : 5.239126
solve_numpy_njit_a : 0.000158
solve_numpy_njit_b : 1.273660
The slowest is CUDA accelerated PyTorch. I verified that PyTorch is using my GPU with
import torch
torch.cuda.is_available()
torch.cuda.get_device_name(0)
returning
True
'GeForce GTX 1080'
I can get behind that, on the CPU, PyTorch is slower than NumPy. What I cannot understand is why PyTorch on the GPU is so much slower. Not that important but actually even more confusing is that Numba's njit decorator makes performance orders of magnitude slower, until you don't use the # decorator syntax anymore.
Is it my setup? Occasionally I get a weird message about the windows page / swap file not being big enough. In case I've taken a completely obscure path to solving linear equations on the GPU, I'd be happy to be directed into another direction.
Edit
So, I focussed on Numba and changed my benchmarking a bit. As suggested by #max9111 I rewrote the functions to receive input and produce output because, in the end, that's what anyone would want to use them for. Now, I also perform a first compile run for the Numba accelerated function so the subsequent timing is fairer. Finally, I checked the performance against matrix size and plotted the results.
TL/DR: Up to matrix sizes of 500x500, Numba acceleration doesn't really make a difference for numpy.linalg.solve.
Here is the code:
import time
from typing import Tuple
import numpy
from matplotlib import pyplot
from numba import jit
#jit(nopython=True)
def solve_numpy_njit(a: numpy.ndarray, b: numpy.ndarray) -> numpy.ndarray:
parameters = numpy.linalg.solve(a, b)
return parameters
def solve_numpy(a: numpy.ndarray, b: numpy.ndarray) -> numpy.ndarray:
parameters = numpy.linalg.solve(a, b)
return parameters
def get_data(dim: int) -> Tuple[numpy.ndarray, numpy.ndarray]:
a = numpy.random.random((dim, dim))
b = numpy.random.random(dim)
return a, b
def main():
a, b = get_data(10)
# compile numba function
p = solve_numpy_njit(a, b)
matrix_size = [(x + 1) * 10 for x in range(50)]
non_accelerated = []
accelerated = []
results = non_accelerated, accelerated
for j, each_matrix_size in enumerate(matrix_size):
for m, f in enumerate((solve_numpy, solve_numpy_njit)):
average_time = -1.
for k in range(5):
time_start = time.time()
for i in range(100):
a, b = get_data(each_matrix_size)
p = f(a, b)
d_t = time.time() - time_start
print(f"{each_matrix_size:d} {f.__name__:<30s}: {d_t:f}")
average_time = (average_time * k + d_t) / (k + 1)
results[m].append(average_time)
pyplot.plot(matrix_size, non_accelerated, label="not numba")
pyplot.plot(matrix_size, accelerated, label="numba")
pyplot.legend()
pyplot.show()
if __name__ == "__main__":
main()
And these are the results (runtime against matrix edge length):
Edit 2
Seeing that Numba doesn't make much of a difference in my case, I came back to benchmarking PyTorch. And indeed, it appears to be roughly 4x faster than Numpy without even using a CUDA device.
Here is the code I used:
import time
from typing import Tuple
import numpy
import torch
from matplotlib import pyplot
def solve_numpy(a: numpy.ndarray, b: numpy.ndarray) -> numpy.ndarray:
parameters = numpy.linalg.solve(a, b)
return parameters
def get_data(dim: int) -> Tuple[numpy.ndarray, numpy.ndarray]:
a = numpy.random.random((dim, dim))
b = numpy.random.random(dim)
return a, b
def get_data_torch(dim: int) -> Tuple[torch.tensor, torch.tensor]:
a = torch.rand(dim, dim)
b = torch.rand(dim, 1)
return a, b
def solve_torch(a: torch.tensor, b: torch.tensor) -> torch.tensor:
parameters, _ = torch.solve(b, a)
return parameters
def experiment_numpy(matrix_size: int, repetitions: int = 100):
for i in range(repetitions):
a, b = get_data(matrix_size)
p = solve_numpy(a, b)
def experiment_pytorch(matrix_size: int, repetitions: int = 100):
for i in range(repetitions):
a, b = get_data_torch(matrix_size)
p = solve_torch(a, b)
def main():
matrix_size = [x for x in range(5, 505, 5)]
experiments = experiment_numpy, experiment_pytorch
results = tuple([] for _ in experiments)
for i, each_experiment in enumerate(experiments):
for j, each_matrix_size in enumerate(matrix_size):
time_start = time.time()
each_experiment(each_matrix_size, repetitions=100)
d_t = time.time() - time_start
print(f"{each_matrix_size:d} {each_experiment.__name__:<30s}: {d_t:f}")
results[i].append(d_t)
for each_experiment, each_result in zip(experiments, results):
pyplot.plot(matrix_size, each_result, label=each_experiment.__name__)
pyplot.legend()
pyplot.show()
if __name__ == "__main__":
main()
And here's the result (runtime against matrix edge length):
So for now, I'll be sticking with torch.solve. However, the original question remains:
How can I exploit my GPU to solve linear equations even faster?

Not able to automatically parallelize for loop with numba

I am trying to run the following on multiple cores for speed up using numba. Unfortunately the function seems to run only on one core when I tested it. Can someone explain to me why and if there is a possibility to get it running on multiple cores?
Minimal working example:
import numpy as np
import numba
a = np.random.rand(100000)
#numba.jit(nopython=True, parallel=True)
def func(x):
result = np.zeros_like(x)
for delta in range(1,len(x)):
thisresult = 0
for i in range(delta,len(x)):
thisresult += (x[i] - x[i-delta])**2
result[delta] = thisresult / (len(x) - delta)
return result
print(func(a))

Explicit Parallelization
I would always remommend to parallelize code explicitely. Numba tries to serialize some parallel code parts, but that won't always work or lead to best performance.
import numpy as np
import numba
a = np.random.rand(100000)
#numba.jit(nopython=True, parallel=True)
def func(x):
result = np.zeros_like(x,dtype=x.dtype)
for delta in numba.prange(1,len(x)):
thisresult = 0
for i in range(delta,len(x)):
thisresult += (x[i] - x[i-delta])**2
result[delta] = thisresult / (len(x) - delta)
return result
print(func(a))
For more details have a look at the documentation.

Cannot replicate results comparing Python, Numpy and Numba matrix multiplication

So while evaluating possibilities to speed up Python code i came across this Stack Overflow post: Comparing Python, Numpy, Numba and C++ for matrix multiplication
I was quite impressed with numba's performance and implemented some of our function in numba. Unfortunately the speedup was only there for very small matrices and for large matrices the code became very slow compared to the previous scipy sparse implementation. I thought this made sense but nevertheless i repeated the test in the original post (code below).
When using a 1000 x 1000 matrix, according to that post even the python implementation should take roughly 0,01 s. Here's my results though:
python : 769.6387 seconds
numpy : 0.0660 seconds
numba : 3.0779 seconds
scipy : 0.0030 seconds
What am i doing wrong to get such different results than the original post? I copied the functions and did not change anything. I tried both Python 3.5.1 (64 bit) and Python 2.7.10 (32 bit), a colleague tried the same code with the same results. This is the result for a 100x100 matrix:
python : 0.6916 seconds
numpy : 0.0035 seconds
numba : 0.0015 seconds
scipy : 0.0035 seconds
Did i make some obvious mistakes?
import numpy as np
import numba as nb
import scipy.sparse
import time
class benchmark(object):
def __init__(self, name):
self.name = name
def __enter__(self):
self.start = time.time()
def __exit__(self, ty, val, tb):
end = time.time()
print("%s : %0.4f seconds" % (self.name, end-self.start))
return False
def dot_py(A, B):
m, n = A.shape
p = B.shape[1]
C = np.zeros((m, p))
for i in range(0, m):
for j in range(0, p):
for k in range(0, n):
C[i, j] += A[i, k] * B[k, j]
return C
def dot_np(A, B):
C = np.dot(A,B)
return C
def dot_scipy(A, B):
C = A * B
return C
dot_nb = nb.jit(nb.float64[:,:](nb.float64[:,:], nb.float64[:,:]), nopython=True)(dot_py)
dim_x = 1000
dim_y = 1000
a = scipy.sparse.rand(dim_x, dim_y, density=0.01)
b = scipy.sparse.rand(dim_x, dim_y, density=0.01)
a_full = a.toarray()
b_full = b.toarray()
print("starting test")
with benchmark("python"):
dot_py(a_full, b_full)
with benchmark("numpy"):
dot_np(a_full, b_full)
with benchmark("numba"):
dot_nb(a_full, b_full)
with benchmark("scipy"):
dot_scipy(a, b)
print("finishing test")
edit:
for anyone seeing this at a later time. this is the results i got when using sparse nxn matrices (1% of elements are nonzero).

In the linked stackoverflow question where you got the code from, m = n = 3 and p is variable, whereas you are using m = n = 1000, which is going to make a huge difference in the timings.

cuda python GPU numbapro 3d loop poor performance

I am trying to set up a 3D loop with the assignment
C(i,j,k) = A(i,j,k) + B(i,j,k)
using Python on my GPU. This is my GPU:
http://www.geforce.com/hardware/desktop-gpus/geforce-gt-520/specifications
The sources I'm looking at / comparing with are:
http://nbviewer.ipython.org/gist/harrism/f5707335f40af9463c43
http://nbviewer.ipython.org/github/ContinuumIO/numbapro-examples/blob/master/webinars/2014_06_17/intro_to_gpu_python.ipynb
It's possible that I've imported more modules than necessary. This is my code:
import numpy as np
import numbapro
import numba
import math
from timeit import default_timer as timer
from numbapro import cuda
from numba import *
#autojit
def myAdd(a, b):
return a+b
myAdd_gpu = cuda.jit(restype=f8, argtypes=[f8, f8], device=True)(myAdd)
#cuda.jit(argtypes=[float32[:,:,:], float32[:,:,:], float32[:,:,:]])
def myAdd_kernel(a, b, c):
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
tz = cuda.threadIdx.z
bx = cuda.blockIdx.x
by = cuda.blockIdx.y
bz = cuda.blockIdx.z
bw = cuda.blockDim.x
bh = cuda.blockDim.y
bd = cuda.blockDim.z
i = tx + bx * bw
j = ty + by * bh
k = tz + bz * bd
if i >= c.shape[0]:
return
if j >= c.shape[1]:
return
if k >= c.shape[2]:
return
for i in xrange(0,c.shape[0]):
for j in xrange(0,c.shape[1]):
for k in xrange(0,c.shape[2]):
# c[i,j,k] = a[i,j,k] + b[i,j,k]
c[i,j,k] = myAdd_gpu(a[i,j,k],b[i,j,k])
def main():
my_gpu = numba.cuda.get_current_device()
print "Running on GPU:", my_gpu.name
cores_per_capability = {1: 8,2: 32,3: 192,}
cc = my_gpu.compute_capability
print "Compute capability: ", "%d.%d" % cc, "(Numba requires >= 2.0)"
majorcc = cc[0]
print "Number of streaming multiprocessor:", my_gpu.MULTIPROCESSOR_COUNT
cores_per_multiprocessor = cores_per_capability[majorcc]
print "Number of cores per mutliprocessor:", cores_per_multiprocessor
total_cores = cores_per_multiprocessor * my_gpu.MULTIPROCESSOR_COUNT
print "Number of cores on GPU:", total_cores
N = 100
thread_ct = my_gpu.WARP_SIZE
block_ct = int(math.ceil(float(N) / thread_ct))
print "Threads per block:", thread_ct
print "Block per grid:", block_ct
a = np.ones((N,N,N), dtype = np.float32)
b = np.ones((N,N,N), dtype = np.float32)
c = np.zeros((N,N,N), dtype = np.float32)
start = timer()
cg = cuda.to_device(c)
myAdd_kernel[block_ct, thread_ct](a,b,cg)
cg.to_host()
dt = timer() - start
print "Wall clock time with GPU in %f s" % dt
print 'c[:3,:,:] = ' + str(c[:3,1,1])
print 'c[-3:,:,:] = ' + str(c[-3:,1,1])
if __name__ == '__main__':
main()
My result from running this is the following:
Running on GPU: GeForce GT 520
Compute capability: 2.1 (Numba requires >= 2.0)
Number of streaming multiprocessor: 1
Number of cores per mutliprocessor: 32
Number of cores on GPU: 32
Threads per block: 32
Block per grid: 4
Wall clock time with GPU in 1.104860 s
c[:3,:,:] = [ 2. 2. 2.]
c[-3:,:,:] = [ 2. 2. 2.]
When I run the examples in the sources, I see significant speedup. I don't think my example is running properly since the wall clock time is much longer than I would expect. I've modeled this mostly from the "even bigger speedups with cuda python" section in the first example link.
I believe I've indexed correctly and safely. Maybe the problem is with my blockdim? or griddim? Or maybe I'm using the wrong types for my GPU. I think I read that they must be a certain type. I'm very new to this so the problem very well could be trivial!
Any and all help is greatly appreciated!

You are creating your indexes correctly but then you're ignoring them.
Running the nested loop
for i in xrange(0,c.shape[0]):
for j in xrange(0,c.shape[1]):
for k in xrange(0,c.shape[2]):
is forcing all your threads to loop through all values in all dimensions, which is not what you want. You want each thread to compute one value in a block and then move on.
I think something like this should work better...
i = tx + bx * bw
while i < c.shape[0]:
j = ty+by*bh
while j < c.shape[1]:
k = tz + bz * bd
while k < c.shape[2]:
c[i,j,k] = myAdd_gpu(a[i,j,k],b[i,j,k])
k+=cuda.blockDim.z*cuda.gridDim.z
j+=cuda.blockDim.y*cuda.gridDim.y
i+=cuda.blockDim.x*cuda.gridDim.x
Try to compile and run it. Also make sure to validate it, as I have not.

I don't see you using imshow, or show, so there is no need to import those.
It doesn't appear as though you use your import of math (I didn't see any calls of math.some_function.
Your imports from numba and numbapro seem repetitive. Your "from numba import cuda" overrides your "from numbapro import cuda", since it is subsequent to it. Your calls to cuda use the cuda in numba not numbapro. When you call "from numba import *", you import everything from numba, not just cuda, which seems to be the only thing you use. Also, (I believe) import numba.cuda is equivalent to from numba import cuda. Why not eliminate all your imports from numba and numbapro with a single "from numba import cuda".

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

why is Numba parallel is slower than normal python loop? - python

Related

Numba parallel code slower than its sequential counterpart

Solving Linear Equations on the GPU with NumPy and PyTorch

Not able to automatically parallelize for loop with numba

Cannot replicate results comparing Python, Numpy and Numba matrix multiplication

cuda python GPU numbapro 3d loop poor performance

Categories

Resources