Parallel processing with numpy or numba

Parallel processing with numpy or numba - python

I have a simple problem. A function receives an array [a, b] of two numbers, and it returns another array [aa, ab]. The sample code is
import numpy as np
def func(array_1):
array_2 = np.zeros_like(array_1)
array_2[0] = array_1[0]*array_1[0]
array_2[1] = array_1[0]*array_1[1]
return array_2
array_1 = np.array([3., 4.]) # sample test array [a, b]
print(array_1) # prints this test array [a, b]
print(func( array_1 ) ) # prints [a*a, a*b]
The two lines inside the function func
array_2[0] = array_1[0]*array_1[0]
array_2[1] = array_1[0]*array_1[1]
are independent and I want to parallelize them.
Please tell me
how to parallize this (without Numba)?
how to parallize this (with Numba)?

This does not make sense to parallelise this code using multiple threads/processes because the arrays are far too small for such parallelisation to be useful. Indeed, creating thread typically takes about 1-100 microseconds on a mainstream machine while this code should take clearly less than a microsecond in Numba. In fact, the two computing lines should take less than 0.01 microsecond. Thus, creating thread will make the execution far slower.
Assuming your array would be much bigger, the typical way to parallelize a Python script is to use multiprocessing (which creates processes). For a Numba code, it is prange + parallel=True (which creates threads).
If you execute a jitted Numba function, then the code already runs a bit in parallel. Indeed, modern mainstream processors already execute instructions in parallel. This is called instruction-level parallelism. More specifically, modern processors pipeline the instructions and execute multiple of them thanks to a superscalar execution and in an out-of-order way. All of this is completely automatic. You just need to avoid having dependencies between the executed instructions.
Finally, if you want to speed up this function, then you need to use Numba in the caller function because a function call from CPython is far more expensive than computing and storing two floats. Note also that allocating an array is pretty expensive too so it is better to reuse buffers at this granularity.

Related

What actually happens when setting parallel=True in #njit numba?

Could someone please explain roughly what happens when one runs a #njit-ted python function which contains a nested for loop (each iteration from each of the loops is independent of the others) and sets parallel=True and puts prange instead of range?
#njit(parallel=True)
def f():
C = np.empty((80, 20, 18), dtype=np.complex128)
for i in prange(80):
for j in prange(20):
for k in range(18):
C[i, j, k] = do_smth(i, j, k) # where do_smth(i, j, k) is #njit-ted and will further call other functions
Similarly, what happens when using prange only for the outermost loop? (i.e. letting for j in range(20): ... )
I understand what a thread is and I put NUMBA_NUM_THREADS (the environmental variable) to be the number of cores of the processor.
I did some profiling using the timeit module and it seems that the parallel=True keyword only slows the execution of the f() function when the .py script is called on a machine with 20 cores (by a considerable amount (even 4 times slower)).
f() above further calls more functions (first one being do_smth()) also having their structure resembling the f()'s (nested for loops which, at each of their iterations, call other #njit-ted functions) structure.
I checked them as above. Is my approach good? I.e. to profile them timeit and changing the keywords params inside their #njit decorator (I played with parallel, fastmath and nogil) and creating a table in which I note the execution times. My aim was to find the best execution time from the results I obtain.

Could someone please explain roughly what happens when one runs a #njit-ted python function which contains a nested for loop and sets parallel=True and puts prange instead of range?
This is explained in the documentation, but basically, when parallel=True is set, prange split the loop iteration in blocks so they are executed in multiple threads. The exact scheduling is dependent of the underlying parallel runtime (eg. TBB, OpenMP, etc.). The loops is analyzed by Numba so to know whether a reduction is needed or not (not all patterns are allowed). It can also fuse parallel loops if needed (though it does not work on my machine with Numba 0.55.2, even on trivial reduction loops: only the outer loop is parallelized). Note that its takes time to create threads and the bigger the number of core, the slower it is. This is why multi-threaded computations should last for a relatively long time so for multiple threads to be useful.
Similarly, what happens when using prange only for the outermost loop? (i.e. letting for j in range(20): ... )
In theory, it is generally better to specify more parallelism. In practice, it is not always useful and sometimes even detrimental because the runtime can use inefficient methods (loop fusion can cause slow modulus to be used with some OpenMP backends).
If you use it only on the outer i-based loop, then only this loop is parallelized (using all the cores by default, so 4 iterations per loop if a static schedule is selected by the backend).
I did some profiling using the timeit module and it seems that the parallel=True keyword only slows the execution of the f() function when the .py script is called on a machine with 20 cores (by a considerable amount (even 4 times slower)).
Parallel programming is not easy. At least, far more than most people think. This is why researcher teams worked on it for decades and it is still an active field of research.
There are many effects that can be responsible for this, including:
Allocator contention (very frequent)
Undefined behavior in the code (frequent): typically a race condition (example)
False-sharing (quite frequent)
NUMA effects (eg. access to remote pages)
Other resource saturation (eg. memory) though it generally make the code barely scale and do not cause a slowdown (unless there is a contention)
A bug in Numba (quite rare)
Also note that the first call cause the function to be compiled so it is slower (and parallel codes are even slower to compile).

Why is it better to use synchronous programming for in-memory operations?

I have a complex nested data structure. I iterate through it and perform some calculations on each possible uniqe pair of elements. It's all in-memory mathematical functions. I don't read from files or do networking.
It takes a few hours to run, with do_work() being called 25,000 times. I am looking for ways to speed it up.
Although Pool.map() seems useful for my lists, it's proving to be difficult because I need to pass extra arguments into the function being mapped.
I thought using the Python multitasking library would help, but when I use Pool.apply_async() to call do_work(), it actually takes longer.
I did some googling and a blogger says "Use sync for in-memory operations — async is a complete waste when you aren’t making blocking calls." Is this true? Can someone explain why? Do the RAM read & write operations interfere with each other? Why does my code take longer with async calls? do_work() writes calculation results to a database, but it doesn't modify my data structure.
Surely there is a way to utilize my processor cores instead of just linearly iterating through my lists.
My starting point, doing it synchronously:
main_list = [ [ [a,b,c,[x,y,z], ... ], ... ], ... ] # list of identical structures
helper_list = [1,2,3]
z = 2
for i_1 in range(0, len(main_list)):
for i_2 in range(0, len(main_list)):
if i_1 < i_2: # only unique combinations
for m in range(0, len(main_list[i_1])):
for h, helper in enumerate(helper_list):
do_work(
main_list[i_1][m][0], main_list[i_2][m][0], # unique combo
main_list[i_1][m][1], main_list[i_1][m][2],
main_list[i_1][m][3][z], main_list[i_2][m][3][h],
helper_list[h]
)
Variable names have been changed to make it more readable.

This is just a general answer, but too long for a comment...
First of all, I think your biggest bottleneck at this very moment is Python itself. I don't know what do_work() does, but if it's CPU intensive, you have the GIL which completely prevents effective parallelisation inside one process. No matter what you do, threads will fight for the GIL and it will eventually make your code even slower. Remember: Python has real threading, but the CPU is shared inside a single process.
I recommend checking out the page of David M Beazley: http://dabeaz.com/GIL/gilvis who did a lot of effort to visualise the GIL behaviour in Python.
On the other hand, the module multiprocessing allows you to run multiple processes and "circumvent" the GIL downsides, but it will be tricky to get access to the same memory locations without bigger penalties or trade-offs.
Second: if you utilise heavy nested loops, you should think about using numba and trying to fit your data structures inside numpy (structured) arrays. This can give you order of magnitude of speed quite easily. Python is slow as hell for such things but luckily there are ways to squeeze out a lot when using appropriate libraries.
To sum up, I think the code you are running could be orders of magnitudes faster with numba and numpy structures.
Alternatively, you can try to rewrite the code in a language like Julia (very similar syntax to Python and the community is extremely helpful) and quickly check how fast it is in order to explore the limits of the performance. It's always a good idea to get a feeling how fast something (or parts of a code) can be in a language which has not such complex performance critical aspects like Python.

Your task is more CPU bound than relying on I/O operations. Asynchronous execution make sense when you have long I/O operations i.e. sending/receiving something from network etc.
What you can do is split task to the chunks and utilize threads and multiprocessing (run on different CPU cores).

Parallel execution Python function which uses Cython memoryviews internally

I am trying to parallelize a routine using Joblib's Parallel, but it doesn't work because one of the underlying functions uses Cython's memoryviews, resulting in buffer source array is read-only (the same issue has been discussed here).
I would like to understand why it happens (my knowledge of the low level machinery of Joblib is limited), and what I could do to work around it.
I know that using Numpy buffers avoids this problem (as mentioned here) but that is not really a solution. The function in question slices an array inside a parallel prange loop: using Numpy buffers would mean I cannot slice the array without the GIL, thus I would not be able to parallelize the loop using prange.

Python memory usage for large numpy array in repeated function call

I wrote a function in Python that returns a large 2d numpy array (2**13,2**13), call it pdd.
import pdd
array=pdd.function(some stuff)
If I call the function once the memory usage jumps to a few gigabytes. Then if I run the same command again
array=pdd.function(some stuff)
The memory usage roughly doubles, like its a second array of that size rather than just rewriting the concurrent one. The problem with this is I want to use this function with an mcmc sampler so many repeated calls to the function, which obviously can't work as it is.
So is there some way to free the memory, or something in the function to optimize or minimize the usage??
EDIT
I appear to have fixed the problem. After trying several things it seems to be scipy's fault. Inside the function there are several 2d FFTs and I was using scipy's fftpack fft2 and ifft2. This resulted in the creation of some large arrays using lots of memory that left/added over a gb in memory with each function call. When I switched to using numpys fft2 and ifft2 it went away. Now after the function ends in left with my one array with a few hundred Mb of memory and no more added with subsequent function calls.
I don't know or understand why this is, and found it surprising that numpy would be better in this case than scipy but there it is.

Parallelise python loop with numpy arrays and shared-memory

I am aware of several questions and answers on this topic, but haven't found a satisfactory answer to this particular problem:
What is the easiest way to do a simple shared-memory parallelisation of a python loop where numpy arrays are manipulated through numpy/scipy functions?
I am not looking for the most efficient way, I just wanted something simple to implement that doesn't require a significant rewrite when the loop is not run in parallel. Just like OpenMP implements in lower level languages.
The best answer I've seen in this regard is this one, but this is a rather clunky way that requires one to express the loop into a function that takes a single argument, several lines of shared-array converting crud, seems to require that the parallel function is called from __main__, and it doesn't seem to work well from the interactive prompt (where I spend a lot of my time).
With all of Python's simplicity is this really the best way to parellelise a loop? Really? This is something trivial to parallelise in OpenMP fashion.
I have painstakingly read through the opaque documentation of the multiprocessing module, only to find out that it is so general that it seems suited to everything but a simple loop parallelisation. I am not interested in setting up Managers, Proxies, Pipes, etc. I just have a simple loop, fully parallel that doesn't have any communication between tasks. Using MPI to parallelise such a simple situation seems like overkill, not to mention it would be memory-inefficient in this case.
I haven't had time to learn about the multitude of different shared-memory parallel packages for Python, but was wondering if someone has more experience in this and can show me a simpler way. Please do not suggest serial optimisation techniques such as Cython (I already use it), or using parallel numpy/scipy functions such as BLAS (my case is more general, and more parallel).

With Cython parallel support:
# asd.pyx
from cython.parallel cimport prange
import numpy as np
def foo():
cdef int i, j, n
x = np.zeros((200, 2000), float)
n = x.shape[0]
for i in prange(n, nogil=True):
with gil:
for j in range(100):
x[i,:] = np.cos(x[i,:])
return x
On a 2-core machine:
$ cython asd.pyx
$ gcc -fPIC -fopenmp -shared -o asd.so asd.c -I/usr/include/python2.7
$ export OMP_NUM_THREADS=1
$ time python -c 'import asd; asd.foo()'
real 0m1.548s
user 0m1.442s
sys 0m0.061s
$ export OMP_NUM_THREADS=2
$ time python -c 'import asd; asd.foo()'
real 0m0.602s
user 0m0.826s
sys 0m0.075s
This runs fine in parallel, since np.cos (like other ufuncs) releases the GIL.
If you want to use this interactively:
# asd.pyxbdl
def make_ext(modname, pyxfilename):
from distutils.extension import Extension
return Extension(name=modname,
sources=[pyxfilename],
extra_link_args=['-fopenmp'],
extra_compile_args=['-fopenmp'])
and (remove asd.so and asd.c first):
>>> import pyximport
>>> pyximport.install(reload_support=True)
>>> import asd
>>> q1 = asd.foo()
# Go to an editor and change asd.pyx
>>> reload(asd)
>>> q2 = asd.foo()
So yes, in some cases you can parallelize just by using threads. OpenMP is just a fancy wrapper for threading, and Cython is therefore only needed here for the easier syntax. Without Cython, you can use the threading module --- works similarly as multiprocessing (and probably more robustly), but you don't need to do anything special to declare arrays as shared memory.
However, not all operations release the GIL, so YMMV for the performance.
***
And another possibly useful link scraped from other Stackoverflow answers --- another interface to multiprocessing: http://packages.python.org/joblib/parallel.html

Using a mapping operation (in this case multiprocessing.Pool.map()) is more or less the the canonical way to paralellize a loop on a single machine. Unless and until the built-in map() is ever paralellized.
An overview of the different possibilities can be found here.
You can use openmp with python (or rather cython), but it doesn't look exactly easy.
IIRC, the point if only running multiprocessing stuff from __main__ is a neccesity because of compatibility with Windows. Since windows lacks fork(), it starts a new python interpreter and has to import the code in it.
Edit
Numpy can paralellize some operations like dot(), vdot() and innerproduct(), when configured with a good multithreading BLAS library like e.g. OpenBLAS. (See also this question.)
Since numpy array operations are mostly by element it seems possible to parallelize them. But this would involve setting up either a shared memory segment for python objects, or dividing the arrays up into pieces and feeding them to the different processes, not unlike what multiprocessing.Pool does. No matter what approach is taken, it would incur memory and processing overhead to manage all that. One would have to run extensive tests to see for which sizes of arrays this would actually be worth the effort. The outcome of those tests would probably vary considerable per hardware architecture, operating system and amount of RAM.

The .map( ) method of the mathDict( ) class in ParallelRegression does exactly what you are looking for in two lines of code that should be very easy at an interactive prompt. It uses true multiprocessing, so the requirement that the function to be run in parallel is pickle-able is unavoidable, but this does provide an easy way to loop over a matrix in shared memory from multiple processes.
Say you have a pickle-able function:
def sum_row( matrix, row ):
return( sum( matrix[row,:] ) )
Then you just need to create a mathDict( ) object representing it, and use mathDict( ).map( ):
matrix = np.array( [i for i in range( 24 )] ).reshape( (6, 4) )
RA, MD = mathDictMaker.fromMatrix( matrix, integer=True )
res = MD.map( [(i,) for i in range( 6 )], sum_row, ordered=True )
print( res )
# [6, 22, 38, 54, 70, 86]
The documentation (link above) explains how to pass a combination of positional and keyword arguments into your function, including the matrix itself at any position or as a keyword argument. This should enable you to use pretty much any function you've already written without modifying it.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.