Parallel execution Python function which uses Cython memoryviews internally

Parallel execution Python function which uses Cython memoryviews internally - python

I am trying to parallelize a routine using Joblib's Parallel, but it doesn't work because one of the underlying functions uses Cython's memoryviews, resulting in buffer source array is read-only (the same issue has been discussed here).
I would like to understand why it happens (my knowledge of the low level machinery of Joblib is limited), and what I could do to work around it.
I know that using Numpy buffers avoids this problem (as mentioned here) but that is not really a solution. The function in question slices an array inside a parallel prange loop: using Numpy buffers would mean I cannot slice the array without the GIL, thus I would not be able to parallelize the loop using prange.

Related

Why is numba not speeding up the following piece of code?

Why is numba not speeding up the following piece of code?
#jit(nopython=True)
def sort(x):
for i in range(1000):
np.sort(x)
I thought numba was made for these sorts of tasks, where you have for loops combined with numpy operations. Yet this jitted function is 2-3x slower than the pure Python variant (i.e. the same function but without the jit), and yes I have run it after it was compiled.
Am I doing something wrong?
EDIT:
Size of x and data-type is dtype = int32 AND float64 (I tried both), len = 5000.

The performance of the Numba implementation is not mean to be faster with relatively big array (eg. > 1024). Indeed, both Numba and Numpy use a compiled sorting algorithm as Numba does (except Numba use a JIT). Numba an only be better here for small arrays because it can mostly remove the overhead of calling a Numpy function from the CPython interpreter (and performing many input checks). The running time is dominated by the time of the sorting calls and not the overhead of the loop for an array of size=5000 (see below).
Besides this, both implementation appear to use slightly different algorithm implementations (at least not the same thresholds). As a result, the two implementations results in different performance. This is dependent of the input array. Some sorting algorithm are fast on some specific kind of distribution where some other sorting algorithm are slow and vice versa for other kind of distribution.
Here is the runtime execution of the two implementation plotted against the array size tested on random arrays on my machine (with 32-bit integers from 0 to 1,000,000,000):
One can see that Numba is faster for small arrays and faster for big ones. When len=5000, the Numba implementation is 50% slower.
Note that you can tune the algorithm used using the parameter kind. Note also that some Numpy optimized implementations use parallelism so that primitives can run faster. In that case, the comparison with the Numba implementation is not fair as Numba should use a sequential implementation (especially if parallel=True is not set). Besides this, this problem appear to be a well known issue and developers are working on it.

I wouldn't expect any performance benefit either. Numba isn't a magic wand that if you just add it you magically get better performance. It does have an overhead that can easily sneak up on you. It helps to understand what exactly numba does. It parses the ast of a python function and compiles it to native code using llvm and for a lot of non-trivial cases, this makes a huge difference because honestly, python sucks at complex math and branching. That is a reasonable drawback for its design choices. Take a look at your code though. It is a numpy sort function inside a for loop. Think logically what optimisation could numba possibly make that could speed this up. Remember that numpy is already damn fast and numba cant really affect that performance. So you have essentially added overhead to the most critical part of your code and hence the loss in performance.

Python: How to synchronize access to a writable array of large numpy arrays (multiprocessing)

I am implementing a specific cache data structure in a machine learning application. The core consists of a list of (large) numpy arrays. The numpy arrays need to be replaced as often and quickly as possible, given the IO limitations. Therefore, there are a few workers that constantly read and prepare data. My current solution is to push a result produced by a worker into a shared queue. A separate process then received the result from the queue and replaces one of the numpy arrays in the list (which is owned by that central process).
I am now wondering whether that is the most elegant solution or whether there are faster solutions. In particular:
1.) As far as I understood the docs, going through a queue is equivalent to a serialization and de-serialization process, which could be slower than using shared memory.
2.) There is some memory overhead if the workers have several objects in the queue (which could have been replaced directly in the list).
I have thought about using a multiprocessing array or the numpy-sharedmem module but both did not really address my scenario. First, my list does not only contain ctypes. Second, each numpy array has a different size and all are independent. Third, I do not need write access to the numpy arrays but only to the 'wrapper' list organizing them.
Also, it should be noted that I am using multiprocessing and not threading as the workers heavily make use of numpy, which should invoke the global interpreter lock essentially all the time.
Questions:
- Is there a way to have a list of numpy arrays in shared memory?
- Is there a 'better' solution compared to the one described above?
Many thanks...

Using scipy routines outside of the GIL

This is sort of a general question related to a specific implementation I have in mind, about whether it's safe to use python routines designed for use inside the GIL in a shared memory environment. Specifically what I'd like to do is use scipy.optimize.curve_fit on a large array inside a cython function.
The data can be expressed as a 2d numpy array (say, of floats) with the axis to be fit along and the other the serialized axis to be parallelized over. Then I'd just like to release the GIL and start looping through the data with a cython.parallel.prange (the idea being then that I can have all my cores working on fitting at once).
The main issue I can foresee is that curve_fit does not operate "in place"; it returns the fit values of the parameters (and optionally their covariance matrix) and so has to allocate that memory at some point. (Of course I also have no idea about any intermediate memory allocation the routine performs.) I'm worried about how this will operate outside the GIL with many threads working concurrently.
I realize that the answer could just be "it should work fine go try it," but I'm hoping to get some idea of what to look out for. I also realize that this question is similar to others about parallelizing scipy/numpy routines, but I think this one is worded differently in that falls within the cython scope of a C environment for python.
Thanks for any help/suggestions.

Not safe. If CPython could safely run that kind of code without the GIL, we wouldn't have the GIL in the first place.

You may find the following discussion to be of interest on Parallel Programming in SciPy.
[I would have posted this as merely a comment, but I lack the requisite reputation.]

Python memory usage for large numpy array in repeated function call

I wrote a function in Python that returns a large 2d numpy array (2**13,2**13), call it pdd.
import pdd
array=pdd.function(some stuff)
If I call the function once the memory usage jumps to a few gigabytes. Then if I run the same command again
array=pdd.function(some stuff)
The memory usage roughly doubles, like its a second array of that size rather than just rewriting the concurrent one. The problem with this is I want to use this function with an mcmc sampler so many repeated calls to the function, which obviously can't work as it is.
So is there some way to free the memory, or something in the function to optimize or minimize the usage??
EDIT
I appear to have fixed the problem. After trying several things it seems to be scipy's fault. Inside the function there are several 2d FFTs and I was using scipy's fftpack fft2 and ifft2. This resulted in the creation of some large arrays using lots of memory that left/added over a gb in memory with each function call. When I switched to using numpys fft2 and ifft2 it went away. Now after the function ends in left with my one array with a few hundred Mb of memory and no more added with subsequent function calls.
I don't know or understand why this is, and found it surprising that numpy would be better in this case than scipy but there it is.

How to use custom data structure with multiprocessing in python

I'm using python 2.7 and numpy on a linux machine.
I am running a program which involves a time-consuming function computeGP(level, grid) which takes input in form of a numpy array level and an object grid, which is not is not modified by this function.
My goal is to parallelize computeGP (locally, so on different cores) for different level but the same grid. Since grid stays invariant, this can be done without synchronization hassle using shared memory. I've read a bit about threading in python and the GIL and it seems to me that i should go with the multiprocessing module rather than threading. This and this answers recommend to use multiprocessing.Array to share efficiently, while noting that on unix machines it is default behaviour that the object is not copied.
My problem is that the object grid is not a numpy array.
It is a list of numpy arrays, because the way my data structure works is that i need to access array (listelement) N and then access its row K.
Basically the list just fakes pointers to the arrays.
So my questions are:
My understanding is, that on unix machines i can share the object
grid without any further usage of multiprocessing datatypes
Array (or Value). Is that correct?
Is there a better way to
implement this pointer-to-array datastructure which can use the more
efficient multiprocessing.Array?
I don't want to assemble one large array containing the smaller ones from the list, because the smaller ones are not really small either...
Any thoughts welcome!

This SO question is very similar to yours: Share Large, Read-Only Numpy Array Between Multiprocessing Processes
There are a few answers in there, but the simplest if you are only using linux is to just make the data structure a global variable. Linux will fork() the process, which gives all worker processes copy-on-write access to the main process' memory (globals).
In this case you don't need to use any special multiprocessing classes or pass any data to the worker processes except level.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.