Using scipy routines outside of the GIL - python

This is sort of a general question related to a specific implementation I have in mind, about whether it's safe to use python routines designed for use inside the GIL in a shared memory environment. Specifically what I'd like to do is use scipy.optimize.curve_fit on a large array inside a cython function.
The data can be expressed as a 2d numpy array (say, of floats) with the axis to be fit along and the other the serialized axis to be parallelized over. Then I'd just like to release the GIL and start looping through the data with a cython.parallel.prange (the idea being then that I can have all my cores working on fitting at once).
The main issue I can foresee is that curve_fit does not operate "in place"; it returns the fit values of the parameters (and optionally their covariance matrix) and so has to allocate that memory at some point. (Of course I also have no idea about any intermediate memory allocation the routine performs.) I'm worried about how this will operate outside the GIL with many threads working concurrently.
I realize that the answer could just be "it should work fine go try it," but I'm hoping to get some idea of what to look out for. I also realize that this question is similar to others about parallelizing scipy/numpy routines, but I think this one is worded differently in that falls within the cython scope of a C environment for python.
Thanks for any help/suggestions.

Not safe. If CPython could safely run that kind of code without the GIL, we wouldn't have the GIL in the first place.

You may find the following discussion to be of interest on Parallel Programming in SciPy.
[I would have posted this as merely a comment, but I lack the requisite reputation.]

Related

Scipy minimize function seems to be creating multiple threads by itself?

I am using the scipy minimize function. The function that it's calling was compiled with Cython and has an underlying C++ implementation that I wrote, but that shouldn't really matter. For some reason when I run my program, it creates as many threads as it can to fill all my cpus. For example if I run top I see that 800% of a cpu is being used or on htop I can see that 8 individual processors are being used, when I only created the program to be run on one. I didn't think that scipy even had parallel processing functionality and I can't find any documentation related to this. What could possible be going on and is there any way to control it?
If some BLAS-implementation (with threading-support) is available (default on Ubuntu for example), some expressions like np.dot() (only the dense case as far as i know) will automatically be run in parallel (reference). Another possible example is sparse-matrix factorization with SuperLU.
Of course different minimizers will behave different.
Newton-type methods (core: solve a system of sparse linear-equations) are probably based on SuperLU (if the code is not one of the common old Fortran/C ones, where the whole code is self-contained). CG-type methods are heavily based on matrix-vector products (np.dot; so the dense-case will be parallel).
For some control over this, start with this SO question.

Parallel execution Python function which uses Cython memoryviews internally

I am trying to parallelize a routine using Joblib's Parallel, but it doesn't work because one of the underlying functions uses Cython's memoryviews, resulting in buffer source array is read-only (the same issue has been discussed here).
I would like to understand why it happens (my knowledge of the low level machinery of Joblib is limited), and what I could do to work around it.
I know that using Numpy buffers avoids this problem (as mentioned here) but that is not really a solution. The function in question slices an array inside a parallel prange loop: using Numpy buffers would mean I cannot slice the array without the GIL, thus I would not be able to parallelize the loop using prange.

Python + Numpy: When is it useful to manually collect garbage?

I read that one could manually collect garbage using
gc.collect()
Now I'm wondering when it is useful to do so. I suppose it is to some extent general Python logic. Say I have a large loop and in each loop will use big matrices Z and rewrite them again and again. Is it useful to remove the matrices and collect garbage in the end, if I don't change the size of Z?
The general question Under which circumstances can one actually observe the impact of forced garbage collection, especially when doing lots of numerical computation within numpy?
As you can see in the answers in the comment, the simplest way to release memory is del array and let the garbage collector do its job.

Parallel computing

I have a two dimensional table (Matrix)
I need to process each line in this matrix independently from the others.
The process of each line is time consuming.
I'd like to use parallel computing resources in our university (Canadian Grid something)
Can I have some advise on how to start ? I never used parallel computing before.
Thanks :)
Start here: http://docs.python.org/library/multiprocessing.html
Be sure to read this: http://docs.python.org/library/multiprocessing.html#examples
This may be helpful: http://www.slideshare.net/pvergain/multiprocessing-with-python-presentation.
While excellent, it includes threads and multiprocessing, even though multiprocessing is often far, far superior to attempting multi-threading.
For Grid computing, multi-threading is largely useless.
Also, you probably also want to read up on celery.
I am one of the developper of a new library called scoop.
It was built exactly for this purpose (grid or super-computing, scientific computing). I suggest you give it a try.
In your case, all you would have to do is a call like this:
futures.map(YourFunc, matrixLine)
It will then be distributed on your grid or whatever environment you choose.
Like the commentators have said, find someone to talk to in your university. The answer to your question will be specific to what software is installed on the grid. If you have access to a grid, it's highly likely you also have access to a person whose job it is to answer your questions (and they will be pleased to help) - find this person!
From what you describe, I would say: first have a look at numpy.
Numpy provides methods to compute the columns and rows in a vectorized manner with nearly C speed. Depending on your problem, this could be faster than parallel computation with pure CPython.
You can than use parallel computing with numpy-arrays to get a really big speed up.
Possible ways to do this is using multiprocessing or Ipython on a cluster.
It is recommended that you use C++/C for performing this computation. You can use the OpenMP API using the #include<omp.h> header. You can start your parallel region using the #pragma amp parallel directive. Since you are parallelising a for-loop for computing your matrix multiplication, you can use #pragma omp parallel for { } to start your for-loop inside the parallel region. OpenMP will automatically take care of the process synchronisation.
Check this out for a sample code: https://gist.github.com/metallurgix/0dfafc03215ce89fc595
Remember to use a big matrix to see actual improvements in speed. A smaller matrix will perform poorly in fact due to increased task overhead created due to forking and joining the multiple threads.
You can also check out MPI if you want to parallelise your code using multiple processors instead of multiple threads.

numpy and Global Interpreter Lock

I am about to write some computationally-intensive Python code that'll almost certainly spend most of its time inside numpy's linear algebra functions.
The problem at hand is embarrassingly parallel. Long story short, the easiest way for me to take advantage of that would be by using multiple threads. The main barrier is almost certainly going to be the Global Interpreter Lock (GIL).
To help design this, it would be useful to have a mental model for which numpy operations can be expected to release the GIL for their duration. To this end, I'd appreciate any rules of thumb, dos and don'ts, pointers etc.
In case it matters, I'm using 64-bit Python 2.7.1 on Linux, with numpy 1.5.1 and scipy 0.9.0rc2, built with Intel MKL 10.3.1.
Quite some numpy routines release GIL, so they can be efficiently parallel in threads (info). Maybe you don't need to do anything special!
You can use this question to find whether the routines you need are among the ones that release GIL. In short, search for ALLOW_THREADS or nogil in the source.
(Also note that MKL has the ability to use multiple threads for a routine, so that's another easy way to get parallelism, although possibly not the fastest kind).
You will probably find answers to all your questions regarding NumPy and parallel programming on the official wiki.
Also, have a look at this recipe page -- it contains example code on how to use NumPy with multiple threads.
Embarrassingly parallel? Numpy? Sounds like a good candidate for PyCUDA or PyOpenCL.

Categories