numpy and Global Interpreter Lock - python

I am about to write some computationally-intensive Python code that'll almost certainly spend most of its time inside numpy's linear algebra functions.
The problem at hand is embarrassingly parallel. Long story short, the easiest way for me to take advantage of that would be by using multiple threads. The main barrier is almost certainly going to be the Global Interpreter Lock (GIL).
To help design this, it would be useful to have a mental model for which numpy operations can be expected to release the GIL for their duration. To this end, I'd appreciate any rules of thumb, dos and don'ts, pointers etc.
In case it matters, I'm using 64-bit Python 2.7.1 on Linux, with numpy 1.5.1 and scipy 0.9.0rc2, built with Intel MKL 10.3.1.

Quite some numpy routines release GIL, so they can be efficiently parallel in threads (info). Maybe you don't need to do anything special!
You can use this question to find whether the routines you need are among the ones that release GIL. In short, search for ALLOW_THREADS or nogil in the source.
(Also note that MKL has the ability to use multiple threads for a routine, so that's another easy way to get parallelism, although possibly not the fastest kind).

You will probably find answers to all your questions regarding NumPy and parallel programming on the official wiki.
Also, have a look at this recipe page -- it contains example code on how to use NumPy with multiple threads.

Embarrassingly parallel? Numpy? Sounds like a good candidate for PyCUDA or PyOpenCL.

Related

GPU parallelization of code that uses numpy functions

I'm trying to parallelize a Python function I wrote to run on multiple GPU cores simultaneously, but it seems like current methods for doing so, such as vectorize and guvectorize from numba, don't allow for anything more sophisticated than simple arithmetic operations in the function (https://github.com/numba/numba/issues/2736).
My question is, is there a package or technique other than numba capable of handling functions that call numpy functions such as numpy.where or numpy.intersect1d?
Completely new to GPU programming here and not sure what the state-of-the-art capabilities are, so sorry if this question seems dumb.
Thanks a lot!

Using scipy routines outside of the GIL

This is sort of a general question related to a specific implementation I have in mind, about whether it's safe to use python routines designed for use inside the GIL in a shared memory environment. Specifically what I'd like to do is use scipy.optimize.curve_fit on a large array inside a cython function.
The data can be expressed as a 2d numpy array (say, of floats) with the axis to be fit along and the other the serialized axis to be parallelized over. Then I'd just like to release the GIL and start looping through the data with a cython.parallel.prange (the idea being then that I can have all my cores working on fitting at once).
The main issue I can foresee is that curve_fit does not operate "in place"; it returns the fit values of the parameters (and optionally their covariance matrix) and so has to allocate that memory at some point. (Of course I also have no idea about any intermediate memory allocation the routine performs.) I'm worried about how this will operate outside the GIL with many threads working concurrently.
I realize that the answer could just be "it should work fine go try it," but I'm hoping to get some idea of what to look out for. I also realize that this question is similar to others about parallelizing scipy/numpy routines, but I think this one is worded differently in that falls within the cython scope of a C environment for python.
Thanks for any help/suggestions.
Not safe. If CPython could safely run that kind of code without the GIL, we wouldn't have the GIL in the first place.
You may find the following discussion to be of interest on Parallel Programming in SciPy.
[I would have posted this as merely a comment, but I lack the requisite reputation.]

Parallel computing

I have a two dimensional table (Matrix)
I need to process each line in this matrix independently from the others.
The process of each line is time consuming.
I'd like to use parallel computing resources in our university (Canadian Grid something)
Can I have some advise on how to start ? I never used parallel computing before.
Thanks :)
Start here: http://docs.python.org/library/multiprocessing.html
Be sure to read this: http://docs.python.org/library/multiprocessing.html#examples
This may be helpful: http://www.slideshare.net/pvergain/multiprocessing-with-python-presentation.
While excellent, it includes threads and multiprocessing, even though multiprocessing is often far, far superior to attempting multi-threading.
For Grid computing, multi-threading is largely useless.
Also, you probably also want to read up on celery.
I am one of the developper of a new library called scoop.
It was built exactly for this purpose (grid or super-computing, scientific computing). I suggest you give it a try.
In your case, all you would have to do is a call like this:
futures.map(YourFunc, matrixLine)
It will then be distributed on your grid or whatever environment you choose.
Like the commentators have said, find someone to talk to in your university. The answer to your question will be specific to what software is installed on the grid. If you have access to a grid, it's highly likely you also have access to a person whose job it is to answer your questions (and they will be pleased to help) - find this person!
From what you describe, I would say: first have a look at numpy.
Numpy provides methods to compute the columns and rows in a vectorized manner with nearly C speed. Depending on your problem, this could be faster than parallel computation with pure CPython.
You can than use parallel computing with numpy-arrays to get a really big speed up.
Possible ways to do this is using multiprocessing or Ipython on a cluster.
It is recommended that you use C++/C for performing this computation. You can use the OpenMP API using the #include<omp.h> header. You can start your parallel region using the #pragma amp parallel directive. Since you are parallelising a for-loop for computing your matrix multiplication, you can use #pragma omp parallel for { } to start your for-loop inside the parallel region. OpenMP will automatically take care of the process synchronisation.
Check this out for a sample code: https://gist.github.com/metallurgix/0dfafc03215ce89fc595
Remember to use a big matrix to see actual improvements in speed. A smaller matrix will perform poorly in fact due to increased task overhead created due to forking and joining the multiple threads.
You can also check out MPI if you want to parallelise your code using multiple processors instead of multiple threads.

How much of NumPy and SciPy is in C?

Are parts of NumPy and/or SciPy programmed in C/C++?
And how does the overhead of calling C from Python compare to the overhead of calling C from Java and/or C#?
I'm just wondering if Python is a better option than Java or C# for scientific apps.
If I look at the shootouts, Python loses by a huge margin. But I guess this is because they don't use 3rd-party libraries in those benchmarks.
I would question any benchmark which doesn't show the source for each implementation (or did I miss something)? It's entirely possible that either or both of those solutions are coded badly which would result in an unfair appraisal of either or both language's performance. [Edit] Oops, now I see the source. As others have pointed out though, it's not using the NumPy/SciPy libraries so those benchmarks are not going to help you make a decision.
I believe the vast majority of NumPy and SciPy is written in C and wrapped in Python for ease of use.
It probably depends what you're doing in any of those languages as to how much overhead there is for a particular application.
I've used Python for data processing and analysis for a couple of years now so I would say it's certainly fit for purpose.
What are you trying to achieve at the end of the day? If you want a fast way to develop readable code, Python is an excellent option and certainly fast enough for a first stab at whatever it is you're trying to solve.
Why not have a bash at each for a small subset of your problem and benchmark the results in terms of development time and run time? Then you can make an objective decision based on some relevant data ...or at least that's what I'd do :-)
There is a better comparison here (not a benchmark but shows ways of speeding up Python). NumPy is mostly written in C. The main advantage of Python is that there are a number of ways of very easily extending your code with C (ctypes, swig,f2py) / C++ (boost.python, weave.inline, weave.blitz) / Fortran (f2py) - or even just by adding type annotations to Python so it can be processed to C (cython). I don't think there are many things comparably easy for C# or Java - at least that so seemlessly handle passing numerical arrays of different types (although I guess proponents would argue since they don't have the performance penalty of Python there is less need to).
A lot of it is written in C or fortran. You can re-write the hot loops in C (or use one of the gazillion ways to speed python up, boost/weave is my favorite), but does it really matter?
Your scientific app will be run once. The rest is just debugging and development, and those can be much quicker on Python.
Most of NumPy is in C, but a large portion of the C code is "boilerplate" to handle all the dirty details of the Python/C interface. I think the ratio C vs. Python is around 50/50 ATM for NumPy.
I am not too familiar with vm-based low-level details, but I believe the interface cost would be higher because of the restrictions put on the jvm and the .clr. One of the reason why numpy is often faster than similar environments is the memory representation and how arrays are shared/passed between functions. Whereas most environments (Matlab and R as well I believe) use Copy-On-Write to pass arrays between functions, NumPy use references. But doing so in e.g. the JVM would be hard (because of restrictions on how to use pointer, etc...). It is doable (an early port of NumPy for Jython exists), but I don't know how they solve this issue. Maybe C++/Cli would make this easier, but I have zero experience with that environment.
It always depends on your own capability to handle the langue, so the language is able to generate fast code. Out of my experience, numpy is several times slower then good .NET implementations. And I expect JAVA to be similar fast. Their optimizing JIT compilers have improved significantly over the years and produce very efficient instructions.
numpy on the other hand comes with a syntax wich is easier to use for those, which are attuned to scripting languages. But if it comes to application development, those advantages often turn to obstacles and you will yearn for typesafety and enterprise IDEs. Also, the syntactic gap is already closing with C#. A growing number of scientific libraries exist for Java and .NET.Personally I tend towards C#, bacause it provides better syntax for multidimensional arrays and somehow feels more 'modern'. But of course, this is only my personal experience.

practice with threads in python

I know that Python has a global lock and i've read Glyph's explaination of python multithreading. But I still want to try it out. What I decided to do as an easy (conceptually) task was to do horizontal and vertical edge detection on a picture.
Here's what's happening (pseudocode):
for pixels in picture:
apply sobel operator horizontal
for pixels in picture:
apply sobel operator vertical
info on sobel operator.
These two loops can run completely independent of each other, and so would be prime candidates for multithreading. (running these two loops on any significantly large picture can take 10+ seconds). However, when I have tried to use the threading module in python, it takes twice as long because of the global lock. My question is should I abandon all hope of doing this in two threads in python and try in another language? If i can forge ahead, what module(s) should I use? If not, what language should I experiment in?
Python 2.6 now includes the mulitprocessing module (formerly processing module on older versions of Python).
It has essentially the same interface as the threading module, but launches the execution into separate processes rather than threads. This allows Python to take advantage of multiple cores/CPUs and scales well for CPU-intensive tasks compared to the threading module approach.
If the sobel operator is CPU-bound, then you won't get any benefit from multiple threads because python does not take advantage of multiple cores.
Conceivably you could spin off multiple processes, though I'm not sure if that would be practical for working on a single image.
10 seconds doesn't seem like a lot of time to waste. If you're concerned about time because you'll be processing many images, then it might be easier to run multiple processes and have each process deal with a separate subset of the images.
I recommend using NumPy as well. Not only will it probably be faster, but if you use threads with it, there won't be a global lock.
I'll also suggest using multiprocessing as Jay suggests.
Anyways, if you really want to practice threading, I'd suggest playing around with PThreads in C. PThreads are insanely simple to use for basic cases and used all over the place.
Bulk matrix operations like the Sobel operator will definitely realize significant speed gains by (correctly) using Matlab/Octave. It is possible that NumPy may provide similar speedups for matrix/array ops.
Python mutliprocessing is the right choice if you want to practice parallel programming with Python. If you don't have Python 2.6 (which you don't if you're using Ubuntu for example), you can use the Google code backported version of multiprocessing. It is part of PyPI, which means you can easily install it using EasyInstall (which is part of the python-setuptools package in Ubuntu).

Categories