Parallel computing

Parallel computing - python

I have a two dimensional table (Matrix)
I need to process each line in this matrix independently from the others.
The process of each line is time consuming.
I'd like to use parallel computing resources in our university (Canadian Grid something)
Can I have some advise on how to start ? I never used parallel computing before.
Thanks :)

Start here: http://docs.python.org/library/multiprocessing.html
Be sure to read this: http://docs.python.org/library/multiprocessing.html#examples
This may be helpful: http://www.slideshare.net/pvergain/multiprocessing-with-python-presentation.
While excellent, it includes threads and multiprocessing, even though multiprocessing is often far, far superior to attempting multi-threading.
For Grid computing, multi-threading is largely useless.
Also, you probably also want to read up on celery.

I am one of the developper of a new library called scoop.
It was built exactly for this purpose (grid or super-computing, scientific computing). I suggest you give it a try.
In your case, all you would have to do is a call like this:
futures.map(YourFunc, matrixLine)
It will then be distributed on your grid or whatever environment you choose.

Like the commentators have said, find someone to talk to in your university. The answer to your question will be specific to what software is installed on the grid. If you have access to a grid, it's highly likely you also have access to a person whose job it is to answer your questions (and they will be pleased to help) - find this person!

From what you describe, I would say: first have a look at numpy.
Numpy provides methods to compute the columns and rows in a vectorized manner with nearly C speed. Depending on your problem, this could be faster than parallel computation with pure CPython.
You can than use parallel computing with numpy-arrays to get a really big speed up.
Possible ways to do this is using multiprocessing or Ipython on a cluster.

It is recommended that you use C++/C for performing this computation. You can use the OpenMP API using the #include<omp.h> header. You can start your parallel region using the #pragma amp parallel directive. Since you are parallelising a for-loop for computing your matrix multiplication, you can use #pragma omp parallel for { } to start your for-loop inside the parallel region. OpenMP will automatically take care of the process synchronisation.
Check this out for a sample code: https://gist.github.com/metallurgix/0dfafc03215ce89fc595
Remember to use a big matrix to see actual improvements in speed. A smaller matrix will perform poorly in fact due to increased task overhead created due to forking and joining the multiple threads.
You can also check out MPI if you want to parallelise your code using multiple processors instead of multiple threads.

Related

Why is it better to use synchronous programming for in-memory operations?

I have a complex nested data structure. I iterate through it and perform some calculations on each possible uniqe pair of elements. It's all in-memory mathematical functions. I don't read from files or do networking.
It takes a few hours to run, with do_work() being called 25,000 times. I am looking for ways to speed it up.
Although Pool.map() seems useful for my lists, it's proving to be difficult because I need to pass extra arguments into the function being mapped.
I thought using the Python multitasking library would help, but when I use Pool.apply_async() to call do_work(), it actually takes longer.
I did some googling and a blogger says "Use sync for in-memory operations — async is a complete waste when you aren’t making blocking calls." Is this true? Can someone explain why? Do the RAM read & write operations interfere with each other? Why does my code take longer with async calls? do_work() writes calculation results to a database, but it doesn't modify my data structure.
Surely there is a way to utilize my processor cores instead of just linearly iterating through my lists.
My starting point, doing it synchronously:
main_list = [ [ [a,b,c,[x,y,z], ... ], ... ], ... ] # list of identical structures
helper_list = [1,2,3]
z = 2
for i_1 in range(0, len(main_list)):
for i_2 in range(0, len(main_list)):
if i_1 < i_2: # only unique combinations
for m in range(0, len(main_list[i_1])):
for h, helper in enumerate(helper_list):
do_work(
main_list[i_1][m][0], main_list[i_2][m][0], # unique combo
main_list[i_1][m][1], main_list[i_1][m][2],
main_list[i_1][m][3][z], main_list[i_2][m][3][h],
helper_list[h]
)
Variable names have been changed to make it more readable.

This is just a general answer, but too long for a comment...
First of all, I think your biggest bottleneck at this very moment is Python itself. I don't know what do_work() does, but if it's CPU intensive, you have the GIL which completely prevents effective parallelisation inside one process. No matter what you do, threads will fight for the GIL and it will eventually make your code even slower. Remember: Python has real threading, but the CPU is shared inside a single process.
I recommend checking out the page of David M Beazley: http://dabeaz.com/GIL/gilvis who did a lot of effort to visualise the GIL behaviour in Python.
On the other hand, the module multiprocessing allows you to run multiple processes and "circumvent" the GIL downsides, but it will be tricky to get access to the same memory locations without bigger penalties or trade-offs.
Second: if you utilise heavy nested loops, you should think about using numba and trying to fit your data structures inside numpy (structured) arrays. This can give you order of magnitude of speed quite easily. Python is slow as hell for such things but luckily there are ways to squeeze out a lot when using appropriate libraries.
To sum up, I think the code you are running could be orders of magnitudes faster with numba and numpy structures.
Alternatively, you can try to rewrite the code in a language like Julia (very similar syntax to Python and the community is extremely helpful) and quickly check how fast it is in order to explore the limits of the performance. It's always a good idea to get a feeling how fast something (or parts of a code) can be in a language which has not such complex performance critical aspects like Python.

Your task is more CPU bound than relying on I/O operations. Asynchronous execution make sense when you have long I/O operations i.e. sending/receiving something from network etc.
What you can do is split task to the chunks and utilize threads and multiprocessing (run on different CPU cores).

Scipy minimize function seems to be creating multiple threads by itself?

I am using the scipy minimize function. The function that it's calling was compiled with Cython and has an underlying C++ implementation that I wrote, but that shouldn't really matter. For some reason when I run my program, it creates as many threads as it can to fill all my cpus. For example if I run top I see that 800% of a cpu is being used or on htop I can see that 8 individual processors are being used, when I only created the program to be run on one. I didn't think that scipy even had parallel processing functionality and I can't find any documentation related to this. What could possible be going on and is there any way to control it?

If some BLAS-implementation (with threading-support) is available (default on Ubuntu for example), some expressions like np.dot() (only the dense case as far as i know) will automatically be run in parallel (reference). Another possible example is sparse-matrix factorization with SuperLU.
Of course different minimizers will behave different.
Newton-type methods (core: solve a system of sparse linear-equations) are probably based on SuperLU (if the code is not one of the common old Fortran/C ones, where the whole code is self-contained). CG-type methods are heavily based on matrix-vector products (np.dot; so the dense-case will be parallel).
For some control over this, start with this SO question.

Using scipy routines outside of the GIL

This is sort of a general question related to a specific implementation I have in mind, about whether it's safe to use python routines designed for use inside the GIL in a shared memory environment. Specifically what I'd like to do is use scipy.optimize.curve_fit on a large array inside a cython function.
The data can be expressed as a 2d numpy array (say, of floats) with the axis to be fit along and the other the serialized axis to be parallelized over. Then I'd just like to release the GIL and start looping through the data with a cython.parallel.prange (the idea being then that I can have all my cores working on fitting at once).
The main issue I can foresee is that curve_fit does not operate "in place"; it returns the fit values of the parameters (and optionally their covariance matrix) and so has to allocate that memory at some point. (Of course I also have no idea about any intermediate memory allocation the routine performs.) I'm worried about how this will operate outside the GIL with many threads working concurrently.
I realize that the answer could just be "it should work fine go try it," but I'm hoping to get some idea of what to look out for. I also realize that this question is similar to others about parallelizing scipy/numpy routines, but I think this one is worded differently in that falls within the cython scope of a C environment for python.
Thanks for any help/suggestions.

Not safe. If CPython could safely run that kind of code without the GIL, we wouldn't have the GIL in the first place.

You may find the following discussion to be of interest on Parallel Programming in SciPy.
[I would have posted this as merely a comment, but I lack the requisite reputation.]

Python Multiprocessing/EM

I did a machine learning Expectation Maximization algorithm in Python, basically an implementation of IBM Model1 for doing machine translation ( here is my GitHub if you want to look at the code) and it works, but reeeaaaaallly sloowwwlly. I'm taking a class now in parallel computing and I was wondering if I could use Python Multiprocessing to reach convergence faster. Can anyone give me any pointers or tips? I don't even know where to start.
EDIT: I was reading around and found this paper on using EM with MapReduce to do parallelization -- maybe this is a better idea?

Most of your problem is that Python is really slow. Remember, your code is executing in an interpreter. When you do code (such as line 82) where you perform a numerical computation one element at a time, you have that one computation - and all the overhead of the Python interpreter.
The first thing you will want to do is vectorize you code with numpy. Unlike your normal python code, numpy is calling out to precompiled efficient binary code. The more work you can hide into numpy, the less time you will waist in the interpreter.
Once you vectorize your code, you can then start profiling it if its still too slow. You should be able to find a lot of simple examples on how to vectorize python, and some of the alternative options.
EDIT: Let me clarify, that parallelizing inherently slow code is mostly pointless. First, is the issue that parallelizing slow code gives the false impression that you have made an improvement. The "scaling up" of parallel code should always be done against the fastest possible single threaded version of the same code (within reason, no need to write everything in assembly before starting any parallel code). For example, consider a lock under contention. The more threads fighting for the lock, the slower the code will run, and you will get no (or negative) performance gains. One way to reduce contention for the lock is to simply slow down the code competing for the lock. This makes it appear as if there is no overhead from lock contention, when in actuality - you have no improvements because the fastest single threaded version of your code will outperform your parallel code.
Also, python really isn't a great language to learn how to write parallel code in. Python has the GIL , which essentially forces all multithreaded code in python to run as if there was but one CPU core. This means bizarre hacks (such as the one you linked) must be done, which have their own additional drawbacks and issues (there are times where such tricks are needed / used, but they shouldn't be the default for running code on a single machine). Don't expect what you learn writing any parallel python code to carry over to other languages or help you with your course.

I think you will have some good success depending on where your bottleneck is. One caveat - When I do code optimization I always like to profile the code, even informally to get an idea of where the bottlenecks are. This will help identify where the time is being spent i.e. file io, network latency, resource contention, not enough cpu cycles etc...
For others who may not be familiar with the Expectation Maximization algorithm a very nice introduction is in Motion Segmentation using EM - a short tutorial, by Yair Weiss. Let us assume we have M data points and N classes/models.
In the EM algorithm there are two steps: Computing the distance between data points and models and Updating our model weights using weighted least squares.
Step 1 - Expectation stage
for data_point in M:
for current_model in N:
compute distance or residual between data_point and current_model
Step 2 - Maximization stage
for each model, compute weighted least squares solving for the model parameters
This requires solving N weighted least square problems where the size is
dependent on the number of parameters in the model that will be solved for.
Your bottleneck may be in the stage of computing the residuals or distances between the data points and the models stage 1 - E Step. In this stage the computations are all independent. I would consider the first stage as embarassingly parallel and quite amenable to parallel computation using parallel map reduce or some other tools in python. I have good success using IPython for such tasks, but there are other good python packages as well.

practice with threads in python

I know that Python has a global lock and i've read Glyph's explaination of python multithreading. But I still want to try it out. What I decided to do as an easy (conceptually) task was to do horizontal and vertical edge detection on a picture.
Here's what's happening (pseudocode):
for pixels in picture:
apply sobel operator horizontal
for pixels in picture:
apply sobel operator vertical
info on sobel operator.
These two loops can run completely independent of each other, and so would be prime candidates for multithreading. (running these two loops on any significantly large picture can take 10+ seconds). However, when I have tried to use the threading module in python, it takes twice as long because of the global lock. My question is should I abandon all hope of doing this in two threads in python and try in another language? If i can forge ahead, what module(s) should I use? If not, what language should I experiment in?

Python 2.6 now includes the mulitprocessing module (formerly processing module on older versions of Python).
It has essentially the same interface as the threading module, but launches the execution into separate processes rather than threads. This allows Python to take advantage of multiple cores/CPUs and scales well for CPU-intensive tasks compared to the threading module approach.

If the sobel operator is CPU-bound, then you won't get any benefit from multiple threads because python does not take advantage of multiple cores.
Conceivably you could spin off multiple processes, though I'm not sure if that would be practical for working on a single image.
10 seconds doesn't seem like a lot of time to waste. If you're concerned about time because you'll be processing many images, then it might be easier to run multiple processes and have each process deal with a separate subset of the images.

I recommend using NumPy as well. Not only will it probably be faster, but if you use threads with it, there won't be a global lock.
I'll also suggest using multiprocessing as Jay suggests.
Anyways, if you really want to practice threading, I'd suggest playing around with PThreads in C. PThreads are insanely simple to use for basic cases and used all over the place.

Bulk matrix operations like the Sobel operator will definitely realize significant speed gains by (correctly) using Matlab/Octave. It is possible that NumPy may provide similar speedups for matrix/array ops.

Python mutliprocessing is the right choice if you want to practice parallel programming with Python. If you don't have Python 2.6 (which you don't if you're using Ubuntu for example), you can use the Google code backported version of multiprocessing. It is part of PyPI, which means you can easily install it using EasyInstall (which is part of the python-setuptools package in Ubuntu).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.