I know that Python has a global lock and i've read Glyph's explaination of python multithreading. But I still want to try it out. What I decided to do as an easy (conceptually) task was to do horizontal and vertical edge detection on a picture.
Here's what's happening (pseudocode):
for pixels in picture:
apply sobel operator horizontal
for pixels in picture:
apply sobel operator vertical
info on sobel operator.
These two loops can run completely independent of each other, and so would be prime candidates for multithreading. (running these two loops on any significantly large picture can take 10+ seconds). However, when I have tried to use the threading module in python, it takes twice as long because of the global lock. My question is should I abandon all hope of doing this in two threads in python and try in another language? If i can forge ahead, what module(s) should I use? If not, what language should I experiment in?
Python 2.6 now includes the mulitprocessing module (formerly processing module on older versions of Python).
It has essentially the same interface as the threading module, but launches the execution into separate processes rather than threads. This allows Python to take advantage of multiple cores/CPUs and scales well for CPU-intensive tasks compared to the threading module approach.
If the sobel operator is CPU-bound, then you won't get any benefit from multiple threads because python does not take advantage of multiple cores.
Conceivably you could spin off multiple processes, though I'm not sure if that would be practical for working on a single image.
10 seconds doesn't seem like a lot of time to waste. If you're concerned about time because you'll be processing many images, then it might be easier to run multiple processes and have each process deal with a separate subset of the images.
I recommend using NumPy as well. Not only will it probably be faster, but if you use threads with it, there won't be a global lock.
I'll also suggest using multiprocessing as Jay suggests.
Anyways, if you really want to practice threading, I'd suggest playing around with PThreads in C. PThreads are insanely simple to use for basic cases and used all over the place.
Bulk matrix operations like the Sobel operator will definitely realize significant speed gains by (correctly) using Matlab/Octave. It is possible that NumPy may provide similar speedups for matrix/array ops.
Python mutliprocessing is the right choice if you want to practice parallel programming with Python. If you don't have Python 2.6 (which you don't if you're using Ubuntu for example), you can use the Google code backported version of multiprocessing. It is part of PyPI, which means you can easily install it using EasyInstall (which is part of the python-setuptools package in Ubuntu).
Related
First of all, I've read multiple forums, papers and articles in the subject.
I had not had the need to implement the use of a GPU in my processes, however they have become more robust. The problem is that I have a somewhat complex function created in python, vectorized in numpy and using #jit to make it run faster.
However, my GPU (AMD) does not show use in the task panel (0%). I have seen PyOpenCL, however I want to know if there is something simpler than translating the code. The function is fast, the problem is that I want to use parallel processing to iterate that function 18 million times which currently takes me 5 hours into multiple proceso, I know that I can use multiprocessing on the CPU, but I want to use my GPU, there is some 'easy' way to split the task on the GPU?
We had some discussion, whether Numba can compile code for GPU automatically. I think it was able to, but now this is deprecated way. The other approach is to use #numba.cuda.jit and write code in terms of CUDA blocks, threads and so on. It works well. With it you enter big and fascinating (I am not joking) world of CUDA programming. You can maybe parallelize running of your big function with different parameters. Maybe you will even not need to rewrite it itself for this...
I am using the scipy minimize function. The function that it's calling was compiled with Cython and has an underlying C++ implementation that I wrote, but that shouldn't really matter. For some reason when I run my program, it creates as many threads as it can to fill all my cpus. For example if I run top I see that 800% of a cpu is being used or on htop I can see that 8 individual processors are being used, when I only created the program to be run on one. I didn't think that scipy even had parallel processing functionality and I can't find any documentation related to this. What could possible be going on and is there any way to control it?
If some BLAS-implementation (with threading-support) is available (default on Ubuntu for example), some expressions like np.dot() (only the dense case as far as i know) will automatically be run in parallel (reference). Another possible example is sparse-matrix factorization with SuperLU.
Of course different minimizers will behave different.
Newton-type methods (core: solve a system of sparse linear-equations) are probably based on SuperLU (if the code is not one of the common old Fortran/C ones, where the whole code is self-contained). CG-type methods are heavily based on matrix-vector products (np.dot; so the dense-case will be parallel).
For some control over this, start with this SO question.
I have a two dimensional table (Matrix)
I need to process each line in this matrix independently from the others.
The process of each line is time consuming.
I'd like to use parallel computing resources in our university (Canadian Grid something)
Can I have some advise on how to start ? I never used parallel computing before.
Thanks :)
Start here: http://docs.python.org/library/multiprocessing.html
Be sure to read this: http://docs.python.org/library/multiprocessing.html#examples
This may be helpful: http://www.slideshare.net/pvergain/multiprocessing-with-python-presentation.
While excellent, it includes threads and multiprocessing, even though multiprocessing is often far, far superior to attempting multi-threading.
For Grid computing, multi-threading is largely useless.
Also, you probably also want to read up on celery.
I am one of the developper of a new library called scoop.
It was built exactly for this purpose (grid or super-computing, scientific computing). I suggest you give it a try.
In your case, all you would have to do is a call like this:
futures.map(YourFunc, matrixLine)
It will then be distributed on your grid or whatever environment you choose.
Like the commentators have said, find someone to talk to in your university. The answer to your question will be specific to what software is installed on the grid. If you have access to a grid, it's highly likely you also have access to a person whose job it is to answer your questions (and they will be pleased to help) - find this person!
From what you describe, I would say: first have a look at numpy.
Numpy provides methods to compute the columns and rows in a vectorized manner with nearly C speed. Depending on your problem, this could be faster than parallel computation with pure CPython.
You can than use parallel computing with numpy-arrays to get a really big speed up.
Possible ways to do this is using multiprocessing or Ipython on a cluster.
It is recommended that you use C++/C for performing this computation. You can use the OpenMP API using the #include<omp.h> header. You can start your parallel region using the #pragma amp parallel directive. Since you are parallelising a for-loop for computing your matrix multiplication, you can use #pragma omp parallel for { } to start your for-loop inside the parallel region. OpenMP will automatically take care of the process synchronisation.
Check this out for a sample code: https://gist.github.com/metallurgix/0dfafc03215ce89fc595
Remember to use a big matrix to see actual improvements in speed. A smaller matrix will perform poorly in fact due to increased task overhead created due to forking and joining the multiple threads.
You can also check out MPI if you want to parallelise your code using multiple processors instead of multiple threads.
I am about to write some computationally-intensive Python code that'll almost certainly spend most of its time inside numpy's linear algebra functions.
The problem at hand is embarrassingly parallel. Long story short, the easiest way for me to take advantage of that would be by using multiple threads. The main barrier is almost certainly going to be the Global Interpreter Lock (GIL).
To help design this, it would be useful to have a mental model for which numpy operations can be expected to release the GIL for their duration. To this end, I'd appreciate any rules of thumb, dos and don'ts, pointers etc.
In case it matters, I'm using 64-bit Python 2.7.1 on Linux, with numpy 1.5.1 and scipy 0.9.0rc2, built with Intel MKL 10.3.1.
Quite some numpy routines release GIL, so they can be efficiently parallel in threads (info). Maybe you don't need to do anything special!
You can use this question to find whether the routines you need are among the ones that release GIL. In short, search for ALLOW_THREADS or nogil in the source.
(Also note that MKL has the ability to use multiple threads for a routine, so that's another easy way to get parallelism, although possibly not the fastest kind).
You will probably find answers to all your questions regarding NumPy and parallel programming on the official wiki.
Also, have a look at this recipe page -- it contains example code on how to use NumPy with multiple threads.
Embarrassingly parallel? Numpy? Sounds like a good candidate for PyCUDA or PyOpenCL.
I wrote a number crunching python code. The calculations involved can take hours. Is it possible somehow to compile it to binary?
Thanks
Not in any useful (for you) way, but moving the calculations into NumPy or Cython will speed them up.
First you can try psyco, that may give you a speed up as much as 10x, but 2x is more typical
If you can post the code up somewhere, perhaps someone can point out how to leverage numpy.
If your task doesn't map well only numpy then cython is a good choice to convert a intensive function or two into C code just by adding a few cdefs.
If you can show us the code (even just the hot spots) we can probably give you better advice.
Perhaps you can modify your algorithm
Shedskin might be worth a try.
From their front page blurb:
Shed Skin is an experimental compiler,
that can translate pure, but
implicitly statically typed Python
programs into optimized C++. It can
generate stand-alone programs or
extension modules that can be imported
and used in larger Python programs.
Besides the typing restriction,
programs cannot freely use the Python
standard library (although about 20
common modules, such as random and re,
are currently supported). Also, not
all Python features, such as nested
functions and variable numbers of
arguments, are supported (see the
tutorial for details).
For a set of 44 non-trivial test
programs (at over 10,000 lines in
total (sloccount)), measurements show
a typical speedup of 2-40 times over
Psyco, and 2-220 times over CPython.
Because Shed Skin is still in an early
stage of development, however, many
other programs will not compile
out-of-the-box.