I am starting to read up over possible ways to parallelise Python code.
DISCLAIMER. This is NOT a question about Multiprocessing vs Multithreading.
At this link https://ipyparallel.readthedocs.io/en/latest/demos.html one finds references to several
concurrency packages for Python to avoid the GIL: https://scipy.github.io/old-wiki/pages/ParallelProgramming
-IPython1
-mpi4py
-parallel python
-Numba
There is also a multiprocessing package:
https://docs.python.org/3/library/multiprocessing.html
And another one called processing:
https://pypi.org/project/processing/
First of all, it is not at all clear to me the difference between the latter two above; what is the difference in using between the multiprocessing module and the processing module?.
In general, I fail to understand the differences between those all -- which must be there, given some developers made the effort to create a mpi4py version for the MPI used in C++. I guess this is not just about the dualism between "threading" and "multiprocessing" approaches, where in one case the memory is shared while the other has each process with its own memory and interpreter, something more must be different between all of those different packages out there.
Thanks to all of those who will dedicate time to answer this!
The difference is that the last version of processing was released in April of 2008 and multiprocessing was added in Python 2.6 in October 2008.
processing was a library that was used before multiprocessing was distributed with Python.
As far as the specific difference between other modules designed for multiprocessing: The scipy page you linked says that "This is a subject for graduate courses in computer science, and I'm not going to address it here....there are some python tools you can use to implement the things you learn in that graduate course." While they admit that may be a bit of an exaggeration, independent study of multiprocessing in general will be required to discern the difference between these libraries, you should probably just stick to the built in multiprocessing module for your initial experiments while you learn how it works. One you're more comfortable with multiprocessing, you might want to check out the pathos framework.
But here are the basics for the packages you mention:
Numba adds decorators that automatically compile functions to make them run faster, it isn't really a multiprocessing tool as much as a JIT compiling tool.
Parallel Python overcomes the GIL to utilize multiple cores or multiple computers, it's designed to be easy to use and to handle all the complex stuff behind the scenes.
MPI for Python is like Paralell Python with less emphasis on simplicity.
IPython is a toolkit with many features, including a shell and Jupyter kernel, it's also not really a multiprocessing tool.
Keep in mind that plenty of libraries/modules do the same thing, there doesn't need to be a reason more than one exists. Use whatever works for you.
Related
This is an assignment I am working on.
I have been asked to write a sample parallel processing program using native python features. I can write the code but the problem is - even after searching I cannot find a native parallel programming feature in python.
Since we have to import "multiprocessing" module - it is not native. I just cannot find which feature is available to use.
Already checked following threads but they use multiprocessing:
Parallel programming in python
Python multiprocessing for parallel processes
How to do parallel programming in Python
I think your definition of "native" is too narrow, or your understanding of the term "import" is mistaken.
The multiprocessing module is part of Python's standard library. Every Python implementation should have it. It is a native feature of Python.
The term "import" should be understood as "make this module available in this program", not as "add this non-native feature to the language". Importing a module does not change the language.
Edit:
In Python 3 you could make concurrent programs with async def and yield. But that shouldn't be considered real parallel processing. You might call it cooperative "multitasking", but it isn't really. It's task switching.
I just posted a question here why python imports take as long as they do. Are there environments that don't require reinitializing modules? If so, what are they?
Details: I'm trying to learn basic python syntax while using extended libraries (matplotlib, mayavi), and each time I test my code I wait (several!!) seconds for the modules to load. There must be a faster way to do this, but I don't know what environments are well suited. Suggestions?
Take a look at ipython and pandas they might be closer to what you want. Python does have a reload for modules but I'm not sure how well it works so anything that keeps a single python instance running and doesn't spawn python child processes is likely to fit the bill (sorry not sure what's available in that area).
http://ipython.org/
http://pandas.pydata.org/
Any environment with client/server architecture (short-lived cli/gui/web-clients, long-lived computational kernels) such as https://jupyter.org/ will do.
Today I stumbled over a post in stackoverflow (see also here):
We are developing opencl4py, higher level bindings. This project uses CFFI, so it works on Pypy.
The major issue we encountered with pyopencl is that 'import pyopencl' does OpenCL initialization and takes the whole virtual memory in case of NVIDIA driver, preventing from correct forking and effectively disabling multiprocessing (yes, we claim that using pyopencl disables multiprocessing at least with NVIDIA). opencl4py uses lazy OpenCL initialization, resolving this "import hell".
Later, it gained some nice features like super easy binary program caching, etc. Unfortunately, the documentation is somewhat brief. The best way to learn how it works is go through the tests.
As there is also pyOpenCL, I was woundering what the difference between these two packages is. Does anybody know where I can find an overview on the pro's and con's for these both packages?
Edit: To include benshope's comment as I would also be interested: what does "disable[s] multiprocessing" mean? Like, it can't run kernels on several devices at one time?
As far as I know, there is no such overview. I'll try to list some key points:
pyOpenCL is a mature project with a relatively large user base. There are tutorials, FAQ, etc. opencl4py appeared on 03/2014; no tutorials, FAQ and so on - only unit tests and docstrings.
pyOpenCL is a native cPython extension, whereas opencl4py uses cffi, so that it works on PyPy (pyOpenCL does NOT) and it does not require to be recompiled each time cPython changes version.
PyOpenCL has extras, such as random number generator and OpenGL interoperability.
opencl4py is extensively tested in Samsung production real world scenarios and is being actively developed.
what does "disable[s] multiprocessing" mean? Like, it can't run kernels on several devices at one time?
Of course, it can, I was trying to say that after importing pyopencl, os.fork() or multiprocessing.Process() lead to crashes inside NVIDIA OpenCL userspace library. It is always a bad idea of doing work during import.
Does Enthought Canopy support parallel code execution on CPU using perhaps openMPI or on GPU using openCV or CUDA
I am looking into switching from C++ to python as i want to make GUI for my parallel code.
Is this a good idea. Does python support parallel computation?
Yes, Python does support this. There are three layers to processes with Python:
subprocess: which simply starts a process within the same thread
threading: which starts a new thread and leaves the old on alone. There are some frequent stories that this not necessarily leads to better performance.
multiprocessing: which is what you are after
Here is an intro to parallel processing on Python.
The official docs for the multiprocessing are here.
The ever so useful discussions on the Python Module of the Week are also worth a look.
Edit:
The python libraries mentioned by HT #jonathan are likely to be:
Cuda:
http://mathema.tician.de/software/pycuda
OpenCV:
http://code.google.com/p/pyopencv/
There is a nice tutorial for this here.
And Message Passing Interface:
http://mpi4py.scipy.org/docs/usrman/intro.html
I'm new to Python and seems that the multiprocessing and threads module are not very interesting and suffer from the same problems such as threads in Perl. Is there a technical reason why the interpreter can't use lightweight threads such as posix threads to make an efficient thread implementation that really runs on several cores?
It is using POSIX threads. The problem is the GIL.
Note that the GIL is not part of the Python spec --- it's part of the CPython reference implementation. Jython, for example, does not suffer from this problem.
That said, looked into Stackless ?
Piotr,
You might want to take a look at stackless (http://www.stackless.com/) which is a modified version of python running lightweight tasklets in message passing (erlang style) fashion.
I'm not sure if you're looking for a multicore solution, but poking around in stackless might give you what you're looking for.
Ben