I was doing multiprocessing in python and hit a pickling error. Which makes me wonder why do we need to pickle the object in order to do multiprocessing? isn't fork() enough?
Edit: I kind of get why we need pickle to do interprocess communication, but that is just for the data you want to transfer right? why does the multiprocessing module also try to pickle stuff like functions etc?
Which makes me wonder why do we need to pickle the object in order to
do multiprocessing?
We don't need pickle, but we do need to communicate between processes, and pickle happens to be a very convenient, fast, and general serialization method for Python. Serialization is one way to communicate between processes. Memory sharing is the other. Unlike memory sharing, the processes don't even need to be on the same machine to communicate. For example, PySpark using serialization very heavily to communicate between executors (which are typically different machines).
Addendum: There are also issues with the GIL (Global Interpreter Lock) when sharing memory in Python (see comments below for detail).
isn't fork() enough?
Not if you want your processes to communicate and share data after they've forked. fork() clones the current memory space, but changes in one process won't be reflected in another after the fork (unless we explicitly share data, of course).
I kind of get why we need pickle to do interprocess communication, but
that is just for the data you want to transfer right? why does the
multiprocessing module also try to pickle stuff like functions etc?
Sometimes complex objects (i.e. "other stuff"? not totally clear on what you meant here) contain the data you want to manipulate, so we'll definitely want to be able to send that "other stuff".
Being able to send a function to another process is incredibly useful. You can create a bunch of child processes and then send them all a function to execute concurrently that you define later in your program. This is essentially the crux of PySpark (again a bit off topic, since PySpark isn't multiprocessing, but it feels strangely relevant).
There are some functional purists (mostly the LISP people) that make arguments that code and data are the same thing. So it's not much of a line to draw for some.
Related
I'm playing with Python multiprocessing module to have a (read-only) array shared among multiple processes. My goal is to use multiprocessing.Array to allocate the data and then have my code forked (forkserver) so that each worker can read straight from the array to do their job.
While reading the Programming guidelines I got a bit confused.
It is first said:
Avoid shared state
As far as possible one should try to avoid shifting large amounts of
data between processes.
It is probably best to stick to using queues or pipes for
communication between processes rather than using the lower level
synchronization primitives.
And then, a couple of lines below:
Better to inherit than pickle/unpickle
When using the spawn or forkserver start methods many types from
multiprocessing need to be picklable so that child processes can use
them. However, one should generally avoid sending shared objects to
other processes using pipes or queues. Instead you should arrange the
program so that a process which needs access to a shared resource
created elsewhere can inherit it from an ancestor process.
As far as I understand, queues and pipes pickle objects. If so, aren't those two guidelines conflicting?
Thanks.
The second guideline is the one relevant to your use case.
The first is reminding you that this isn't threading where you manipulate shared data structures with locks (or atomic operations). If you use Manager.dict() (which is actually SyncManager.dict) for everything, every read and write has to access the manager's process, and you also need the synchronization typical of threaded programs (which itself may come at a higher cost from being cross-process).
The second guideline suggests inheriting shared, read-only objects via fork; in the forkserver case, this means you have to create such objects before the call to set_start_method, since all workers are children of a process created at that time.
The reports on the usability of such sharing are mixed at best, but if you can use a small number of any of the C-like array types (like numpy or the standard array module), you should see good performance (because the majority of pages will never be written to deal with reference counts). Note that you do not need multiprocessing.Array here (though it may work fine), since you do not need writes in one concurrent process to be visible in another.
Is it possible to deal in python with shared memory parallel tasks? My task should be parallel over several cores (though threading module does not fit here, as far as I know the only facilities to do that is multiprocessing). There are lots of tasks for which I create a thread (in a python case process) pool. Then I need to initialize this threads (processes) with a lot of data from the main thread (process). Threads process this data and return new ones (again a lot of) to the main thread (process). What I see a huge overhead is that each task must copy the data to the new process in case of processes and do that same after finishing. But in case of threads this is eliminated. And it should be a huge speed up. Can I achieve this speed up with python?
Yes, the multiprocessing module does support placing objects in shared memory. See the documentation.
Threads won't help you due to the GIL serializing them. However, you can still use multiprocessing and share the data between processes using the mmap module or equivalent. This will require the data to be structured to be readable from the file, so you won't be able to use, say, dictionaries - but using a file-based storage such as sqlite will be fine.
I'm trying to find a reasonable approach in Python for a real-time application, multiprocessing and large files.
A parent process spawn 2 or more child. The first child reads data, keep in memory, and the others process it in a pipeline fashion. The data should be organized into an object,sent to the following process, processed,sent, processed and so on.
Available methodologies such as Pipe, Queue, Managers seem not adequate due to overheads (serialization, etc).
Is there an adequate approach for this?
I've used Celery and Redis for real-time multiprocessing in high memory applications, but it really depends on what you're trying to accomplish.
The biggest benefits I've found in Celery over built-in multiprocessing tools (Pipe/Queue) are:
Low overhead. You call a function directly, no need to serialize data.
Scaling. Need to ramp up worker processes? Just add more workers.
Transparency. Easy to inspect tasks/workers and find bottlenecks.
For really squeezing out performance, ZMQ is my go to. A lot more work to set up and fine-tune, but it's as close to bare sockets as you can safely get.
Disclaimer: This is all anecdotal. It really comes down to what your specific needs are. I'd benchmark different options with sample data before you go down any path.
First, a suspicion that message-passing may be inadequate because of all the overhead is not a good reason to overcomplicate your program. It's a good reason to build a proof of concept and come up with some sample data and start testing. If you're spending 80% of your time pickling things or pushing stuff through queues, then yes, that's probably going to be a problem in your real life code—assuming the amount of work your proof of concept does is reasonably comparable to your real code. But if you're spending 98% of your time doing the real work, then there is no problem to solve. Message passing will be simpler, so just use it.
Also, even if you do identify a problem here, that doesn't mean that you have to abandon message passing; it may just be a problem with what's built in to multiprocessing. Technologies like 0MQ and Celery may have lower overhead than a simple queue. Even being more careful about what you send over the queue can make a huge difference.
But if message passing is out, the obvious alternative is data sharing. This is explained pretty well in the multiprocessing docs, along with the pros and cons of each.
Sharing state between processes describes the basics of how to do it. There are other alternatives, like using mmapped files of platform-specific shared memory APIs, but there's not much reason to do that over multiprocessing unless you need, e.g., persistent storage between runs.
There are two big problems to deal with, but both can be dealt with.
First, you can't share Python objects, only simple values. Python objects have internal references to each other all over the place, the garbage collector can't see references to objects in other processes' heaps, and so on. So multiprocessing.Value can only hold the same basic kinds of native values as array.array, and multiprocessing.Array can hold (as you'd guess by the name) 1D arrays of the same values, and that's it. For anything more complicated, if you can define it in terms of a ctypes.Structure, you can use https://docs.python.org/3/library/multiprocessing.html#module-multiprocessing.sharedctypes, but this still means that any references between objects have to be indirect. (For example, you often have to store indices into an array.) (Of course none of this is bad news if you're using NumPy, because you're probably already storing most of your data in NumPy arrays of simple values, which are sharable.)
Second, shared data are of course subject to race conditions. And, unlike multithreading within a single process, you can't rely on the GIL to help protect you here; there are multiple interpreters that can all be trying to modify the same data at the same time. So you have to use locks or conditions to protect things.
For multiprocessing pipeline check out MPipe.
For shared memory (specifically NumPy arrays) check out numpy-sharedmem.
I've used these to do high-performance realtime, parallel image processing (average accumulation and face detection using OpenCV) while squeezing out all available resources from a multi-core CPU system. Check out Sherlock if interested. Hope this helps.
One option is to use something like brain-plasma that maintains a shared-memory object namespace that is independent of the Python process or thread. Kind of like Redis but can be used with big objects and has a simple API, built on top of Apache Arrow.
$ pip install brain-plasma
# process 1
from brain_plasma import Brain
brain = Brain()
brain['myvar'] = 657
# process 2
from brain_plasma import Brain
brain = Brain()
brain['myvar']
# >>> 657
Python 3.8 now offers shared memory access between processes using multiprocessing.shared_memory. All you hand off between processes is a string that references the shared memory block. In the consuming process you get a memoryview object which supports slicing without copying the data like byte arrays do. If you are using numpy it can reference the memory block in an O(1) operation, allowing fast transfers of large blocks of numeric data. As far as I understand generic objects still need to be deserialized since a raw byte array is what's received by the consuming process.
Why does the multiprocessing package for python pickle objects to pass them between processes, i.e. to return results from different processes to the main interpreter process? This may be an incredibly naive question, but why can't process A say to process B "object x is at point y in memory, it's yours now" without having to perform the operation necessary to represent the object as a string.
multiprocessing runs jobs in different processes. Processes have their own independent memory spaces, and in general cannot share data through memory.
To make processes communicate, you need some sort of channel. One possible channel would be a "shared memory segment", which pretty much is what it sounds like. But it's more common to use "serialization". I haven't studied this issue extensively but my guess is that the shared memory solution is too tightly coupled; serialization lets processes communicate without letting one process cause a fault in the other.
When data sets are really large, and speed is critical, shared memory segments may be the best way to go. The main example I can think of is video frame buffer image data (for example, passed from a user-mode driver to the kernel or vice versa).
http://en.wikipedia.org/wiki/Shared_memory
http://en.wikipedia.org/wiki/Serialization
Linux, and other *NIX operating systems, provide a built-in mechanism for sharing data via serialization: "domain sockets" This should be quite fast.
http://en.wikipedia.org/wiki/Unix_domain_socket
Since Python has pickle that works well for serialization, multiprocessing uses that. pickle is a fast, binary format; it should be more efficient in general than a serialization format like XML or JSON. There are other binary serialization formats such as Google Protocol Buffers.
One good thing about using serialization: it's about the same to share the work within one computer (to use additional cores) or to share the work between multiple computers (to use multiple computers in a cluster). The serialization work is identical, and network sockets work about like domain sockets.
EDIT: #Mike McKerns said, in a comment below, that multiprocessing can use shared memory sometimes. I did a Google search and found this great discussion of it: Python multiprocessing shared memory
i just recently read an article about the GIL (Global Interpreter Lock) in python.
Which seems to be some big issue when it comes to Python performance. So i was wondering myself
what would be the best practice to archive more performance. Would it be threading or
either multiprocessing? Because i hear everybody say something different, it would be
nice to have one clear answer. Or at least to know the pros and contras of multithreading
against multiprocessing.
Kind regards,
Dirk
It depends on the application, and on the python implementation that you are using.
In CPython (the reference implementation) and pypy the GIL only allows one thread at a time to execute Python bytecode. Other threads might be doing I/O or running extensions written in C.
It is worth noting that some other implementations like IronPython and JPython don't have a GIL.
A characteristic of threading is that all threads share the same interpreter and all the live objects. So threads can share global data almost without extra effort. You need to use locking to serialize access to data, though! Imagine what would happen if two threads would try to modifiy the same list.
Multiprocessing actually runs in different processes. That sidesteps the GIL, but if large amounts of data need to be shared between processes that data has to be pickled and transported to another process via IPC where it has to be unpickled again. The multiprocessing module can take care of the messy details for you, but it still adds overhead.
So if your program wants to run Python code in parallel but doesn't need to share huge amounts of data between instances (e.g. just filenames of files that need to be processed), multiprocessing is a good choice.
Currently multiprocessing is the only way that I'm aware of in the standard library to use all the cores of your CPU at the same time.
On the other hand if your tasks need to share a lot of data and most of the processing is done in extension or is I/O, threading would be a good choice.