Python multiprocessing Queue vs Pipe vs SharedMemory

Python multiprocessing Queue vs Pipe vs SharedMemory - python

I want to run two Python processes in parallel, with each being able to send data to and receive data from the other at any time. Python's multiprocessing package seems to have multiple solutions to this, such as Queue, Pipe, and SharedMemory. What are the pros and cons of using each of these, and which one would be best for accomplishing this specific goal?

It comes down to what you want to share, who you want to share it with, how often you want to share it, what your latency requirements are, and your skill-set, your maintainability needs and your preferences. Then there are the usual tradeoffs to be made between performance, legibility, upgradeability and so on.
If you are sharing native Python objects, they will generally be most simply shared via a "multiprocessing queue" because they will be packaged up before transmission and unpackaged on receipt.
If you are sharing large arrays, such as images, you will likely find that "multiprocessing shared memory" has least overhead because there is no pickling involved. However, if you want to share such arrays with other machines across a network, shared memory will not work, so you may need to resort to Redis or some other technology. Generally, "multiprocessing shared memory" takes more setting up, and requires you to do more to synchronise access, but is more performant for larger data-sets.
If you are sharing between Python and C/C++ or another language, you may elect to use protocol buffers and pipes, or again Redis.
As I said, there are many tradeoffs and opinions - far more than I have addressed here. The first thing though is to determine your needs in terms of bandwidth, latency, flexibility and then think about the most appropriate technology.

Related

Fastest way to exchange data between C++ and Python?

I'm working on a project that is written in C++ and Python. The communication between the 2 sides is made through TCP sockets. Both processes run on the same machine.
The problem is that it is too slow for the current needs. What is the fastest way to exchange information between C++ and Python?
I heard about ZeroMQ, but would it be noticeably faster than plain TCP sockets?
Edit: OS is Linux, the data that should be transferred consists of multiple floats (lets say around 100 numbers) every 0.02s, both ways. So 50 times per second, the python code sends 100 float numbers to C++, and the C++ code then responds with 100 float numbers.

In case performance is the only metric you care for, shared memory is going to be the fastest way to share data between two processes running on the same machine. You can use a semaphore in shared memory for synchronization.
TCP sockets will work as well, and are probably fast enough. Since you are using Linux, I would just use pipes, it is the simplest approach, and they will outperform TCP sockets. This should get your started: http://man7.org/linux/man-pages/man2/pipe.2.html
For more background information, I recommend Advanced Programming in the UNIX Environment.

If you're in the same machine, use a named shared memory, it's a very fast option. On python you have multiprocessing.shared_memory and in C++ you can use posix shared memory, once you're in Linux.

Short answer, no, but ZeroMQ can have other advantages. Lets go straight for it, if you are on Linux and want fast data transfer, you go Shared memory. But it will not as easy as with ZeroMQ.
Because ZeroMQ is a message queue. It solves (and solves well) different problems. It is able to use IPC between C++ and Python, what can be noticeably faster than using sockets (for the same usages), and gives you a window for network features in your future developments. It is reliable and quite easy to use, with the same API in Python and C++. It is often used with Protobuf to serialize and send data, even for high troughput.
The first problem with IPC on ZeroMQ is that it lacks Windows support, because it is not a POSIX compliant system. But the biggest problem is maybe not there: ZeroMQ is slow because it embeds your message. You can enjoy the benefits of it, but it can impede performances. The best way to check this is, as always, to test it by yourself with IPC-BENCH as I am not sure the benchmark I provided in the previous link was using the IPC. The average gain with IPC against local domain TCP is not fantastic though.
As I previously said, I am quite sure Shared Memory will be the fastest, excluding the last possibility: develop your own C++ wrapper in Python. I would bet that is the fastest solution, but will require a bit of engineering to multithread if needed because both C++ and Python will run on the same process. And of course you need to adjust the current C++ code if it is already started.
And as usual, remember that optimization always happen in a context. If data transfer is only a fraction of the running time when compared to the processing you can do after, or if you can wait the 0.00001sec that using Shared memory would help you to gain, it might be worth it to go directly to ZeroMQ because it will be simpler, more scalable and more reliable.

How to share objects and data between python processes in real-time?

I'm trying to find a reasonable approach in Python for a real-time application, multiprocessing and large files.
A parent process spawn 2 or more child. The first child reads data, keep in memory, and the others process it in a pipeline fashion. The data should be organized into an object,sent to the following process, processed,sent, processed and so on.
Available methodologies such as Pipe, Queue, Managers seem not adequate due to overheads (serialization, etc).
Is there an adequate approach for this?

I've used Celery and Redis for real-time multiprocessing in high memory applications, but it really depends on what you're trying to accomplish.
The biggest benefits I've found in Celery over built-in multiprocessing tools (Pipe/Queue) are:
Low overhead. You call a function directly, no need to serialize data.
Scaling. Need to ramp up worker processes? Just add more workers.
Transparency. Easy to inspect tasks/workers and find bottlenecks.
For really squeezing out performance, ZMQ is my go to. A lot more work to set up and fine-tune, but it's as close to bare sockets as you can safely get.
Disclaimer: This is all anecdotal. It really comes down to what your specific needs are. I'd benchmark different options with sample data before you go down any path.

First, a suspicion that message-passing may be inadequate because of all the overhead is not a good reason to overcomplicate your program. It's a good reason to build a proof of concept and come up with some sample data and start testing. If you're spending 80% of your time pickling things or pushing stuff through queues, then yes, that's probably going to be a problem in your real life code—assuming the amount of work your proof of concept does is reasonably comparable to your real code. But if you're spending 98% of your time doing the real work, then there is no problem to solve. Message passing will be simpler, so just use it.
Also, even if you do identify a problem here, that doesn't mean that you have to abandon message passing; it may just be a problem with what's built in to multiprocessing. Technologies like 0MQ and Celery may have lower overhead than a simple queue. Even being more careful about what you send over the queue can make a huge difference.
But if message passing is out, the obvious alternative is data sharing. This is explained pretty well in the multiprocessing docs, along with the pros and cons of each.
Sharing state between processes describes the basics of how to do it. There are other alternatives, like using mmapped files of platform-specific shared memory APIs, but there's not much reason to do that over multiprocessing unless you need, e.g., persistent storage between runs.
There are two big problems to deal with, but both can be dealt with.
First, you can't share Python objects, only simple values. Python objects have internal references to each other all over the place, the garbage collector can't see references to objects in other processes' heaps, and so on. So multiprocessing.Value can only hold the same basic kinds of native values as array.array, and multiprocessing.Array can hold (as you'd guess by the name) 1D arrays of the same values, and that's it. For anything more complicated, if you can define it in terms of a ctypes.Structure, you can use https://docs.python.org/3/library/multiprocessing.html#module-multiprocessing.sharedctypes, but this still means that any references between objects have to be indirect. (For example, you often have to store indices into an array.) (Of course none of this is bad news if you're using NumPy, because you're probably already storing most of your data in NumPy arrays of simple values, which are sharable.)
Second, shared data are of course subject to race conditions. And, unlike multithreading within a single process, you can't rely on the GIL to help protect you here; there are multiple interpreters that can all be trying to modify the same data at the same time. So you have to use locks or conditions to protect things.

For multiprocessing pipeline check out MPipe.
For shared memory (specifically NumPy arrays) check out numpy-sharedmem.
I've used these to do high-performance realtime, parallel image processing (average accumulation and face detection using OpenCV) while squeezing out all available resources from a multi-core CPU system. Check out Sherlock if interested. Hope this helps.

One option is to use something like brain-plasma that maintains a shared-memory object namespace that is independent of the Python process or thread. Kind of like Redis but can be used with big objects and has a simple API, built on top of Apache Arrow.
$ pip install brain-plasma
# process 1
from brain_plasma import Brain
brain = Brain()
brain['myvar'] = 657
# process 2
from brain_plasma import Brain
brain = Brain()
brain['myvar']
# >>> 657

Python 3.8 now offers shared memory access between processes using multiprocessing.shared_memory. All you hand off between processes is a string that references the shared memory block. In the consuming process you get a memoryview object which supports slicing without copying the data like byte arrays do. If you are using numpy it can reference the memory block in an O(1) operation, allowing fast transfers of large blocks of numeric data. As far as I understand generic objects still need to be deserialized since a raw byte array is what's received by the consuming process.

Is the Multi interface in curl faster or more efficient than using multiple easy interfaces?

I am making something which involves pycurl since pycurl depends on libcurl, I was reading through its documentation and came across this Multi interface where you could perform several transfers using a single multi object. I was wondering if this is faster/more memory efficient than having miltiple easy interfaces ? I was wondering what is the advantage with this approach since the site barely says,
"Enable multiple simultaneous transfers in the same thread without making it complicated for the application."

You are trying to optimize something that doesn't matter at all.
If you want to download 200 URLs as fast as possible, you are going to spend 99.99% of your time waiting for those 200 requests, limited by your network and/or the server(s) you're downloading from. The key to optimizing that is to make the right number of concurrent requests. Anything you can do to cut down the last 0.01% will have no visible effect on your program. (See Amdahl's Law.)
Different sources give different guidelines, but typically it's somewhere between 6-12 requests, no more than 2-4 to the same server. Since you're pulling them all from Google, I'd suggest starting 4 concurrent requests, then, if that's not fast enough, tweaking that number until you get the best results.
As for space, the cost of storing 200 pages is going to far outstrip the cost of a few dozen bytes here and there for overhead. Again, what you want to optimize is those 200 pages—by storing them to disk instead of in memory, by parsing them as they come in instead of downloading everything and then parsing everything, etc.
Anyway, instead of looking at what command-line tools you have and trying to find a library that's similar to those, look for libraries directly. pycurl can be useful in some cases, e.g., when you're trying to do something complicated and you already know how to do it with libcurl, but in general, it's going to be a lot easier to use either stdlib modules like urllib or third-party modules designed to be as simple as possible like requests.
The main example for ThreadPoolExecutor in the docs shows how to do exactly what you want to do. (If you're using Python 2.x, you'll have to pip install futures to get the backport for ThreadPoolExecutor, and use urllib2 instead of urllib.request, but otherwise the code will be identical.)

Having multiple easy interfaces running concurrently in the same thread means building your own reactor and driving curl at a lower level. That's painful in C, and just as painful in Python, which is why libcurl offers, and recommends, multi.
But that "in the same thread" is key here. You can also create a pool of threads and throw the easy instances into that. In C, that can still be painful; in Python, it's dead simple. In fact, the first example in the docs for using a concurrent.futures.ThreadPoolExecutor does something similar, but actually more complicated than you need here, and it's still just a few lines of code.
If you're comparing multi vs. easy with a manual reactor, the simplicity is the main benefit. In C, you could easily implement a more efficient reactor than the one libcurl uses; in Python, that may or may not be true. But in either language, the performance cost of switching among a handful of network requests is going to be so tiny compared to everything else you're doing—especially waiting for those network requests—that it's unlikely to ever matter.
If you're comparing multi vs. easy with a thread pool, then a reactor can definitely outperform threads (except on platforms where you can tie a thread pool to a proactor, as with Windows I/O completion ports), especially for huge numbers of concurrent connections. Also, each thread needs its own stack, which typically means about 1MB of memory pages allocated (although not all of them used), which can be a serious problem in 32-bit land for huge numbers of connections. That's why very few serious servers use threads for connections. But in a client making a handful of connections, none of this matters; again, the costs incurred by wasting 8 threads vs. using a reactor will be so small compared to the real costs of your program that they won't matter.

Why does python multiprocessing pickle objects to pass objects between processes?

Why does the multiprocessing package for python pickle objects to pass them between processes, i.e. to return results from different processes to the main interpreter process? This may be an incredibly naive question, but why can't process A say to process B "object x is at point y in memory, it's yours now" without having to perform the operation necessary to represent the object as a string.

multiprocessing runs jobs in different processes. Processes have their own independent memory spaces, and in general cannot share data through memory.
To make processes communicate, you need some sort of channel. One possible channel would be a "shared memory segment", which pretty much is what it sounds like. But it's more common to use "serialization". I haven't studied this issue extensively but my guess is that the shared memory solution is too tightly coupled; serialization lets processes communicate without letting one process cause a fault in the other.
When data sets are really large, and speed is critical, shared memory segments may be the best way to go. The main example I can think of is video frame buffer image data (for example, passed from a user-mode driver to the kernel or vice versa).
http://en.wikipedia.org/wiki/Shared_memory
http://en.wikipedia.org/wiki/Serialization
Linux, and other *NIX operating systems, provide a built-in mechanism for sharing data via serialization: "domain sockets" This should be quite fast.
http://en.wikipedia.org/wiki/Unix_domain_socket
Since Python has pickle that works well for serialization, multiprocessing uses that. pickle is a fast, binary format; it should be more efficient in general than a serialization format like XML or JSON. There are other binary serialization formats such as Google Protocol Buffers.
One good thing about using serialization: it's about the same to share the work within one computer (to use additional cores) or to share the work between multiple computers (to use multiple computers in a cluster). The serialization work is identical, and network sockets work about like domain sockets.
EDIT: #Mike McKerns said, in a comment below, that multiprocessing can use shared memory sometimes. I did a Google search and found this great discussion of it: Python multiprocessing shared memory

Python multithreading best practices

i just recently read an article about the GIL (Global Interpreter Lock) in python.
Which seems to be some big issue when it comes to Python performance. So i was wondering myself
what would be the best practice to archive more performance. Would it be threading or
either multiprocessing? Because i hear everybody say something different, it would be
nice to have one clear answer. Or at least to know the pros and contras of multithreading
against multiprocessing.
Kind regards,
Dirk

It depends on the application, and on the python implementation that you are using.
In CPython (the reference implementation) and pypy the GIL only allows one thread at a time to execute Python bytecode. Other threads might be doing I/O or running extensions written in C.
It is worth noting that some other implementations like IronPython and JPython don't have a GIL.
A characteristic of threading is that all threads share the same interpreter and all the live objects. So threads can share global data almost without extra effort. You need to use locking to serialize access to data, though! Imagine what would happen if two threads would try to modifiy the same list.
Multiprocessing actually runs in different processes. That sidesteps the GIL, but if large amounts of data need to be shared between processes that data has to be pickled and transported to another process via IPC where it has to be unpickled again. The multiprocessing module can take care of the messy details for you, but it still adds overhead.
So if your program wants to run Python code in parallel but doesn't need to share huge amounts of data between instances (e.g. just filenames of files that need to be processed), multiprocessing is a good choice.
Currently multiprocessing is the only way that I'm aware of in the standard library to use all the cores of your CPU at the same time.
On the other hand if your tasks need to share a lot of data and most of the processing is done in extension or is I/O, threading would be a good choice.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.