I'm working on a project that is written in C++ and Python. The communication between the 2 sides is made through TCP sockets. Both processes run on the same machine.
The problem is that it is too slow for the current needs. What is the fastest way to exchange information between C++ and Python?
I heard about ZeroMQ, but would it be noticeably faster than plain TCP sockets?
Edit: OS is Linux, the data that should be transferred consists of multiple floats (lets say around 100 numbers) every 0.02s, both ways. So 50 times per second, the python code sends 100 float numbers to C++, and the C++ code then responds with 100 float numbers.
In case performance is the only metric you care for, shared memory is going to be the fastest way to share data between two processes running on the same machine. You can use a semaphore in shared memory for synchronization.
TCP sockets will work as well, and are probably fast enough. Since you are using Linux, I would just use pipes, it is the simplest approach, and they will outperform TCP sockets. This should get your started: http://man7.org/linux/man-pages/man2/pipe.2.html
For more background information, I recommend Advanced Programming in the UNIX Environment.
If you're in the same machine, use a named shared memory, it's a very fast option. On python you have multiprocessing.shared_memory and in C++ you can use posix shared memory, once you're in Linux.
Short answer, no, but ZeroMQ can have other advantages. Lets go straight for it, if you are on Linux and want fast data transfer, you go Shared memory. But it will not as easy as with ZeroMQ.
Because ZeroMQ is a message queue. It solves (and solves well) different problems. It is able to use IPC between C++ and Python, what can be noticeably faster than using sockets (for the same usages), and gives you a window for network features in your future developments. It is reliable and quite easy to use, with the same API in Python and C++. It is often used with Protobuf to serialize and send data, even for high troughput.
The first problem with IPC on ZeroMQ is that it lacks Windows support, because it is not a POSIX compliant system. But the biggest problem is maybe not there: ZeroMQ is slow because it embeds your message. You can enjoy the benefits of it, but it can impede performances. The best way to check this is, as always, to test it by yourself with IPC-BENCH as I am not sure the benchmark I provided in the previous link was using the IPC. The average gain with IPC against local domain TCP is not fantastic though.
As I previously said, I am quite sure Shared Memory will be the fastest, excluding the last possibility: develop your own C++ wrapper in Python. I would bet that is the fastest solution, but will require a bit of engineering to multithread if needed because both C++ and Python will run on the same process. And of course you need to adjust the current C++ code if it is already started.
And as usual, remember that optimization always happen in a context. If data transfer is only a fraction of the running time when compared to the processing you can do after, or if you can wait the 0.00001sec that using Shared memory would help you to gain, it might be worth it to go directly to ZeroMQ because it will be simpler, more scalable and more reliable.
Related
I want to run two Python processes in parallel, with each being able to send data to and receive data from the other at any time. Python's multiprocessing package seems to have multiple solutions to this, such as Queue, Pipe, and SharedMemory. What are the pros and cons of using each of these, and which one would be best for accomplishing this specific goal?
It comes down to what you want to share, who you want to share it with, how often you want to share it, what your latency requirements are, and your skill-set, your maintainability needs and your preferences. Then there are the usual tradeoffs to be made between performance, legibility, upgradeability and so on.
If you are sharing native Python objects, they will generally be most simply shared via a "multiprocessing queue" because they will be packaged up before transmission and unpackaged on receipt.
If you are sharing large arrays, such as images, you will likely find that "multiprocessing shared memory" has least overhead because there is no pickling involved. However, if you want to share such arrays with other machines across a network, shared memory will not work, so you may need to resort to Redis or some other technology. Generally, "multiprocessing shared memory" takes more setting up, and requires you to do more to synchronise access, but is more performant for larger data-sets.
If you are sharing between Python and C/C++ or another language, you may elect to use protocol buffers and pipes, or again Redis.
As I said, there are many tradeoffs and opinions - far more than I have addressed here. The first thing though is to determine your needs in terms of bandwidth, latency, flexibility and then think about the most appropriate technology.
I have 2 code bases, one in python, one in c++. I want to share real time data between them. I am trying to evaluate which option will work best for my specific use case:
many small data updates from the C++ program to the python program
they both run on the same machine
reliability is important
low latency is nice to have
I can see a few options:
One process writes to a flat file, the other process reads it. It is non scalable, slow and I/O error prone.
One process writes to a database, the other process reads it. This makes it more scalable, slightly less error prone, but still very slow.
Embed my python program into the C++ one or the other way round. I rejected that solution because both code bases are reasonably complex, and I prefered to keep them separated for maintainability reasons.
I use some sockets in both programs, and send messages directly. This seems to be a reasonable approach, but does not leverage the fact that they are on the same machine (it will be optimized slightly by using local host as destination, but still feels cumbersome).
Use shared memory. So far I think this is the most satisfying solution I have found, but has the drawback of being slightly more complex to implement.
Are there other solutions I should consider?
First of all, this question is highly opinion-based!
The cleanest way would be to use them in the same process and get them communicate directly. The only complexity is to implement proper API and C++ -> Python calls. Drawbacks are maintainability as you noted and potentially lower robustness (both crash together, not a problem in most cases) and lower flexibility (are you sure you'll never need to run them on different machines?). Extensibility is the best as it's very simple to add more communication or to change existing. You can reconsider maintainability point. Can you python app be used w/o C++ counterpart? If not I wouldn't worry about maintainability so much.
Then shared memory is the next choice with better maintainability but same other drawbacks. Extensibility is a little bit worse but still not so bad. It can be complicated, I don't know Python support for shared memory operation, for C++ you can have a look at Boost.Interprocess. The main question I'd check first is synchronisation between processes.
Then, network communication. Lots of choices here, from the simplest possible binary protocol implemented on socket level to higher-level options mentioned in comments. It depends how complex your C++ <-> Python communication is and can be in the future. This approach can be more complicated to implement, can require 3rd-party libraries but once done it's extensible and flexible. Usually 3rd-party libraries are based on code generation (Thrift, Protobuf) that doesn't simplify your build process.
I wouldn't seriously consider file system or database for this case.
I am making something which involves pycurl since pycurl depends on libcurl, I was reading through its documentation and came across this Multi interface where you could perform several transfers using a single multi object. I was wondering if this is faster/more memory efficient than having miltiple easy interfaces ? I was wondering what is the advantage with this approach since the site barely says,
"Enable multiple simultaneous transfers in the same thread without making it complicated for the application."
You are trying to optimize something that doesn't matter at all.
If you want to download 200 URLs as fast as possible, you are going to spend 99.99% of your time waiting for those 200 requests, limited by your network and/or the server(s) you're downloading from. The key to optimizing that is to make the right number of concurrent requests. Anything you can do to cut down the last 0.01% will have no visible effect on your program. (See Amdahl's Law.)
Different sources give different guidelines, but typically it's somewhere between 6-12 requests, no more than 2-4 to the same server. Since you're pulling them all from Google, I'd suggest starting 4 concurrent requests, then, if that's not fast enough, tweaking that number until you get the best results.
As for space, the cost of storing 200 pages is going to far outstrip the cost of a few dozen bytes here and there for overhead. Again, what you want to optimize is those 200 pages—by storing them to disk instead of in memory, by parsing them as they come in instead of downloading everything and then parsing everything, etc.
Anyway, instead of looking at what command-line tools you have and trying to find a library that's similar to those, look for libraries directly. pycurl can be useful in some cases, e.g., when you're trying to do something complicated and you already know how to do it with libcurl, but in general, it's going to be a lot easier to use either stdlib modules like urllib or third-party modules designed to be as simple as possible like requests.
The main example for ThreadPoolExecutor in the docs shows how to do exactly what you want to do. (If you're using Python 2.x, you'll have to pip install futures to get the backport for ThreadPoolExecutor, and use urllib2 instead of urllib.request, but otherwise the code will be identical.)
Having multiple easy interfaces running concurrently in the same thread means building your own reactor and driving curl at a lower level. That's painful in C, and just as painful in Python, which is why libcurl offers, and recommends, multi.
But that "in the same thread" is key here. You can also create a pool of threads and throw the easy instances into that. In C, that can still be painful; in Python, it's dead simple. In fact, the first example in the docs for using a concurrent.futures.ThreadPoolExecutor does something similar, but actually more complicated than you need here, and it's still just a few lines of code.
If you're comparing multi vs. easy with a manual reactor, the simplicity is the main benefit. In C, you could easily implement a more efficient reactor than the one libcurl uses; in Python, that may or may not be true. But in either language, the performance cost of switching among a handful of network requests is going to be so tiny compared to everything else you're doing—especially waiting for those network requests—that it's unlikely to ever matter.
If you're comparing multi vs. easy with a thread pool, then a reactor can definitely outperform threads (except on platforms where you can tie a thread pool to a proactor, as with Windows I/O completion ports), especially for huge numbers of concurrent connections. Also, each thread needs its own stack, which typically means about 1MB of memory pages allocated (although not all of them used), which can be a serious problem in 32-bit land for huge numbers of connections. That's why very few serious servers use threads for connections. But in a client making a handful of connections, none of this matters; again, the costs incurred by wasting 8 threads vs. using a reactor will be so small compared to the real costs of your program that they won't matter.
I'm working on a simple experiment in Python. I have a "master" process, in charge of all the others, and every single process has a connection via unix socket to the master process. I would like to be able for the master process to be able to monitor all of the sockets for a response - but there could theoretically be almost a hundred of them. How would threads impact the memory and performance of the application? What would be the best solution? Thanks a lot!
One hundred simultaneous threads might be pushing the reasonable limits of threading. If you find this is the cleanest way to organize your code, I'd say give it a try, but threading really doesn't scale very far.
What works better is to use a technique like select to wait for one of the sockets to be readable / writable / or has an error to report. This mechanism lets you go to sleep until something interesting happens, handle as many sockets have content to handle, and then go back to sleep again, all in a single thread of execution. Removing the multi-threading can often reduce chances for errors, and this style of programming should get you into the hundreds of connections no trouble. (If you want to go beyond about 100, I'd use the poll functionality instead of select -- constantly rebuilding the list of interesting file descriptors takes time that poll does not require.)
Something to consider is the Python Twisted Framework. They've gone to some length to provide a consistent way to hook callbacks onto events for this exact sort of programming. (If you're familiar with node.js, it's a bit like that, but Python.) I must admit a slight aversion to Twisted -- I never got very far in their documentation without being utterly baffled -- but a lot of people made it further in the docs than I did. You might find it a better fit than I have.
The easiest way to conduct comparative tests of threads versus processes for socket handling is to use the SocketServer in Python's standard library. You can easily switch approaches (while keeping everything else the same) by inheriting from either ThreadingMixIn or ForkingMixIn. Here is a simple example to get you started.
Another alternative is a select/poll approach using non-blocking sockets in a single process and a single thread.
If you're interested in software that is already fully developed and highly evolved, consider these high-performance Python based server packages:
The Twisted framework uses the async single process, single thread style.
The Tornado framework is similar (less evolved, less full featured, but easier to understand)
And Gunicorn which is a high-performance forking server.
I was recently reading this document which lists a number of strategies that could be employed to implement a socket server. Namely, they are:
Serve many clients with each thread, and use nonblocking I/O and level-triggered readiness notification
Serve many clients with each thread, and use nonblocking I/O and readiness change notification
Serve many clients with each server thread, and use asynchronous I/O
serve one client with each server thread, and use blocking I/O
Build the server code into the kernel
Now, I would appreciate a hint on which should be used in CPython, which we know has some good points, and some bad points. I am mostly interested in performance under high concurrency, and yes a number of the current implementations are too slow.
So if I may start with the easy one, "5" is out, as I am not going to be hacking anything into the kernel.
"4" Also looks like it must be out because of the GIL. Of course, you could use multiprocessing in place of threads here, and that does give a significant boost. Blocking IO also has the advantage of being easier to understand.
And here my knowledge wanes a bit:
"1" is traditional select or poll which could be trivially combined with multiprocessing.
"2" is the readiness-change notification, used by the newer epoll and kqueue
"3" I am not sure there are any kernel implementations for this that have Python wrappers.
So, in Python we have a bag of great tools like Twisted. Perhaps they are a better approach, though I have benchmarked Twisted and found it too slow on a multiple processor machine. Perhaps having 4 twisteds with a load balancer might do it, I don't know. Any advice would be appreciated.
asyncore is basically "1" - It uses select internally, and you just have one thread handling all requests. According to the docs it can also use poll. (EDIT: Removed Twisted reference, I thought it used asyncore, but I was wrong).
"2" might be implemented with python-epoll (Just googled it - never seen it before).
EDIT: (from the comments) In python 2.6 the select module has epoll, kqueue and kevent build-in (on supported platforms). So you don't need any external libraries to do edge-triggered serving.
Don't rule out "4", as the GIL will be dropped when a thread is actually doing or waiting for IO-operations (most of the time probably). It doesn't make sense if you've got huge numbers of connections of course. If you've got lots of processing to do, then python may not make sense with any of these schemes.
For flexibility maybe look at Twisted?
In practice your problem boils down to how much processing you are going to do for requests. If you've got a lot of processing, and need to take advantage of multi-core parallel operation, then you'll probably need multiple processes. On the other hand if you just need to listen on lots of connections, then select or epoll, with a small number of threads should work.
How about "fork"? (I assume that is what the ForkingMixIn does) If the requests are handled in a "shared nothing" (other than DB or file system) architecture, fork() starts pretty quickly on most *nixes, and you don't have to worry about all the silly bugs and complications from threading.
Threads are a design illness forced on us by OSes with too-heavy-weight processes, IMHO. Cloning a page table with copy-on-write attributes seems a small price, especially if you are running an interpreter anyway.
Sorry I can't be more specific, but I'm more of a Perl-transitioning-to-Ruby programmer (when I'm not slaving over masses of Java at work)
Update: I finally did some timings on thread vs fork in my "spare time". Check it out:
http://roboprogs.com/devel/2009.04.html
Expanded:
http://roboprogs.com/devel/2009.12.html
One sollution is gevent. Gevent maries a libevent based event polling with lightweight cooperative task switching implemented by greenlet.
What you get is all the performance and scalability of an event system with the elegance and straightforward model of blocking IO programing.
(I don't know what the SO convention about answering to realy old questions is, but decided I'd still add my 2 cents)
Can I suggest additional links?
cogen is a crossplatform library for network oriented, coroutine based programming using the enhanced generators from python 2.5. On the main page of cogen project there're links to several projects with similar purpose.
I like Douglas' answer, but as an aside...
You could use a centralized dispatch thread/process that listens for readiness notifications using select and delegates to a pool of worker threads/processes to help accomplish your parallelism goals.
As Douglas mentioned, however, the GIL won't be held during most lengthy I/O operations (since no Python-API things are happening), so if it's response latency you're concerned about you can try moving the critical portions of your code to CPython API.
http://docs.python.org/library/socketserver.html#asynchronous-mixins
As for multi-processor (multi-core) machines. With CPython due to GIL you'll need at least one process per core, to scale. As you say that you need CPython, you might try to benchmark that with ForkingMixIn. With Linux 2.6 might give some interesting results.
Other way is to use Stackless Python. That's how EVE solved it. But I understand that it's not always possible.