Sharing information between a python code and c++ code (IPC)

Sharing information between a python code and c++ code (IPC) - python

I have 2 code bases, one in python, one in c++. I want to share real time data between them. I am trying to evaluate which option will work best for my specific use case:
many small data updates from the C++ program to the python program
they both run on the same machine
reliability is important
low latency is nice to have
I can see a few options:
One process writes to a flat file, the other process reads it. It is non scalable, slow and I/O error prone.
One process writes to a database, the other process reads it. This makes it more scalable, slightly less error prone, but still very slow.
Embed my python program into the C++ one or the other way round. I rejected that solution because both code bases are reasonably complex, and I prefered to keep them separated for maintainability reasons.
I use some sockets in both programs, and send messages directly. This seems to be a reasonable approach, but does not leverage the fact that they are on the same machine (it will be optimized slightly by using local host as destination, but still feels cumbersome).
Use shared memory. So far I think this is the most satisfying solution I have found, but has the drawback of being slightly more complex to implement.
Are there other solutions I should consider?

First of all, this question is highly opinion-based!
The cleanest way would be to use them in the same process and get them communicate directly. The only complexity is to implement proper API and C++ -> Python calls. Drawbacks are maintainability as you noted and potentially lower robustness (both crash together, not a problem in most cases) and lower flexibility (are you sure you'll never need to run them on different machines?). Extensibility is the best as it's very simple to add more communication or to change existing. You can reconsider maintainability point. Can you python app be used w/o C++ counterpart? If not I wouldn't worry about maintainability so much.
Then shared memory is the next choice with better maintainability but same other drawbacks. Extensibility is a little bit worse but still not so bad. It can be complicated, I don't know Python support for shared memory operation, for C++ you can have a look at Boost.Interprocess. The main question I'd check first is synchronisation between processes.
Then, network communication. Lots of choices here, from the simplest possible binary protocol implemented on socket level to higher-level options mentioned in comments. It depends how complex your C++ <-> Python communication is and can be in the future. This approach can be more complicated to implement, can require 3rd-party libraries but once done it's extensible and flexible. Usually 3rd-party libraries are based on code generation (Thrift, Protobuf) that doesn't simplify your build process.
I wouldn't seriously consider file system or database for this case.

Related

Fastest way to exchange data between C++ and Python?

I'm working on a project that is written in C++ and Python. The communication between the 2 sides is made through TCP sockets. Both processes run on the same machine.
The problem is that it is too slow for the current needs. What is the fastest way to exchange information between C++ and Python?
I heard about ZeroMQ, but would it be noticeably faster than plain TCP sockets?
Edit: OS is Linux, the data that should be transferred consists of multiple floats (lets say around 100 numbers) every 0.02s, both ways. So 50 times per second, the python code sends 100 float numbers to C++, and the C++ code then responds with 100 float numbers.

In case performance is the only metric you care for, shared memory is going to be the fastest way to share data between two processes running on the same machine. You can use a semaphore in shared memory for synchronization.
TCP sockets will work as well, and are probably fast enough. Since you are using Linux, I would just use pipes, it is the simplest approach, and they will outperform TCP sockets. This should get your started: http://man7.org/linux/man-pages/man2/pipe.2.html
For more background information, I recommend Advanced Programming in the UNIX Environment.

If you're in the same machine, use a named shared memory, it's a very fast option. On python you have multiprocessing.shared_memory and in C++ you can use posix shared memory, once you're in Linux.

Short answer, no, but ZeroMQ can have other advantages. Lets go straight for it, if you are on Linux and want fast data transfer, you go Shared memory. But it will not as easy as with ZeroMQ.
Because ZeroMQ is a message queue. It solves (and solves well) different problems. It is able to use IPC between C++ and Python, what can be noticeably faster than using sockets (for the same usages), and gives you a window for network features in your future developments. It is reliable and quite easy to use, with the same API in Python and C++. It is often used with Protobuf to serialize and send data, even for high troughput.
The first problem with IPC on ZeroMQ is that it lacks Windows support, because it is not a POSIX compliant system. But the biggest problem is maybe not there: ZeroMQ is slow because it embeds your message. You can enjoy the benefits of it, but it can impede performances. The best way to check this is, as always, to test it by yourself with IPC-BENCH as I am not sure the benchmark I provided in the previous link was using the IPC. The average gain with IPC against local domain TCP is not fantastic though.
As I previously said, I am quite sure Shared Memory will be the fastest, excluding the last possibility: develop your own C++ wrapper in Python. I would bet that is the fastest solution, but will require a bit of engineering to multithread if needed because both C++ and Python will run on the same process. And of course you need to adjust the current C++ code if it is already started.
And as usual, remember that optimization always happen in a context. If data transfer is only a fraction of the running time when compared to the processing you can do after, or if you can wait the 0.00001sec that using Shared memory would help you to gain, it might be worth it to go directly to ZeroMQ because it will be simpler, more scalable and more reliable.

Externalising CPU computation from Python for multi-core concurrency

I have a PyQt5 application which runs perfectly on my development machine (Core i7 Windows 7), but has performance issues on my target platform (Linux Embedded ARM). I've been researching Python concurrency in further detail, prior to 'optimising' my current code (i.e. ensuring all UI code is in the MainThread, with all logic code in separate threads). I've learnt that the GIL largely prevents the CPython interpreter from realising true concurrency.
My question: would I be better off using IronPython or Cython as the interpreter, or sending all the logic to an external non-Python function which can make use of multiple cores, and leave the PyQt application to simply update the UI? If the latter, which language would be well suited to high-speed, concurrent calculation?

If the latter, which language would be well suited to high-speed, concurrent calculation?
You've written a lot about your system and yet not enough about what it actually does; what kind of "calculations" are you doing? — If you're doing anything heavily computational, it's very likely someone has worked very hard to make a hardware-optimized library to do these kinds of calculations, e.g. BLAS via scipy/numpy (see Arm's own website). You want to push as much work out of your own Python code and into their hands. The language you use to call these libraries is much less important. Python is already great for this kind of "gluing" work for such libraries. Note that even using built-in Python functions, such as using sum(value for value in some_iter) instead of summing in a Python for loop, also pushes computation out of slow interpretation and into highly-optimized C code.
Otherwise, without profiling your actual code, it's hard to say what would be best. After doing the above by efficiently formulating your calculations in a way that optimized libraries can best do their work (e.g. by properly vectorizing them), you can then use Python's multiprocessing to divide up whatever Python logic is causing a bottleneck from that which isn't (see this answer on why multiprocesing is often better than threading). I'd wager this would be much more beneficial than just swapping out CPython for another implementation.
Only once you've delegated as much computation to external libraries as possible and paralllelized as well as possible using multiprocessing would I then start writing these computation-heavy processes in Cython, which could be considered a type of low-level optimization over the aforementioned architectural improvements.

echoing #errantlinguist, please be aware that parallel performance is highly application-dependent.
To maintain GUI responsiveness, yes, I would just use a separate "worker" thread to keep the main thread available to handle GUI events.
To do something "insanely parallel", like a Monte Carlo computation, where you have many many completely independent tasks which have minimal communication between them, I might try multiprocessing.
If I were doing something like very large matrix operations, I would do it multithreaded. Anaconda will automatically parallelize some numpy operations via MKL on intel processors (but this will not help you on ARM). I believe you could look at something like numba to help you with this, if you stay in python. If you are unhappy with performance, you may want to try implementing in C++. If you use almost all vectorized numpy operations, you should not see a big difference from using C++, but as python loops, etc. start to creep in, you will probably begin to see big differences in performance (beyond the max 4x you will gain by parallelizing your python code over 4 cores). If you switch to C++ for matrix operations, I highly recommend the Eigen library. It's very fast and easy to understand at a high-level.
Please be aware that when you use multithreading, you are usually in a shared memory context, which eliminates a lot of the expensive io you will encounter in multiprocessing, but it also introduces some classes of bugs you are not used to encountering in serial programs (when two threads begin to access the same resources). In multiprocessing, memory is usually separate, except for explicitly defined communications between the processes. In that sense, I find that multiprocessing code is typically easier to understand and debug.
Also there are frameworks out there to handle complex computational graphs, with many steps, which may include both multithreading and multiprocessing (try dask).
Good luck!

Tricks to improve performance of python backend

I am using python programs to nearly everything:
deploy scripts
nagios routines
website backend (web2py)
The reason why I am doing this is because I can reuse the code to provide different kind of services.
Since a while ago I have noticed that those scripts are putting a high CPU load on my servers. I have taken several steps to mitigate this:
late initialization, using cached_property (see here and here), so that only those objects needed are indeed initialized (including import of the related modules)
turning some of my scripts into http services (with a simple web.py implementation, wrapping-up my classes). The services are then triggered (by nagios for example), with simple curl calls.
This has reduced the load dramatically, going from over 20 CPU load to well under 1. It seems python startup is very resource intensive, for complex programs with lots of inter-dependencies.
I would like to know what other strategies are people here implementing to improve the performance of python software.

An easy one-off improvement is to use PyPy instead of the standard CPython for long-lived scripts and daemons (for short-lived scripts it's unlikely to help and may actually have longer startup times). Other than that, it sounds like you've already hit upon one of the biggest improvements for short-lived system scripts, which is to avoid the overhead of starting the Python interpreter for frequently-invoked scripts.
For example, if you invoke one script from another and they're both in Python you should definitely consider importing the other script as a module and calling its functions directly, as opposed to using subprocess or similar.
I appreciate that it's not always possible to do this, since some use-cases rely on external scripts being invoked - Nagios checks, for example, are going to be tricky to keep resident at all times. Your approach of making the actual check script a simple HTTP request seems reasonable enough, but the approach I took was to use passive checks and run an external service to periodically update the status. This allows the service generating check results to be resident as a daemon rather than requiring Nagios to invoke a script for each check.
Also, watch your system to see whether the slowness really is CPU overload or IO issues. You can use utilities like vmstat to watch your IO usage. If you're IO bound then optimising your code won't necessarily help a lot. In this case, if you're doing something like processing lots of text files (e.g. log files) then you can store them gzipped and access them directly using Python's gzip module. This increases CPU load but reduces IO load because you only need transfer the compressed data from disk. You can also write output files directly in gzipped format using the same approach.
I'm afraid I'm not particularly familiar with web2py specifically, but you can investigate whether it's easy to put a caching layer in front if the freshness of the data isn't totally critical. Try and make sure both your server and clients use conditional requests correctly, which will reduce request processing time. If they're using a back-end database, you could investigate whether something like memcached will help. These measures are only likely to give you real benefit if you're experiencing a reasonably high volume of requests or if each request is expensive to handle.
I should also add that generally reducing system load in other ways can occasionally give surprising benefits. I used to have a relatively small server running Apache and I found moving to nginx helped a surprising amount - I believe it was partly more efficient request handling, but primarily it freed up some memory that the filesystem cache could then use to further boost IO-bound operations.
Finally, if overhead is still a problem then carefully profile your most expensive scripts and optimise the hotspots. This could be improving your Python code, or it could mean pushing code out to C extensions if that's an option for you. I've had some great performance by pushing data-path code out into C extensions for large-scale log processing and similar tasks (talking about hundreds of GB of logs at a time). However, this is a heavy-duty and time-consuming approach and should be reserved for the few places where you really need the speed boost. It also depends whether you have someone available who's familiar enough with C to do it.

How would an irc bot written in tcl stack up against a python/node.js clone?

I believe eggdrop is the most active/popular bot and it's written in tcl ( and according to wiki the core is C but I haven't confirmed that ).
I'm wondering if there would be any performance benefit of recoding it's functionality in node.js or Python, in addition to making it more accessible since Python and JS are arguably more popular languages and not many are familiar with tcl.
So, how would they stack up vs tcl in general, performance-wise?

As you suspected, eggdrop is not written in tcl, it is written in C, however it does use tcl as its scripting/extension language.
I would expect that in the case of an eggdrop, the performance difference between using tcl as a scripting language, and using Python, Lua, JS, or virtually anything else would be negligible, as eggdrops generally aren't performing high load tasks.
In the event it really was an issue, your question would need more specifics. Performance for what task under what conditions? Memory use? CPU efficiency? Latency? And the answer would probably be "measure and find out". Given the typical use of an eggdrop, it doesn't take particularly efficient code to respond to the occasional IRC trigger command once every few minutes or hours.
As a more general case, I'm sure you could find benchmark comparisons of specific algorithms or tasks performed by various scripting languages on particular operating systems or environments, at which point it wouldn't really have anything to do with IRC or eggdrop.

If you're not doing much other than waiting on a quiet channel for something to happen, performance is pretty much irrelevant. You could probably write that in BF (well, with network connectivity primitives added) and have it perform OK.
If you're running on lots of busy channels with many things being watched for, that's different. Tcl's very good at event-driven IO, which is ideal for this sort of situation. (Python can do that, but needs external libraries, as does Lua. I don't know JS enough to comment there.)
If you're needing to do significant non-IO-bound processing for some message responses, you're into needing threads. I know that both Tcl and Python support threads, but with utterly different threading models (Python has a shared-memory model which makes it easier to pass some types of task around, especially when the data is large, and Tcl has an apartment model which greatly reduces the amount of locking required in the implementation for a good performance boost in CPU-bound code).
How is that relevant for IRC bots? Well, it all depends on what you're doing in the bot.

What would I use Stackless Python for?

There are many questions related to Stackless Python. But none answering this my question, I think (correct me if wrong - please!). There's some buzz about it all the time so I curious to know. What would I use Stackless for? How is it better than CPython?
Yes it has green threads (stackless) that allow quickly create many lightweight threads as long as no operations are blocking (something like Ruby's threads?). What is this great for? What other features it has I want to use over CPython?

It allows you to work with massive amounts of concurrency. Nobody sane would create one hundred thousand system threads, but you can do this using stackless.
This article tests doing just that, creating one hundred thousand tasklets in both Python and Google Go (a new programming language): http://dalkescientific.com/writings/diary/archive/2009/11/15/100000_tasklets.html
Surprisingly, even if Google Go is compiled to native code, and they tout their co-routines implementation, Python still wins.
Stackless would be good for implementing a map/reduce algorithm, where you can have a very large number of reducers depending on your input data.

Stackless Python's main benefit is the support for very lightweight coroutines. CPython doesn't support coroutines natively (although I expect someone to post a generator-based hack in the comments) so Stackless is a clear improvement on CPython when you have a problem that benefits from coroutines.
I think the main area where they excel are when you have many concurrent tasks running within your program. Examples might be game entities that run a looping script for their AI, or a web server that is servicing many clients with pages that are slow to create.
You still have many of the typical problems with concurrency correctness however regarding shared data, but the deterministic task switching makes it easier to write safe code since you know exactly where control will be transferred and therefore know the exact points at which the shared state must be up to date.

Thirler already mentioned that stackless was used in Eve Online. Keep in mind, that:
(..) stackless adds a further twist to this by allowing tasks to be separated into smaller tasks, Tasklets, which can then be split off the main program to execute on their own. This can be used for fire-and-forget tasks, like sending off an email, or dispatching an event, or for IO operations, e.g. sending and receiving network packets. One tasklet waits for a packet from the network while others continue running the game loop.
It is in some ways like threads, but is non-preemptive and explicitly scheduled, so there are fewer issues with synchronization. Also, switching between tasklets is much faster than thread switching, and you can have a huge number of active tasklets whereas the number of threads is severely limited by the computer hardware.
(got this citation from here)
At PyCon 2009 there was given a very interesting talk, describing why and how Stackless is used at CCP Games.
Also, there is a very good introductory material, which describes why stackless is a good solution for Your applications. (it may be somewhat old, but I think that it is worth reading).

EVEOnline is largely programmed in Stackless Python. They have several dev blogs on the use of it. It seems it is very useful for high performance computing.

While I've not used Stackless itself, I have used Greenlet for implementing highly-concurrent network applications. Some of the use cases Linden Lab has put it towards are: high-performance smart proxies, a fast system for distributing commands over huge numbers of machines, and an application that does a ton of database writes and reads (at a ratio of about 1:2, which is very write-heavy, so it's spending most of its time waiting for the database to return), and a web-crawler-type-thing for internal web data. Basically any app that's expecting to have to do a lot of network I/O will benefit from being able to create a bajillion lightweight threads. 10,000 connected clients doesn't seem like a huge deal to me.
Stackless or Greenlet aren't really a complete solution, though. They are very low-level and you're going to have to do a lot of monkeywork to build an application with them that uses them to their fullest. I know this because I maintain a library that provides a networking and scheduling layer on top of Greenlet, specifically because writing apps is so much easier with it. There are a bunch of these now; I maintain Eventlet, but also there is Concurrence, Chiral, and probably a few more that I don't know about.
If the sort of app you want to write sounds like what I wrote about, consider one of these libraries. The choice of Stackless vs Greenlet is somewhat less important than deciding what library best suits the needs of what you want to do.

The basic usefulness for green threads, the way I see it, is to implement a system in which you have a large amount of objects that do high latency operations. A concrete example would be communicating with other machines:
def Run():
# Do stuff
request_information() # This call might block
# Proceed doing more stuff
Threads let you write the above code naturally, but if the number of objects is large enough, threads just cannot perform adequately. But you can use green threads even for in really large amounts. The request_information() above could switch out to some scheduler where other work is waiting and return later. You get all the benefits of being able to call "blocking" functions as if they return immediately without using threads.
This is obviously very useful for any kind of distributed computing if you want to write code in a straightforward way.
It is also interesting for multiple cores to mitigate waiting for locks:
def Run():
# Do some calculations
green_lock(the_foo)
# Do some more calculations
The green_lock function would basically attempt to acquire the lock and just switch out to a main scheduler if it fails due to other cores using the object.
Again, green threads are being used to mitigate blocking, allowing code to be written naturally and still perform well.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.