High-performance replacement for multiprocessing.Queue

High-performance replacement for multiprocessing.Queue - python

My distributed application consists of many producers that push tasks into several FIFO queues, and multiple consumers for every one of these queues. All these components live on a single node, so no networking involved.
This pattern is perfectly supported by Python's built-in multiprocessing.Queue, however when I am scaling up my application the queue implementation seems to be a bottleneck. I am not sending large amounts of data, so memory sharing does not solve the problem. What I need is fast guaranteed delivery of 10^4-10^5 small messages per second. Each message is about 100 bytes.
I am new to the world of fast distributed computing and I am very confused by the sheer amount of options. There is RabbitMQ, Redis, Kafka, etc.
ZeroMQ is a more focused and compact alternative, which also has successors such as nanomsg and nng. Also, implementing something like a many-to-many queue with a guaranteed delivery seems nontrivial without a broker.
I would really appreciate if someone could point me to a "standard" way of doing something like this with one of the faster frameworks.

After trying a few available implementations and frameworks, I still could not find anything that would be suitable for my task. Either too slow or too heavy.
To solve the issue my colleagues and I developed this: https://github.com/alex-petrenko/faster-fifo
faster-fifo is a drop-in replacement for Python's multiprocessing.Queue and is significantly faster. In fact, it is up to 30x faster in the configurations I cared about (many producers, few consumers) because it additionally supports get_many() method on the consumer side.
It is brokereless, lightweight, supports arbitrary many-to-many configurations, implemented for Posix systems using pthread synchronization primitives.

I think that a lot of it depends partly on what sort of importance you place on individual messages.
If each and every one is vital, and you have to consider what happens to them in the event of some failure somewhere, then frameworks like RabbitMQ can be useful. RabbitMQ has a broker, and it's possible to configure this for some sort of high availability, high reliability mode. With the right queue settings, RabbitMQ will look after your messages up until some part of your system consumes them.
To do all this, RabbitMQ needs a broker. This makes it fairly slow. Though at one point there was talk about reimplementing RabbitMQ on top of ZeroMQ's underlying protocols (zmtp) and doing away with the broker, implementing all the functionality in the endpoints instead.
In contrast, ZeroMQ does far less to guarantee that, in the event of failures, your messages will actually, eventually, get through to the intended destination. If a process dies, or a network connection fails, then there's a high chance that messages have got lost. More recent versions can be set up to actively monitor connections, so that if a network cable breaks or a process dies somewhere, the endpoints at the other end of the sockets can be informed about this pretty quickly. If one then implements a communicating sequential processes framework on top of ZMQ's actor framework (think: message acknowledgements, etc. This will slow it down) you can end up with a system whereby endpoints can know for sure that messages have been transfered to intended destinations.
Being brokerless allows zmq to be pretty fast. And it's efficient across a number of different transports, ranging from inproc to tcp, all of which can be blended together. If you're not worried about processes crashing or network connections failing, ZMQ gives you a guarantee to deliver messages right out of the box.
So, deciding what it is that's important in your application helps choose what technology you're doing to use as part of it - RabbitMQ, ZeroMQ, etc. Once you've decided that, then the problem of "how to get the patterns I need" is reduced to "what patterns does that technology support". RabbitMQ is, AFAIK, purely pub/sub (there can be a lot of each), whereas ZeroMQ has many more.

I have tried Redis Server queuing in order to replace Python standard multiprocessing Queue. It is s NO GO for Redis ! Python is best, fastest and can accept any kind of data type you throw at it, where with Redis and complex datatype such as dict with lot of numpy array etc... you have to pickle or json dumps/loads which add up overhead to the process.
Cheers,
Steve

Related

Twisted, genvent, asyncoro - are they what I might need?

Learning Python and trying to do something ambitious (perhaps too much).
The application (console, that runs silently like a server), needs to talk to 2 serial ports, needs to deal with timers, needs to push information on Redis KV-store, write logs, and interact with bunch of other similar applications using unix IPC (or socket comm.)
The easier way (to my mind) to think of such an application is to work with threads and event queues. However due to what I understand as GIL enforced limitation with threading, it is not quite an option with Python (unless, I misunderstood things). The alternative way, what I understood - is to work with asynchronous I/O framework, green-threads, coroutines etc.
Are twisted, gevent and asyncoro really alternatives in Python for asynchronous event-driven programming that I intend to write ?
Since learning twisted seems to be such a big investment (in terms of time/effort), I was wondering if gevent and asyncoro could be easier and better alternative ? From the bit of superficial document reading done so far, asyncoro seems to be simplest, with very limited amount of new learning, and Twisted is other extreme, with gevent being somewhere in the middle -- but then I am not sure, if they are really comparable.
Here's an example of what the application would do if were multi-threaded:
Thread:1 - Monitor health of serial port, periodically i.e. with a timer. Say check every 2 minutes if last state was healthy. If last state was unhealthy then check every 30 seconds for first 5 mins, then every minute for next 10 mins... like in exponential backoff. Note that there are multiple such serial ports.
Thread:2 - Monitor state of application-level sessions that come-and-go from time to time, over the serial ports, and the communication that happens over it. Redis is (planned) to be used to write to distributed KV-store s.t. other instances of application (running on same or other servers), can coordinate certain other actions.
Thread:3 - Performs some other housekeeping tasks.
All of the threads need to do logging, all the threads use timers (& other events) to do certain things. Timers are used for periodic execution of some logic and as timeouts to guard certain actions (blocking or non-blocking).
My experience with Python is extremely limited, but I have experience writing similar programs in C/C++ and Java. Using Python for this, to learn.

You can use any of the libraries you've mentioned here to implement the application you've described. You can also use traditional threads. The GIL prevents you from achieving hardware-level parallelism in the execution of Python byte code operations (as distinct from, say, native code being invoked from your Python program). It does not prevent you from performing parallel I/O operations - which is what it sounds like your application is primarily concerned with.
There isn't enough detail in your question to provide a recommendation of one of these tools over another (and if there were enough detail, the question would probably be enormous and the effort to answer it correctly would probably discourage anyone on SO from doing so). It's typically safe to say that the threading approach is probably the worst, though (for a variety of reasons I won't even attempt to expain here; they're documented well enough on the internet at large).

Maximally simple django timed/scheduled tasks (e.g.: reminders)?

Question is relevant to this and this;
the difference is, I'd prefer something with possibly more precision and low load (per-minute cron job isn't preferable for those) and with minimal overhead (i.e. installing celery with rabbitmq seems like a big overkill).
An example task for such is personal reminders server (with reminders that could be edited over web and sent out through e-mail or XMPP).
I'm probably looking for something more like node.js's setTimeout but for django (and though I might prefer to implement reminders in node.js anyway, it's still a possibly interesting question).
For example, it's possible to start new threads in django app (with functions consisting of sleep() and send()); in what ways this can be bad?

The problem with using threads for this solution are the typical problems with Python threads that always drive people towards multi-process solutions instead. The problem is compounded here by the fact your thread isn't driven by the normal request-response cycle. This is summarized nicely by Malcolm Tredinnick here:
Have to disagree. Threads are not a good solution to this problem. The
issue is process management. As written, your threads will never be
rejoined. Webserver processes have a lifecycle uncontrollable by you
(the MaxRequestsPerChild Apache parameter and similar things in other
servers) and you are messing with that by using threads.
If you need a process with a lifecycle that is not matched by the
request-response path — something long running and independent of the
response — a completely separate process is definitely the right model
to use. Using a thread is tying it to the response lifecycle, which
wil have unintended side-effects.
A possible solution for you might be to have a long running process performing your tasks which gets a wake-up signal from a light cron process.
Another possibility would be build something using 0mq, which is much lighter than AMQP style queues (at the cost of some features of course). Tarek Ziade is working on a Mozilla project called powerhose that uses 0mq, looks super simple, and has a heartbeat capability with resolution to the second.

Python: OpenMPI Vs. RabbitMQ

Suppose that one is interested to write a python app where there should be communication between different processes. The communications will be done by sending strings and/or numpy arrays.
What are the considerations to prefer OpenMPI vs. a tool like RabbitMQ?

There is no single correct answer to such question. It all depends on a big number of different factors. For example:
What kind of communications do you have? Are you sending large packets or small packets, do you need good bandwidth or low latency?
What kind of delivery guarantees do you need?
OpenMPI can instantly deliver messages only to a running process, while different MQ solutions can queue messages and allow fancy producer-consumer configurations.
What kind of network do you have? If you are running on the localhost, something like ZeroMQ would probably be the fastest. If you are running on the set of hosts, depends on the interconnections available. E.g. OpenMPI can utilize infiniband/mirynet links.
What kind of processing are you doing? With MPI all processes are usually started at the same time, do the processing and terminate all at once.

This is exactly the scenario I was in a few months ago and I decided to use AMQP with RabbitMQ using topic exchanges, in addition to memcache for large objects.
The AMQP messages are all strings, in JSON object format so that it is easy to add attributes to a message (like number of retries) and republish it. JSON objects are a subset of JSON that correspond to Python dicts. For instance {"recordid": "272727"} is a JSON object with one attribute. I could have just pickled a Python dict but that would have locked us into only using Python with the message queues.
The large objects don't get routed by AMQP, instead they go into a memcache where they are available for another process to retrieve them. You could just as well use Redis or Tokyo Tyrant for this job. The idea is that we did not want short messages to get queued behind large objects.
In the end, my Python processes ended up using both AMQP and ZeroMQ for two different aspects of the architecture. You may find that it makes sense to use both OpenMPI and AMQP but for different types of jobs.
In my case, a supervisor process runs forever, starts a whole flock of worker who also run forever unless they die or hang, in which case the supervisor restarts them. The work constantly flows in as messages via AMQP, and each process handles just one step of the work, so that when we identify a bottleneck we can have multiple instances of the process, possibly on separate machines, to remove the bottleneck. In my case, I have 15 instances of one process, 4 of two others, and about 8 other single instances.

Python server for hardware control (possibly with Twisted?)

I'm currently in the process of programming a server which can let clients interact with a piece of hardware. For the interested readers it's a device which monitors the wavelength of a set of lasers concurrently (and controls the lasers). The server should be able to broadcast the wavelengths (a list of floats) on a regular basis and let the clients change the settings of the device through dll calls.
My initial idea was to write a custom protocol to handle the communication, but after thinking about how to handle TCP fragmentation and data encoding I bumped into Twisted, and it looks like most of the work is already done if I use perspective broker to share the data and call server methods directly from the clients. This solution might be a bit overkill, but for me it appeared obvious, what do you think?
My main concern arrose when I thought about the clients. Basically I need two types of clients, one which just displays the wavelengths (this should be straight forward) and a second which can change the device settings and get feedback when it's changed. My idea was to create a single client capable of both, but thinking about combining it with our previous system got me thinking... The second client should be controlled from an already rather complex python framework which controls a lot of independant hardware with relatively strict timing requirements, and the settings of the wavelengthmeter should then be called within this sequential code. Now the thing is, how do I mix this with the Twisted client? As I understand Twisted is not threadsafe, so I can't simply spawn a new thread running the reactor and then inteact with it from my main thread, can I?
Any suggestions for writing this server/client framework through different means than Twisted are very welcome!
Thanks

You can start the reactor in a dedicated thread, and then issue calls to it with blockingCallFromThread from your existing "sequential" code.
Also, I'd recommend AMP for the protocol rather than PB, since AMP is more amenable to heterogeneous environments (see amp-protocol.net for independent protocol information), and it sounds like you have a substantial amount of other technology you might want to integrate with this system.

Have you tried zeromq?
It's a library that simplifies working with sockets. It can operate over TCP and implements several topologies, such as publisher/subscriber (for broadcasting data, such as your laser readings) and request/response (that you can use for you control scheme).
There are bindings for several languages and the site is full of examples. Also, it's amazingly fast.
Good stuff.

Writing a socket-based server in Python, recommended strategies?

I was recently reading this document which lists a number of strategies that could be employed to implement a socket server. Namely, they are:
Serve many clients with each thread, and use nonblocking I/O and level-triggered readiness notification
Serve many clients with each thread, and use nonblocking I/O and readiness change notification
Serve many clients with each server thread, and use asynchronous I/O
serve one client with each server thread, and use blocking I/O
Build the server code into the kernel
Now, I would appreciate a hint on which should be used in CPython, which we know has some good points, and some bad points. I am mostly interested in performance under high concurrency, and yes a number of the current implementations are too slow.
So if I may start with the easy one, "5" is out, as I am not going to be hacking anything into the kernel.
"4" Also looks like it must be out because of the GIL. Of course, you could use multiprocessing in place of threads here, and that does give a significant boost. Blocking IO also has the advantage of being easier to understand.
And here my knowledge wanes a bit:
"1" is traditional select or poll which could be trivially combined with multiprocessing.
"2" is the readiness-change notification, used by the newer epoll and kqueue
"3" I am not sure there are any kernel implementations for this that have Python wrappers.
So, in Python we have a bag of great tools like Twisted. Perhaps they are a better approach, though I have benchmarked Twisted and found it too slow on a multiple processor machine. Perhaps having 4 twisteds with a load balancer might do it, I don't know. Any advice would be appreciated.

asyncore is basically "1" - It uses select internally, and you just have one thread handling all requests. According to the docs it can also use poll. (EDIT: Removed Twisted reference, I thought it used asyncore, but I was wrong).
"2" might be implemented with python-epoll (Just googled it - never seen it before).
EDIT: (from the comments) In python 2.6 the select module has epoll, kqueue and kevent build-in (on supported platforms). So you don't need any external libraries to do edge-triggered serving.
Don't rule out "4", as the GIL will be dropped when a thread is actually doing or waiting for IO-operations (most of the time probably). It doesn't make sense if you've got huge numbers of connections of course. If you've got lots of processing to do, then python may not make sense with any of these schemes.
For flexibility maybe look at Twisted?
In practice your problem boils down to how much processing you are going to do for requests. If you've got a lot of processing, and need to take advantage of multi-core parallel operation, then you'll probably need multiple processes. On the other hand if you just need to listen on lots of connections, then select or epoll, with a small number of threads should work.

How about "fork"? (I assume that is what the ForkingMixIn does) If the requests are handled in a "shared nothing" (other than DB or file system) architecture, fork() starts pretty quickly on most *nixes, and you don't have to worry about all the silly bugs and complications from threading.
Threads are a design illness forced on us by OSes with too-heavy-weight processes, IMHO. Cloning a page table with copy-on-write attributes seems a small price, especially if you are running an interpreter anyway.
Sorry I can't be more specific, but I'm more of a Perl-transitioning-to-Ruby programmer (when I'm not slaving over masses of Java at work)
Update: I finally did some timings on thread vs fork in my "spare time". Check it out:
http://roboprogs.com/devel/2009.04.html
Expanded:
http://roboprogs.com/devel/2009.12.html

One sollution is gevent. Gevent maries a libevent based event polling with lightweight cooperative task switching implemented by greenlet.
What you get is all the performance and scalability of an event system with the elegance and straightforward model of blocking IO programing.
(I don't know what the SO convention about answering to realy old questions is, but decided I'd still add my 2 cents)

Can I suggest additional links?
cogen is a crossplatform library for network oriented, coroutine based programming using the enhanced generators from python 2.5. On the main page of cogen project there're links to several projects with similar purpose.

I like Douglas' answer, but as an aside...
You could use a centralized dispatch thread/process that listens for readiness notifications using select and delegates to a pool of worker threads/processes to help accomplish your parallelism goals.
As Douglas mentioned, however, the GIL won't be held during most lengthy I/O operations (since no Python-API things are happening), so if it's response latency you're concerned about you can try moving the critical portions of your code to CPython API.

http://docs.python.org/library/socketserver.html#asynchronous-mixins
As for multi-processor (multi-core) machines. With CPython due to GIL you'll need at least one process per core, to scale. As you say that you need CPython, you might try to benchmark that with ForkingMixIn. With Linux 2.6 might give some interesting results.
Other way is to use Stackless Python. That's how EVE solved it. But I understand that it's not always possible.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.