Implementing a client protocol in Twisted, my current code does a lot of work on each protocol unit recieved, it doesn't use I/O so no Deferreds are currently used.
The processing is not meant to be intensive but it open to pluggable interfaces.
Is there a treshold to do this partition?
You might consider "deferring" at points where plugins are going to be called, since you can't predict whether they will do any I/O to databases, remote processes, web services, whatever.
Look into using #inlineCallbacks, which will simplify your life in terms of breaking up your processing into Deferreds simply by using Python's yield statement. Then you can experiment with breaking up your compute-intensive work in various ways, perhaps to give other protocol handlers a chance to run and complete, especially if some handlers are compute-intensive and others are not.
Related
My distributed application consists of many producers that push tasks into several FIFO queues, and multiple consumers for every one of these queues. All these components live on a single node, so no networking involved.
This pattern is perfectly supported by Python's built-in multiprocessing.Queue, however when I am scaling up my application the queue implementation seems to be a bottleneck. I am not sending large amounts of data, so memory sharing does not solve the problem. What I need is fast guaranteed delivery of 10^4-10^5 small messages per second. Each message is about 100 bytes.
I am new to the world of fast distributed computing and I am very confused by the sheer amount of options. There is RabbitMQ, Redis, Kafka, etc.
ZeroMQ is a more focused and compact alternative, which also has successors such as nanomsg and nng. Also, implementing something like a many-to-many queue with a guaranteed delivery seems nontrivial without a broker.
I would really appreciate if someone could point me to a "standard" way of doing something like this with one of the faster frameworks.
After trying a few available implementations and frameworks, I still could not find anything that would be suitable for my task. Either too slow or too heavy.
To solve the issue my colleagues and I developed this: https://github.com/alex-petrenko/faster-fifo
faster-fifo is a drop-in replacement for Python's multiprocessing.Queue and is significantly faster. In fact, it is up to 30x faster in the configurations I cared about (many producers, few consumers) because it additionally supports get_many() method on the consumer side.
It is brokereless, lightweight, supports arbitrary many-to-many configurations, implemented for Posix systems using pthread synchronization primitives.
I think that a lot of it depends partly on what sort of importance you place on individual messages.
If each and every one is vital, and you have to consider what happens to them in the event of some failure somewhere, then frameworks like RabbitMQ can be useful. RabbitMQ has a broker, and it's possible to configure this for some sort of high availability, high reliability mode. With the right queue settings, RabbitMQ will look after your messages up until some part of your system consumes them.
To do all this, RabbitMQ needs a broker. This makes it fairly slow. Though at one point there was talk about reimplementing RabbitMQ on top of ZeroMQ's underlying protocols (zmtp) and doing away with the broker, implementing all the functionality in the endpoints instead.
In contrast, ZeroMQ does far less to guarantee that, in the event of failures, your messages will actually, eventually, get through to the intended destination. If a process dies, or a network connection fails, then there's a high chance that messages have got lost. More recent versions can be set up to actively monitor connections, so that if a network cable breaks or a process dies somewhere, the endpoints at the other end of the sockets can be informed about this pretty quickly. If one then implements a communicating sequential processes framework on top of ZMQ's actor framework (think: message acknowledgements, etc. This will slow it down) you can end up with a system whereby endpoints can know for sure that messages have been transfered to intended destinations.
Being brokerless allows zmq to be pretty fast. And it's efficient across a number of different transports, ranging from inproc to tcp, all of which can be blended together. If you're not worried about processes crashing or network connections failing, ZMQ gives you a guarantee to deliver messages right out of the box.
So, deciding what it is that's important in your application helps choose what technology you're doing to use as part of it - RabbitMQ, ZeroMQ, etc. Once you've decided that, then the problem of "how to get the patterns I need" is reduced to "what patterns does that technology support". RabbitMQ is, AFAIK, purely pub/sub (there can be a lot of each), whereas ZeroMQ has many more.
I have tried Redis Server queuing in order to replace Python standard multiprocessing Queue. It is s NO GO for Redis ! Python is best, fastest and can accept any kind of data type you throw at it, where with Redis and complex datatype such as dict with lot of numpy array etc... you have to pickle or json dumps/loads which add up overhead to the process.
Cheers,
Steve
Learning Python and trying to do something ambitious (perhaps too much).
The application (console, that runs silently like a server), needs to talk to 2 serial ports, needs to deal with timers, needs to push information on Redis KV-store, write logs, and interact with bunch of other similar applications using unix IPC (or socket comm.)
The easier way (to my mind) to think of such an application is to work with threads and event queues. However due to what I understand as GIL enforced limitation with threading, it is not quite an option with Python (unless, I misunderstood things). The alternative way, what I understood - is to work with asynchronous I/O framework, green-threads, coroutines etc.
Are twisted, gevent and asyncoro really alternatives in Python for asynchronous event-driven programming that I intend to write ?
Since learning twisted seems to be such a big investment (in terms of time/effort), I was wondering if gevent and asyncoro could be easier and better alternative ? From the bit of superficial document reading done so far, asyncoro seems to be simplest, with very limited amount of new learning, and Twisted is other extreme, with gevent being somewhere in the middle -- but then I am not sure, if they are really comparable.
Here's an example of what the application would do if were multi-threaded:
Thread:1 - Monitor health of serial port, periodically i.e. with a timer. Say check every 2 minutes if last state was healthy. If last state was unhealthy then check every 30 seconds for first 5 mins, then every minute for next 10 mins... like in exponential backoff. Note that there are multiple such serial ports.
Thread:2 - Monitor state of application-level sessions that come-and-go from time to time, over the serial ports, and the communication that happens over it. Redis is (planned) to be used to write to distributed KV-store s.t. other instances of application (running on same or other servers), can coordinate certain other actions.
Thread:3 - Performs some other housekeeping tasks.
All of the threads need to do logging, all the threads use timers (& other events) to do certain things. Timers are used for periodic execution of some logic and as timeouts to guard certain actions (blocking or non-blocking).
My experience with Python is extremely limited, but I have experience writing similar programs in C/C++ and Java. Using Python for this, to learn.
You can use any of the libraries you've mentioned here to implement the application you've described. You can also use traditional threads. The GIL prevents you from achieving hardware-level parallelism in the execution of Python byte code operations (as distinct from, say, native code being invoked from your Python program). It does not prevent you from performing parallel I/O operations - which is what it sounds like your application is primarily concerned with.
There isn't enough detail in your question to provide a recommendation of one of these tools over another (and if there were enough detail, the question would probably be enormous and the effort to answer it correctly would probably discourage anyone on SO from doing so). It's typically safe to say that the threading approach is probably the worst, though (for a variety of reasons I won't even attempt to expain here; they're documented well enough on the internet at large).
I am writing an implementation of a NAT. My algorithm is as follows:
Packet comes in
Check against lookup table if external, add to lookup table if internal
Swap the source address and send the packet on its way
I have been reading about Twisted. I was curious if Twisted takes advantage of multicore CPUs? Assume the system has thousands of users and one packet comes right after the other. With twisted can the lookup table operations be taking place at the same time on each core. I hear with threads the GIL will not allow this anyway. Perhaps I could benifit from multiprocessing>
Nginix is asynchronous and happily serves thousands of users at the same time.
Using threads with twisted is discouraged. It has very good performance when used asynchronously, but the code you write for the request handlers must not block. So if your handler is a pretty big piece of code, break it up into smaller parts and utilize twisted's famous Deferreds to attach the other parts via callbacks. It certainly requires a somewhat different thinking than most programmers are used to, but it has benefits. If the code has blocking parts, like database operations, or accessing other resources via network to get some result, try finding asynchronous libraries for those tasks too, so you can use Deferreds in those cases also. If you can't use asynchronous libraries you may finally use the deferToThread function, which will run the function you want to call in a different thread and return a Deferred for it, and fire your callback when finished, but it's better to use that as a last resort, if nothing else can be done.
Here is the official tutorial for Deferreds:
http://twistedmatrix.com/documents/10.1.0/core/howto/deferredindepth.html
And another nice guide, which can help to get used to think in "async mode":
http://ezyang.com/twisted/defer2.html
I have to write a CPU-bound server in Python, to distribute workloads for many cores. I want to use Twisted as the server (requests coming in via TCP).
Are there any better options - using Ampoule, perhaps? I also saw an article using Twisted's pb for communication, in conjunction with Popen - or perhaps combine it with multiprocessing?
Ampoule is a good building block for a multiprocess CPU-bound server. It uses the simpler AMP protocol, rather than PB (the complexity of which is usually not needed just to move job data into another process and then retrieve the results). It handles process creation, lifetime management, restarting, etc.
You generally want to avoid using Popen or the multiprocessing standard library module if you're using Twisted. They can cooperate, but they both present blocking-oriented APIs which somewhat defeat the purpose of using Twisted in the first place. Twisted's native child process API, reactor.spawnProcess is as capable and avoids blocking. Ampoule is based on this.
Ampoule is not as widely used as multiprocessing, though. You may find it to have some quirks in your development or deployment environments. I don't think these will be obstacles you can't overcome, though. I developed a service which used Ampoule to distribute the work of parsing large quantities of HTML across multiple CPUs and it eventually worked fine. If you do come across any problems though I encourage you to report them upstream! Eventually I would like to be able to say that Ampoule is as robust as anything (or more so) instead of attaching a disclaimer about its use. :)
I was recently reading this document which lists a number of strategies that could be employed to implement a socket server. Namely, they are:
Serve many clients with each thread, and use nonblocking I/O and level-triggered readiness notification
Serve many clients with each thread, and use nonblocking I/O and readiness change notification
Serve many clients with each server thread, and use asynchronous I/O
serve one client with each server thread, and use blocking I/O
Build the server code into the kernel
Now, I would appreciate a hint on which should be used in CPython, which we know has some good points, and some bad points. I am mostly interested in performance under high concurrency, and yes a number of the current implementations are too slow.
So if I may start with the easy one, "5" is out, as I am not going to be hacking anything into the kernel.
"4" Also looks like it must be out because of the GIL. Of course, you could use multiprocessing in place of threads here, and that does give a significant boost. Blocking IO also has the advantage of being easier to understand.
And here my knowledge wanes a bit:
"1" is traditional select or poll which could be trivially combined with multiprocessing.
"2" is the readiness-change notification, used by the newer epoll and kqueue
"3" I am not sure there are any kernel implementations for this that have Python wrappers.
So, in Python we have a bag of great tools like Twisted. Perhaps they are a better approach, though I have benchmarked Twisted and found it too slow on a multiple processor machine. Perhaps having 4 twisteds with a load balancer might do it, I don't know. Any advice would be appreciated.
asyncore is basically "1" - It uses select internally, and you just have one thread handling all requests. According to the docs it can also use poll. (EDIT: Removed Twisted reference, I thought it used asyncore, but I was wrong).
"2" might be implemented with python-epoll (Just googled it - never seen it before).
EDIT: (from the comments) In python 2.6 the select module has epoll, kqueue and kevent build-in (on supported platforms). So you don't need any external libraries to do edge-triggered serving.
Don't rule out "4", as the GIL will be dropped when a thread is actually doing or waiting for IO-operations (most of the time probably). It doesn't make sense if you've got huge numbers of connections of course. If you've got lots of processing to do, then python may not make sense with any of these schemes.
For flexibility maybe look at Twisted?
In practice your problem boils down to how much processing you are going to do for requests. If you've got a lot of processing, and need to take advantage of multi-core parallel operation, then you'll probably need multiple processes. On the other hand if you just need to listen on lots of connections, then select or epoll, with a small number of threads should work.
How about "fork"? (I assume that is what the ForkingMixIn does) If the requests are handled in a "shared nothing" (other than DB or file system) architecture, fork() starts pretty quickly on most *nixes, and you don't have to worry about all the silly bugs and complications from threading.
Threads are a design illness forced on us by OSes with too-heavy-weight processes, IMHO. Cloning a page table with copy-on-write attributes seems a small price, especially if you are running an interpreter anyway.
Sorry I can't be more specific, but I'm more of a Perl-transitioning-to-Ruby programmer (when I'm not slaving over masses of Java at work)
Update: I finally did some timings on thread vs fork in my "spare time". Check it out:
http://roboprogs.com/devel/2009.04.html
Expanded:
http://roboprogs.com/devel/2009.12.html
One sollution is gevent. Gevent maries a libevent based event polling with lightweight cooperative task switching implemented by greenlet.
What you get is all the performance and scalability of an event system with the elegance and straightforward model of blocking IO programing.
(I don't know what the SO convention about answering to realy old questions is, but decided I'd still add my 2 cents)
Can I suggest additional links?
cogen is a crossplatform library for network oriented, coroutine based programming using the enhanced generators from python 2.5. On the main page of cogen project there're links to several projects with similar purpose.
I like Douglas' answer, but as an aside...
You could use a centralized dispatch thread/process that listens for readiness notifications using select and delegates to a pool of worker threads/processes to help accomplish your parallelism goals.
As Douglas mentioned, however, the GIL won't be held during most lengthy I/O operations (since no Python-API things are happening), so if it's response latency you're concerned about you can try moving the critical portions of your code to CPython API.
http://docs.python.org/library/socketserver.html#asynchronous-mixins
As for multi-processor (multi-core) machines. With CPython due to GIL you'll need at least one process per core, to scale. As you say that you need CPython, you might try to benchmark that with ForkingMixIn. With Linux 2.6 might give some interesting results.
Other way is to use Stackless Python. That's how EVE solved it. But I understand that it's not always possible.