Implementing a multi-process server in Python, with Twisted

Implementing a multi-process server in Python, with Twisted - python

I have to write a CPU-bound server in Python, to distribute workloads for many cores. I want to use Twisted as the server (requests coming in via TCP).
Are there any better options - using Ampoule, perhaps? I also saw an article using Twisted's pb for communication, in conjunction with Popen - or perhaps combine it with multiprocessing?

Ampoule is a good building block for a multiprocess CPU-bound server. It uses the simpler AMP protocol, rather than PB (the complexity of which is usually not needed just to move job data into another process and then retrieve the results). It handles process creation, lifetime management, restarting, etc.
You generally want to avoid using Popen or the multiprocessing standard library module if you're using Twisted. They can cooperate, but they both present blocking-oriented APIs which somewhat defeat the purpose of using Twisted in the first place. Twisted's native child process API, reactor.spawnProcess is as capable and avoids blocking. Ampoule is based on this.
Ampoule is not as widely used as multiprocessing, though. You may find it to have some quirks in your development or deployment environments. I don't think these will be obstacles you can't overcome, though. I developed a service which used Ampoule to distribute the work of parsing large quantities of HTML across multiple CPUs and it eventually worked fine. If you do come across any problems though I encourage you to report them upstream! Eventually I would like to be able to say that Ampoule is as robust as anything (or more so) instead of attaching a disclaimer about its use. :)

Related

Threads vs Asynchronous Networking (Twisted) Python

I am writing an implementation of a NAT. My algorithm is as follows:
Packet comes in
Check against lookup table if external, add to lookup table if internal
Swap the source address and send the packet on its way
I have been reading about Twisted. I was curious if Twisted takes advantage of multicore CPUs? Assume the system has thousands of users and one packet comes right after the other. With twisted can the lookup table operations be taking place at the same time on each core. I hear with threads the GIL will not allow this anyway. Perhaps I could benifit from multiprocessing>
Nginix is asynchronous and happily serves thousands of users at the same time.

Using threads with twisted is discouraged. It has very good performance when used asynchronously, but the code you write for the request handlers must not block. So if your handler is a pretty big piece of code, break it up into smaller parts and utilize twisted's famous Deferreds to attach the other parts via callbacks. It certainly requires a somewhat different thinking than most programmers are used to, but it has benefits. If the code has blocking parts, like database operations, or accessing other resources via network to get some result, try finding asynchronous libraries for those tasks too, so you can use Deferreds in those cases also. If you can't use asynchronous libraries you may finally use the deferToThread function, which will run the function you want to call in a different thread and return a Deferred for it, and fire your callback when finished, but it's better to use that as a last resort, if nothing else can be done.
Here is the official tutorial for Deferreds:
http://twistedmatrix.com/documents/10.1.0/core/howto/deferredindepth.html
And another nice guide, which can help to get used to think in "async mode":
http://ezyang.com/twisted/defer2.html

Python - Waiting for input from lots of sockets

I'm working on a simple experiment in Python. I have a "master" process, in charge of all the others, and every single process has a connection via unix socket to the master process. I would like to be able for the master process to be able to monitor all of the sockets for a response - but there could theoretically be almost a hundred of them. How would threads impact the memory and performance of the application? What would be the best solution? Thanks a lot!

One hundred simultaneous threads might be pushing the reasonable limits of threading. If you find this is the cleanest way to organize your code, I'd say give it a try, but threading really doesn't scale very far.
What works better is to use a technique like select to wait for one of the sockets to be readable / writable / or has an error to report. This mechanism lets you go to sleep until something interesting happens, handle as many sockets have content to handle, and then go back to sleep again, all in a single thread of execution. Removing the multi-threading can often reduce chances for errors, and this style of programming should get you into the hundreds of connections no trouble. (If you want to go beyond about 100, I'd use the poll functionality instead of select -- constantly rebuilding the list of interesting file descriptors takes time that poll does not require.)
Something to consider is the Python Twisted Framework. They've gone to some length to provide a consistent way to hook callbacks onto events for this exact sort of programming. (If you're familiar with node.js, it's a bit like that, but Python.) I must admit a slight aversion to Twisted -- I never got very far in their documentation without being utterly baffled -- but a lot of people made it further in the docs than I did. You might find it a better fit than I have.

The easiest way to conduct comparative tests of threads versus processes for socket handling is to use the SocketServer in Python's standard library. You can easily switch approaches (while keeping everything else the same) by inheriting from either ThreadingMixIn or ForkingMixIn. Here is a simple example to get you started.
Another alternative is a select/poll approach using non-blocking sockets in a single process and a single thread.
If you're interested in software that is already fully developed and highly evolved, consider these high-performance Python based server packages:
The Twisted framework uses the async single process, single thread style.
The Tornado framework is similar (less evolved, less full featured, but easier to understand)
And Gunicorn which is a high-performance forking server.

Twisted Threading + MapReduce on a single node/server?

I'm confused about Twisted threading.
I've heard and read more than a few articles, books, and sat through a few presentations on the subject of threading vs processes in Python. It just seems to me that unless one is doing lots of IO or wanting to utilize shared memory across jobs, then the right choice is to use multiprocessing.
However, from what I've seen so far, it seems like Twisted uses Threads (pThreads from the python threading module). And Twisted seems to perform really really well in processing lots of data.
I've got a fairly large number of processes that I'd like to distribute processing to using the MapReduce pattern in Python on a single node/server. They don't do any IO really, they just do a lot of processing.
Is the Twisted reactor the right tool for this job?

The short answer to your question: no, twisted threading is not the right solution for heavy processing.
If you have a lot of processing to do, twisted's threading will still be subject to the GIL (Global Interpreter Lock). Without going into a long in depth explanation, the GIL is what allows only one thread at a time to execute python code. What this means in effect is you will not be able to take advantage of multiple cores with a single multi-threaded twisted process. That said, some C modules (such as bits of SciPy) can release the GIL and run multi-threaded, though the python code associated is still effectively single-threaded.
What twisted's threading is mainly useful for is using it along with blocking I/O based modules. A prime example of this is database API's, because the db-api spec doesn't account for asynchronous use cases, and most database modules adhere to the spec. Thusly, to use PostgreSQL for example from a twisted app, one has to either block or use something like twisted.enterprise.adbapi which is a wrapper that uses twisted.internet.threads.deferToThread to allow a SQL query to execute while other stuff is going on. This can allow other python code to run because the socket module (among most others involving operating system I/O) will release the GIL while in a system call.
That said, you can use twisted to write a network application talking to many twisted (or non-twisted, if you'd like) workers. Each worker could then work on little bits of work, and you would not be restricted by the GIL, because each worker would be its own completely isolated process. The master process can then make use of many of twisted's asynchronous primitives. For example you could use a DeferredList to wait on a number of results coming from any number of workers, and then run a response handler when all of the Deferred's complete. (thus allowing you to do your map call) If you want to go down this route, I recommend looking at twisted.protocols.amp, which is their Asynchronous Message Protocol, and can be used very trivially to implement a network-based RPC or map-reduce.
The downside of running many disparate processes versus something like multiprocessing is that
you lose out on simple process management, and
the subprocesses can't share memory as if they would if they were forked on a unix system.
Though for modern systems, 2) is rarely a problem unless you are running hundreds of subprocesses. And problem 1) can be solved by using a process management system like supervisord
Edit For more on python and the GIL, you should watch Dave Beazley's talks on the subject ( website , video, slides )

having to run multiple instances of a web service for ruby/python seems like a hack to me

Is it just me or is having to run multiple instances of a web server to scale a hack?
Am I wrong in this?
Clarification
I am referring to how I read people run multiple instances of a web service on a single server. I am not talking about a cluster of servers.

Not really, people were running multiple frontends across a cluster of servers before multicore cpus became widespread
So there has been all the infrastructure for supporting sessions properly across multiple frontends for quite some time before it became really advantageous to run a bunch of threads on one machine.
Infact using asynchronous style frontends gives better performance on the same hardware than a multithreaded approach, so I would say that not running multiple instances in favour of a multithreaded monster is a hack

Since we are now moving towards more cores, rather than faster processors - in order to scale more and more, you will need to be running more instances.
So yes, I reckon you are wrong.
This does not by any means condone brain-dead programming with the excuse that you can just scale it horizontally, that just seems retarded.

With no details, it is very difficult to see what you are getting at. That being said, it is quite possible that you are simply not using the right approach for your problem.
Sometimes multiple separate instances are better. Sometimes, your Python services are actually better deployed behind a single Apache instance (using mod_wsgi) which may elect to use more than a single process. I don't know about Ruby to opinionate there.
In short, if you want to make your service scalable then the way to do so depends heavily on additional details. Is it scaling up or scaling out? What is the operating system and available or possibly installable server software? Is the service itself easily parallelized and how much is it database dependent? How is the database deployed?

Even if Ruby/Python interpreters were perfect, and could utilize all avail CPU with single process, you would still reach maximal capability of single server sooner or later and have to scale across several machines, going back to running several instances of your app.

I would hesitate to say that the issue is a "hack". Or indeed that threaded solutions are necessarily superior.
The situation is a result of design decisions used in the interpreters of languages like Ruby and Python.
I work with Ruby, so the details may be different for other languages.
BUT ... essentially, Ruby uses a Global Interpreter Lock to prevent threading issues:
http://en.wikipedia.org/wiki/Global_Interpreter_Lock
The side-effect of this is that to achieve concurrency with frameworks like Rails, rather than relying on multiple threads within the VM, we use multiple processes, each with its own interpreter and instance of your framework and application code
Each instance of the app handles a single request at a time. To achieve concurrency we have to spin up multiple instances.
In the olden days (2-3 years ago) we would run multiple mongrel (or similar) instances behind a proxy (generally apache). Passenger changed some of this because it is smart enough to manage the processes itself, rather than requiring manual setup. You tell Passenger how many processes it can use and off it goes.
The whole structure is actually not as bad as the thread-orthodoxy would have you believe. For a start, it's pretty easy to make this type of architecture work in a multicore environment. Any modern database is designed to handle highly concurrent loads, so having multiple processes has very little if any effect at that level.
If you use a language like JRuby you can deploy into a threaded app server like Tomcat and have a deployment that looks much more "java-like". However, this is not as big a win as you might think, because now your application needs to be much more thread-aware and you can see side effects and strangeness from threading issues.

Your assumption that Tomcat's and IIS's single process per server is superior is flawed. The choice of a multi-threaded server and a multi-process server depends on a lot of variables.
One main thing is the underlying operating system. Unix systems have always had great support for multi-processing because of the copy-on-write nature of the fork system call. This makes multi-processes a really attractive option because web-serving is usually very shared-nothing and you don't have to worry about locking. Windows on the other hand had much heavier processes and lighter threads so programs like IIS would gravitate to a multi-threading model.
As for the question to wether it's a hack to run multiple servers really depends on your perspective. If you look at Apache, it comes with a variety of pluggable engines to choose from. The MPM-prefork one is the default because it allows the programmer to easily use non-thread-safe C/Perl/database libraries without having to throw locks and semaphores all over the place. To some that might be a hack to work around poorly implemented libraries. To me it's a brilliant way of leaving it to the OS to handle the problems and letting me get back to work.
Also a multi-process model comes with a few features that would be very difficult to implement in a multi-threaded server. Because they are just processes, zero-downtime rolling-updates are trivial. You can do it with a bash script.
It also has it's short-comings. In a single-server model setting up a singleton that holds some global state is trivial, while on a multi-process model you have to serialize that state to a database or Redis server. (Of course if your single-process server outgrows a single server you'll have to do that anyway.)
Is it a hack? Yes and no. Both original implementations (MRI, and CPython) have Global Interpreter Locks that will prevent a multi-core server from operating at it's 100% potential. On the other hand multi-process has it's advantages (especially on the Unix-side of the fence).
There's also nothing inherent in the languages themselves that makes them require a GIL, so you can run your application with Jython, JRuby, IronPython or IronRuby if you really want to share state inside a single process.

Writing a socket-based server in Python, recommended strategies?

I was recently reading this document which lists a number of strategies that could be employed to implement a socket server. Namely, they are:
Serve many clients with each thread, and use nonblocking I/O and level-triggered readiness notification
Serve many clients with each thread, and use nonblocking I/O and readiness change notification
Serve many clients with each server thread, and use asynchronous I/O
serve one client with each server thread, and use blocking I/O
Build the server code into the kernel
Now, I would appreciate a hint on which should be used in CPython, which we know has some good points, and some bad points. I am mostly interested in performance under high concurrency, and yes a number of the current implementations are too slow.
So if I may start with the easy one, "5" is out, as I am not going to be hacking anything into the kernel.
"4" Also looks like it must be out because of the GIL. Of course, you could use multiprocessing in place of threads here, and that does give a significant boost. Blocking IO also has the advantage of being easier to understand.
And here my knowledge wanes a bit:
"1" is traditional select or poll which could be trivially combined with multiprocessing.
"2" is the readiness-change notification, used by the newer epoll and kqueue
"3" I am not sure there are any kernel implementations for this that have Python wrappers.
So, in Python we have a bag of great tools like Twisted. Perhaps they are a better approach, though I have benchmarked Twisted and found it too slow on a multiple processor machine. Perhaps having 4 twisteds with a load balancer might do it, I don't know. Any advice would be appreciated.

asyncore is basically "1" - It uses select internally, and you just have one thread handling all requests. According to the docs it can also use poll. (EDIT: Removed Twisted reference, I thought it used asyncore, but I was wrong).
"2" might be implemented with python-epoll (Just googled it - never seen it before).
EDIT: (from the comments) In python 2.6 the select module has epoll, kqueue and kevent build-in (on supported platforms). So you don't need any external libraries to do edge-triggered serving.
Don't rule out "4", as the GIL will be dropped when a thread is actually doing or waiting for IO-operations (most of the time probably). It doesn't make sense if you've got huge numbers of connections of course. If you've got lots of processing to do, then python may not make sense with any of these schemes.
For flexibility maybe look at Twisted?
In practice your problem boils down to how much processing you are going to do for requests. If you've got a lot of processing, and need to take advantage of multi-core parallel operation, then you'll probably need multiple processes. On the other hand if you just need to listen on lots of connections, then select or epoll, with a small number of threads should work.

How about "fork"? (I assume that is what the ForkingMixIn does) If the requests are handled in a "shared nothing" (other than DB or file system) architecture, fork() starts pretty quickly on most *nixes, and you don't have to worry about all the silly bugs and complications from threading.
Threads are a design illness forced on us by OSes with too-heavy-weight processes, IMHO. Cloning a page table with copy-on-write attributes seems a small price, especially if you are running an interpreter anyway.
Sorry I can't be more specific, but I'm more of a Perl-transitioning-to-Ruby programmer (when I'm not slaving over masses of Java at work)
Update: I finally did some timings on thread vs fork in my "spare time". Check it out:
http://roboprogs.com/devel/2009.04.html
Expanded:
http://roboprogs.com/devel/2009.12.html

One sollution is gevent. Gevent maries a libevent based event polling with lightweight cooperative task switching implemented by greenlet.
What you get is all the performance and scalability of an event system with the elegance and straightforward model of blocking IO programing.
(I don't know what the SO convention about answering to realy old questions is, but decided I'd still add my 2 cents)

Can I suggest additional links?
cogen is a crossplatform library for network oriented, coroutine based programming using the enhanced generators from python 2.5. On the main page of cogen project there're links to several projects with similar purpose.

I like Douglas' answer, but as an aside...
You could use a centralized dispatch thread/process that listens for readiness notifications using select and delegates to a pool of worker threads/processes to help accomplish your parallelism goals.
As Douglas mentioned, however, the GIL won't be held during most lengthy I/O operations (since no Python-API things are happening), so if it's response latency you're concerned about you can try moving the critical portions of your code to CPython API.

http://docs.python.org/library/socketserver.html#asynchronous-mixins
As for multi-processor (multi-core) machines. With CPython due to GIL you'll need at least one process per core, to scale. As you say that you need CPython, you might try to benchmark that with ForkingMixIn. With Linux 2.6 might give some interesting results.
Other way is to use Stackless Python. That's how EVE solved it. But I understand that it's not always possible.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.