How does a python web server overcomes GIL

How does a python web server overcomes GIL - python

Any web server might have to handle a lot of requests at the same time. As python interpreter actually has GIL constraint, how concurrency is implemented?
Do they use multiple processes and use IPC for state sharing?

You usually have many workers(i.e. gunicorn), each being dispatched with independent requests. Everything else(concurrency related) is handled by the database so it is abstracted from you.
You don't need IPC, you just need a "single source of truth", which will be the RDBMS, a cache server(redis, memcached), etc.

First of all, requests can be handled independently. However, servers want to simultaneously handle them in order to keep the number of requests that can be handled per time at a maximum.
The implementation of this concept of concurrency depends on the webserver.
Some implementations may have a fixed number of threads or processes for handling requests. If all are in use, additional requests have to wait until being handled.
Another possibility is that a process or thread is spawned for each request. Spawning a process for each request leads to an absurd memory and cpu overhead. Spawning lightweight threads is better. Doing so, you can serve hundreds of clients per second. However, also threads bring their management overhead, manifesting itself in high memory and CPU consumption.
For serving thousands of clients per second, an event-driven architecture based on asynchronous coroutines is a state-of-the-art solution. It enables the server to serve clients at a high rate without spawning zillions of threads. On the Wikipedia page of the so-called C10k problem you find a list of web servers. Among those, many make use of this architecture.
Coroutines are available for Python, too. Have look at http://www.gevent.org/. That's why a Python WSGI app based on e.g uWSGI + gevent is an extremely performant solution.

As normal. Web serving is mostly I/O-bound, and the GIL is released during I/O operations. So either threading is used without any special accommodations, or an event loop (such as Twisted) is used.

Related

tornado websocket server - connections queue

I have a tornado.websocket.WebSocketHandler which processes data. The idea is to instantiate a limited number of handlers (e.g. so they are bounded by the the number of CPU cores). I would like to put the rest of the connections in a queue (as soon as they are opened) so one of them is activated when another finishes.
I was trying to do that via threading.Semaphore, but it seems that tornado socket handlers run in a single thread, so everything hangs out. How can I achieve that ?

Tornado has its own asynchronous semaphore class in tornado.locks.Semaphore.
Tornado is designed to make connections very cheap - one connection per core would be an extremely low limit. I suggest not limiting the number of connections per se, but limiting what you do with these connections. (and remember the GIL - unless you're calling out to C extensions for your cpu-intensive work, you can't make use of multiple CPU cores from python anyway). Doing your CPU-intensive work on a bounded ThreadPoolExecutor may be the best way to do what it sounds like you're trying to do.

Should I use epoll or just blocking recv in threads?

I'm trying to write a scalable custom web server.
Here's what I have so far:
The main loop and request interpreter are in Cython. The main loop accepts connections and assigns the sockets to one of the processes in the pool (has to be processes, threads won't get any benefit from multi-core hardware because of the GIL).
Each process has a thread pool. The process assigns the socket to a thread.
The thread calls recv (blocking) on the socket and waits for data. When some shows up, it gets piped into the request interpreter, and then sent via WSGI to the application running in that thread.
Now I've heard about epoll and am a little confused. Is there any benefit to using epoll to get socket data and then pass that directly to the processes? Or should I just go the usual route of having each thread wait on recv?
PS: What is epoll actually used for? It seems like multithreading and blocking fd calls would accomplish the same thing.

If you're already using multiple threads, epoll doesn't offer you much additional benefit.
The point of epoll is that a single thread can listen for activity on many file selectors simultaneously (and respond to events on each as they occur), and thus provide event-driven multitasking without requiring the spawning of additional threads. Threads are relatively cheap (compared to spawning processes), but each one does require some overhead (after all, they each have to maintain a call stack).
If you wanted to, you could rewrite your pool processes to be single-threaded using epoll, which would reduce your overall thread usage count, but of course you'd have to consider whether that's something you care about or not - in general, for low numbers of simultaneous requests on each worker, the overhead of spawning threads wouldn't matter, but if you want each worker to be able to handle 1000s of open connections, that overhead can become significant (and that's where epoll shines).
But...
What you're describing sounds suspiciously like you're basically reinventing the wheel - your:
main loop and request interpreter
pool of processes
sounds almost exactly like:
nginx (or any other load balancer/reverse proxy)
A pre-forking tornado app
Tornado is a single-threaded web server python module using epoll, and it has the capability built-in for pre-forking (meaning that it spawns multiple copies of itself as separate processes, effectively creating a process pool). Tornado is based on the tech created to power Friendfeed - they needed a way to handle huge numbers of open connections for long-polling clients looking for new real-time updates.
If you're doing this as a learning process, then by all means, reinvent away! It's a great way to learn. But if you're actually trying to build an application on top of these kinds of things, I'd highly recommend considering using the existing, stable, communally-developed projects - it'll save you a lot of time, false starts, and potential gotchas.
(P.S. I approve of your avatar. <3)

The epoll function (and the other functions in the same family poll and select) allow you to write single threading networking code that manage multiple networking connection. Since there is no threading, there is no need fot synchronisation as would be required in a multi-threaded program (this can be difficult to get right).
On the other hand, you'll need to have an explicit state machine for each connection. In a threaded program, this state machine is implicit.
Those function just offer another way to multiplex multiple connexion in a process. Sometimes it is easier not to use threads, other times you're already using threads, and thus it is easier just to use blocking sockets (which release the GIL in Python).

Twisted Threading + MapReduce on a single node/server?

I'm confused about Twisted threading.
I've heard and read more than a few articles, books, and sat through a few presentations on the subject of threading vs processes in Python. It just seems to me that unless one is doing lots of IO or wanting to utilize shared memory across jobs, then the right choice is to use multiprocessing.
However, from what I've seen so far, it seems like Twisted uses Threads (pThreads from the python threading module). And Twisted seems to perform really really well in processing lots of data.
I've got a fairly large number of processes that I'd like to distribute processing to using the MapReduce pattern in Python on a single node/server. They don't do any IO really, they just do a lot of processing.
Is the Twisted reactor the right tool for this job?

The short answer to your question: no, twisted threading is not the right solution for heavy processing.
If you have a lot of processing to do, twisted's threading will still be subject to the GIL (Global Interpreter Lock). Without going into a long in depth explanation, the GIL is what allows only one thread at a time to execute python code. What this means in effect is you will not be able to take advantage of multiple cores with a single multi-threaded twisted process. That said, some C modules (such as bits of SciPy) can release the GIL and run multi-threaded, though the python code associated is still effectively single-threaded.
What twisted's threading is mainly useful for is using it along with blocking I/O based modules. A prime example of this is database API's, because the db-api spec doesn't account for asynchronous use cases, and most database modules adhere to the spec. Thusly, to use PostgreSQL for example from a twisted app, one has to either block or use something like twisted.enterprise.adbapi which is a wrapper that uses twisted.internet.threads.deferToThread to allow a SQL query to execute while other stuff is going on. This can allow other python code to run because the socket module (among most others involving operating system I/O) will release the GIL while in a system call.
That said, you can use twisted to write a network application talking to many twisted (or non-twisted, if you'd like) workers. Each worker could then work on little bits of work, and you would not be restricted by the GIL, because each worker would be its own completely isolated process. The master process can then make use of many of twisted's asynchronous primitives. For example you could use a DeferredList to wait on a number of results coming from any number of workers, and then run a response handler when all of the Deferred's complete. (thus allowing you to do your map call) If you want to go down this route, I recommend looking at twisted.protocols.amp, which is their Asynchronous Message Protocol, and can be used very trivially to implement a network-based RPC or map-reduce.
The downside of running many disparate processes versus something like multiprocessing is that
you lose out on simple process management, and
the subprocesses can't share memory as if they would if they were forked on a unix system.
Though for modern systems, 2) is rarely a problem unless you are running hundreds of subprocesses. And problem 1) can be solved by using a process management system like supervisord
Edit For more on python and the GIL, you should watch Dave Beazley's talks on the subject ( website , video, slides )

how to process long-running requests in python workers?

I have a python (well, it's php now but we're rewriting) function that takes some parameters (A and B) and compute some results (finds best path from A to B in a graph, graph is read-only), in typical scenario one call takes 0.1s to 0.9s to complete. This function is accessed by users as a simple REST web-service (GET bestpath.php?from=A&to=B). Current implementation is quite stupid - it's a simple php script+apache+mod_php+APC, every requests needs to load all the data (over 12MB in php arrays), create all structures, compute a path and exit. I want to change it.
I want a setup with N independent workers (X per server with Y servers), each worker is a python app running in a loop (getting request -> processing -> sending reply -> getting req...), each worker can process one request at a time. I need something that will act as a frontend: get requests from users, manage queue of requests (with configurable timeout) and feed my workers with one request at a time.
how to approach this? can you propose some setup? nginx + fcgi or wsgi or something else? haproxy? as you can see i'am a newbie in python, reverse-proxy, etc. i just need a starting point about architecture (and data flow)
btw. workers are using read-only data so there is no need to maintain locking and communication between them

The typical way to handle this sort of arrangement using threads in Python is to use the standard library module Queue. An example of using the Queue module for managing workers can be found here: Queue Example

Looks like you need the "workers" to be separate processes (at least some of them, and therefore might as well make them all separate processes rather than bunches of threads divided into several processes). The multiprocessing module in Python 2.6 and later's standard library offers good facilities to spawn a pool of processes and communicate with them via FIFO "queues"; if for some reason you're stuck with Python 2.5 or even earlier there are versions of multiprocessing on the PyPi repository that you can download and use with those older versions of Python.
The "frontend" can and should be pretty easily made to run with WSGI (with either Apache or Nginx), and it can deal with all communications to/from worker processes via multiprocessing, without the need to use HTTP, proxying, etc, for that part of the system; only the frontend would be a web app per se, the workers just receive, process and respond to units of work as requested by the frontend. This seems the soundest, simplest architecture to me.
There are other distributed processing approaches available in third party packages for Python, but multiprocessing is quite decent and has the advantage of being part of the standard library, so, absent other peculiar restrictions or constraints, multiprocessing is what I'd suggest you go for.

There are many FastCGI modules with preforked mode and WSGI interface for python around, the most known is flup. My personal preference for such task is superfcgi with nginx. Both will launch several processes and will dispatch requests to them. 12Mb is not as much to load them separately in each process, but if you'd like to share data among workers you need threads, not processes. Note, that heavy math in python with single process and many threads won't use several CPU/cores efficiently due to GIL. Probably the best approach is to use several processes (as much as cores you have) each running several threads (default mode in superfcgi).

The most simple solution in this case is to use the webserver to do all the heavy lifting. Why should you handle threads and/or processes when the webserver will do all that for you?
The standard arrangement in deployments of Python is:
The webserver start a number of processes each running a complete python interpreter and loading all your data into memory.
HTTP request comes in and gets dispatched off to some process
Process does your calculation and returns the result directly to the webserver and user
When you need to change your code or the graph data, you restart the webserver and go back to step 1.
This is the architecture used Django and other popular web frameworks.

I think you can configure modwsgi/Apache so it will have several "hot" Python interpreters
in separate processes ready to go at all times and also reuse them for new accesses
(and spawn a new one if they are all busy).
In this case you could load all the preprocessed data as module globals and they would
only get loaded once per process and get reused for each new access. In fact I'm not sure this isn't the default configuration
for modwsgi/Apache.
The main problem here is that you might end up consuming
a lot of "core" memory (but that may not be a problem either).
I think you can also configure modwsgi for single process/multiple
thread -- but in that case you may only be using one CPU because
of the Python Global Interpreter Lock (the infamous GIL), I think.
Don't be afraid to ask at the modwsgi mailing list -- they are very
responsive and friendly.

You could use nginx load balancer to proxy to PythonPaste paster (which serves WSGI, for example Pylons), that launches each request as separate thread anyway.

Another option is a queue table in the database.
The worker processes run in a loop or off cron and poll the queue table for new jobs.

Writing a socket-based server in Python, recommended strategies?

I was recently reading this document which lists a number of strategies that could be employed to implement a socket server. Namely, they are:
Serve many clients with each thread, and use nonblocking I/O and level-triggered readiness notification
Serve many clients with each thread, and use nonblocking I/O and readiness change notification
Serve many clients with each server thread, and use asynchronous I/O
serve one client with each server thread, and use blocking I/O
Build the server code into the kernel
Now, I would appreciate a hint on which should be used in CPython, which we know has some good points, and some bad points. I am mostly interested in performance under high concurrency, and yes a number of the current implementations are too slow.
So if I may start with the easy one, "5" is out, as I am not going to be hacking anything into the kernel.
"4" Also looks like it must be out because of the GIL. Of course, you could use multiprocessing in place of threads here, and that does give a significant boost. Blocking IO also has the advantage of being easier to understand.
And here my knowledge wanes a bit:
"1" is traditional select or poll which could be trivially combined with multiprocessing.
"2" is the readiness-change notification, used by the newer epoll and kqueue
"3" I am not sure there are any kernel implementations for this that have Python wrappers.
So, in Python we have a bag of great tools like Twisted. Perhaps they are a better approach, though I have benchmarked Twisted and found it too slow on a multiple processor machine. Perhaps having 4 twisteds with a load balancer might do it, I don't know. Any advice would be appreciated.

asyncore is basically "1" - It uses select internally, and you just have one thread handling all requests. According to the docs it can also use poll. (EDIT: Removed Twisted reference, I thought it used asyncore, but I was wrong).
"2" might be implemented with python-epoll (Just googled it - never seen it before).
EDIT: (from the comments) In python 2.6 the select module has epoll, kqueue and kevent build-in (on supported platforms). So you don't need any external libraries to do edge-triggered serving.
Don't rule out "4", as the GIL will be dropped when a thread is actually doing or waiting for IO-operations (most of the time probably). It doesn't make sense if you've got huge numbers of connections of course. If you've got lots of processing to do, then python may not make sense with any of these schemes.
For flexibility maybe look at Twisted?
In practice your problem boils down to how much processing you are going to do for requests. If you've got a lot of processing, and need to take advantage of multi-core parallel operation, then you'll probably need multiple processes. On the other hand if you just need to listen on lots of connections, then select or epoll, with a small number of threads should work.

How about "fork"? (I assume that is what the ForkingMixIn does) If the requests are handled in a "shared nothing" (other than DB or file system) architecture, fork() starts pretty quickly on most *nixes, and you don't have to worry about all the silly bugs and complications from threading.
Threads are a design illness forced on us by OSes with too-heavy-weight processes, IMHO. Cloning a page table with copy-on-write attributes seems a small price, especially if you are running an interpreter anyway.
Sorry I can't be more specific, but I'm more of a Perl-transitioning-to-Ruby programmer (when I'm not slaving over masses of Java at work)
Update: I finally did some timings on thread vs fork in my "spare time". Check it out:
http://roboprogs.com/devel/2009.04.html
Expanded:
http://roboprogs.com/devel/2009.12.html

One sollution is gevent. Gevent maries a libevent based event polling with lightweight cooperative task switching implemented by greenlet.
What you get is all the performance and scalability of an event system with the elegance and straightforward model of blocking IO programing.
(I don't know what the SO convention about answering to realy old questions is, but decided I'd still add my 2 cents)

Can I suggest additional links?
cogen is a crossplatform library for network oriented, coroutine based programming using the enhanced generators from python 2.5. On the main page of cogen project there're links to several projects with similar purpose.

I like Douglas' answer, but as an aside...
You could use a centralized dispatch thread/process that listens for readiness notifications using select and delegates to a pool of worker threads/processes to help accomplish your parallelism goals.
As Douglas mentioned, however, the GIL won't be held during most lengthy I/O operations (since no Python-API things are happening), so if it's response latency you're concerned about you can try moving the critical portions of your code to CPython API.

http://docs.python.org/library/socketserver.html#asynchronous-mixins
As for multi-processor (multi-core) machines. With CPython due to GIL you'll need at least one process per core, to scale. As you say that you need CPython, you might try to benchmark that with ForkingMixIn. With Linux 2.6 might give some interesting results.
Other way is to use Stackless Python. That's how EVE solved it. But I understand that it's not always possible.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.