Gunicorn With Gevent, Performance Gain

Gunicorn With Gevent, Performance Gain - python

Gunicorn Worker Class
Gunicorn has worker_class setting. Some possible values are
sync
gthread
gevent
Definitions from Luis Sena's nice blog
sync
This is the default worker class. Each process will handle 1 request at a time and you can use the parameter -w to set workers.
The recommendation for the number of workers is 2–4 x $(NUM_CORES), although it will depend on how your application works.
gthread
If you use gthread, Gunicorn will allow each worker to have multiple threads. In this case, the Python application is loaded once per worker, and each of the threads spawned by the same worker shares the same memory space.
Those threads will be at the mercy of the GIL, but it’s still useful for when you have some I/O blocking happening. It will allow you to handle more concurrency without increasing your memory too much.
gevent
Eventlet and gevent make use of “green threads” or “pseudo threads” and are based on greenlet.
In practice, if your application work is mainly I/O bound, it will allow it to scale to potentially thousands of concurrent requests on a single process.
Even with the rise of async frameworks (fastapi, sanic, etc), this is still relevant today since it allows you to optimize for I/O without having the extra code complexity.
The way they manage to do it is by “monkey patching” your code, mainly replacing blocking parts with compatible cooperative counterparts from gevent package.
It uses epoll or kqueue or libevent for highly scalable non-blocking I/O. Coroutines ensure that the developer uses a blocking style of programming that is similar to threading, but provide the benefits of non-blocking I/O.
This is usually the most efficient way to run your django/flask/etc web application, since most of the time the bulk of the latency comes from I/O related work.
Workers Value
While using gevent workers threads count is not set up. Documentation settings - threads says that threads are only relevant for gthread workers. So with gevent we only have workers. And each worker is a separate operating system process as far as I know. And the workers count is 8 to 16 for a 4 core machine.
Where does the performance gain come from
So really there is not any threads executed? If there is not any threads how does the gevent workers gain performance. There should be some pseudo threads that should be executed concurrently. While some pseudo thread is making I/O it should be detected and another pseudo thread should be executed. So where is this another pseudo thread? Is it the other worker processes or gunicorn creates some pseudo threads within a worker process?

Related

How does Waitress handle concurrent tasks?

I'm trying to build a python webserver using Django and Waitress, but I'd like to know how Waitress handles concurrent requests, and when blocking may occur.
While the Waitress documentation mentions that multiple worker threads are available, it doesn't provide a lot of information on how they are implemented and how the python GIL affects them (emphasis my own):
When a channel determines the client has sent at least one full valid HTTP request, it schedules a "task" with a "thread dispatcher". The thread dispatcher maintains a fixed pool of worker threads available to do client work (by default, 4 threads). If a worker thread is available when a task is scheduled, the worker thread runs the task. The task has access to the channel, and can write back to the channel's output buffer. When all worker threads are in use, scheduled tasks will wait in a queue for a worker thread to become available.
There doesn't seem to be much information on Stackoverflow either. From the question "Is Gunicorn's gthread async worker analogous to Waitress?":
Waitress has a master async thread that buffers requests, and enqueues each request to one of its sync worker threads when the request I/O is finished.
These statements don't address the GIL (at least from my understanding) and it'd be great if someone could elaborate more on how worker threads work for Waitress. Thanks!

Here's how the event-driven asynchronous servers generally work:
Start a process and listen to incoming requests. Utilizing the event notification API of the operating system makes it very easy to serve thousands of clients from single thread/process.
Since there's only one process managing all the connections, you don't want to perform any slow (or blocking) tasks in this process. Because then it will block the program for every client.
To perform blocking tasks, the server delegates the tasks to "workers". Workers can be threads (running in the same process) or separate processes (or subprocesses). Now the main process can keep on serving clients while workers perform the blocking tasks.
How does Waitress handle concurrent tasks?
Pretty much the same way I just described above. And for workers it creates threads, not processes.
how the python GIL affects them
Waitress uses threads for workers. So, yes they are affected by GIL in that they aren't truly concurrent though they seem to be. "Asynchronous" is the correct term.
Threads in Python run inside a single process, on a single CPU core, and don't run in parallel. A thread acquires the GIL for a very small amount of time and executes its code and then the GIL is acquired by another thread.
But since the GIL is released on network I/O, the parent process will always acquire the GIL whenever there's a network event (such as an incoming request) and this way you can stay assured that the GIL will not affect the network bound operations (like receiving requests or sending response).
On the other hand, Python processes are actually concurrent: they can run in parallel on multiple cores. But Waitress doesn't use processes.
Should you be worried?
If you're just doing small blocking tasks like database read/writes and serving only a few hundred users per second, then using threads isn't really that bad.
For serving a large volume of users or doing long running blocking tasks, you can look into using external task queues like Celery. This will be much better than spawning and managing processes yourself.

Hint: Those were my comments to the accepted answer and the conversation below, moved to a separate answer for space reasons.
Wait.. The 5th request will stay in the queue until one of the 4 threads is done with their previous handling, and therefore gone back to the pool. One thread will only ever server one request at a time. "IO bound" tasks only help in that the threads waiting for IO will implicitly (e.g. by calling time.sleep) tell the scheduler (python's internal one) that it can pass the GIL along to another thread since there's currently nothing to do, so that the others will get more CPU time for their stuff. On thread level this is fully sequential, which is still concurrent and asynchronous on process level, just not parallel. Just to get some wording staight.
Also, Python threads are "standard" OS threads (like those in C). So they will use all CPU cores and make full use of them. The only thing restricting them is that they need to hold the GIL when calling Python C-API functions, because the whole API in general is not thread-safe. On the other hand, calls to non-Python functions, i.e. functions in C extensions like numpy for example, but also many database APIs, including anything loaded via ctypes, do not hold the GIL while running. Why should they, they are running external C binaries which don't know anything of the Python interpreter running in the parent process. Therefore, such tasks will run truely in parallel when called from a WSGI app hosted by waitress. And if you've got more cores available, turn the thread number up to that amount (threads=X kwarg on waitress.create_server).

Async/IO and Parallelism

I am using aiohttp to create an Async/IO webserver. However, to my understanding, Async/IO means the server can only run on one processing core. Regular, synchronous servers like uwsgi, on the other hand, can fully utilize the computer's computing resources with truly parallel threads and processes. Why, then, is Async/IO new and trendy if it less parallel than multiprocessing? Can async servers like aiohttp be multi-processed?

Why, then, is Async/IO new and trendy if it less parallel than multiprocessing?
The two solve different problems. Asyncio allows writing asynchronous code sans the "callback hell". await allows the use of constructs like loops, ifs, try/except, and so on, with automatic task switching at await points. This enables servicing a large number of connections without needing to spawn a thread per connection, but with maintainable code that looks as if it were written for blocking connections. Thus asyncio mostly helps with the code whose only bottleneck is waiting for external events, such as network IO and timeouts.
Multiprocessing, on the other hand, is about parallelizing execution of CPU-bound code, such as scientific calculations. Since OS threads do not help due to the GIL, multiprocessing spawns separate OS processes and distributes the work among them. This comes at the cost of the processes not being able to easily share data - all communication is done either by serialization through pipes, or with dedicated proxies.
A multi-threaded asyncio-style framework is possible in theory - for example, Rust's tokio is like that - but would be unlikely to bring performance benefits due to Python's GIL preventing utilization of multiple cores. Combining asyncio and multiprocessing can work on asyncio code that doesn't depend on shared state, which is supported by asyncio through run_in_executor and ProcessPoolExecutor.

Gunicorn can help you:
gunicorn module:app --bind 0.0.0.0:8080 --worker-class aiohttp.GunicornWebWorker --workers 4

Does dask distributed use Tornado coroutines for workers tasks?

I've read at the dask distributed documentation that:
Worker and Scheduler nodes operate concurrently. They serve several
overlapping requests and perform several overlapping computations at
the same time without blocking.
I've always thought single-thread concurrent programming is best suited for I/O expensive, not CPU-bound jobs. However I expect many dask tasks (e.g. dask.pandas, dask.array) to be CPU intensive.
Does distributed only use Tornado for client/server communication, with separate processes/threads to run the dask tasks? Actually dask-worker has --nprocs and --nthreads arguments so I expect this to be the case.
How do concurrency with Tornado coroutines and more common processes/threads processing each dask task live together in distributed?

You are correct.
Each distributed.Worker object contains a concurrent.futures.ThreadPoolExecutor with multiple threads. Tasks are run on this ThreadPoolExecutor for parallel performance. All communication and coordination tasks are managed by the Tornado IOLoop.
Generally this solution allows computation to happen separately from communication and administration. This allows parallel computing within a worker and allows workers to respond to server requests even while computing tasks.
Command line options
When you make the following call:
dask-worker --nprocs N --nthreads T
It starts N separate distributed.Worker objects in separate Python processes. Each of these workers has a ThreadPoolExecutor with T threads.

Evenlet semaphore, how to limit calls to a particular subprocess?

I need to create a semaphore to restrict the parallel count of a particular subprocess. I am using gunicorn with eventlet workers and allow many simultaneous connections. Mostly these are waiting on remote data. However, they all enter a processing phase at some point and this involves calling a subprocess. This subprocess though should not be run too often in parallel as it is memory/CPU hungry.
Is threading.Semaphore correctly monkey_patch'd and usable with eventlet inside gunicorn?

As I understand the problem:
one gunicorn process (this is crucial) spawns N green threads
each worker may spawn one or more subprocesses
you want to limit total number of subprocesses
In this case, yes, semaphore will work as expected.
However, if you have more than one process, they will have separate instances of semaphore and you would observe more subprocesses. In this case, I recommend to move subprocess responsibilities to a separate application, running on same machine and call it via API you like (RPC/socket/message queue/dbus/etc). You could design the system like this:
user -> gunicorn (any number of processes)
gunicorn -> one subprocess manager
manager -> N subprocesses
The manager listens for jobs from gunicorn, spawns a subprocess if needed, maybe reuses existing subprocesses. You may like a job queue system like Beanstalk, Celery, Gearman. Or you may wish to build a custom solution on top of existing message transports like NSQ, RabbitMQ, ZeroMQ.

How many concurrent requests does a single Flask process receive?

I'm building an app with Flask, but I don't know much about WSGI and it's HTTP base, Werkzeug. When I start serving a Flask application with gunicorn and 4 worker processes, does this mean that I can handle 4 concurrent requests?
I do mean concurrent requests, and not requests per second or anything else.

When running the development server - which is what you get by running app.run(), you get a single synchronous process, which means at most 1 request is being processed at a time.
By sticking Gunicorn in front of it in its default configuration and simply increasing the number of --workers, what you get is essentially a number of processes (managed by Gunicorn) that each behave like the app.run() development server. 4 workers == 4 concurrent requests. This is because Gunicorn uses its included sync worker type by default.
It is important to note that Gunicorn also includes asynchronous workers, namely eventlet and gevent (and also tornado, but that's best used with the Tornado framework, it seems). By specifying one of these async workers with the --worker-class flag, what you get is Gunicorn managing a number of async processes, each of which managing its own concurrency. These processes don't use threads, but instead coroutines. Basically, within each process, still only 1 thing can be happening at a time (1 thread), but objects can be 'paused' when they are waiting on external processes to finish (think database queries or waiting on network I/O).
This means, if you're using one of Gunicorn's async workers, each worker can handle many more than a single request at a time. Just how many workers is best depends on the nature of your app, its environment, the hardware it runs on, etc. More details can be found on Gunicorn's design page and notes on how gevent works on its intro page.

Currently there is a far simpler solution than the ones already provided. When running your application you just have to pass along the threaded=True parameter to the app.run() call, like:
app.run(host="your.host", port=4321, threaded=True)
Another option as per what we can see in the werkzeug docs, is to use the processes parameter, which receives a number > 1 indicating the maximum number of concurrent processes to handle:
threaded – should the process handle each request in a separate thread?
processes – if greater than 1 then handle each request in a new process up to this maximum number of concurrent processes.
Something like:
app.run(host="your.host", port=4321, processes=3) #up to 3 processes
More info on the run() method here, and the blog post that led me to find the solution and api references.
Note: on the Flask docs on the run() methods it's indicated that using it in a Production Environment is discouraged because (quote): "While lightweight and easy to use, Flask’s built-in server is not suitable for production as it doesn’t scale well."
However, they do point to their Deployment Options page for the recommended ways to do this when going for production.

Flask will process one request per thread at the same time. If you have 2 processes with 4 threads each, that's 8 concurrent requests.
Flask doesn't spawn or manage threads or processes. That's the responsability of the WSGI gateway (eg. gunicorn).

No- you can definitely handle more than that.
Its important to remember that deep deep down, assuming you are running a single core machine, the CPU really only runs one instruction* at a time.
Namely, the CPU can only execute a very limited set of instructions, and it can't execute more than one instruction per clock tick (many instructions even take more than 1 tick).
Therefore, most concurrency we talk about in computer science is software concurrency.
In other words, there are layers of software implementation that abstract the bottom level CPU from us and make us think we are running code concurrently.
These "things" can be processes, which are units of code that get run concurrently in the sense that each process thinks its running in its own world with its own, non-shared memory.
Another example is threads, which are units of code inside processes that allow concurrency as well.
The reason your 4 worker processes will be able to handle more than 4 requests is that they will fire off threads to handle more and more requests.
The actual request limit depends on HTTP server chosen, I/O, OS, hardware, network connection etc.
Good luck!
*instructions are the very basic commands the CPU can run. examples - add two numbers, jump from one instruction to another

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.