I have a Pylons web application served by Apache (mod_wsgi, prefork). Because of Apache, there are multiple separate processes running my application code concurrently. Some of the non-critical tasks that the application does I want to defer for processing in background to improve "live" response times. So I'm thinking of task queue, many Apache processes adding tasks to this queue, a single separate Python process processing them one-by-one and removing from queue.
The queue should preferably be persisted to disk so queued unprocessed tasks are not lost because of power outage, server restart etc. The question is what would be a reasonable way to implement such queue?
As for the things I've tried: I started with simple SQLite database and single table in it for storing queue items. In load testing, when increasing level of concurrency, I started getting "database locked" errors, as expected. The quick'n'dirty fix was to replace SQLite with MySQL--it handles concurrency issues well but feels like an overkill for the simple thing I need to do. Queue-related DB operations also show up prominently in my profiling reports.
A message broker like Apache's ActiveMQ is an ideal solution here.
The pipeline could be following:
Application process that is responsible for handling HTTP requests generates replies quickly and sends low-priority, heavy tasks to AMQ queue.
One or more another processes are subscribed to consume AMQ queue and do what is intended to do with these heavy tasks.
The requirement of queue persistence is fulfilled out of the box since ActiveMQ stores messages that are not yet consumed in persistent storage. Furthermore it scales quite well since you're free to deploy multiple HTTP-apps, multiple consumer apps and AMQ itself on different machines each.
We use something like this in our project written in Python utilizing STOMP as underlying communication protocol.
A web server (any web server) is multi-producer, single-consumer process.
A simple solution is to build a wsgiref or Werkzeug backend server to handle your backend requests.
Since this "backend" server is build using WSGI technology, it's very, very similar to the front-end web server. Except. It doesn't produce HTML responses (JSON is usually simpler). Other than that, it's very straightforward.
You design RESTful transactions for this backend. You use all of the various WSGI features for URI parsing, authorization, authentication, etc. You -- generally -- don't need session management, since RESTful servers don't usually offer sessions.
If you get into serious scalability issues, you simply wrap your backend server in lighttpd or some other web engine to create a multi-threaded backend.
Related
We are working on an Internet and Intranet platform, that serves client-requests over website applications.
There are heavy-weight computations on database entries and files. We want to update the state of those computations via push-notification to the client and make changes to files without the risk of race-conditions. The architecture is supposed to run on both, low- scaled one-server environments and high-scaled cluster environments.
So far, we are running a Django Webserver with Postgresql, the Python-Library Channels and RabbitMQ as Messagebroker.
Once a HTTP-Request from a client arrives in Django, we trigger the task via task.delay() and immediatly return the task_id to the client. The client then opens a websocket to another Django-route and hands over the task_ids he is interested in. Django then polls the state of the task via AsyncResult(task_id).state. Once the state changes, we read the results via AsyncResult(task_id).get and push the task_results to the client.
Here a similar sequence diagramm, from another project I found online.
Source(18.09.21)
Something that is not seen on the diagram, the channels_worker have to fetch the file they are working on from Django. A part of the result is not for the client, but to update the file. Django locks and updates the file localy as soon, as the client asks for and Django receives the task_results from celery (the changes only add attributes and will not be in conflict with each other).
My thoughts about this architecture are:
monitoring of the celery-events is bad so far.
It is only triggered by the client, which has to know about the tasks to begin with.
Django is not suited for monitoring
and polling is not efficient in general.
The file management seems fishy.
I would prefer a proper monitoring, where events are pushed to Django and the client. The client have to be able to consume the events at any time later.
I have some thoughts about solutions, but I would like to hear your opinion first. Later I can bring them in the discussion too.
Greetings
Python
Edit 1
From other sources I got helpful information regarding a good strategy.
Instead of Django "monitoring" the celery tasks, we can use a dedicated Websocket-Service, like FastAPI thand monitors task events and propagates them to the clients via websocket.
The client doesn't have to know about it's running tasks per se. Instead we can have ownership of tasks and the client only has to authenticate himself. The whole Security Blog will be implemented anyways and its supported by Celery.
For the file management, we should use a dedicated object storage like minio. This service can become subscriber to task-events related to files.
We all like Python, but we don't have to re-invent the wheel whenever we want a better monitoring or more control on the behavior of our systems.
That being said, I would recommend re-architecture the solution to decrease the complexity of your django application by exploring what native cloud solutions are offering in terms of micro-service architecture (API-Gateway), AWS SQS and SNS, computation, and storage options for your files.
Such an approach will carry out a lot of the monitoring, configurations, file management activities, and most importantly your monolithic application could scale without code changes or additional configurations.
I know GIL blocks python from running its threads across cores. If it does so, why python is being used in webservers, how are the companies like youtube, instagram handling it.
PS: I know alternatives like multiprocessing can solve it. But it would be great if anyone can post it with a scenario that was handled by them.
Python is used for server-side handling in webservers, but not (usually) as webserver.
On normal setup: we have have Apache or other webserver to handles a lot of processes (server-side) (python uses usually wsgi). Note usually apache handles directly "static" files. So we have one apache server, many parallel apache processes (to handle connection and basic http) and many python processes which handles one connection per time.
Each of such process are independent each others (they just use the same resources), so you can program your server side part easily, without worrying about deadlocks. It is mostly a trade-off: performance of code, and easy and quickly to produce code without huge problems. But usually webserver with python scale very well (also on large sites), and servers are cheaper then programmers.
Note: security is also increased by having just one request in a process.
GIL exists in CPython, (Python interpreter made in C and most used), other interpreter versions such as Jython or IronPython don't have such problem, because they don't have GIL.
Even though, using CPython you can still have concurrency, just do your thing in C and then "link it" in your Python code, just like Numpy or similar do.
Other thing is, even though you have your page using Flask or Django, when you set up it in a production server, you have an Apache or Nginx, etc which has a real charge balancer (or load balancer, I can't remember the name in english now) that can serve the page to many people at the same time.
Take it from the Flask docs (link):
Flask’s built-in server is not suitable for production as it doesn’t scale well and by default serves only one request at a time.
[...]
If you want to deploy your Flask application to a WSGI server not listed here, look up the server documentation about how to use a WSGI app with it. Just remember that your Flask application object is the actual WSGI application.
Although a bit late, but I will try to give a generic and useful answer.
#Giacomo Catenazzi's answer is a good one but some part of it is factually incorrect.
API requests (or other form of web requests) are served from an already running process. The creation of this 'already running' process is handled by some webserver like gunicorn which on startup creates specified number of processes that are running the code in your web application continuously waiting to serve any incoming request.
Needless to say, each of these processes are limited by the GIL to only run one thread at a time. But one process in its lifetime handles more than one (normally many) request. Here it would be better if we could understand the flow of a request.
We will take an example of flask but this is applicable to most web frameworks. When a request comes from Nginx, it is handed over to gunicorn which interacts with your web application via wsgi. When the request reaches to the framework, an app context is created and some variables are pushed into the app-context. Then it follows the normal route that mostly people are familiar with: routing, db calls, response creation and so on. The response is then handed back to the gunicorn via wsgi again. At the time of handing over the response, the app context is teared down. So it's the app context, not the process that is created on every new request.
Also, I have talked only about the sync worker in gunicorn but it also has an option of async worker which can handle multiple requests in parallel through coroutines. But thats a separate topic.
So answering your question:
Nginx (Capable of handling multiple requests at a time)
Gunicorn creates a pool of n number of processes at the start and also manages the pool in the sense that if a process exits or gets stuck, it kills/recreates ans adds that to the pool.
Each process handling 1 request at a time.
Read more about gunicorn's design and how it can be used to help you achieve your requirements. This is a good thread about gunicorn with flask understanding. And this is a great resource to understand flask app context
Recently I came to know about Django channels.
Can somebody tell me the difference between channels and celery, as well as where to use celery and channels.
Channels in Django are meant for asynchronous handling of requests.
The standard model Django uses is Request-Response but that has significant limitations. We cannot do anything outside the restrictions of that model.
Channels came about to allow Web Socket support and building complex applications around Web Sockets, so that we can send multiple messages, manage sessions, etc.
Celery is a completely different thing, it is an asynchronous task queue/job queue based on distributed message passing. It is primarily for queuing tasks and scheduling them to run at specific intervals.
Simply put Channels are used when you need asynchronous data communication like a chat application, and Celery is for scheduling tasks and events like a server scraping the web for a certain type of news at fixed intervals.
Channels in Django is for WebSocket, long-poll HTTP.
Celery is for background task, queue.
Django Channels:
beyond HTTP - to handle WebSockets, chat protocols, IoT protocols, and
more.
Passes messages between client and server (Full duplex connection)
Handle HTTP and Web-sockets requests
Asynchronous
Example :
Real time chat application
Update social feeds
Multiplayer game
Sending Notifications
Celery :
It’s a task queue with focus on real-time processing, while also supporting task scheduling.
Perform long running background tasks
Perform periodic tasks
Asynchronous
Example :
Processing Videos/images
Sending bulk emails
Further reading
Example of Celery and Django Channels
Asynchronous vs Synchronous
Channels is a project that takes Django and extends its abilities beyond HTTP - to handle Web-sockets, chat protocols, IoT protocols, and more. It’s built on a Python specification called ASGI.
Channels changes Django to weave asynchronous code underneath and through Django’s synchronous core, allowing Django projects to handle not only HTTP, but protocols that require long-running connections too - WebSockets, MQTT, chatbots, amateur radio, and more.
It does this while preserving Django’s synchronous and easy-to-use nature, allowing you to choose how you write your code - synchronous in a style like Django views, fully asynchronous, or a mixture of both. On top of this, it provides integrations with Django’s auth system, session system, and more, making it easier than ever to extend your HTTP-only project to other protocols.
It also bundles this event-driven architecture with channel layers, a system that allows you to easily communicate between processes, and separate your project into different processes.
Celery is an asynchronous task queue based on distributed message passing. It provides functionalities to run real-time operations and schedule some tasks to be executed later. These tasks can be executed asynchronous or synchronous, this means you may prefer to run them at the background or chain them to make one task to be fulfilled after the successful execution of an another task.
Other answers, greatly explained the diff, but in facts Channels & Celery can both do asynchronous pooled tasks in common.
Channels and Celery both use a backend for messages and worker daemon(s). So the same kind of thing could be implemented with both.
But keep in mind that Celery is primary made for and can handle most issue of task pooling (retries, result backend, etc), where Channels is absolutely not made for.
Django channels gives to Django the ability to handle more than just plain HTTP requests, including Websockets and HTTP2. Think of this as 2 way duplex communication that happens asynchronously
No browser refreshing. Multiple clients can send and receive data via websocket and django channels orchestrates this intercommunication example a group chat with simultaneously clients accessing at the same time. You can achieve background processing of long running code simliar to that of a celery to a certain extent, but the application of channels is different to that of celery.
Celery is an asynchronous task queue/job queue based on distributed message passing. As well as scheduling. In layman's terms, I want to fire and run a task in the background or I want to have a periodical task that fires and runs in the back on a set interval. You can also fire task in a synchronous way as well fire and wait until complete and continue.
So the key difference is in the use case they serve and objectives of the frameworks
I am creating an application that basically has multiple connections to a third party Chat Streaming API(Socket based).
The way it works is - Every user has an account on my app and another account on the third party app. He gives me an access token for the third party chat app and I connect to the third party API to stream his chats. This happens for hundreds of users.
I need to create a socket connection pool for every user and run parallel threads. I am using a python library(for that API) and am able to achieve real time feeds for single users. How do I implement an asynchronous socket connection pool in Python or NodeJS? I have a Linux micro instance on EC2 and I need to run this application for 1000 users.
I am exploring Redis+Tornado to implement this. Are there any better alternatives?
This will be messy and also a couple of things to consider.
If you are going to use multiple threads remember that you can only run so many per CPU as the OS permits, rather go multiprocessing.
If you are going async with long polling processes it will prevent other clients from processing requests.
Solution
When your application absolutely needs to be real-time I would suggest websockets for server-client interaction.
Then from your clients request start a single process that listens\polls on your streaming API using multiprocessing in python. So you will essentially create a separate process for each client.
And now, to make your WebSocketHandler and Background API Streamer interact with each other you can use the Observer Pattern (https://en.wikipedia.org/wiki/Observer_pattern) to notify the WebSocket that you have received data from the API.
Make sure that you assign a unique ID to every client and make sure that you only post the data to the intended client when using websockets.
EDIT:
Web:
Also on your question regarding Tornado. It is a good lightweight framework for running a couple of users maybe 1000. But anything more than that I would suggest looking at Django as it will allow you to be more productive in producing code and also there are lots of tools out there that the community have developed over time.
Database:
Red.is is a good choice if you need a very fast no-sql db, also have a look at mongodb. If you require a multi-region DB I would suggest going with Cassandra or CouchDB due to the partitioned nodes. The image below might help you better decide which DB to use.
I am trying to design a web application that processes large quantities of large mixed-media files coming from asynchronous processes. Each process can take several minutes.
The files are either uploaded as a POST body or pulled by the web server according to a source URL provided. The files can be processed by a variety of external tools in a synchronous or asynchronous way.
I need to be able to load balance this application so I can process multiple large files simultaneously for as much as I can afford to scale.
I think Python is my best choice for this project, but beside this, I am open to any solution. The app can either deliver the file back or rely on a messaging channel to notify the clients about the process completion.
Some approaches I thought I might use:
1) Use a non-blocking web server such as Tornado that keeps the connection open until the file processing is done. The external processing command is launched and the web server waits until the file is ready and pipes the resulting IO stream directly back to the web app that returns it. Since the processes sending requests are asynchronous, they might afford to wait (unless memory or some other issues come up).
2) Use a regular web server like Cherrypy (which I am more confident with) and have the webapp use a messaging channel to report the processing progress. The web server returns a HTTP response as soon as it receives the file, validates it and sends it to a background process. At the same time it sends a message notifying the process start. The background process then takes care of delivering the file to an available location and sending another message to the channel notifying the location of the new file. This solution looks more flexible than 1), but requires writing a separate script to handle the messages outside the web application, as well as a separate storage space for the temp files that have to be cleaned up at a certain point.
3) Use some internal messaging capability of any of the webserves mentioned above, which I am not familiar with...
Edit: something like CherryPy's pub-sub engine (http://cherrypy.readthedocs.org/en/latest/extend.html?highlight=messaging#publish-subscribe-pattern) could be a good solution.
Any suggestions?
Thank you,
gm
I had a similar situation come up with a really large scale data processing engine that my team implemented. We wanted to build our api calls in Flask, some of which can take many hours to complete, but have a way to notify the user in real time what is going on.
Basically what I came up with is was what you described as option 2. On the same machine that I am serving the flask app through apache, I created a tornado app that serves up a websocket that reports progress to the end user. Once my main page is served, it establishes the websocket connection to the tornado server, and the flask app periodically sends updates to the tornado app, and down to the end user. Even if the browser is closed during the long running application, apache keeps the request alive and processing, and upon logging back in, I can still see the current progress.
I wrote about this solution in some more detail here:
http://jonfeatherstone.com/2013/08/01/mongo-and-websockets-for-application-logging/
Good luck!