Creating queue for Flask backend that can handle multiple users

Creating queue for Flask backend that can handle multiple users - python

I am creating a robot that has a Flask and React (running on raspberry pi zero) based interface for users to request it to perform tasks. When a user requests a task I want the backend to put it in a queue, and have the backend constantly looking at the queue and processing it on a one-by-one basis. Each tasks can take anywhere from 15-60 seconds so they are pretty lengthy.
Currently I just immediately do the task in the same python process that is running the Flask server, and from testing locally It seems like i can go to the react app in two different browsers and request tasks at the same time and it looks like the raspberry pi is trying to run them in parallel (from what I'm seeing in the printed logs).
What is the best way to allow multiple users to go to the front-end and queue up tasks? When multiple users go to the react app I assume they all connect to the same instance of the back-end. So it it enough just to add a dequeue to the back-end and protect it with a mutex lock (what is the pythonic way to use mutexes?). Or is this too simple? Do I need some other process or method to implement the task queue (such as writing/reading to an external file to act as the queue)?

In general, the most popular way to run tasks in Python is using Celery. It is a Python framework that runs on a separate process, continuously checking a queue (like Redis or AMQP) for tasks. When it finds one, it executes it, and logs the result to a "result backend" (like a database or Redis again). Then you have the Flask servers just push the tasks to the queue.
In order to notify the users, you could use polling from the React app, which is just requesting an update every 5 seconds until you see from the result backend that the task has completed successfully. As soon as you see that, stop polling and show the user the notification.
You can easily have multiple worker processes run in parallel, if the app would become large enough to need it. In general, you just need to remember to have every process do what it's needed to do: Flask servers should answer web requests, and Celery servers should process tasks. Not the other way around.

Related

Which requests should be handled by the webserver and which by a task queue worker?

I am working on a Python web app that uses Celery to schedule and execute user job requests.
Most of the time the requests submitted by a user can't be resolved immediately and thus it makes sense to me to schedule them in a queue.
However, now that I have the whole queuing architecture in place, I'm confused about whether I should delegate all the request processing logic to the queue/workers or if I should leave some of the work to the webserver itself.
For example, apart from the job scheduling, there are times where a user only needs to perform a simple database query, or retrieve a static JSON file. Should I also delegate these "synchronous" requests to the queue/workers?
Right now, my webserver controllers don't do anything except validating incoming JSON request schemas and forwarding them to the queue. What are the pros and cons of having a dumb webserver like this?

I believe the way you have it right now plus giving the workers the small jobs now is good. That way the workers would be overloaded first in the event of an attack or huge request influx. :)

Best approach to tackle long polling in server side

I have a use case where I need to poll the API every 1 sec (basically infinite while loop). The polling will be initiated dynamically by user through an external system. This means there can be multiple polling running at the same time. The polling will be completed when the API returns 400. Anyways, my current implementation looks something like:
Flask APP deployed on heroku.
Flask APP has an endpoint which external system calls to start polling.
That flask endpoint will add the message to queue and as soon as worker gets it, it will start polling. I am using Heroku Redis to Go addons. Under the hood it uses python-rq and redis.
The problem is when some polling process goes on for a long time, the other process just sits on the queue. I want to be able to do all of the polling in a concurrent process.
What's the best approach to tackle this problem? Fire up multiple workers?
What if there could be potentially more than 100 concurrent processes.

You could implement a "weighted"/priority queue. There may be multiple ways of implementing this, but the simplest example that comes to my mind is using a min or max heap.
You shoud keep track of how many events are in the queue for each process, as the number of events for one process grows, the weight of the new inserted events should decrease. Everytime an event is processed, you start processing the following one with the greatest weight.
PS More workers will also speed up the the work.

ThreadPoolExecutor on long running process

I want to use ThreadPoolExecutor on a webapp (django),
All examples that I saw are using the thread pool like that:
with ThreadPoolExecutor(max_workers=1) as executor:
code
I tried to store the thread pool as a class member of a class and to use map fucntion
but I got memory leak, the only way I could use it is by the with notation
so I have 2 questions:
Each time I run with ThreadPoolExecutor does it creates threads again and then release them, in other word is this operation is expensive?
If I avoid using with how can I release the memory of the threads
thanks

Normally, web applications are stateless. That means every object you create should live in a request and die at the end of the request. That includes your ThreadPoolExecutor. Having an executor at the application level may work, but it will be embedded into your web application instead of running as a separate group of processes.
So if you want to take the workers down or restart them, your web app will have to restart as well.
And there will be stability concerns, since there is no main process watching over child processes detecting which one has gotten stale, so requires a lot of code to get multiprocessing right.
Alternatively, If you want a persistent group of processes to listen to a job queue and run your tasks, there are several projects that do that for you. All you need to do is to set up a server that takes care of queueing and locking such as redis or rabbitmq, then point your project at that server and start the workers. Some projects even let you use the database as a job queue backend.

Background thread behind django project

I never worked on web application/service side and not sure if this is the right way for my work:
I have data collection system collecting data from serial port, and also want to present the data to user using web service. I'm thinking of creating a Django project to show my data on website. Also, to collecting the data, I need some background thread running when the website started. I'm trying to re-use the models defined in my django project in the data collecting thread.
First, I'd like to know if this is a reasonable design? If yes, is there any easy way to do that? I saw a lot topics about background tasks using celery but those are very complicate scenarios. Isn't there an easy way for this?

Celery is good, if you have tasks need to be runned in background. For example it could be a interaction with web-workers ( like sending emails, massive updates in stores and etc), or it could be parallel tasks, when one master worker sends tasks to celery server ( or servers ).
In you case, I think better solution is:
Create one daemon, which will talk with your SERIAL PORT in infinite loop and save data somewhere.
Web workers, which will read this data and represent to user.
If will need something like, long queries with heavy calculation for users, you can add Celery to your stack, and this celery will work as web workers, just read data and return results to web workers.

Monitor python scraper programs on multiple Amazon EC2 servers with a single web interface written in Django

I have a web-scraper (command-line scripts) written in Python that run on 4-5 Amazon-EC2 instances.
What i do is place the copy of these python scripts in these EC2 servers and run them.
So the next time when i change the program i have to do it for all the copies.
So, you can see the problem of redundancy, management and monitoring.
So, to reduce the redundancy and for easy management , I want to place the code in a separate server from which it can be executed on other EC2 servers and also monitor theses python programs, and logs created them through a Django/Web interface situated in this server.

There are at least two issues you're dealing with:
monitoring of execution of the scraping tasks
deployment of code to multiple servers
and each of them requires a different solution.
In general I would recommend using task queue for this kind of assignment (I have tried and was very pleased with Celery running on Amazon EC2).
One advantage of the task queue is that it abstracts the definition of the task from the worker which actually performs it. So you send the tasks to the queue, and then a variable number of workers (servers with multiple workers) process those tasks by asking for them one at a time. Each worker if it's idle will connect to the queue and ask for some work. If it receives it (a task) it will start processing it. Then it might send the results back and it will ask for another task and so on.
This means that a number of workers can change over time and they will process the tasks from the queue automatically until there are no more tasks to process. The use case for this is using Amazon's Spot instances which will greatly reduce the cost. Just send your tasks to the queue, create X spot requests and see the servers processing your tasks. You don't really need to care about the servers going up and down at any moment because the price went above your bid. That's nice, isn't it ?
Now, this implicitly takes care of monitoring - because celery has tools for monitoring the queue and processing, it can even be integrated with django using django-celery.
When it comes to deployment of code to multiple servers, Celery doesn't support that. The reasons behind this are of different nature, see e.g. this discussion. One of them might be that it's just difficult to implement.
I think it's possible to live without it, but if you really care, I think there's a relatively simple DIY solution. Put your code under VCS (I recommend Git) and check for updates on a regular basis. If there's an update, run a bash script which will kill your workers, make all the updates and start the workers again so that they can process more tasks. Given Celerys ability to handle failure this should work just fine.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.