I have a use case where I need to poll the API every 1 sec (basically infinite while loop). The polling will be initiated dynamically by user through an external system. This means there can be multiple polling running at the same time. The polling will be completed when the API returns 400. Anyways, my current implementation looks something like:
Flask APP deployed on heroku.
Flask APP has an endpoint which external system calls to start polling.
That flask endpoint will add the message to queue and as soon as worker gets it, it will start polling. I am using Heroku Redis to Go addons. Under the hood it uses python-rq and redis.
The problem is when some polling process goes on for a long time, the other process just sits on the queue. I want to be able to do all of the polling in a concurrent process.
What's the best approach to tackle this problem? Fire up multiple workers?
What if there could be potentially more than 100 concurrent processes.
You could implement a "weighted"/priority queue. There may be multiple ways of implementing this, but the simplest example that comes to my mind is using a min or max heap.
You shoud keep track of how many events are in the queue for each process, as the number of events for one process grows, the weight of the new inserted events should decrease. Everytime an event is processed, you start processing the following one with the greatest weight.
PS More workers will also speed up the the work.
Related
I am creating a robot that has a Flask and React (running on raspberry pi zero) based interface for users to request it to perform tasks. When a user requests a task I want the backend to put it in a queue, and have the backend constantly looking at the queue and processing it on a one-by-one basis. Each tasks can take anywhere from 15-60 seconds so they are pretty lengthy.
Currently I just immediately do the task in the same python process that is running the Flask server, and from testing locally It seems like i can go to the react app in two different browsers and request tasks at the same time and it looks like the raspberry pi is trying to run them in parallel (from what I'm seeing in the printed logs).
What is the best way to allow multiple users to go to the front-end and queue up tasks? When multiple users go to the react app I assume they all connect to the same instance of the back-end. So it it enough just to add a dequeue to the back-end and protect it with a mutex lock (what is the pythonic way to use mutexes?). Or is this too simple? Do I need some other process or method to implement the task queue (such as writing/reading to an external file to act as the queue)?
In general, the most popular way to run tasks in Python is using Celery. It is a Python framework that runs on a separate process, continuously checking a queue (like Redis or AMQP) for tasks. When it finds one, it executes it, and logs the result to a "result backend" (like a database or Redis again). Then you have the Flask servers just push the tasks to the queue.
In order to notify the users, you could use polling from the React app, which is just requesting an update every 5 seconds until you see from the result backend that the task has completed successfully. As soon as you see that, stop polling and show the user the notification.
You can easily have multiple worker processes run in parallel, if the app would become large enough to need it. In general, you just need to remember to have every process do what it's needed to do: Flask servers should answer web requests, and Celery servers should process tasks. Not the other way around.
I'm working with Django1.8 and Python2.7.
In a certain part of the project, I open a socket and send some data through it. Due to the way the other end works, I need to leave some time (let's say 10 miliseconds) between each data that I send:
while True:
send(data)
sleep(0.01)
So my question is: is it considered a bad practive to simply use sleep() to create that pause? Is there maybe any other more efficient approach?
UPDATED:
The reason why I need to create that pause is because the other end of the socket is an external service that takes some time to process the chunks of data I send. I should also point out that it doesnt return anything after having received or let alone processed the data. Leaving that brief pause ensures that each chunk of data that I send gets properly processed by the receiver.
EDIT: changed the sleep to 0.01.
Yes, this is bad practice and an anti-pattern. You will tie up the "worker" which is processing this request for an unknown period of time, which will make it unavailable to serve other requests. The classic pattern for web applications is to service a request as-fast-as-possible, as there is generally a fixed or max number of concurrent workers. While this worker is continually sleeping, it's effectively out of the pool. If multiple requests hit this endpoint, multiple workers are tied up, so the rest of your application will experience a bottleneck. Beyond that, you also have potential issues with database locks or race conditions.
The standard approach to handling your situation is to use a task queue like Celery. Your web-application would tell Celery to initiate the task and then quickly finish with the request logic. Celery would then handle communicating with the 3rd party server. Django works with Celery exceptionally well, and there are many tutorials to help you with this.
If you need to provide information to the end-user, then you can generate a unique ID for the task and poll the result backend for an update by having the client refresh the URL every so often. (I think Celery will automatically generate a guid, but I usually specify one.)
Like most things, short answer: it depends.
Slightly longer answer:
If you're running it in an environment where you have many (50+ for example) connections to the webserver, all of which are triggering the sleep code, you're really not going to like the behavior. I would strongly recommend looking at using something like celery/rabbitmq so Django can dump the time delayed part onto something else and then quickly respond with a "task started" message.
If this is production, but you're the only person hitting the webserver, it still isn't great design, but if it works, it's going to be hard to justify the extra complexity of the task queue approach mentioned above.
I'm doing some metric analysis on on my web app, which makes extensive use of celery. I have one metric which measures the full trip from a post_save signal through a celery task (which itself calls a number of different celery tasks) to the end of that task. I've been hitting the server with up to 100 requests in 5 seconds.
What I find interesting is that when I hit the server with hundreds of requests (which entails thousands of celery worker processes being queued), the time it takes for the trip from post save to the end of the main celery task increases significantly, even though I never do any additional database calls, and none of the celery tasks should be blocking the main task.
Could the fact that there are so many celery tasks in the queue when I make a bunch of requests really quickly be slowing down the logic in my post_save function and main celery task? That is, could the processing associated with getting the sub-tasks that the main celery task creates onto a crowded queue be having a significant impact on the time it takes to reach the end of the main celery task?
It's impossible to really answer your question without an in-depth analysis of your actual code AND benchmark protocol, and while having some working experience with Python, Django and Celery I wouldn't be able to do such an in-depth analysis. Now there are a couple very obvious points :
if your workers are running on the same computer as your Django instance, they will compete with Django process(es) for CPU, RAM and IO.
if the benchmark "client" is also running on the same computer then you have a "heisenbench" case - bombing a server with 100s of HTTP request per second also uses a serious amount of resources...
To make a long story short: concurrent / parallel programming won't give you more processing power, it will only allow you to (more or less) easily scale horizontally.
I'm not sure about slowing down, but it can cause your application to hang. I've had this problem where one application would backup several other queues with no workers. My application could then no longer queue messages.
If you open up a django shell and try to queue a task. Then hit ctrl+c. I can't quite remember what the stack trace should be, but if you post it here I could confirm it.
I am working on a project, to be deployed on Heroku in Django, which has around 12 update functions. They take around 15 minutes to run each. Let's call them update1(), update2()...update10().
I am deploying with one worker dyno on Heroku, and I would like to run up to n or more of these at once (They are not really computationally intensive, they are all HTML parsers, but the data is time-sensitive, so I would like them to be called as often as possible).
I've read a lot of Celery and APScheduler documentation, but I'm not really sure which is the best/easiest for me. Do scheduled tasks run concurrently if the times overlap with one another (ie. if I run one every 2 minutes, and another every 3 minutes, or do they wait until each one finishes?)
Any way I can queue these functions, so at least a few of them are running at once? What is the suggested number of simultaneous calls for this use-case?
Based on you use case description you do not need a Scheduler, so APScheduler will not match your requirements well.
Do you have a web dyno besides your worker dyno? The usual design pattern for this type of processing is to set up a control thread or control process (your web dyno) that accepts requests. These requests are then placed on a request queue.
This queue is read by one or more worker threads or worker processes (you worker dyno). I have not worked with Celery, but it looks like a match with your requirements. How many worker threads or worker dyno's you will need is difficult to determine based on your description. You will need to specify also how many requests for updates you will need to process per second. Also, you will need to specify if the request is CPU bound or IO bound.
I need to import some data to show it for user but page execution time exceeds 30 second limit. So I decided to split my big code into several tasks and try Task Queues. I add about 10-20 tasks to queue and app engine executes tasks in parallel while user is waiting for data. How can I determine that my tasks are completed to show user data ASAP? Can I somehow iterate over active tasks?
I've solved this in the past by keeping the status for the tasks in memcached, and polling (via Ajax) to determine when the tasks are finished.
If you go this way, it's best if you can always "manually" determine the status of the tasks without looking in memcached, since there's always the (slim) chance that memcache will go down or will get cleared or something as a task is running.