I need to import some data to show it for user but page execution time exceeds 30 second limit. So I decided to split my big code into several tasks and try Task Queues. I add about 10-20 tasks to queue and app engine executes tasks in parallel while user is waiting for data. How can I determine that my tasks are completed to show user data ASAP? Can I somehow iterate over active tasks?
I've solved this in the past by keeping the status for the tasks in memcached, and polling (via Ajax) to determine when the tasks are finished.
If you go this way, it's best if you can always "manually" determine the status of the tasks without looking in memcached, since there's always the (slim) chance that memcache will go down or will get cleared or something as a task is running.
Related
We are running an API server where users submit jobs for calculation, which take between 1 second and 1 hour. They then make requests to check the status and get their results, which could be (much) later, or even never.
Currently jobs are added to a pub/sub queue, and processed by various worker processes. These workers then send pub/sub messages back to a listener, which stores the status/results in a postgres database.
I am looking into using Celery to simplify things and allow for easier scaling.
Submitting jobs and getting results isn't a problem in Celery, using celery_app.send_task. However, I am not sure how to best ensure the results are stored when, particularly for long-running or possibly abandoned jobs.
Some solutions I considered include:
Give all workers access to the database and let them handle updates. The main limitation to this seems to be the db connection pool limit, as worker processes can scale to 50 replicas in some cases.
Listen to celery events in a separate pod, and write changes based on this to the jobs db. Only 1 connection needed, but as far as I understand, this would miss out on events while this pod is redeploying.
Only check job results when the user asks for them. It seems this could lead to lost results when the user takes too long, or slowly clog the results cache.
As in (3), but periodically check on all jobs not marked completed in the db. A tad complicated, but doable?
Is there a standard pattern for this, or am I trying to do something unusual with Celery? Any advice on how to tackle this is appreciated.
In the past I solved similar problem by modifying tasks to not only return result of the computation, but also store it into a cache server (Redis) right before it returns. I had a task that periodically (every 5min) collects these results and writes data (in bulk, so quite effective) to a relational database. It was quite effective until we started filling the cache with hundreds of thousands of results, so we implemented a tiny service that does this instead of task that runs periodically.
I have a use case where I need to poll the API every 1 sec (basically infinite while loop). The polling will be initiated dynamically by user through an external system. This means there can be multiple polling running at the same time. The polling will be completed when the API returns 400. Anyways, my current implementation looks something like:
Flask APP deployed on heroku.
Flask APP has an endpoint which external system calls to start polling.
That flask endpoint will add the message to queue and as soon as worker gets it, it will start polling. I am using Heroku Redis to Go addons. Under the hood it uses python-rq and redis.
The problem is when some polling process goes on for a long time, the other process just sits on the queue. I want to be able to do all of the polling in a concurrent process.
What's the best approach to tackle this problem? Fire up multiple workers?
What if there could be potentially more than 100 concurrent processes.
You could implement a "weighted"/priority queue. There may be multiple ways of implementing this, but the simplest example that comes to my mind is using a min or max heap.
You shoud keep track of how many events are in the queue for each process, as the number of events for one process grows, the weight of the new inserted events should decrease. Everytime an event is processed, you start processing the following one with the greatest weight.
PS More workers will also speed up the the work.
I have tasks that do a get request to an API.
I have around 70 000 requests that I need to do, and I want to spread them out in 24 hours. So not all 70k requests are run at for example 10AM.
How would I do that in celery django? I have been searching for hours but cant find a good simple solution.
The database has a list of games that needs to be refreshed. Currently I have a cron that creates tasks every hour. But is it better to create a task for every game and make it repeat every hour?
The typical approach is to send them whenever you need some work done, no matter how many there are (even hundreds of thousands). The execution however is controlled by how many workers (and worker processes) you have subscribed to a dedicated queue. The key here is the dedicated queue - that is a common way of not allowing all workers start executing the newly created tasks. This goes beyond the basic Celery usage. You need to use celery multi for this use-case, or create two or more separate Celery workers manually with different queues.
If you do not want to over-complicate things you can use your current setup, but make these tasks with lowest priority, so if any new, more important, task gets created, it will be executed first. Problem with this approach is that only Redis and RabbitMQ backends support priorities as far as I know.
I'm creating a celery task in a situation where task producers are more than consumers (workers). Now since my queues are getting filled up and the workers consume in FCFS manner, can I get to execute a specific task(given a task_id) instantly?
for eg:
My tasks are filled in the following fashion. [1,2,3,4,5,6,7,8,9,0]. Now the tasks are fetched from the zeroth index. Now a situation arise where I want to execute task 8 above all. How can I do this?
The worker need not execute that task (because there can be situation where a worker is already occupied). It can be run directly from the application. And when the task is completed (either from the worker or directly from the application), it should get deleted from the queue.
I know how to forcefully revoke a task (given a task_id) but how can I execute a task given an id ?
how can I execute a task given an id ?
the short answer is you can't. Celery workers pull tasks off the broker backend as they become available.
Why not?
Note that's not a limitation of Celery as such, rather it is a characteristic of message queuing systems(MQS) in general. The point of MQS is to desynchronize an application's component so that the producer can go on to do other work while workers execute the tasks asynchronously. In other words, once a task has been sent off it cannot be modified (but it can be removed as long as it has not been started yet).
What options are there?
Celery offers you several options to deal with lower v.s. higher priority or short- and long-running tasks, at task submission time:
Routing - tasks can be routed to different workers. So if your tasks [0 .. 9] are all long-running, except for task 8, you could route task 8 to a worker, or a set of workers, that deal with short-running tasks.
Timed execution - specify a countdown or estimated time of arrival (eta) for each task. That's a good option if you know that some tasks can be delayed for later execution i.e. when the system will be less busy. This leaves workers ready for those tasks that need to be executed immediately.
Task expiry - specify an expire countdown or time with a callback. This way the task will be revoked if it didn't execute within the time alloted to it and the callback can start an alternative course of action.
Check on task results periodically, revoke a task if it didn't start executing within some time. Note this is different from task expiry where the revoking only happens once a worker has fetched the task from the queue - if the queue is full the revoking may happen too late for your use case. Checking results periodically means you have another component in your system that does this and determines an alternate course of action.
I'm running Django, Celery and RabbitMQ. What I'm trying to achieve is to ensure, that tasks related to one user are executed in order (specifically, one at the time, I don't want task concurrency per user)
whenever new task is added for user, it should depend on the most recently added task. Additional functionality might include not adding task to queue, if task of this type is queued for this user and has not yet started.
I've done some research and:
I couldn't find a way to link newly created task with already queued one in Celery itself, chains seem to be only able to link new tasks.
I think that both functionalities are possible to implement with custom RabbitMQ message handler, though it might be hard to code after all.
I've also read about celery-tasktree and this might be an easiest way to ensure execution order, but how do I link new task with already "applied_async" task_tree or queue? Is there any way that I could implement that additional no-duplicate functionality using this package?
Edit: There is this also this "lock" example in celery cookbook and as the concept is fine, I can't see a possible way to make it work as intended in my case - simply if I can't acquire lock for user, task would have to be retried, but this means pushing it to the end of queue.
What would be the best course of action here?
If you configure the celery workers so that they can only execute one task at a time (see worker_concurrency setting), then you could enforce the concurrency that you need on a per user basis. Using a method like
NUMBER_OF_CELERY_WORKERS = 10
def get_task_queue_for_user(user):
return "user_queue_{}".format(user.id % NUMBER_OF_CELERY_WORKERS)
to get the task queue based on the user id, every task will be assigned to the same queue for each user. The workers would need to be configured to only consume tasks from a single task queue.
It would play out like this:
User 49 triggers a task
The task is sent to user_queue_9
When the one and only celery worker that is listening to user_queue_9 is ready to consume a new task, the task is executed
This is a hacky answer though, because
requiring just a single celery worker for each queue is a brittle system -- if the celery worker stops, the whole queue stops
the workers are running inefficiently