I am new to Celery and I am trying to understand how queues work. If I have two tasks say task1 and task2 and I place them in different queues and on task1 I use only a single worker and on task2 I use multiple works, will task1 only run one at a time since I only have a single worker? and will task2 run as many as many times as I have workers? Is my understanding correct?
You are close. You can distribute tasks to specific queues and configure workers to only listen to specific queues and scale the number of workers listening to each queue independently. More workers, generally, means more tasks can execute concurrently.
However, having only a single worker assigned to a particular queue/task alone does not guarantee that the task will only execute one at a time.
By default, workers have concurrency enabled, meaning a single worker can utilize multiple processes to execute tasks concurrently. Further, there are other working settings to consider, too, like prefetching and early acknowledgement.
If you want to ensure that a task can only be executed one at a time, you should not rely on the (lack of) availability of worker processes. Instead, a locking mechanism like described in the docs ensuring a task is executed one at a time would be one recommended approach for this.
task2 will not be executed as many times as you have workers on that queue, this task will be distributed to one of workers that will be available at that moment of time.
Related
I have a couple of workers deployed in Kubernetes. I want to write a customized exporter for Prometheus so I need to check all workers' availability.
I have some huge tasks in one queue, which take 200 seconds (for example). The related workers to this queue have been run with eventlet pool and 1000 concurrency. This worker deployed in a workload with 2 pods.
Because of the huge tasks, sometimes light tasks got stuck in these workers and does not process until huge task are done(I have another queue for light tasks, but I have to have some light tasks in this queue).
How I can check all workers' performance and upness?
I come across Bootstrap in celery but I do not know whether it helps me or not, because I want to have a task that is run on every worker (and queues) and I want it to run between huge tasks not separated.
For more details: I want to save this data in a Redis and read it in my exporter.
I'm using APScheduler to schedule tasks in python, and these tasks need to run independently and concurrently with another tasks.
The main rule is that these tasks have to be executed at the exact moment they were scheduled and cannot be blocked or delay execution because of another task.
The tasks are dynamically scheduled by the users of my application.
For that, when the task execution time arrives, I start a new sub-process to execute it:
def _initialize_order_process(user, order):
p = Process(target=do_scheduled_order, args=(user, order))
p.start()
p.join()
It's important to know that each subprocess start a connection with a server.
And i'm scheduling my taks like this:
scheduler.add_job(_initialize_order_process, 'date', run_date=start_time, args=[user, order], id=job_id)
My problem is when a large number of tasks are scheduled for the same time, due to the number of processes, the server crashes.
So, I need this application to be scalable to support many users.
Does anyone know how to create a scalable solution for my use case?
One solution would be to add more hardware, horizontally (get more servers).
You add requests to a task queue, for example, using Redis, then delegate tasks using Celery workers, and run many parallel applications to pick up the workload
Another solution would be setup a cluster for Apache Airflow, then run tasks through it
cannot be blocked or delay execution because of another task
Unfortunately, that's not how task scheduling works. Eventually, you will have jobs that depend on each other, and so you'll have to have a DAG of job processes
I have two types of tasks task1 and task2; also I have two types of consumers ConsumerA and ConsumerB.
ConsumerA can process only tasks type task1. ConsumerB can process both types of tasks: task1 and task2.
I want the ConsumerA to process only tasks type task1.
I want the ConsumerB to process tasks type task2 first and process tasks type task1 only when there is no tasks type2.
How to design this solution with RabbitMQ. I think about using two queues, but I don't know how to ensure that it processes tasks type2 only when there are no tasks type1.
I have read all StackOverflow without any success.
Maybe someone solved this task?
Hello #Dima Voronenkov
You may want to check Priorities Queues and also Round-robin dispatching from RabbitMQ documentation.
With both you could create one queue per worker, and distribute task1 to both queues and to process task2 with higher priority than task1.
This could be expanded further if this approach meets your requirements.
A graphic description of RoundRobin method:
Edit 1: To avoid task1 piling up for ConsumerB you can assign to the task1 a time to live so when they "expire" and send them to a "dead-letter-queue", that actually points to ConsumerA queue, so when ConsumerB is busy, those task1 messages will be redirected to the other queue.
Going beyond, you can also give "more tickets" to ConsumerA queue, at least in RR standard definition, I do not know how to do it on RabbitMQ, but it can be achieved creating more ConsumerA like queues, to group them in the same cluster of queues.
Approach 2
Using Topics(check link in the bottom) and priorities could also be a good solution, however I have not tested both on the same exercise, I can not see why it would not work, check on below image:
In this scenario, only one queue is needed with one exchange, and ConsumerA will only consume task1 messages while ConsumerB will consume both messages giving a higher priority to task2.
Topics: https://www.rabbitmq.com/tutorials/tutorial-five-python.html
I'm creating a celery task in a situation where task producers are more than consumers (workers). Now since my queues are getting filled up and the workers consume in FCFS manner, can I get to execute a specific task(given a task_id) instantly?
for eg:
My tasks are filled in the following fashion. [1,2,3,4,5,6,7,8,9,0]. Now the tasks are fetched from the zeroth index. Now a situation arise where I want to execute task 8 above all. How can I do this?
The worker need not execute that task (because there can be situation where a worker is already occupied). It can be run directly from the application. And when the task is completed (either from the worker or directly from the application), it should get deleted from the queue.
I know how to forcefully revoke a task (given a task_id) but how can I execute a task given an id ?
how can I execute a task given an id ?
the short answer is you can't. Celery workers pull tasks off the broker backend as they become available.
Why not?
Note that's not a limitation of Celery as such, rather it is a characteristic of message queuing systems(MQS) in general. The point of MQS is to desynchronize an application's component so that the producer can go on to do other work while workers execute the tasks asynchronously. In other words, once a task has been sent off it cannot be modified (but it can be removed as long as it has not been started yet).
What options are there?
Celery offers you several options to deal with lower v.s. higher priority or short- and long-running tasks, at task submission time:
Routing - tasks can be routed to different workers. So if your tasks [0 .. 9] are all long-running, except for task 8, you could route task 8 to a worker, or a set of workers, that deal with short-running tasks.
Timed execution - specify a countdown or estimated time of arrival (eta) for each task. That's a good option if you know that some tasks can be delayed for later execution i.e. when the system will be less busy. This leaves workers ready for those tasks that need to be executed immediately.
Task expiry - specify an expire countdown or time with a callback. This way the task will be revoked if it didn't execute within the time alloted to it and the callback can start an alternative course of action.
Check on task results periodically, revoke a task if it didn't start executing within some time. Note this is different from task expiry where the revoking only happens once a worker has fetched the task from the queue - if the queue is full the revoking may happen too late for your use case. Checking results periodically means you have another component in your system that does this and determines an alternate course of action.
Is there any way to make Celery recheck if there are any tasks in the main queue ready to be started? Will the remote command add_consumer() do the job?
The reason: I am running several concurrent tasks, which spawn multiple sub-processes. When the tasks are done, the sub-processes sometimes take a few seconds to finish, so because the concurrency limit is maxed out by sub-processes, a new task from the queue is never started. And because Celery does not check again when the sub-processes finish, the queue gets stalled with no active tasks. Therefore I want to add a periodical task that tells Celery to recheck the queue and and start the next task. How do I tell Celery this?
From the docs:
The add_consumer control command will tell one or more workers to start consuming from a queue. This operation is idempotent.
Yes, add_consumer does what you want. You could also combine that with a periodic task to "recheck the queue and start the next task" every so often (depending on your need)