It seems that Celery (v4.1) can be either used with some prefetching of tasks, or with CELERY_ACKS_LATE=True (Discussed here)
We currently work with CELERY_ACKS_LATE=False and CELERYD_PREFETCH_MULTIPLIER=1
In both cases, there are unacknowledged messages in Rabbit.
At times we suffer from network issues that cause Celery to lose the connection to Rabbit for few seconds, getting these warnings: consumer: Connection to broker lost. Trying to re-establish the connection...
When this happens, the unacknowledged messages, turn back to Ready, what seems to be the standard behaviour, and are being consumed by another consumer.
This causes a multiple execution of the tasks, as the consumer started a prefetched task in the worker process, but couldn't ack it to Rabbit.
As it seems that its impossible to guarantee that tasks will get executed exactly once in Celery without external tools, how is it possible to ensure that tasks are executed at most once?
----- Edit ----
One approach I'm considering is to use the task's self.request.delivery_info['redelivered'] and fail tasks that were redelivered.
While achieving the goal of "executing at most once" this will have a high rate of false positives (tasks that weren't already executed)
There is difference between the task is executed once and there is extra side effect when the task is executed multiple times ie. if your tasks are not idempotent then executing twice the same task will lead to bugs.
What I recommend is to allow task to execute several times, but making them idempotent so they have no effect if they were already executed.
Related
After playing with some "defect" scenarios with celery (Redis being a broker for whatever it worth) we came to understanding that there is effectively no sense in setting acks_late=true without simultaneous setting of task_reject_on_worker_lost=true because the task won't be rescheduled (again, in our tests) -- task stays in the "unacked" category forever.
At the same time everybody says that acks_late will make the task being subject for rescheduling on the same / another worker, so the question is: when does it happen?
The official docs say that
Note that the worker will acknowledge the message if the child process
executing the task is terminated (either by the task calling
sys.exit(), or by signal) even when acks_late is enabled. This
behavior is intentional as…
We don’t want to rerun tasks that forces the kernel to send a SIGSEGV (segmentation fault) or similar signals to the process.
We assume that a system administrator deliberately killing the task does not want it to automatically restart.
A task that allocates too much memory is in danger of triggering the kernel OOM killer, the same may happen again.
A task that always fails when redelivered may cause a high-frequency message loop taking down the system.
If you really want a task to be redelivered in these scenarios you
should consider enabling the task_reject_on_worker_lost setting.
What are possible examples of "something went wrong" that don't fall into the "worker terminated deliberately or due to a signal caught" category?
Reboot, power outage, hardware failure. n.b., all of your examples assume that the prefetch multiplier is 1.
Note that there is a difference between the celery worker process, to the child processes actually executing the tasks.
By default, when you create a celery worker, it will create one "parent" process and x number of child processes which executes the tasks, where x is the number of CPUs you have (you can read more about this in the docs, and how to configure it)
I have tested all the different scenarios, these are my conclusions:
acks_late is about what happens when the worker dies. task_reject_on_worker_lost is about the actual process executing the task.
For example, if I have a k8s pod running celery process: if I send sigkill (cold shutdown) to the pod, having acks_late as true will make sure that the task will be picked up by a different worker.
But, if I kill somehow the child process executing the task (go inside the pod and kill the child process for example, or if the process exits by itself somehow), the task will not be picked up even if acks_late is true.
If you set task_reject_on_worker_lost to true, the task will be picked up again.
hope that clarifies everything
Is there for me to configure celery to just drop the tasks in case of a non-graceful shutdown of a worker? Its more critical for me that tasks are not repeated rather than they are always delivered.
As mentioned in the docs:
If a task isn’t acknowledged within the Visibility Timeout the task will be redelivered to another worker and executed.
This causes problems with ETA/countdown/retry tasks where the time to execute exceeds the visibility timeout; in fact if that happens it will be executed again, and again in a loop.
So you have to increase the visibility timeout to match the time of the longest ETA you’re planning to use.
My use case is that I am using a visibility_timeout of 1 day, but still in some cases that is not enough- I want to schedule tasks even further in the future. "Power failure" or any other event causing a non-graceful shutdown is very rare and I'm fine with tasks being dropped in, say, 0.01% of the cases. Moreover, a task executed 1 day later than it was supposed to, is as bad as the task not being run at all.
One obvious, hacky, way is to set visibility_timeout to 100 years. Is there a better way?
There's a acks_late configuration, but the default value is false (so make sure you didn't enable it):
The acks_late setting would be used when you need the task to be
executed again if the worker (for some reason) crashes mid-execution.
It’s important to note that the worker isn’t known to crash, and if it
does it’s usually an unrecoverable error that requires human
intervention (bug in the worker, or task code).
(quote from here)
The definition of task_acks_late (seems like the name has changed in the last version of some mismatch) can be found here.
What are the implications of disabling gossip, mingle, and heartbeat on my celery workers?
In order to reduce the number of messages sent to CloudAMQP to stay within the free plan, I decided to follow these recommendations. I therefore used the options --without-gossip --without-mingle --without-heartbeat. Since then, I have been using these options by default for all my celery projects but I am not sure if there are any side-effects I am not aware of.
Please note:
we now moved to a Redis broker and do not have that much limitations on the number of messages sent to the broker
we have several instances running multiple celery workers with multiple queues
This is the base documentation which doesn't give us much info
heartbeat
Is related to communication between the worker and the broker (in your case the broker is CloudAMQP).
See explanation
With the --without-heartbeat the worker won't send heartbeat events
mingle
It only asks for "logical clocks" and "revoked tasks" from other workers on startup.
Taken from whatsnew-3.1
The worker will now attempt to synchronize with other workers in the same cluster.
Synchronized data currently includes revoked tasks and logical clock.
This only happens at startup and causes a one second startup delay to collect broadcast responses from other workers.
You can disable this bootstep using the --without-mingle argument.
Also see docs
gossip
Workers send events to all other workers and this is currently used for "clock synchronization", but it's also possible to write your own handlers on events, such as on_node_join, See docs
Taken from whatsnew-3.1
Workers are now passively subscribing to worker related events like heartbeats.
This means that a worker knows what other workers are doing and can detect if they go offline. Currently this is only used for clock synchronization, but there are many possibilities for future additions and you can write extensions that take advantage of this already.
Some ideas include consensus protocols, reroute task to best worker (based on resource usage or data locality) or restarting workers when they crash.
We believe that although this is a small addition, it opens amazing possibilities.
You can disable this bootstep using the --without-gossip argument.
Celery workers started up with the --without-mingle option, as #ofirule mentioned above, will not receive synchronization data from other workers, particularly revoked tasks. So if you revoke a task, all workers currently running will receive that broadcast and store it in memory so that when one of them eventually picks up the task from the queue, it will not execute it:
https://docs.celeryproject.org/en/stable/userguide/workers.html#persistent-revokes
But if a new worker starts up before that task has been dequeued by a worker that received the broadcast, it doesn't know to revoke the task. If it eventually picks up the task, then the task is executed. You will see this behavior if you're running in an environment where you are dynamically scaling in and out celery workers constantly.
I wanted to know if the --without-heartbeat flag would impact the worker's ability to detect broker disconnect and attempts to reconnect. The documentation referenced above only opaquely refers to these heartbeats acting at the application layer rather than TCP/IP layer. Ok--what I really want to know is does eliminating these messages affect my worker's ability to function--specifically to detect broker disconnect and then to try to reconnect appropriately?
I ran a few quick tests myself and found that with the --without-heartbeat flag passed, workers still detect broker disconnect very quickly (initiated by me shutting down the RabbitMQ instance), and they attempt to reconnect to the broker and do so successfully when I restart the RabbitMQ instance. So my basic testing suggests the heartbeats are not necessary for basic health checks and functionality. What's the point of them anyways? It's unclear to me, but they don't appear to have impact on worker functionality.
I'm creating a celery task in a situation where task producers are more than consumers (workers). Now since my queues are getting filled up and the workers consume in FCFS manner, can I get to execute a specific task(given a task_id) instantly?
for eg:
My tasks are filled in the following fashion. [1,2,3,4,5,6,7,8,9,0]. Now the tasks are fetched from the zeroth index. Now a situation arise where I want to execute task 8 above all. How can I do this?
The worker need not execute that task (because there can be situation where a worker is already occupied). It can be run directly from the application. And when the task is completed (either from the worker or directly from the application), it should get deleted from the queue.
I know how to forcefully revoke a task (given a task_id) but how can I execute a task given an id ?
how can I execute a task given an id ?
the short answer is you can't. Celery workers pull tasks off the broker backend as they become available.
Why not?
Note that's not a limitation of Celery as such, rather it is a characteristic of message queuing systems(MQS) in general. The point of MQS is to desynchronize an application's component so that the producer can go on to do other work while workers execute the tasks asynchronously. In other words, once a task has been sent off it cannot be modified (but it can be removed as long as it has not been started yet).
What options are there?
Celery offers you several options to deal with lower v.s. higher priority or short- and long-running tasks, at task submission time:
Routing - tasks can be routed to different workers. So if your tasks [0 .. 9] are all long-running, except for task 8, you could route task 8 to a worker, or a set of workers, that deal with short-running tasks.
Timed execution - specify a countdown or estimated time of arrival (eta) for each task. That's a good option if you know that some tasks can be delayed for later execution i.e. when the system will be less busy. This leaves workers ready for those tasks that need to be executed immediately.
Task expiry - specify an expire countdown or time with a callback. This way the task will be revoked if it didn't execute within the time alloted to it and the callback can start an alternative course of action.
Check on task results periodically, revoke a task if it didn't start executing within some time. Note this is different from task expiry where the revoking only happens once a worker has fetched the task from the queue - if the queue is full the revoking may happen too late for your use case. Checking results periodically means you have another component in your system that does this and determines an alternate course of action.
I'm working on a Python based system, to enqueue long running tasks to workers.
The tasks originate from an outside service that generate a "token", but once they're created based on that token, they should run continuously, and stopped only when explicitly removed by code.
The task starts a WebSocket and loops on it. If the socket is closed, it reopens it. Basically, the task shouldn't reach conclusion.
My goals in architecting this solutions are:
When gracefully restarting a worker (for example to load new code), the task should be re-added to the queue, and picked up by some worker.
Same thing should happen when ungraceful shutdown happens.
2 workers shouldn't work on the same token.
Other processes may create more tasks that should be directed to the same worker that's handling a specific token. This will be resolved by sending those tasks to a queue named after the token, which the worker should start listening to after starting the token's task. I am listing this requirement as an explanation to why a task engine is even required here.
Independent servers, fast code reload, etc. - Minimal downtime per task.
All our server side is Python, and looks like Celery is the best platform for it.
Are we using the right technology here? Any other architectural choices we should consider?
Thanks for your help!
According to the docs
When shutdown is initiated the worker will finish all currently executing tasks before it actually terminates, so if these tasks are important you should wait for it to finish before doing anything drastic (like sending the KILL signal).
If the worker won’t shutdown after considerate time, for example because of tasks stuck in an infinite-loop, you can use the KILL signal to force terminate the worker, but be aware that currently executing tasks will be lost (unless the tasks have the acks_late option set).
You may get something like what you want by using retry or acks_late
Overall I reckon you'll need to implement some extra application-side job control, plus, maybe, a lock service.
But, yes, overall you can do this with celery. Whether there are better technologies... that's out of the scope of this site.