Python Celery - Raise if queue is not available

Python Celery - Raise if queue is not available - python

I have defined a route in my Celery configs:
task_routes = {'tasks.add': {'queue': 'calculate'}}
So that only a specific worker will run that task. I start my worker:
celery -A myproj worker -n worker1#%h -Q calculate
And then run my task:
add.apply_async((2, 2), time_limit=5)
Everything goes well. But now, let's say my worker dies and I try to run my task again. It hangs, forever. Time_limit doesn't do me any good since the task will never get to its queue. How can I define a time out in this case? In other words, if no queue is available in the next X seconds, I'd like to raise an error. Is that possible?

I'm assuming you are using Rabbitmq as the message broker, and if you are, there are a few subtleties about how Rabbitmq (and other AMQP-like message queues) work.
First of all, when you send a message, your process sends it to an exchange, which in turn routes the message to 0 or more queues. Your queue may or may not have a consumer (i.e. a celery worker) consuming messages, but as a sender you have no control of the receiving side unless there is an active reply from that worker.
However, I think it is possible to achieve what you want by doing the following (assuming you have a backend)
Make sure your queue is declared with a Message TTL of your choice (let's say 60 seconds). Also make sure it is not declared to delete if no consumers are attached. Also declare a dead-letter exchange.
Have a celery worker listening to your dead letter exchange, but that worker is raising an appropriate exception whenever it receives a message. The easiest here is probably to listen to the messages, but not have any tasks loaded. This way, it will result in a FAILURE in your backend saying something about a not implemented task.
If your original worker dies, any message in the queue will expire after your selected TTL and be sent to your dead-letter exchange at which point the second worker (the auto-failing one) will receive the message and raise fail the task.
Note that you need to set your TTL well above the time you expect the message to linger in the Rabbitmq queue, as it will expire regardless of there being a worker consuming from the queue or not.
To set up the first queue, I think you need a configuration looking something like:
Queue(
default_queue_name,
default_exchange,
routing_key=default_routing_key,
queue_arguments={
'x-message-ttl': 60000 # milliseconds
'x-dead-letter-exchange': deadletter_exchange_name,
'x-dead-letter-routing-key': deadletter_routing_key
})
The dead letter queue would look more like a standard celery worker queue configuration, but you may want to have a separate config for it, since you don't want to load any tasks for this worker.
So to sum up, yes it is possible but it is not as straightforward as one might think.

Related

Tasks linger in celery amqp when publisher is terminated

I am using Celery with a RabbitMQ server. I have a publisher, which could potentially be terminated by a SIGKILL and since this signal cannot be watched, I cannot revoke the tasks. What would be a common approach to revoke the tasks where the publisher is not alive anymore?
I experimented with an interval on the worker side, but the publisher is obviously not registered as a worker, so I don't know how I can detect a timeout

There's nothing built-in to celery to monitor the producer / publisher status -- only the worker / consumer status. There are other alternatives that you can consider, for example by using a redis expiring key that has to be updated periodically by the publisher that can serve as a proxy for whether a publisher is alive. And then in the task checking to see if the flag for a publisher still exists within redis, and if it doesn't the task returns doing nothing.

I am pretty sure what you want is not possible with Celery, so I suggest you to shift your logic around and redesign everything to be part of a Celery workflow (or several Celery canvases depends on the actual use-case). My experience with Celery is that you can build literally any workflow you can imagine with those Celery primitives and/or custom Celery signatures.

Another solution, which works in my case, is to add the next task only if the current processed ones are finished. In this case the queue doesn't fill up.

What are the consequences of disabling gossip, mingle and heartbeat for celery workers?

What are the implications of disabling gossip, mingle, and heartbeat on my celery workers?
In order to reduce the number of messages sent to CloudAMQP to stay within the free plan, I decided to follow these recommendations. I therefore used the options --without-gossip --without-mingle --without-heartbeat. Since then, I have been using these options by default for all my celery projects but I am not sure if there are any side-effects I am not aware of.
Please note:
we now moved to a Redis broker and do not have that much limitations on the number of messages sent to the broker
we have several instances running multiple celery workers with multiple queues

This is the base documentation which doesn't give us much info
heartbeat
Is related to communication between the worker and the broker (in your case the broker is CloudAMQP).
See explanation
With the --without-heartbeat the worker won't send heartbeat events
mingle
It only asks for "logical clocks" and "revoked tasks" from other workers on startup.
Taken from whatsnew-3.1
The worker will now attempt to synchronize with other workers in the same cluster.
Synchronized data currently includes revoked tasks and logical clock.
This only happens at startup and causes a one second startup delay to collect broadcast responses from other workers.
You can disable this bootstep using the --without-mingle argument.
Also see docs
gossip
Workers send events to all other workers and this is currently used for "clock synchronization", but it's also possible to write your own handlers on events, such as on_node_join, See docs
Taken from whatsnew-3.1
Workers are now passively subscribing to worker related events like heartbeats.
This means that a worker knows what other workers are doing and can detect if they go offline. Currently this is only used for clock synchronization, but there are many possibilities for future additions and you can write extensions that take advantage of this already.
Some ideas include consensus protocols, reroute task to best worker (based on resource usage or data locality) or restarting workers when they crash.
We believe that although this is a small addition, it opens amazing possibilities.
You can disable this bootstep using the --without-gossip argument.

Celery workers started up with the --without-mingle option, as #ofirule mentioned above, will not receive synchronization data from other workers, particularly revoked tasks. So if you revoke a task, all workers currently running will receive that broadcast and store it in memory so that when one of them eventually picks up the task from the queue, it will not execute it:
https://docs.celeryproject.org/en/stable/userguide/workers.html#persistent-revokes
But if a new worker starts up before that task has been dequeued by a worker that received the broadcast, it doesn't know to revoke the task. If it eventually picks up the task, then the task is executed. You will see this behavior if you're running in an environment where you are dynamically scaling in and out celery workers constantly.

I wanted to know if the --without-heartbeat flag would impact the worker's ability to detect broker disconnect and attempts to reconnect. The documentation referenced above only opaquely refers to these heartbeats acting at the application layer rather than TCP/IP layer. Ok--what I really want to know is does eliminating these messages affect my worker's ability to function--specifically to detect broker disconnect and then to try to reconnect appropriately?
I ran a few quick tests myself and found that with the --without-heartbeat flag passed, workers still detect broker disconnect very quickly (initiated by me shutting down the RabbitMQ instance), and they attempt to reconnect to the broker and do so successfully when I restart the RabbitMQ instance. So my basic testing suggests the heartbeats are not necessary for basic health checks and functionality. What's the point of them anyways? It's unclear to me, but they don't appear to have impact on worker functionality.

Celery doesn't acknowledge tasks if stopped too quickly

For a project using Celery, I would like to test the execution of a task.
I know that the documentation advises to mock it but since I'm not using the official client I would want to check for a specific test that everything works well.
Then I set up a very simple task that takes as parameters an Unix socket name and a message to write to it: the task opens the connection on the socket, writes the message and closes the connection.
Inside the tests, the Celery worker is launched with a subprocess: I start it before sending the task, send it a SIGTERM when I receive the message on the socket and then wait for the process to close.
Everything goes well : the message is received, it matches what is expected and the worker correctly terminates.
But I found that when the tests stop, a message still remains within the RabbitMQ queue, as if the task had never been acknowledged.
I confirmed this by looking at the RabbitMQ graphical interface: a "Deliver" occurs after the task is executed but no "Acknowledge".
This seems strange because using the default configuration the acknowledge should be sent before task execution.
Going further in my investigations I noticed that if I add a sleep of a split second just before sending SIGTERM to the worker, the task is acknowledged.
I tried to inspect the executions with or without sleep using strace, here are the logs:
Execution with a sleep of 0.5s.
Execution without sleep.
The only noticeable difference I see is that with sleep the worker has time to start a new communication with the broker. It receives an EAGAIN from a recvfrom and sends a frame "\1\0\1\0\0\0\r\0<\0P\0\0\0\0\0\0\0\1\0\316".
Is this the acknowledge? Why does this occur so late?
I give you the parameters with which I launch the Celery worker: celery worker --app tests.functional.tasks.app --concurrency 1 --pool solo --without-heartbeat.
The --without-heartbeat is just here to reduce differences between executions with or without sleep. Otherwise an additional heartbeat frame would occur in the execution with sleep.
Thanks.

How to detect Celery task which doing similar job before run another task?

My celery task is doing time-consuming calculations on some database-stored entity. Workflow is like this: get information from database, compile it to some serializable object, save object. Other tasks are doing other calculations (like rendering images) on loaded object.
But serialization is time-consuming, so i'd like to have one task per one entity running for a while, which holds serialized object in memory and process client requests, delivered through messaging queue (redis pubsub). If no requests for a while, task exits. After that, if client need some job to be done, it runs another task, which loads object, process it and stay tuned for a while for other jobs. This task should check at startup, if it only one worker on this particular entity to avoid collisions. So what is best strategy to check is there another task running for this entity?
1) First idea is to send message to some channel associated with entity, and wait for response. Bad idea, target task can be busy with calculations and waiting for response with timeout is just wasting time.
2) Store celery task-id in db is even worse - task can be killed, but record will stay, so we need to ensure that target task is alive.
3) Third idea is to inspect workers for running tasks, checking it state for entity id (which task will provide at startup). Also seems, that some collisions can happens, i.e. if several tasks are scheduled, but not runing yet.
For now I think idea 1 is the best with modifications like this: task will send message to entity channel on startup with it's startup time, but then immediately starts working, not waiting for response. Then it checks message queue and if someone is respond they compare timestamps and task with bigger timestamp quits. Seems complicated enough, are there better solution?

Final solution is to start supervisor thread in task, which reply to 'discover' message from competing tasks.
So workflow is like that.
Task starts, then subscribes to Redis PubSub channel with entity ID
Task sends 'discover' message to channel
Task wait a little bit
Task search 'reply' in incoming messages in channel, if found exits.
Task starts supervisor thread, which reply by 'reply' to all incoming 'discover' messages
This works fine except several tasks start simultaneouly, i.e. after worker restart. To avoid this need to make subscription proccess atomic, using Redis lock:
class RedisChannel:
def __init__(self, channel_id):
self.channel_id = channel_id
self.redis = StrictRedis()
self.channel = self.redis.pubsub()
with self.redis.lock(channel_id):
self.channel.subscribe(channel_id)

What happens to a Celery Worker's scheduled (eta) tasks when it shuts down?

I've been learning about celery and haven't been able to find the answer to a conceptual question and have had odd results experimenting.
When there are scheduled tasks (by scheduled, I don't mean periodic but scheduled to run in the future using eta=x) submitted to Celery, they seem to be consumed from the queue by a worker right away (rather than staying in the Redis default celery key/queue). Presumably, the worker will actually execute the tasks at eta.
What happens if that worker were to be shut down or restarted (to update it's registered tasks for example)? Would those scheduled tasks be lost? They are not "running" so a warm terminate wouldn't wait for them to finish of course.
Is there a way to force those tasks to be return to the queue and consumed by the next available worker?
I suppose, manually, one could dump the tasks before shutting down a worker:
http://celery.readthedocs.org/en/latest/userguide/workers.html#inspecting-workers
and resubmit them when a new worker is back up... but is this supposed to happen automatically?
Would really appreciate any help with this
Thanks

Take a look at acks_late
http://celery.readthedocs.org/en/latest/reference/celery.app.task.html#celery.app.task.Task.acks_late
If set to true Celery will keep the task in the queue until it has been successfully executed.

Update: Celery 5.1
Workers will acknowledge the message even if acks_late is enabled. This is the default and intentional setting set forth by the library. [Ref]
To change the default settings and re-queue your unfinished tasks, you can use the task_reject_on_worker_lost config. [Ref]
Although keep in mind that this could lead to a message loop and can cause unintended effects if your tasks are not idempotent.
Specifically for eta tasks, queues wait for workers to acknowledge the tasks before deleting them. With default settings, celery workers ack right before the task is executed and with acks_late when the task is finished executing.
So when workers fail to ack the tasks probably because of shutdown/restart/lost_connection or in case of Redis/SQS visibility_timeout exceeded [ref], the queue will redeliver the message to any available worker.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.