Strategy for worker addressing

Strategy for worker addressing - python

I have an application which is delegating some operations to a celery task. The operations must be performed by different workers, depending on some parameters. I have thought about implementing this using queues. My idea is the following:
The client requests actions from a specific message queue1
If worker1 (exclusively responsible for queue1) is already active, it will process the request
If no worker is listening to queue1, a catch-all worker (worker-main) will instantiate worker1. The request will be forwared to worker1.
worker1 will shut itself down after some time without being used
My understanding of celery is limited, and I have several questions.
How do I implement worker-main in celery?: this is a worker listening to all queues, but with less priority than any other worker. That is, it will only act if the request is not taken by any other worker.
How does woker-main create worker1? Once creater, worker1 must be associated to queue1, with higher precendence than worker-main?
Can a request be forwarded from worker-mainto worker1? The reply should be sent to the client directly.
Can worker1 shut itself down?
You can see a graphical description of the architecture that I am trying to implement in the image below:

You could link together "worker main" and "worker1" in a sequential workflow so that "worker main" always handles the job as step 1, but simply returns and does nothing if it detects that "worker1" is already up.
So the task hits "worker main" first, "worker main" checks for upness of the server that worker1 is running on, and if that server is not up, pulls it up, waits for it to be fully up and then returns. Here is a proof of concept I tested to see how link worked in Celery to create a sequential workflow, somebody with more real-world experience may have better solutions. It also contains error handling, in case getting the worker up fails I suppose in your case.
Note that there is no concept of queues in this approach. Furthermore, you could give worker1 and worker2 different method names instead of differentiating on parameters, the client can do the parsing of parameters and then select celery method to execute.

Related

Python Celery - Raise if queue is not available

I have defined a route in my Celery configs:
task_routes = {'tasks.add': {'queue': 'calculate'}}
So that only a specific worker will run that task. I start my worker:
celery -A myproj worker -n worker1#%h -Q calculate
And then run my task:
add.apply_async((2, 2), time_limit=5)
Everything goes well. But now, let's say my worker dies and I try to run my task again. It hangs, forever. Time_limit doesn't do me any good since the task will never get to its queue. How can I define a time out in this case? In other words, if no queue is available in the next X seconds, I'd like to raise an error. Is that possible?

I'm assuming you are using Rabbitmq as the message broker, and if you are, there are a few subtleties about how Rabbitmq (and other AMQP-like message queues) work.
First of all, when you send a message, your process sends it to an exchange, which in turn routes the message to 0 or more queues. Your queue may or may not have a consumer (i.e. a celery worker) consuming messages, but as a sender you have no control of the receiving side unless there is an active reply from that worker.
However, I think it is possible to achieve what you want by doing the following (assuming you have a backend)
Make sure your queue is declared with a Message TTL of your choice (let's say 60 seconds). Also make sure it is not declared to delete if no consumers are attached. Also declare a dead-letter exchange.
Have a celery worker listening to your dead letter exchange, but that worker is raising an appropriate exception whenever it receives a message. The easiest here is probably to listen to the messages, but not have any tasks loaded. This way, it will result in a FAILURE in your backend saying something about a not implemented task.
If your original worker dies, any message in the queue will expire after your selected TTL and be sent to your dead-letter exchange at which point the second worker (the auto-failing one) will receive the message and raise fail the task.
Note that you need to set your TTL well above the time you expect the message to linger in the Rabbitmq queue, as it will expire regardless of there being a worker consuming from the queue or not.
To set up the first queue, I think you need a configuration looking something like:
Queue(
default_queue_name,
default_exchange,
routing_key=default_routing_key,
queue_arguments={
'x-message-ttl': 60000 # milliseconds
'x-dead-letter-exchange': deadletter_exchange_name,
'x-dead-letter-routing-key': deadletter_routing_key
})
The dead letter queue would look more like a standard celery worker queue configuration, but you may want to have a separate config for it, since you don't want to load any tasks for this worker.
So to sum up, yes it is possible but it is not as straightforward as one might think.

What are the consequences of disabling gossip, mingle and heartbeat for celery workers?

What are the implications of disabling gossip, mingle, and heartbeat on my celery workers?
In order to reduce the number of messages sent to CloudAMQP to stay within the free plan, I decided to follow these recommendations. I therefore used the options --without-gossip --without-mingle --without-heartbeat. Since then, I have been using these options by default for all my celery projects but I am not sure if there are any side-effects I am not aware of.
Please note:
we now moved to a Redis broker and do not have that much limitations on the number of messages sent to the broker
we have several instances running multiple celery workers with multiple queues

This is the base documentation which doesn't give us much info
heartbeat
Is related to communication between the worker and the broker (in your case the broker is CloudAMQP).
See explanation
With the --without-heartbeat the worker won't send heartbeat events
mingle
It only asks for "logical clocks" and "revoked tasks" from other workers on startup.
Taken from whatsnew-3.1
The worker will now attempt to synchronize with other workers in the same cluster.
Synchronized data currently includes revoked tasks and logical clock.
This only happens at startup and causes a one second startup delay to collect broadcast responses from other workers.
You can disable this bootstep using the --without-mingle argument.
Also see docs
gossip
Workers send events to all other workers and this is currently used for "clock synchronization", but it's also possible to write your own handlers on events, such as on_node_join, See docs
Taken from whatsnew-3.1
Workers are now passively subscribing to worker related events like heartbeats.
This means that a worker knows what other workers are doing and can detect if they go offline. Currently this is only used for clock synchronization, but there are many possibilities for future additions and you can write extensions that take advantage of this already.
Some ideas include consensus protocols, reroute task to best worker (based on resource usage or data locality) or restarting workers when they crash.
We believe that although this is a small addition, it opens amazing possibilities.
You can disable this bootstep using the --without-gossip argument.

Celery workers started up with the --without-mingle option, as #ofirule mentioned above, will not receive synchronization data from other workers, particularly revoked tasks. So if you revoke a task, all workers currently running will receive that broadcast and store it in memory so that when one of them eventually picks up the task from the queue, it will not execute it:
https://docs.celeryproject.org/en/stable/userguide/workers.html#persistent-revokes
But if a new worker starts up before that task has been dequeued by a worker that received the broadcast, it doesn't know to revoke the task. If it eventually picks up the task, then the task is executed. You will see this behavior if you're running in an environment where you are dynamically scaling in and out celery workers constantly.

I wanted to know if the --without-heartbeat flag would impact the worker's ability to detect broker disconnect and attempts to reconnect. The documentation referenced above only opaquely refers to these heartbeats acting at the application layer rather than TCP/IP layer. Ok--what I really want to know is does eliminating these messages affect my worker's ability to function--specifically to detect broker disconnect and then to try to reconnect appropriately?
I ran a few quick tests myself and found that with the --without-heartbeat flag passed, workers still detect broker disconnect very quickly (initiated by me shutting down the RabbitMQ instance), and they attempt to reconnect to the broker and do so successfully when I restart the RabbitMQ instance. So my basic testing suggests the heartbeats are not necessary for basic health checks and functionality. What's the point of them anyways? It's unclear to me, but they don't appear to have impact on worker functionality.

Understanding celery worker nodes

I am trying to understand the working of celery and AMQP here.
My scenario
I install celery in my machine
pip install celery
I make tasks using
from celery import Celery
app = Celery('tasks', backend='amqp', broker='amqp://')
#app.task
def print_hello():
print 'hello there'
As far as I understood, celery converts this task to message and send to brokers(redis or rabbitmq) via AMQP protocol. Then these messages are queued and delivered to worker nodes to process the message.
My questions are,
Suppose I created task in a Java environment and if the message is sent to a external worker node, does it mean the worker node server must have Java installed in it to execute the task ?
If the message is picked by external worker node, how does worker node and broker find each other ? In the above code I only have the broker address to store tasks.
Also Why are we storing the tasks in a broker ? Why couldn't we implement exchange algorithm in celery and send the message direct to workers ?
What is the difference between SOAP and AMQP ?

The workers need not only Python, but all the code for the tasks you want to run on them.
But you don't address the nodes specifically, that is precisely why there is a broker. You put your tasks on the queue, and the workers pick them up.
I have no idea why you've mentioned SOAP in this context. It has nothing whatsoever to do with anything.

The specific answers to your questions are:
"if the message is sent to a external worker node" is slightly misleading. A message is not sent to a worker node per se. It is sent to the Broker (identified by a URL) and specifically an Exchange on that broker with a Routing Key which sees it landing in a Queue. Workers are all configured with the same Broker URL and read this Queue, and it's very much a case of [first-in-best-dressed][1], the first Worker to consume the message (to read a message in an AMQP it is removed from the Queue in one atomic operation). The [messages][2] are language independent. The Workers however are written in Python and the task definition must be in Python, though the Python task definition can of course call out to any other library by whatever means to execute the task. But in a sense yes, whatever run time libraries your task needs in order to run it needs to have on the same machine as the Worker, and they must have a Python wrapper around them so the Worker can load them.
"If the message is picked by external worker node, how does worker node and broker find each other?" - This question is misleading. They don't find each other. The Worker is configured with the exact same Broker URL as the Client is. It has know the URL. The way Celery typically solves this in Python is that the code snippet you shared is loaded by both the Client, and the Worker. This is in fact one of the beauties of Celery. That you write you tasks in Python and you load the definitions in the Worker unaltered. They thus use the same Broker, and have the same Task defined. The #app.task actually creates a Task class instance which has two very important methods: apply_async() which is what creates and sends the message requesting the task, and run() which runs the decorated function. The former is called int he Client. The latter by the Worker (to actually run the task).
"Why are we storing the tasks in a broker?" -Tasks are not stored in a broker. The task is defined in a python file like your code snippet. As described in 2. The same definition is read by both Client and Worker. A messages is sent from Client to Worker asking it to run the task.
"Why couldn't we implement exchange algorithm in celery and send the message direct to workers?" - I'll have to take a guess here, but I would ask, Why reinvent the wheel? There is a standard defined, AMQP (the Advanced Message Queueing Protocol), and there are a number of implementations of that standard. Why write yet another one? Celery is FOSS, and like so much FOSS I imagine the people who started writing it wanted to focus on task management not message management and chose to lean on AMQP for message management. A fair choice. But for what it's worth Celery does implement quite a lot in Kombu, to provide a Python API to AMQP.
SOAP (abbreviation for Simple Object Access Protocol) is a messaging protocol specification for exchanging structured information in the implementation of web services in computer networks.
AMQP (abbreviation for Advanced Message Queuing Protocol) is an open standard application layer protocol for message-oriented middleware. The defining features of AMQP are message orientation, queuing, routing (including point-to-point and publish-and-subscribe), reliability and security.
SOAP is typically much higher level int the protocol stack. Described here:
https://www.amqp.org/product/different

Celery resume consuming from a queue

I have different celery queues and at a certain point I want workers to stop consuming from my queues
celery_app.control.cancel_consumer(consumer_queue)
In a while I want to be able to resume consumers and I do that with the next command
celery.control.add_consumer(
consumer_queue,
routing_key=consumer_queue,
destination=['worker-name'],
)
At this point I expect worker-name will be fetching tasks from consumer_queue which my custom router redirects by a routing_key. But instead I have this output from a celery inspect
celery.control.inspect().active_queues()
{'celery#worker-name': []}
Some details
Celery: celery==3.1.23
Kombu: kombu==3.0.35
billiard: billiard==3.3.0.23
Note: adding consumer via celery flower (flower==0.8.4) works even that the command is the same.
What am I doing wrong and how to reenable consuming in a proper way?

Ok, it was a premature question with a plain solution: I have provided wrong name for the worker, instead of setting worker-name I should provide celery#worker-name identifier.
For debug purposes it's also useful to set reply=True argument
response = celery.control.add_consumer(
consumer_queue,
routing_key=consumer_queue,
destination=['celery#{}'.format(consumer)],
reply=True,
)
print(response)
and you'll see whether operation was successful or not
[{u'celery#worker-name': {u'ok': u'add consumer consumer-queue'}}]

Persistent Long Running Tasks in Celery

I'm working on a Python based system, to enqueue long running tasks to workers.
The tasks originate from an outside service that generate a "token", but once they're created based on that token, they should run continuously, and stopped only when explicitly removed by code.
The task starts a WebSocket and loops on it. If the socket is closed, it reopens it. Basically, the task shouldn't reach conclusion.
My goals in architecting this solutions are:
When gracefully restarting a worker (for example to load new code), the task should be re-added to the queue, and picked up by some worker.
Same thing should happen when ungraceful shutdown happens.
2 workers shouldn't work on the same token.
Other processes may create more tasks that should be directed to the same worker that's handling a specific token. This will be resolved by sending those tasks to a queue named after the token, which the worker should start listening to after starting the token's task. I am listing this requirement as an explanation to why a task engine is even required here.
Independent servers, fast code reload, etc. - Minimal downtime per task.
All our server side is Python, and looks like Celery is the best platform for it.
Are we using the right technology here? Any other architectural choices we should consider?
Thanks for your help!

According to the docs
When shutdown is initiated the worker will finish all currently executing tasks before it actually terminates, so if these tasks are important you should wait for it to finish before doing anything drastic (like sending the KILL signal).
If the worker won’t shutdown after considerate time, for example because of tasks stuck in an infinite-loop, you can use the KILL signal to force terminate the worker, but be aware that currently executing tasks will be lost (unless the tasks have the acks_late option set).
You may get something like what you want by using retry or acks_late
Overall I reckon you'll need to implement some extra application-side job control, plus, maybe, a lock service.
But, yes, overall you can do this with celery. Whether there are better technologies... that's out of the scope of this site.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.