Celery routing to multiple tasks rather than hosts - python

I am working on porting an application I wrote from Golang (using redis) to Python and I would love to use Celery to accomplish my task queuing, but I have a question regarding routing...
My application receives "events" via REST POSTs, where each "event" can be of a different type. I then want to have workers in the background wait for events of certain types. The caveat here is that ONE event could result in MORE than ONE task handling the event. For example:
some/lib/a/tasks.py
#task
def handle_event_typeA(a,b,c):
# handles event...
pass
#task
def handle_event_typeB(a,b,c):
# handles other event...
pass
some/lib/b/tasks.py
#task
def handle_event_typeA(a,b,c):
# handles event slightly differently... but still same event...
pass
In summary... I want to be able to run N number of workers (across X number of machines) and each one of these works will have Y numbers of tasks registered such as: a.handle_event_typeA, b.handle_event_typeA, etc... and I want to be able to insert a task into a queue and have one worker pick up the task and route it to more than one task in the worker (i.e. to both a.handle_event_typeA and b.handle_event_typeA).
I have read over the documentation of Kombu here and Celery's routing documentation here and I can't seem to figure out how to configure this correctly.
I have been using Celery for some time now for more traditional workflows and I am very happy with its feature set, performance, and stability. I would implement what I need using Kombu directly or some homebrew solution, but I would like to use Celery if at all possible.
Thanks guys! I hope I don't waste anyone's time with this question.
Edit 1
After some more time thinking about this issue I have come up with a workaround to implement what I want with Celery. It's not the most elegant solution, but it's working well. I am using django and it's caching abstraction (you can use something like memcached or redis directly instead). Here's the snippet I came up with:
from django.core.cache import cache
from celery.execute import send_task
SUBSCRIBERS_KEY = 'task_subscribers.{0}'
def subscribe_task(key, task):
# get current list of subscribers
cache_key = SUBSCRIBERS_KEY.format(key)
subscribers = cache.get(cache_key) or []
# get task name
if hasattr(task, 'delay'):
name = task.name
else:
name = task
# add to list
if not name in subscribers:
subscribers.append(name)
# set cache
cache.set(cache_key, subscribers)
def publish_task(key, *kargs):
# get current list of subscribers
cache_key = SUBSCRIBERS_KEY.format(key)
subscribers = cache.get(cache_key) or []
# iterate through all subscribers and execute task
for task in subscribers:
# send celery task
send_task(task, args=kargs, kwargs={})
What I then do is subscribe to tasks in different modules by doing the following:
subscribe_task('typeA', 'some.lib.b.tasks.handle_event_typeA')
Then I can call the publish task method when handling the REST events.

Related

How do I fetch results of a Celery task within another Celery task?

Pardon my ignorance as I am learning how I can use celery for my purposes.
Suppose I have two tasks: create_ticket and add_message_to_ticket. Usually create_ticket task is created and completed before add_message_to_ticket tasks are created multiple times.
#app.task
def create_ticket(ticket_id):
time.sleep(random.uniform(1.0, 4.0)) # replace with code that processes ticket creation
return f"Successfully processed ticket creation: {ticket_id}"
#app.task
def add_message_to_ticket(ticket_id, who, when, message_contents):
# TODO add code that checks to see if create_ticket task for ticket_id has already been completed
time.sleep(random.uniform(1.0, 4.0)) # replace with code that handles added message
return f"Successfully processed message for ticket {ticket_id} by {who} at {when}"
Now suppose that these tasks are created out of order due to Python's server receiving the events from an external web service out of order. For example, one add_message_to_ticket.delay(82, "auroranil", 1599039427, "This issue also occurs on Microsoft Edge on Windows 10.") gets called few seconds before create_ticket.delay(82) gets called. How would I solve the following problems?
How would I fetch results of celery task create_ticket by specifying ticket_id within task add_message_to_ticket? All I can think of is to maintain a database that stores tickets state, and checks to see if a particular ticket has been created, but I want to know if I am able to use celery's result backend somehow.
If I receive an add_message_to_ticket task with a ticket id where I find out that corresponding ticket does not have create_ticket task completed, do I reject that task, and put that back in the queue?
Do I need to ensure that the tasks are idempotent? I know that is good practice, but is it a requirement for this to work?
Is there a better approach at solving this problem? I am aware of Celery Canvas workflow with primitives such as chain, but I am not sure how I can ensure that these events are processed in order, or be able to put tasks on pending state while it waits for tasks it depends on to be completed based on arguments I want celery to check, which in this case is ticket_id.
I am not particularly worried if I receive multiple user messages for a particular ticket with timestamps out of order, as it is not as important as knowing that a ticket has been created before messages are added to that ticket. The point I am making is that I am coding up several tasks where some events crucially depend on others, whereas the ordering of other events do not matter as much for the Python's server to function.
Edit:
Partial solutions:
Use task_id to identify Celery tasks, with a formatted string containing argument values which identifies that task. For example, task_id="create_ticket(\"TICKET000001\")"
Retry tasks that do not meet dependency requirements. Blocking for subtasks to be completed is bad, as subtask may never complete, and will hog a process in one of the worker machines.
Store arguments as part of result of a completed task, so that you can use that information not available in later tasks.
Relevant links:
Where do you set the task_id of a celery task?
Retrieve result from 'task_id' in Celery from unknown task
Find out whether celery task exists
More questions:
How do I ensure that I send task once per task_id? For instance, I want create_ticket task to be applied asynchronous only once. This is an alternative to making all tasks idempotent.
How do I use AsyncResult in add_message_to_ticket to check for status of create_ticket task? Is it possible to specify a chain somehow even though the first task may have already been completed?
How do I fetch all results of tasks given task name derived from the name of the function definition?
Most importantly, should I use Celery results backend to abstract stored data away from dealing with a database? Or should I scratch this idea and just go ahead with designing a database schema instead?

Huey not calling tasks in Django

I have a Django rest framework app that calls 2 huey tasks in succession in a serializer create method like so:
...
def create(self, validated_data):
user = self.context['request'].user
player_ids = validated_data.get('players', [])
game = Game.objects.create()
tasks.make_players_friends_task(player_ids)
tasks.send_notification_task(user.id, game.id)
return game
# tasks.py
#db_task()
def make_players_friends_task(ids):
players = User.objects.filter(id__in=ids)
# process players
#db_task()
def send_notification_task(user_id, game_id):
user = User.objects.get(id=user_id)
game = Game.objects.get(id=game_id)
# send notifications
When running the huey process in the terminal, when I hit this endpoint, I can see that only one or the other of the tasks is ever called, but never both. I am running huey with the default settings (redis with 1 thread worker.)
If I alter the code so that I am passing in the objects themselves as parameters, rather than the ids, and remove the django queries in the #db_task methods, things seem to work alright.
The reason I initially used the ids as parameters is because I assumed (or read somewhere) that huey uses json serialization as default, but after looking into it, pickle is actually the default serializer.
One theory is that since I am only running one worker, and also have a #db_periodic_task method in the app, the process can only handle listening for tasks or executing them at any time, but not both. This is the way celery seems to work, where you need a separate process for a scheduler and a worker each, but this isn't mentioned in huey's documentation.
If you run the huey consumer it will actually spawn a separate scheduler together with the amount of workers you've specified, so that's not going to be your problem.
You're not giving enough information to actually properly see what's going wrong so check the following:
If you run the huey consumer in the terminal, observe whether all your tasks show up as properly registered so that the consumer is actually capable of consuming them.
Check whether your redis process is running.
Try performing the tasks with a blocking call to see on which tasks it fails:
task_result = tasks.make_players_friends_task(player_ids)
task_result.get(blocking=True)
task_result = tasks.send_notification_task(user.id, game.id)
task_result.get(blocking=True)
Do this with a debugger or print statements to see whether it makes it to the end of your function or where it gets stuck.
Make sure to always restart your consumer when you change code. It doesn't automatically pick up new code like the django dev server. The fact that your code works as intended while pickling whole objects instead of passing id's could point to this, as it would be really weird that this would break it. On the other hand, you shouldn't pass in django ORM objects. It makes way more sense to use your id approach.

How to ensure task execution order per user using Celery, RabbitMQ and Django?

I'm running Django, Celery and RabbitMQ. What I'm trying to achieve is to ensure, that tasks related to one user are executed in order (specifically, one at the time, I don't want task concurrency per user)
whenever new task is added for user, it should depend on the most recently added task. Additional functionality might include not adding task to queue, if task of this type is queued for this user and has not yet started.
I've done some research and:
I couldn't find a way to link newly created task with already queued one in Celery itself, chains seem to be only able to link new tasks.
I think that both functionalities are possible to implement with custom RabbitMQ message handler, though it might be hard to code after all.
I've also read about celery-tasktree and this might be an easiest way to ensure execution order, but how do I link new task with already "applied_async" task_tree or queue? Is there any way that I could implement that additional no-duplicate functionality using this package?
Edit: There is this also this "lock" example in celery cookbook and as the concept is fine, I can't see a possible way to make it work as intended in my case - simply if I can't acquire lock for user, task would have to be retried, but this means pushing it to the end of queue.
What would be the best course of action here?
If you configure the celery workers so that they can only execute one task at a time (see worker_concurrency setting), then you could enforce the concurrency that you need on a per user basis. Using a method like
NUMBER_OF_CELERY_WORKERS = 10
def get_task_queue_for_user(user):
return "user_queue_{}".format(user.id % NUMBER_OF_CELERY_WORKERS)
to get the task queue based on the user id, every task will be assigned to the same queue for each user. The workers would need to be configured to only consume tasks from a single task queue.
It would play out like this:
User 49 triggers a task
The task is sent to user_queue_9
When the one and only celery worker that is listening to user_queue_9 is ready to consume a new task, the task is executed
This is a hacky answer though, because
requiring just a single celery worker for each queue is a brittle system -- if the celery worker stops, the whole queue stops
the workers are running inefficiently

In celery how to get task position in queue?

I'm using Celery with Redis as broker and I can see that the queue is actually a redis list with the serialized task as the items.
My question is, if I have an AsyncResult object as a result of calling <task>.delay(), is there a way to determine the item's position in the queue?
UPDATE:
I'm finally able to get the position using:
from celery.task.control import inspect
i = inspect()
i.reserved()
but its a bit slow since it needs to communicate with all the workers.
The inspect.reserved()/scheduled() you mention may work, but not
always accurate since it only take into account the tasks
that the workers have prefetched.
Celery does not allow out of band operations on the queue, like removing messages
from the queue, or reordering them, because it will not scale in a distributed system.
The messages may not have reached the queue yet, which can result
in race conditions and in practice it is not a sequential queue with transactional
operations, but a stream of messages originating from several locations.
That is, the Celery API is based around strict message passing semantics.
It is possible to access the queue directly on some of the brokers
Celery supports (like Redis or Database), but that is not part of the public API,
and you are discouraged from doing so, but of course if you are not planning on
supporting operations at scale you should do whatever is the most convenient for you
and discard my advice.
If this is just to give the user some idea when his job will be completed, then
I'm sure you could come up with an algorithm to predict when the task will be executed,
if you just had the length of the queue and the time at which each task was inserted.
The first is just a redis.len("celery"), and the latter you could
add yourself by listening to the task_sent signal:
from celery.signals import task_sent
#task_sent.connect
def record_insertion_time(id, **kwargs):
redis.zadd("celery.insertion_times", id)
Using a sorted set here: http://redis.io/commands/zadd
For a pure message passing solution you could use a dedicated monitor
that consumes the Celery event stream and predicts when tasks will finish.
http://docs.celeryproject.org/en/latest/userguide/monitoring.html#event-reference
(just noticed that the task-sent is missing the timestamp field in
the documentation, but a timestamp is sent with that event so I will fix it).
The events also contain a "clock" field which is a logical clock
(see http://en.wikipedia.org/wiki/Lamport_timestamps),
this can be used to detect the order of events in a distributed
system without depending on the system time on each machine
to be in sync (which is ~impossible to achieve).

how to track revoked tasks in across multiple celeryd processes

I have a reminder type app that schedules tasks in celery using the "eta" argument. If the parameters in the reminder object changes (e.g. time of reminder), then I revoke the task previously sent and queue a new task.
I was wondering if there's any good way of keeping track of revoked tasks across celeryd restarts. I'd like to have the ability to scale celeryd processes up/down on the fly, and it seems that any celeryd processes started after the revoke command was sent will still execute that task.
One way of doing it is to keep a list of revoked task ids, but this method will result in the list growing arbitrarily. Pruning this list requires guarantees that the task is no longer in the RabbitMQ queue, which doesn't seem to be possible.
I've also tried using a shared --statedb file for each of the celeryd workers, but it seems that the statedb file is only updated on termination of the workers and thus not suitable for what I would like to accomplish.
Thanks in advance!
Interesting problem, I think it should be easy to solve using broadcast commands.
If when a new worker starts up it requests all the other workers to dump its revoked
tasks to the new worker. Adding two new remote control commands,
you can easily add new commands by using #Panel.register,
Module control.py:
from celery.worker import state
from celery.worker.control import Panel
#Panel.register
def bulk_revoke(panel, ids):
state.revoked.update(ids)
#Panel.register
def broadcast_revokes(panel, destination):
panel.app.control.broadcast("bulk_revoke", arguments={
"ids": list(state.revoked)},
destination=destination)
Add it to CELERY_IMPORTS:
CELERY_IMPORTS = ("control", )
The only missing problem now is to connect it so that the new worker
triggers broadcast_revokes at startup. I guess you could use the worker_ready
signal for this:
from celery import current_app as celery
from celery.signals import worker_ready
def request_revokes_at_startup(sender=None, **kwargs):
celery.control.broadcast("broadcast_revokes",
destination=sender.hostname)
I had to do something similar in my project and used celerycam with django-admin-monitor. The monitor takes a snapshot of tasks and saves them in the database periodically. And there is a nice user interface to browse and check the status of all tasks. And you can even use it even if your project is not Django based.
I implemented something similar to this some time ago, and the solution I came up with was very similar to yours.
The way I solved this problem was to have the worker fetch the Task object from the database when the job ran (by passing it the primary key, as the documentation recommends). In your case, before the reminder is sent the worker should perform a check to ensure that the task is "ready" to be run. If not, it should simply return without doing any work (assuming that the ETA has changed and another worker will pick up the new job).

Categories