Celery groups and chains - python

I need to sort some tasks in Celery that some of them should as a single task and some should work parallel and when the tasks in the group completed, it should pass the next one:
chain(
task1.s(),
task2.s(),
group(task3.s(), task4.s()),
group(task5.s(), task6.s(), task7.s()),
task7.s()
).delay()
But I think what did I do is wrong. Any body have idea how to do it?
Also, I don't care about sending the result of each task to the others.

This one finally worked:
chain(
task1.s(),
task2.s(),
chord([task3.s(), task4.s()], body=task_result.s(), immutable=True),
chord([task5.s(), task6.s(), task7.s()], body=task_result.s(), immutable=True),
task7.s()
).delay()

This sounds like a chord, ie where you execute tasks in parallel and have a callback into another task when the parallel tasks are finished: http://docs.celeryproject.org/en/latest/userguide/canvas.html#chords
So you might have to change it something like:
chain(task1.s(), task2.s(), chord(task3.s(), task4.s())(chord(task5.s(), task6.s(), task7.s())(task7.s())))
Also, chains/groups etc always return the results and pass them on to the child task(s) so you have to model the task arguments accordingly.
As it's quite a complex workflow, you might be better off calling the next task from with the previous task (like calling task2.s().delay() at the end of task1) - but I guess there's no way around modelling the chord.

Related

Celery execute tasks in the order they get called (during runtime)

I have a task consisting of subtasks in a chain. How can I ensure a second call of this task does not start before the first one has finished?
#shared_task
def task(user):
res = chain(subtask_1.s(), # each subtask takes ~1 hour
subtask_2.s(),
subtask_3.s())
return res.apply_async()
A django view might now trigger to call this task:
# user A visits page that triggers task
task.delay(userA)
# 10 seconds later, while task() is still executing, user B visits page
task.delay(userB)
This leads to the tasks racing each other instead of being executed in sequential order. E.g. once a worker has finished with subtask_1() of the first task, it begins working on subtask_1() of the second task, instead of subtask_2() and subtask_3() of the first one.
Is there a way to elegently avoid this? I guess the problem is the order the subtasks get added to the queue.
I have already set worker --concurreny=1, however that still doesn't change the order he consumes from the queue.
Official docs (task cookbook) seem to offer a solution which I don't understand and doesn't work for me unfortunately.
Perhaps include a blocking mechanism within the task, after the chain, with a while not res.ready(): sleep(1) kind of hack?
You can wait for first task to finish and then execute second one like this.
res = task.delay(userA)
res.get() # will block until finished
task.delay(userB)
But it will block the calling thread until first one finished. You can chain tasks to avoid blocking, but for that you have to modify task signature a little to accept task result as argument.
#shared_task
def task(_, user): signature takes one extra argument
# skipped
and
from celery.canvas import chain
chain(task.s(None, userA), task.s(userB))()

How to monitor a group of tasks in celery?

I have a situation where a periodic monthly big_task reads a file and enqueue one chained-task per row in this file, where the chained tasks are small_task_1 and small_task_2:
class BigTask(PeriodicTask):
run_every = crontab(hour=00, minute=00, day_of_month=1)
def run(self):
task_list = []
with open("the_file.csv" as f:
for row in f:
t = chain(
small_task_1.s(row),
small_task_2.s(),
)
task_list.append(t)
gr = group(*task_list)
r = gr.apply_async()
I would like to get statistics about the number of enqueued, failed tasks (and detail about the exception) for each small_task, as soon as all of them are finished (whatever the status is) to send a summary email to the project admins.
I first thought of using chord, but callback is not executed if any of the headers task fails, which will surely happen in my case.
I could also use r.get() in the BigTask, very convenient, but not recommended to wait for a task result into another task (even if here, I guess the risk of worker deadlock is poor since task will be executed only once a month).
Important note: input file contains ~700k rows.
How would you recommend to proceed?
I'm not sure if it can help you to monitor, but about the chord and the callback issue you could use link_error callback (for catching exceptions). In your case for example you can use it like:
small_task_1.s(row).set(link_error=error_task))
and implement celery error_task that send you notification or whatever.
In celery 4, you can set it once for the all canvas (but it didn't work for me in 3.1):
r = gr.apply_async(link_error=error_task)
For the monitoring part, you can use flower of course.
Hope that help
EDIT: An alternative (without using additional persistency) would be to catch the exception and add some logic to the result and the callback. For example:
def small_task_1():
try:
// do stuff
return 'success', result
except:
return 'fail', result
and then in your callback task iterate over the results tuples and check for fails because doing the actual logic.
I found the best solution to be iterate over the group results, after the group is ready.
When you issue a Group, you have a ResultSet object. You can .save() this object, to get it later and check if .is_ready, or you can call .join() and wait for the results.
When it ends, you can access .results and you have a list of AsyncResult objects. These objects all have a .state property that you can access and check if the task was successul or not.
However, you can only check the results after the group ends. During the process, you can get the value of .completed_count() and have an idea of group progress.
https://docs.celeryproject.org/en/latest/reference/celery.result.html#celery.result.ResultSet
The solution we use for a partly similar problem where celery builtin stuff (tasks states etc) doesn't really cut it is to manually store desired informations in Redis and retrieve them when needed.

Python celery - how to wait for all subtasks in chord

I am unit testing celery tasks.
I have chain tasks that also have groups, so a chord is resulted.
The test should look like:
run celery task ( delay )
wait for task and all subtasks
assert
I tried the following:
def wait_for_result(result):
result.get()
for child in result.children or list():
if isinstance(child, GroupResult):
# tried looping over task result in group
# until tasks are ready, but without success
pass
wait_for_result(child)
This creates a deadlock, chord_unlock being retried forever.
I am not interested in task results.
How can I wait for all the subtasks to finish?
Although this is an old question, I just wanted to share how I got rid of the deadlock issue, just in case it helps somebody.
Like the celery logs says, never use get() inside a task. This indeed will create a deadlock.
I have a similar set of celery tasks which includes chain of group tasks, hence making it a chord. I'm calling these tasks using tornado, by making HTTP request. So what I did was something like this:
#task
def someFunction():
....
#task
def someTask():
....
#task
def celeryTask():
groupTask = group([someFunction.s(i) for i in range(10)])
job = (groupTask| someTask.s())
return job
When celeryTask() is being called by tornado, the chain will start executing, & the UUID of someTask() will be held in job. It will look something like
AsyncResult: 765b29a8-7873-4b28-b05c-7e19c33e950c
This UUID is returned and the celeryTask() exits before even the chain starts executing(ideally), hence leaving space for another process to run.
I then used the tornado layer to check the status of the task. Details on the tornado layer can be found in this stackoverflow question
Have you tried chord + callback ?
http://docs.celeryproject.org/en/latest/userguide/canvas.html#chords
>>> callback = tsum.s()
>>> header = [add.s(i, i) for i in range(100)]
>>> result = chord(header)(callback)
>>> result.get()
9900

Get current celery task id anywhere in the thread

I'd like to get the task id inside a running task,
without knowing which task I'm in.
(That's why I can't use https://stackoverflow.com/a/8096086/245024)
I'd like it to be something like this:
#task
def my_task():
foo()
def foo():
logger.log(current_task_id)
This pattern returns in many different tasks, and I don't want to carry the task context to every inner method call.
One option could be to use the thread local storage, but then I will need to initialize it before the task starts, and clean it after it finished.
Is there something simpler?
from celery import current_task
print current_task.request.id
I'm just copying this from the comment, because it should be an answer, so thanks to #asksol.

django celery: how to set task to run at specific interval programmatically

I found that I can set the task to run at specific interval at specific times from here, but that was only done during task declaration. How do I set a task to run periodically dynamically?
The schedule is derived from a setting, and thus seems to be immutable at runtime.
You can probably accomplish what you're looking for using Task ETAs. This guarantees that your task won't run before the desired time, but doesn't promise to run the task at the designated timeā€”if the workers are overloaded at the designated ETA, the task may run later.
If that restriction isn't an issue, you could write a task which would first run itself like:
#task
def mytask():
keep_running = # Boolean, should the task keep running?
if keep_running:
run_again = # calculate when to run again
mytask.apply_async(eta=run_again)
# ... do the stuff you came here to do ...
The major downside of this approach is that you are relying on the taskstore to remember the tasks in flight. If one of them fails before firing off the next one, then the task will never run again. If your broker isn't persisted to disk and it dies (taking all in-flight tasks with it), then none of those tasks will run again.
You could solve these issues with some kind of transaction logging and a periodic "nanny" task whose job it is to find such repeating tasks that died an untimely death and revive them.
If I had to implement what you've described, I think this is how I would approach it.
celery.task.base.PeriodicTask defines is_due which determines when the next run should be. You could override this function to contain your custom dynamic running logic. See the docs here: http://docs.celeryproject.org/en/latest/reference/celery.task.base.html?highlight=is_due#celery.task.base.PeriodicTask.is_due
An example:
import random
from celery.task import PeriodicTask
class MyTask(PeriodicTask):
def run(self, **kwargs):
logger = self.get_logger(**kwargs)
logger.info("Running my task")
def is_due(self, last_run_at):
# Add your logic for when to run. Mine is random
if random.random() < 0.5:
# Run now and ask again in a minute
return (True, 60)
else:
# Don't run now but run in 10 secs
return (True, 10)
see here http://docs.celeryproject.org/en/latest/userguide/periodic-tasks.html
i think you can't make it dynamically ... best way is create task in task :D
for example you want run something for X sec later then you create new task with x sec delay and in this task create another task for N*X sec delay ...
This should help you some... http://celery.readthedocs.org/en/latest/faq.html#can-i-change-the-interval-of-a-periodic-task-at-runtime
Once you've defined a custom schedule, assign it to your task as asksol has suggested above.
CELERYBEAT_SCHEDULE = {
"my_name": {
"task": "myapp.tasks.task",
"schedule": myschedule(),
}
}
You might also want to modify CELERYBEAT_MAX_LOOP_INTERVAL if you want your schedule to update more often than every five minutes.

Categories