Python celery - how to wait for all subtasks in chord - python

I am unit testing celery tasks.
I have chain tasks that also have groups, so a chord is resulted.
The test should look like:
run celery task ( delay )
wait for task and all subtasks
assert
I tried the following:
def wait_for_result(result):
result.get()
for child in result.children or list():
if isinstance(child, GroupResult):
# tried looping over task result in group
# until tasks are ready, but without success
pass
wait_for_result(child)
This creates a deadlock, chord_unlock being retried forever.
I am not interested in task results.
How can I wait for all the subtasks to finish?

Although this is an old question, I just wanted to share how I got rid of the deadlock issue, just in case it helps somebody.
Like the celery logs says, never use get() inside a task. This indeed will create a deadlock.
I have a similar set of celery tasks which includes chain of group tasks, hence making it a chord. I'm calling these tasks using tornado, by making HTTP request. So what I did was something like this:
#task
def someFunction():
....
#task
def someTask():
....
#task
def celeryTask():
groupTask = group([someFunction.s(i) for i in range(10)])
job = (groupTask| someTask.s())
return job
When celeryTask() is being called by tornado, the chain will start executing, & the UUID of someTask() will be held in job. It will look something like
AsyncResult: 765b29a8-7873-4b28-b05c-7e19c33e950c
This UUID is returned and the celeryTask() exits before even the chain starts executing(ideally), hence leaving space for another process to run.
I then used the tornado layer to check the status of the task. Details on the tornado layer can be found in this stackoverflow question

Have you tried chord + callback ?
http://docs.celeryproject.org/en/latest/userguide/canvas.html#chords
>>> callback = tsum.s()
>>> header = [add.s(i, i) for i in range(100)]
>>> result = chord(header)(callback)
>>> result.get()
9900

Related

Celery execute tasks in the order they get called (during runtime)

I have a task consisting of subtasks in a chain. How can I ensure a second call of this task does not start before the first one has finished?
#shared_task
def task(user):
res = chain(subtask_1.s(), # each subtask takes ~1 hour
subtask_2.s(),
subtask_3.s())
return res.apply_async()
A django view might now trigger to call this task:
# user A visits page that triggers task
task.delay(userA)
# 10 seconds later, while task() is still executing, user B visits page
task.delay(userB)
This leads to the tasks racing each other instead of being executed in sequential order. E.g. once a worker has finished with subtask_1() of the first task, it begins working on subtask_1() of the second task, instead of subtask_2() and subtask_3() of the first one.
Is there a way to elegently avoid this? I guess the problem is the order the subtasks get added to the queue.
I have already set worker --concurreny=1, however that still doesn't change the order he consumes from the queue.
Official docs (task cookbook) seem to offer a solution which I don't understand and doesn't work for me unfortunately.
Perhaps include a blocking mechanism within the task, after the chain, with a while not res.ready(): sleep(1) kind of hack?
You can wait for first task to finish and then execute second one like this.
res = task.delay(userA)
res.get() # will block until finished
task.delay(userB)
But it will block the calling thread until first one finished. You can chain tasks to avoid blocking, but for that you have to modify task signature a little to accept task result as argument.
#shared_task
def task(_, user): signature takes one extra argument
# skipped
and
from celery.canvas import chain
chain(task.s(None, userA), task.s(userB))()

multiprocessing.Pool: get notified when a task is started

I use multiprocessing.Pool like so to execute a number of tasks.
def execute(task):
# run task, return result
def on_completion(task_result):
# process task result
async_results = [pool.apply_async(execute,
args=[task],
callback=on_completion)
for task in self.tasks]
# wait for results
My completion handler is invoked by the pool in a nice, serialized way so I don't have to worry about thread safety in its implementation.
However, I would also like to be notified when a task is started. Is there an elegant way to accomplish the following?
def on_start(arg): # Whatever arg(s) were passed to the execute function
# Called when task starts to run
pool.apply_async(run_task,
args=[task],
start_callback=on_start,
completion_callback=on_completion)

Process Celery Tasks results in arrival order

I am new to celery, i have an "celery-server", like the code below which returns results after some time, depending on calculation. I have emulated this behaviour in this simple program below with the sleep function. What i want is to process the early returning results before the "heavy results". I have written a simple programm, see snipped below, which intentionally creates the "heavy load" task as first call.
Note that the subsequent calls create "lighter" tasks and therefore the celery server returns them earlier. Therefore i want to process the returning results based on the order they arrive at the client. Right now (see client code) it waits until the heavy tasks has returned.
But with the examples from celery docs, i am supposed to wait for results by checking the id, or poll for them (which is dumb, because celery client has to check the id of the "first" arrived result somehow i guess).
How can I process the results of celery in the order they arrive at the client? I don't want to poll in an endless loop for "result.ready()" as this completely screws up IMHO somehow the sense of aync processing.
Found no solution in the docs. What i want to do is "get first arrived resulted and get id", compare this to my "result.id" (did i send the task?) and then process accordingly.
#
# Name this code "tasks.py" and run it with:
# celery worker -A tasks --loglevel=info
#
from celery import Celery
import time
app = Celery('tasks', backend='amqp', broker='amqp://guest:guest#127.0.0.1:5672/%2F')
#app.task()
def add(x,y):
print("x=%s y=%s" % (x,y))
time.sleep(x)
return x + y
Second programm the client: This works like celery docs, howerver celery has already completed 0,1,2 (and therefore the client should work on it).
#!/usr/bin/python3
from tasks import add
results = []
max = 4
for i in range(0,max):
print(max-(i+1))
result = add.delay(max-(i+1),0)
results.append(result)
print("")
for i in range(0,max):
result = results[i].get(timeout=10)
print(result)
Result: (the last 4 numbers should appear in arrival order which would be 0,1,2,3 )
3
2
1
0
3
2
1
0
You should implement a callback rather than looping through the results in the order that they were sent to the queue:
http://celery.readthedocs.org/en/latest/userguide/calling.html#linking-callbacks-errbacks
In tasks.py:
#app.task()
def process_add(result):
print(result)
In client.py:
from tasks import add, process_add
results = []
max = 4
for i in range(0,max):
print(max-(i+1))
add.apply_async((max-(i+1), 0), link=process_add.s())

Get current celery task id anywhere in the thread

I'd like to get the task id inside a running task,
without knowing which task I'm in.
(That's why I can't use https://stackoverflow.com/a/8096086/245024)
I'd like it to be something like this:
#task
def my_task():
foo()
def foo():
logger.log(current_task_id)
This pattern returns in many different tasks, and I don't want to carry the task context to every inner method call.
One option could be to use the thread local storage, but then I will need to initialize it before the task starts, and clean it after it finished.
Is there something simpler?
from celery import current_task
print current_task.request.id
I'm just copying this from the comment, because it should be an answer, so thanks to #asksol.

How to route a chain of tasks to a specific queue in celery?

When I route a task to a particular queue it works:
task.apply_async(queue='beetroot')
But if I create a chain:
chain = task | task
And then I write
chain.apply_async(queue='beetroot')
It seems to ignore the queue keyword and assigns to the default 'celery' queue.
It would be nice if celery supported routing in chains - all tasks executed sequentially in the same queue.
I do it like this:
subtask = task.s(*myargs, **mykwargs).set(queue=myqueue)
mychain = celery.chain(subtask, subtask2, ...)
mychain.apply_async()
Ok I got this one figured out.
You have to add the required execution options like queue= or countdown= to the subtask definition, or through a partial:
subtask definition:
from celery import subtask
chain = subtask('task', queue = 'beetroot') | subtask('task', queue = 'beetroot')
partial:
chain = task.s().apply_async(queue = 'beetroot') | task.s().apply_async(queue = 'beetroot')
Then you execute the chain through:
chain.apply_async()
or,
chain.delay()
And the tasks will be sent to the 'beetroot' queue. Extra execution arguments in this last command will not do anything. It would have been kind of nice to apply all of those execution arguments at the Chain (or Group, or any other Canvas primitives) level.
This is rather late, but I don't think the code provided by #mpaf is entirely correct.
Context: In my case, I have two subtasks, out of which the first provides a return value which is passed on to the second as the input argument. I was having trouble in getting the second task to execute - I saw in the logs that Celery would acknowledge the second task as a callback of the first, but it would never execute the second.
This was my non-working chain code -:
from celery import chain
chain(
module.task1.s(arg),
module.task2.s()
).apply_async(countdown=0.1, queue='queuename')
Using the syntax provided in #mpaf's answer, I got both tasks to execute, but the execution order was haphazard and the second subtask was not acknowledged as a callback of the first. I got the idea to browse the docs on how to explicitly set a queue on a subtask.
This is the working code -:
chain(
module.task1.s(arg).set(queue='queuename'),
module.task2.s().set(queue='queuename')
).apply_async(countdown=0.1)

Categories