I have a task consisting of subtasks in a chain. How can I ensure a second call of this task does not start before the first one has finished?
#shared_task
def task(user):
res = chain(subtask_1.s(), # each subtask takes ~1 hour
subtask_2.s(),
subtask_3.s())
return res.apply_async()
A django view might now trigger to call this task:
# user A visits page that triggers task
task.delay(userA)
# 10 seconds later, while task() is still executing, user B visits page
task.delay(userB)
This leads to the tasks racing each other instead of being executed in sequential order. E.g. once a worker has finished with subtask_1() of the first task, it begins working on subtask_1() of the second task, instead of subtask_2() and subtask_3() of the first one.
Is there a way to elegently avoid this? I guess the problem is the order the subtasks get added to the queue.
I have already set worker --concurreny=1, however that still doesn't change the order he consumes from the queue.
Official docs (task cookbook) seem to offer a solution which I don't understand and doesn't work for me unfortunately.
Perhaps include a blocking mechanism within the task, after the chain, with a while not res.ready(): sleep(1) kind of hack?
You can wait for first task to finish and then execute second one like this.
res = task.delay(userA)
res.get() # will block until finished
task.delay(userB)
But it will block the calling thread until first one finished. You can chain tasks to avoid blocking, but for that you have to modify task signature a little to accept task result as argument.
#shared_task
def task(_, user): signature takes one extra argument
# skipped
and
from celery.canvas import chain
chain(task.s(None, userA), task.s(userB))()
Related
Suppose i have Report-Building-Workflow roughly described as follows
#shared_task
def report_builder(*args, **kwargs)
return chain(
group(gather_data_from_service_a.s()....),
merge_data.s(),
export_to_html.s(),
export_to_pdf.s(),
store_pdf_on_s3.s(),
).apply_async()
Called by task = report_builder.s().apply_async()
How do I keep track of the main task (here: report_builder)? Problem is the main-task succeeds short after its been called (because all it does is returning an AsyncResult).
What i want to achieve is to poll status-information about the main-task, transitioning from PENDING to STARTED to (SUCCESS or FAILURE) and in the end get the s3-location of the last task in the chain as a result of the main task.
I use multiprocessing.Pool like so to execute a number of tasks.
def execute(task):
# run task, return result
def on_completion(task_result):
# process task result
async_results = [pool.apply_async(execute,
args=[task],
callback=on_completion)
for task in self.tasks]
# wait for results
My completion handler is invoked by the pool in a nice, serialized way so I don't have to worry about thread safety in its implementation.
However, I would also like to be notified when a task is started. Is there an elegant way to accomplish the following?
def on_start(arg): # Whatever arg(s) were passed to the execute function
# Called when task starts to run
pool.apply_async(run_task,
args=[task],
start_callback=on_start,
completion_callback=on_completion)
I need to sort some tasks in Celery that some of them should as a single task and some should work parallel and when the tasks in the group completed, it should pass the next one:
chain(
task1.s(),
task2.s(),
group(task3.s(), task4.s()),
group(task5.s(), task6.s(), task7.s()),
task7.s()
).delay()
But I think what did I do is wrong. Any body have idea how to do it?
Also, I don't care about sending the result of each task to the others.
This one finally worked:
chain(
task1.s(),
task2.s(),
chord([task3.s(), task4.s()], body=task_result.s(), immutable=True),
chord([task5.s(), task6.s(), task7.s()], body=task_result.s(), immutable=True),
task7.s()
).delay()
This sounds like a chord, ie where you execute tasks in parallel and have a callback into another task when the parallel tasks are finished: http://docs.celeryproject.org/en/latest/userguide/canvas.html#chords
So you might have to change it something like:
chain(task1.s(), task2.s(), chord(task3.s(), task4.s())(chord(task5.s(), task6.s(), task7.s())(task7.s())))
Also, chains/groups etc always return the results and pass them on to the child task(s) so you have to model the task arguments accordingly.
As it's quite a complex workflow, you might be better off calling the next task from with the previous task (like calling task2.s().delay() at the end of task1) - but I guess there's no way around modelling the chord.
I am unit testing celery tasks.
I have chain tasks that also have groups, so a chord is resulted.
The test should look like:
run celery task ( delay )
wait for task and all subtasks
assert
I tried the following:
def wait_for_result(result):
result.get()
for child in result.children or list():
if isinstance(child, GroupResult):
# tried looping over task result in group
# until tasks are ready, but without success
pass
wait_for_result(child)
This creates a deadlock, chord_unlock being retried forever.
I am not interested in task results.
How can I wait for all the subtasks to finish?
Although this is an old question, I just wanted to share how I got rid of the deadlock issue, just in case it helps somebody.
Like the celery logs says, never use get() inside a task. This indeed will create a deadlock.
I have a similar set of celery tasks which includes chain of group tasks, hence making it a chord. I'm calling these tasks using tornado, by making HTTP request. So what I did was something like this:
#task
def someFunction():
....
#task
def someTask():
....
#task
def celeryTask():
groupTask = group([someFunction.s(i) for i in range(10)])
job = (groupTask| someTask.s())
return job
When celeryTask() is being called by tornado, the chain will start executing, & the UUID of someTask() will be held in job. It will look something like
AsyncResult: 765b29a8-7873-4b28-b05c-7e19c33e950c
This UUID is returned and the celeryTask() exits before even the chain starts executing(ideally), hence leaving space for another process to run.
I then used the tornado layer to check the status of the task. Details on the tornado layer can be found in this stackoverflow question
Have you tried chord + callback ?
http://docs.celeryproject.org/en/latest/userguide/canvas.html#chords
>>> callback = tsum.s()
>>> header = [add.s(i, i) for i in range(100)]
>>> result = chord(header)(callback)
>>> result.get()
9900
I am using a task (queueing-task) to queue multiple others tasks — fanout. When I try to use Queue.add with task argument being a list of Task instances with more than 5 element's and in transaction… I get this error.
JointException: taskqueue.DatastoreError caused by:
<class 'google.appengine.api.datastore_errors.BadRequestError'>
Too many messages, maximum allowed 5
Is there another way to queue more than 5 tasks in a transaction?
Or...
Maybe I don't need a transaction, cause:
I don't care if any of those tasks get queued twice anyway, and
if queueing will fail for any of them, then the whole queueing-task will be re run.
So tell me how do I queue more than 5 tasks in a transaction or tell me to not use transaction cause I don't really need one.
One solution close to solving your problem is to add one transactional task that fans-out the remaining tasks. Just add the one fan-out task in your existing transaction.
Unless there is a business logic reason to do so, do not re-run a task that has already run. Preventing tasks from being re-inserted (i.e. duplicated) is straightforward and saves resources. Your fan-out task will basically look like:
class FanOutTask(webapp.RequestHandler):
def get(self):
name = self.request.get('name')
params = deserialize(self.request.get('params'))
try:
task_params = params.get('stuff')
taskqueue.add(url='/worker/1', name=name + '-1', params=task_params)
except TaskAlreadyExistsError:
pass
try:
task_params = params.get('more')
taskqueue.add(url='/worker/2', name=name + '-2', params=task_params)
except TaskAlreadyExistsError:
pass
Adding the fan-out task transactionally ensures it is enqueued. Errors resulting from the task already being run get caught and ignored, other errors cause the fan-out task to re-run. With this pattern you can insert many sub-tasks pretty easily.