Suppose i have Report-Building-Workflow roughly described as follows
#shared_task
def report_builder(*args, **kwargs)
return chain(
group(gather_data_from_service_a.s()....),
merge_data.s(),
export_to_html.s(),
export_to_pdf.s(),
store_pdf_on_s3.s(),
).apply_async()
Called by task = report_builder.s().apply_async()
How do I keep track of the main task (here: report_builder)? Problem is the main-task succeeds short after its been called (because all it does is returning an AsyncResult).
What i want to achieve is to poll status-information about the main-task, transitioning from PENDING to STARTED to (SUCCESS or FAILURE) and in the end get the s3-location of the last task in the chain as a result of the main task.
Related
I have a task consisting of subtasks in a chain. How can I ensure a second call of this task does not start before the first one has finished?
#shared_task
def task(user):
res = chain(subtask_1.s(), # each subtask takes ~1 hour
subtask_2.s(),
subtask_3.s())
return res.apply_async()
A django view might now trigger to call this task:
# user A visits page that triggers task
task.delay(userA)
# 10 seconds later, while task() is still executing, user B visits page
task.delay(userB)
This leads to the tasks racing each other instead of being executed in sequential order. E.g. once a worker has finished with subtask_1() of the first task, it begins working on subtask_1() of the second task, instead of subtask_2() and subtask_3() of the first one.
Is there a way to elegently avoid this? I guess the problem is the order the subtasks get added to the queue.
I have already set worker --concurreny=1, however that still doesn't change the order he consumes from the queue.
Official docs (task cookbook) seem to offer a solution which I don't understand and doesn't work for me unfortunately.
Perhaps include a blocking mechanism within the task, after the chain, with a while not res.ready(): sleep(1) kind of hack?
You can wait for first task to finish and then execute second one like this.
res = task.delay(userA)
res.get() # will block until finished
task.delay(userB)
But it will block the calling thread until first one finished. You can chain tasks to avoid blocking, but for that you have to modify task signature a little to accept task result as argument.
#shared_task
def task(_, user): signature takes one extra argument
# skipped
and
from celery.canvas import chain
chain(task.s(None, userA), task.s(userB))()
I have a situation where a periodic monthly big_task reads a file and enqueue one chained-task per row in this file, where the chained tasks are small_task_1 and small_task_2:
class BigTask(PeriodicTask):
run_every = crontab(hour=00, minute=00, day_of_month=1)
def run(self):
task_list = []
with open("the_file.csv" as f:
for row in f:
t = chain(
small_task_1.s(row),
small_task_2.s(),
)
task_list.append(t)
gr = group(*task_list)
r = gr.apply_async()
I would like to get statistics about the number of enqueued, failed tasks (and detail about the exception) for each small_task, as soon as all of them are finished (whatever the status is) to send a summary email to the project admins.
I first thought of using chord, but callback is not executed if any of the headers task fails, which will surely happen in my case.
I could also use r.get() in the BigTask, very convenient, but not recommended to wait for a task result into another task (even if here, I guess the risk of worker deadlock is poor since task will be executed only once a month).
Important note: input file contains ~700k rows.
How would you recommend to proceed?
I'm not sure if it can help you to monitor, but about the chord and the callback issue you could use link_error callback (for catching exceptions). In your case for example you can use it like:
small_task_1.s(row).set(link_error=error_task))
and implement celery error_task that send you notification or whatever.
In celery 4, you can set it once for the all canvas (but it didn't work for me in 3.1):
r = gr.apply_async(link_error=error_task)
For the monitoring part, you can use flower of course.
Hope that help
EDIT: An alternative (without using additional persistency) would be to catch the exception and add some logic to the result and the callback. For example:
def small_task_1():
try:
// do stuff
return 'success', result
except:
return 'fail', result
and then in your callback task iterate over the results tuples and check for fails because doing the actual logic.
I found the best solution to be iterate over the group results, after the group is ready.
When you issue a Group, you have a ResultSet object. You can .save() this object, to get it later and check if .is_ready, or you can call .join() and wait for the results.
When it ends, you can access .results and you have a list of AsyncResult objects. These objects all have a .state property that you can access and check if the task was successul or not.
However, you can only check the results after the group ends. During the process, you can get the value of .completed_count() and have an idea of group progress.
https://docs.celeryproject.org/en/latest/reference/celery.result.html#celery.result.ResultSet
The solution we use for a partly similar problem where celery builtin stuff (tasks states etc) doesn't really cut it is to manually store desired informations in Redis and retrieve them when needed.
I am unit testing celery tasks.
I have chain tasks that also have groups, so a chord is resulted.
The test should look like:
run celery task ( delay )
wait for task and all subtasks
assert
I tried the following:
def wait_for_result(result):
result.get()
for child in result.children or list():
if isinstance(child, GroupResult):
# tried looping over task result in group
# until tasks are ready, but without success
pass
wait_for_result(child)
This creates a deadlock, chord_unlock being retried forever.
I am not interested in task results.
How can I wait for all the subtasks to finish?
Although this is an old question, I just wanted to share how I got rid of the deadlock issue, just in case it helps somebody.
Like the celery logs says, never use get() inside a task. This indeed will create a deadlock.
I have a similar set of celery tasks which includes chain of group tasks, hence making it a chord. I'm calling these tasks using tornado, by making HTTP request. So what I did was something like this:
#task
def someFunction():
....
#task
def someTask():
....
#task
def celeryTask():
groupTask = group([someFunction.s(i) for i in range(10)])
job = (groupTask| someTask.s())
return job
When celeryTask() is being called by tornado, the chain will start executing, & the UUID of someTask() will be held in job. It will look something like
AsyncResult: 765b29a8-7873-4b28-b05c-7e19c33e950c
This UUID is returned and the celeryTask() exits before even the chain starts executing(ideally), hence leaving space for another process to run.
I then used the tornado layer to check the status of the task. Details on the tornado layer can be found in this stackoverflow question
Have you tried chord + callback ?
http://docs.celeryproject.org/en/latest/userguide/canvas.html#chords
>>> callback = tsum.s()
>>> header = [add.s(i, i) for i in range(100)]
>>> result = chord(header)(callback)
>>> result.get()
9900
Let's say I add 100 push tasks (as group 1) to my tasks-queue. Then I add another 200 tasks (as group 2) to the same queue. How can I understand if all tasks of group 1 are finished?
Looks like QueueStatistics will not help here. tag works only with pull queues.
And I can not have separate queues (since I may have hundreds of groups).
I would probably solve it by using a sharded counter in datastore like #mgilson said and decorate my deferred functions to run a callback when the tasks are done running.
I think something like this is what you are looking for if you include the code at https://cloud.google.com/appengine/articles/sharding_counters?hl=en and write a decriment function to complement the increment one.
import random
import time
from google.appengine.ext import deferred
def done_work():
logging.info('work done!')
def worker(callback=None):
def fst(f):
def snd(*args, **kwargs):
key = kwargs['shard_key']
del kwargs['shard_key']
retval = f(*args, **kwargs)
decriment(key)
if get_count(key) == 0:
callback()
return retval
return snd
return fst
def func(n):
# do some work
time.sleep(random.randint(1, 10) / 10.0)
logging.info('task #{:d}'.format(n))
def make_some_tasks():
func = worker(callback=done_work)(func)
key = random.randint(0, 1000)
for n in xrange(0, 100):
increment(key)
deferred.defer(func, n, shard_key=key)
Tasks are not guaranteed to run only once, occasionally even successfully executed tasks may be repeated. Here's such an example: GAE deferred task retried due to "instance unavailable" despite having already succeeded.
Because of this using a counter incremented at task enqueueing and decremented at task completion wouldn't work - it would be decremented twice in such a duplicate execution case, throwing the whole computation off.
The only reliable way of keeping track of task completion (that I can think of) is to independently track each individual enqueued task. You can do that using the task names (either specified or auto-assigned after successful enqueueing) - they are unique for a given queue. Task names to be tracked can be kept in task lists persisted in the datastore, for example.
Note: this is just the theoretical answer I got to when I asked myself the same question, I didn't get to actually test it.
I found that I can set the task to run at specific interval at specific times from here, but that was only done during task declaration. How do I set a task to run periodically dynamically?
The schedule is derived from a setting, and thus seems to be immutable at runtime.
You can probably accomplish what you're looking for using Task ETAs. This guarantees that your task won't run before the desired time, but doesn't promise to run the task at the designated timeāif the workers are overloaded at the designated ETA, the task may run later.
If that restriction isn't an issue, you could write a task which would first run itself like:
#task
def mytask():
keep_running = # Boolean, should the task keep running?
if keep_running:
run_again = # calculate when to run again
mytask.apply_async(eta=run_again)
# ... do the stuff you came here to do ...
The major downside of this approach is that you are relying on the taskstore to remember the tasks in flight. If one of them fails before firing off the next one, then the task will never run again. If your broker isn't persisted to disk and it dies (taking all in-flight tasks with it), then none of those tasks will run again.
You could solve these issues with some kind of transaction logging and a periodic "nanny" task whose job it is to find such repeating tasks that died an untimely death and revive them.
If I had to implement what you've described, I think this is how I would approach it.
celery.task.base.PeriodicTask defines is_due which determines when the next run should be. You could override this function to contain your custom dynamic running logic. See the docs here: http://docs.celeryproject.org/en/latest/reference/celery.task.base.html?highlight=is_due#celery.task.base.PeriodicTask.is_due
An example:
import random
from celery.task import PeriodicTask
class MyTask(PeriodicTask):
def run(self, **kwargs):
logger = self.get_logger(**kwargs)
logger.info("Running my task")
def is_due(self, last_run_at):
# Add your logic for when to run. Mine is random
if random.random() < 0.5:
# Run now and ask again in a minute
return (True, 60)
else:
# Don't run now but run in 10 secs
return (True, 10)
see here http://docs.celeryproject.org/en/latest/userguide/periodic-tasks.html
i think you can't make it dynamically ... best way is create task in task :D
for example you want run something for X sec later then you create new task with x sec delay and in this task create another task for N*X sec delay ...
This should help you some... http://celery.readthedocs.org/en/latest/faq.html#can-i-change-the-interval-of-a-periodic-task-at-runtime
Once you've defined a custom schedule, assign it to your task as asksol has suggested above.
CELERYBEAT_SCHEDULE = {
"my_name": {
"task": "myapp.tasks.task",
"schedule": myschedule(),
}
}
You might also want to modify CELERYBEAT_MAX_LOOP_INTERVAL if you want your schedule to update more often than every five minutes.