In my python project i want to use celery for creation pipeline of tasks: some tasks will be grouped and this group is part of chain. Schema of pipeline:
task_chain = chain(
group(
chain(taks1.s(uid=uid, index=i), task2.s(uid=uid, index=i)) for i in
range(len(collection))
),
task3.s(uid=uid),
task4.s(uid=uid),
reduce_job_results_from_pages.s(job_uid=job_uid),
push_metrics.s(job_uid=job_uid))
Should i use result backend in this case or only broker is enough?
I don't understand what technology celery use to sync results of tasks and pass result of previous task or group of tasks to next one in chain.
Thank you!
The answer is somewhat provided in the Important Notes section on the Canvas page:
Tasks used within a chord must not ignore their results. In practice this means that you must enable a result_backend in order to use chords. Additionally, if task_ignore_result is set to True in your configuration, be sure that the individual tasks to be used within the chord are defined with ignore_result=False. This applies to both Task subclasses and decorated tasks.
You may wonder that you could get away without it since you do not use Chord. - I believe Celery will transform any Chain with a Group in it into a Chord.
Related
I have to spawn certain tasks and have them execute in parallel. However I also need to have all their results of all these updated centrally.
Is it possible to access the results of all these tasks within a parent task somehow? I know I cant call a task_result.get() from a tasks since Celery doesnt allow it, is there any other way to achieve this?
You can make Celery wait for the result of a subtask (see disable_sync_subtasks parameter to get()), it's just not recommended because you could deadlock the worker (see here for more details). So if you use it, you should know what you are doing.
The recommended way for your use case is to use a chord:
A chord is just like a group but with a callback. A chord consists of a header group and a body, where the body is a task that should execute after all of the tasks in the header are complete.
This would indeed require you to refactor your logic a bit so you don't need the subtasks' results inside the parent task but to process it in the chord's body.
Pardon my ignorance as I am learning how I can use celery for my purposes.
Suppose I have two tasks: create_ticket and add_message_to_ticket. Usually create_ticket task is created and completed before add_message_to_ticket tasks are created multiple times.
#app.task
def create_ticket(ticket_id):
time.sleep(random.uniform(1.0, 4.0)) # replace with code that processes ticket creation
return f"Successfully processed ticket creation: {ticket_id}"
#app.task
def add_message_to_ticket(ticket_id, who, when, message_contents):
# TODO add code that checks to see if create_ticket task for ticket_id has already been completed
time.sleep(random.uniform(1.0, 4.0)) # replace with code that handles added message
return f"Successfully processed message for ticket {ticket_id} by {who} at {when}"
Now suppose that these tasks are created out of order due to Python's server receiving the events from an external web service out of order. For example, one add_message_to_ticket.delay(82, "auroranil", 1599039427, "This issue also occurs on Microsoft Edge on Windows 10.") gets called few seconds before create_ticket.delay(82) gets called. How would I solve the following problems?
How would I fetch results of celery task create_ticket by specifying ticket_id within task add_message_to_ticket? All I can think of is to maintain a database that stores tickets state, and checks to see if a particular ticket has been created, but I want to know if I am able to use celery's result backend somehow.
If I receive an add_message_to_ticket task with a ticket id where I find out that corresponding ticket does not have create_ticket task completed, do I reject that task, and put that back in the queue?
Do I need to ensure that the tasks are idempotent? I know that is good practice, but is it a requirement for this to work?
Is there a better approach at solving this problem? I am aware of Celery Canvas workflow with primitives such as chain, but I am not sure how I can ensure that these events are processed in order, or be able to put tasks on pending state while it waits for tasks it depends on to be completed based on arguments I want celery to check, which in this case is ticket_id.
I am not particularly worried if I receive multiple user messages for a particular ticket with timestamps out of order, as it is not as important as knowing that a ticket has been created before messages are added to that ticket. The point I am making is that I am coding up several tasks where some events crucially depend on others, whereas the ordering of other events do not matter as much for the Python's server to function.
Edit:
Partial solutions:
Use task_id to identify Celery tasks, with a formatted string containing argument values which identifies that task. For example, task_id="create_ticket(\"TICKET000001\")"
Retry tasks that do not meet dependency requirements. Blocking for subtasks to be completed is bad, as subtask may never complete, and will hog a process in one of the worker machines.
Store arguments as part of result of a completed task, so that you can use that information not available in later tasks.
Relevant links:
Where do you set the task_id of a celery task?
Retrieve result from 'task_id' in Celery from unknown task
Find out whether celery task exists
More questions:
How do I ensure that I send task once per task_id? For instance, I want create_ticket task to be applied asynchronous only once. This is an alternative to making all tasks idempotent.
How do I use AsyncResult in add_message_to_ticket to check for status of create_ticket task? Is it possible to specify a chain somehow even though the first task may have already been completed?
How do I fetch all results of tasks given task name derived from the name of the function definition?
Most importantly, should I use Celery results backend to abstract stored data away from dealing with a database? Or should I scratch this idea and just go ahead with designing a database schema instead?
I am using celery 3 with Django.
I have a list of jobs in database. User can start a particular job which starts a celery task.
Now I want user to be able to start multiple jobs and it should add them to the celery queue and process them one after the other not in parallel as with async.
I am trying to create a job scheduler with celery where user can select the jobs to execute and they will be executed in sequential fashion.
If I use chain() then I cannot add new tasks to the chain dynamically.
What is the best solution?
A better primitive for you to use would be link instead of chain in Celery.
From the documentation:
s = add.s(2, 2)
s.link(mul.s(4))
s.link(log_result.s())
You can see how this allows you to dynamically add a task to be executed by iterating through the required tasks in a loop, and linking each one's signature. After the loop you would want to call something like s.apply_async(...) to execute them.
If I understood the tutorial correctly, Celery subtask supports almost the same API as task, but has the additional advantage that it can be passed around to other functions or processes.
Clearly, if that was the case, Celery would have simply replaced tasks with subtasks instead of keeping both (e.g., the #app.task decorator would have converted a function to a subtask instead of to a task, etc.). So I must be misunderstanding something.
What can a task do that a subtask can't?
Celery API changed quite a bit; my question is specific to version 3.1 (currently, the latest).
Edit:
I know the docs say subtasks are intended to be called from other tasks. My question is what prevents Celery from getting rid of tasks completely and using subtasks everywhere? They seem to be strictly more flexible/powerful than tasks:
# tasks.py
from celery import Celery
app = Celery(backend='rpc://')
#app.task
def add(x, y):
# just print out a log line for testing purposes
print(x, y)
# client.py
from tasks import add
add_subtask = add.subtask()
# in this context, it seems the following two lines do the same thing
add.delay(2, 2)
add_subtask.delay(2, 2)
# when we need to pass argument to other tasks, we must use add_subtask
# so it seems add_subtask is strictly better than add
You will take the difference into account when you start using complex workflows with celery.
A signature() wraps the arguments, keyword arguments, and execution
options of a single task invocation in a way such that it can be
passed to functions or even serialized and sent across the wire.
Signatures are often nicknamed “subtasks” because they describe a task
to be called within a task.
Also:
subtask‘s are objects used to pass around the signature of a task
invocation, (for example to send it over the network)
Task is just a function definition wrapped with decorator, but subtask is a task with parameters passed, but not yet started. You may transfer the subtask serialized over network or, more used, call it within a group/chain/chord.
I am a beginner in Django and Celery and I'm trying to chain three tasks as follows:
tasks = chain(task_analyze1, task_analyze2, task_combined)
where I do further processing on outputs of task_analyze1 and task_analyze2 inside task_combined.
But so far from what I have read online, it seems that in chain, the output of one task is passed to the next task, so I will only get the output of task_analyze2 in task_combined.
Is there a way, either by using chain or maybe some other way, to get the output of both tasks?
Edit:
One possible way around it that comes to mind is to include the output of first task in the output of second task as well. But since the second task in my case is being used in a few other places, changing it would break a lot of other things in my code.
I was interested to know whether Celery chain has a dynamic inside it that would allow the outputs to pass further in chain and not only to the immediate task following them.