I would like to know when some tasks have finished executing, something I can achieve in celery with:
task.ready()
I really don't care about the actual results, I only need to know if the task are completed.
Storing the results is not an option, because they are complex objects from an external library, and they are not serializable.
So, is it possible to know when a task is ready, without having to store the results?
Related
Pardon my ignorance as I am learning how I can use celery for my purposes.
Suppose I have two tasks: create_ticket and add_message_to_ticket. Usually create_ticket task is created and completed before add_message_to_ticket tasks are created multiple times.
#app.task
def create_ticket(ticket_id):
time.sleep(random.uniform(1.0, 4.0)) # replace with code that processes ticket creation
return f"Successfully processed ticket creation: {ticket_id}"
#app.task
def add_message_to_ticket(ticket_id, who, when, message_contents):
# TODO add code that checks to see if create_ticket task for ticket_id has already been completed
time.sleep(random.uniform(1.0, 4.0)) # replace with code that handles added message
return f"Successfully processed message for ticket {ticket_id} by {who} at {when}"
Now suppose that these tasks are created out of order due to Python's server receiving the events from an external web service out of order. For example, one add_message_to_ticket.delay(82, "auroranil", 1599039427, "This issue also occurs on Microsoft Edge on Windows 10.") gets called few seconds before create_ticket.delay(82) gets called. How would I solve the following problems?
How would I fetch results of celery task create_ticket by specifying ticket_id within task add_message_to_ticket? All I can think of is to maintain a database that stores tickets state, and checks to see if a particular ticket has been created, but I want to know if I am able to use celery's result backend somehow.
If I receive an add_message_to_ticket task with a ticket id where I find out that corresponding ticket does not have create_ticket task completed, do I reject that task, and put that back in the queue?
Do I need to ensure that the tasks are idempotent? I know that is good practice, but is it a requirement for this to work?
Is there a better approach at solving this problem? I am aware of Celery Canvas workflow with primitives such as chain, but I am not sure how I can ensure that these events are processed in order, or be able to put tasks on pending state while it waits for tasks it depends on to be completed based on arguments I want celery to check, which in this case is ticket_id.
I am not particularly worried if I receive multiple user messages for a particular ticket with timestamps out of order, as it is not as important as knowing that a ticket has been created before messages are added to that ticket. The point I am making is that I am coding up several tasks where some events crucially depend on others, whereas the ordering of other events do not matter as much for the Python's server to function.
Edit:
Partial solutions:
Use task_id to identify Celery tasks, with a formatted string containing argument values which identifies that task. For example, task_id="create_ticket(\"TICKET000001\")"
Retry tasks that do not meet dependency requirements. Blocking for subtasks to be completed is bad, as subtask may never complete, and will hog a process in one of the worker machines.
Store arguments as part of result of a completed task, so that you can use that information not available in later tasks.
Relevant links:
Where do you set the task_id of a celery task?
Retrieve result from 'task_id' in Celery from unknown task
Find out whether celery task exists
More questions:
How do I ensure that I send task once per task_id? For instance, I want create_ticket task to be applied asynchronous only once. This is an alternative to making all tasks idempotent.
How do I use AsyncResult in add_message_to_ticket to check for status of create_ticket task? Is it possible to specify a chain somehow even though the first task may have already been completed?
How do I fetch all results of tasks given task name derived from the name of the function definition?
Most importantly, should I use Celery results backend to abstract stored data away from dealing with a database? Or should I scratch this idea and just go ahead with designing a database schema instead?
I currently have a celery beat task that will periodically compute some result. Externally, I may have API calls (at an unknown frequency) that will query for this result. I was thinking of using the "last run task" result, so when the API would make the query, celery could simply query for the last returned result for the beat task.
I, however, do not see any documentation for this behavior. I have occasionally seen posts linking to the celery "task result store", but unfortunately all the links have given me a 404 Error.
I think it's not possible.
Even worker inspect doesn't give the list of finished task and neither their corresponding ids. maybe the best way is to write data directly to redis and read it later.
Another approach that may works is to share task id which is accessible in the task (more) and get the result using Retrieve task result by id in Celery.
I am starting with Luigi, and I wonder how does Luigi know, that it shouldn't re-run the task because it was already successfully run with the same parameters. I read through the docs, but didn't find the answer.
Hypotheses:
Does Luigi store the state (tasks instances and their results) in memory (it doesn't use DB)? So, when I restart scheduler, it forgets everything and re-runs all tasks?
Or, does Luigi always run task.complete for any scheduled task to see if the task should be run? Which would mean that the complete handler should be really quick?
Or, does it work in a different way?
Thanks for help!
Aha, found this in task.output:
The output of the Task determines if the Task needs to be run–the task is considered finished iff the outputs all exist. Subclasses should override this method to return a single Target or a list of Target instances.
So, it means that the complete or output.exists should be really really fast.
My app spiders various websites, and uses Celery to do it in a nice, distributed way.
The spidering can be split into stages, kind of like a chain, except I don't know exactly what tasks are going to be in each stage ahead of time. For example, a lot of spiders run one task to get a list of results, then run another task for each result to get more information on the result. I'm going to call this new kind of thing an "unknown chain".
My problem is how to implement the unknown chain. I'd like to be able to use it wherever a chain can be used, such as waiting for it synchronously, running it with a callback, or (most importantly) putting it into a chord.
My current solution is to have the task for each stage return the signature for the next stage. I can then create one function that synchronously waits for the unknown chain to complete:
def run_unknown_chain_sync(unknown_chain):
result = unknown_chain.delay().get()
while isinstance(result, Signature):
result = result.delay().get()
return result
And another function + task that does it asynchronously with a callback:
def run_query_async(unknown_chain):
unknown_chain_advance.delay(unknown_chain, callback)
#app.task
def unknown_chain_advance(unknown_chain, callback):
if isinstance(unknown_chain, Signature):
chain(unknown_chain, unknown_chain_advance(callback)).delay()
else:
callback.delay(result)
The main problem with this solution is that the unknown chain can't be used in a chord.
Other ideas I came up with:
Do some kind of yucky messing around with Celery's innards and somehow create a new kind of task that represents an unknown chain. If it looks like a task, it should work like a task.
It would intercept whatever is reporting the task as finished, and check if the task is actually done or just returning the next stage. If it's returning the next stage, it would "forget" to report the task as finished and start the next stage, and chain something onto that which repeats the process.
Not a very good idea because it will break when I update Celery. Also, I haven't looked too close at the Celery codebase, but I suspect this might be impossible.
Create a new kind of primitive, kind of like chain, but called unknown_chain. I doubt this can be done because from my reading of the celery code, Celery is not designed to allow you to make new kinds of signatures like this.
Invent my own way of chording unknown chains, like I invented my own way of running them with a callback. The question is, how the hell would you do that?
I have a certain type of task that does something that I would like refreshed a few minutes after it originally run, if a certain condition is met.
As far as I can see, there's no way to rerun a task that has previously run since the information about the task request (args, kwargs, priority..) is not saved anywhere.
I can see that it appears in Flower, but I assume that's because it uses Celery events.
Is there any way to accomplish what I want? I could add a post-task hook which saves the request info, but that seems a bit off.
I'm using RabbitMQ as the broker and MongoDB as the results backend.
As per the docs apply_async has a cowntdown option allowing you to delay the execution for a certain number of seconds.
You could just make a recursive task:
#app.task
def my_task(an_arg):
# do something
my_task.apply_async(countdown=120, kwargs={"an_arg": an_arg})