I have pretty unique behavior I need to achieve with celery. I understand that it is not recommended to have tasks block at all, however I think it is necessary here as I describe below. Pseudocode:
Task 1:
Set event to false
Start group of task 2
Scrape website every few seconds to check for changes
If changes found, set event
Task 2:
Log into website with selenium.
Block until event from Task 1 is set
Perform website action with selenium
I would want task2 to be executed multiple times in parallel for multiple users. Therefore checking the website for updates in each instance of task2 would result in a large number of requests to the website which is not acceptable.
For a normal flow like this, I would to use task1 to start login tasks in a group and start another group after the condition has been met to execute the action tasks. However, the web action is time-sensitive and I don't want to re-open a new selenium instance (which would defeat the purpose of having this structure in the first place).
I've seen examples like this: Flask Celery task locking but using a Redis cache seems unnecessary for this application (and it does not need to be atomic because the 'lock' is only modified by task1). I've also looked into Celery's remote control but I'm not sure if there is the capability to block until a signal is received.
There is a similar question here which was solved by splitting the task I want to block into 2 separate tasks, but again I can't do this.
Celery tasks can themselves enqueue tasks, so it's possible to wait for an event like "it's 9am", and then spawn off a bunch of parallel tasks. If you need to launch an additional task on the completion of a group of parallel tasks (i.e., if you need a fan-in task at the completion of all fan-out tasks), the mechanism you want is chords.
Related
I am new to both python and celery, and not quite sure how to define my problem regarding celery.
I will try to explain my problem with an example;
Lets say there is a system where users buy tickets for events. Each event has a fixed cap of tickets(10). Requests to API is synchronous and buy requests return a success or fail depending of the event's capacity. Obviously this might create a race-condition for that resource.
There are two solutions I could think of:
Somehow add a database constraint that will block the insert requests when event is full.(Could not find any applicable solution btw)
Add a sequential task queue for buy ticket events. This way it is ensured that there will be no race. However since I am waiting for the API response, with enough load it might take ages for the task to be executed, if there is only 1 queue. To reduce this waiting time, I have to implement a event-id-based queue mechanism, where each event has its own sequential queue created dynamically, and will have its own consumer. Once the event is full consumer should be dissolved. Is there any way to achieve this with celery?
Thanks.
Pardon my ignorance as I am learning how I can use celery for my purposes.
Suppose I have two tasks: create_ticket and add_message_to_ticket. Usually create_ticket task is created and completed before add_message_to_ticket tasks are created multiple times.
#app.task
def create_ticket(ticket_id):
time.sleep(random.uniform(1.0, 4.0)) # replace with code that processes ticket creation
return f"Successfully processed ticket creation: {ticket_id}"
#app.task
def add_message_to_ticket(ticket_id, who, when, message_contents):
# TODO add code that checks to see if create_ticket task for ticket_id has already been completed
time.sleep(random.uniform(1.0, 4.0)) # replace with code that handles added message
return f"Successfully processed message for ticket {ticket_id} by {who} at {when}"
Now suppose that these tasks are created out of order due to Python's server receiving the events from an external web service out of order. For example, one add_message_to_ticket.delay(82, "auroranil", 1599039427, "This issue also occurs on Microsoft Edge on Windows 10.") gets called few seconds before create_ticket.delay(82) gets called. How would I solve the following problems?
How would I fetch results of celery task create_ticket by specifying ticket_id within task add_message_to_ticket? All I can think of is to maintain a database that stores tickets state, and checks to see if a particular ticket has been created, but I want to know if I am able to use celery's result backend somehow.
If I receive an add_message_to_ticket task with a ticket id where I find out that corresponding ticket does not have create_ticket task completed, do I reject that task, and put that back in the queue?
Do I need to ensure that the tasks are idempotent? I know that is good practice, but is it a requirement for this to work?
Is there a better approach at solving this problem? I am aware of Celery Canvas workflow with primitives such as chain, but I am not sure how I can ensure that these events are processed in order, or be able to put tasks on pending state while it waits for tasks it depends on to be completed based on arguments I want celery to check, which in this case is ticket_id.
I am not particularly worried if I receive multiple user messages for a particular ticket with timestamps out of order, as it is not as important as knowing that a ticket has been created before messages are added to that ticket. The point I am making is that I am coding up several tasks where some events crucially depend on others, whereas the ordering of other events do not matter as much for the Python's server to function.
Edit:
Partial solutions:
Use task_id to identify Celery tasks, with a formatted string containing argument values which identifies that task. For example, task_id="create_ticket(\"TICKET000001\")"
Retry tasks that do not meet dependency requirements. Blocking for subtasks to be completed is bad, as subtask may never complete, and will hog a process in one of the worker machines.
Store arguments as part of result of a completed task, so that you can use that information not available in later tasks.
Relevant links:
Where do you set the task_id of a celery task?
Retrieve result from 'task_id' in Celery from unknown task
Find out whether celery task exists
More questions:
How do I ensure that I send task once per task_id? For instance, I want create_ticket task to be applied asynchronous only once. This is an alternative to making all tasks idempotent.
How do I use AsyncResult in add_message_to_ticket to check for status of create_ticket task? Is it possible to specify a chain somehow even though the first task may have already been completed?
How do I fetch all results of tasks given task name derived from the name of the function definition?
Most importantly, should I use Celery results backend to abstract stored data away from dealing with a database? Or should I scratch this idea and just go ahead with designing a database schema instead?
I have a use case where I need to poll the API every 1 sec (basically infinite while loop). The polling will be initiated dynamically by user through an external system. This means there can be multiple polling running at the same time. The polling will be completed when the API returns 400. Anyways, my current implementation looks something like:
Flask APP deployed on heroku.
Flask APP has an endpoint which external system calls to start polling.
That flask endpoint will add the message to queue and as soon as worker gets it, it will start polling. I am using Heroku Redis to Go addons. Under the hood it uses python-rq and redis.
The problem is when some polling process goes on for a long time, the other process just sits on the queue. I want to be able to do all of the polling in a concurrent process.
What's the best approach to tackle this problem? Fire up multiple workers?
What if there could be potentially more than 100 concurrent processes.
You could implement a "weighted"/priority queue. There may be multiple ways of implementing this, but the simplest example that comes to my mind is using a min or max heap.
You shoud keep track of how many events are in the queue for each process, as the number of events for one process grows, the weight of the new inserted events should decrease. Everytime an event is processed, you start processing the following one with the greatest weight.
PS More workers will also speed up the the work.
I'm running Django, Celery and RabbitMQ. What I'm trying to achieve is to ensure, that tasks related to one user are executed in order (specifically, one at the time, I don't want task concurrency per user)
whenever new task is added for user, it should depend on the most recently added task. Additional functionality might include not adding task to queue, if task of this type is queued for this user and has not yet started.
I've done some research and:
I couldn't find a way to link newly created task with already queued one in Celery itself, chains seem to be only able to link new tasks.
I think that both functionalities are possible to implement with custom RabbitMQ message handler, though it might be hard to code after all.
I've also read about celery-tasktree and this might be an easiest way to ensure execution order, but how do I link new task with already "applied_async" task_tree or queue? Is there any way that I could implement that additional no-duplicate functionality using this package?
Edit: There is this also this "lock" example in celery cookbook and as the concept is fine, I can't see a possible way to make it work as intended in my case - simply if I can't acquire lock for user, task would have to be retried, but this means pushing it to the end of queue.
What would be the best course of action here?
If you configure the celery workers so that they can only execute one task at a time (see worker_concurrency setting), then you could enforce the concurrency that you need on a per user basis. Using a method like
NUMBER_OF_CELERY_WORKERS = 10
def get_task_queue_for_user(user):
return "user_queue_{}".format(user.id % NUMBER_OF_CELERY_WORKERS)
to get the task queue based on the user id, every task will be assigned to the same queue for each user. The workers would need to be configured to only consume tasks from a single task queue.
It would play out like this:
User 49 triggers a task
The task is sent to user_queue_9
When the one and only celery worker that is listening to user_queue_9 is ready to consume a new task, the task is executed
This is a hacky answer though, because
requiring just a single celery worker for each queue is a brittle system -- if the celery worker stops, the whole queue stops
the workers are running inefficiently
I need to import some data to show it for user but page execution time exceeds 30 second limit. So I decided to split my big code into several tasks and try Task Queues. I add about 10-20 tasks to queue and app engine executes tasks in parallel while user is waiting for data. How can I determine that my tasks are completed to show user data ASAP? Can I somehow iterate over active tasks?
I've solved this in the past by keeping the status for the tasks in memcached, and polling (via Ajax) to determine when the tasks are finished.
If you go this way, it's best if you can always "manually" determine the status of the tasks without looking in memcached, since there's always the (slim) chance that memcache will go down or will get cleared or something as a task is running.