I'm building a basic ETL pipeline that hits a main endpoint which holds a list of IDs (variable amounts on each call) for processing. My current thinking is to use RabbitMQ as a queue system and have three tasks (Extract, Transform, Load) consume from RabbitMQ. Most tutorials I've seen online showcase a simple sequential execution of tasks before they exit. I've tried to construct a DAG that does this sequential action on each ID we receive. But I've run into problems trying to figure out how to schedule all these tasks through airflow, when I don't know how many IDs exist.
This is the general Tree view of the DAG in question:
I've begun the process of utilizing RabbitMQ to push these IDs to a queue and have celery spool up a variable amount of workers to handle the load. The problem I've run into is I don't know how to break out of the "consumption" loop. For example (I'm using pseudocode for some abstraction over RabbitMQ):
def extract():
# callback function when messages are sent to this worker
def _extract(channel, url, rest):
resp = request.get(url)
channel.publish('transform_queue', resp)
# attach the callback to the queue
channel.basic_consume('extract_queue', callback=_extract)
channel.start_consuming() # runs a pseudo loop waiting for messages here
Just as a note, some the variables (such as the channel underneath _extract are implicit, but will most likely be wrapped in a Custom Operator.
The Load and Transform functions work similarly. The problem I've run into is when the function starts consuming it doesn't stop until it's shutdown. I've been able to send sentinel messages to allow the function to "exit", however this will cause the Task to be marked as Failed, and sent to retry. For example here's the code for the sentinel shutdown.
def extract():
# callback function when messages are sent to this worker
def _extract(channel, message, rest):
if message == SHUTDOWN:
exit()
resp = requests.get(message.url)
channel.publish('transform_queue', resp)
# attach the callback to the queue
channel.basic_consume('extract_queue', callback=_extract)
channel.start_consuming() # runs a pseudo loop waiting for messages here
There's also the option to selectively cancel consumers, however this would just add more complexity as there is still the issue of polling for cancellation, and then that task would end up with the same issue above.
The main questions I have are:
Is there a way to exit with success in this setup?
Is this the best way to approach this problem? I imagine this is a common use case for airflow, so there must be some best practices or common setups. However, I haven't been able to find it.
I could understand the following from your question and hence suggestions to explore the below.
You are not sure of number of inputs and hence number of times you want to run the flow
You can create a custom operator (say FindIDs) which starts with finding out how many IDs you need to execute for and pushes the values to XComs. These messages can then be used for your other functions (say extractor, transformer and loader) and they can be set in a sequence as below
start >> findIds >> extractor >> transformer >> loader >> end
Check https://airflow.apache.org/docs/apache-airflow/stable/concepts.html?#xcoms
You need to skip certain executions in case there are no inputs (or IDs in your case
I would use ShortCircuitOperator in this case and conditionally skip the execution for the DAG.
Check: https://github.com/apache/airflow/blob/master/airflow/example_dags/example_short_circuit_operator.py
Related
How I can run a another method action() automatically when a set of celery tasks is finished. Are there any simple way to trigger another function call on completion?
#tasks.py
#app.task
def rank(item):
# Update database
#main.py
from tasks import rank
def action():
print('Tasks has been finished.')
ans = list()
for item in tqdm.tqdm(all_items):
rank.apply_async(([{"_id": item["_id"], "max": item["max"]}]))
In the previous message that is very similar to this one, which you deleted, I explained how to do this without using the Chord workflow primitive that you for some reason decided to avoid... You even left some parts of that code here that does nothing (ans = list()). I will put that part of the answer here, as it explains how what you need can be accomplished:
Without some code changes your code will not work. For starters, apply_async() does not return result. So, after you modify the code to ans.append(rank.apply_async(([{"_id": item["_id"], "max": item["max"]}])).get()) it should work as you want, but unfortunately it will not distribute tasks (which is why we use Celery!), so in order to emulate the logic that Chord does, you would need to call apply_async() as you do, store the task IDs, and periodically poll for state. If the task is finished, get the result and do this until all are finished.
Solution B would be to use Group primitive, schedule your tasks to be executed in a group, obtain GroupResult object, and do the same what I wrote above - periodically poll for individual results.
If you do this polling in a loop, than you can simply call action() after the loop, as it will be called after all tasks are finished. Once you implement this you will understand why many of experienced Celery users use Chord instead...
I am currently working on a test system that uses selenium grid for WhatsApp automation.
WhatsApp requires a QR code scan to log in, but once the code has been scanned, the session persists as long as the cookies remain saved in the browser's user data directory.
I would like to run a series of tests concurrently while making sure that every session is only used by one thread at any given time.
I would also like to be able to add additional tests to the queue while tests are being run.
So far I have considered using the ThreadPoolExecutor context manager in order to limit the maximum available workers to the maximum number of sessions. Something like this:
import queue
from concurrent.futures import ThreadPoolExecutor
def make_queue(questions):
q = queue.Queue()
for question in questions:
q.put(question)
return q
def test_conversation(q):
item = q.get()
# Whatsapp test happens here
q.task_done()
def run_tests(questions):
q = make_queue(questions)
with ThreadPoolExecutor(max_workers=number_of_sessions) as executor:
while not q.empty()
test_results = executor.submit(test_conversation, q)
for f in concurrent.futures.as_completed(test_results):
# save results somewhere
It does not include some way to make sure that every thread gets its own session though and as far as I know I can only send one parameter to the function that the executor calls.
I could make some complicated checkout system that works like borrowing books from a library so that every session can only be checked out once at any given time, but I'm not confident in making something that is thread safe and works in all cases. Even the ones I can't think of until they happen.
I am also not sure how I would keep the thing going while adding items to the queue without it locking up my entire application. Would I have to run run_tests() in its own thread?
Is there an established way to do this? Any help would be much appreciated.
Pardon my ignorance as I am learning how I can use celery for my purposes.
Suppose I have two tasks: create_ticket and add_message_to_ticket. Usually create_ticket task is created and completed before add_message_to_ticket tasks are created multiple times.
#app.task
def create_ticket(ticket_id):
time.sleep(random.uniform(1.0, 4.0)) # replace with code that processes ticket creation
return f"Successfully processed ticket creation: {ticket_id}"
#app.task
def add_message_to_ticket(ticket_id, who, when, message_contents):
# TODO add code that checks to see if create_ticket task for ticket_id has already been completed
time.sleep(random.uniform(1.0, 4.0)) # replace with code that handles added message
return f"Successfully processed message for ticket {ticket_id} by {who} at {when}"
Now suppose that these tasks are created out of order due to Python's server receiving the events from an external web service out of order. For example, one add_message_to_ticket.delay(82, "auroranil", 1599039427, "This issue also occurs on Microsoft Edge on Windows 10.") gets called few seconds before create_ticket.delay(82) gets called. How would I solve the following problems?
How would I fetch results of celery task create_ticket by specifying ticket_id within task add_message_to_ticket? All I can think of is to maintain a database that stores tickets state, and checks to see if a particular ticket has been created, but I want to know if I am able to use celery's result backend somehow.
If I receive an add_message_to_ticket task with a ticket id where I find out that corresponding ticket does not have create_ticket task completed, do I reject that task, and put that back in the queue?
Do I need to ensure that the tasks are idempotent? I know that is good practice, but is it a requirement for this to work?
Is there a better approach at solving this problem? I am aware of Celery Canvas workflow with primitives such as chain, but I am not sure how I can ensure that these events are processed in order, or be able to put tasks on pending state while it waits for tasks it depends on to be completed based on arguments I want celery to check, which in this case is ticket_id.
I am not particularly worried if I receive multiple user messages for a particular ticket with timestamps out of order, as it is not as important as knowing that a ticket has been created before messages are added to that ticket. The point I am making is that I am coding up several tasks where some events crucially depend on others, whereas the ordering of other events do not matter as much for the Python's server to function.
Edit:
Partial solutions:
Use task_id to identify Celery tasks, with a formatted string containing argument values which identifies that task. For example, task_id="create_ticket(\"TICKET000001\")"
Retry tasks that do not meet dependency requirements. Blocking for subtasks to be completed is bad, as subtask may never complete, and will hog a process in one of the worker machines.
Store arguments as part of result of a completed task, so that you can use that information not available in later tasks.
Relevant links:
Where do you set the task_id of a celery task?
Retrieve result from 'task_id' in Celery from unknown task
Find out whether celery task exists
More questions:
How do I ensure that I send task once per task_id? For instance, I want create_ticket task to be applied asynchronous only once. This is an alternative to making all tasks idempotent.
How do I use AsyncResult in add_message_to_ticket to check for status of create_ticket task? Is it possible to specify a chain somehow even though the first task may have already been completed?
How do I fetch all results of tasks given task name derived from the name of the function definition?
Most importantly, should I use Celery results backend to abstract stored data away from dealing with a database? Or should I scratch this idea and just go ahead with designing a database schema instead?
I have pretty unique behavior I need to achieve with celery. I understand that it is not recommended to have tasks block at all, however I think it is necessary here as I describe below. Pseudocode:
Task 1:
Set event to false
Start group of task 2
Scrape website every few seconds to check for changes
If changes found, set event
Task 2:
Log into website with selenium.
Block until event from Task 1 is set
Perform website action with selenium
I would want task2 to be executed multiple times in parallel for multiple users. Therefore checking the website for updates in each instance of task2 would result in a large number of requests to the website which is not acceptable.
For a normal flow like this, I would to use task1 to start login tasks in a group and start another group after the condition has been met to execute the action tasks. However, the web action is time-sensitive and I don't want to re-open a new selenium instance (which would defeat the purpose of having this structure in the first place).
I've seen examples like this: Flask Celery task locking but using a Redis cache seems unnecessary for this application (and it does not need to be atomic because the 'lock' is only modified by task1). I've also looked into Celery's remote control but I'm not sure if there is the capability to block until a signal is received.
There is a similar question here which was solved by splitting the task I want to block into 2 separate tasks, but again I can't do this.
Celery tasks can themselves enqueue tasks, so it's possible to wait for an event like "it's 9am", and then spawn off a bunch of parallel tasks. If you need to launch an additional task on the completion of a group of parallel tasks (i.e., if you need a fan-in task at the completion of all fan-out tasks), the mechanism you want is chords.
I'm running Django, Celery and RabbitMQ. What I'm trying to achieve is to ensure, that tasks related to one user are executed in order (specifically, one at the time, I don't want task concurrency per user)
whenever new task is added for user, it should depend on the most recently added task. Additional functionality might include not adding task to queue, if task of this type is queued for this user and has not yet started.
I've done some research and:
I couldn't find a way to link newly created task with already queued one in Celery itself, chains seem to be only able to link new tasks.
I think that both functionalities are possible to implement with custom RabbitMQ message handler, though it might be hard to code after all.
I've also read about celery-tasktree and this might be an easiest way to ensure execution order, but how do I link new task with already "applied_async" task_tree or queue? Is there any way that I could implement that additional no-duplicate functionality using this package?
Edit: There is this also this "lock" example in celery cookbook and as the concept is fine, I can't see a possible way to make it work as intended in my case - simply if I can't acquire lock for user, task would have to be retried, but this means pushing it to the end of queue.
What would be the best course of action here?
If you configure the celery workers so that they can only execute one task at a time (see worker_concurrency setting), then you could enforce the concurrency that you need on a per user basis. Using a method like
NUMBER_OF_CELERY_WORKERS = 10
def get_task_queue_for_user(user):
return "user_queue_{}".format(user.id % NUMBER_OF_CELERY_WORKERS)
to get the task queue based on the user id, every task will be assigned to the same queue for each user. The workers would need to be configured to only consume tasks from a single task queue.
It would play out like this:
User 49 triggers a task
The task is sent to user_queue_9
When the one and only celery worker that is listening to user_queue_9 is ready to consume a new task, the task is executed
This is a hacky answer though, because
requiring just a single celery worker for each queue is a brittle system -- if the celery worker stops, the whole queue stops
the workers are running inefficiently