Automatically generating ExternalTaskSensor where execution date depends only on DAG id - python

Help me crack this one. I am looking for an elegant solution for dynamically generating ExternalTaskSensor tasks in Airflow with unique execution_date_fn functions while avoiding problems arising from function scopes.
I am trying to create a DAG that depends on several other DAGs by that they shouldn't run simultaneously. I am using ExternalTaskSensor operators, best illustrated visually:
Sensor tasks, e.g. sense_dag_1, rely on DagRun object to find the latest execution_date of e.g. dag_1. This method, called get_execution_date in my code, is created and passed to the ExternalTaskOperator via execution_date_fn kwarg. This works well when "manually" creating tasks as shown in example:
sensor = ExternalTaskSensor(
task_id='sense_dag_1',
external_dag_id='dag_1',
execution_date_fn=lambda dt: get_execution_date('dag_1'),
dag=dag
)
Note that a little workaround is used, providing get_execution_date to ExternalTaskSensor as a lambda function with input dt. This solution, however, faces a problem when automatically generating multiple ExternalTaskSensor tasks for a list of DAG names, as seen in naive example:
for dag_id in ['dag_1', 'dag_2']:
sensor = ExternalTaskSensor(
task_id='sense_'+dag_id,
external_dag_id=dag_id,
execution_date_fn=lambda dt: get_execution_date(dag_id),
dag=dag
)
When creating tasks this way, all tasks are set with the same execution_date, coinciding with execution_date of the last dag_id in the list. I am aware that this problem is to do with functions and scope, as described in Python's official documentation.
We end up with two limiting factors:
execution_date_fn only accepts dt and **context as arguments
Lambda functions (and functions in general) require arguments defined in function scope in order to correctly assign variable values.
QUESTION: How do I dynamically generate ExternalTaskSensor tasks where execution_date only depends on dag_id?
I have one solution in mind: adding tasks upstream of sensors that get the execution_date and push it to context. However, in this case order of task execution becomes important, making the end DAG look something like this:
This is not aesthetically pleasing and I am looking for something better

I remembered a working solution: move sensor instantiation to a separate function. This way the variable scope is separated for loop and task creation. Code example:
def create_sensor(dag_id):
task = ExternalTaskSensor(
task_id='sense_'+dag_id,
external_dag_id=dag_id,
execution_date_fn=lambda dt: get_execution_date(dag_id),
dag=dag
)
return task
for dag_id in ['dag_1', 'dag_2']:
sensor = create_sensor(dag_id)
More on the topic here: Unexpected Airflow behaviour in dynamic task generation

Related

Return values from a child dag triggered without execution date using TriggerDagRunOperator

Context:
I have a set of operators in a dag that perform operations based on the dag_run configuration(let's call it child dag: dag_c). The child dag can be triggered by other dags (let's call it Parent dag_a & dag_b). Parent dag_a & dag_b runs at the same time.
Use case:
My use case is to get a return value from dag c to dag a and dag b. The return value is unique to the parent dag which triggered the child dag. Since dag A & B can run at the same time, I cannot use execution_date. The code in dag c is common for both dag a & b, and I don't want to copy-paste the same code in both dags.
Problem Statement:
Without setting the execution date, I am not able to get the return value using the xcom_pull method
So far I have tried :
To push the return value using the xcom_push method in child dag, but the child dags execution is in future and value is not available in parent dag
Tried to do an xcom_push using the parent dag's execution, but it failed since we can not set an x_com in the past. The job fails with error: 'execution_date can not be in the past'
Use Subdags for dag c generation and call it in dag A & B. This method solved my use case, but sub dags are not recommended in airflow, I would like to know if there are any better methods to solve this problem.
I am using airflow 2.0.2.
This was answered as on the Apache Airflow GitHub Discussion board but to bring these threads together for everyone:
Maybe try Airflow Variables instead of XCom in this case. You could use the Variable.set() method to write the return value required. There would not be any execution_date constraints on the value that's set and the value is still centrally accessible by dag_a and dag_b.

Return value from one Airflow DAG into another one

My DAG (let's call it DAG_A) starting another DAG (DAG_B) using the trigger_dagrun operator. DAG_B's tasks use XCOM and I would like to obtain XCOM value from one of the tasks of DAG_B's run (exactly the one I've started) upon completion.
Use of XCOM is not a hard requirement - basically any (reasonable) mechanism that Airflow itself provides would work. I can change DAG_B if needed.
Can't find any examples of such cases, so appreciate the help.
Plan B would be to make DAG_B save XCOM values into some persistent storage like DB or file together with some run id, and DAG_A will take it from there. But I would like to avoid such complications if some built-in mechanisms were available.
You can pull XCOM values from another dag, by passing in the dag_id to xcom_pull() (see the task_instance.xcom_pull() function documentation). This works as long as you triggered the subdag using the same execution date as your current DAG. That's trivially achieved by templating the execution_date value:
trigger = TriggerDagRunOperator(
task_id="trigger_dag_b",
trigger_dag_id="DAG_B",
execution_date="{{ execution_date }}",
...
)
Then, provided you used an ExternalTaskSensor sensor to wait for the specific task to have completed or used wait_for_completion=True in your TriggerDagRunOperator() task, you can later on pull the XCOM with task_instance.xcom_pull(dag_id="DAG_B", ...) (add in task ids and/or the XCOM key you want to pull).
If you are not averse to coding a Python operator, you can also import the XCom model and just use its XCom.get_one() method directly:
value = XCom.get_one(
execution_date=ti.execution_date,
key="target key",
task_id="some.task.id",
dag_id="DAG_B",
)
I've used similar techniques using a multi-dagrun trigger (to process a variable number of resources); this is trickier as in that case you can't re-use the execution date (each dagrun must have a unique (dag_id, execution_date) tuple).
In those cases I either used direct queries (joining the SQLAlchemy XCom model against the DagRun model using dagrun ids stored in an XCom by the trigger, instead of relying on the execution date matching), or avoided the whole issue by configuring the subdags up front. The latter is achieved by setting up the sub-dag with configuration telling it where to output results that the parent dag then picks up. The documentation doesn't appear to mention this properly, but the conf argument to TriggerDagRun() supports templating too, so you can generate a dictionary there as input to the sub-dag, tasks in the sub-dag then reference the configuration via params.

Airflow dynamic tasks at runtime

Other questions about 'dynamic tasks' seem to address dynamic construction of a DAG at schedule or design time. I'm interested in dynamically adding tasks to a DAG during execution.
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from datetime import datetime
dag = DAG('test_dag', description='a test',
schedule_interval='0 0 * * *',
start_date=datetime(2018, 1, 1),
catchup=False)
def make_tasks():
du1 = DummyOperator(task_id='dummy1', dag=dag)
du2 = DummyOperator(task_id='dummy2', dag=dag)
du3 = DummyOperator(task_id='dummy3', dag=dag)
du1 >> du2 >> du3
p = PythonOperator(
task_id='python_operator',
dag=dag,
python_callable=make_tasks)
This naive implementation doesn't seem to work - the dummy tasks never show up in the UI.
What's the correct way to add new operators to the DAG during execution? Is it possible?
It it not possible to modify the DAG during its execution (without a lot more work).
The dag = DAG(... is picked up in a loop by the scheduler. It will have task instance 'python_operator' in it. That task instance gets scheduled in a dag run, and executed by a worker or executor. Since DAG models in the Airflow DB are only updated by the scheduler these added dummy tasks will not be persisted to the DAG nor scheduled to run. They will be forgotten when the worker exits. Unless you copy all the code from the scheduler regarding persisting & updating the model… but that will be undone the next time the scheduler visits the DAG file for parsing, which could be happening once a minute, once a second or faster depending how many other DAG files there are to parse.
Airflow actually wants each DAG to approximately stay the same layout between runs. It also wants to reload/parse DAG files constantly. So though you could make a DAG file that on each run determines the tasks dynamically based on some external data (preferably cached in a file or pyc module, not network I/O like a DB lookup, you'll slow down the whole scheduling loop for all the DAGs) it's not a good plan as your graph and tree view will get all confusing, and your scheduler parsing will be more taxed by that lookup.
You could make the callable run each task…
def make_tasks(context):
du1 = DummyOperator(task_id='dummy1', dag=dag)
du2 = DummyOperator(task_id='dummy2', dag=dag)
du3 = DummyOperator(task_id='dummy3', dag=dag)
du1.execute(context)
du2.execute(context)
du3.execute(context)
p = PythonOperator(
provides_context=true,
But that's sequential, and you have to work out how to use python to make them parallel (use futures?) and if any raise an exception the whole task fails. Also it is bound to one executor or worker so not using airflow's task distribution (kubernetes, mesos, celery).
The other way to work with this is to add a fixed number of tasks (the maximal number), and use the callable(s) to short circuit the unneeded tasks or push arguments with xcom for each of them, changing their behavior at run time but not changing the DAG.
Regarding your code sample, you never call your function which registers your tasks in your DAG.
To have a kind of dynamic tasks, you can have a single operator which does something different depending on some state or you can have a handful of operators which can be skipped depending on the state, with a ShortCircuitOperator.
I appreciate all the work everybody has done here as I have the same challenge of creating dynamically structured DAGs. I have done enough mistakes to not use software against its design. If I cant inspect the whole run on the UI and zoom in and out, basically use airflow features, which are the main reason I use it anyway. I can just write multiprocessing code inside a function and be done with it as well.
That all being said my solution is to use a resource manager such as redis locking and have a DAG that writes to this resource manager with data about what to run how to run etc; and have another DAG or DAGs that run in certain intervals polling the resource manager, locking them before running and removing them at finish. This way at least I use airflow as expected even though its specifications dont exactly meet my needs. I breakdown the problem into more definable chunks. The solutions are creative but they are against the design and not tested by the developers. The specifically say to have fixed structured workflows. I cannot put a work around code that is not tested and against design unless I rewrite the core airflow code and test myself. I understand my solution brings complexity with locking and all that but at least I know the boundaries to that.

In airflow can end user pass parameters to keys which are associated with some specific dag

i have searched many links but didn't find any solution to the problem i have. I have seen option to pass key/var into airflow UI ,but it is really confusing for end user to work as which key is associated with which dag. Is there any way to implement functionality like :
While running an airflow job, end user will be asked for values to some parameters and after entering those details airflow will run the job.
Unfortunately, it's not possible to wait for user input let say in Airflow UI. DAG's are programmatically authored which means defined as a code and they should not be dynamic since they are imported in web server, scheduler and workers in same time and has to be same.
There are two workarounds I came up with, and we use first in production for a while.
1) Create a small wrapper around Variables. For each DAG then load Variables and compose arguments which are then passed into Operators via default_arguments.
2) Add Slack operator which can be programmatically configured to wait for user input. Afterwards, propagate that information via XCOM into next Operator.

What's the difference between Celery task and subtask?

If I understood the tutorial correctly, Celery subtask supports almost the same API as task, but has the additional advantage that it can be passed around to other functions or processes.
Clearly, if that was the case, Celery would have simply replaced tasks with subtasks instead of keeping both (e.g., the #app.task decorator would have converted a function to a subtask instead of to a task, etc.). So I must be misunderstanding something.
What can a task do that a subtask can't?
Celery API changed quite a bit; my question is specific to version 3.1 (currently, the latest).
Edit:
I know the docs say subtasks are intended to be called from other tasks. My question is what prevents Celery from getting rid of tasks completely and using subtasks everywhere? They seem to be strictly more flexible/powerful than tasks:
# tasks.py
from celery import Celery
app = Celery(backend='rpc://')
#app.task
def add(x, y):
# just print out a log line for testing purposes
print(x, y)
# client.py
from tasks import add
add_subtask = add.subtask()
# in this context, it seems the following two lines do the same thing
add.delay(2, 2)
add_subtask.delay(2, 2)
# when we need to pass argument to other tasks, we must use add_subtask
# so it seems add_subtask is strictly better than add
You will take the difference into account when you start using complex workflows with celery.
A signature() wraps the arguments, keyword arguments, and execution
options of a single task invocation in a way such that it can be
passed to functions or even serialized and sent across the wire.
Signatures are often nicknamed “subtasks” because they describe a task
to be called within a task.
Also:
subtask‘s are objects used to pass around the signature of a task
invocation, (for example to send it over the network)
Task is just a function definition wrapped with decorator, but subtask is a task with parameters passed, but not yet started. You may transfer the subtask serialized over network or, more used, call it within a group/chain/chord.

Categories