Airflow dynamic tasks at runtime - python

Other questions about 'dynamic tasks' seem to address dynamic construction of a DAG at schedule or design time. I'm interested in dynamically adding tasks to a DAG during execution.
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from datetime import datetime
dag = DAG('test_dag', description='a test',
schedule_interval='0 0 * * *',
start_date=datetime(2018, 1, 1),
catchup=False)
def make_tasks():
du1 = DummyOperator(task_id='dummy1', dag=dag)
du2 = DummyOperator(task_id='dummy2', dag=dag)
du3 = DummyOperator(task_id='dummy3', dag=dag)
du1 >> du2 >> du3
p = PythonOperator(
task_id='python_operator',
dag=dag,
python_callable=make_tasks)
This naive implementation doesn't seem to work - the dummy tasks never show up in the UI.
What's the correct way to add new operators to the DAG during execution? Is it possible?

It it not possible to modify the DAG during its execution (without a lot more work).
The dag = DAG(... is picked up in a loop by the scheduler. It will have task instance 'python_operator' in it. That task instance gets scheduled in a dag run, and executed by a worker or executor. Since DAG models in the Airflow DB are only updated by the scheduler these added dummy tasks will not be persisted to the DAG nor scheduled to run. They will be forgotten when the worker exits. Unless you copy all the code from the scheduler regarding persisting & updating the model… but that will be undone the next time the scheduler visits the DAG file for parsing, which could be happening once a minute, once a second or faster depending how many other DAG files there are to parse.
Airflow actually wants each DAG to approximately stay the same layout between runs. It also wants to reload/parse DAG files constantly. So though you could make a DAG file that on each run determines the tasks dynamically based on some external data (preferably cached in a file or pyc module, not network I/O like a DB lookup, you'll slow down the whole scheduling loop for all the DAGs) it's not a good plan as your graph and tree view will get all confusing, and your scheduler parsing will be more taxed by that lookup.
You could make the callable run each task…
def make_tasks(context):
du1 = DummyOperator(task_id='dummy1', dag=dag)
du2 = DummyOperator(task_id='dummy2', dag=dag)
du3 = DummyOperator(task_id='dummy3', dag=dag)
du1.execute(context)
du2.execute(context)
du3.execute(context)
p = PythonOperator(
provides_context=true,
But that's sequential, and you have to work out how to use python to make them parallel (use futures?) and if any raise an exception the whole task fails. Also it is bound to one executor or worker so not using airflow's task distribution (kubernetes, mesos, celery).
The other way to work with this is to add a fixed number of tasks (the maximal number), and use the callable(s) to short circuit the unneeded tasks or push arguments with xcom for each of them, changing their behavior at run time but not changing the DAG.

Regarding your code sample, you never call your function which registers your tasks in your DAG.
To have a kind of dynamic tasks, you can have a single operator which does something different depending on some state or you can have a handful of operators which can be skipped depending on the state, with a ShortCircuitOperator.

I appreciate all the work everybody has done here as I have the same challenge of creating dynamically structured DAGs. I have done enough mistakes to not use software against its design. If I cant inspect the whole run on the UI and zoom in and out, basically use airflow features, which are the main reason I use it anyway. I can just write multiprocessing code inside a function and be done with it as well.
That all being said my solution is to use a resource manager such as redis locking and have a DAG that writes to this resource manager with data about what to run how to run etc; and have another DAG or DAGs that run in certain intervals polling the resource manager, locking them before running and removing them at finish. This way at least I use airflow as expected even though its specifications dont exactly meet my needs. I breakdown the problem into more definable chunks. The solutions are creative but they are against the design and not tested by the developers. The specifically say to have fixed structured workflows. I cannot put a work around code that is not tested and against design unless I rewrite the core airflow code and test myself. I understand my solution brings complexity with locking and all that but at least I know the boundaries to that.

Related

Is it possible to kill the previous DAG run if its still running when its time for the latest run?

Our airflow is forced to interact with a company with a very poor system. Its not unusual for our DAG to get stuck waiting for a report that never actually gets completed. This DAG runs daily pulling the the same information, so if its time for the next run it would be nice to just kill the last run and move on with the new one. I haven't found anything saying Airflow has a DAG argument that can achieve this. Is there a quick easy setting for this behavior, or would it need to be done logically in the sensor that checks if the report is complete?
If your DAG is scheduled daily, how about setting dagrun_timeout to 24 hours? I believe this should in effect kill the previous dag run around when it kicks off a new one. Related question about setting DAG timeouts.
Alternatively, you could either use a PythonOperator, define your own operator, or extend the report sensor you describe to kill the previous DagRun programmatically. I believe that this would look like...
Get the current dag run from the Airflow context
Get the previous dag run with dag_run.get_previous_dagrun()
Set the state on the previous dag run with prev_dag_run.set_state
My recommendation would be to set the timeout given these two options. I agree that there is no specific kill_previous_run dag argument

Automatically generating ExternalTaskSensor where execution date depends only on DAG id

Help me crack this one. I am looking for an elegant solution for dynamically generating ExternalTaskSensor tasks in Airflow with unique execution_date_fn functions while avoiding problems arising from function scopes.
I am trying to create a DAG that depends on several other DAGs by that they shouldn't run simultaneously. I am using ExternalTaskSensor operators, best illustrated visually:
Sensor tasks, e.g. sense_dag_1, rely on DagRun object to find the latest execution_date of e.g. dag_1. This method, called get_execution_date in my code, is created and passed to the ExternalTaskOperator via execution_date_fn kwarg. This works well when "manually" creating tasks as shown in example:
sensor = ExternalTaskSensor(
task_id='sense_dag_1',
external_dag_id='dag_1',
execution_date_fn=lambda dt: get_execution_date('dag_1'),
dag=dag
)
Note that a little workaround is used, providing get_execution_date to ExternalTaskSensor as a lambda function with input dt. This solution, however, faces a problem when automatically generating multiple ExternalTaskSensor tasks for a list of DAG names, as seen in naive example:
for dag_id in ['dag_1', 'dag_2']:
sensor = ExternalTaskSensor(
task_id='sense_'+dag_id,
external_dag_id=dag_id,
execution_date_fn=lambda dt: get_execution_date(dag_id),
dag=dag
)
When creating tasks this way, all tasks are set with the same execution_date, coinciding with execution_date of the last dag_id in the list. I am aware that this problem is to do with functions and scope, as described in Python's official documentation.
We end up with two limiting factors:
execution_date_fn only accepts dt and **context as arguments
Lambda functions (and functions in general) require arguments defined in function scope in order to correctly assign variable values.
QUESTION: How do I dynamically generate ExternalTaskSensor tasks where execution_date only depends on dag_id?
I have one solution in mind: adding tasks upstream of sensors that get the execution_date and push it to context. However, in this case order of task execution becomes important, making the end DAG look something like this:
This is not aesthetically pleasing and I am looking for something better
I remembered a working solution: move sensor instantiation to a separate function. This way the variable scope is separated for loop and task creation. Code example:
def create_sensor(dag_id):
task = ExternalTaskSensor(
task_id='sense_'+dag_id,
external_dag_id=dag_id,
execution_date_fn=lambda dt: get_execution_date(dag_id),
dag=dag
)
return task
for dag_id in ['dag_1', 'dag_2']:
sensor = create_sensor(dag_id)
More on the topic here: Unexpected Airflow behaviour in dynamic task generation

Triggering the external dag using another dag in Airflow

Having list of tasks which calls different dags in master dag.I'm using the TriggerDagrunoperator to accomplish this. But facing few issues.
TriggerDagrunoperator doesn't wait for completion of external dag, it triggers next task. I want that to wait until completion and next task should trigger based on the status. Came across ExternalTaskSensor. It is making the process complicated. Is there any other solution to fix this?
If I trigger the master dag again, I want the task to restart from where it is failed. Right now, it's not restarting, but for time based schedule,it will.
.. I want that to wait until completion .. Came across
ExternalTaskSensor. It is making the process complicated ..
I'm unaware of any other way to achieve this. I myself did this the same way.
If I trigger the master dag again, I want the task to restart from
where it is failed...
This requirement of your goes against the principle of idempotency that Airflow demands. I'd suggest you try to re-work you jobs in incorporate idempotency (for instance in case of retries, you have to have idempotency). Meanwhile you can take inspiration from some people and try to achieve something similar (but it will be pretty complicated)
With Airflow 2.0.1, the triggering dag can be made to wait for completion of target dag with parameter wait_for_completion
ref: here

Airflow: how to specify condition for hourly workflow like trigger only if currently same instance is not running?

I have created an workflow (contains few tasks). It is hourly execution. Workflow should be triggered only if another instance of workflow is not running at same time. If it is running, workflow execution should be skipped for that hour.
I checked with "depends_on_past" but couldn't get it.
Set the max_active_runs on your DAG to 1 and also catchup to False
From the official Airflow documentation for trigger rules:
The depends_on_past (boolean), when set to True, keeps a task from getting triggered if the previous schedule for the task hasn’t succeeded.
It will work if you use it in the definition of the task. You can pair it with wait_for_downstream= True as well to guarantee that the new run's instance will not begin until the last run's instance of the task has completed execution.
task_depends = DummyOperator( task_id= "task_depend", dag= dag, depends_on_past= True )
However another way to work around this assuming that you only need the latest run to work is using the Latest Run Only concept:
Standard workflow behavior involves running a series of tasks for a particular date/time range. Some workflows, however, perform tasks that are independent of run time but need to be run on a schedule, much like a standard cron job. In these cases, backfills or running jobs missed during a pause just wastes CPU cycles.
For situations like this, you can use the LatestOnlyOperator to skip tasks that are not being run during the most recent scheduled run for a DAG. The LatestOnlyOperator skips all immediate downstream tasks, and itself, if the time right now is not between its execution_time and the next scheduled execution_time.

Airflow Scheduler creating PID for same dag to generate tasks every time

I am using Local Executor. I have a situation where i have unique dags getting generated for each request id for eg 1.py , 2.py .
1.py assume has two tasks and 2.py has 3 tasks. I would also receive more dags periodically for eg 3.py,4.py etc.
Is there any problem of creating a dag for every new id/request ID.
I have observed that Scheduler keeps giving this log.
Started a process (PID: 92186) to generate tasks for /Users/nshar141/airflow/dags/3.py - logging into /Users/nshar141/airflow/logs/scheduler/2018-05-07/3.py.log
My question here is why scheduler keeps generating separate PIDs for generating tasks. I tried changing different parameters in the config related to concurrency and parallelism but scheduler seems to be executing that statement everytime for every dag present in dags folder.
I am attaching my dag definition. I want to run dag as soon as it is created. What are the parameters i should give in start_time and scheduler_interval?
dag = DAG('3', description='Sample DAG',schedule_interval=#once,start_date=datetime(2018, 5, 07), catchup=False)
Since i have a need to generate dags dynamically with unique dag id and place it in the dags folder my concern here is scheduler would generate too many process IDS for every dag in the folder which already has been executed.
Why do you want to create a new DAG for every request? I think that the most appropriate way would be to store requests and have a single DAG perform your logic for multiple requests at the same time, in a batch fashion. You can run your DAG very often if you want.
You seem to want tasks to be executed as soon as possible. If you're interested in near real-time with a lot of throughput, Airflow may not be appropriate and you'll want to use a message queue instead.

Categories