How ti create looping task to run multiple time in Airflow? - python

I need to do this:
for i in range(5):
step1 = PythonOperator(
....
)
#dependencies
step1 >> step1 >> step1 >> step1 >> step1

If you want task i to depend on task i-1, here is a simple solution that keeps a reference to the previous task and uses it to set the dependency for the current task:
def job(i):
print(f"Executing task {i}")
with DAG(
"SO_74208401",
start_date=datetime.now() - timedelta(days = 2),
schedule=None,
) as dag:
previous_task = None
for i in range(5):
task = PythonOperator(
task_id=f"task_{i}",
python_callable=job,
op_args=[i]
)
if previous_task:
previous_task >> task
previous_task = task

Related

Airflow 1.10.15 dynamic task creation

I'm trying to create a DAG that will spawn N-tasks depending on the result of the previous task. The problem is that I cannot use the value returned from the previous task (in XCom) outside of Operator
Is there a way to make this work?
with DAG(
"spawn_dag",
start_date=datetime(2022, 1, 1)
) as dag:
# Calculates the number of tasks based on some previous task run
count_number_of_tasks = PythonOperator(
task_id='count_number_of_tasks',
python_callable=count_tasks_function,
dag=dag,
xcom_push=True,
provide_context=True
)
# Generates tasks and chains them
def dynamic_spawn_func(parent_dag_name, child_dag_name, start_date, args, **kwargs):
subdag = DAG(
dag_id=f"{parent_dag_name}.{child_dag_name}",
default_args=args,
start_date=start_date,
schedule_interval=None
)
# Here is the problem, the following variable cannot be used in a loop to spawn tasks
number_of_tasks = kwargs['ti'].xcom_pull(dag_id='spawn_dag', task_ids='count_number_of_tasks')
# This is where that variable is used
for j in range(number_of_tasks):
task = PythonOperator(
task_id='processor_' + str(j),
python_callable=some_func,
op_kwargs={"val": j},
dag=subdag,
provide_context=True)
task_2 = PythonOperator(
task_id='wait_for_processor_' + str(j),
python_callable=some_func,
op_kwargs={"val": j},
dag=subdag,
provide_context=True)
task >> task_2
return subdag
dynamic_spawn_op = SubDagOperator(
task_id='dynamic_spawn',
subdag=dynamic_spawn_func("spawn_dag", "dynamic_spawn", dag.start_date, args=default_args),
dag=dag,
provide_context=True
)
generate >> count_number_of_tasks >> dynamic_spawn_op
No. Migrate to Airflow 2.3+. Airlfow 1.10 is End of Life for 2 years now and you are shooting yourself in the foot by not upgrading. Not only you lack new features (like Dynamic Task Mapping) but also you make yourself super-vulnerable to potential security problems (there were 10s of CVEs fixed since 1.10) but also you put yourself in this position:
https://xkcd.com/979/
because you are one of the last peoople in the world who run Airflow 1.10.
Not upgrading at this stage is just very wrong decision because not upgrading costs you a LOT more than migration cost. Multiple times more.

conditionally_trigger for TriggerDagRunOperator

I have 2 DAGs: dag_a and dag_b (dag_a -> dag_b)
After dag_a is executed, TriggerDagRunOperator is called, which starts dag_b. The problem is, when dag_b is off (paused), dag_a's TriggerDagRunOperator creates scheduled runs in dag_b that queue up for as long as dag_a is running. After turning dag_b back ON, the execution of tasks from the queue begins.
I'm trying to find a solution for TriggerDagRunOperator, namely a conditionally_trigger function that would skip the execution of the TriggerDagRunOperator task if dag_b is paused (OFF). How can i do this?
You can use ShortCircuitOperator to execute/skip the downstream dag_b. Then, use the Airflow Rest API (or shell/CLI) to figure out whether dag_b is paused or not.
dag_a = TriggerDagRunOperator(
trigger_dag_id='dag_a',
...
)
pause_check = ShortCircuitOperator(
task_id='pause_check',
python_callable=is_dag_paused,
op_kwargs={
'dag_id': 'dag_b'
}
)
dag_b = TriggerDagRunOperator(
trigger_dag_id='dag_b',
...
)
dag_a >> pause_check >> dag_b
and is_dag_paused function can be like this. (here I use Rest API.)
def is_dag_paused(**kwargs):
import requests
from requests.auth import HTTPBasicAuth
dag_id = kwargs['dag_id']
res = requests.get(f'http://{airflow_host}/api/v1/dags/{dag_id}/details',
auth=HTTPBasicAuth('username', 'pasword')) # The auth method could be different for you.
if res.status_code == 200:
rjson = res.json()
# if you return True, the downstream tasks will be executed
# if False, it will be skipped
return not rjson['is_paused']
else:
print('Error: ', res)
exit(1)
import airflow.settings
from airflow.models import DagModel
def check_status_dag(*op_args):
session = airflow.settings.Session()
qry = session.query(DagModel).filter(DagModel.dag_id == op_args[0])
if not qry.value(DagModel.is_paused):
return op_args[1]
else: return op_args[2]
Where check_status_dag is the method of making a choice decision for executing a further branch, op_args[0] is the dag_id of the dag being checked for pause status, op_args[1] and op_args[2] are the names of the tasks in accordance with the logic of the BranchPythonOperator
start = DummyOperator(
task_id = 'start',
dag=dag
)
check_dag_B = BranchPythonOperator(
task_id = "check_dag_B",
python_callable = check_status_dag,
op_args = ['dag_B','trigger_dag_B','skip_trigger_dag_B'],
trigger_rule = 'all_done',
dag = dag
)
trigger_dag_B = TriggerDagRunOperator(
task_id = 'trigger_dag_B',
trigger_dag_id = 'dag_B',
dag = dag
)
skip_trigger_dag_B = DummyOperator(
task_id = 'skip_trigger_dag_B',
dag = dag
)
finish = DummyOperator(
task_id = 'finish',
trigger_rule = 'all_done',
dag=dag
)
start >> check_dag_B >> [trigger_dag_B, skip_trigger_dag_B] >> finish#or continue working

How to use TaskGroup and PythonBranchOperator together?

There is a trigger_task that starts the DAG at a certain time, if it comes up there, it returns start_tasks - a group of tasks that is executed sequentially, otherwise the stop_tasks task that stops the execution of the entire dag. The problem is that PythonBranchOperator branches tasks incorrectly if you use TaskGroup and not one task.
It lines up like this:
trigger_task >> (all tasks)
It should be like this:
trigger_task >> stop_tasks or start_tasks (depending on the output of trigger_task, in starts_tasks all tasks are sequentially from the group)
Below is the code
trigger_task = BranchPythonOperator(
task_id='task_trigger',
python_callable=task_trigger,
dag=dag
)
stop_tasks = PythonOperator(
task_id='stop_tasks',
python_callable=stop_func,
dag=dag
)
with TaskGroup('start_tasks', dag=dag) as start_tasks:
get_1c_saldo_contractor = PythonOperator(
task_id='get_1c_saldo_contractor',
python_callable=get_1c_saldo_contractor,
dag=dag
)
sql_sensor_dm_partner_balance = SqlSensor(
task_id='sensor_task_dm_partner_balance',
poke_interval=60,
conn_id='airflow_db',
sql=sql_query.format('DM_PartnerBalance', dt),
on_failure_callback=custom_failure_function,
on_success_callback=custom_success_function,
dag=dag)
saldo_comparison_task = PythonOperator(
task_id='saldo_comparison',
python_callable=saldo_comparison,
dag=dag
)
trigger_task >> [stop_tasks, start_tasks]
def task_trigger():
if datetime.now().strftime('%Y-%m-%d %H:%M') == '2022-08-18 17:03':
return 'start_tasks'
else:
return 'stop_tasks'
def stop_func():
return 0

How do I trigger a backfill with TriggerDagRunOperator?

I have a requirement where I need the dag triggered by TriggerDagRunOperator to execute a backfill and not just for the same execution date.
The TriggerDagOperator is set as follows:
trigger1 = TriggerDagRunOperator(
task_id = 'trigger1',
trigger_dag_id = 'target_dag',
conf = {'message': 'Starting target 1'},
reset_dag_run = True,
wait_for_completion = True
)
Target dag is basically:
starting_date = datetime.strptime("2021-11-15", "%Y-%m-%d")
with DAG("target_dag", default_args=default_args, schedule_interval='#daily', max_active_runs=10) as dag:
start = DummyOperator(
task_id = 'start'
)
t1 = PythonOperator(
task_id = "t1",
provide_context=True,
python_callable=t1
)
finish = DummyOperator(
task_id = 'finish'
)
start >> t1 >> finish
target_dag is only executing for today's date and not backfilling.
How do I force it to backfill regardless of past dag runs? I'm using airflow 2.0
this might be late now, but I have come up with 2 different solutions.
The first one (and probably the better) would be as follows:
from airflow.operators.latest_only_operator import LatestOnlyOperator
t1 = LatestOnlyOperator(task_id="ensure_backfill_complete")
I was stuck on a similar conundrum, and this suddenly popped in my head.
The 2nd one is basically wrapping the operator in a loop within a python function, which is honestly terrible.
and it seems to work.

Apache Airflow Timeout error when dynamically creating tasks in DAG

In my old DAG, I created tasks like so:
start_task = DummyOperator(task_id = "start_task")
t1 = PythonOperator(task_id = "t1", python_callable = get_t1)
t2 = PythonOperator(task_id = "t2", python_callable = get_t2)
t3 = PythonOperator(task_id = "t3", python_callable = get_t3)
t4 = PythonOperator(task_id = "t4", python_callable = get_t4)
t5 = PythonOperator(task_id = "t5", python_callable = get_t5)
t6 = PythonOperator(task_id = "t6", python_callable = get_t6)
t7 = PythonOperator(task_id = "t7", python_callable = get_t7)
t8 = PythonOperator(task_id = "t8", python_callable = get_t8)
t9 = PythonOperator(task_id = "t9", python_callable = get_t9)
t10 = PythonOperator(task_id = "t10", python_callable = get_t10)
t11 = PythonOperator(task_id = "t11", python_callable = get_t11)
end_task = DummyOperator(task_id = "end_task")
start_task >> [t1, t2, t3, t4, t5, t6, t7, t8, t9, t10, t11] >> end_task
Each of these tasks runs a different query, and each task is run concurrently. I have revised my code because much of it was redundant and could be put inside functions. In my new code, I also attempted to create tasks dynamically by reading in the queries and metadata for each task from a .json.
New Code:
loaded_info = load_info() # function call to load .json data into a list
start_task = DummyOperator(task_id = "start_task")
end_task = DummyOperator(task_id = "end_task")
tasks = [] # empty list to append tasks to in for loop
for x in loaded_info:
qce = QCError(**x)
id = qce.column
task = PythonOperator(task_id = id, python_callable = create_task(qce))
tasks.append(task)
start_task >> tasks >> end_task
This new code appears fine, however it prevents my from running airflow initdb. After running the command, the terminal will wait and never finish until I finally CRTL+C to kill it, then eventually gives me an error after kill:
raise AirflowTaskTimeout(self.error_message)
pandas.io.sql.DatabaseError: Execution failed on sql 'select ..., count(*) as frequency from ... where ... <> all (array['...', '...', etc.]) or ... is null group by ... order by ... asc': Timeout, PID: 315
(Note: the query in the error statement above is just the first query in the .json). Considering I never had this error with the old DAG, I'm assuming this is due to the dynamic task creation, but I need help identifying what exactly is causing this error.
What I have tried:
Running each query individually in Airflow Webserver Admin Ad-Hoc (they all work fine)
Creating a test script to run locally and output the contents of the .json to ensure everything is correctly formatted, etc.
I managed to get airflow initdb to run finally (but I have not yet tested my job, and will update on its status later).
It turns out that when defining a python operator, you cannot include an argument like I was doing:
task = PythonOperator(task_id = id, python_callable = create_task(qce))
Passing qce into create_tasks is what was causing the error. To pass arguments into your tasks, see here.
For those of you who want to see the fix for my exact case, I have this:
with DAG("dva_event_analysis_dag", default_args = DEFAULT_ARGS, schedule_interval = None, catchup = False) as dag:
loaded_info = load_info()
start_task = DummyOperator(task_id = "start_task")
end_task = DummyOperator(task_id = "end_task")
tasks = []
for x in loaded_info:
id = x["column"]
task = PythonOperator(task_id = id, provide_context = True, python_callable = create_task, op_kwargs = x)
tasks.append(task)
start_task >> tasks >> end_task
Update (7/03/2019): Job status is successful. This was indeed the fix to my error. Hopefully this helps out others with similar issues.

Categories