How to use TaskGroup and PythonBranchOperator together? - python

There is a trigger_task that starts the DAG at a certain time, if it comes up there, it returns start_tasks - a group of tasks that is executed sequentially, otherwise the stop_tasks task that stops the execution of the entire dag. The problem is that PythonBranchOperator branches tasks incorrectly if you use TaskGroup and not one task.
It lines up like this:
trigger_task >> (all tasks)
It should be like this:
trigger_task >> stop_tasks or start_tasks (depending on the output of trigger_task, in starts_tasks all tasks are sequentially from the group)
Below is the code
trigger_task = BranchPythonOperator(
task_id='task_trigger',
python_callable=task_trigger,
dag=dag
)
stop_tasks = PythonOperator(
task_id='stop_tasks',
python_callable=stop_func,
dag=dag
)
with TaskGroup('start_tasks', dag=dag) as start_tasks:
get_1c_saldo_contractor = PythonOperator(
task_id='get_1c_saldo_contractor',
python_callable=get_1c_saldo_contractor,
dag=dag
)
sql_sensor_dm_partner_balance = SqlSensor(
task_id='sensor_task_dm_partner_balance',
poke_interval=60,
conn_id='airflow_db',
sql=sql_query.format('DM_PartnerBalance', dt),
on_failure_callback=custom_failure_function,
on_success_callback=custom_success_function,
dag=dag)
saldo_comparison_task = PythonOperator(
task_id='saldo_comparison',
python_callable=saldo_comparison,
dag=dag
)
trigger_task >> [stop_tasks, start_tasks]
def task_trigger():
if datetime.now().strftime('%Y-%m-%d %H:%M') == '2022-08-18 17:03':
return 'start_tasks'
else:
return 'stop_tasks'
def stop_func():
return 0

Related

Airflow 1.10.15 dynamic task creation

I'm trying to create a DAG that will spawn N-tasks depending on the result of the previous task. The problem is that I cannot use the value returned from the previous task (in XCom) outside of Operator
Is there a way to make this work?
with DAG(
"spawn_dag",
start_date=datetime(2022, 1, 1)
) as dag:
# Calculates the number of tasks based on some previous task run
count_number_of_tasks = PythonOperator(
task_id='count_number_of_tasks',
python_callable=count_tasks_function,
dag=dag,
xcom_push=True,
provide_context=True
)
# Generates tasks and chains them
def dynamic_spawn_func(parent_dag_name, child_dag_name, start_date, args, **kwargs):
subdag = DAG(
dag_id=f"{parent_dag_name}.{child_dag_name}",
default_args=args,
start_date=start_date,
schedule_interval=None
)
# Here is the problem, the following variable cannot be used in a loop to spawn tasks
number_of_tasks = kwargs['ti'].xcom_pull(dag_id='spawn_dag', task_ids='count_number_of_tasks')
# This is where that variable is used
for j in range(number_of_tasks):
task = PythonOperator(
task_id='processor_' + str(j),
python_callable=some_func,
op_kwargs={"val": j},
dag=subdag,
provide_context=True)
task_2 = PythonOperator(
task_id='wait_for_processor_' + str(j),
python_callable=some_func,
op_kwargs={"val": j},
dag=subdag,
provide_context=True)
task >> task_2
return subdag
dynamic_spawn_op = SubDagOperator(
task_id='dynamic_spawn',
subdag=dynamic_spawn_func("spawn_dag", "dynamic_spawn", dag.start_date, args=default_args),
dag=dag,
provide_context=True
)
generate >> count_number_of_tasks >> dynamic_spawn_op
No. Migrate to Airflow 2.3+. Airlfow 1.10 is End of Life for 2 years now and you are shooting yourself in the foot by not upgrading. Not only you lack new features (like Dynamic Task Mapping) but also you make yourself super-vulnerable to potential security problems (there were 10s of CVEs fixed since 1.10) but also you put yourself in this position:
https://xkcd.com/979/
because you are one of the last peoople in the world who run Airflow 1.10.
Not upgrading at this stage is just very wrong decision because not upgrading costs you a LOT more than migration cost. Multiple times more.

conditionally_trigger for TriggerDagRunOperator

I have 2 DAGs: dag_a and dag_b (dag_a -> dag_b)
After dag_a is executed, TriggerDagRunOperator is called, which starts dag_b. The problem is, when dag_b is off (paused), dag_a's TriggerDagRunOperator creates scheduled runs in dag_b that queue up for as long as dag_a is running. After turning dag_b back ON, the execution of tasks from the queue begins.
I'm trying to find a solution for TriggerDagRunOperator, namely a conditionally_trigger function that would skip the execution of the TriggerDagRunOperator task if dag_b is paused (OFF). How can i do this?
You can use ShortCircuitOperator to execute/skip the downstream dag_b. Then, use the Airflow Rest API (or shell/CLI) to figure out whether dag_b is paused or not.
dag_a = TriggerDagRunOperator(
trigger_dag_id='dag_a',
...
)
pause_check = ShortCircuitOperator(
task_id='pause_check',
python_callable=is_dag_paused,
op_kwargs={
'dag_id': 'dag_b'
}
)
dag_b = TriggerDagRunOperator(
trigger_dag_id='dag_b',
...
)
dag_a >> pause_check >> dag_b
and is_dag_paused function can be like this. (here I use Rest API.)
def is_dag_paused(**kwargs):
import requests
from requests.auth import HTTPBasicAuth
dag_id = kwargs['dag_id']
res = requests.get(f'http://{airflow_host}/api/v1/dags/{dag_id}/details',
auth=HTTPBasicAuth('username', 'pasword')) # The auth method could be different for you.
if res.status_code == 200:
rjson = res.json()
# if you return True, the downstream tasks will be executed
# if False, it will be skipped
return not rjson['is_paused']
else:
print('Error: ', res)
exit(1)
import airflow.settings
from airflow.models import DagModel
def check_status_dag(*op_args):
session = airflow.settings.Session()
qry = session.query(DagModel).filter(DagModel.dag_id == op_args[0])
if not qry.value(DagModel.is_paused):
return op_args[1]
else: return op_args[2]
Where check_status_dag is the method of making a choice decision for executing a further branch, op_args[0] is the dag_id of the dag being checked for pause status, op_args[1] and op_args[2] are the names of the tasks in accordance with the logic of the BranchPythonOperator
start = DummyOperator(
task_id = 'start',
dag=dag
)
check_dag_B = BranchPythonOperator(
task_id = "check_dag_B",
python_callable = check_status_dag,
op_args = ['dag_B','trigger_dag_B','skip_trigger_dag_B'],
trigger_rule = 'all_done',
dag = dag
)
trigger_dag_B = TriggerDagRunOperator(
task_id = 'trigger_dag_B',
trigger_dag_id = 'dag_B',
dag = dag
)
skip_trigger_dag_B = DummyOperator(
task_id = 'skip_trigger_dag_B',
dag = dag
)
finish = DummyOperator(
task_id = 'finish',
trigger_rule = 'all_done',
dag=dag
)
start >> check_dag_B >> [trigger_dag_B, skip_trigger_dag_B] >> finish#or continue working

How ti create looping task to run multiple time in Airflow?

I need to do this:
for i in range(5):
step1 = PythonOperator(
....
)
#dependencies
step1 >> step1 >> step1 >> step1 >> step1
If you want task i to depend on task i-1, here is a simple solution that keeps a reference to the previous task and uses it to set the dependency for the current task:
def job(i):
print(f"Executing task {i}")
with DAG(
"SO_74208401",
start_date=datetime.now() - timedelta(days = 2),
schedule=None,
) as dag:
previous_task = None
for i in range(5):
task = PythonOperator(
task_id=f"task_{i}",
python_callable=job,
op_args=[i]
)
if previous_task:
previous_task >> task
previous_task = task

How do I trigger a backfill with TriggerDagRunOperator?

I have a requirement where I need the dag triggered by TriggerDagRunOperator to execute a backfill and not just for the same execution date.
The TriggerDagOperator is set as follows:
trigger1 = TriggerDagRunOperator(
task_id = 'trigger1',
trigger_dag_id = 'target_dag',
conf = {'message': 'Starting target 1'},
reset_dag_run = True,
wait_for_completion = True
)
Target dag is basically:
starting_date = datetime.strptime("2021-11-15", "%Y-%m-%d")
with DAG("target_dag", default_args=default_args, schedule_interval='#daily', max_active_runs=10) as dag:
start = DummyOperator(
task_id = 'start'
)
t1 = PythonOperator(
task_id = "t1",
provide_context=True,
python_callable=t1
)
finish = DummyOperator(
task_id = 'finish'
)
start >> t1 >> finish
target_dag is only executing for today's date and not backfilling.
How do I force it to backfill regardless of past dag runs? I'm using airflow 2.0
this might be late now, but I have come up with 2 different solutions.
The first one (and probably the better) would be as follows:
from airflow.operators.latest_only_operator import LatestOnlyOperator
t1 = LatestOnlyOperator(task_id="ensure_backfill_complete")
I was stuck on a similar conundrum, and this suddenly popped in my head.
The 2nd one is basically wrapping the operator in a loop within a python function, which is honestly terrible.
and it seems to work.

airflow - creating dag and task dynamically create the pipeline for one object

In airflow I want to export some tables from pg to BQ.
task1: get the max id from BQ
task2: export the data from PG (id>maxid)
task3: GCS to BQ stage
task4: BQ stage to BQ main
But there is a slight challenge, The schedule interval is different. So I created a JSON file to tell the sync interval. So if it is 2mins then it'll use the DAG upsert_2mins else 10mins interval (upsert_10mins) . I used this syntax to generate it dynamically.
JSON config file:
{
"tbl1": ["update_timestamp", "2mins", "stg"],
"tbl2": ["update_timestamp", "2mins", "stg"]
}
Code:
import json
from airflow import DAG
from datetime import datetime, timedelta
from airflow.utils.dates import days_ago
from airflow.contrib.hooks.bigquery_hook import BigQueryHook
from airflow.contrib.operators.bigquery_operator import BigQueryOperator
from customoperator.custom_PostgresToGCSOperator import custom_PostgresToGCSOperator
from airflow.contrib.operators.gcs_to_bq import custom_PostgresToGoogleCloudStorageOperator
from airflow.operators.python_operator import PythonOperator
from airflow.operators.dummy_operator import DummyOperator
table_list = ['tbl1','tbl2']
#DAG details
docs = """test"""
# Add args and Dag
default_args = {
'owner': 'DBteam',
'depends_on_past': False,
'start_date': days_ago(1),
'email_on_failure': True,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=1)
}
with open('/home/airflow/gcs/dags/upsert_dag/config.json','r') as conf:
config = json.loads(conf.read())
def get_max_ts(dag,tablename,**kwargs):
code for find the max record
return records[0][0]
def pgexport(dag,tablename, **kwargs):
code for exporting the data PGtoGCS
export_tables.execute(None)
def stg_bqimport(dag,tablename, **kwargs):
code to import GCS to BQ
bqload.execute(None)
def prd_merge(dag,tablename, **kwargs):
code to merge bq to main bq table
bqmerge.execute(context=kwargs)
for table_name in table_list:
sync_interval = config[table_name][1]
cron_time = ''
if sync_interval == '2mins':
cron_time = '*/20 * * * *'
else:
cron_time = '*/10 * * * *'
dag = DAG(
'upsert_every_{}'.format(sync_interval),
default_args=default_args,
description='Incremental load - Every 10mins',
schedule_interval=cron_time,
catchup=False,
max_active_runs=1,
doc_md = docs
)
max_ts = PythonOperator(
task_id="get_maxts_{}".format(table_name),
python_callable=get_max_ts,
op_kwargs={'tablename':table_name, 'dag': dag},
provide_context=True,
dag=dag
)
export_gcs = PythonOperator(
task_id='export_gcs_{}'.format(table_name),
python_callable=pgexport,
op_kwargs={'tablename':table_name, 'dag': dag},
provide_context=True,
dag=dag
)
stg_load = PythonOperator(
task_id='stg_load_{}'.format(table_name),
python_callable=stg_bqimport,
op_kwargs={'tablename':table_name, 'dag': dag},
provide_context=True,
dag=dag
)
merge = PythonOperator(
task_id='merge_{}'.format(table_name),
python_callable=prd_merge,
op_kwargs={'tablename':table_name, 'dag': dag},
provide_context=True,
dag=dag
)
globals()[sync_interval] = dag
max_ts >> export_gcs >> stg_load >> merge
It actually created the dag but the issue is from the web UI im able to see the task for the last table.But it has to show the tasks for 2 tables.
Your code is creating 2 dags, one for each table, but overwriting the first one with the second.
My suggestion is to change the format of the JSON file to:
{
"2mins": [
"tbl1": ["update_timestamp", "stg"],
"tbl2": ["update_timestamp", "stg"]
],
"10mins": [
"tbl3": ["update_timestamp", "stg"],
"tbl4": ["update_timestamp", "stg"]
]
}
And have your code iterate over the schedules and create the needed tasks for each table (you will need two loops):
# looping on the schedules to create two dags
for schedule, tables in config.items():
cron_time = '*/10 * * * *'
if schedule== '2mins':
cron_time = '*/20 * * * *'
dag_id = 'upsert_every_{}'.format(schedule)
dag = DAG(
dag_id ,
default_args=default_args,
description='Incremental load - Every 10mins',
schedule_interval=cron_time,
catchup=False,
max_active_runs=1,
doc_md = docs
)
# Looping over the tables to create the tasks for
# each table in the current schedule
for table_name, table_config in tables.items():
max_ts = PythonOperator(
task_id="get_maxts_{}".format(table_name),
python_callable=get_max_ts,
op_kwargs={'tablename':table_name, 'dag': dag},
provide_context=True,
dag=dag
)
export_gcs = PythonOperator(
task_id='export_gcs_{}'.format(table_name),
python_callable=pgexport,
op_kwargs={'tablename':table_name, 'dag': dag},
provide_context=True,
dag=dag
)
stg_load = PythonOperator(
task_id='stg_load_{}'.format(table_name),
python_callable=stg_bqimport,
op_kwargs={'tablename':table_name, 'dag': dag},
provide_context=True,
dag=dag
)
merge = PythonOperator(
task_id='merge_{}'.format(table_name),
python_callable=prd_merge,
op_kwargs={'tablename':table_name, 'dag': dag},
provide_context=True,
dag=dag
)
# Tasks for the same table will be chained
max_ts >> export_gcs >> stg_load >> merge
# DAG is created among the global objects
globals()[dag_id] = dag

Categories