Airflow external_task_sensor never stops poking - python

I need to wait for another task in another dag until I can trigger my own task. But my external sensor is not stopping poking. I already read some of the other related questions here and made sure I have adjusted the execution_delta. But still, I have the same issue.
Here are my two dags
Parent Dag:
import datetime
import pendulum
from airflow import models
from airflow.operators.python_operator import PythonOperator
local_tz = pendulum.timezone("Europe/Berlin")
args = {
"start_date": datetime.datetime(2022, 1, 25, tzinfo=local_tz),
"provide_context": True,
}
def start_job(process_name, **kwargs):
print('do something: ' + process_name)
return True
with models.DAG(
dag_id="test_parent",
default_args=args,
# catchup=False,
) as dag:
task_parent_1 = PythonOperator(
task_id="task_parent_1",
python_callable=start_job,
op_kwargs={"process_name": "my parent task 1"},
provide_context=True,
)
task_parent_2 = PythonOperator(
task_id="task_parent_2",
python_callable=start_job,
op_kwargs={"process_name": "my parent task 2"},
provide_context=True,
)
task_parent_1 >> task_parent_2
And my child dag:
import datetime
import pendulum
from airflow import models
from airflow.operators.python_operator import PythonOperator
from airflow.operators.sensors import ExternalTaskSensor
local_tz = pendulum.timezone("Europe/Berlin")
args = {
"start_date": datetime.datetime(2022, 1, 25, tzinfo=local_tz),
"provide_context": True,
}
def start_job(process_name, **kwargs):
print('do something: ' + process_name)
return True
with models.DAG(
dag_id="test_child",
default_args=args,
# catchup=False,
) as dag:
wait_for_parent_task = ExternalTaskSensor(
task_id='wait_for_parent_task',
external_dag_id='test_parent',
external_task_id='task_parent_2',
execution_delta=datetime.timedelta(hours=24),
# execution_date_fn=lambda dt: dt - datetime.timedelta(hours=24),
)
task_child_1 = PythonOperator(
task_id="task_child_1",
python_callable=start_job,
op_kwargs={"process_name": "my child task 1"},
provide_context=True,
)
task_child_2 = PythonOperator(
task_id="task_child_2",
python_callable=start_job,
op_kwargs={"process_name": "my child task 2"},
provide_context=True,
)
task_child_1 >> wait_for_parent_task >> task_child_2

Code-wise it looks correct, but the start_date is set to today. With execution_delta set, the ExternalTaskSensor will check for the task with execution date execution_date - execution_delta. I.e. the first DAG run will start on the 26th at 00:00, and the ExternalTaskSensor will check for a task with execution_date of 25th 00:00 - 24 hours = 24th 00:00. Since that's before your DAG's starting date, there won't be a task for that execution_date.
In the logs you should see the DAG/task/date it's checking: Poking for tasks %s in dag %s on %s .... You could set your DAG's starting date to a few days ago or let it run for a few days to debug the issue.

Alternatively, I found also a way that worked for me by setting the execution date to the scheduled date of the parent dag. The advantage: you can trigger the dag also manually
Assuming that the parent dag is scheduled at 6.00 AM in the timezone "Europe/Berlin".
wait_for_parent_task = ExternalTaskSensor(
task_id='wait_for_parent_task ',
external_dag_id='test_parent',
external_task_id='task_parent_2',
check_existence=True,
# execution_date needs to be exact (scheduled time) and the london timezone
# Remember: The scheduled start is always the one step further in the past -
# For a daily schedule: - datetime.timedelta(days=1)
execution_date_fn=lambda dt: (datetime.datetime(year=dt.year, month=dt.month, day=dt.day, tzinfo=local_tz)
+ datetime.timedelta(hours=6, minutes=0)
- datetime.timedelta(days=1)
).astimezone(local_tz_london),
)

Related

How to handle Inter State DAG exeution

Both DAG A and DAG B run at the same time. for example 10 AM. DAG A complete in 5 minutes and DAG B will wait for the execution state of DAG A if the state is successful then DAG B will move to the next step otherwise will throw the error. DAB B always takes the execution state of DAG A on the same day and time. For example - Suppose DAG A ran yesterday successfully but today it is not started due to some issue but DAG B started and should not consider the previous run state. DAG B should consider DAG A's current execution state.
If the execution state is other than failed, success then how to handle the code.
I am new to Airflow and don't know how to handle this
Code
def status(**context):
try:
TI = context["task_instance"]
exuection_date = context["execution_date"]
run_state_intra = []
run_id_intra = []
for data_tuple in (
settings.Session()
.query(DR.dag_id, DR.execution_date, DR.state, DR.run_id)
.order_by(DR.execution_date.desc())
.limit(1)
):
In DagB you can create a BranchPythonOperator that finds the last run of DagA and decide if to continue with next task or raise exception.
in the example below, I am checking if DagA was running in the same day (midnight to now) and with state=success. if I have results I am returning the name of the next task to run, else I raise exception and DagB fails.
from datetime import datetime
from airflow import DAG
from airflow.models import DagRun, TaskInstance
from airflow.operators.python import BranchPythonOperator, PythonOperator
from airflow.exceptions import AirflowException
from airflow.utils import timezone
from airflow.utils.trigger_rule import TriggerRule
with DAG(
dag_id="DagB",
start_date=datetime(2022, 1, 1),
schedule_interval=None,
render_template_as_native_obj=True,
tags=["DagB"],
) as dag:
def check_success_dag_a(**context):
ti: TaskInstance = context['ti']
dag_run: DagRun = context['dag_run']
date: datetime = ti.execution_date
ts = timezone.make_aware(datetime(date.year, date.month, date.day, 0, 0, 0))
dag_a = dag_run.find(
dag_id='DagA',
state='success',
execution_start_date=ts,
execution_end_date=ti.execution_date)
if dag_a:
return "taskB"
raise AirflowException("DagA failed")
check_success = BranchPythonOperator(
task_id="check_success_dag_a",
python_callable=check_success_dag_a,
)
def run():
print('DagB')
run_task = PythonOperator(
task_id="taskB",
dag=dag,
python_callable=run,
trigger_rule=TriggerRule.ONE_SUCCESS
)
(check_success >> [run_task])

How do I trigger a backfill with TriggerDagRunOperator?

I have a requirement where I need the dag triggered by TriggerDagRunOperator to execute a backfill and not just for the same execution date.
The TriggerDagOperator is set as follows:
trigger1 = TriggerDagRunOperator(
task_id = 'trigger1',
trigger_dag_id = 'target_dag',
conf = {'message': 'Starting target 1'},
reset_dag_run = True,
wait_for_completion = True
)
Target dag is basically:
starting_date = datetime.strptime("2021-11-15", "%Y-%m-%d")
with DAG("target_dag", default_args=default_args, schedule_interval='#daily', max_active_runs=10) as dag:
start = DummyOperator(
task_id = 'start'
)
t1 = PythonOperator(
task_id = "t1",
provide_context=True,
python_callable=t1
)
finish = DummyOperator(
task_id = 'finish'
)
start >> t1 >> finish
target_dag is only executing for today's date and not backfilling.
How do I force it to backfill regardless of past dag runs? I'm using airflow 2.0
this might be late now, but I have come up with 2 different solutions.
The first one (and probably the better) would be as follows:
from airflow.operators.latest_only_operator import LatestOnlyOperator
t1 = LatestOnlyOperator(task_id="ensure_backfill_complete")
I was stuck on a similar conundrum, and this suddenly popped in my head.
The 2nd one is basically wrapping the operator in a loop within a python function, which is honestly terrible.
and it seems to work.

Google Cloud Composer DAG is not getting triggered

I'm scheduling a DAG to run at 04:00 AM, Tuesday through Saturday eastern standard time (NY) starting from today 2020/08/11. After writing up the code and deploying, I expected the DAG to get triggered. I refreshed my Airflow UI page a couple of times but it's not triggering still. I am using Airflow version v1.10.9-composer with python 3.
This is my DAG code:
"""
This DAG executes a retrieval job
"""
# Required packages to execute DAG
from __future__ import print_function
import pendulum
from airflow.models import DAG
from airflow.models import Variable
from datetime import datetime, timedelta
from airflow.contrib.operators.ssh_operator import SSHOperator
from airflow.operators.dummy_operator import DummyOperator
from airflow.utils.trigger_rule import TriggerRule
local_tz = pendulum.timezone("America/New_York")
# DAG parameters
default_args = {
'owner': 'Me',
'depends_on_past': False,
'start_date': datetime(2020, 8, 10, 4, tzinfo=local_tz),
'dagrun_timeout': None,
'email': Variable.get('email'),
'email_on_failure': True,
'email_on_retry': False,
'provide_context': True,
'retries': None,
'retry_delay': timedelta(minutes=5)
}
# create DAG object with Name and default_args
with DAG(
'retrieve_files',
schedule_interval='0 4 * * 2-6',
description='Retrieves files from sftp',
max_active_runs=1,
catchup=True,
default_args=default_args
) as dag:
# Define tasks - below are dummy tasks and a task instantiated by SSHOperator- calling methods written in other py class
start_dummy = DummyOperator(
task_id='start',
dag=dag
)
end_dummy = DummyOperator(
task_id='end',
trigger_rule=TriggerRule.NONE_FAILED,
dag=dag
)
retrieve_file = SSHOperator(
ssh_conn_id="my_conn",
task_id='retrieve_file',
command='/usr/bin/python3 /path_to_file/getFile.py',
dag=dag)
dag.doc_md = __doc__
retrieve_file.doc_md = """\
#### Task Documentation
Connects to sftp and retrieves files.
"""
start_dummy >> retrieve_file >> end_dummy
Referring to the official documentation:
The scheduler runs your job one schedule_interval AFTER the start date.
If your start_date is 2020-01-01 and schedule_interval is #daily, the
first run will be created on 2020-01-02 i.e., after your start date
has passed.
In order to run a DAG at a specific time everyday (including today), the start_date needs to be set to a time in the past and schedule_interval needs to have the desired time in cron format. It is very important to set yesterday's datetime properly or the trigger won't work.
In that case, we should set the start_date as a Tuesday from previous week, which is: (2020, 8, 4). There should be 1 week interval that has passed since your start date, because of running it weekly.
Let's take a look for the following example, which shows how to run a job 04:00 AM, Tuesday through Saturday EST:
from datetime import datetime, timedelta
from airflow import models
import pendulum
from airflow.operators import bash_operator
local_tz = pendulum.timezone("America/New_York")
default_dag_args = {
'start_date': datetime(2020, 8, 4, 4, tzinfo=local_tz),
'retries': 0,
}
with models.DAG(
'Test',
default_args=default_dag_args,
schedule_interval='00 04 * * 2-6') as dag:
# DAG code
print_dag_run_conf = bash_operator.BashOperator(
task_id='print_dag_run_conf', bash_command='echo {{ dag_run.id }}')
I recommend you to check the what’s the deal with start_date documentation.

How to trigger a Airflow task only when new partition/data in avialable in the AWS athena table using DAG in python?

I have a scenerio like a below :
Trigger a Task 1 and Task 2 only when new data is avialable for them in source table ( Athena). Trigger for Task1 and Task2 should happen when a new data parition in a day.
Trigger Task 3 only on the completion of Task 1 and Task 2
Trigger Task 4 only the completion of Task 3
My code
from airflow import DAG
from airflow.contrib.sensors.aws_glue_catalog_partition_sensor import AwsGlueCatalogPartitionSensor
from datetime import datetime, timedelta
from airflow.operators.postgres_operator import PostgresOperator
from utils import FAILURE_EMAILS
yesterday = datetime.combine(datetime.today() - timedelta(1), datetime.min.time())
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': yesterday,
'email': FAILURE_EMAILS,
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5)
}
dag = DAG('Trigger_Job', default_args=default_args, schedule_interval='#daily')
Athena_Trigger_for_Task1 = AwsGlueCatalogPartitionSensor(
task_id='athena_wait_for_Task1_partition_exists',
database_name='DB',
table_name='Table1',
expression='load_date={{ ds_nodash }}',
timeout=60,
dag=dag)
Athena_Trigger_for_Task2 = AwsGlueCatalogPartitionSensor(
task_id='athena_wait_for_Task2_partition_exists',
database_name='DB',
table_name='Table2',
expression='load_date={{ ds_nodash }}',
timeout=60,
dag=dag)
execute_Task1 = PostgresOperator(
task_id='Task1',
postgres_conn_id='REDSHIFT_CONN',
sql="/sql/flow/Task1.sql",
params={'limit': '50'},
trigger_rule='all_success',
dag=dag
)
execute_Task2 = PostgresOperator(
task_id='Task2',
postgres_conn_id='REDSHIFT_CONN',
sql="/sql/flow/Task2.sql",
params={'limit': '50'},
trigger_rule='all_success',
dag=dag
)
execute_Task3 = PostgresOperator(
task_id='Task3',
postgres_conn_id='REDSHIFT_CONN',
sql="/sql/flow/Task3.sql",
params={'limit': '50'},
trigger_rule='all_success',
dag=dag
)
execute_Task4 = PostgresOperator(
task_id='Task4',
postgres_conn_id='REDSHIFT_CONN',
sql="/sql/flow/Task4",
params={'limit': '50'},
dag=dag
)
execute_Task1.set_upstream(Athena_Trigger_for_Task1)
execute_Task2.set_upstream(Athena_Trigger_for_Task2)
execute_Task3.set_upstream(execute_Task1)
execute_Task3.set_upstream(execute_Task2)
execute_Task4.set_upstream(execute_Task3)
What is best optimal way of achieving it?
I believe your question addresses two major problems:
forgetting to configure the schedule_interval in an explicit way so #daily is setting up something you're not expecting.
How to trigger and retry properly the execution of the dag when you depend on an external event to complete the execution
the short answer: set explicitly your schedule_interval with a cron job format and use sensor operators to check from time to time
default_args={
'retries': (endtime - starttime)*60/poke_time
}
dag = DAG('Trigger_Job', default_args=default_args, schedule_interval='0 10 * * *')
Athena_Trigger_for_Task1 = AwsGlueCatalogPartitionSensor(
....
poke_time= 60*5 #<---- set a poke_time in seconds
dag=dag)
where startime is what time your daily task will start, endtime what is the last time of the day you should check if an event was done before flagging as failed and poke_time is the interval your sensor_operator will check if the event happened.
How to address the cron job explicitly
whenever you are setting your dag to #daily like you did:
dag = DAG('Trigger_Job', default_args=default_args, schedule_interval='#daily')
from the docs, you can see you are actualy be doing:
#daily - Run once a day at midnight
Which now makes sense why you're getting timeout error, and fails after 5 minutes because you set 'retries': 1 and 'retry_delay': timedelta(minutes=5). So it tries running the dag at midnight, it fails. retries again 5 minutes after and fail again, so it flag as failed.
So basically #daily run is setting an implicit cron job of:
#daily -> Run once a day at midnight -> 0 0 * * *
The cron job format is of the format below and you set the value to * whenever you want to say "all".
Minute Hour Day_of_Month Month Day_of_Week
So #daily is basicly saying run this every: minute 0 hour 0 of all days_of_month of all months of all days_of_week
So your case is run this every: minute 0 hour 10 of all days_of_month of all_months of all days_of_week. This translate in cron job format to:
0 10 * * *
How to trigger and retry properly the execution of the dag when you depend on an external event to complete the execution
you could trigger a dag in airflow from an external event by using the command airflow trigger_dag. this would be possible if some how you could trigger a lambda function/ python script to target your airflow instance.
If you can't trigger the dag externally, then use a sensor operator like OP did, set a poke_time to it and set a reasonable high number of retries.

Apache Airflow - Prescript rerun at each task of the dag and date change

I am new with using airflow.
I noticed that if you define a global variable (timestamp) in the code, this value will change for each task. For example in the very basic example bellow, I define a variable now but each time I print it in a task, this value changes.
from datetime import timedelta
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.utils.dates import days_ago
import time
now = int(time.time() * 1000)
RANGE = range(1, 10)
def init_step():
print("Run on RANGE {}".format(RANGE))
print("Date of the Scans {}".format(now))
return RANGE
def trigger_step(index):
time.sleep(10)
print("index {} - date {}".format(index, now))
return index
default_args = {
'owner': 'airflow',
'start_date': days_ago(1),
'retries': 2,
'retry_delay': timedelta(minutes=15)
}
with DAG('test',
default_args=default_args,
schedule_interval='0 16 */7 * *',
) as dag:
init = PythonOperator(task_id='init',
python_callable=init_step,
dag=dag)
for index in init_step():
run = PythonOperator(task_id='trigger-port-' + str(index),
op_kwargs={'index': index},
python_callable=trigger_step, dag=dag)
dag >> init >> run
Is it a normal behavior ? Is there a way to change it ?

Categories