Google Cloud Composer DAG is not getting triggered - python

I'm scheduling a DAG to run at 04:00 AM, Tuesday through Saturday eastern standard time (NY) starting from today 2020/08/11. After writing up the code and deploying, I expected the DAG to get triggered. I refreshed my Airflow UI page a couple of times but it's not triggering still. I am using Airflow version v1.10.9-composer with python 3.
This is my DAG code:
"""
This DAG executes a retrieval job
"""
# Required packages to execute DAG
from __future__ import print_function
import pendulum
from airflow.models import DAG
from airflow.models import Variable
from datetime import datetime, timedelta
from airflow.contrib.operators.ssh_operator import SSHOperator
from airflow.operators.dummy_operator import DummyOperator
from airflow.utils.trigger_rule import TriggerRule
local_tz = pendulum.timezone("America/New_York")
# DAG parameters
default_args = {
'owner': 'Me',
'depends_on_past': False,
'start_date': datetime(2020, 8, 10, 4, tzinfo=local_tz),
'dagrun_timeout': None,
'email': Variable.get('email'),
'email_on_failure': True,
'email_on_retry': False,
'provide_context': True,
'retries': None,
'retry_delay': timedelta(minutes=5)
}
# create DAG object with Name and default_args
with DAG(
'retrieve_files',
schedule_interval='0 4 * * 2-6',
description='Retrieves files from sftp',
max_active_runs=1,
catchup=True,
default_args=default_args
) as dag:
# Define tasks - below are dummy tasks and a task instantiated by SSHOperator- calling methods written in other py class
start_dummy = DummyOperator(
task_id='start',
dag=dag
)
end_dummy = DummyOperator(
task_id='end',
trigger_rule=TriggerRule.NONE_FAILED,
dag=dag
)
retrieve_file = SSHOperator(
ssh_conn_id="my_conn",
task_id='retrieve_file',
command='/usr/bin/python3 /path_to_file/getFile.py',
dag=dag)
dag.doc_md = __doc__
retrieve_file.doc_md = """\
#### Task Documentation
Connects to sftp and retrieves files.
"""
start_dummy >> retrieve_file >> end_dummy

Referring to the official documentation:
The scheduler runs your job one schedule_interval AFTER the start date.
If your start_date is 2020-01-01 and schedule_interval is #daily, the
first run will be created on 2020-01-02 i.e., after your start date
has passed.
In order to run a DAG at a specific time everyday (including today), the start_date needs to be set to a time in the past and schedule_interval needs to have the desired time in cron format. It is very important to set yesterday's datetime properly or the trigger won't work.
In that case, we should set the start_date as a Tuesday from previous week, which is: (2020, 8, 4). There should be 1 week interval that has passed since your start date, because of running it weekly.
Let's take a look for the following example, which shows how to run a job 04:00 AM, Tuesday through Saturday EST:
from datetime import datetime, timedelta
from airflow import models
import pendulum
from airflow.operators import bash_operator
local_tz = pendulum.timezone("America/New_York")
default_dag_args = {
'start_date': datetime(2020, 8, 4, 4, tzinfo=local_tz),
'retries': 0,
}
with models.DAG(
'Test',
default_args=default_dag_args,
schedule_interval='00 04 * * 2-6') as dag:
# DAG code
print_dag_run_conf = bash_operator.BashOperator(
task_id='print_dag_run_conf', bash_command='echo {{ dag_run.id }}')
I recommend you to check the what’s the deal with start_date documentation.

Related

Airflow external_task_sensor never stops poking

I need to wait for another task in another dag until I can trigger my own task. But my external sensor is not stopping poking. I already read some of the other related questions here and made sure I have adjusted the execution_delta. But still, I have the same issue.
Here are my two dags
Parent Dag:
import datetime
import pendulum
from airflow import models
from airflow.operators.python_operator import PythonOperator
local_tz = pendulum.timezone("Europe/Berlin")
args = {
"start_date": datetime.datetime(2022, 1, 25, tzinfo=local_tz),
"provide_context": True,
}
def start_job(process_name, **kwargs):
print('do something: ' + process_name)
return True
with models.DAG(
dag_id="test_parent",
default_args=args,
# catchup=False,
) as dag:
task_parent_1 = PythonOperator(
task_id="task_parent_1",
python_callable=start_job,
op_kwargs={"process_name": "my parent task 1"},
provide_context=True,
)
task_parent_2 = PythonOperator(
task_id="task_parent_2",
python_callable=start_job,
op_kwargs={"process_name": "my parent task 2"},
provide_context=True,
)
task_parent_1 >> task_parent_2
And my child dag:
import datetime
import pendulum
from airflow import models
from airflow.operators.python_operator import PythonOperator
from airflow.operators.sensors import ExternalTaskSensor
local_tz = pendulum.timezone("Europe/Berlin")
args = {
"start_date": datetime.datetime(2022, 1, 25, tzinfo=local_tz),
"provide_context": True,
}
def start_job(process_name, **kwargs):
print('do something: ' + process_name)
return True
with models.DAG(
dag_id="test_child",
default_args=args,
# catchup=False,
) as dag:
wait_for_parent_task = ExternalTaskSensor(
task_id='wait_for_parent_task',
external_dag_id='test_parent',
external_task_id='task_parent_2',
execution_delta=datetime.timedelta(hours=24),
# execution_date_fn=lambda dt: dt - datetime.timedelta(hours=24),
)
task_child_1 = PythonOperator(
task_id="task_child_1",
python_callable=start_job,
op_kwargs={"process_name": "my child task 1"},
provide_context=True,
)
task_child_2 = PythonOperator(
task_id="task_child_2",
python_callable=start_job,
op_kwargs={"process_name": "my child task 2"},
provide_context=True,
)
task_child_1 >> wait_for_parent_task >> task_child_2
Code-wise it looks correct, but the start_date is set to today. With execution_delta set, the ExternalTaskSensor will check for the task with execution date execution_date - execution_delta. I.e. the first DAG run will start on the 26th at 00:00, and the ExternalTaskSensor will check for a task with execution_date of 25th 00:00 - 24 hours = 24th 00:00. Since that's before your DAG's starting date, there won't be a task for that execution_date.
In the logs you should see the DAG/task/date it's checking: Poking for tasks %s in dag %s on %s .... You could set your DAG's starting date to a few days ago or let it run for a few days to debug the issue.
Alternatively, I found also a way that worked for me by setting the execution date to the scheduled date of the parent dag. The advantage: you can trigger the dag also manually
Assuming that the parent dag is scheduled at 6.00 AM in the timezone "Europe/Berlin".
wait_for_parent_task = ExternalTaskSensor(
task_id='wait_for_parent_task ',
external_dag_id='test_parent',
external_task_id='task_parent_2',
check_existence=True,
# execution_date needs to be exact (scheduled time) and the london timezone
# Remember: The scheduled start is always the one step further in the past -
# For a daily schedule: - datetime.timedelta(days=1)
execution_date_fn=lambda dt: (datetime.datetime(year=dt.year, month=dt.month, day=dt.day, tzinfo=local_tz)
+ datetime.timedelta(hours=6, minutes=0)
- datetime.timedelta(days=1)
).astimezone(local_tz_london),
)

Unix TimeStamp conversion into Date/Time field failed in Airflow DAG

I am trying to convert Unix Timestamp into Date/Time format in Airflow Dag .
Function get_execution_time() in below Airflow script throwing parsing error :-
ERROR - HTTP error: Bad Request
[2021-10-04 21:39:46,187] ERROR - {"error":"unable to parse: metric parse error: expected field at 1:94: \"model_i7,model_owner=cgrm_developer,model_name=m1,execution_time=**<function get_execution_time at 0x7f46df0b1b70>\**""}
[2021-10-04 21:39:46,216] ERROR - 400:Bad Request
Parser failed to resolve the unix timestamp into Date/Time format and throwing execution_time=<function get_execution_time at 0x7f46df0b1b70>\ .
Here is the executable script :-
from airflow import DAG
from random import random, seed, choice
from datetime import datetime, timedelta
from airflow.operators.python_operator import PythonOperator
from airflow.operators.postgres_operator import PostgresOperator
from airflow.operators.http_operator import SimpleHttpOperator
import time
import warnings
import requests
from datetime import datetime
import calendar
from functools import wraps
warnings.filterwarnings("ignore",category=DeprecationWarning)
def get_execution_time():
ts = int("1284101485")
return datetime.utcfromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S')
model_list = ['m1', 'm2', 'm3', 'm4']
def get_data_str(model_name):
table_name = 'model_i7,'
execution_time=get_execution_time()
parameters = ',model_name={model_name},execution_time={execution_time}'.format(\
model_name=model_name,execution_time=get_execution_time)
return table_name + 'model_owner=test_developer' + parameters
default_args = {
'owner': 'developer',
'depends_on_past': False,
'start_date': datetime(2021,10,04),
'retries': 1,
'retry_delay': timedelta(minutes=1),
'catchup' : False,
'dagrun_timeout' : timedelta(hours=3),
'email_on_success': False,
'email_on_failure': False,
'email_on_retry': False
}
with DAG(
dag_id='testi7',
schedule_interval=None,
tags=['testi7'],
access_control = {'developer':{'can_dag_read','can_dag_edit'}},
default_args=default_args
) as dag:
for model_name in model_list:
extracting_metrics = SimpleHttpOperator(
task_id='extracting_metrics',
http_conn_id ='metrics_api',
endpoint='/write',
python_callable=get_data_str,
provide_context=True,
data=get_data_str(model_name)
)
Expected Output :-
table_name
model_owner
model_name
execution_time
model_output
developer
model1
2010-09-10T10:59:33+00:00
Would appreciate it if some one can help to resolve above exception .

Scheduling Airflow DAGs to run exclusively Monday through Friday i.e only weekdays

I have a DAG executing a Python script which takes a date argument (the current date). I'm scheduling the DAG to run at 6:00 AM Monday through Friday i.e weekdays Eastern Standard Time. The DAG has to run the Python script on Monday with Mondays date as an argument, same for Tuesday all the way to Friday with Fridays date as an argument.
I noticed using a schedule interval of '0 6 * * 1-5' didn't work because Fridays execution didn't occur until the following Monday.
I changed the schedule interval to '0 6 * * *' to run everyday at 6:00 AM and at the start of my dag, filter for dates that fall within ‘0 6 * * 1-5’, so effectively Monday to Friday. For Saturday and Sunday, the downstream tasks should be skipped.
This is my code
from __future__ import print_function
import pendulum
import logging
from airflow.models import DAG
from airflow.models import Variable
from datetime import datetime, timedelta
from airflow.contrib.operators.ssh_operator import SSHOperator
from airflow.operators.python_operator import ShortCircuitOperator
from airflow.operators.dummy_operator import DummyOperator
from airflow.utils.trigger_rule import TriggerRule
from croniter import croniter
log = logging.getLogger(__name__)
def filter_processing_date(**context):
execution_date = context['execution_date']
cron = croniter('0 6 * * 1-5', execution_date)
log.info('cron is: {}'.format(cron))
log.info('execution date is: {}'.format(execution_date))
#prev_date = cron.get_prev(datetime)
#log.info('prev_date is: {}'.format(prev_date))
return execution_date == cron.get_next(datetime).get_prev(datetime)
local_tz = pendulum.timezone("America/New_York")
# DAG parameters
default_args = {
'owner': 'Managed Services',
'depends_on_past': False,
'start_date': datetime(2020, 8, 3, tzinfo=local_tz),
'dagrun_timeout': None,
'email': Variable.get('email'),
'email_on_failure': True,
'email_on_retry': False,
'provide_context': True,
'retries': 12,
'retry_delay': timedelta(minutes=5)
}
with DAG(
'execute_python',
schedule_interval='0 6 * * *',
default_args=default_args
) as dag:
start_dummy = DummyOperator(
task_id='start',
dag=dag
)
end_dummy = DummyOperator(
task_id='end',
trigger_rule=TriggerRule.NONE_FAILED,
dag=dag
)
weekdays_only = ShortCircuitOperator(
task_id='weekdays_only',
python_callable=filter_processing_date,
dag=dag
)
run_python = SSHOperator(
ssh_conn_id="oci_connection",
task_id='run_python',
command='/usr/bin/python3 /home/sb/local/bin/runProcess.py -d {{ ds_nodash }}',
dag=dag)
start_dummy >> weekdays_only >> run_python >> end_dummy
Unfortunately, weekdays_only task is failing with the below error message. What is going wrong?
Airflow error message
Airflow error message continuation
Airflow version: v1.10.9-composer
Python 3.
I managed to solve my problem by hacking something together. Checking if the next execution date is a weekday and returning true if it's the case or false otherwise. I call the function in a ShortCircuitOperator which proceeds with downstream tasks if true or skips them if false.
This is my code below but I'm open to better solutions.
from __future__ import print_function
import pendulum
import logging
from airflow.models import DAG
from airflow.models import Variable
from datetime import datetime, timedelta
from airflow.contrib.operators.ssh_operator import SSHOperator
from airflow.operators.python_operator import ShortCircuitOperator
from airflow.operators.dummy_operator import DummyOperator
from airflow.utils.trigger_rule import TriggerRule
log = logging.getLogger(__name__)
def checktheday(**context):
next_execution_date = context['next_execution_date']
log.info('next_execution_date is: {}'.format(next_execution_date))
date_check = next_execution_date.weekday()
log.info('date_check is: {}'.format(date_check))
if date_check == 0 or date_check == 1 or date_check == 2 or date_check == 3 or date_check == 4:
decision = True
else:
decision = False
log.info('decision is: {}'.format(decision))
return decision
local_tz = pendulum.timezone("America/New_York")
# DAG parameters
default_args = {
'owner': 'Managed Services',
'depends_on_past': False,
'start_date': datetime(2020, 8, 3, tzinfo=local_tz),
'dagrun_timeout': None,
'email': Variable.get('email'),
'email_on_failure': True,
'email_on_retry': False,
'provide_context': True,
'retries': 12,
'retry_delay': timedelta(minutes=5)
}
with DAG(
'execute_python',
schedule_interval='0 6 * * *',
default_args=default_args
) as dag:
start_dummy = DummyOperator(
task_id='start',
dag=dag
)
end_dummy = DummyOperator(
task_id='end',
trigger_rule=TriggerRule.NONE_FAILED,
dag=dag
)
weekdays_only = ShortCircuitOperator(
task_id='weekdays_only',
python_callable=checktheday,
dag=dag
)
run_python = SSHOperator(
ssh_conn_id="oci_connection",
task_id='run_python',
command='/usr/bin/python3 /home/sb/local/bin/runProcess.py -d {{ macros.ds_format(macros.ds_add(ds, 1), "%Y-%m-%d", "%Y%m%d") }}',
dag=dag)
start_dummy >> weekdays_only >> run_python >> end_dummy

How to trigger a Airflow task only when new partition/data in avialable in the AWS athena table using DAG in python?

I have a scenerio like a below :
Trigger a Task 1 and Task 2 only when new data is avialable for them in source table ( Athena). Trigger for Task1 and Task2 should happen when a new data parition in a day.
Trigger Task 3 only on the completion of Task 1 and Task 2
Trigger Task 4 only the completion of Task 3
My code
from airflow import DAG
from airflow.contrib.sensors.aws_glue_catalog_partition_sensor import AwsGlueCatalogPartitionSensor
from datetime import datetime, timedelta
from airflow.operators.postgres_operator import PostgresOperator
from utils import FAILURE_EMAILS
yesterday = datetime.combine(datetime.today() - timedelta(1), datetime.min.time())
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': yesterday,
'email': FAILURE_EMAILS,
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5)
}
dag = DAG('Trigger_Job', default_args=default_args, schedule_interval='#daily')
Athena_Trigger_for_Task1 = AwsGlueCatalogPartitionSensor(
task_id='athena_wait_for_Task1_partition_exists',
database_name='DB',
table_name='Table1',
expression='load_date={{ ds_nodash }}',
timeout=60,
dag=dag)
Athena_Trigger_for_Task2 = AwsGlueCatalogPartitionSensor(
task_id='athena_wait_for_Task2_partition_exists',
database_name='DB',
table_name='Table2',
expression='load_date={{ ds_nodash }}',
timeout=60,
dag=dag)
execute_Task1 = PostgresOperator(
task_id='Task1',
postgres_conn_id='REDSHIFT_CONN',
sql="/sql/flow/Task1.sql",
params={'limit': '50'},
trigger_rule='all_success',
dag=dag
)
execute_Task2 = PostgresOperator(
task_id='Task2',
postgres_conn_id='REDSHIFT_CONN',
sql="/sql/flow/Task2.sql",
params={'limit': '50'},
trigger_rule='all_success',
dag=dag
)
execute_Task3 = PostgresOperator(
task_id='Task3',
postgres_conn_id='REDSHIFT_CONN',
sql="/sql/flow/Task3.sql",
params={'limit': '50'},
trigger_rule='all_success',
dag=dag
)
execute_Task4 = PostgresOperator(
task_id='Task4',
postgres_conn_id='REDSHIFT_CONN',
sql="/sql/flow/Task4",
params={'limit': '50'},
dag=dag
)
execute_Task1.set_upstream(Athena_Trigger_for_Task1)
execute_Task2.set_upstream(Athena_Trigger_for_Task2)
execute_Task3.set_upstream(execute_Task1)
execute_Task3.set_upstream(execute_Task2)
execute_Task4.set_upstream(execute_Task3)
What is best optimal way of achieving it?
I believe your question addresses two major problems:
forgetting to configure the schedule_interval in an explicit way so #daily is setting up something you're not expecting.
How to trigger and retry properly the execution of the dag when you depend on an external event to complete the execution
the short answer: set explicitly your schedule_interval with a cron job format and use sensor operators to check from time to time
default_args={
'retries': (endtime - starttime)*60/poke_time
}
dag = DAG('Trigger_Job', default_args=default_args, schedule_interval='0 10 * * *')
Athena_Trigger_for_Task1 = AwsGlueCatalogPartitionSensor(
....
poke_time= 60*5 #<---- set a poke_time in seconds
dag=dag)
where startime is what time your daily task will start, endtime what is the last time of the day you should check if an event was done before flagging as failed and poke_time is the interval your sensor_operator will check if the event happened.
How to address the cron job explicitly
whenever you are setting your dag to #daily like you did:
dag = DAG('Trigger_Job', default_args=default_args, schedule_interval='#daily')
from the docs, you can see you are actualy be doing:
#daily - Run once a day at midnight
Which now makes sense why you're getting timeout error, and fails after 5 minutes because you set 'retries': 1 and 'retry_delay': timedelta(minutes=5). So it tries running the dag at midnight, it fails. retries again 5 minutes after and fail again, so it flag as failed.
So basically #daily run is setting an implicit cron job of:
#daily -> Run once a day at midnight -> 0 0 * * *
The cron job format is of the format below and you set the value to * whenever you want to say "all".
Minute Hour Day_of_Month Month Day_of_Week
So #daily is basicly saying run this every: minute 0 hour 0 of all days_of_month of all months of all days_of_week
So your case is run this every: minute 0 hour 10 of all days_of_month of all_months of all days_of_week. This translate in cron job format to:
0 10 * * *
How to trigger and retry properly the execution of the dag when you depend on an external event to complete the execution
you could trigger a dag in airflow from an external event by using the command airflow trigger_dag. this would be possible if some how you could trigger a lambda function/ python script to target your airflow instance.
If you can't trigger the dag externally, then use a sensor operator like OP did, set a poke_time to it and set a reasonable high number of retries.

Apache Airflow - Prescript rerun at each task of the dag and date change

I am new with using airflow.
I noticed that if you define a global variable (timestamp) in the code, this value will change for each task. For example in the very basic example bellow, I define a variable now but each time I print it in a task, this value changes.
from datetime import timedelta
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.utils.dates import days_ago
import time
now = int(time.time() * 1000)
RANGE = range(1, 10)
def init_step():
print("Run on RANGE {}".format(RANGE))
print("Date of the Scans {}".format(now))
return RANGE
def trigger_step(index):
time.sleep(10)
print("index {} - date {}".format(index, now))
return index
default_args = {
'owner': 'airflow',
'start_date': days_ago(1),
'retries': 2,
'retry_delay': timedelta(minutes=15)
}
with DAG('test',
default_args=default_args,
schedule_interval='0 16 */7 * *',
) as dag:
init = PythonOperator(task_id='init',
python_callable=init_step,
dag=dag)
for index in init_step():
run = PythonOperator(task_id='trigger-port-' + str(index),
op_kwargs={'index': index},
python_callable=trigger_step, dag=dag)
dag >> init >> run
Is it a normal behavior ? Is there a way to change it ?

Categories