I am trying to get the Airflow ExternalTaskSensor to work but so far have not been able to get it to complete, it always seems to get stuck running and never finishes so the DAG can move onto the next task.
Here is the code I am using to test:
DEFAULT_ARGS = {
'owner': 'NAME',
'depends_on_past': False,
'start_date': datetime(2019, 9, 9),
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False
}
external_watch_dag = DAG(
'DAG-External_watcher-Test',
default_args=DEFAULT_ARGS,
dagrun_timeout=timedelta(hours=1),
schedule_interval=None
)
start_op = DummyOperator(
task_id='start_op',
dag=external_watch_dag
)
trigger_external = TriggerDagRunOperator(
task_id='trigger_external',
trigger_dag_id='DAG-Dummy',
dag=external_watch_dag
)
external_watch_op = ExternalTaskSensor(
task_id='external_watch_op',
external_dag_id='DAG-Dummy',
external_task_id='dummy_task',
check_existence=True,
execution_delta=timedelta(minutes=-1),
# execution_date_fn=datetime(2019, 9, 25),
execution_timeout=timedelta(minutes=30),
dag=external_watch_dag
)
end_op = DummyOperator(
task_id='end_op',
dag=external_watch_dag
)
start_op >> trigger_external >> external_watch_op >> end_op
# start_op >> [external_watch_op, trigger_external]
# external_watch_op >> end_op
# Below is the setup for the dummy DAG that is called above by the Trigger and watched by the TaskSensor
dummy_dag = DAG(
'DAG-Dummy',
default_args=DEFAULT_ARGS,
dagrun_timeout=timedelta(hours=1),
schedule_interval=None
)
dummy_task = BashOperator(
task_id='dummy_task',
bash_command='sleep 10',
dag=dummy_dag
)
I have tried tweaking this code a number of ways but have not gotten any success with the ExternalTaskSensor.
Does anyone know how to solve this problem and get the ExternalTaskSensor to work properly? I have also read that issues can arise through scheduling intervals when using the ExternalTaskSensor, is it possible that part of the issue is that the DAGs both have schedule_interval=None?
I had gotten this to work with both of the DAGs set to the exact same schedule_interval, but that will not work in production. The goal is to have the main DAG, external-watch-dag to be on a regular schedule and trigger that DAG-Dummy during its run, with the DAG-Dummy itself having schedule_interval=None.
Any help is greatly appreciated.
By default the ExternalTaskSensor will monitor the external_dag_id with the same execution date that the sensor DAG. With execution_delta you can set a time delta between the sensor dag and the external dag so it can look for the correct execution_date to monitor. This works great when both dags are run in a schedule because you know exactly this timedelta.
The problem: when a dag is triggered manually or by another dag, you cannot known for sure the the exact execution date of any of these two dags.
The solution: because you are using the TriggerDagRunOperator, you can set the execution_date parameter. This will make sure that the execution date from your dag and the external dag is the same. From the docs:
execution_date (str or datetime.datetime) – Execution date for the dag (templated)
So your code will look like this:
trigger_external = TriggerDagRunOperator(
task_id='trigger_external',
trigger_dag_id='DAG-Dummy',
dag=external_watch_dag,
execution_date="{{ execution_date }}", # Use the template to get the current execution date
)
external_watch_op = ExternalTaskSensor(
task_id='external_watch_op',
external_dag_id='DAG-Dummy',
external_task_id='dummy_task',
check_existence=True,
execution_timeout=timedelta(minutes=30),
dag=external_watch_dag
)
Related
While developing some code on airflow, I saw that all my PythonOperator task parameters that were '' (single quotes) are being replaced with None, when passed to the python_callable.
For reproducing, take the following function (that will be the python_callable)
def print_something(something):
print('Something: ', something)
And then, the following DAG and Task:
with DAG(
dag_id='print_test',
tags=['Test'],
start_date=days_ago(1),
schedule_interval=None,
default_args={'owner': 'rand'},
catchup=False,
render_template_as_native_obj=True,
) as dag:
print_task = PythonOperator(
task_id=f'task_print_test',
dag=dag,
python_callable=print_something,
op_kwargs={'something': {'test':''}}
)
print_task
When I go to the task execution logs:
[2022-07-13, 12:04:14 -03] {logging_mixin.py:115} INFO - Something: {'test': None}
So airflow is actually replacing single quotes for None values.
Is there any way to prevent this?
Yes, there is. Remove this dag configuration:
render_template_as_native_obj=True
It's causing your dag to replace somethings (arguments and others) by python objects.
I'm newbie in Apache Airflow.
There are a lot of examples of basic DAGs in the Internet.
Unfortunately, I didn't find any examples of single-task DAG's.
Most of DAG's examples contain bitshift operator in the end of the .py script, which defines tasks order.
For example:
# ...our DAG's code...
task1 >> task2 >> task3
But what if my DAG has just a single task at the moment?
My question is - do I need to use this single task's name in the end of Python file?
Or if we have only 1 task in the scope, Airflow will handle it itself, and the last line of code below is redundant?
from datetime import timedelta
from airflow.operators.bash import BashOperator
from airflow.utils.dates import days_ago
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
}
with DAG(
'tutorial',
default_args=default_args,
description='A simple tutorial DAG',
schedule_interval=timedelta(days=1),
start_date=days_ago(2),
tags=['example'],
) as dag:
t1 = BashOperator(
task_id='print_date',
bash_command='date',
)
t1 # IS THIS LINE OF CODE NECESSARY?
The answer is NO, you don't need to include the last line. You could also avoid the asignment of the variable t1, leaving the DAG like this:
with DAG(
'tutorial',
default_args=default_args,
description='A simple tutorial DAG',
schedule_interval=timedelta(days=1),
start_date=days_ago(2),
tags=['example'],
) as dag:
BashOperator(
task_id='print_date',
bash_command='date',
)
The reason to perfom the assignment of an instance of an Operator (such as BashOperator), to a variable (called Task in this scope) is similiar to any other object in OOP. In your example there is no other "operation" perfomed over the t1 variable (you are not reading it or consuming any method from it) so there no is no reason to declare it.
When starting with Airflow, I think is very clarifying to use the DebugExecutor to perform quick tests like this and understand how everything is working. If you are using VS Code you can find an example config file, here.
I have a problem with a DAG in Airflow, I've tried changing the start_date twice for a week before today and it still doesn't run. The schedule interval is set to '5 9 * * *'.
Here is the code
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
//code
default_args = {
'owner': 'Lucas',
'email': ['//email'],
'email_on_failure': True,
'start_date': datetime(2021, 7, 9),
'retry_delay': timedelta(minutes=5)
}
with DAG('instagram', default_args=default_args, schedule_interval='5 9 * * *', catchup=False) as dag:
token = get_token()
//code
It is really strange because it is not a problem with the dag itself, I can trigger the dag manually without any error and the start_date and schedule_interval seems fine, any ideas?
The solution was to delete the dag and creating another one with a different name, same code. I still have no idea what problem in airflow caused this.
I'm trying to make a dag that has 2 operators that are created dynamically, depending on the number of "pipelines" that a json config file has. this file is stored in the variable dag_datafusion_args. Then I have a standard bash operator, and I have a task called success at the end that sends a message to the slack saying that the dag is over. the other 2 tasks that are python operators are generated dynamically and run in parallel. I'm using the composer, when I put the dag in the bucket it appears on the webserver ui, but when I click to see the dag the following message appears'DAG "dag_lucas4" seems to be missing. ', If I test the tasks directly by CLI on the kubernetes cluster it works! But I can't seem to make the web UI appear. I tried to do as a suggestion of some people here in SO to restart the webserver by installing a python package, I tried 3x but without success. Does anyone know what can it be?
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
from aux_py_files.med.med_airflow_functions import *
from google.cloud import storage
from datetime import timedelta
TEMPLATE_SEARCH_PATH = '/home/airflow/gcs/plugins/'
INDEX=1
default_args = {
'owner':'lucas',
'start_date': '2021-01-10',
'email': ['xxxx'],
'email_on_failure': False,
'email_on_success': False,
'retries': 3,
'retry_delay': timedelta(minutes=2),
'on_failure_callback': post_message_fail_to_slack
}
dag_datafusion_args=return_datafusion_config_file('med')
with DAG('dag_lucas4', default_args = default_args, schedule_interval="30 23 * * *", template_searchpath = [TEMPLATE_SEARCH_PATH]) as dag:
extract_ftp_csv_files_load_in_gcs = BashOperator(
task_id='extract_ftp_csv_files_load_in_gcs',
bash_command='aux_sh_files/med/script.sh'
)
success = PythonOperator(
task_id='success',
python_callable=post_message_success_to_slack,
op_kwargs={'dag_name':'dag_lucas2'}
)
for pipeline,args in dag_datafusion_args.items():
configure_pipeline=PythonOperator(
task_id=f'configure_pipeline{str(INDEX)}',
python_callable=setPipelineArguments,
op_kwargs={'dag_name':'med', 'pipeline_name':pipeline},
provide_context=True
)
start_pipeline = PythonOperator(
task_id= f'start_pipeline{str(INDEX)}',
python_callable=start_pipeline_wrapper,
op_kwargs={'configure_pipeline_task':f'configure_pipeline{str(INDEX)}'},
retries=3,
provide_context=True
)
[extract_ftp_csv_files_load_in_gcs,configure_pipeline] >> start_pipeline >> success
INDEX += 1
Appears that The Airflow-Webserver in Cloud Composer runs in the tenant project, the worker and scheduler runs in the customer project. Tenant project is nothing but its google side managed environment for some part of airflow components. So the Webserver UI doesn't have complete access to your project resources. As it doesn't run under your project's environment. So I can read my config json file with return_datafusion_config_file . Best way is create an ENV variable with that file.
I'm running Airflow 1.9.0 with LocalExecutor and PostgreSQL database in a Linux AMI. I want to manually trigger DAGs, but whenever I create a DAG that has schedule_interval set to None or to #once, the webserver tree view crashes with the following error (I only show the last call):
File "/usr/local/lib/python2.7/site-packages/croniter/croniter.py", line 467, in expand
raise CroniterBadCronError(cls.bad_length)
CroniterBadCronError: Exactly 5 or 6 columns has to be specified for iteratorexpression.
Furthermore, when I manually trigger the DAG, a DAG run starts but the tasks themselves are never scheduled. I've looked around, but it seems that I'm the only one with this type of error. Has anyone encountered this error before and found a fix?
Minimal example triggering the problem:
import datetime as dt
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
default_args = {
'owner': 'me'
}
bash_command = """
echo "this is a test task"
"""
with DAG('schedule_test',
default_args=default_args,
start_date = dt.datetime(2018, 7, 24),
schedule_interval='None',
catchup=False
) as dag:
first_task = BashOperator(task_id = "first_task", bash_command = bash_command)
Try this:
Set your schedule_interval to None without the '', or simply do not specify schedule_interval in your DAG. It is set to None as a default. More information on that here: airflow docs -- search for schedule_interval
Set orchestration for your tasks at the bottom of the dag.
Like so:
import datetime
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.dummy_operator import DummyOperator
default_args = {
'owner': 'me'
}
bash_command = """
echo "this is a test task"
"""
with DAG('schedule_test',
default_args=default_args,
start_date = datetime(2018, 7, 24),
schedule_interval=None,
catchup=False
) as dag:
t1 = DummyOperator(
task_id='extract_data',
dag=dag
)
t2 = BashOperator(
task_id = "first_task",
bash_command = bash_command
)
#####ORCHESTRATION#####
## It is saying that in order for t2 to run, t1 must be done.
t2.set_upstream(t1)
None Value Should not in quotes
It should be like this:
schedule_interval=None
Here is the documentation link: https://airflow.apache.org/docs/apache-airflow/1.10.1/scheduler.html#:~:text=Note%3A%20Use%20schedule_interval%3DNone%20and%20not%20schedule_interval%3D%27None%27%20when%20you%20don%E2%80%99t%20want%20to%20schedule%20your%20DAG