Apache Airflow DAG with single task - python

I'm newbie in Apache Airflow.
There are a lot of examples of basic DAGs in the Internet.
Unfortunately, I didn't find any examples of single-task DAG's.
Most of DAG's examples contain bitshift operator in the end of the .py script, which defines tasks order.
For example:
# ...our DAG's code...
task1 >> task2 >> task3
But what if my DAG has just a single task at the moment?
My question is - do I need to use this single task's name in the end of Python file?
Or if we have only 1 task in the scope, Airflow will handle it itself, and the last line of code below is redundant?
from datetime import timedelta
from airflow.operators.bash import BashOperator
from airflow.utils.dates import days_ago
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
}
with DAG(
'tutorial',
default_args=default_args,
description='A simple tutorial DAG',
schedule_interval=timedelta(days=1),
start_date=days_ago(2),
tags=['example'],
) as dag:
t1 = BashOperator(
task_id='print_date',
bash_command='date',
)
t1 # IS THIS LINE OF CODE NECESSARY?

The answer is NO, you don't need to include the last line. You could also avoid the asignment of the variable t1, leaving the DAG like this:
with DAG(
'tutorial',
default_args=default_args,
description='A simple tutorial DAG',
schedule_interval=timedelta(days=1),
start_date=days_ago(2),
tags=['example'],
) as dag:
BashOperator(
task_id='print_date',
bash_command='date',
)
The reason to perfom the assignment of an instance of an Operator (such as BashOperator), to a variable (called Task in this scope) is similiar to any other object in OOP. In your example there is no other "operation" perfomed over the t1 variable (you are not reading it or consuming any method from it) so there no is no reason to declare it.
When starting with Airflow, I think is very clarifying to use the DebugExecutor to perform quick tests like this and understand how everything is working. If you are using VS Code you can find an example config file, here.

Related

Airflow is failing my DAG when I use external scripts giving ModuleNotFoundError: No module named

I am new to Airflow, and I am trying to create a Python pipeline scheduling automation process. My project youtubecollection01 utilizes custom created modules, so when I run the DAG it fails with ModuleNotFoundError: No module named 'Authentication'.
This is how my project is structured:
This is my dag file:
# This to intialize the file as a dag file
from airflow import DAG
from datetime import datetime, timedelta
from airflow.operators.python import PythonOperator
# from airflow.utils.dates import days_ago
from youtubecollectiontier01.src.__main__ import main
default_args = {
'owner': 'airflow',
'depends_on_past': False,
# 'start_date': days_ago(1),
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
# 'priority_weight': 10,
# 'end_date': datetime(2016, 1, 1),
}
# curate dag
with DAG('collect_layer_01', start_date=datetime(2022,7,25),
schedule_interval='#daily', catchup=False, default_args=default_args) as dag:
curate = PythonOperator(
task_id='collect_tier_01', # name for the task you would like to execute
python_callable=main, # the name of your python function
provide_context=True,
dag=dag)
I am importing main function from the __main__.py, however inside the main I am importing other classes such as Authentication.py, ChannelClass.py, Common.py and that's where Airflow is not recognizing.
Why it is failing for the imports, is it a directory issue or an Airflow issue? I tried moving the project under plugins and run it, but it did not work, any feedback would be highly appreciated!
Thank you!
Up until the last part, you got everything setup according to the tutorials! Also, thank you for a well documented question.
If you have not changed the PYTHON_PATH for airflow, you can try the following to get the default with:
$ airflow info
In the paths info part, you get "airflow_home", "system_path", "python_path" and "airflow_on_path".
Now within the "python_path", you'll basically see that, airflow is set up so that it will check everything inside /dags, /plugins and /config folder.
More about this topic in documents called "Module Management"
Now, I think, the problem with your code can be fixed with a little change.
In your main code you import:
from Authentication import Authentication
in a default setup, Airflow doesn't know where that is!
If you import it this way:
from youtubecollectiontier01.src.Authentication import Authentication
Just like the one you did in the DAG file. I believe it will work. Same goes for the other classes you have ChannelClass, Common, etc.
Waiting to hear from you!

Airflow replacing single quotes by None on PythonOperator

While developing some code on airflow, I saw that all my PythonOperator task parameters that were '' (single quotes) are being replaced with None, when passed to the python_callable.
For reproducing, take the following function (that will be the python_callable)
def print_something(something):
print('Something: ', something)
And then, the following DAG and Task:
with DAG(
dag_id='print_test',
tags=['Test'],
start_date=days_ago(1),
schedule_interval=None,
default_args={'owner': 'rand'},
catchup=False,
render_template_as_native_obj=True,
) as dag:
print_task = PythonOperator(
task_id=f'task_print_test',
dag=dag,
python_callable=print_something,
op_kwargs={'something': {'test':''}}
)
print_task
When I go to the task execution logs:
[2022-07-13, 12:04:14 -03] {logging_mixin.py:115} INFO - Something: {'test': None}
So airflow is actually replacing single quotes for None values.
Is there any way to prevent this?
Yes, there is. Remove this dag configuration:
render_template_as_native_obj=True
It's causing your dag to replace somethings (arguments and others) by python objects.

Get DAG Email from another DAG in Airflow

I want to get the email mentioned in this DAG's default args using another DAG in the airflow. How can I do that? Please help, I am new to airflow!
from airflow.models import DagRun
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
from datetime import datetime, timedelta
from airflow import DAG
def first_function(**context):
print("hello")
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'email': ['example#gmail.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5)
}
with DAG(
'main',
default_args=default_args,
description='Sample DAG',
schedule_interval=timedelta(days=1),
start_date=datetime(2022,6,10),
catchup=False
) as dag:
first_function = PythonOperator(
task_id="first_function",
python_callable=first_function,
)
first_function
You can use a custom module in Airflow to share config options/Operators/or any arbitrary Python code across DAGs.
Typically you would create a directory in your DAGs directory (which by default is {AIRFLOW_HOME}/dags.
To share default_args between DAGs, you could create the following layout:
Create {AIRFLOW_HOME}/dags/custom/__init__.py
Create {AIRFLOW_HOME}/dags/custom/shared_config.py
Create {AIRFLOW_HOME}/dags/.airflowignore
Add the directory name custom to the first line of .airflowignore.
Cut and paste your default_args dictionary from your DAG into {AIRFLOW_HOME}/dags/custom/shared_config.py
You can see this layout suggested in the Airflow documentation here.
The .airflowignore tells the scheduler to skip the custom directory when it parses your DAGs (which by default happens every 30s) - because the custom directory contains Python, but never any DAGs, the scheduler should skip these files to avoid adding unnecessary load to the scheduler. This is explained in the documentation link above.
You need to add an __init__.py to the custom directory - airflow requires it even though when writing in Python3 you don't need it because of implicit namespaces (again this is explained in the same link above).
From your dag you can then import as needed:
from custom.shared_config import default_args

How to parse nested macros in Airflow

I'm currently facing a challenge in terms of parsing nested macros. Below is my DAG File
from airflow import DAG
from airflow.utils.dates import days_ago
from datetime import timedelta
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from airflow.models import Variable
from apty.utils.date import date_ref_now
default_args = {
"owner": "Akhil",
"depends_on_past": False,
"start_date": days_ago(0),
"email_on_failure": False,
"email_on_retry": False,
"retries": 0,
"retry_delay": timedelta(minutes=5),
}
dag = DAG(
"user_sample",
default_args=default_args,
description="test",
schedule_interval=None,
catchup=False,
)
def sample_app(hello=None):
return hello
extra_attrs = {"date_stamp":"{{ds}}",
"foo":"bar"}
start = DummyOperator(task_id="start", dag=dag)
python = PythonOperator(
python_callable=sample_app,
task_id="mid",
dag=dag,
params={"date_stamp": extra_attrs["date_stamp"]},
op_kwargs={"hello": "{{params.date_stamp}}"},
)
start >> python
I have a scenario where I need to pass {{ds}} as one of the parameters to my operator, after which I'll use that parameter as my wish either passing as an op_kwargs / op_args. (I have used Python Operator as an example but I would be using my own custom Operator).
Here I would like to make it clear that {{ds}} is passed as a parameter value only, I don't want it to be written anywhere i.e in op_kwargs as per this example.
When I try to run it I'm getting return value from python callable as {{ds}} but not the current date_stamp.
Please help me out.
Template or macro variables are only available for parameters that are specified as template_fields on the operator class in use. This depends on the specific version and implementation of Airflow you're using, but here's the latest https://github.com/apache/airflow/blob/98896e4e327f256fd04087a49a13e16a246022c9/airflow/operators/python.py#L72 for the PythonOperator. Since, as you say, you control the operator in question, you can specify any fields you want on the class definition's template_fields. (This all assumes your class inherits from BaseOperator.)

Airflow ExternalTaskSensor Stuck

I am trying to get the Airflow ExternalTaskSensor to work but so far have not been able to get it to complete, it always seems to get stuck running and never finishes so the DAG can move onto the next task.
Here is the code I am using to test:
DEFAULT_ARGS = {
'owner': 'NAME',
'depends_on_past': False,
'start_date': datetime(2019, 9, 9),
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False
}
external_watch_dag = DAG(
'DAG-External_watcher-Test',
default_args=DEFAULT_ARGS,
dagrun_timeout=timedelta(hours=1),
schedule_interval=None
)
start_op = DummyOperator(
task_id='start_op',
dag=external_watch_dag
)
trigger_external = TriggerDagRunOperator(
task_id='trigger_external',
trigger_dag_id='DAG-Dummy',
dag=external_watch_dag
)
external_watch_op = ExternalTaskSensor(
task_id='external_watch_op',
external_dag_id='DAG-Dummy',
external_task_id='dummy_task',
check_existence=True,
execution_delta=timedelta(minutes=-1),
# execution_date_fn=datetime(2019, 9, 25),
execution_timeout=timedelta(minutes=30),
dag=external_watch_dag
)
end_op = DummyOperator(
task_id='end_op',
dag=external_watch_dag
)
start_op >> trigger_external >> external_watch_op >> end_op
# start_op >> [external_watch_op, trigger_external]
# external_watch_op >> end_op
# Below is the setup for the dummy DAG that is called above by the Trigger and watched by the TaskSensor
dummy_dag = DAG(
'DAG-Dummy',
default_args=DEFAULT_ARGS,
dagrun_timeout=timedelta(hours=1),
schedule_interval=None
)
dummy_task = BashOperator(
task_id='dummy_task',
bash_command='sleep 10',
dag=dummy_dag
)
I have tried tweaking this code a number of ways but have not gotten any success with the ExternalTaskSensor.
Does anyone know how to solve this problem and get the ExternalTaskSensor to work properly? I have also read that issues can arise through scheduling intervals when using the ExternalTaskSensor, is it possible that part of the issue is that the DAGs both have schedule_interval=None?
I had gotten this to work with both of the DAGs set to the exact same schedule_interval, but that will not work in production. The goal is to have the main DAG, external-watch-dag to be on a regular schedule and trigger that DAG-Dummy during its run, with the DAG-Dummy itself having schedule_interval=None.
Any help is greatly appreciated.
By default the ExternalTaskSensor will monitor the external_dag_id with the same execution date that the sensor DAG. With execution_delta you can set a time delta between the sensor dag and the external dag so it can look for the correct execution_date to monitor. This works great when both dags are run in a schedule because you know exactly this timedelta.
The problem: when a dag is triggered manually or by another dag, you cannot known for sure the the exact execution date of any of these two dags.
The solution: because you are using the TriggerDagRunOperator, you can set the execution_date parameter. This will make sure that the execution date from your dag and the external dag is the same. From the docs:
execution_date (str or datetime.datetime) – Execution date for the dag (templated)
So your code will look like this:
trigger_external = TriggerDagRunOperator(
task_id='trigger_external',
trigger_dag_id='DAG-Dummy',
dag=external_watch_dag,
execution_date="{{ execution_date }}", # Use the template to get the current execution date
)
external_watch_op = ExternalTaskSensor(
task_id='external_watch_op',
external_dag_id='DAG-Dummy',
external_task_id='dummy_task',
check_existence=True,
execution_timeout=timedelta(minutes=30),
dag=external_watch_dag
)

Categories