I'm currently facing a challenge in terms of parsing nested macros. Below is my DAG File
from airflow import DAG
from airflow.utils.dates import days_ago
from datetime import timedelta
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from airflow.models import Variable
from apty.utils.date import date_ref_now
default_args = {
"owner": "Akhil",
"depends_on_past": False,
"start_date": days_ago(0),
"email_on_failure": False,
"email_on_retry": False,
"retries": 0,
"retry_delay": timedelta(minutes=5),
}
dag = DAG(
"user_sample",
default_args=default_args,
description="test",
schedule_interval=None,
catchup=False,
)
def sample_app(hello=None):
return hello
extra_attrs = {"date_stamp":"{{ds}}",
"foo":"bar"}
start = DummyOperator(task_id="start", dag=dag)
python = PythonOperator(
python_callable=sample_app,
task_id="mid",
dag=dag,
params={"date_stamp": extra_attrs["date_stamp"]},
op_kwargs={"hello": "{{params.date_stamp}}"},
)
start >> python
I have a scenario where I need to pass {{ds}} as one of the parameters to my operator, after which I'll use that parameter as my wish either passing as an op_kwargs / op_args. (I have used Python Operator as an example but I would be using my own custom Operator).
Here I would like to make it clear that {{ds}} is passed as a parameter value only, I don't want it to be written anywhere i.e in op_kwargs as per this example.
When I try to run it I'm getting return value from python callable as {{ds}} but not the current date_stamp.
Please help me out.
Template or macro variables are only available for parameters that are specified as template_fields on the operator class in use. This depends on the specific version and implementation of Airflow you're using, but here's the latest https://github.com/apache/airflow/blob/98896e4e327f256fd04087a49a13e16a246022c9/airflow/operators/python.py#L72 for the PythonOperator. Since, as you say, you control the operator in question, you can specify any fields you want on the class definition's template_fields. (This all assumes your class inherits from BaseOperator.)
Related
I am new to Airflow, and I am trying to create a Python pipeline scheduling automation process. My project youtubecollection01 utilizes custom created modules, so when I run the DAG it fails with ModuleNotFoundError: No module named 'Authentication'.
This is how my project is structured:
This is my dag file:
# This to intialize the file as a dag file
from airflow import DAG
from datetime import datetime, timedelta
from airflow.operators.python import PythonOperator
# from airflow.utils.dates import days_ago
from youtubecollectiontier01.src.__main__ import main
default_args = {
'owner': 'airflow',
'depends_on_past': False,
# 'start_date': days_ago(1),
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
# 'priority_weight': 10,
# 'end_date': datetime(2016, 1, 1),
}
# curate dag
with DAG('collect_layer_01', start_date=datetime(2022,7,25),
schedule_interval='#daily', catchup=False, default_args=default_args) as dag:
curate = PythonOperator(
task_id='collect_tier_01', # name for the task you would like to execute
python_callable=main, # the name of your python function
provide_context=True,
dag=dag)
I am importing main function from the __main__.py, however inside the main I am importing other classes such as Authentication.py, ChannelClass.py, Common.py and that's where Airflow is not recognizing.
Why it is failing for the imports, is it a directory issue or an Airflow issue? I tried moving the project under plugins and run it, but it did not work, any feedback would be highly appreciated!
Thank you!
Up until the last part, you got everything setup according to the tutorials! Also, thank you for a well documented question.
If you have not changed the PYTHON_PATH for airflow, you can try the following to get the default with:
$ airflow info
In the paths info part, you get "airflow_home", "system_path", "python_path" and "airflow_on_path".
Now within the "python_path", you'll basically see that, airflow is set up so that it will check everything inside /dags, /plugins and /config folder.
More about this topic in documents called "Module Management"
Now, I think, the problem with your code can be fixed with a little change.
In your main code you import:
from Authentication import Authentication
in a default setup, Airflow doesn't know where that is!
If you import it this way:
from youtubecollectiontier01.src.Authentication import Authentication
Just like the one you did in the DAG file. I believe it will work. Same goes for the other classes you have ChannelClass, Common, etc.
Waiting to hear from you!
I want to get the email mentioned in this DAG's default args using another DAG in the airflow. How can I do that? Please help, I am new to airflow!
from airflow.models import DagRun
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
from datetime import datetime, timedelta
from airflow import DAG
def first_function(**context):
print("hello")
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'email': ['example#gmail.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5)
}
with DAG(
'main',
default_args=default_args,
description='Sample DAG',
schedule_interval=timedelta(days=1),
start_date=datetime(2022,6,10),
catchup=False
) as dag:
first_function = PythonOperator(
task_id="first_function",
python_callable=first_function,
)
first_function
You can use a custom module in Airflow to share config options/Operators/or any arbitrary Python code across DAGs.
Typically you would create a directory in your DAGs directory (which by default is {AIRFLOW_HOME}/dags.
To share default_args between DAGs, you could create the following layout:
Create {AIRFLOW_HOME}/dags/custom/__init__.py
Create {AIRFLOW_HOME}/dags/custom/shared_config.py
Create {AIRFLOW_HOME}/dags/.airflowignore
Add the directory name custom to the first line of .airflowignore.
Cut and paste your default_args dictionary from your DAG into {AIRFLOW_HOME}/dags/custom/shared_config.py
You can see this layout suggested in the Airflow documentation here.
The .airflowignore tells the scheduler to skip the custom directory when it parses your DAGs (which by default happens every 30s) - because the custom directory contains Python, but never any DAGs, the scheduler should skip these files to avoid adding unnecessary load to the scheduler. This is explained in the documentation link above.
You need to add an __init__.py to the custom directory - airflow requires it even though when writing in Python3 you don't need it because of implicit namespaces (again this is explained in the same link above).
From your dag you can then import as needed:
from custom.shared_config import default_args
I'm newbie in Apache Airflow.
There are a lot of examples of basic DAGs in the Internet.
Unfortunately, I didn't find any examples of single-task DAG's.
Most of DAG's examples contain bitshift operator in the end of the .py script, which defines tasks order.
For example:
# ...our DAG's code...
task1 >> task2 >> task3
But what if my DAG has just a single task at the moment?
My question is - do I need to use this single task's name in the end of Python file?
Or if we have only 1 task in the scope, Airflow will handle it itself, and the last line of code below is redundant?
from datetime import timedelta
from airflow.operators.bash import BashOperator
from airflow.utils.dates import days_ago
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
}
with DAG(
'tutorial',
default_args=default_args,
description='A simple tutorial DAG',
schedule_interval=timedelta(days=1),
start_date=days_ago(2),
tags=['example'],
) as dag:
t1 = BashOperator(
task_id='print_date',
bash_command='date',
)
t1 # IS THIS LINE OF CODE NECESSARY?
The answer is NO, you don't need to include the last line. You could also avoid the asignment of the variable t1, leaving the DAG like this:
with DAG(
'tutorial',
default_args=default_args,
description='A simple tutorial DAG',
schedule_interval=timedelta(days=1),
start_date=days_ago(2),
tags=['example'],
) as dag:
BashOperator(
task_id='print_date',
bash_command='date',
)
The reason to perfom the assignment of an instance of an Operator (such as BashOperator), to a variable (called Task in this scope) is similiar to any other object in OOP. In your example there is no other "operation" perfomed over the t1 variable (you are not reading it or consuming any method from it) so there no is no reason to declare it.
When starting with Airflow, I think is very clarifying to use the DebugExecutor to perform quick tests like this and understand how everything is working. If you are using VS Code you can find an example config file, here.
In am trying to call DAG from another DAG( target_dag from parent_dag).
My parent_dag code is :
from datetime import datetime
from airflow.models import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.models import Variable
from airflow.operators.trigger_dagrun import TriggerDagRunOperator
def read_metadata(**kwargs):
asqldb_kv = Variable.get("asql_kv")
perfom some operations based on asqldb_kv and populate the result_dictionary list
if len(result_dictionary) > 0:
my_var = Variable.set("num_runs", len(result_dictionary))
ti = kwargs['ti']
ti.xcom_push(key='res', value=result_dictionary)
default_args = {
'start_date': datetime(year=2021, month=6, day=19),
'provide_context': True
}
with DAG(
dag_id='parent_dag',
default_args=default_args,
schedule_interval='#once',
description='Test Trigger DAG'
) as dag:
trigger = TriggerDagRunOperator(
task_id="test_trigger_dagrun",
python_callable=read_metadata,
trigger_dag_id="target_dag"
)
I am getting the below error :
airflow.exceptions.AirflowException: Invalid arguments were passed to TriggerDagRunOperator (task_id: test_trigger_dagrun). Invalid arguments were:
**kwargs: {'python_callable': <function read_metadata at 0x7ff5f4159620>}
Any help appreciated.
Edit :
python_callable is depreciated in TriggerDagRunOperator - Airflow 2.0.
My requirement is :
I need to access Azure Synapse and get a variable (Say 3). Based on retrieved variable, I need to create tasks dynamically. Say, if Synapse has 3 , then I need to create 3 tasks.
My idea was :
DAG 1 - Access Azure synapse and get Variable. Update this to Airflow Variable. Trigger DAG2 using TriggerDagRunOperator.
DAG 2 - Create tasks depending on the Airflow Variable updated in DAG 1.
Any inputs how can I achieve this?
Consider the following example of a DAG where the first task, get_id_creds, extracts a list of credentials from a database. This operation tells me what users in my database I am able to run further data preprocessing on and it writes those ids to the file /tmp/ids.txt. I then scan those ids into my DAG and use them to generate a list of upload_transaction tasks that can be run in parallel.
My question is: Is there a more idiomatically correct, dynamic way to do this using airflow? What I have here feels clumsy and brittle. How can I directly pass a list of valid IDs from one process to that defines the subsequent downstream processes?
from datetime import datetime, timedelta
import os
import sys
from airflow.models import DAG
from airflow.operators.python_operator import PythonOperator
import ds_dependencies
SCRIPT_PATH = os.getenv('DASH_PREPROC_PATH')
if SCRIPT_PATH:
sys.path.insert(0, SCRIPT_PATH)
import dash_workers
else:
print('Define DASH_PREPROC_PATH value in environmental variables')
sys.exit(1)
default_args = {
'start_date': datetime.now(),
'schedule_interval': None
}
DAG = DAG(
dag_id='dash_preproc',
default_args=default_args
)
get_id_creds = PythonOperator(
task_id='get_id_creds',
python_callable=dash_workers.get_id_creds,
provide_context=True,
dag=DAG)
with open('/tmp/ids.txt', 'r') as infile:
ids = infile.read().splitlines()
for uid in uids:
upload_transactions = PythonOperator(
task_id=uid,
python_callable=dash_workers.upload_transactions,
op_args=[uid],
dag=DAG)
upload_transactions.set_downstream(get_id_creds)
Per #Juan Riza's suggestion I checked out this link: Proper way to create dynamic workflows in Airflow. This was pretty much the answer, although I was able to simplify the solution enough that I thought I would offer my own modified version of the implementation here:
from datetime import datetime
import os
import sys
from airflow.models import DAG
from airflow.operators.python_operator import PythonOperator
import ds_dependencies
SCRIPT_PATH = os.getenv('DASH_PREPROC_PATH')
if SCRIPT_PATH:
sys.path.insert(0, SCRIPT_PATH)
import dash_workers
else:
print('Define DASH_PREPROC_PATH value in environmental variables')
sys.exit(1)
ENV = os.environ
default_args = {
# 'start_date': datetime.now(),
'start_date': datetime(2017, 7, 18)
}
DAG = DAG(
dag_id='dash_preproc',
default_args=default_args
)
clear_tables = PythonOperator(
task_id='clear_tables',
python_callable=dash_workers.clear_db,
dag=DAG)
def id_worker(uid):
return PythonOperator(
task_id=uid,
python_callable=dash_workers.main_preprocess,
op_args=[uid],
dag=DAG)
for uid in capone_dash_workers.get_id_creds():
clear_tables >> id_worker(uid)
clear_tables cleans the database that will be re-built as a result of the process. id_worker is a function that dynamically generates new preprocessing tasks, based on the array of ID values returned from get_if_creds. The task ID is just the corresponding user ID, though it could easily have been an index, i, as in the example mentioned above.
NOTE That bitshift operator (<<) looks backwards to me, as the clear_tables task should come first, but it's what seems to be working in this case.
Considering that Apache Airflow is a workflow management tool, ie. it determines the dependencies between task that the user defines in comparison (as an example) with apache Nifi which is a dataflow management tool, ie. the dependencies here are data which are transferd through the tasks.
That said, i think that your approach is quit right (my comment is based on the code posted) but Airflow offers a concept called XCom. It allows tasks to "cross-communicate" between them by passing some data. How big should the passed data be ? it is up to you to test! But generally it should be not so big. I think it is in the form of key,value pairs and it get stored in the airflow meta-database,ie you can't pass files for example but a list with ids could work.
Like i said you should test it your self. I would be very happy to know your experience. Here is an example dag which demonstrates the use of XCom and here is the necessary documentation. Cheers!