I want to get the email mentioned in this DAG's default args using another DAG in the airflow. How can I do that? Please help, I am new to airflow!
from airflow.models import DagRun
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
from datetime import datetime, timedelta
from airflow import DAG
def first_function(**context):
print("hello")
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'email': ['example#gmail.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5)
}
with DAG(
'main',
default_args=default_args,
description='Sample DAG',
schedule_interval=timedelta(days=1),
start_date=datetime(2022,6,10),
catchup=False
) as dag:
first_function = PythonOperator(
task_id="first_function",
python_callable=first_function,
)
first_function
You can use a custom module in Airflow to share config options/Operators/or any arbitrary Python code across DAGs.
Typically you would create a directory in your DAGs directory (which by default is {AIRFLOW_HOME}/dags.
To share default_args between DAGs, you could create the following layout:
Create {AIRFLOW_HOME}/dags/custom/__init__.py
Create {AIRFLOW_HOME}/dags/custom/shared_config.py
Create {AIRFLOW_HOME}/dags/.airflowignore
Add the directory name custom to the first line of .airflowignore.
Cut and paste your default_args dictionary from your DAG into {AIRFLOW_HOME}/dags/custom/shared_config.py
You can see this layout suggested in the Airflow documentation here.
The .airflowignore tells the scheduler to skip the custom directory when it parses your DAGs (which by default happens every 30s) - because the custom directory contains Python, but never any DAGs, the scheduler should skip these files to avoid adding unnecessary load to the scheduler. This is explained in the documentation link above.
You need to add an __init__.py to the custom directory - airflow requires it even though when writing in Python3 you don't need it because of implicit namespaces (again this is explained in the same link above).
From your dag you can then import as needed:
from custom.shared_config import default_args
Related
I am new to Airflow, and I am trying to create a Python pipeline scheduling automation process. My project youtubecollection01 utilizes custom created modules, so when I run the DAG it fails with ModuleNotFoundError: No module named 'Authentication'.
This is how my project is structured:
This is my dag file:
# This to intialize the file as a dag file
from airflow import DAG
from datetime import datetime, timedelta
from airflow.operators.python import PythonOperator
# from airflow.utils.dates import days_ago
from youtubecollectiontier01.src.__main__ import main
default_args = {
'owner': 'airflow',
'depends_on_past': False,
# 'start_date': days_ago(1),
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
# 'priority_weight': 10,
# 'end_date': datetime(2016, 1, 1),
}
# curate dag
with DAG('collect_layer_01', start_date=datetime(2022,7,25),
schedule_interval='#daily', catchup=False, default_args=default_args) as dag:
curate = PythonOperator(
task_id='collect_tier_01', # name for the task you would like to execute
python_callable=main, # the name of your python function
provide_context=True,
dag=dag)
I am importing main function from the __main__.py, however inside the main I am importing other classes such as Authentication.py, ChannelClass.py, Common.py and that's where Airflow is not recognizing.
Why it is failing for the imports, is it a directory issue or an Airflow issue? I tried moving the project under plugins and run it, but it did not work, any feedback would be highly appreciated!
Thank you!
Up until the last part, you got everything setup according to the tutorials! Also, thank you for a well documented question.
If you have not changed the PYTHON_PATH for airflow, you can try the following to get the default with:
$ airflow info
In the paths info part, you get "airflow_home", "system_path", "python_path" and "airflow_on_path".
Now within the "python_path", you'll basically see that, airflow is set up so that it will check everything inside /dags, /plugins and /config folder.
More about this topic in documents called "Module Management"
Now, I think, the problem with your code can be fixed with a little change.
In your main code you import:
from Authentication import Authentication
in a default setup, Airflow doesn't know where that is!
If you import it this way:
from youtubecollectiontier01.src.Authentication import Authentication
Just like the one you did in the DAG file. I believe it will work. Same goes for the other classes you have ChannelClass, Common, etc.
Waiting to hear from you!
I'm newbie in Apache Airflow.
There are a lot of examples of basic DAGs in the Internet.
Unfortunately, I didn't find any examples of single-task DAG's.
Most of DAG's examples contain bitshift operator in the end of the .py script, which defines tasks order.
For example:
# ...our DAG's code...
task1 >> task2 >> task3
But what if my DAG has just a single task at the moment?
My question is - do I need to use this single task's name in the end of Python file?
Or if we have only 1 task in the scope, Airflow will handle it itself, and the last line of code below is redundant?
from datetime import timedelta
from airflow.operators.bash import BashOperator
from airflow.utils.dates import days_ago
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
}
with DAG(
'tutorial',
default_args=default_args,
description='A simple tutorial DAG',
schedule_interval=timedelta(days=1),
start_date=days_ago(2),
tags=['example'],
) as dag:
t1 = BashOperator(
task_id='print_date',
bash_command='date',
)
t1 # IS THIS LINE OF CODE NECESSARY?
The answer is NO, you don't need to include the last line. You could also avoid the asignment of the variable t1, leaving the DAG like this:
with DAG(
'tutorial',
default_args=default_args,
description='A simple tutorial DAG',
schedule_interval=timedelta(days=1),
start_date=days_ago(2),
tags=['example'],
) as dag:
BashOperator(
task_id='print_date',
bash_command='date',
)
The reason to perfom the assignment of an instance of an Operator (such as BashOperator), to a variable (called Task in this scope) is similiar to any other object in OOP. In your example there is no other "operation" perfomed over the t1 variable (you are not reading it or consuming any method from it) so there no is no reason to declare it.
When starting with Airflow, I think is very clarifying to use the DebugExecutor to perform quick tests like this and understand how everything is working. If you are using VS Code you can find an example config file, here.
I'm currently facing a challenge in terms of parsing nested macros. Below is my DAG File
from airflow import DAG
from airflow.utils.dates import days_ago
from datetime import timedelta
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from airflow.models import Variable
from apty.utils.date import date_ref_now
default_args = {
"owner": "Akhil",
"depends_on_past": False,
"start_date": days_ago(0),
"email_on_failure": False,
"email_on_retry": False,
"retries": 0,
"retry_delay": timedelta(minutes=5),
}
dag = DAG(
"user_sample",
default_args=default_args,
description="test",
schedule_interval=None,
catchup=False,
)
def sample_app(hello=None):
return hello
extra_attrs = {"date_stamp":"{{ds}}",
"foo":"bar"}
start = DummyOperator(task_id="start", dag=dag)
python = PythonOperator(
python_callable=sample_app,
task_id="mid",
dag=dag,
params={"date_stamp": extra_attrs["date_stamp"]},
op_kwargs={"hello": "{{params.date_stamp}}"},
)
start >> python
I have a scenario where I need to pass {{ds}} as one of the parameters to my operator, after which I'll use that parameter as my wish either passing as an op_kwargs / op_args. (I have used Python Operator as an example but I would be using my own custom Operator).
Here I would like to make it clear that {{ds}} is passed as a parameter value only, I don't want it to be written anywhere i.e in op_kwargs as per this example.
When I try to run it I'm getting return value from python callable as {{ds}} but not the current date_stamp.
Please help me out.
Template or macro variables are only available for parameters that are specified as template_fields on the operator class in use. This depends on the specific version and implementation of Airflow you're using, but here's the latest https://github.com/apache/airflow/blob/98896e4e327f256fd04087a49a13e16a246022c9/airflow/operators/python.py#L72 for the PythonOperator. Since, as you say, you control the operator in question, you can specify any fields you want on the class definition's template_fields. (This all assumes your class inherits from BaseOperator.)
I am new in python and airflow dag.
I am following below link and code which is mention in answer section.
How to pass dynamic arguments Airflow operator?
I am facing issue to reading yaml file, In yaml file I have some configuration related arguments.
configs:
cluster_name: "test-cluster"
project_id: "t***********"
zone: "europe-west1-c"
num_workers: 2
worker_machine_type: "n1-standard-1"
master_machine_type: "n1-standard-1"
In DAG script I have created one task which will be create cluster, before executing this task we need all the arguments which we need to pass on it default_args parameter like cluster-name, project_id etc.For reading those parameter I have created one readYML method.see below code
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
from zipfile import ZipFile
from airflow.contrib.operators import dataproc_operator
from airflow.models import Variable
import yaml
def readYML():
print("inside readYML")
global cfg
file_name = "/home/airflow/gcs/data/cluster_config.yml"
with open(file_name, 'r') as ymlfile:
cfg = yaml.load(ymlfile)
print(cfg['configs']['cluster_name'])
# Default Arguments
readYML()
dag_name = Variable.get("dag_name")
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime.now(),
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
#'cluster_name': cfg['configs']['cluster_name'],
}
# Instantiate a DAG
dag = DAG(dag_id='read_yml', default_args=default_args,
schedule_interval=timedelta(days=1))
# Creating Tasks
Task1 = DataprocClusterCreateOperator(
task_id='create_cluster',
dag=dag
)
In this code there is no error, When I am uploading in GCP composer environment, No error notification is showing but this DAG is no runnable there is no Run button is coming.
See attached screen shot.
I am using python 3 & airflow composer-1.7.2-airflow-1.10.2 version.
According to the Data Stored in Cloud Storage page in the Cloud Composer docs:
To avoid a webserver error, make sure that data the webserver needs to parse a DAG (not run) is available in the dags/ folder. Otherwise, the webserver can't access the data or load the Airflow web interface.
Your DAG is attempting to open the YAML file under /home/airflow/gcs/data, which isn't present on the webserver. Put the file under the dags/ folder in your GCS bucket, and it will be accessible to the scheduler, workers, and webserver, and the DAG will work in the Web UI.
Below is the airflow DAG code. It runs perfectly both when airflow is hosted locally, and on cloud composer. However, the DAG itself isn't clickable in the Composer UI.
I found a similar question and tried the accepted answer as linked in this question. My problem is similar.
import airflow
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from airflow.operators.mysql_operator import MySqlOperator
from airflow.contrib.operators.dataproc_operator import DataprocClusterCreateOperator
from airflow.contrib.operators.dataproc_operator import DataprocClusterDeleteOperator
from airflow.contrib.operators.dataproc_operator import DataProcSparkOperator
from datetime import datetime, timedelta
import sys
#copy this package to dag directory in GCP composer bucket
from schemas.schemaValidator import loadSchema
from schemas.schemaValidator import sparkArgListToMap
#change these paths to point to GCP Composer data directory
## cluster config
clusterConfig= loadSchema("somePath/jobConfig/cluster.yaml","cluster")
##per job yaml config
autoLoanCsvToParquetConfig= loadSchema("somePath/jobConfig/job.yaml","job")
default_args= {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2019, 1, 1),
'retries': 1,
'retry_delay': timedelta(minutes=3)
}
dag= DAG('usr_job', default_args=default_args, schedule_interval=None)
t1= DummyOperator(task_id= "start", dag=dag)
t2= DataprocClusterCreateOperator(
task_id= "CreateCluster",
cluster_name= clusterConfig["cluster"]["cluster_name"],
project_id= clusterConfig["project_id"],
num_workers= clusterConfig["cluster"]["worker_config"]["num_instances"],
image_version= clusterConfig["cluster"]["dataproc_img"],
master_machine_type= clusterConfig["cluster"]["worker_config"]["machine_type"],
worker_machine_type= clusterConfig["cluster"]["worker_config"]["machine_type"],
zone= clusterConfig["region"],
dag=dag
)
t3= DataProcSparkOperator(
task_id= "csvToParquet",
main_class= autoLoanCsvToParquetConfig["job"]["main_class"],
arguments= autoLoanCsvToParquetConfig["job"]["args"],
cluster_name= clusterConfig["cluster"]["cluster_name"],
dataproc_spark_jars= autoLoanCsvToParquetConfig["job"]["jarPath"],
dataproc_spark_properties= sparkArgListToMap(autoLoanCsvToParquetConfig["spark_params"]),
dag=dag
)
t4= DataprocClusterDeleteOperator(
task_id= "deleteCluster",
cluster_name= clusterConfig["cluster"]["cluster_name"],
project_id= clusterConfig["project_id"],
dag= dag
)
t5= DummyOperator(task_id= "stop", dag=dag)
t1>>t2>>t3>>t4>>t5
The UI gives this error - "This DAG isn't available in the webserver DAG bag object. It shows up in this list because the scheduler marked it as active in the metadata database."
And yet, when I triggered the DAG manually on Composer, I found it ran successfully through the log files.
The issue was with the path which was being provided for picking up the configuration files. I was giving path for the data folder in GCS. As per Google documentation, only dags folder is synced to all nodes, and not the data folder.
Needless to say, it was a issue encountered during dag parsing time, hence, it did not appear correctly on the UI. More interestingly, these debug messages were not exposed to Composer 1.5 and earlier. Now they are available to the end user to help in debugging. Thanks anyway to everyone who helped.