I am trying to create a DAG that generates tasks dynamically based on a JSON file located in storage. I followed this guide step-by-step:
https://bigdata-etl.com/apache-airflow-create-dynamic-dag/
But the DAG gets stuck with the following message:
Is it possible to read an external file and use it to create tasks dynamically in Composer? I can do this when I read data only from an airflow Variable, but when I read an external file, the dag gets stuck in the isn't available in the web server's DagBag object state. I need to read from an external file as the contents of the JSON will change with every execution.
I am using composer-1.8.2-airflow-1.10.2.
I read this answer to a similar question:
Dynamic task definition in Airflow
But I am not trying to create the tasks based on a separate task, only based on the external file.
This is my second approach that also get's stuck in that error state:
import datetime
import airflow
from airflow.operators import bash_operator
from airflow.operators.dummy_operator import DummyOperator
from airflow.models import Variable
import json
import os
products = json.loads(Variable.get("products"))
default_args = {
'owner': 'Composer Example',
'depends_on_past': False,
'email': [''],
'email_on_failure': False,
'email_on_retry': False,
'retries': 0,
'retry_delay': datetime.timedelta(minutes=5),
'start_date': datetime.datetime(2020, 1, 10),
}
with airflow.DAG(
'json_test2',
default_args=default_args,
# Not scheduled, trigger only
schedule_interval=None) as dag:
# Print the dag_run's configuration, which includes information about the
# Cloud Storage object change.
def read_json_file(file_path):
if os.path.exists(file_path):
with open(file_path, 'r') as f:
return json.load(f)
def get_run_list(files):
run_list = []
#The file is uploaded in the storage bucket used as a volume by Composer
last_exec_json = read_json_file("/home/airflow/gcs/data/last_execution.json")
date = last_exec_json["date"]
hour = last_exec_json["hour"]
for file in files:
#Testing by adding just date and hour
name = file['name']+f'_{date}_{hour}'
run_list.append(name)
return run_list
rl = get_run_list(products)
start = DummyOperator(task_id='start', dag=dag)
end = DummyOperator(task_id='end', dag=dag)
for name in rl:
tsk = DummyOperator(task_id=name, dag=dag)
start >> tsk >> end
It is possible to create DAG that generates task dynamically based on a JSON file, which is located in a Cloud Storage bucket. I followed guide, that you provided, and it works perfectly in my case.
Firstly you need to upload your JSON configuration file to $AIRFLOW_HOME/dags directory, and then DAG python file to the same path (you can find the path in airflow.cfg file, which is located in the bucket).
Later on, you will be able to see DAG in Airflow UI:
As you can see the log DAG isn't available in the web server's DagBag object, the DAG isn't available on Airflow Web Server. However, the DAG can be scheduled as active because Airflow Scheduler is working independently with the Airflow Web Server.
When a lot of DAGs are loaded at once to a Composer environment, it may overload on the environment. As the Airflow webserver is on a Google-managed project, only certain types of updates will cause the webserver container to be restarted, like adding or upgrading one of the PyPI packages or changing an Airflow setting. The workaround is to add a dummy environment variable:
Open Composer instance in GCP
ENVIRONMENT VARIABLE tab
Edit, then add environment variable and Submit
You can use following command to restart it:
gcloud composer environments update ${ENVIRONMENT_NAME} --location=${ENV_LOCATION} --update-airflow-configs=core-dummy=true
gcloud composer environments update ${ENVIRONMENT_NAME} --location=${ENV_LOCATION} --remove-airflow-configs=core-dummy
I hope you find the above pieces of information useful.
Related
I am new to Airflow, and I am trying to create a Python pipeline scheduling automation process. My project youtubecollection01 utilizes custom created modules, so when I run the DAG it fails with ModuleNotFoundError: No module named 'Authentication'.
This is how my project is structured:
This is my dag file:
# This to intialize the file as a dag file
from airflow import DAG
from datetime import datetime, timedelta
from airflow.operators.python import PythonOperator
# from airflow.utils.dates import days_ago
from youtubecollectiontier01.src.__main__ import main
default_args = {
'owner': 'airflow',
'depends_on_past': False,
# 'start_date': days_ago(1),
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
# 'priority_weight': 10,
# 'end_date': datetime(2016, 1, 1),
}
# curate dag
with DAG('collect_layer_01', start_date=datetime(2022,7,25),
schedule_interval='#daily', catchup=False, default_args=default_args) as dag:
curate = PythonOperator(
task_id='collect_tier_01', # name for the task you would like to execute
python_callable=main, # the name of your python function
provide_context=True,
dag=dag)
I am importing main function from the __main__.py, however inside the main I am importing other classes such as Authentication.py, ChannelClass.py, Common.py and that's where Airflow is not recognizing.
Why it is failing for the imports, is it a directory issue or an Airflow issue? I tried moving the project under plugins and run it, but it did not work, any feedback would be highly appreciated!
Thank you!
Up until the last part, you got everything setup according to the tutorials! Also, thank you for a well documented question.
If you have not changed the PYTHON_PATH for airflow, you can try the following to get the default with:
$ airflow info
In the paths info part, you get "airflow_home", "system_path", "python_path" and "airflow_on_path".
Now within the "python_path", you'll basically see that, airflow is set up so that it will check everything inside /dags, /plugins and /config folder.
More about this topic in documents called "Module Management"
Now, I think, the problem with your code can be fixed with a little change.
In your main code you import:
from Authentication import Authentication
in a default setup, Airflow doesn't know where that is!
If you import it this way:
from youtubecollectiontier01.src.Authentication import Authentication
Just like the one you did in the DAG file. I believe it will work. Same goes for the other classes you have ChannelClass, Common, etc.
Waiting to hear from you!
I upgraded my airflow to 2.0. After upgrading, my kubernetes_pod_operator not working and give me the following error. How do I fix this upgrade issue. What do I need to change in the code to make it work in airflow 2.0?
Error:
Broken DAG: [/home/airflow/gcs/dags/daily_data_dag.py] Traceback (most recent call last):
File "/opt/python3.8/lib/python3.8/site-packages/airflow/models/baseoperator.py", line 178, in apply_defaults
result = func(self, *args, **kwargs)
File "/opt/python3.8/lib/python3.8/site-packages/airflow/models/baseoperator.py", line 506, in __init__
raise AirflowException(
airflow.exceptions.AirflowException: Invalid arguments were passed to KubernetesPodOperator (task_id: snowpack_daily_data_pipeline_16_11_2021). Invalid arguments were:
**kwargs: {'email_on_success': True}
Code:
import datetime
from airflow import models
from airflow.contrib.operators import kubernetes_pod_operator
from kubernetes.client import models as k8s
# DEFINE VARS HERE:
dag_name = "daily_data_pipeline"
schedule_interval = '#daily'
email = ["xxx#gmail.com"]
# get this from Gitlab
docker_image = "registry.gitlab.com/xxxx:dev"
default_dag_args = {
# The start_date describes when a DAG is valid / can be run. Set this to a
# fixed point in time rather than dynamically, since it is evaluated every
# time a DAG is parsed. See:
# https://airflow.apache.org/faq.html#what-s-the-deal-with-start-date
'start_date': datetime.datetime(2021, 6, 26),
'depends_on_past': False,
'max_active_runs': 1,
# 'concurrency': 1
}
# Define a DAG (directed acyclic graph) of tasks.
# Any task you create within the context manager is automatically added to the
# DAG object.
with models.DAG(
dag_name,
schedule_interval=schedule_interval,
default_args=default_dag_args,
catchup=False) as dag:
kubernetes_min_pod = kubernetes_pod_operator.KubernetesPodOperator(
# The ID specified for the task.
task_id=f'{dag_name}_{datetime.datetime.now().strftime("%d_%m_%Y")}',
# Name of task you want to run, used to generate Pod ID.
name=dag_name,
# Entrypoint of the container, if not specified the Docker container's
# entrypoint is used. The cmds parameter is templated.
# cmds=['echo'],
# The namespace to run within Kubernetes, default namespace is
# `default`. There is the potential for the resource starvation of
# Airflow workers and scheduler within the Cloud Composer environment,
# the recommended solution is to increase the amount of nodes in order
# to satisfy the computing requirements. Alternatively, launching pods
# into a custom namespace will stop fighting over resources.
namespace='default',
# Setup email on failure and success
email_on_failure=True,
email_on_success=True,
email=email,
# Docker image specified. Defaults to hub.docker.com, but any fully
# qualified URLs will point to a custom repository. Supports private
# gcr.io images if the Composer Environment is under the same
# project-id as the gcr.io images and the service account that Composer
# uses has permission to access the Google Container Registry
# (the default service account has permission)
image=docker_image,
image_pull_secrets='gitlab',
image_pull_policy='Always',
# if you have a larger image, you may need to increase the default from 120
startup_timeout_seconds=600)
The problem has to do with the email_on_success parameter: as you can see in the BaseOperator documentation, only email_on_retry and email_on_failure are supported.
If you need to send email on success, you may use on_success_callback. Please, consider for instance this example obtained from this Github issue:
from airflow.utils.email import send_email
def on_success_callback(context):
ti: TaskInstance = context["ti"]
dag_id = ti.dag_id
task_id = ti.task_id
msg = "DAG succeeded"
subject = f"Success {dag_id}.{task_id}"
send_email(to=EMAIL_LIST, subject=subject, html_content=msg)
Please, note the callback of the example is defined at the DAG level. The airflow documentation provides several additional examples.
In your specific use case, it could looks like the following:
import datetime
from airflow import models
from airflow.contrib.operators import kubernetes_pod_operator
from airflow.utils.email import send_email
from kubernetes.client import models as k8s
# DEFINE VARS HERE:
dag_name = "daily_data_pipeline"
schedule_interval = '#daily'
email = ["xxx#gmail.com"]
# get this from Gitlab
docker_image = "registry.gitlab.com/xxxx:dev"
default_dag_args = {
# The start_date describes when a DAG is valid / can be run. Set this to a
# fixed point in time rather than dynamically, since it is evaluated every
# time a DAG is parsed. See:
# https://airflow.apache.org/faq.html#what-s-the-deal-with-start-date
'start_date': datetime.datetime(2021, 6, 26),
'depends_on_past': False,
'max_active_runs': 1,
# 'concurrency': 1
}
# Define a DAG (directed acyclic graph) of tasks.
# Any task you create within the context manager is automatically added to the
# DAG object.
with models.DAG(
dag_name,
schedule_interval=schedule_interval,
default_args=default_dag_args,
catchup=False) as dag:
def notify_successful_execution(context):
# Access the information provided in context if required
send_email(
to=[email],
subject='Successful execution',
html_content='The process was executed successfully'
)
kubernetes_min_pod = kubernetes_pod_operator.KubernetesPodOperator(
# The ID specified for the task.
task_id=f'{dag_name}_{datetime.datetime.now().strftime("%d_%m_%Y")}',
# Name of task you want to run, used to generate Pod ID.
name=dag_name,
# Entrypoint of the container, if not specified the Docker container's
# entrypoint is used. The cmds parameter is templated.
# cmds=['echo'],
# The namespace to run within Kubernetes, default namespace is
# `default`. There is the potential for the resource starvation of
# Airflow workers and scheduler within the Cloud Composer environment,
# the recommended solution is to increase the amount of nodes in order
# to satisfy the computing requirements. Alternatively, launching pods
# into a custom namespace will stop fighting over resources.
namespace='default',
# Setup email on failure and success
email_on_failure=True,
# Comment the following parameter, it is unsupported
# email_on_success=True,
email=email,
# Docker image specified. Defaults to hub.docker.com, but any fully
# qualified URLs will point to a custom repository. Supports private
# gcr.io images if the Composer Environment is under the same
# project-id as the gcr.io images and the service account that Composer
# uses has permission to access the Google Container Registry
# (the default service account has permission)
image=docker_image,
image_pull_secrets='gitlab',
image_pull_policy='Always',
# if you have a larger image, you may need to increase the default from 120
startup_timeout_seconds=600,
# Use on success callback instead for your email notifications
on_success_callback=notify_successful_execution)
I'm trying to make a dag that has 2 operators that are created dynamically, depending on the number of "pipelines" that a json config file has. this file is stored in the variable dag_datafusion_args. Then I have a standard bash operator, and I have a task called success at the end that sends a message to the slack saying that the dag is over. the other 2 tasks that are python operators are generated dynamically and run in parallel. I'm using the composer, when I put the dag in the bucket it appears on the webserver ui, but when I click to see the dag the following message appears'DAG "dag_lucas4" seems to be missing. ', If I test the tasks directly by CLI on the kubernetes cluster it works! But I can't seem to make the web UI appear. I tried to do as a suggestion of some people here in SO to restart the webserver by installing a python package, I tried 3x but without success. Does anyone know what can it be?
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
from aux_py_files.med.med_airflow_functions import *
from google.cloud import storage
from datetime import timedelta
TEMPLATE_SEARCH_PATH = '/home/airflow/gcs/plugins/'
INDEX=1
default_args = {
'owner':'lucas',
'start_date': '2021-01-10',
'email': ['xxxx'],
'email_on_failure': False,
'email_on_success': False,
'retries': 3,
'retry_delay': timedelta(minutes=2),
'on_failure_callback': post_message_fail_to_slack
}
dag_datafusion_args=return_datafusion_config_file('med')
with DAG('dag_lucas4', default_args = default_args, schedule_interval="30 23 * * *", template_searchpath = [TEMPLATE_SEARCH_PATH]) as dag:
extract_ftp_csv_files_load_in_gcs = BashOperator(
task_id='extract_ftp_csv_files_load_in_gcs',
bash_command='aux_sh_files/med/script.sh'
)
success = PythonOperator(
task_id='success',
python_callable=post_message_success_to_slack,
op_kwargs={'dag_name':'dag_lucas2'}
)
for pipeline,args in dag_datafusion_args.items():
configure_pipeline=PythonOperator(
task_id=f'configure_pipeline{str(INDEX)}',
python_callable=setPipelineArguments,
op_kwargs={'dag_name':'med', 'pipeline_name':pipeline},
provide_context=True
)
start_pipeline = PythonOperator(
task_id= f'start_pipeline{str(INDEX)}',
python_callable=start_pipeline_wrapper,
op_kwargs={'configure_pipeline_task':f'configure_pipeline{str(INDEX)}'},
retries=3,
provide_context=True
)
[extract_ftp_csv_files_load_in_gcs,configure_pipeline] >> start_pipeline >> success
INDEX += 1
Appears that The Airflow-Webserver in Cloud Composer runs in the tenant project, the worker and scheduler runs in the customer project. Tenant project is nothing but its google side managed environment for some part of airflow components. So the Webserver UI doesn't have complete access to your project resources. As it doesn't run under your project's environment. So I can read my config json file with return_datafusion_config_file . Best way is create an ENV variable with that file.
I am new in python and airflow dag.
I am following below link and code which is mention in answer section.
How to pass dynamic arguments Airflow operator?
I am facing issue to reading yaml file, In yaml file I have some configuration related arguments.
configs:
cluster_name: "test-cluster"
project_id: "t***********"
zone: "europe-west1-c"
num_workers: 2
worker_machine_type: "n1-standard-1"
master_machine_type: "n1-standard-1"
In DAG script I have created one task which will be create cluster, before executing this task we need all the arguments which we need to pass on it default_args parameter like cluster-name, project_id etc.For reading those parameter I have created one readYML method.see below code
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
from zipfile import ZipFile
from airflow.contrib.operators import dataproc_operator
from airflow.models import Variable
import yaml
def readYML():
print("inside readYML")
global cfg
file_name = "/home/airflow/gcs/data/cluster_config.yml"
with open(file_name, 'r') as ymlfile:
cfg = yaml.load(ymlfile)
print(cfg['configs']['cluster_name'])
# Default Arguments
readYML()
dag_name = Variable.get("dag_name")
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime.now(),
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
#'cluster_name': cfg['configs']['cluster_name'],
}
# Instantiate a DAG
dag = DAG(dag_id='read_yml', default_args=default_args,
schedule_interval=timedelta(days=1))
# Creating Tasks
Task1 = DataprocClusterCreateOperator(
task_id='create_cluster',
dag=dag
)
In this code there is no error, When I am uploading in GCP composer environment, No error notification is showing but this DAG is no runnable there is no Run button is coming.
See attached screen shot.
I am using python 3 & airflow composer-1.7.2-airflow-1.10.2 version.
According to the Data Stored in Cloud Storage page in the Cloud Composer docs:
To avoid a webserver error, make sure that data the webserver needs to parse a DAG (not run) is available in the dags/ folder. Otherwise, the webserver can't access the data or load the Airflow web interface.
Your DAG is attempting to open the YAML file under /home/airflow/gcs/data, which isn't present on the webserver. Put the file under the dags/ folder in your GCS bucket, and it will be accessible to the scheduler, workers, and webserver, and the DAG will work in the Web UI.
Below is the airflow DAG code. It runs perfectly both when airflow is hosted locally, and on cloud composer. However, the DAG itself isn't clickable in the Composer UI.
I found a similar question and tried the accepted answer as linked in this question. My problem is similar.
import airflow
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from airflow.operators.mysql_operator import MySqlOperator
from airflow.contrib.operators.dataproc_operator import DataprocClusterCreateOperator
from airflow.contrib.operators.dataproc_operator import DataprocClusterDeleteOperator
from airflow.contrib.operators.dataproc_operator import DataProcSparkOperator
from datetime import datetime, timedelta
import sys
#copy this package to dag directory in GCP composer bucket
from schemas.schemaValidator import loadSchema
from schemas.schemaValidator import sparkArgListToMap
#change these paths to point to GCP Composer data directory
## cluster config
clusterConfig= loadSchema("somePath/jobConfig/cluster.yaml","cluster")
##per job yaml config
autoLoanCsvToParquetConfig= loadSchema("somePath/jobConfig/job.yaml","job")
default_args= {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2019, 1, 1),
'retries': 1,
'retry_delay': timedelta(minutes=3)
}
dag= DAG('usr_job', default_args=default_args, schedule_interval=None)
t1= DummyOperator(task_id= "start", dag=dag)
t2= DataprocClusterCreateOperator(
task_id= "CreateCluster",
cluster_name= clusterConfig["cluster"]["cluster_name"],
project_id= clusterConfig["project_id"],
num_workers= clusterConfig["cluster"]["worker_config"]["num_instances"],
image_version= clusterConfig["cluster"]["dataproc_img"],
master_machine_type= clusterConfig["cluster"]["worker_config"]["machine_type"],
worker_machine_type= clusterConfig["cluster"]["worker_config"]["machine_type"],
zone= clusterConfig["region"],
dag=dag
)
t3= DataProcSparkOperator(
task_id= "csvToParquet",
main_class= autoLoanCsvToParquetConfig["job"]["main_class"],
arguments= autoLoanCsvToParquetConfig["job"]["args"],
cluster_name= clusterConfig["cluster"]["cluster_name"],
dataproc_spark_jars= autoLoanCsvToParquetConfig["job"]["jarPath"],
dataproc_spark_properties= sparkArgListToMap(autoLoanCsvToParquetConfig["spark_params"]),
dag=dag
)
t4= DataprocClusterDeleteOperator(
task_id= "deleteCluster",
cluster_name= clusterConfig["cluster"]["cluster_name"],
project_id= clusterConfig["project_id"],
dag= dag
)
t5= DummyOperator(task_id= "stop", dag=dag)
t1>>t2>>t3>>t4>>t5
The UI gives this error - "This DAG isn't available in the webserver DAG bag object. It shows up in this list because the scheduler marked it as active in the metadata database."
And yet, when I triggered the DAG manually on Composer, I found it ran successfully through the log files.
The issue was with the path which was being provided for picking up the configuration files. I was giving path for the data folder in GCS. As per Google documentation, only dags folder is synced to all nodes, and not the data folder.
Needless to say, it was a issue encountered during dag parsing time, hence, it did not appear correctly on the UI. More interestingly, these debug messages were not exposed to Composer 1.5 and earlier. Now they are available to the end user to help in debugging. Thanks anyway to everyone who helped.