I am new in python and airflow dag.
I am following below link and code which is mention in answer section.
How to pass dynamic arguments Airflow operator?
I am facing issue to reading yaml file, In yaml file I have some configuration related arguments.
configs:
cluster_name: "test-cluster"
project_id: "t***********"
zone: "europe-west1-c"
num_workers: 2
worker_machine_type: "n1-standard-1"
master_machine_type: "n1-standard-1"
In DAG script I have created one task which will be create cluster, before executing this task we need all the arguments which we need to pass on it default_args parameter like cluster-name, project_id etc.For reading those parameter I have created one readYML method.see below code
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
from zipfile import ZipFile
from airflow.contrib.operators import dataproc_operator
from airflow.models import Variable
import yaml
def readYML():
print("inside readYML")
global cfg
file_name = "/home/airflow/gcs/data/cluster_config.yml"
with open(file_name, 'r') as ymlfile:
cfg = yaml.load(ymlfile)
print(cfg['configs']['cluster_name'])
# Default Arguments
readYML()
dag_name = Variable.get("dag_name")
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime.now(),
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
#'cluster_name': cfg['configs']['cluster_name'],
}
# Instantiate a DAG
dag = DAG(dag_id='read_yml', default_args=default_args,
schedule_interval=timedelta(days=1))
# Creating Tasks
Task1 = DataprocClusterCreateOperator(
task_id='create_cluster',
dag=dag
)
In this code there is no error, When I am uploading in GCP composer environment, No error notification is showing but this DAG is no runnable there is no Run button is coming.
See attached screen shot.
I am using python 3 & airflow composer-1.7.2-airflow-1.10.2 version.
According to the Data Stored in Cloud Storage page in the Cloud Composer docs:
To avoid a webserver error, make sure that data the webserver needs to parse a DAG (not run) is available in the dags/ folder. Otherwise, the webserver can't access the data or load the Airflow web interface.
Your DAG is attempting to open the YAML file under /home/airflow/gcs/data, which isn't present on the webserver. Put the file under the dags/ folder in your GCS bucket, and it will be accessible to the scheduler, workers, and webserver, and the DAG will work in the Web UI.
Related
I want to get the email mentioned in this DAG's default args using another DAG in the airflow. How can I do that? Please help, I am new to airflow!
from airflow.models import DagRun
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
from datetime import datetime, timedelta
from airflow import DAG
def first_function(**context):
print("hello")
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'email': ['example#gmail.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5)
}
with DAG(
'main',
default_args=default_args,
description='Sample DAG',
schedule_interval=timedelta(days=1),
start_date=datetime(2022,6,10),
catchup=False
) as dag:
first_function = PythonOperator(
task_id="first_function",
python_callable=first_function,
)
first_function
You can use a custom module in Airflow to share config options/Operators/or any arbitrary Python code across DAGs.
Typically you would create a directory in your DAGs directory (which by default is {AIRFLOW_HOME}/dags.
To share default_args between DAGs, you could create the following layout:
Create {AIRFLOW_HOME}/dags/custom/__init__.py
Create {AIRFLOW_HOME}/dags/custom/shared_config.py
Create {AIRFLOW_HOME}/dags/.airflowignore
Add the directory name custom to the first line of .airflowignore.
Cut and paste your default_args dictionary from your DAG into {AIRFLOW_HOME}/dags/custom/shared_config.py
You can see this layout suggested in the Airflow documentation here.
The .airflowignore tells the scheduler to skip the custom directory when it parses your DAGs (which by default happens every 30s) - because the custom directory contains Python, but never any DAGs, the scheduler should skip these files to avoid adding unnecessary load to the scheduler. This is explained in the documentation link above.
You need to add an __init__.py to the custom directory - airflow requires it even though when writing in Python3 you don't need it because of implicit namespaces (again this is explained in the same link above).
From your dag you can then import as needed:
from custom.shared_config import default_args
I am new to airflow. I have a simple python script my_python_script.py located inside a GCP bucket. I would like to trigger this python script with program arguments using airflow.
My airflow python code looks somewhat looks like this:
import pytz
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime
from airflow.providers.google.cloud.operators.dataproc import DataprocSubmitJobOperator
from pytz import timezone
from helper import help
import pendulum
config = help.loadJSON("batch/path/to/json")
executor_config = config["executor"]
common_task_args = {
'owner': 'my_name',
'depends_on_past': False,
'email': config["mailingList"],
'email_on_failure': True,
'start_date': pendulum.datetime(2022, 4, 10, tz=time_zone),
'executor_config': executor_config,
'gcp_conn_id': config["connectionId"],
'project_id': config["projectId"],
'location': config["location"]
}
dag = DAG('my_dag',
default_args=common_task_args,
is_paused_upon_creation=True,
catchup=False,
schedule_interval=None)
simple_python_task = {
"reference": {"project_id": config["projectId"]},
"placement": {"cluster_name": config["clusterName"]},
<TODO: initialise my_python_script.py script located on GCP bucket with program arguments>
}
job_to_be_triggered = DataprocSubmitJobOperator(
task_id="simple_python_task",
job=simple_python_task,
dag=dag
)
job_to_be_triggered
What am I supposed to do in the TODO section in the code snippet above? The idea is to trigger my_python_script.py script located in a GCP dataproc bucket [gs://path/to/python/script.py] with program arguments [to the python script].
PS: It is important for the python script to be on GCP.
I'm trying to make a dag that has 2 operators that are created dynamically, depending on the number of "pipelines" that a json config file has. this file is stored in the variable dag_datafusion_args. Then I have a standard bash operator, and I have a task called success at the end that sends a message to the slack saying that the dag is over. the other 2 tasks that are python operators are generated dynamically and run in parallel. I'm using the composer, when I put the dag in the bucket it appears on the webserver ui, but when I click to see the dag the following message appears'DAG "dag_lucas4" seems to be missing. ', If I test the tasks directly by CLI on the kubernetes cluster it works! But I can't seem to make the web UI appear. I tried to do as a suggestion of some people here in SO to restart the webserver by installing a python package, I tried 3x but without success. Does anyone know what can it be?
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
from aux_py_files.med.med_airflow_functions import *
from google.cloud import storage
from datetime import timedelta
TEMPLATE_SEARCH_PATH = '/home/airflow/gcs/plugins/'
INDEX=1
default_args = {
'owner':'lucas',
'start_date': '2021-01-10',
'email': ['xxxx'],
'email_on_failure': False,
'email_on_success': False,
'retries': 3,
'retry_delay': timedelta(minutes=2),
'on_failure_callback': post_message_fail_to_slack
}
dag_datafusion_args=return_datafusion_config_file('med')
with DAG('dag_lucas4', default_args = default_args, schedule_interval="30 23 * * *", template_searchpath = [TEMPLATE_SEARCH_PATH]) as dag:
extract_ftp_csv_files_load_in_gcs = BashOperator(
task_id='extract_ftp_csv_files_load_in_gcs',
bash_command='aux_sh_files/med/script.sh'
)
success = PythonOperator(
task_id='success',
python_callable=post_message_success_to_slack,
op_kwargs={'dag_name':'dag_lucas2'}
)
for pipeline,args in dag_datafusion_args.items():
configure_pipeline=PythonOperator(
task_id=f'configure_pipeline{str(INDEX)}',
python_callable=setPipelineArguments,
op_kwargs={'dag_name':'med', 'pipeline_name':pipeline},
provide_context=True
)
start_pipeline = PythonOperator(
task_id= f'start_pipeline{str(INDEX)}',
python_callable=start_pipeline_wrapper,
op_kwargs={'configure_pipeline_task':f'configure_pipeline{str(INDEX)}'},
retries=3,
provide_context=True
)
[extract_ftp_csv_files_load_in_gcs,configure_pipeline] >> start_pipeline >> success
INDEX += 1
Appears that The Airflow-Webserver in Cloud Composer runs in the tenant project, the worker and scheduler runs in the customer project. Tenant project is nothing but its google side managed environment for some part of airflow components. So the Webserver UI doesn't have complete access to your project resources. As it doesn't run under your project's environment. So I can read my config json file with return_datafusion_config_file . Best way is create an ENV variable with that file.
I am trying to create a DAG that generates tasks dynamically based on a JSON file located in storage. I followed this guide step-by-step:
https://bigdata-etl.com/apache-airflow-create-dynamic-dag/
But the DAG gets stuck with the following message:
Is it possible to read an external file and use it to create tasks dynamically in Composer? I can do this when I read data only from an airflow Variable, but when I read an external file, the dag gets stuck in the isn't available in the web server's DagBag object state. I need to read from an external file as the contents of the JSON will change with every execution.
I am using composer-1.8.2-airflow-1.10.2.
I read this answer to a similar question:
Dynamic task definition in Airflow
But I am not trying to create the tasks based on a separate task, only based on the external file.
This is my second approach that also get's stuck in that error state:
import datetime
import airflow
from airflow.operators import bash_operator
from airflow.operators.dummy_operator import DummyOperator
from airflow.models import Variable
import json
import os
products = json.loads(Variable.get("products"))
default_args = {
'owner': 'Composer Example',
'depends_on_past': False,
'email': [''],
'email_on_failure': False,
'email_on_retry': False,
'retries': 0,
'retry_delay': datetime.timedelta(minutes=5),
'start_date': datetime.datetime(2020, 1, 10),
}
with airflow.DAG(
'json_test2',
default_args=default_args,
# Not scheduled, trigger only
schedule_interval=None) as dag:
# Print the dag_run's configuration, which includes information about the
# Cloud Storage object change.
def read_json_file(file_path):
if os.path.exists(file_path):
with open(file_path, 'r') as f:
return json.load(f)
def get_run_list(files):
run_list = []
#The file is uploaded in the storage bucket used as a volume by Composer
last_exec_json = read_json_file("/home/airflow/gcs/data/last_execution.json")
date = last_exec_json["date"]
hour = last_exec_json["hour"]
for file in files:
#Testing by adding just date and hour
name = file['name']+f'_{date}_{hour}'
run_list.append(name)
return run_list
rl = get_run_list(products)
start = DummyOperator(task_id='start', dag=dag)
end = DummyOperator(task_id='end', dag=dag)
for name in rl:
tsk = DummyOperator(task_id=name, dag=dag)
start >> tsk >> end
It is possible to create DAG that generates task dynamically based on a JSON file, which is located in a Cloud Storage bucket. I followed guide, that you provided, and it works perfectly in my case.
Firstly you need to upload your JSON configuration file to $AIRFLOW_HOME/dags directory, and then DAG python file to the same path (you can find the path in airflow.cfg file, which is located in the bucket).
Later on, you will be able to see DAG in Airflow UI:
As you can see the log DAG isn't available in the web server's DagBag object, the DAG isn't available on Airflow Web Server. However, the DAG can be scheduled as active because Airflow Scheduler is working independently with the Airflow Web Server.
When a lot of DAGs are loaded at once to a Composer environment, it may overload on the environment. As the Airflow webserver is on a Google-managed project, only certain types of updates will cause the webserver container to be restarted, like adding or upgrading one of the PyPI packages or changing an Airflow setting. The workaround is to add a dummy environment variable:
Open Composer instance in GCP
ENVIRONMENT VARIABLE tab
Edit, then add environment variable and Submit
You can use following command to restart it:
gcloud composer environments update ${ENVIRONMENT_NAME} --location=${ENV_LOCATION} --update-airflow-configs=core-dummy=true
gcloud composer environments update ${ENVIRONMENT_NAME} --location=${ENV_LOCATION} --remove-airflow-configs=core-dummy
I hope you find the above pieces of information useful.
Below is the airflow DAG code. It runs perfectly both when airflow is hosted locally, and on cloud composer. However, the DAG itself isn't clickable in the Composer UI.
I found a similar question and tried the accepted answer as linked in this question. My problem is similar.
import airflow
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from airflow.operators.mysql_operator import MySqlOperator
from airflow.contrib.operators.dataproc_operator import DataprocClusterCreateOperator
from airflow.contrib.operators.dataproc_operator import DataprocClusterDeleteOperator
from airflow.contrib.operators.dataproc_operator import DataProcSparkOperator
from datetime import datetime, timedelta
import sys
#copy this package to dag directory in GCP composer bucket
from schemas.schemaValidator import loadSchema
from schemas.schemaValidator import sparkArgListToMap
#change these paths to point to GCP Composer data directory
## cluster config
clusterConfig= loadSchema("somePath/jobConfig/cluster.yaml","cluster")
##per job yaml config
autoLoanCsvToParquetConfig= loadSchema("somePath/jobConfig/job.yaml","job")
default_args= {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2019, 1, 1),
'retries': 1,
'retry_delay': timedelta(minutes=3)
}
dag= DAG('usr_job', default_args=default_args, schedule_interval=None)
t1= DummyOperator(task_id= "start", dag=dag)
t2= DataprocClusterCreateOperator(
task_id= "CreateCluster",
cluster_name= clusterConfig["cluster"]["cluster_name"],
project_id= clusterConfig["project_id"],
num_workers= clusterConfig["cluster"]["worker_config"]["num_instances"],
image_version= clusterConfig["cluster"]["dataproc_img"],
master_machine_type= clusterConfig["cluster"]["worker_config"]["machine_type"],
worker_machine_type= clusterConfig["cluster"]["worker_config"]["machine_type"],
zone= clusterConfig["region"],
dag=dag
)
t3= DataProcSparkOperator(
task_id= "csvToParquet",
main_class= autoLoanCsvToParquetConfig["job"]["main_class"],
arguments= autoLoanCsvToParquetConfig["job"]["args"],
cluster_name= clusterConfig["cluster"]["cluster_name"],
dataproc_spark_jars= autoLoanCsvToParquetConfig["job"]["jarPath"],
dataproc_spark_properties= sparkArgListToMap(autoLoanCsvToParquetConfig["spark_params"]),
dag=dag
)
t4= DataprocClusterDeleteOperator(
task_id= "deleteCluster",
cluster_name= clusterConfig["cluster"]["cluster_name"],
project_id= clusterConfig["project_id"],
dag= dag
)
t5= DummyOperator(task_id= "stop", dag=dag)
t1>>t2>>t3>>t4>>t5
The UI gives this error - "This DAG isn't available in the webserver DAG bag object. It shows up in this list because the scheduler marked it as active in the metadata database."
And yet, when I triggered the DAG manually on Composer, I found it ran successfully through the log files.
The issue was with the path which was being provided for picking up the configuration files. I was giving path for the data folder in GCS. As per Google documentation, only dags folder is synced to all nodes, and not the data folder.
Needless to say, it was a issue encountered during dag parsing time, hence, it did not appear correctly on the UI. More interestingly, these debug messages were not exposed to Composer 1.5 and earlier. Now they are available to the end user to help in debugging. Thanks anyway to everyone who helped.