I'm trying to run a DAG using a custom Operator for a task, but the job_name parameter(set automatically to be the same as the task_id name) is being read as a folder structure instead of the string itself. Example: "example_task" is being read as "\example_task\". By default, Airflow does not accept the "\" character in job_name.
Here is the code:
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from datamechanics_airflow_plugin.operator import DataMechanicsOperator
from airflow.utils.dates import days_ago
import pendulum
local_tz = pendulum.timezone("America/Sao_Paulo")
with DAG(
dag_id="processing_dag",
start_date=days_ago(1).astimezone(tz=local_tz),
schedule_interval="#daily",
) as dag:
start = DummyOperator(task_id="start")
end=DummyOperator(task_id="end")
landing_to_processing=DataMechanicsOperator(
task_id="landing_to_processing",
config_template_name="spark-gcp-hudi",
config_overrides={
"mainApplicationFile": "/home/edu/airflow/dags/scripts/process_landing_data.py",
},
)
start>>landing_to_processing>>end
The DataMechanicsOperator comes from https://www.datamechanics.co/ and the plugin has been correctly installed.
Here is part of the Airflow UI error message:
airflow.exceptions.AirflowException: Response: b'{"errors": {"jobName": "\'landing_to_processing\' does not match \'^[a-z0-9]([-a-z0-9]*[a-z0-9])?(\\\\\\\\\\\\\\\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*$\'"}, "message": "Input payload validation failed"}\n', Status Code: 400
I've tested the code in a local Airflow Server and in a Docker container.
I really can't see what could be possibly causing this.
Related
I'm trying to make a dag that has 2 operators that are created dynamically, depending on the number of "pipelines" that a json config file has. this file is stored in the variable dag_datafusion_args. Then I have a standard bash operator, and I have a task called success at the end that sends a message to the slack saying that the dag is over. the other 2 tasks that are python operators are generated dynamically and run in parallel. I'm using the composer, when I put the dag in the bucket it appears on the webserver ui, but when I click to see the dag the following message appears'DAG "dag_lucas4" seems to be missing. ', If I test the tasks directly by CLI on the kubernetes cluster it works! But I can't seem to make the web UI appear. I tried to do as a suggestion of some people here in SO to restart the webserver by installing a python package, I tried 3x but without success. Does anyone know what can it be?
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
from aux_py_files.med.med_airflow_functions import *
from google.cloud import storage
from datetime import timedelta
TEMPLATE_SEARCH_PATH = '/home/airflow/gcs/plugins/'
INDEX=1
default_args = {
'owner':'lucas',
'start_date': '2021-01-10',
'email': ['xxxx'],
'email_on_failure': False,
'email_on_success': False,
'retries': 3,
'retry_delay': timedelta(minutes=2),
'on_failure_callback': post_message_fail_to_slack
}
dag_datafusion_args=return_datafusion_config_file('med')
with DAG('dag_lucas4', default_args = default_args, schedule_interval="30 23 * * *", template_searchpath = [TEMPLATE_SEARCH_PATH]) as dag:
extract_ftp_csv_files_load_in_gcs = BashOperator(
task_id='extract_ftp_csv_files_load_in_gcs',
bash_command='aux_sh_files/med/script.sh'
)
success = PythonOperator(
task_id='success',
python_callable=post_message_success_to_slack,
op_kwargs={'dag_name':'dag_lucas2'}
)
for pipeline,args in dag_datafusion_args.items():
configure_pipeline=PythonOperator(
task_id=f'configure_pipeline{str(INDEX)}',
python_callable=setPipelineArguments,
op_kwargs={'dag_name':'med', 'pipeline_name':pipeline},
provide_context=True
)
start_pipeline = PythonOperator(
task_id= f'start_pipeline{str(INDEX)}',
python_callable=start_pipeline_wrapper,
op_kwargs={'configure_pipeline_task':f'configure_pipeline{str(INDEX)}'},
retries=3,
provide_context=True
)
[extract_ftp_csv_files_load_in_gcs,configure_pipeline] >> start_pipeline >> success
INDEX += 1
Appears that The Airflow-Webserver in Cloud Composer runs in the tenant project, the worker and scheduler runs in the customer project. Tenant project is nothing but its google side managed environment for some part of airflow components. So the Webserver UI doesn't have complete access to your project resources. As it doesn't run under your project's environment. So I can read my config json file with return_datafusion_config_file . Best way is create an ENV variable with that file.
I'm trying to discover why we are getting this error. Is it a missing dependency? A version problem? Why is this happening with just this one DAG and not the others?
The error is:
FileNotFoundError: [Errno 2] No such file or directory: /home/airflow/composer_kube_config
Here's our DAG:
import datetime
from airflow import DAG
from airflow.contrib.operators.kubernetes_pod_operator import KubernetesPodOperator
from airflow.contrib.kubernetes.secret import Secret
from airflow.contrib.kubernetes.volume import Volume
from airflow.contrib.kubernetes.volume_mount import VolumeMount
# from airflow.contrib.kubernetes.pod import Port
from utils.constants import DEFAULT_ARGS, DUMB_BUCKET, SCHEMA_BUCKET, PROJECT, \
CLOUD_COMPOSER_SERVICE_ACCOUNT_SECRET, STAGING_BUCKET
volume_mount = VolumeMount(
'secret',
mount_path='/etc/secret',
sub_path=None,
read_only=True
)
volume_config= { 'persistentVolumeClaim': { 'claimName': 'all-ftp' } }
volume = Volume(name='secret', configs=volume_config)
with DAG( 'ftp_file_poller', schedule_interval="55 6 * * *", start_date=datetime.datetime(2020,7,1) ) as dag:
poller = KubernetesPodOperator(
secrets=[CLOUD_COMPOSER_SERVICE_ACCOUNT_SECRET],
task_id='ftp-file-poller',
name='ftp-polling',
cmds=['ftp-poller'],
namespace='default',
image='us.gcr.io/<our gcp project>/ftp-poller:v7',
is_delete_operator_pod=True,
get_logs=True,
volumes=[volume],
volume_mounts=[volume_mount]
)
poller.doc = """
about this DAG info
"""
Here's a quote I found in the docs about this file:
# Only name, namespace, image, and task_id are required to create a
# KubernetesPodOperator. In Cloud Composer, currently the operator defaults
# to using the config file found at `/home/airflow/composer_kube_config if
# no `config_file` parameter is specified. By default it will contain the
# credentials for Cloud Composer's Google Kubernetes Engine cluster that is
# created upon environment creation.
This was solved by adding the DEFAULT_ARGS constant to the DAG definition like so:
with DAG(dag_id='ftp_file_poller',
schedule_interval="55 6 * * *",
start_date=datetime.datetime(2020,7,1),
default_args=DEFAULT_ARGS) as dag:
I want to customize my DAGs to send the email when it's failed or succeeded. I'm trying to use on_success_callback and on_failure_callback in DAG constructor, but it doesn't work for DAG. In the same time it works for DummyOperator that I put inside my DAG.
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from datetime import datetime
from utils import get_report_operator, DagStatus
TEST_DAG_NAME='test_dag'
TEST_DAG_REPORT_SUBSCRIBERS = ['MY_EMAIL']
def send_success_report(context):
subject = 'Airflow report: {0} run success'.format(TEST_DAG_NAME)
email_operator = get_report_operator(subject, TEST_DAG_REPORT_SUBSCRIBERS, TEST_DAG_NAME, DagStatus.SUCCESS)
email_operator.execute(context)
def send_failed_report(context):
subject = 'Airflow report: {0} run failed'.format(TEST_DAG_NAME)
email_operator = get_report_operator(subject, TEST_DAG_REPORT_SUBSCRIBERS, TEST_DAG_NAME, DagStatus.FAILED)
email_operator.execute(context)
dag = DAG(dag_id=TEST_DAG_NAME,
schedule_interval=None,
start_date=datetime(2019,6,6),
on_success_callback=send_success_report,
on_failure_callback=send_failed_report)
DummyOperator(task_id='task',
on_success_callback=send_success_report,
on_failure_callback=send_failed_report,
dag = dag)
I've also implemented some add-in over the Airflow EmailOperator for send report. I don't thing that error in this part, but still.
class DagStatus(Enum):
SUCCESS = 0
FAILED = 1
def get_report_operator(sbjct, to_lst, dag_id, dag_status):
status = 'SUCCESS' if dag_status == DagStatus.SUCCESS else 'FAILED'
status_color = '#87C540' if dag_status == DagStatus.SUCCESS else '#FF1717'
with open(os.path.join(os.path.dirname(__file__), 'airflow_report.html'), 'r', encoding='utf-8') as report_file:
report_mask = report_file.read()
report_text = report_mask.format(dag_id, status, status_color)
tmp_dag = DAG(dag_id='tmp_dag', start_date=datetime(year=2019, month=9, day=12), schedule_interval=None)
return EmailOperator(task_id='send_email',
to=to_lst,
subject=sbjct,
html_content=report_text.encode('utf-8'),
dag = tmp_dag)
What I do wrong?
Instead put on_failure_callback as argument in default_args dictionary and pass it to DAG.
All arguments in defaut_args passed to a DAG will be applied to all of DAG's operators. Its the only way, as of now, to apply a common parameter to all the operators in the DAG.
dag = DAG(dag_id=TEST_DAG_NAME,
schedule_interval=None,
start_date=datetime(2019,6,6),
default_args={
'on_success_callback': send_success_report,
'on_failure_callback': send_failed_report
})
I am new in python and airflow dag.
I am following below link and code which is mention in answer section.
How to pass dynamic arguments Airflow operator?
I am facing issue to reading yaml file, In yaml file I have some configuration related arguments.
configs:
cluster_name: "test-cluster"
project_id: "t***********"
zone: "europe-west1-c"
num_workers: 2
worker_machine_type: "n1-standard-1"
master_machine_type: "n1-standard-1"
In DAG script I have created one task which will be create cluster, before executing this task we need all the arguments which we need to pass on it default_args parameter like cluster-name, project_id etc.For reading those parameter I have created one readYML method.see below code
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
from zipfile import ZipFile
from airflow.contrib.operators import dataproc_operator
from airflow.models import Variable
import yaml
def readYML():
print("inside readYML")
global cfg
file_name = "/home/airflow/gcs/data/cluster_config.yml"
with open(file_name, 'r') as ymlfile:
cfg = yaml.load(ymlfile)
print(cfg['configs']['cluster_name'])
# Default Arguments
readYML()
dag_name = Variable.get("dag_name")
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime.now(),
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
#'cluster_name': cfg['configs']['cluster_name'],
}
# Instantiate a DAG
dag = DAG(dag_id='read_yml', default_args=default_args,
schedule_interval=timedelta(days=1))
# Creating Tasks
Task1 = DataprocClusterCreateOperator(
task_id='create_cluster',
dag=dag
)
In this code there is no error, When I am uploading in GCP composer environment, No error notification is showing but this DAG is no runnable there is no Run button is coming.
See attached screen shot.
I am using python 3 & airflow composer-1.7.2-airflow-1.10.2 version.
According to the Data Stored in Cloud Storage page in the Cloud Composer docs:
To avoid a webserver error, make sure that data the webserver needs to parse a DAG (not run) is available in the dags/ folder. Otherwise, the webserver can't access the data or load the Airflow web interface.
Your DAG is attempting to open the YAML file under /home/airflow/gcs/data, which isn't present on the webserver. Put the file under the dags/ folder in your GCS bucket, and it will be accessible to the scheduler, workers, and webserver, and the DAG will work in the Web UI.
Below is the airflow DAG code. It runs perfectly both when airflow is hosted locally, and on cloud composer. However, the DAG itself isn't clickable in the Composer UI.
I found a similar question and tried the accepted answer as linked in this question. My problem is similar.
import airflow
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from airflow.operators.mysql_operator import MySqlOperator
from airflow.contrib.operators.dataproc_operator import DataprocClusterCreateOperator
from airflow.contrib.operators.dataproc_operator import DataprocClusterDeleteOperator
from airflow.contrib.operators.dataproc_operator import DataProcSparkOperator
from datetime import datetime, timedelta
import sys
#copy this package to dag directory in GCP composer bucket
from schemas.schemaValidator import loadSchema
from schemas.schemaValidator import sparkArgListToMap
#change these paths to point to GCP Composer data directory
## cluster config
clusterConfig= loadSchema("somePath/jobConfig/cluster.yaml","cluster")
##per job yaml config
autoLoanCsvToParquetConfig= loadSchema("somePath/jobConfig/job.yaml","job")
default_args= {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2019, 1, 1),
'retries': 1,
'retry_delay': timedelta(minutes=3)
}
dag= DAG('usr_job', default_args=default_args, schedule_interval=None)
t1= DummyOperator(task_id= "start", dag=dag)
t2= DataprocClusterCreateOperator(
task_id= "CreateCluster",
cluster_name= clusterConfig["cluster"]["cluster_name"],
project_id= clusterConfig["project_id"],
num_workers= clusterConfig["cluster"]["worker_config"]["num_instances"],
image_version= clusterConfig["cluster"]["dataproc_img"],
master_machine_type= clusterConfig["cluster"]["worker_config"]["machine_type"],
worker_machine_type= clusterConfig["cluster"]["worker_config"]["machine_type"],
zone= clusterConfig["region"],
dag=dag
)
t3= DataProcSparkOperator(
task_id= "csvToParquet",
main_class= autoLoanCsvToParquetConfig["job"]["main_class"],
arguments= autoLoanCsvToParquetConfig["job"]["args"],
cluster_name= clusterConfig["cluster"]["cluster_name"],
dataproc_spark_jars= autoLoanCsvToParquetConfig["job"]["jarPath"],
dataproc_spark_properties= sparkArgListToMap(autoLoanCsvToParquetConfig["spark_params"]),
dag=dag
)
t4= DataprocClusterDeleteOperator(
task_id= "deleteCluster",
cluster_name= clusterConfig["cluster"]["cluster_name"],
project_id= clusterConfig["project_id"],
dag= dag
)
t5= DummyOperator(task_id= "stop", dag=dag)
t1>>t2>>t3>>t4>>t5
The UI gives this error - "This DAG isn't available in the webserver DAG bag object. It shows up in this list because the scheduler marked it as active in the metadata database."
And yet, when I triggered the DAG manually on Composer, I found it ran successfully through the log files.
The issue was with the path which was being provided for picking up the configuration files. I was giving path for the data folder in GCS. As per Google documentation, only dags folder is synced to all nodes, and not the data folder.
Needless to say, it was a issue encountered during dag parsing time, hence, it did not appear correctly on the UI. More interestingly, these debug messages were not exposed to Composer 1.5 and earlier. Now they are available to the end user to help in debugging. Thanks anyway to everyone who helped.