I am working on AirFlow POC, written a DAG which can run a script using ssh on one server. It giving alert if the script fails to execute but if the scripts executed and the task in the script fails it is not giving any mail.
Example: I'm running a script which can take backup of a database in db2. If the instance is down and unable to take the backup so backup command failed but we are not getting any alert as the script executed successfully.
from airflow.models import DAG, Variable
from airflow.contrib.operators.ssh_operator import SSHOperator
from datetime import datetime, timedelta
import airflow
import os
import logging
import airflow.settings
from airflow.utils.dates import days_ago
from airflow.operators.email_operator import EmailOperator
from airflow.models import TaskInstance
from airflow.operators.python_operator import BranchPythonOperator
from airflow.operators.dummy_operator import DummyOperator
START_DATE = airflow.utils.dates.days_ago(1)
SCHEDULE_INTERVAL = "5,10,15,20,30,40 * * * * *"
log = logging.getLogger(__name__)
# Use this to grab the pushed value and determine your path
def determine_branch(**kwargs):
"""Use this to define the pathway of the branch operator based on the return code from the SSHOperator"""
ti = TaskInstance(task='t1', execution_date=START_DATE)
return_code = kwargs['ti'].xcom_pull(ti.task_ids='SSHTest')
print("From Kwargs: ", return_code)
if return_code := '1':
return 't4'
return 't3'
# DAG for airflow task
dag_email_recipient = ["mailid"]
# These args will get passed on to the SSH operator
default_args = {
'owner': 'afuser',
'depends_on_past': False,
'email': ['mailid'],
'email_on_failure': True,
'email_on_retry': False,
'start_date': START_DATE,
'retries': 1,
'retry_delay': timedelta(minutes=1),
}
with DAG(
'DB2_SSH_TEST',
default_args=default_args,
description='How to use the SSH Operator?',
schedule_interval=SCHEDULE_INTERVAL,
start_date=START_DATE,
catchup=False,
) as dag:
# Be sure to add '; echo $?' to the end of your bash script for this to work.
t1 = SSHOperator(
ssh_conn_id='DB2_Jabxusr_SSH',
task_id='SSHTest',
xcom_push=True,
command='/path/backup_script',
provide_context=True,
)
# Function defined above called here
t2 = BranchPythonOperator(
task_id='check_ssh_output',
python_callable=determine_branch,
)
# If we don't want to send email
t3 = DummyOperator(
task_id='end'
)
# If we do
t4 = EmailOperator(
to=dag_email_recipient,
subject='Airflow Success: process incoming files',
files=None,
html_content='',
#...
)
t1 >> t2
t2 >> t3
t2 >> t4
You can configure email notifications per task. First you need to configure your email provider, and then when you create your task you can specify "email_on_retry", "email_on_failure" flags or you can even write your own custom hook "on_failure" where you will be able to code your own custom logic deciding when and how to notify.
Here is a very nice Astronomer article explaining all the ins and outs of notifications: https://www.astronomer.io/guides/error-notifications-in-airflow
Related
I am completely new to Apache Airflow. I have a situation. The code I have used is
import airflow
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
from airflow.contrib.operators.ssh_operator import SSHOperator
from airflow.contrib.hooks.ssh_hook import SSHHook
default_args = {
'owner': 'john',
'depends_on_past': False,
'email': [''],
'email_on_failure': False,
'email_on_retry': False,
'retries': 0,
'retry_delay': timedelta(minutes=5)
}
dag = DAG(
'tutorial',
default_args = default_args,
description='A simple tutorial DAG',
schedule_interval=None)
bash_tutorial = """
echo "Execute shell file: /A/B/server/tutorial.ksh"
echo "{{macros.ds_format(ds, "%Y-%m-%d", "%m-%d-%Y"}}"
source /home/johnbs/.profile
/A/B/server/tutorial.ksh
"""
t1 = SSHOperator(
ssh_conn_id='dev'
task_id='tutorial.ksh'
command=bash_tutorial,
dag=dag
)
Using airflow, I want to trigger a ksh script in different servers like dev and test servers i.e.
tutorial.ksh is present in dev server(conn_id is 'dev') of the path (/A/B/C/tutorial.ksh) and in test server(conn_id is 'test') of the path (/A/B/D/tutorial.ksh)...Here you can see C folder from dev and D folder from test...Which area should I update the code?
Each instance of SSHOperator execute a command on a single server.
You will need to define each connection separately as explained in the docs and then you can do:
server_connection_ids = ['dev', 'test']
start_op = DummyOperator(task_id="start_task", dag=dag)
for conn in server_connection_ids:
bash_tutorial = f"""
echo "Execute shell file: /A/B/server/tutorial.ksh"
echo "{{macros.ds_format(ds, "%Y-%m-%d", "%m-%d-%Y"}}"
source /home/johnbs/.profile
/A/B/{conn}/tutorial.ksh
"""
ssh_op = SSHOperator(
ssh_conn_id=f'{conn}',
task_id=f'ssh_{conn}_task',
command=bash_tutorial,
dag=dag
)
This will create a task per server.
I am trying to run a simple dag in Airflow to execute a python file and it is throwing the error can't open the file '/User/....'.
below is the script I am using.
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
from datetime import datetime,timedelta
default_args = {
'owner': 'airflow',
'start_date': datetime(2021,3,2),
'depends_on_past': False,
'retries': 0
}
dag=DAG(dag_id='DummyDag',default_args=default_args,catchup=False,schedule_interval='#once')
command_load='python /usr/local/airflow/dags/dummy.py '
#load=BashOperator(task_id='loadfile',bash_command='python /Users/<user-id>/MyDrive/DataEngineeringAssignment/dummydata.py',dag=dag)
start=DummyOperator(task_id='Start',dag=dag)
end=DummyOperator(task_id='End',dag=dag)
dir >> start >> end
Is there anywhere I am going wrong?
Option 1:
The file location (dummydata.py) is relative to the directory containing the pipeline file (DAG file).
dag=DAG(
dag_id='DummyDag',
...
)
load=BashOperator(task_id='loadfile',
bash_command='python dummydata.py',
dag=dag
)
Option 2:
define your template_searchpath as pointing to any folder locations in the DAG constructor call.
dag=DAG(
dag_id='DummyDag',
...,
template_searchpath=['/Users/<user-id>/MyDrive/DataEngineeringAssignment/']
)
load=BashOperator(task_id='loadfile',
# "dummydata.py" is a file under "/Users/<user-id>/MyDrive/DataEngineeringAssignment/"
bash_command='python dummydata.py ', # Note: Space is important!
dag=dag
)
For more information you can read about it in the docs
I'm trying to make a dag that has 2 operators that are created dynamically, depending on the number of "pipelines" that a json config file has. this file is stored in the variable dag_datafusion_args. Then I have a standard bash operator, and I have a task called success at the end that sends a message to the slack saying that the dag is over. the other 2 tasks that are python operators are generated dynamically and run in parallel. I'm using the composer, when I put the dag in the bucket it appears on the webserver ui, but when I click to see the dag the following message appears'DAG "dag_lucas4" seems to be missing. ', If I test the tasks directly by CLI on the kubernetes cluster it works! But I can't seem to make the web UI appear. I tried to do as a suggestion of some people here in SO to restart the webserver by installing a python package, I tried 3x but without success. Does anyone know what can it be?
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
from aux_py_files.med.med_airflow_functions import *
from google.cloud import storage
from datetime import timedelta
TEMPLATE_SEARCH_PATH = '/home/airflow/gcs/plugins/'
INDEX=1
default_args = {
'owner':'lucas',
'start_date': '2021-01-10',
'email': ['xxxx'],
'email_on_failure': False,
'email_on_success': False,
'retries': 3,
'retry_delay': timedelta(minutes=2),
'on_failure_callback': post_message_fail_to_slack
}
dag_datafusion_args=return_datafusion_config_file('med')
with DAG('dag_lucas4', default_args = default_args, schedule_interval="30 23 * * *", template_searchpath = [TEMPLATE_SEARCH_PATH]) as dag:
extract_ftp_csv_files_load_in_gcs = BashOperator(
task_id='extract_ftp_csv_files_load_in_gcs',
bash_command='aux_sh_files/med/script.sh'
)
success = PythonOperator(
task_id='success',
python_callable=post_message_success_to_slack,
op_kwargs={'dag_name':'dag_lucas2'}
)
for pipeline,args in dag_datafusion_args.items():
configure_pipeline=PythonOperator(
task_id=f'configure_pipeline{str(INDEX)}',
python_callable=setPipelineArguments,
op_kwargs={'dag_name':'med', 'pipeline_name':pipeline},
provide_context=True
)
start_pipeline = PythonOperator(
task_id= f'start_pipeline{str(INDEX)}',
python_callable=start_pipeline_wrapper,
op_kwargs={'configure_pipeline_task':f'configure_pipeline{str(INDEX)}'},
retries=3,
provide_context=True
)
[extract_ftp_csv_files_load_in_gcs,configure_pipeline] >> start_pipeline >> success
INDEX += 1
Appears that The Airflow-Webserver in Cloud Composer runs in the tenant project, the worker and scheduler runs in the customer project. Tenant project is nothing but its google side managed environment for some part of airflow components. So the Webserver UI doesn't have complete access to your project resources. As it doesn't run under your project's environment. So I can read my config json file with return_datafusion_config_file . Best way is create an ENV variable with that file.
I want to customize my DAGs to send the email when it's failed or succeeded. I'm trying to use on_success_callback and on_failure_callback in DAG constructor, but it doesn't work for DAG. In the same time it works for DummyOperator that I put inside my DAG.
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from datetime import datetime
from utils import get_report_operator, DagStatus
TEST_DAG_NAME='test_dag'
TEST_DAG_REPORT_SUBSCRIBERS = ['MY_EMAIL']
def send_success_report(context):
subject = 'Airflow report: {0} run success'.format(TEST_DAG_NAME)
email_operator = get_report_operator(subject, TEST_DAG_REPORT_SUBSCRIBERS, TEST_DAG_NAME, DagStatus.SUCCESS)
email_operator.execute(context)
def send_failed_report(context):
subject = 'Airflow report: {0} run failed'.format(TEST_DAG_NAME)
email_operator = get_report_operator(subject, TEST_DAG_REPORT_SUBSCRIBERS, TEST_DAG_NAME, DagStatus.FAILED)
email_operator.execute(context)
dag = DAG(dag_id=TEST_DAG_NAME,
schedule_interval=None,
start_date=datetime(2019,6,6),
on_success_callback=send_success_report,
on_failure_callback=send_failed_report)
DummyOperator(task_id='task',
on_success_callback=send_success_report,
on_failure_callback=send_failed_report,
dag = dag)
I've also implemented some add-in over the Airflow EmailOperator for send report. I don't thing that error in this part, but still.
class DagStatus(Enum):
SUCCESS = 0
FAILED = 1
def get_report_operator(sbjct, to_lst, dag_id, dag_status):
status = 'SUCCESS' if dag_status == DagStatus.SUCCESS else 'FAILED'
status_color = '#87C540' if dag_status == DagStatus.SUCCESS else '#FF1717'
with open(os.path.join(os.path.dirname(__file__), 'airflow_report.html'), 'r', encoding='utf-8') as report_file:
report_mask = report_file.read()
report_text = report_mask.format(dag_id, status, status_color)
tmp_dag = DAG(dag_id='tmp_dag', start_date=datetime(year=2019, month=9, day=12), schedule_interval=None)
return EmailOperator(task_id='send_email',
to=to_lst,
subject=sbjct,
html_content=report_text.encode('utf-8'),
dag = tmp_dag)
What I do wrong?
Instead put on_failure_callback as argument in default_args dictionary and pass it to DAG.
All arguments in defaut_args passed to a DAG will be applied to all of DAG's operators. Its the only way, as of now, to apply a common parameter to all the operators in the DAG.
dag = DAG(dag_id=TEST_DAG_NAME,
schedule_interval=None,
start_date=datetime(2019,6,6),
default_args={
'on_success_callback': send_success_report,
'on_failure_callback': send_failed_report
})
I'm running Airflow 1.9.0 with LocalExecutor and PostgreSQL database in a Linux AMI. I want to manually trigger DAGs, but whenever I create a DAG that has schedule_interval set to None or to #once, the webserver tree view crashes with the following error (I only show the last call):
File "/usr/local/lib/python2.7/site-packages/croniter/croniter.py", line 467, in expand
raise CroniterBadCronError(cls.bad_length)
CroniterBadCronError: Exactly 5 or 6 columns has to be specified for iteratorexpression.
Furthermore, when I manually trigger the DAG, a DAG run starts but the tasks themselves are never scheduled. I've looked around, but it seems that I'm the only one with this type of error. Has anyone encountered this error before and found a fix?
Minimal example triggering the problem:
import datetime as dt
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
default_args = {
'owner': 'me'
}
bash_command = """
echo "this is a test task"
"""
with DAG('schedule_test',
default_args=default_args,
start_date = dt.datetime(2018, 7, 24),
schedule_interval='None',
catchup=False
) as dag:
first_task = BashOperator(task_id = "first_task", bash_command = bash_command)
Try this:
Set your schedule_interval to None without the '', or simply do not specify schedule_interval in your DAG. It is set to None as a default. More information on that here: airflow docs -- search for schedule_interval
Set orchestration for your tasks at the bottom of the dag.
Like so:
import datetime
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.dummy_operator import DummyOperator
default_args = {
'owner': 'me'
}
bash_command = """
echo "this is a test task"
"""
with DAG('schedule_test',
default_args=default_args,
start_date = datetime(2018, 7, 24),
schedule_interval=None,
catchup=False
) as dag:
t1 = DummyOperator(
task_id='extract_data',
dag=dag
)
t2 = BashOperator(
task_id = "first_task",
bash_command = bash_command
)
#####ORCHESTRATION#####
## It is saying that in order for t2 to run, t1 must be done.
t2.set_upstream(t1)
None Value Should not in quotes
It should be like this:
schedule_interval=None
Here is the documentation link: https://airflow.apache.org/docs/apache-airflow/1.10.1/scheduler.html#:~:text=Note%3A%20Use%20schedule_interval%3DNone%20and%20not%20schedule_interval%3D%27None%27%20when%20you%20don%E2%80%99t%20want%20to%20schedule%20your%20DAG