(First time user of cloud composer) All examples I have seen define very simple python functions within the DAG.
I have multiple lengthy python scripts I want to run. Can I put these inside a task?
If so, is it then better to use the PythonOperator or call them from the BashOperator?
E.g. something like
default_dag-args ={}
with models.DAG('jobname', schedule_interval = datetime.timedelta(days=1), default_args = default_dag_args) as dag:
do_stuff1 = python_operator.PythonOperator(
task_id ='task_1'
python_callable =myscript1.py)
do_stuff2 = python_operator.PythonOperator(
task_id ='task_2'
python_callable =myscript2.py)
If you put your python scripts into separate files, you can actually use both PythonOperator and BashOperator to execute the scripts.
Let's assume you place your python scripts under the following folder structure.
dags/
my_dag.py
tasks/
myscript1.py
myscript2.py
Using PythonOperator in my_dag.py
from datetime import timedelta
from airflow.models import DAG
from airflow.operators.python_operator import PythonOperator
from scripts import myscript1, myscript2
default_dag_args = {}
with DAG(
"jobname",
schedule_interval=timedelta(days=1),
default_args=default_dag_args,
) as dag:
do_stuff1 = PythonOperator(
task_id="task_1",
python_callable=myscript1.main, # assume entrypoint is main()
)
do_stuff2 = PythonOperator(
task_id="task_2",
python_callable=myscript2.main, # assume entrypoint is main()
)
Using BashOperator in my_dag.py
from datetime import timedelta
from airflow.models import DAG
from airflow.operators.bash_operator import BashOperator
default_dag_args = {}
with DAG(
"jobname",
schedule_interval=timedelta(days=1),
default_args=default_dag_args,
) as dag:
do_stuff1 = BashOperator(
task_id="task_1",
bash_command="python /path/to/myscript1.py",
)
do_stuff2 = BashOperator(
task_id="task_2",
bash_command="python /path/to/myscript2.py",
)
Related
I am working on AirFlow POC, written a DAG which can run a script using ssh on one server. It giving alert if the script fails to execute but if the scripts executed and the task in the script fails it is not giving any mail.
Example: I'm running a script which can take backup of a database in db2. If the instance is down and unable to take the backup so backup command failed but we are not getting any alert as the script executed successfully.
from airflow.models import DAG, Variable
from airflow.contrib.operators.ssh_operator import SSHOperator
from datetime import datetime, timedelta
import airflow
import os
import logging
import airflow.settings
from airflow.utils.dates import days_ago
from airflow.operators.email_operator import EmailOperator
from airflow.models import TaskInstance
from airflow.operators.python_operator import BranchPythonOperator
from airflow.operators.dummy_operator import DummyOperator
START_DATE = airflow.utils.dates.days_ago(1)
SCHEDULE_INTERVAL = "5,10,15,20,30,40 * * * * *"
log = logging.getLogger(__name__)
# Use this to grab the pushed value and determine your path
def determine_branch(**kwargs):
"""Use this to define the pathway of the branch operator based on the return code from the SSHOperator"""
ti = TaskInstance(task='t1', execution_date=START_DATE)
return_code = kwargs['ti'].xcom_pull(ti.task_ids='SSHTest')
print("From Kwargs: ", return_code)
if return_code := '1':
return 't4'
return 't3'
# DAG for airflow task
dag_email_recipient = ["mailid"]
# These args will get passed on to the SSH operator
default_args = {
'owner': 'afuser',
'depends_on_past': False,
'email': ['mailid'],
'email_on_failure': True,
'email_on_retry': False,
'start_date': START_DATE,
'retries': 1,
'retry_delay': timedelta(minutes=1),
}
with DAG(
'DB2_SSH_TEST',
default_args=default_args,
description='How to use the SSH Operator?',
schedule_interval=SCHEDULE_INTERVAL,
start_date=START_DATE,
catchup=False,
) as dag:
# Be sure to add '; echo $?' to the end of your bash script for this to work.
t1 = SSHOperator(
ssh_conn_id='DB2_Jabxusr_SSH',
task_id='SSHTest',
xcom_push=True,
command='/path/backup_script',
provide_context=True,
)
# Function defined above called here
t2 = BranchPythonOperator(
task_id='check_ssh_output',
python_callable=determine_branch,
)
# If we don't want to send email
t3 = DummyOperator(
task_id='end'
)
# If we do
t4 = EmailOperator(
to=dag_email_recipient,
subject='Airflow Success: process incoming files',
files=None,
html_content='',
#...
)
t1 >> t2
t2 >> t3
t2 >> t4
You can configure email notifications per task. First you need to configure your email provider, and then when you create your task you can specify "email_on_retry", "email_on_failure" flags or you can even write your own custom hook "on_failure" where you will be able to code your own custom logic deciding when and how to notify.
Here is a very nice Astronomer article explaining all the ins and outs of notifications: https://www.astronomer.io/guides/error-notifications-in-airflow
I am trying to run a simple dag in Airflow to execute a python file and it is throwing the error can't open the file '/User/....'.
below is the script I am using.
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
from datetime import datetime,timedelta
default_args = {
'owner': 'airflow',
'start_date': datetime(2021,3,2),
'depends_on_past': False,
'retries': 0
}
dag=DAG(dag_id='DummyDag',default_args=default_args,catchup=False,schedule_interval='#once')
command_load='python /usr/local/airflow/dags/dummy.py '
#load=BashOperator(task_id='loadfile',bash_command='python /Users/<user-id>/MyDrive/DataEngineeringAssignment/dummydata.py',dag=dag)
start=DummyOperator(task_id='Start',dag=dag)
end=DummyOperator(task_id='End',dag=dag)
dir >> start >> end
Is there anywhere I am going wrong?
Option 1:
The file location (dummydata.py) is relative to the directory containing the pipeline file (DAG file).
dag=DAG(
dag_id='DummyDag',
...
)
load=BashOperator(task_id='loadfile',
bash_command='python dummydata.py',
dag=dag
)
Option 2:
define your template_searchpath as pointing to any folder locations in the DAG constructor call.
dag=DAG(
dag_id='DummyDag',
...,
template_searchpath=['/Users/<user-id>/MyDrive/DataEngineeringAssignment/']
)
load=BashOperator(task_id='loadfile',
# "dummydata.py" is a file under "/Users/<user-id>/MyDrive/DataEngineeringAssignment/"
bash_command='python dummydata.py ', # Note: Space is important!
dag=dag
)
For more information you can read about it in the docs
I want to customize my DAGs to send the email when it's failed or succeeded. I'm trying to use on_success_callback and on_failure_callback in DAG constructor, but it doesn't work for DAG. In the same time it works for DummyOperator that I put inside my DAG.
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from datetime import datetime
from utils import get_report_operator, DagStatus
TEST_DAG_NAME='test_dag'
TEST_DAG_REPORT_SUBSCRIBERS = ['MY_EMAIL']
def send_success_report(context):
subject = 'Airflow report: {0} run success'.format(TEST_DAG_NAME)
email_operator = get_report_operator(subject, TEST_DAG_REPORT_SUBSCRIBERS, TEST_DAG_NAME, DagStatus.SUCCESS)
email_operator.execute(context)
def send_failed_report(context):
subject = 'Airflow report: {0} run failed'.format(TEST_DAG_NAME)
email_operator = get_report_operator(subject, TEST_DAG_REPORT_SUBSCRIBERS, TEST_DAG_NAME, DagStatus.FAILED)
email_operator.execute(context)
dag = DAG(dag_id=TEST_DAG_NAME,
schedule_interval=None,
start_date=datetime(2019,6,6),
on_success_callback=send_success_report,
on_failure_callback=send_failed_report)
DummyOperator(task_id='task',
on_success_callback=send_success_report,
on_failure_callback=send_failed_report,
dag = dag)
I've also implemented some add-in over the Airflow EmailOperator for send report. I don't thing that error in this part, but still.
class DagStatus(Enum):
SUCCESS = 0
FAILED = 1
def get_report_operator(sbjct, to_lst, dag_id, dag_status):
status = 'SUCCESS' if dag_status == DagStatus.SUCCESS else 'FAILED'
status_color = '#87C540' if dag_status == DagStatus.SUCCESS else '#FF1717'
with open(os.path.join(os.path.dirname(__file__), 'airflow_report.html'), 'r', encoding='utf-8') as report_file:
report_mask = report_file.read()
report_text = report_mask.format(dag_id, status, status_color)
tmp_dag = DAG(dag_id='tmp_dag', start_date=datetime(year=2019, month=9, day=12), schedule_interval=None)
return EmailOperator(task_id='send_email',
to=to_lst,
subject=sbjct,
html_content=report_text.encode('utf-8'),
dag = tmp_dag)
What I do wrong?
Instead put on_failure_callback as argument in default_args dictionary and pass it to DAG.
All arguments in defaut_args passed to a DAG will be applied to all of DAG's operators. Its the only way, as of now, to apply a common parameter to all the operators in the DAG.
dag = DAG(dag_id=TEST_DAG_NAME,
schedule_interval=None,
start_date=datetime(2019,6,6),
default_args={
'on_success_callback': send_success_report,
'on_failure_callback': send_failed_report
})
I'm running Airflow 1.9.0 with LocalExecutor and PostgreSQL database in a Linux AMI. I want to manually trigger DAGs, but whenever I create a DAG that has schedule_interval set to None or to #once, the webserver tree view crashes with the following error (I only show the last call):
File "/usr/local/lib/python2.7/site-packages/croniter/croniter.py", line 467, in expand
raise CroniterBadCronError(cls.bad_length)
CroniterBadCronError: Exactly 5 or 6 columns has to be specified for iteratorexpression.
Furthermore, when I manually trigger the DAG, a DAG run starts but the tasks themselves are never scheduled. I've looked around, but it seems that I'm the only one with this type of error. Has anyone encountered this error before and found a fix?
Minimal example triggering the problem:
import datetime as dt
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
default_args = {
'owner': 'me'
}
bash_command = """
echo "this is a test task"
"""
with DAG('schedule_test',
default_args=default_args,
start_date = dt.datetime(2018, 7, 24),
schedule_interval='None',
catchup=False
) as dag:
first_task = BashOperator(task_id = "first_task", bash_command = bash_command)
Try this:
Set your schedule_interval to None without the '', or simply do not specify schedule_interval in your DAG. It is set to None as a default. More information on that here: airflow docs -- search for schedule_interval
Set orchestration for your tasks at the bottom of the dag.
Like so:
import datetime
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.dummy_operator import DummyOperator
default_args = {
'owner': 'me'
}
bash_command = """
echo "this is a test task"
"""
with DAG('schedule_test',
default_args=default_args,
start_date = datetime(2018, 7, 24),
schedule_interval=None,
catchup=False
) as dag:
t1 = DummyOperator(
task_id='extract_data',
dag=dag
)
t2 = BashOperator(
task_id = "first_task",
bash_command = bash_command
)
#####ORCHESTRATION#####
## It is saying that in order for t2 to run, t1 must be done.
t2.set_upstream(t1)
None Value Should not in quotes
It should be like this:
schedule_interval=None
Here is the documentation link: https://airflow.apache.org/docs/apache-airflow/1.10.1/scheduler.html#:~:text=Note%3A%20Use%20schedule_interval%3DNone%20and%20not%20schedule_interval%3D%27None%27%20when%20you%20don%E2%80%99t%20want%20to%20schedule%20your%20DAG
I have the following DAG, which executes the different methods with a class dedicated to a data preprocessing routine:
from datetime import datetime
import os
import sys
from airflow.models import DAG
from airflow.operators.python_operator import PythonOperator
import ds_dependencies
SCRIPT_PATH = os.getenv('MARKETING_PREPROC_PATH')
if SCRIPT_PATH:
sys.path.insert(0, SCRIPT_PATH)
from table_builder import OnlineOfflinePreprocess
else:
print('Define MARKETING_PREPROC_PATH value in environmental variables')
sys.exit(1)
default_args = {
'start_date': datetime.now(),
'max_active_runs': 1,
'concurrency': 4
}
worker = OnlineOfflinePreprocess()
DAG = DAG(
dag_id='marketing_data_preproc',
default_args=default_args,
start_date=datetime.today()
)
import_online_data = PythonOperator(
task_id='import_online_data',
python_callable=worker.import_online_data,
dag=DAG)
import_offline_data = PythonOperator(
task_id='import_offline_data',
python_callable=worker.import_offline_data,
dag=DAG)
merge_aurum_to_sherlock = PythonOperator(
task_id='merge_aurum_to_sherlock',
python_callable=worker.merge_aurum_to_sherlock,
dag=DAG)
merge_sherlock_to_aurum = PythonOperator(
task_id='merge_sherlock_to_aurum',
python_callable=worker.merge_sherlock_to_aurum,
dag=DAG)
upload_au_to_sh = PythonOperator(
task_id='upload_au_to_sh',
python_callable=worker.upload_table,
op_args='aurum_to_sherlock',
dag=DAG)
upload_sh_to_au = PythonOperator(
task_id='upload_sh_to_au',
python_callable=worker.upload_table,
op_args='sherlock_to_aurum',
dag=DAG)
import_online_data >> merge_aurum_to_sherlock
import_offline_data >> merge_aurum_to_sherlock
merge_aurum_to_sherlock >> merge_sherlock_to_aurum
merge_aurum_to_sherlock >> upload_au_to_sh
merge_sherlock_to_aurum >> upload_sh_to_au
This produces the following error:
[2017-09-07 19:32:09,587] {base_task_runner.py:97} INFO - Subtask: AttributeError: 'OnlineOfflinePreprocess' object has no attribute 'online_info'
Which is actually pretty obvious given how airflow works: the outputs from the different class methods called aren't stored to the global class object initialized at the top of the graph.
Can I solve this with XCom? Overall, what is the thinking about how to blend the coherence of OOP with Airflow?
It's less of an issue about OOP with airflow and more about state with airflow.
Any state that needs to be passed between tasks needs to be stored persistently. This is because each airflow task is an independent process (which could even be running on a different machine!) and thus in-memory communication is not possible.
You are correct you can use XCOM to pass this state (if it's small, since it gets stored in the airflow database). If it's large you probably want to store it somewhere else, maybe a filesystem or S3 or HDFS or a specialized database.