Airflow ETL pipeline - using schedule date in functions? - python

Is it possible to refer to the default_args start_date in your Python function?
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2016, 11, 21),
'email': ['mmm.mm#mmm.com'],
'email_on_failure': True,
'email_on_retry': True,
'retries': 1,
'retry_delay': timedelta(minutes=1)
}
My python script primarily uses subprocess to issue this statement:
query = '"SELECT * FROM {}.dbo.{} WHERE row_date = \'{}\'"'.format(database, select_database(database)[table_int],
query_date)
command = 'BCP {} queryout \"{}\" -t, -c -a 10240 -S "server" -T'.format(query, os.path.join(path, filename))
The task I want to run is using BCP to query 'Select * from table where date = {}'. Currently, my python script has all of the logic for the date variable (defaults to yesterday). However, it would be nice to refer to the default_arg instead and have airflow handle the dates.
So, to simplify, I want to use the default_arg start_date and schedule (runs each day) to fill in the variable on my BCP command. Is this the right approach or should I keep the date logic in the python script?

This is the right approach, but what you really need is execution_date, not start_date. You can get execution_date as 'ds' default variable through the context in PythonOperator using provide_context=True parameter. The provide_context=True parameter passes a set of default variables used in Jinja Templating through kwargs argument. You can read more about Default Variables and Jinja Templating in the relevant sections of the documentation. https://airflow.incubator.apache.org/code.html#default-variables
https://airflow.incubator.apache.org/concepts.html#jinja-templating
Your code should look like the following:
def query_db(**kwargs):
#get execution date in format YYYY-MM-DD
query_date = kwargs.get('ds')
#rest of your logic
t_query_db = PythonOperator(
task_id='query_db',
python_callable=query_db,
provide_context=True,
dag=dag)

Related

pass dag_run_id value from one task to another airflow task

def my_function(**kwargs):
global dag_run_id
dag_run_id = kwargs['dag_run'].run_id
example_task = PythonOperator(
task_id='example_task',
python_callable=my_function,
provide_context=True,
dag=dag)
bash_task = BashOperator(
task_id = 'bash_task' ,
bash_command = 'echo {{ dag_run_id }}' ,
dag = dag ,
)
bash_task is not printing value of dag_run_id and i want to use dag_run_id in other tasks
XCom (Cross-Communication) is the mechanism in Airflow that allows you to share data between tasks. Returning a value from a PythonOperator's callable automatically stores the value as an XCom. So your Python function could do:
def my_function(**kwargs):
dag_run_id = kwargs["run_id"]
return dag_run_id
Note that run_id is one of the templated variables given by Airflow, see the full list here: https://airflow.apache.org/docs/apache-airflow/stable/templates-ref.html#variables.
This stores the returned value as an "XCom" in Airflow. You can observe XComs via the Grid View -> select task -> XCom, or see all XCom values via Admin -> XComs. The task-specific XCom view shows something like this:
You can then fetch (known as "pull" in Airflow) the value in another task:
bash_task = BashOperator(
task_id="bash_task",
bash_command="echo {{ ti.xcom_pull(task_ids='example_task') }}",
)
This will fetch the XCom value from the task with id example_task and echo it.
The full DAG code looks like this:
import datetime
from airflow import DAG
from airflow.operators.bash import BashOperator
from airflow.operators.python import PythonOperator
with DAG(
dag_id="so_75213078",
start_date=datetime.datetime(2023, 1, 1),
schedule_interval=None,
):
def my_function(**kwargs):
dag_run_id = kwargs["run_id"]
return dag_run_id
example_task = PythonOperator(task_id="example_task", python_callable=my_function)
bash_task = BashOperator(
task_id="bash_task",
bash_command="echo {{ ti.xcom_pull(task_ids='example_task') }}",
)
example_task >> bash_task
Tasks are executed by separate processes in Airflow (and sometimes on separate machines), therefore you cannot rely on e.g. global or a local file path to exist for all tasks.
To pass a variable value from a task to another one, you can use Airflow Cross-Communication (XCom) as explained in the other answer.
But if you just want to pass the dag_run id, you don't need to do all of this, where it's available on all the tasks:
bash_task = BashOperator(
task_id = 'bash_task' ,
bash_command = 'echo {{ dag_run.run_id }}' ,
dag = dag ,
)

Airflow XCOM communication from BashOperator to SSHOperator

I just began learning Airflow, but it is quite difficult to grasp the concept of Xcom. Therefore I wrote a dag like this:
from airflow import DAG
from airflow.utils.edgemodifier import Label
from datetime import datetime
from datetime import timedelta
from airflow.operators.bash import BashOperator
from airflow.contrib.operators.ssh_operator import SSHOperator
from airflow.contrib.hooks.ssh_hook import SSHHook
#For more default argument for a task (or creating templates), please check this website
#https://airflow.apache.org/docs/apache-airflow/stable/_api/airflow/models/index.html#airflow.models.BaseOperator
default_args = {
'owner': '...',
'email': ['...'],
'email_on_retry': False,
'email_on_failure': True,
'retries': 3,
'retry_delay': timedelta(minutes=5),
'start_date': datetime(2021, 6, 10, 23, 0, 0, 0),
}
hook = SSHHook(
remote_host='...',
username='...',
password='...## Heading ##',
port=22,
)
with DAG(
'test_dag',
description='This is my first DAG to learn BASH operation, SSH connection, and transfer data among jobs',
default_args=default_args,
start_date=datetime(2021, 6, 10, 23, 0, 0, 0),
schedule_interval="0 * * * *",
tags = ['Testing', 'Tutorial'],
) as dag:
# Declare Tasks
Read_my_IP = BashOperator(
# Task ID has to be the combination of alphanumeric chars, dashes, dots, and underscores
task_id='Read_my_IP',
# The last line will be pushed to next task
bash_command="hostname -i | awk '{print $1}'",
)
Read_remote_IP = SSHOperator(
task_id='Read_remote_IP',
ssh_hook=hook,
environment={
'Pi_IP': Read_my_IP.xcom_pull('Read_my_IP'),
},
command="echo {{Pi_IP}}",
)
# Declare Relationship between tasks
Read_my_IP >> Label("PI's IP address") >> Read_remote_IP
The first task ran successfully, but I could not obtain the XCom return_value from task Read_my_IP, which is the IP address of the local machine. This might be very basic, but the documentation does not mention how to declare task_instance.
Please help to complete the Xcom flow and pass the IP address from the local machine to the remote machine for further procedure.
The command parameter of SSHOperator is templated thus you can get the xcom directly:
Read_remote_IP = SSHOperator(
task_id='Read_remote_IP',
ssh_hook=hook,
command="echo {{ ti.xcom_pull(task_ids='Read_my_IP') }}"
)
Note that you need also to explicitly ask for xcom to be pushed from BashOperator (see operator description):
Read_my_IP = BashOperator(
# Task ID has to be the combination of alphanumeric chars, dashes, dots, and underscores
task_id='Read_my_IP',
# The last line will be pushed to next task
bash_command="hostname -i | awk '{print $1}'",
do_xcom_push=True
)

How can I do a nohup command using airflows ssh_operator?

I'm new to airflow and I'm trying to run a job on an ec2 instance using airflow's ssh_operator like shown below:
t2 = SSHOperator(
ssh_conn_id='ec2_ssh_connection',
task_id='execute_script',
command="nohup python test.py &",
retries=3,
dag=dag)
The job takes few hours and I want airflow to execute the python script and end. However when the command is executed and the dag completes the script is terminated on the ec2 instance. I also noticed that the above code doesn't create a nohup.out file.
I'm looking at how to run nohup using SSHOperator. It seems like this might be a python related issue because I'm getting the following error on EC2 script when the nohup has been executed:
[Errno 32] Broken pipe
Thanks!
Airflow's SSHHook uses the Paramiko module for SSH connectivity. There is an SO question regarding Prarmiko and nohup. One of the answers suggests to add sleep after the nohup command. I cannot explain exactly why, but it actually works. It is also necessary to set get_pty=True in SSHOperator.
Here is a complete example that demonstrates the solution:
from datetime import datetime
from airflow import DAG
from airflow.contrib.operators.ssh_operator import SSHOperator
default_args = {
'start_date': datetime(2001, 2, 3, 4, 0),
}
with DAG(
'a_dag', schedule_interval=None, default_args=default_args, catchup=False,
) as dag:
op = SSHOperator(
task_id='ssh',
ssh_conn_id='ssh_default',
command=(
'nohup python -c "import time;time.sleep(30);print(1)" & sleep 10'
),
get_pty=True, # This is needed!
)
The nohup.out file is written to the user's $HOME.

Airflow: sequential task backfill; ignore upstream dependancy

The airflow set-up I have is Running With CeleryExecutor and Postresql database. I am running production with this setting hence, need the worker and scheduler to run multiple threads at once.
However, I also want to backfill just one task in a date range without affecting the state of anything else in the dag.
This is an example dag:
default_args = {
'owner': 'user',
'start_date': datetime(2020, 3, 27),
'depends_on_past': True,
'wait_for_downstream': True
}
dag = DAG(
'test_dag1',
schedule_interval='00 16 * * *',
default_args=default_args,
catchup=False,
description='Running test_dag1',
)
dummy1 = BashOperator(
task_id="Dummy1",
bash_command="echo 0",
dag=dag
)
dummy2 = BashOperator(
task_id="Dummy2",
bash_command="echo Hello",
dag=dag
)
date_task = BashOperator(
task_id ="date_variables",
bash_command="echo World",
dag=dag
)
dummy1 >> dummy2 >> date_task
airflow backfill test_dag1 -t Dummy2 --reset_dagruns -I -s 2020-03-27 -e 2020-04-05
The above command will take into consideration the upstream dependancy and run the Dummy1 and Dummy2 for the date range. This run also respects the depends_on_past=True and wait_for_downstream=True and runs the dags sequentially from start date to end date.
However, I want to ignore the upstream depedency of task Dummy2 and also run the tasks sequentially from the start date to end date.
airflow backfill test_dag1 -t Dummy2 -i --reset_dagruns -I -s 2020-03-27 -e 2020-04-05
The above command does not respect the depends_on_past=True and runs all the tasks at once, parallely using the CeleryExecutor.
Is there a way to backfill run tasks within a dag sequentially but also ignore the upstream and downstream tasks associated with the Task in question?
Any help would be appreciated.

Airflow BashOperator doesn't work but PythonOperator does

I seem to have a problem with BashOperator. I'm using Airflow 1.10 installed on CentOS in a Miniconda environment (Python 3.6) using the package on Conda Forge.
When I run airflow test tutorial pyHi 2018-01-01 the output is "Hello world!" as expected.
However, when I run airflow test tutorial print_date 2018-01-01 or
airflow test tutorial templated 2018-01-01 nothing happens.
This is the Linux shell output:
(etl) [root#VIRT02 airflow]# airflow test tutorial sleep 2015-06-01
[2018-09-28 19:56:09,727] {__init__.py:51} INFO - Using executor SequentialExecutor
[2018-09-28 19:56:09,962] {models.py:258} INFO - Filling up the DagBag from /root/airflow/dags
My DAG configuration file, which is based on the Airflow tutorial, is shown below.
from airfl ow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
import test
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2010, 1, 1),
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
dag = DAG(
'tutorial',
'My first attempt',
schedule_interval=timedelta(days=1),
default_args=default_args,
)
# t1, t2 and t3 are examples of tasks created by instantiating operators
t1 = BashOperator(
task_id='print_date',
bash_command='date',
dag=dag)
t2 = BashOperator(
task_id='sleep',
bash_command='sleep 5',
retries=3,
dag=dag)
templated_command = """
{% for i in range(5) %}
echo "{{ ds }}"
echo "{{ macros.ds_add(ds, 7)}}"
echo "{{ params.my_param }}"
{% endfor %}
"""
t3 = BashOperator(
task_id='templated',
bash_command=templated_command,
params={'my_param': 'Parameter I passed in'},
dag=dag)
t4 = BashOperator(
task_id='hi',
bash_command = 'test.sh',
dag=dag,
)
t5 = PythonOperator(
task_id='pyHi',
python_callable=test.main,
dag=dag,
)
t2.set_upstream(t1)
t3.set_upstream(t1)
Technically it's not that the BashOperator doesn't work, it's just that you don't see the stdout of the Bash command in the Airflow logs. This is a known issue and a ticket has already been filed on Airflow's issue tracker: https://issues.apache.org/jira/browse/AIRFLOW-2674
The proof of the fact that BashOperator does work is that if you run your sleep operator with
airflow test tutorial sleep 2018-01-01
you will have to wait 5 seconds before it terminates, which is the behaviour you'd expect from the Bash sleep command.

Categories