pass dag_run_id value from one task to another airflow task - python

def my_function(**kwargs):
global dag_run_id
dag_run_id = kwargs['dag_run'].run_id
example_task = PythonOperator(
task_id='example_task',
python_callable=my_function,
provide_context=True,
dag=dag)
bash_task = BashOperator(
task_id = 'bash_task' ,
bash_command = 'echo {{ dag_run_id }}' ,
dag = dag ,
)
bash_task is not printing value of dag_run_id and i want to use dag_run_id in other tasks

XCom (Cross-Communication) is the mechanism in Airflow that allows you to share data between tasks. Returning a value from a PythonOperator's callable automatically stores the value as an XCom. So your Python function could do:
def my_function(**kwargs):
dag_run_id = kwargs["run_id"]
return dag_run_id
Note that run_id is one of the templated variables given by Airflow, see the full list here: https://airflow.apache.org/docs/apache-airflow/stable/templates-ref.html#variables.
This stores the returned value as an "XCom" in Airflow. You can observe XComs via the Grid View -> select task -> XCom, or see all XCom values via Admin -> XComs. The task-specific XCom view shows something like this:
You can then fetch (known as "pull" in Airflow) the value in another task:
bash_task = BashOperator(
task_id="bash_task",
bash_command="echo {{ ti.xcom_pull(task_ids='example_task') }}",
)
This will fetch the XCom value from the task with id example_task and echo it.
The full DAG code looks like this:
import datetime
from airflow import DAG
from airflow.operators.bash import BashOperator
from airflow.operators.python import PythonOperator
with DAG(
dag_id="so_75213078",
start_date=datetime.datetime(2023, 1, 1),
schedule_interval=None,
):
def my_function(**kwargs):
dag_run_id = kwargs["run_id"]
return dag_run_id
example_task = PythonOperator(task_id="example_task", python_callable=my_function)
bash_task = BashOperator(
task_id="bash_task",
bash_command="echo {{ ti.xcom_pull(task_ids='example_task') }}",
)
example_task >> bash_task
Tasks are executed by separate processes in Airflow (and sometimes on separate machines), therefore you cannot rely on e.g. global or a local file path to exist for all tasks.

To pass a variable value from a task to another one, you can use Airflow Cross-Communication (XCom) as explained in the other answer.
But if you just want to pass the dag_run id, you don't need to do all of this, where it's available on all the tasks:
bash_task = BashOperator(
task_id = 'bash_task' ,
bash_command = 'echo {{ dag_run.run_id }}' ,
dag = dag ,
)

Related

How to run Airflow tasks synchronously

I have an airflow comprising of 2-3 steps
PythonOperator --> It runs the query on AWS Athena and stores the generated file on specific s3 path
BashOperator --> Increments the airflow variable for tracking
BashOperator --> It takes the output(response) of task1 and and run some code on top of it.
What happens here is the airflow gets completed within seconds even if the Athena query step is running.
I want to make sure that after the file is generated further steps should run. Basically i want this to be synchronous.
You can set the tasks as:
def athena_task():
# Add your code
return
t1 = PythonOperator(
task_id='athena_task',
python_callable=athena_task,
)
t2 = BashOperator(
task_id='variable_task',
bash_command='', #replace with relevant command
)
t3 = BashOperator(
task_id='process_task',
bash_command='', #replace with relevant command
)
t1 >> t2 >> t3
t2 will run only after t1 is completed successfully and t3 will start only after t2 is completed successfully.
Note that Airflow has AWSAthenaOperator which might save you the trouble of writing the code yourself. The operator submit a query to Athena and save the output in S3 path by setting the output_location parameter:
run_query = AWSAthenaOperator(
task_id='athena_task',
query='SELECT * FROM my_table',
output_location='s3://some-bucket/some-path/',
database='my_database'
)
Athena's query API is asynchronous. You start a query, get an ID back, and then you need to poll until the query has completed using the GetQueryExecution API call.
If you only start the query in the first task then there is not guarantee that the query has completed when the next task runs. Only when GetQueryExecution has returned a status of SUCCEEDED (or FAILED/CANCELLED) can you expect the output file to exist.
As #Elad points out, AWSAthenaOperator does this for you, and handles error cases, and more.

Airflow XCOM communication from BashOperator to SSHOperator

I just began learning Airflow, but it is quite difficult to grasp the concept of Xcom. Therefore I wrote a dag like this:
from airflow import DAG
from airflow.utils.edgemodifier import Label
from datetime import datetime
from datetime import timedelta
from airflow.operators.bash import BashOperator
from airflow.contrib.operators.ssh_operator import SSHOperator
from airflow.contrib.hooks.ssh_hook import SSHHook
#For more default argument for a task (or creating templates), please check this website
#https://airflow.apache.org/docs/apache-airflow/stable/_api/airflow/models/index.html#airflow.models.BaseOperator
default_args = {
'owner': '...',
'email': ['...'],
'email_on_retry': False,
'email_on_failure': True,
'retries': 3,
'retry_delay': timedelta(minutes=5),
'start_date': datetime(2021, 6, 10, 23, 0, 0, 0),
}
hook = SSHHook(
remote_host='...',
username='...',
password='...## Heading ##',
port=22,
)
with DAG(
'test_dag',
description='This is my first DAG to learn BASH operation, SSH connection, and transfer data among jobs',
default_args=default_args,
start_date=datetime(2021, 6, 10, 23, 0, 0, 0),
schedule_interval="0 * * * *",
tags = ['Testing', 'Tutorial'],
) as dag:
# Declare Tasks
Read_my_IP = BashOperator(
# Task ID has to be the combination of alphanumeric chars, dashes, dots, and underscores
task_id='Read_my_IP',
# The last line will be pushed to next task
bash_command="hostname -i | awk '{print $1}'",
)
Read_remote_IP = SSHOperator(
task_id='Read_remote_IP',
ssh_hook=hook,
environment={
'Pi_IP': Read_my_IP.xcom_pull('Read_my_IP'),
},
command="echo {{Pi_IP}}",
)
# Declare Relationship between tasks
Read_my_IP >> Label("PI's IP address") >> Read_remote_IP
The first task ran successfully, but I could not obtain the XCom return_value from task Read_my_IP, which is the IP address of the local machine. This might be very basic, but the documentation does not mention how to declare task_instance.
Please help to complete the Xcom flow and pass the IP address from the local machine to the remote machine for further procedure.
The command parameter of SSHOperator is templated thus you can get the xcom directly:
Read_remote_IP = SSHOperator(
task_id='Read_remote_IP',
ssh_hook=hook,
command="echo {{ ti.xcom_pull(task_ids='Read_my_IP') }}"
)
Note that you need also to explicitly ask for xcom to be pushed from BashOperator (see operator description):
Read_my_IP = BashOperator(
# Task ID has to be the combination of alphanumeric chars, dashes, dots, and underscores
task_id='Read_my_IP',
# The last line will be pushed to next task
bash_command="hostname -i | awk '{print $1}'",
do_xcom_push=True
)

Airflow: sequential task backfill; ignore upstream dependancy

The airflow set-up I have is Running With CeleryExecutor and Postresql database. I am running production with this setting hence, need the worker and scheduler to run multiple threads at once.
However, I also want to backfill just one task in a date range without affecting the state of anything else in the dag.
This is an example dag:
default_args = {
'owner': 'user',
'start_date': datetime(2020, 3, 27),
'depends_on_past': True,
'wait_for_downstream': True
}
dag = DAG(
'test_dag1',
schedule_interval='00 16 * * *',
default_args=default_args,
catchup=False,
description='Running test_dag1',
)
dummy1 = BashOperator(
task_id="Dummy1",
bash_command="echo 0",
dag=dag
)
dummy2 = BashOperator(
task_id="Dummy2",
bash_command="echo Hello",
dag=dag
)
date_task = BashOperator(
task_id ="date_variables",
bash_command="echo World",
dag=dag
)
dummy1 >> dummy2 >> date_task
airflow backfill test_dag1 -t Dummy2 --reset_dagruns -I -s 2020-03-27 -e 2020-04-05
The above command will take into consideration the upstream dependancy and run the Dummy1 and Dummy2 for the date range. This run also respects the depends_on_past=True and wait_for_downstream=True and runs the dags sequentially from start date to end date.
However, I want to ignore the upstream depedency of task Dummy2 and also run the tasks sequentially from the start date to end date.
airflow backfill test_dag1 -t Dummy2 -i --reset_dagruns -I -s 2020-03-27 -e 2020-04-05
The above command does not respect the depends_on_past=True and runs all the tasks at once, parallely using the CeleryExecutor.
Is there a way to backfill run tasks within a dag sequentially but also ignore the upstream and downstream tasks associated with the Task in question?
Any help would be appreciated.

Airflow BashOperator doesn't work but PythonOperator does

I seem to have a problem with BashOperator. I'm using Airflow 1.10 installed on CentOS in a Miniconda environment (Python 3.6) using the package on Conda Forge.
When I run airflow test tutorial pyHi 2018-01-01 the output is "Hello world!" as expected.
However, when I run airflow test tutorial print_date 2018-01-01 or
airflow test tutorial templated 2018-01-01 nothing happens.
This is the Linux shell output:
(etl) [root#VIRT02 airflow]# airflow test tutorial sleep 2015-06-01
[2018-09-28 19:56:09,727] {__init__.py:51} INFO - Using executor SequentialExecutor
[2018-09-28 19:56:09,962] {models.py:258} INFO - Filling up the DagBag from /root/airflow/dags
My DAG configuration file, which is based on the Airflow tutorial, is shown below.
from airfl ow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
import test
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2010, 1, 1),
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
dag = DAG(
'tutorial',
'My first attempt',
schedule_interval=timedelta(days=1),
default_args=default_args,
)
# t1, t2 and t3 are examples of tasks created by instantiating operators
t1 = BashOperator(
task_id='print_date',
bash_command='date',
dag=dag)
t2 = BashOperator(
task_id='sleep',
bash_command='sleep 5',
retries=3,
dag=dag)
templated_command = """
{% for i in range(5) %}
echo "{{ ds }}"
echo "{{ macros.ds_add(ds, 7)}}"
echo "{{ params.my_param }}"
{% endfor %}
"""
t3 = BashOperator(
task_id='templated',
bash_command=templated_command,
params={'my_param': 'Parameter I passed in'},
dag=dag)
t4 = BashOperator(
task_id='hi',
bash_command = 'test.sh',
dag=dag,
)
t5 = PythonOperator(
task_id='pyHi',
python_callable=test.main,
dag=dag,
)
t2.set_upstream(t1)
t3.set_upstream(t1)
Technically it's not that the BashOperator doesn't work, it's just that you don't see the stdout of the Bash command in the Airflow logs. This is a known issue and a ticket has already been filed on Airflow's issue tracker: https://issues.apache.org/jira/browse/AIRFLOW-2674
The proof of the fact that BashOperator does work is that if you run your sleep operator with
airflow test tutorial sleep 2018-01-01
you will have to wait 5 seconds before it terminates, which is the behaviour you'd expect from the Bash sleep command.

Airflow ETL pipeline - using schedule date in functions?

Is it possible to refer to the default_args start_date in your Python function?
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2016, 11, 21),
'email': ['mmm.mm#mmm.com'],
'email_on_failure': True,
'email_on_retry': True,
'retries': 1,
'retry_delay': timedelta(minutes=1)
}
My python script primarily uses subprocess to issue this statement:
query = '"SELECT * FROM {}.dbo.{} WHERE row_date = \'{}\'"'.format(database, select_database(database)[table_int],
query_date)
command = 'BCP {} queryout \"{}\" -t, -c -a 10240 -S "server" -T'.format(query, os.path.join(path, filename))
The task I want to run is using BCP to query 'Select * from table where date = {}'. Currently, my python script has all of the logic for the date variable (defaults to yesterday). However, it would be nice to refer to the default_arg instead and have airflow handle the dates.
So, to simplify, I want to use the default_arg start_date and schedule (runs each day) to fill in the variable on my BCP command. Is this the right approach or should I keep the date logic in the python script?
This is the right approach, but what you really need is execution_date, not start_date. You can get execution_date as 'ds' default variable through the context in PythonOperator using provide_context=True parameter. The provide_context=True parameter passes a set of default variables used in Jinja Templating through kwargs argument. You can read more about Default Variables and Jinja Templating in the relevant sections of the documentation. https://airflow.incubator.apache.org/code.html#default-variables
https://airflow.incubator.apache.org/concepts.html#jinja-templating
Your code should look like the following:
def query_db(**kwargs):
#get execution date in format YYYY-MM-DD
query_date = kwargs.get('ds')
#rest of your logic
t_query_db = PythonOperator(
task_id='query_db',
python_callable=query_db,
provide_context=True,
dag=dag)

Categories