Airflow: sequential task backfill; ignore upstream dependancy - python

The airflow set-up I have is Running With CeleryExecutor and Postresql database. I am running production with this setting hence, need the worker and scheduler to run multiple threads at once.
However, I also want to backfill just one task in a date range without affecting the state of anything else in the dag.
This is an example dag:
default_args = {
'owner': 'user',
'start_date': datetime(2020, 3, 27),
'depends_on_past': True,
'wait_for_downstream': True
}
dag = DAG(
'test_dag1',
schedule_interval='00 16 * * *',
default_args=default_args,
catchup=False,
description='Running test_dag1',
)
dummy1 = BashOperator(
task_id="Dummy1",
bash_command="echo 0",
dag=dag
)
dummy2 = BashOperator(
task_id="Dummy2",
bash_command="echo Hello",
dag=dag
)
date_task = BashOperator(
task_id ="date_variables",
bash_command="echo World",
dag=dag
)
dummy1 >> dummy2 >> date_task
airflow backfill test_dag1 -t Dummy2 --reset_dagruns -I -s 2020-03-27 -e 2020-04-05
The above command will take into consideration the upstream dependancy and run the Dummy1 and Dummy2 for the date range. This run also respects the depends_on_past=True and wait_for_downstream=True and runs the dags sequentially from start date to end date.
However, I want to ignore the upstream depedency of task Dummy2 and also run the tasks sequentially from the start date to end date.
airflow backfill test_dag1 -t Dummy2 -i --reset_dagruns -I -s 2020-03-27 -e 2020-04-05
The above command does not respect the depends_on_past=True and runs all the tasks at once, parallely using the CeleryExecutor.
Is there a way to backfill run tasks within a dag sequentially but also ignore the upstream and downstream tasks associated with the Task in question?
Any help would be appreciated.

Related

pass dag_run_id value from one task to another airflow task

def my_function(**kwargs):
global dag_run_id
dag_run_id = kwargs['dag_run'].run_id
example_task = PythonOperator(
task_id='example_task',
python_callable=my_function,
provide_context=True,
dag=dag)
bash_task = BashOperator(
task_id = 'bash_task' ,
bash_command = 'echo {{ dag_run_id }}' ,
dag = dag ,
)
bash_task is not printing value of dag_run_id and i want to use dag_run_id in other tasks
XCom (Cross-Communication) is the mechanism in Airflow that allows you to share data between tasks. Returning a value from a PythonOperator's callable automatically stores the value as an XCom. So your Python function could do:
def my_function(**kwargs):
dag_run_id = kwargs["run_id"]
return dag_run_id
Note that run_id is one of the templated variables given by Airflow, see the full list here: https://airflow.apache.org/docs/apache-airflow/stable/templates-ref.html#variables.
This stores the returned value as an "XCom" in Airflow. You can observe XComs via the Grid View -> select task -> XCom, or see all XCom values via Admin -> XComs. The task-specific XCom view shows something like this:
You can then fetch (known as "pull" in Airflow) the value in another task:
bash_task = BashOperator(
task_id="bash_task",
bash_command="echo {{ ti.xcom_pull(task_ids='example_task') }}",
)
This will fetch the XCom value from the task with id example_task and echo it.
The full DAG code looks like this:
import datetime
from airflow import DAG
from airflow.operators.bash import BashOperator
from airflow.operators.python import PythonOperator
with DAG(
dag_id="so_75213078",
start_date=datetime.datetime(2023, 1, 1),
schedule_interval=None,
):
def my_function(**kwargs):
dag_run_id = kwargs["run_id"]
return dag_run_id
example_task = PythonOperator(task_id="example_task", python_callable=my_function)
bash_task = BashOperator(
task_id="bash_task",
bash_command="echo {{ ti.xcom_pull(task_ids='example_task') }}",
)
example_task >> bash_task
Tasks are executed by separate processes in Airflow (and sometimes on separate machines), therefore you cannot rely on e.g. global or a local file path to exist for all tasks.
To pass a variable value from a task to another one, you can use Airflow Cross-Communication (XCom) as explained in the other answer.
But if you just want to pass the dag_run id, you don't need to do all of this, where it's available on all the tasks:
bash_task = BashOperator(
task_id = 'bash_task' ,
bash_command = 'echo {{ dag_run.run_id }}' ,
dag = dag ,
)

How to run Airflow tasks synchronously

I have an airflow comprising of 2-3 steps
PythonOperator --> It runs the query on AWS Athena and stores the generated file on specific s3 path
BashOperator --> Increments the airflow variable for tracking
BashOperator --> It takes the output(response) of task1 and and run some code on top of it.
What happens here is the airflow gets completed within seconds even if the Athena query step is running.
I want to make sure that after the file is generated further steps should run. Basically i want this to be synchronous.
You can set the tasks as:
def athena_task():
# Add your code
return
t1 = PythonOperator(
task_id='athena_task',
python_callable=athena_task,
)
t2 = BashOperator(
task_id='variable_task',
bash_command='', #replace with relevant command
)
t3 = BashOperator(
task_id='process_task',
bash_command='', #replace with relevant command
)
t1 >> t2 >> t3
t2 will run only after t1 is completed successfully and t3 will start only after t2 is completed successfully.
Note that Airflow has AWSAthenaOperator which might save you the trouble of writing the code yourself. The operator submit a query to Athena and save the output in S3 path by setting the output_location parameter:
run_query = AWSAthenaOperator(
task_id='athena_task',
query='SELECT * FROM my_table',
output_location='s3://some-bucket/some-path/',
database='my_database'
)
Athena's query API is asynchronous. You start a query, get an ID back, and then you need to poll until the query has completed using the GetQueryExecution API call.
If you only start the query in the first task then there is not guarantee that the query has completed when the next task runs. Only when GetQueryExecution has returned a status of SUCCEEDED (or FAILED/CANCELLED) can you expect the output file to exist.
As #Elad points out, AWSAthenaOperator does this for you, and handles error cases, and more.

Airflow XCOM communication from BashOperator to SSHOperator

I just began learning Airflow, but it is quite difficult to grasp the concept of Xcom. Therefore I wrote a dag like this:
from airflow import DAG
from airflow.utils.edgemodifier import Label
from datetime import datetime
from datetime import timedelta
from airflow.operators.bash import BashOperator
from airflow.contrib.operators.ssh_operator import SSHOperator
from airflow.contrib.hooks.ssh_hook import SSHHook
#For more default argument for a task (or creating templates), please check this website
#https://airflow.apache.org/docs/apache-airflow/stable/_api/airflow/models/index.html#airflow.models.BaseOperator
default_args = {
'owner': '...',
'email': ['...'],
'email_on_retry': False,
'email_on_failure': True,
'retries': 3,
'retry_delay': timedelta(minutes=5),
'start_date': datetime(2021, 6, 10, 23, 0, 0, 0),
}
hook = SSHHook(
remote_host='...',
username='...',
password='...## Heading ##',
port=22,
)
with DAG(
'test_dag',
description='This is my first DAG to learn BASH operation, SSH connection, and transfer data among jobs',
default_args=default_args,
start_date=datetime(2021, 6, 10, 23, 0, 0, 0),
schedule_interval="0 * * * *",
tags = ['Testing', 'Tutorial'],
) as dag:
# Declare Tasks
Read_my_IP = BashOperator(
# Task ID has to be the combination of alphanumeric chars, dashes, dots, and underscores
task_id='Read_my_IP',
# The last line will be pushed to next task
bash_command="hostname -i | awk '{print $1}'",
)
Read_remote_IP = SSHOperator(
task_id='Read_remote_IP',
ssh_hook=hook,
environment={
'Pi_IP': Read_my_IP.xcom_pull('Read_my_IP'),
},
command="echo {{Pi_IP}}",
)
# Declare Relationship between tasks
Read_my_IP >> Label("PI's IP address") >> Read_remote_IP
The first task ran successfully, but I could not obtain the XCom return_value from task Read_my_IP, which is the IP address of the local machine. This might be very basic, but the documentation does not mention how to declare task_instance.
Please help to complete the Xcom flow and pass the IP address from the local machine to the remote machine for further procedure.
The command parameter of SSHOperator is templated thus you can get the xcom directly:
Read_remote_IP = SSHOperator(
task_id='Read_remote_IP',
ssh_hook=hook,
command="echo {{ ti.xcom_pull(task_ids='Read_my_IP') }}"
)
Note that you need also to explicitly ask for xcom to be pushed from BashOperator (see operator description):
Read_my_IP = BashOperator(
# Task ID has to be the combination of alphanumeric chars, dashes, dots, and underscores
task_id='Read_my_IP',
# The last line will be pushed to next task
bash_command="hostname -i | awk '{print $1}'",
do_xcom_push=True
)

How can I do a nohup command using airflows ssh_operator?

I'm new to airflow and I'm trying to run a job on an ec2 instance using airflow's ssh_operator like shown below:
t2 = SSHOperator(
ssh_conn_id='ec2_ssh_connection',
task_id='execute_script',
command="nohup python test.py &",
retries=3,
dag=dag)
The job takes few hours and I want airflow to execute the python script and end. However when the command is executed and the dag completes the script is terminated on the ec2 instance. I also noticed that the above code doesn't create a nohup.out file.
I'm looking at how to run nohup using SSHOperator. It seems like this might be a python related issue because I'm getting the following error on EC2 script when the nohup has been executed:
[Errno 32] Broken pipe
Thanks!
Airflow's SSHHook uses the Paramiko module for SSH connectivity. There is an SO question regarding Prarmiko and nohup. One of the answers suggests to add sleep after the nohup command. I cannot explain exactly why, but it actually works. It is also necessary to set get_pty=True in SSHOperator.
Here is a complete example that demonstrates the solution:
from datetime import datetime
from airflow import DAG
from airflow.contrib.operators.ssh_operator import SSHOperator
default_args = {
'start_date': datetime(2001, 2, 3, 4, 0),
}
with DAG(
'a_dag', schedule_interval=None, default_args=default_args, catchup=False,
) as dag:
op = SSHOperator(
task_id='ssh',
ssh_conn_id='ssh_default',
command=(
'nohup python -c "import time;time.sleep(30);print(1)" & sleep 10'
),
get_pty=True, # This is needed!
)
The nohup.out file is written to the user's $HOME.

Airflow BashOperator doesn't work but PythonOperator does

I seem to have a problem with BashOperator. I'm using Airflow 1.10 installed on CentOS in a Miniconda environment (Python 3.6) using the package on Conda Forge.
When I run airflow test tutorial pyHi 2018-01-01 the output is "Hello world!" as expected.
However, when I run airflow test tutorial print_date 2018-01-01 or
airflow test tutorial templated 2018-01-01 nothing happens.
This is the Linux shell output:
(etl) [root#VIRT02 airflow]# airflow test tutorial sleep 2015-06-01
[2018-09-28 19:56:09,727] {__init__.py:51} INFO - Using executor SequentialExecutor
[2018-09-28 19:56:09,962] {models.py:258} INFO - Filling up the DagBag from /root/airflow/dags
My DAG configuration file, which is based on the Airflow tutorial, is shown below.
from airfl ow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
import test
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2010, 1, 1),
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
dag = DAG(
'tutorial',
'My first attempt',
schedule_interval=timedelta(days=1),
default_args=default_args,
)
# t1, t2 and t3 are examples of tasks created by instantiating operators
t1 = BashOperator(
task_id='print_date',
bash_command='date',
dag=dag)
t2 = BashOperator(
task_id='sleep',
bash_command='sleep 5',
retries=3,
dag=dag)
templated_command = """
{% for i in range(5) %}
echo "{{ ds }}"
echo "{{ macros.ds_add(ds, 7)}}"
echo "{{ params.my_param }}"
{% endfor %}
"""
t3 = BashOperator(
task_id='templated',
bash_command=templated_command,
params={'my_param': 'Parameter I passed in'},
dag=dag)
t4 = BashOperator(
task_id='hi',
bash_command = 'test.sh',
dag=dag,
)
t5 = PythonOperator(
task_id='pyHi',
python_callable=test.main,
dag=dag,
)
t2.set_upstream(t1)
t3.set_upstream(t1)
Technically it's not that the BashOperator doesn't work, it's just that you don't see the stdout of the Bash command in the Airflow logs. This is a known issue and a ticket has already been filed on Airflow's issue tracker: https://issues.apache.org/jira/browse/AIRFLOW-2674
The proof of the fact that BashOperator does work is that if you run your sleep operator with
airflow test tutorial sleep 2018-01-01
you will have to wait 5 seconds before it terminates, which is the behaviour you'd expect from the Bash sleep command.

Categories