Airflow Python operator passing parameters - python

I'm trying to write a Python operator in an airflow DAG and pass certain parameters to the Python callable.
My code looks like below.
def my_sleeping_function(threshold):
print(threshold)
fmfdependency = PythonOperator(
task_id='poke_check',
python_callable=my_sleeping_function,
provide_context=True,
op_kwargs={'threshold': 100},
dag=dag)
end = BatchEndOperator(
queue=QUEUE,
dag=dag)
start.set_downstream(fmfdependency)
fmfdependency.set_downstream(end)
But I keep getting the below error.
TypeError: my_sleeping_function() got an unexpected keyword argument 'dag_run'
Not able to figure out why.

Add **kwargs to your operator parameters list after your threshold param

This is how you can pass arguments for a Python operator in Airflow.
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from time import sleep
from datetime import datetime
def my_func(*op_args):
print(op_args)
return op_args[0]
with DAG('python_dag', description='Python DAG', schedule_interval='*/5 * * * *', start_date=datetime(2018, 11, 1), catchup=False) as dag:
dummy_task = DummyOperator(task_id='dummy_task', retries=3)
python_task = PythonOperator(task_id='python_task', python_callable=my_func, op_args=['one', 'two', 'three'])
dummy_task >> python_task

Related

Unix TimeStamp conversion into Date/Time field failed in Airflow DAG

I am trying to convert Unix Timestamp into Date/Time format in Airflow Dag .
Function get_execution_time() in below Airflow script throwing parsing error :-
ERROR - HTTP error: Bad Request
[2021-10-04 21:39:46,187] ERROR - {"error":"unable to parse: metric parse error: expected field at 1:94: \"model_i7,model_owner=cgrm_developer,model_name=m1,execution_time=**<function get_execution_time at 0x7f46df0b1b70>\**""}
[2021-10-04 21:39:46,216] ERROR - 400:Bad Request
Parser failed to resolve the unix timestamp into Date/Time format and throwing execution_time=<function get_execution_time at 0x7f46df0b1b70>\ .
Here is the executable script :-
from airflow import DAG
from random import random, seed, choice
from datetime import datetime, timedelta
from airflow.operators.python_operator import PythonOperator
from airflow.operators.postgres_operator import PostgresOperator
from airflow.operators.http_operator import SimpleHttpOperator
import time
import warnings
import requests
from datetime import datetime
import calendar
from functools import wraps
warnings.filterwarnings("ignore",category=DeprecationWarning)
def get_execution_time():
ts = int("1284101485")
return datetime.utcfromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S')
model_list = ['m1', 'm2', 'm3', 'm4']
def get_data_str(model_name):
table_name = 'model_i7,'
execution_time=get_execution_time()
parameters = ',model_name={model_name},execution_time={execution_time}'.format(\
model_name=model_name,execution_time=get_execution_time)
return table_name + 'model_owner=test_developer' + parameters
default_args = {
'owner': 'developer',
'depends_on_past': False,
'start_date': datetime(2021,10,04),
'retries': 1,
'retry_delay': timedelta(minutes=1),
'catchup' : False,
'dagrun_timeout' : timedelta(hours=3),
'email_on_success': False,
'email_on_failure': False,
'email_on_retry': False
}
with DAG(
dag_id='testi7',
schedule_interval=None,
tags=['testi7'],
access_control = {'developer':{'can_dag_read','can_dag_edit'}},
default_args=default_args
) as dag:
for model_name in model_list:
extracting_metrics = SimpleHttpOperator(
task_id='extracting_metrics',
http_conn_id ='metrics_api',
endpoint='/write',
python_callable=get_data_str,
provide_context=True,
data=get_data_str(model_name)
)
Expected Output :-
table_name
model_owner
model_name
execution_time
model_output
developer
model1
2010-09-10T10:59:33+00:00
Would appreciate it if some one can help to resolve above exception .

Apache Airflow - Prescript rerun at each task of the dag and date change

I am new with using airflow.
I noticed that if you define a global variable (timestamp) in the code, this value will change for each task. For example in the very basic example bellow, I define a variable now but each time I print it in a task, this value changes.
from datetime import timedelta
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.utils.dates import days_ago
import time
now = int(time.time() * 1000)
RANGE = range(1, 10)
def init_step():
print("Run on RANGE {}".format(RANGE))
print("Date of the Scans {}".format(now))
return RANGE
def trigger_step(index):
time.sleep(10)
print("index {} - date {}".format(index, now))
return index
default_args = {
'owner': 'airflow',
'start_date': days_ago(1),
'retries': 2,
'retry_delay': timedelta(minutes=15)
}
with DAG('test',
default_args=default_args,
schedule_interval='0 16 */7 * *',
) as dag:
init = PythonOperator(task_id='init',
python_callable=init_step,
dag=dag)
for index in init_step():
run = PythonOperator(task_id='trigger-port-' + str(index),
op_kwargs={'index': index},
python_callable=trigger_step, dag=dag)
dag >> init >> run
Is it a normal behavior ? Is there a way to change it ?

Airflow - Access Xcom in BranchPythonOperator

I have extensively searched for airflow blogs and documentation to debug a problem I have.
What I am trying to solve
Check if a particular file exists on an ftp server
If it exists upload it to cloud
If it doesn't exist, send an email to the client reporting that no file is found
What I have
A custom operator extending the BaseOperator that uses the SSH Hook and pushes a value (true or false).
Task that uses BranchPythonOperator to pull the value from xcom and check if previous task returned true or false and make the decision about the next task.
Please look at the code below. This code is a simplified version of what I am trying to do.
If anyone is interested in my original code, please scroll down to the end of the question.
Here the custom operator simply returns a String Even or Odd, based on the minute being even or odd.
import logging
from airflow.models import BaseOperator
from airflow.plugins_manager import AirflowPlugin
from airflow.utils.decorators import apply_defaults
from datetime import datetime
log = logging.getLogger(__name__)
class MediumTestOperator(BaseOperator):
#apply_defaults
def __init__(self,
do_xcom_push=True,
*args,
**kwargs):
super(MediumTestOperator, self).__init__(*args, **kwargs)
self.do_xcom_push = do_xcom_push
self.args = args
self.kwargs = kwargs
def execute(self, context):
# from IPython import embed; embed()
current_minute = datetime.now().minute
context['ti'].xcom_push(key="Airflow", value="Apache Incubating")
if current_minute %2 == 0:
context['ti'].xcom_push(key="minute", value="Even")
else:
context['ti'].xcom_push(key="minute", value="Odd")
# from IPython import embed; embed()
class MediumTestOperatorPlugin(AirflowPlugin):
name = "medium_test"
operators = [MediumTestOperator]
File: caller.py
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.operators.python_operator import BranchPythonOperator
from airflow.operators.dummy_operator import DummyOperator
from medium_payen_op import MediumTestOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'guillaume',
'depends_on_past': False,
'start_date': datetime(2018, 6, 18),
'email': ['hello#moonshots.ai'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=1)
}
dag = DAG(
'Weekday',
default_args=default_args,
schedule_interval="#once")
sample_task = MediumTestOperator(
task_id='task_1',
provide_context=True,
dag=dag
)
def get_branch_follow(**kwargs):
x = kwargs['ti'].xcom_pull(task_ids='task_1', key="minute")
print("From Kwargs: ", x)
if x == 'Even':
return 'task_3'
else:
return 'task_4'
task_2 = BranchPythonOperator(
task_id='task_2_branch',
python_callable=get_branch_follow,
provide_context=True,
dag=dag
)
def get_dample(**kwargs):
x = kwargs['ti'].xcom_pull(task_ids='task_1', key="minute")
y = kwargs['ti'].xcom_pull(task_ids='task_1', key="Airflow")
print("Minute is:", x, " Airflow is from: ", y)
print("Task 3 Running")
task_3 = PythonOperator(
python_callable=get_dample,
provide_context=True,
dag=dag,
task_id='task_3'
)
def get_dample(**kwargs):
x = kwargs['ti'].xcom_pull(task_ids='task_1', key="minute")
y = kwargs['ti'].xcom_pull(task_ids='task_1', key="Airflow")
print("Minute is:", x, " Airflow is from: ", y)
print("Task 4 Running")
task_4 = PythonOperator(
python_callable=get_dample,
provide_context=True,
dag=dag,
task_id='task_4'
)
sample_task >> task_3
task_2 >> task_3
task_2 >> task_4
As you can see from the attached images, the Xcom push did work and I can pull the values from PythonOperator but not from the BranchPythonOperator.
Any help is appreciated.
Xcom Pull from inside the Python Callable of the BranchPythonOperator returns 'None' always, resulting in the Else block running always.
A Tree View of the DAG
XCom Values from the Admin Screen
Xcom Pull from the PythonOperator returns proper values.
This is the original code that I am working with
The custom operator pushes a string True or False as an Xcom Value which then read by the BranchPythonOperator.
I want to read the value pushed by a task created using the above custom operator inside of a BranchPythonOperator task and choose a different path based on the returned value.
File: check_file_exists_operator.py
import logging
from tempfile import NamedTemporaryFile
from airflow.contrib.hooks.ssh_hook import SSHHook
from airflow.models import BaseOperator
from airflow.plugins_manager import AirflowPlugin
from airflow.utils.decorators import apply_defaults
log = logging.getLogger(__name__)
class CheckFileExistsOperator(BaseOperator):
"""
This operator checks if a given file name exists on the
the sftp server.
Returns true if it exists, false otherwise.
:param sftp_path_prefix: The sftp remote path. This is the specified file path
for downloading the file from the SFTP server.
:type sftp_path_prefix: string
:param file_to_be_processed: File that is to be Searched
:type file_to_be_processed: str
:param sftp_conn_id: The sftp connection id. The name or identifier for
establishing a connection to the SFTP server.
:type sftp_conn_id: string
:param timeout: timeout (in seconds) for executing the command.
:type timeout: int
:param do_xcom_push: return the stdout which also get set in xcom by
airflow platform
:type do_xcom_push: bool
"""
FORWARD_SLASH_LITERAL = '/'
template_fields = ('file_to_be_processed',)
#apply_defaults
def __init__(self,
sftp_path_prefix,
file_to_be_processed,
sftp_conn_id='ssh_default',
timeout=10,
do_xcom_push=True,
*args,
**kwargs):
super(CheckFileExistsOperator, self).__init__(*args, **kwargs)
self.sftp_path_prefix = sftp_path_prefix
self.file_to_be_processed = file_to_be_processed
self.sftp_conn_id = sftp_conn_id
self.timeout = timeout
self.do_xcom_push = do_xcom_push
self.args = args
self.kwargs = kwargs
def execute(self, context):
# Refer to https://docs.paramiko.org/en/2.4/api/sftp.html
ssh_hook = SSHHook(ssh_conn_id=self.sftp_conn_id)
sftp_client = ssh_hook.get_conn().open_sftp()
sftp_file_absolute_path = self.sftp_path_prefix.strip() + \
self.FORWARD_SLASH_LITERAL + \
self.file_to_be_processed.strip()
task_instance = context['task_instance']
log.debug('Checking if the follwoing file exists: %s', sftp_file_absolute_path)
try:
with NamedTemporaryFile("w") as temp_file:
sftp_client.get(sftp_file_absolute_path, temp_file.name)
# Return a string equivalent of the boolean.
# Returning a boolean will make the key unreadable
params = {'file_exists' : True}
self.kwargs['params'] = params
task_instance.xcom_push(key="file_exists", value='True')
log.info('File Exists, returning True')
return 'True'
except FileNotFoundError:
params = {'file_exists' : False}
self.kwargs['params'] = params
task_instance.xcom_push(key="file_exists", value='False')
log.info('File Does not Exist, returning False')
return 'False'
class CheckFilePlugin(AirflowPlugin):
name = "check_file_exists"
operators = [CheckFileExistsOperator]
File: airflow_dag_sample.py
import logging
from airflow import DAG
from check_file_exists_operator import CheckFileExistsOperator
from airflow.contrib.operators.sftp_to_s3_operator import SFTPToS3Operator
from airflow.operators.python_operator import BranchPythonOperator
from datetime import timedelta, datetime
from dateutil.relativedelta import relativedelta
from airflow.operators.email_operator import EmailOperator
log = logging.getLogger(__name__)
FORWARD_SLASH_LITERAL = '/'
default_args = {
'owner': 'gvatreya',
'depends_on_past': False,
'start_date': datetime(2019, 1, 1),
'email': ['***#***.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 5,
'retry_delay': timedelta(minutes=2),
'timeout': 10,
'sftp_conn_id': 'sftp_local_cluster',
'provide_context': True
}
dag = DAG('my_test_dag',
description='Another tutorial DAG',
schedule_interval='0 12 * * *',
start_date=datetime(2017, 3, 20),
default_args=default_args,
template_searchpath='/Users/your_name/some_path/airflow_home/sql',
catchup=False)
template_filename_from_xcom = """
{{ task_instance.xcom_pull(task_ids='get_fname_ships', key='file_to_be_processed', dag_id='my_test_dag') }}
"""
template_file_prefix_from_xcom = """
{{ task_instance.xcom_pull(task_ids='get_fname_ships', key="month_prefix_for_file", dag_id='my_test_dag') }}
"""
t_check_file_exists = CheckFileExistsOperator(
sftp_path_prefix='/toDjembe',
file_to_be_processed=template_filename_from_xcom.strip(),
sftp_conn_id='sftp_local_cluster',
task_id='check_file_exists',
dag=dag
)
def branch(**kwargs):
file_exist = kwargs['task_instance'].xcom_pull(task_ids='get_fname_ships', key="file_exists",
dag_id='my_test_dag')
print(template_filename_from_xcom)
from IPython import embed; embed()
log.debug("FILE_EXIST(from branch): %s", file_exist)
if file_exist:
return 's3_upload'
else:
return 'send_file_not_found_email'
t_branch_on_file_existence = BranchPythonOperator(
task_id='branch_on_file_existence',
python_callable=branch,
dag=dag
)
t_send_file_not_found_email = EmailOperator(
task_id='send_file_not_found_email',
to='***#***.com',
subject=template_email_subject.format(state='FAILURE',filename=template_filename_from_xcom.strip(),content='Not found on SFTP Server'),
html_content='File Not Found in SFTP',
mime_charset='utf-8',
dag=dag
)
t_upload_to_s3 = SFTPToS3Operator(
task_id='s3_upload',
sftp_conn_id='sftp_local_cluster',
sftp_path='/djembe/' + template_filename_from_xcom.strip(),
s3_conn_id='s3_conn',
s3_bucket='djembe-users',
s3_key='gvatreya/experiment/' + template_file_prefix_from_xcom.strip() + FORWARD_SLASH_LITERAL + template_filename_from_xcom.strip(),
dag=dag
)
t_check_file_exists >> t_branch_on_file_existence
t_branch_on_file_existence >> t_upload_to_s3
t_branch_on_file_existence >> t_send_file_not_found_email
However, when I run the code, the branch operator always sees the string 'None'.
However, the Xcom has the value true.
I tried debugging using IPython embed() and see that the taskinstance doesnot hold the value of the xcom. I tried using params, and other things that I could think of, but to no avail.
After spending days on this, I am now starting to think I have missed something crucial about the XCom in Airflow.
Hoping anyone could help.
Thanks in advance.
I think, the issue is with dependency.
You currently have the following:
sample_task >> task_3
task_2 >> task_3
task_2 >> task_4
Change it to the following i.e. adding sample_task >> tasK_2 line.
sample_task >> task_3
sample_task >> tasK_2
task_2 >> task_3
task_2 >> task_4
Your task that pushes to xcom should run first before the task that uses BranchPythonOperator
In your 2nd example, the branch function uses xcom_pull(task_ids='get_fname_ships' but I can't find any task with get_fname_ships task_id.

Airflow, XCom and multiple task_ids

How does the task_ids work when multiple tasks are specified?
In this particular code example I expected to retreive the load_cycle_id_2 from both tasks in a tuple (5555,22222) but instead it comes out (None, 22222).
Why is that?
from airflow.models import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime
args = {
'owner': 'airflow',
'start_date': datetime.now(),
'provide_context': True
}
demo_dag = DAG(dag_id='first', start_date=datetime.now(), schedule_interval='#once',default_args=args)
def push_load_id(**kwargs):
kwargs['ti'].xcom_push(key='load_cycle_id_2',value=22222)
kwargs['ti'].xcom_push(key='load_cycle_id_3',value=44444)
def another_push_load_id(**kwargs):
kwargs['ti'].xcom_push(key='load_cycle_id_2',value=5555)
kwargs['ti'].xcom_push(key='anotherload_cycle_id_3',value=6666)
def pull_load_id(**kwargs):
ti = kwargs['ti'].xcom_pull(key='load_cycle_id_2', task_ids=['another_push_load_id','push_load_id'])
print(ti)
push_operator = PythonOperator(task_id='push_load_id', python_callable=push_load_id, dag=demo_dag)
pull_operator = PythonOperator(task_id='pull_load_id', python_callable=pull_load_id, dag=demo_dag)
push_operator >> pull_operator
Your dags runs only push_load_id andpull_load_id functions. You do not create an operator that uses another_push_load_id function.
The end of your code should look like:
push_operator = PythonOperator(task_id='push_load_id', python_callable=push_load_id, dag=demo_dag)
another_push_operator = PythonOperator(task_id='push_load_id', python_callable= another_push_load_id, dag=demo_dag)
pull_operator = PythonOperator(task_id='pull_load_id', python_callable=pull_load_id, dag=demo_dag)
push_operator >> another_push_operator >> pull_operator

Usage of variable in airflow DAG

I set the variable with the "airflow variables" command in cli
I wants to use this variable in DAG.
I executed the following command on the terminal
The error continues occurs.
Broken DAG: [/root/airflow/dags/param_test.py] invalid syntax (param_test.py, line 13)
airflow variables -s sh_path = "/tmp/echo_test.sh"
airflow scheduler
here the code :
from airflow import DAG
from airflow.models import Variable
from airflow.operators.bash_operator import BashOperator
tmpl_search_path = Variable.get ("sh_path")
dag = DAG ('param_test', schedule_interval = '* / 5 * * * *'
           start_date = datetime (2018,9,4), catchup = False)
bash_task = BashOperator (
      task_id = "bash_task"
      bash_command = 'sh '+ {{var.value.tmpl_search_path}},
      dag = dag)
bash_task.set_downstream (python_task)
bash_task1 = BashOperator (
      task_id = 'echo',
      bash_command = 'echo 1',
      dag = dag)
bash_task.set_downstream (bash_task1)
You need to quote the jinja templating. Use it as below:
bash_task = BashOperator (
task_id = "bash_task"
bash_command = "sh {{var.value.tmpl_search_path}}",
dag = dag)

Categories