How to dynamically add bucket_key value in airflow's S3KeySensor - python

I'm trying to set S3KeySensor's bucket_key up based on dagrun input variable.
I have one dag "dag_trigger" that uses TriggerDagRunOperator to trigger dagrun for dag "dag_triggered". I'm trying to extend example https://github.com/apache/incubator-airflow/blob/master/airflow/example_dags/example_trigger_target_dag.py.
So I want to send a variable to triggered dag, and according to the variable's value I want to set backet_key value in S3KeySensor task. I know how to use sent variable in PythonOperator callable function, but I do not know how to use it on the sensor object.
dag_trigger dag:
import datetime
from airflow import DAG
from airflow.operators.dagrun_operator import TriggerDagRunOperator
default_args = {
'owner': 'airflow',
'start_date': datetime.datetime.now()}
dag = DAG('dag_trigger', default_args=default_args, schedule_interval="#hourly")
def task_1_run(context, dag_run_object):
sent_variable = '2018_02_19' # not important
dag_run_object.payload = {'message': sent_variable}
print "DAG dag_trigger triggered with payload: %s" % dag_run_object.payload)
return dag_run_object
task_1 = TriggerDagRunOperator(task_id="task_1",
trigger_dag_id="dag_triggered",
provide_context=True,
python_callable=task_1_run,
dag=dag)
And dag_triggered dag:
import datetime
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.operators.sensors import S3KeySensor
default_args = {
'owner': 'airflow',
'start_date': datetime.datetime.now()
}
dag = DAG('dag_triggered', default_args=default_args, schedule_interval=None)
wait_files_to_arrive_task = S3KeySensor(
task_id='wait_file_to_arrive',
bucket_key='file_%s' % '', # Here I want to place conf['sent_variable']
wildcard_match=True,
bucket_name='test-bucket',
s3_conn_id='test_s3_conn',
timeout=18*60*60,
poke_interval=120,
dag=dag)
I tried to get the value from dag object using dag.get_dagrun().conf['sent_variable'] but I have a doubt how to set dagrun create_date variable (dag_trigger will triggered dag_triggered every hour and dag_triggered could wait longer for file).
I also tried to create PythonOperator that would be upstream for wait_files_to_arrive_task. The callable python function could get information about sent_variable. After that I tried to set value for bucket_key like bucket_key = callable_function() - but I have problem with arguments.
And I also think the global variables is not good solution.
Maybe someone has idea that works.

It's not possible to fetch a value in your DAG run conf directly in your DAG file. That's something that cannot be determined without context of which DAG run it's part of. One way to think about it is when you run python my_dag.py to test if your DAG file compiles, it has to initialize all these operators without needing to specify an execution date. So anything that could differ by DAG run can't be referenced directly.
So instead, you can pass it as a template value which will later get rendered with context when the task is actually being run.
wait_files_to_arrive_task = S3KeySensor(
task_id='wait_file_to_arrive',
bucket_key='file_{{ dag_run.conf["message"] }}',
...)
Note that only parameters listed in the template_fields of an operator will be rendered. Luckily someone anticipated this so bucket_key is indeed a template field.

Related

How to run Airflow dag with conn_id in config template by PostgresOperator?

I have a Airflow dag with a PostgresOperator to execute a SQL query. I want to switch to my test database or my prod database with config (run w/config). But postgres_conn_id is not a template field and so PostgresOperator say "{{ dag_run.conf.get('CONN_ID_TEST', 'pg_database') }}" is not a connection.
I run this script with {"CONN_ID_TEST": "pg_database_test"} config.
I try to create a custom postgresql operator with the same code of Airflow github and I add template_fields: Sequence[str] = ("postgres_conn_id",) at the top of my class CustomPostgresOperator but that doesn't work too (same error).
I have two conn_id env variables :
AIRFLOW_CONN_ID_PG_DATABASE (prod)
AIRFLOW_CONN_ID_PG_DATABASE_TEST (test)
My script looks like :
from airflow import DAG
from airflow.providers.postgres.operators.postgres import PostgresOperator
from airflow.operators.dummy import DummyOperator
DAG_ID = "init_database"
POSTGRES_CONN_ID = "{{ dag_run.conf.get('CONN_ID_TEST', 'pg_database') }}"
with DAG(
dag_id=DAG_ID,
description="My dag",
schedule_interval="#once",
start_date=dt.datetime(2022, 1, 1),
catchup=False,
) as dag:
start = DummyOperator(task_id = 'start')
my_task = PostgresOperator( #### OR CustomPostgresOperator
task_id="select",
sql="SELECT * FROM pets LIMIT 1;",
postgres_conn_id=POSTGRES_CONN_ID,
autocommit=True
)
start >> my task
How I can process to solve my problem ? And if is not possible how I can switch my PostgresOperator connection to my dev database without recreate an other DAG script ?
Thanks, Léo
Subclassing is a solid way to modify the template_fields how you wish. Since template_fields is a class attribute your subclass only really needs to be the following (assuming you're just adding the connection ID to the existing template_fields):
from airflow.providers.postgres.operators.postgres import PostgresOperator as _PostgresOperator
class PostgresOperator(_PostgresOperator):
template_fields = [*_PostgresOperator.template_fields, "conn_id"]
The above is using Postgres provider version 5.3.1 which actually uses the Common SQL provider under the hood so the connection parameter is actually conn_id. (template_fields refer to the instance attribute name rather than the parameter name.)
For example, assume the below DAG gets triggered with a run config of {"environment": "dev"}:
from pendulum import datetime
from airflow.decorators import dag
from airflow.providers.postgres.operators.postgres import PostgresOperator as _PostgresOperator
class PostgresOperator(_PostgresOperator):
template_fields = [*_PostgresOperator.template_fields, "conn_id"]
#dag(start_date=datetime(2023, 1, 1), schedule=None)
def template_postgres_conn():
PostgresOperator(task_id="run_sql", sql="SELECT 1;", postgres_conn_id="{{ dag_run.conf['environment'] }}")
template_postgres_conn()
Looking at the task log, the connection ID of "dev" is used to execute the SQL:

How to avoid dynamic execution of expression in dag parameter at Airflow?

I'm using a parameter that is the timestamp in a set of tasks:
default_dag_args = {'arg1': 'arg1-value',
'arg2': 'arg2-value',
'now': datetime.now()}
I would like that the now parameter would have the same value for all the tasks. But what happens is that it's re-executed for each function
Is there a way of doing it (executing once and using the same value through the dag)? I'm using the TaskFlow API for Airflow 2.0:
#task
def python_task()
context = get_current_context()
context_dag = context['dag']
now = context_dag.default_args['now']
print now
I tried to set the time constant, at the start of the dag file, like:
TIME = datetime.now()
and got the context inside of the tasks with get_current_context() just like you did.
Sadly, I think because of running the DAG file from start, every time a task got defined in the script, time was recalculated.
One idea I have is to use XCOM's in order to save the datetime to a variable and pull it to other tasks:
My sample code is below, I think you'll get the idea.
from airflow.decorators import task, dag
from datetime import datetime
import time
default_arguments = {
'owner': 'admin',
# This is the beginning, for more see: https://airflow.apache.org/faq.html#what-s-the-deal-with-start-date
'start_date': datetime(2022, 5, 2)
}
#dag(
schedule_interval=None,
dag_id = "Time_Example_Dag",
default_args = default_arguments,
catchup=False,
)
def the_global_time_checker_dag():
#task
def time_set():
# To use XCOM to pass the value between tasks,
# we have to parse the datetime to a string.
now = str(datetime.now())
return now
#task
def starting_task(datetime_string):
important_number = 23
# We can use this datetime object in whatever way we like.
date_time_obj = datetime.strptime(datetime_string, '%Y-%m-%d %H:%M:%S.%f')
print(date_time_obj)
return important_number
#task
def important_task(datetime_string, number):
# Passing some time
time.sleep(10)
# Again, we are free to do whatever we want with this object.
date_time_obj = datetime.strptime(datetime_string, '%Y-%m-%d %H:%M:%S.%f')
print(date_time_obj)
print("The important number is: {}".format(number))
time_right_now = time_set()
start = starting_task(datetime_string = time_right_now)
important = important_task(datetime_string = time_right_now, number = start)
time_checker = the_global_time_checker_dag()
Through the logs, you can see all the datetime values are the same.
For more information about XCOM in Taskflow API, you can check here.
When a worker gets a task instance to run, it rebuilds the whole DagBag from the Python files to get the DAG and task definition. So every time a task instance is ran, your DAG file is sourced, rerunning your DAG definition code. And that resulting DAG object is the one that the particular task instance will be defined by.
It's critical to understand that the DAG definition is not simply built once for every execution date and then persisted/reused for all TIs within that DagRun. The DAG definition is constantly being recomputed from your Python code, each TI is ran in a separate process independently and without state from other tasks. Thus, if your DAG definition includes non-deterministic results at DagBag build time - such as datetime.now() - every instantiation of your DAG even for the same execution date will have different values. You need to build your DAGs in a deterministic and idempotent manner.
The only way to share non-deterministic results is to store them in the DB and have your tasks fetch them, as #sezai-burak-kantarcı has pointed out. Best practice is to use task context-specific variables, like {{ ds }}, {{ execution_date }}, {{ data_interval_start }}, etc. These are the same for all tasks within a DAG run. You can see the template variables available here: Airflow emplates reference

Get session parameter for airflow.models.dag get_last_dagrun()

I am trying to pass into my custom Operator a parameter which is the last run time of the dag itself.
Following the documentation, I understand that i should use dag.get_last_dagrun() https://airflow.apache.org/docs/apache-airflow/1.10.6/_api/airflow/models/dag/index.html#airflow.models.dag.get_last_dagrun . However, I can't manage to pass the session parameter correctly.
Where can I find this?
When using the function without parameters, it return None.
I think that it's because I triggered the Dag myself, thus i want to set include_externally_triggered to true. But i still need to manage the session parameter before.
I tried to create the variable last_run before creating the dag and also when defining the tasks. I suppose that inside the task, self is included and it will fetch correctly without putting any parameters.
But what about the one which is outside of the dag?
I have also tried this solution which give me a time even if its the first time I run the Dag (I have clean the dag log from the ui), Maybe its the current executing DAG timestamp? If yes, I would need to compare the dates to exempt if equal?. https://stackoverflow.com/a/63930004/18036486
from airflow import DAG
from DAG.operators.custom_operator1 import customOperator1
last_run = dag.get_last_dagrun() #HERE
default_args = {
"owner": "admin",
"depends_on_past": False,
"email": ["email#email.com"],
"email_on_failure": False,
"email_on_retry": False,
"retries": 1,
}
with DAG(
dag_id="Custom",
schedule_interval="#once",
description="Desc",
start_date=datetime(2022, 3, 11),
catchup=False,
tags=["custom"],
default_args=default_args) as dag:
#Custom Operator
custom = customOperator1(
task_id = 'custom',
last_run = dag.get_last_dagrun() # OR HERE
)
custom
The actual answer at https://stackoverflow.com/a/63930004/18036486 : . is including the current running Dag. Therefore, i slightly modified the function in order to exempt other dags which status isn't 'running' of course, you can add other conditions for the other Dag states:
enter image description here
Now, I can get the lastest successfull Dag execution_date to dynamically update my data!
from airflow.models import DagRun
def get_last_exec_date(dag_id):
dag_runs = DagRun.find(dag_id=dag_id)
dags = []
for dag in dag_runs:
if dag.state == 'success':
dags.append(dag)
dags.sort(key=lambda x: x.execution_date, reverse=False)
return dags[0] if dags != [] else None

Airflow operator and dags and properly returning, exposing, and accessing values?

I need to create a airflow operator that takes a few inputs and returns a string that will be used as an input for another operator that will run next. I'm new to airflow dags and operators and am confused on how to properly do this. Since I'm building this for people who use airflow and build dags and I'm not an actual airflow user or dag developer I want to get advice on doing it properly. I have created a operator and it returns a token (just a string so hello world operator example works fine). Doing so I see the value in the xcom value for the dag execution. But how would I properly retrieve that value and input it into the next operator? For my example I just called the same operator but in real it will be calling a different operator. I just do not know how to properly code this. Do I just add code to the dag? Does the operator need code added? Or should something else?
Example Dag:
import logging
import os
from airflow import DAG
from airflow.utils.dates import days_ago
from custom_operators.hello_world import HelloWorldOperator
from datetime import datetime
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'email_on_failure': False,
'email_on_retry': False,
'retries': 0,
}
dag = DAG("hello_world_test",
description='Testing out a operator',
start_date=days_ago(1),
schedule_interval=None,
catchup=False,
default_args=default_args)
get_token = HelloWorldOperator(
task_id='hello_world_task_1',
name='My input to generate a token',
dag=dag
)
token = "My token" # Want this to be the return value from get_token
run_this = HelloWorldOperator(
task_id='hello_world_task_2',
name=token,
dag=dag
)
logging.info("Start")
get_token >> run_this
logging.info("End")
Hello World operator:
from airflow.models.baseoperator import BaseOperator
class HelloWorldOperator(BaseOperator):
def __init__(
self,
some_input: str,
**kwargs) -> None:
super().__init__(**kwargs)
self.some_input = some_input
def execute(self, context):
# Bunch of business logic
token = "MyGeneratedToken"
return token
This is a good start :).
The right way to retrieve the token from another task is to use jinja templating
run_this = RetrieveToken(
task_id='hello_world_task_2',
retrieved_token="{{ ti.xcom_pull(task_ids=[\'hello_world_task_1\']) }}'",
dag=dag
)
However, you have to remember in your RetrieveToken to add retrieved_token to template_fields array: https://airflow.apache.org/docs/apache-airflow/stable/howto/custom-operator.html#templating
You can also call xcom_pull method explicitly in your "retrieve" operator and pass the "origin" task id to the operator to retrieve it from the right task.

Airflow 2.0.0+ - Pass a Dynamically Generated Dictionary to DAG Triggered by TriggerDagRunOperator

Previously, I was using the python_callable parameter of the TriggerDagRunOperator to dynamically alter the dag_run_obj payload that is passed to the newly triggered DAG.
Since its removal in Airflow 2.0.0 (Pull Req: https://github.com/apache/airflow/pull/6317), is there a way to do this, without creating a custom TriggerDagRunOperator?
For context, here is the flow of my code:
#Poll Amazon S3 bucket for new received files
fileSensor_tsk = S3KeySensor()
#Use chooseDAGBasedOnInput function to create dag_run object (previously python_callable was used directly in TriggerDagRunOperator to create the dag_run object for the new triggered DAG)
#dag_run object will pass received file name details to new DAG for reference in order to complete its own work
chooseDAGTrigger_tsk = BranchPythonOperator(
task_id='chooseDAGTrigger_tsk',
python_callable=chooseDAGBasedOnInput,
provide_context=True
)
triggerNewDAG_tsk = TriggerDagRunOperator(
task_id='triggerNewDAG_tsk',
trigger_dag_id='1000_NEW_LOAD'
)
triggerNewDAG2_tsk = TriggerDagRunOperator(
task_id='triggerNew2DAG_tsk',
trigger_dag_id='1000_NEW2_LOAD'
) ...
Any help or commentary would be greatly appreciated!
EDIT - adding previously used python_callable function used in TriggerDagRunOperator:
def intakeFile(context, dag_run_obj):
#read from S3, get filename and pass to triggered DAG
bucket_name = os.environ.get('bucket_name')
s3_hook = S3Hook(aws_conn_id='aws_default')
s3_hook.copy_object()
s3_hook.delete_objects()
...
dag_run_obj.payload = {
'filePath': workingPath,
'source': source,
'fileName': fileName
}
return dag_run_obj
The TriggerDagRunOperator now takes a conf parameter to which a dictinoary can be provided as the conf object for the DagRun. Here is more information on triggering DAGs which you may find helpful as well.
EDIT
Since you need to execute a function to determine which DAG to trigger and do not want to create a custom TriggerDagRunOperator, you could execute intakeFile() in a PythonOperator (or use the #task decorator with the Task Flow API) and use the return value as the conf argument in the TriggerDagRunOperator. As part of Airflow 2.0, return values are automatically pushed to XCom within many operators; the PythonOperator included.
Here is the general idea:
def intakeFile(*args, **kwargs):
# read from S3, get filename and pass to triggered DAG
bucket_name = os.environ.get("bucket_name")
s3_hook = S3Hook(aws_conn_id="aws_default")
s3_hook.copy_object()
s3_hook.delete_objects()
...
dag_run_obj.payload = {
"filePath": workingPath,
"source": source,
"fileName": fileName,
}
return dag_run_obj
get_dag_to_trigger = PythonOperator(
task_id="get_dag_to_trigger",
python_callable=intakeFile
)
triggerNewDAG_tsk = TriggerDagRunOperator(
task_id="triggerNewDAG_tsk",
trigger_dag_id="{{ ti.xcom_pull(task_ids='get_dag_to_trigger', key='return_value') }}",
)
get_dag_to_trigger >> triggerNewDAG_tsk

Categories