How do I access an airflow rendered template downstream? - python

Airflow version: 1.10.12
I'm having trouble passing a rendered template object for use downstream. I'm trying to grab two config variables from the Airflow conf.
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime
with DAG(
dag_id="example_trigger_target_dag",
default_args={"owner": "airflow"},
start_date=datetime(2021, 6, 24),
schedule_interval='#once',
tags=['example'],
) as dag:
bash_task = BashOperator(
task_id="bash_task",
bash_command='echo "{{ conf.test }}:{{ conf.tag }}"',
xcom_push=True
)
Basically what it passes to xcom is just : without the variables present. The full rendered string does show up however in the rendered tab. Am I missing something?
Edit: the variables exist in the conf, I just replaced them with test and tag for security reasons.

I just tested this code (I'm using Airflow 2.1.0) and got that result.
BashOperator(task_id='test_task',
bash_command='echo " VARS: {{ conf.email.email_backend '
'}}:{{ conf.core.executor }}"',
do_xcom_push=True)

Related

How to run Airflow dag with conn_id in config template by PostgresOperator?

I have a Airflow dag with a PostgresOperator to execute a SQL query. I want to switch to my test database or my prod database with config (run w/config). But postgres_conn_id is not a template field and so PostgresOperator say "{{ dag_run.conf.get('CONN_ID_TEST', 'pg_database') }}" is not a connection.
I run this script with {"CONN_ID_TEST": "pg_database_test"} config.
I try to create a custom postgresql operator with the same code of Airflow github and I add template_fields: Sequence[str] = ("postgres_conn_id",) at the top of my class CustomPostgresOperator but that doesn't work too (same error).
I have two conn_id env variables :
AIRFLOW_CONN_ID_PG_DATABASE (prod)
AIRFLOW_CONN_ID_PG_DATABASE_TEST (test)
My script looks like :
from airflow import DAG
from airflow.providers.postgres.operators.postgres import PostgresOperator
from airflow.operators.dummy import DummyOperator
DAG_ID = "init_database"
POSTGRES_CONN_ID = "{{ dag_run.conf.get('CONN_ID_TEST', 'pg_database') }}"
with DAG(
dag_id=DAG_ID,
description="My dag",
schedule_interval="#once",
start_date=dt.datetime(2022, 1, 1),
catchup=False,
) as dag:
start = DummyOperator(task_id = 'start')
my_task = PostgresOperator( #### OR CustomPostgresOperator
task_id="select",
sql="SELECT * FROM pets LIMIT 1;",
postgres_conn_id=POSTGRES_CONN_ID,
autocommit=True
)
start >> my task
How I can process to solve my problem ? And if is not possible how I can switch my PostgresOperator connection to my dev database without recreate an other DAG script ?
Thanks, Léo
Subclassing is a solid way to modify the template_fields how you wish. Since template_fields is a class attribute your subclass only really needs to be the following (assuming you're just adding the connection ID to the existing template_fields):
from airflow.providers.postgres.operators.postgres import PostgresOperator as _PostgresOperator
class PostgresOperator(_PostgresOperator):
template_fields = [*_PostgresOperator.template_fields, "conn_id"]
The above is using Postgres provider version 5.3.1 which actually uses the Common SQL provider under the hood so the connection parameter is actually conn_id. (template_fields refer to the instance attribute name rather than the parameter name.)
For example, assume the below DAG gets triggered with a run config of {"environment": "dev"}:
from pendulum import datetime
from airflow.decorators import dag
from airflow.providers.postgres.operators.postgres import PostgresOperator as _PostgresOperator
class PostgresOperator(_PostgresOperator):
template_fields = [*_PostgresOperator.template_fields, "conn_id"]
#dag(start_date=datetime(2023, 1, 1), schedule=None)
def template_postgres_conn():
PostgresOperator(task_id="run_sql", sql="SELECT 1;", postgres_conn_id="{{ dag_run.conf['environment'] }}")
template_postgres_conn()
Looking at the task log, the connection ID of "dev" is used to execute the SQL:

Get session parameter for airflow.models.dag get_last_dagrun()

I am trying to pass into my custom Operator a parameter which is the last run time of the dag itself.
Following the documentation, I understand that i should use dag.get_last_dagrun() https://airflow.apache.org/docs/apache-airflow/1.10.6/_api/airflow/models/dag/index.html#airflow.models.dag.get_last_dagrun . However, I can't manage to pass the session parameter correctly.
Where can I find this?
When using the function without parameters, it return None.
I think that it's because I triggered the Dag myself, thus i want to set include_externally_triggered to true. But i still need to manage the session parameter before.
I tried to create the variable last_run before creating the dag and also when defining the tasks. I suppose that inside the task, self is included and it will fetch correctly without putting any parameters.
But what about the one which is outside of the dag?
I have also tried this solution which give me a time even if its the first time I run the Dag (I have clean the dag log from the ui), Maybe its the current executing DAG timestamp? If yes, I would need to compare the dates to exempt if equal?. https://stackoverflow.com/a/63930004/18036486
from airflow import DAG
from DAG.operators.custom_operator1 import customOperator1
last_run = dag.get_last_dagrun() #HERE
default_args = {
"owner": "admin",
"depends_on_past": False,
"email": ["email#email.com"],
"email_on_failure": False,
"email_on_retry": False,
"retries": 1,
}
with DAG(
dag_id="Custom",
schedule_interval="#once",
description="Desc",
start_date=datetime(2022, 3, 11),
catchup=False,
tags=["custom"],
default_args=default_args) as dag:
#Custom Operator
custom = customOperator1(
task_id = 'custom',
last_run = dag.get_last_dagrun() # OR HERE
)
custom
The actual answer at https://stackoverflow.com/a/63930004/18036486 : . is including the current running Dag. Therefore, i slightly modified the function in order to exempt other dags which status isn't 'running' of course, you can add other conditions for the other Dag states:
enter image description here
Now, I can get the lastest successfull Dag execution_date to dynamically update my data!
from airflow.models import DagRun
def get_last_exec_date(dag_id):
dag_runs = DagRun.find(dag_id=dag_id)
dags = []
for dag in dag_runs:
if dag.state == 'success':
dags.append(dag)
dags.sort(key=lambda x: x.execution_date, reverse=False)
return dags[0] if dags != [] else None

Airflow operator and dags and properly returning, exposing, and accessing values?

I need to create a airflow operator that takes a few inputs and returns a string that will be used as an input for another operator that will run next. I'm new to airflow dags and operators and am confused on how to properly do this. Since I'm building this for people who use airflow and build dags and I'm not an actual airflow user or dag developer I want to get advice on doing it properly. I have created a operator and it returns a token (just a string so hello world operator example works fine). Doing so I see the value in the xcom value for the dag execution. But how would I properly retrieve that value and input it into the next operator? For my example I just called the same operator but in real it will be calling a different operator. I just do not know how to properly code this. Do I just add code to the dag? Does the operator need code added? Or should something else?
Example Dag:
import logging
import os
from airflow import DAG
from airflow.utils.dates import days_ago
from custom_operators.hello_world import HelloWorldOperator
from datetime import datetime
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'email_on_failure': False,
'email_on_retry': False,
'retries': 0,
}
dag = DAG("hello_world_test",
description='Testing out a operator',
start_date=days_ago(1),
schedule_interval=None,
catchup=False,
default_args=default_args)
get_token = HelloWorldOperator(
task_id='hello_world_task_1',
name='My input to generate a token',
dag=dag
)
token = "My token" # Want this to be the return value from get_token
run_this = HelloWorldOperator(
task_id='hello_world_task_2',
name=token,
dag=dag
)
logging.info("Start")
get_token >> run_this
logging.info("End")
Hello World operator:
from airflow.models.baseoperator import BaseOperator
class HelloWorldOperator(BaseOperator):
def __init__(
self,
some_input: str,
**kwargs) -> None:
super().__init__(**kwargs)
self.some_input = some_input
def execute(self, context):
# Bunch of business logic
token = "MyGeneratedToken"
return token
This is a good start :).
The right way to retrieve the token from another task is to use jinja templating
run_this = RetrieveToken(
task_id='hello_world_task_2',
retrieved_token="{{ ti.xcom_pull(task_ids=[\'hello_world_task_1\']) }}'",
dag=dag
)
However, you have to remember in your RetrieveToken to add retrieved_token to template_fields array: https://airflow.apache.org/docs/apache-airflow/stable/howto/custom-operator.html#templating
You can also call xcom_pull method explicitly in your "retrieve" operator and pass the "origin" task id to the operator to retrieve it from the right task.

How to dynamically add bucket_key value in airflow's S3KeySensor

I'm trying to set S3KeySensor's bucket_key up based on dagrun input variable.
I have one dag "dag_trigger" that uses TriggerDagRunOperator to trigger dagrun for dag "dag_triggered". I'm trying to extend example https://github.com/apache/incubator-airflow/blob/master/airflow/example_dags/example_trigger_target_dag.py.
So I want to send a variable to triggered dag, and according to the variable's value I want to set backet_key value in S3KeySensor task. I know how to use sent variable in PythonOperator callable function, but I do not know how to use it on the sensor object.
dag_trigger dag:
import datetime
from airflow import DAG
from airflow.operators.dagrun_operator import TriggerDagRunOperator
default_args = {
'owner': 'airflow',
'start_date': datetime.datetime.now()}
dag = DAG('dag_trigger', default_args=default_args, schedule_interval="#hourly")
def task_1_run(context, dag_run_object):
sent_variable = '2018_02_19' # not important
dag_run_object.payload = {'message': sent_variable}
print "DAG dag_trigger triggered with payload: %s" % dag_run_object.payload)
return dag_run_object
task_1 = TriggerDagRunOperator(task_id="task_1",
trigger_dag_id="dag_triggered",
provide_context=True,
python_callable=task_1_run,
dag=dag)
And dag_triggered dag:
import datetime
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.operators.sensors import S3KeySensor
default_args = {
'owner': 'airflow',
'start_date': datetime.datetime.now()
}
dag = DAG('dag_triggered', default_args=default_args, schedule_interval=None)
wait_files_to_arrive_task = S3KeySensor(
task_id='wait_file_to_arrive',
bucket_key='file_%s' % '', # Here I want to place conf['sent_variable']
wildcard_match=True,
bucket_name='test-bucket',
s3_conn_id='test_s3_conn',
timeout=18*60*60,
poke_interval=120,
dag=dag)
I tried to get the value from dag object using dag.get_dagrun().conf['sent_variable'] but I have a doubt how to set dagrun create_date variable (dag_trigger will triggered dag_triggered every hour and dag_triggered could wait longer for file).
I also tried to create PythonOperator that would be upstream for wait_files_to_arrive_task. The callable python function could get information about sent_variable. After that I tried to set value for bucket_key like bucket_key = callable_function() - but I have problem with arguments.
And I also think the global variables is not good solution.
Maybe someone has idea that works.
It's not possible to fetch a value in your DAG run conf directly in your DAG file. That's something that cannot be determined without context of which DAG run it's part of. One way to think about it is when you run python my_dag.py to test if your DAG file compiles, it has to initialize all these operators without needing to specify an execution date. So anything that could differ by DAG run can't be referenced directly.
So instead, you can pass it as a template value which will later get rendered with context when the task is actually being run.
wait_files_to_arrive_task = S3KeySensor(
task_id='wait_file_to_arrive',
bucket_key='file_{{ dag_run.conf["message"] }}',
...)
Note that only parameters listed in the template_fields of an operator will be rendered. Luckily someone anticipated this so bucket_key is indeed a template field.

Airflow custom jinja2 filters

I am trying to add custom filters for my airflow jinja2 templates.
Since my folders in S3 are like
/year/month/day/
, my purpose is to use yesterday_ds in the Variables screen like this:
s3://logs.web.com/AWSLogs/{{ yesterday_ds | get_year }}/{{ yesterday_ds | get_month }}/{{ yesterday_ds | get_day }}/
I have seen in a PR (which I think is already merged..) that you can do this with the parameter 'user_defined_filters' in the dag_args parameter in the dag object creation here
The problem is that even when doing it, it says 'no filter named get_year', for example.
This is my code:
dag.py
dag = DAG(
dag_id='dag-name',
default_args=utils.get_dag_args(user_defined_filters=utils.get_date_filters()),
template_searchpath=tmpl_search_path,
schedule_interval=timedelta(days=1),
max_active_runs=1,
)
utils.py
def get_dag_args(**kwargs):
return {
'owner' : kwargs.get('owner', 'owner_name'),
'depends_on_past' : kwargs.get('depends_on_past', False),
'start_date' : kwargs.get('start_date', datetime.now() - timedelta(1)),
'email' : kwargs.get('email', ['blabla#blabla.com']),
'retries' : kwargs.get('retries', 5),
'provide_context' : kwargs.get('provide_context', True),
'retry_delay' : kwargs.get('retry_delay', timedelta(minutes=5)),
'user_defined_filters': get_date_filters()
}
def get_date_filters():
return dict(
get_year=lambda date_string: date_string.strftime('%Y'),
get_month=lambda date_string: date_string.strftime('%m'),
get_day=lambda date_string: date_string.strftime('%d'),
)
Does anybody see where I am mistaken? Thank you!
EDIT
Printing this after the dag definition, shows no custom filters, unfortunately :(.
jinja_env = dag.get_template_env()
print(jinja_env.filters)
Also, if I try to add it directly as a DAG object parameter, as it shows in the tests # tests/models.py:
Broken DAG: [/home/ubuntu/airflow/dags/dag.py] __init__() got an unexpected keyword argument 'user_defined_filters'
EDIT 2
Ok what I see is that I have the version 1.8.0 and this one does not have the filters. Anybody knows how to download the 1.8.2rc one via pip? Or we cant?
Airflow has support for custom filters and macros now
Working code example:
from airflow import DAG
from datetime import datetime, timedelta
def first_day_of_month(any_day):
return any_day.replace(day=1)
def last_day_of_month(any_day):
next_month = any_day.replace(day=28) + timedelta(days=4) # this will never fail
return next_month - timedelta(days=next_month.day)
def isoformat_month(any_date):
return any_date.strftime("%Y-%m")
with DAG(
dag_id='generate_raw_logs',
default_args=default_args,
schedule_interval=timedelta(minutes=120),
catchup=False,
user_defined_macros={
'first_day_of_month': first_day_of_month,
'last_day_of_month': last_day_of_month,
},
user_defined_filters={
'isoformat_month': isoformat_month
}
)
Airflow packaging name has been changed on pip. 1.8.2rc1 can be downloaded using pip install apache-airflow instead.
Also, note that according to the mailling list, they are currently working on releasing 1.8.2rc4 as 1.8.2.

Categories