Airflow Variable Usage in DAG file - python

I have a short DAG where I need to get the variable stored in Airflow (Airflow -> Admin -> variables). But, when we use as a template I'm getting below error.
Sample code and error shown below:
from airflow import DAG
from airflow.models import Variable
from airflow.operators.python import PythonOperator
def display_variable():
my_var = Variable.get("my_var")
print('variable' + my_var)
return my_var
def display_variable1():
my_var = {{ var.value.my_var }}
print('variable' + my_var)
return my_var
dag = DAG(dag_id="variable_dag", start_date=airflow.utils.dates.days_ago(14),
schedule_interval='#daily')
task = PythonOperator(task_id='display_variable', python_callable=display_variable, dag=dag)
task1 = PythonOperator(task_id='display_variable1', python_callable=display_variable1, dag=dag)
task >> task1
Here the usage to get the value of a variable using:
Variable.ger("my_var") --> is working
But, I'm getting an error using the other way:
{{ var.value.my_var }}
Error:
File "/home/airflow_home/dags/variable_dag.py", line 12, in display_variable1
my_var = {{ var.value.my_var }}
NameError: name 'var' is not defined

Both display_variable functions run Python code, so Variable.get() works as intended. The {{ ... }} syntax is used for templated strings. Some arguments of most Airflow operators support templated strings, which can be given as "{{ expression to be evaluated at runtime }}". Look up Jinja templating for more information. Before a task is executed, templated strings are evaluated. For example:
BashOperator(
task_id="print_now",
bash_command="echo It is currently {{ macros.datetime.now() }}",
)
Airflow evaluates the bash_command just before executing it, and as a result the bash_command will hold e.g. "Today is Wednesday".
However, running {{ ... }} as if it were Python code would actually try to create a nested set:
{{ variable }}
^^
|└── inner set
|
outer set
Since sets are not hashable in Python, this will never evaluate, even if the statement inside is valid.
Additional resources:
https://www.astronomer.io/guides/templating
The template_fields attribute on each operator defines which attributes are template-able, see docs for your operator to see the value of template_fields: https://airflow.apache.org/docs/apache-airflow/stable/_api/airflow/operators/python/index.html#airflow.operators.python.PythonOperator.template_fields

Related

How to use dag_run.conf for typed arguments

I have a DAG that create a Google Dataproc cluster and submit a job to it.
I would like to be able to customize the cluster (number of workers) and the job (arguments passed to it) through the dag_run.conf parameter.
Cluster creation
For the cluster creation, I wrote a logic with something like:
DataprocCreateClusterOperator(...
cluster_config = {...
num_workers = "{% if 'cluster' is in dag_run.conf and 'secondary_worker_config' is in dag_run.conf['cluster'] and 'num_instances' is in dag_run.conf['cluster']['secondary_worker_config'] %}{{ dag_run.conf['cluster']['secondary_worker_config']['num_instances'] }}{% else %}16{% endif %}"
}
)
That is to say, if cluster.secondary_worker_config.num_instances is available in dag_run.conf, use it, else fallback on default value 16.
However, when rendered, this is expanded as a Python string, like "16", leading to failure because the num_workers parameter must be an int or a long.
I cannot parse it to int during operator declaration:
num_workers = int("{% ... %}")
because this will try to interpret the whole jinja script as an integer (and not the resulting value).
Using the | int jinja filter neither solve the problem.
Job submission
I have a similar problem for job submission.
Operator expect a job dict argument, with field spark.args to provide arguments to the spark job. This field must be an iterable, and is expected to be a list of strings, e.g: ["--arg=foo", "bar"].
I want to be able to add some arguments by providing them through dag_run.conf:
{
args = ["--new_arg=baz", "bar2"]
}
But adding these arguments to the initial list doesn't seem to be possible. You either get a single argument for all additional args: ["--arg=foo", "bar", "--new_arg=baz bar2"], or a single string with all arguments.
In any case, the resulting job submission is not working as expected...
Is there an existing way to workaround this problem?
If not, is there a way to add a "casting step" after "template rendering" one, either in the provider operators or directly in the BaseOperator abstract class?
Edit
I think that the solution proposed by Josh Fell is the way to go. However, for those that don't want to upgrade Airflow, I tried to implement the solution proposed by Jarek.
import unittest
import datetime
from typing import Any
from airflow import DAG
from airflow.models import BaseOperator, TaskInstance
# Define an operator which check its argument type at runtime (during "execute")
class TypedOperator(BaseOperator):
def __init__(self, int_param: int, **kwargs):
super(TypedOperator, self).__init__(**kwargs)
self.int_param = int_param
def execute(self, context: Any):
assert(type(self.int_param) is int)
# Extend the "typed" operator with an operator handling templating
class TemplatedOperator(TypedOperator):
template_fields = ['templated_param']
def __init__(self,
templated_param: str = "{% if 'value' is in dag_run.conf %}{{ dag_run.conf['value'] }}{% else %}16{% endif %}",
**kwargs):
super(TemplatedOperator, self).__init__(int_param=int(templated_param), **kwargs)
# Run a test, instantiating a task and executing it
class JinjaTest(unittest.TestCase):
def test_templating(self):
print("Start test")
dag = DAG("jinja_test_dag", default_args=dict(
start_date=datetime.date.today().isoformat()
))
print("Task intanciation (regularly done by scheduler)")
task = TemplatedOperator(task_id="my_task", dag=dag)
print("Done")
print("Task execution (only done when DAG triggered)")
context = TaskInstance(task=task, execution_date=datetime.datetime.now()).get_template_context()
task.execute(context)
print("Done")
self.assertTrue(True)
Which give the output:
Start test
Task intanciation (regularly done by scheduler)
Ran 1 test in 0.006s
FAILED (errors=1)
Error
Traceback (most recent call last):
File "/home/alexis/AdYouLike/Repositories/data-airflow-dags/tests/data_airflow_dags/utils/tasks/test_jinja.py", line 38, in test_templating
task = TemplatedOperator(task_id="my_task", dag=dag)
File "/home/alexis/AdYouLike/Repositories/data-airflow-dags/.venv/lib/python3.6/site-packages/airflow/models/baseoperator.py", line 89, in __call__
obj: BaseOperator = type.__call__(cls, *args, **kwargs)
File "/home/alexis/AdYouLike/Repositories/data-airflow-dags/tests/data_airflow_dags/utils/tasks/test_jinja.py", line 26, in __init__
super(TemplatedOperator, self).__init__(int_param=int(templated_param), **kwargs)
ValueError: invalid literal for int() with base 10: "{% if 'value' is in dag_run.conf %}{{ dag_run.conf['value'] }}{% else %}16{% endif %}"
As you can see, this fails at the task instanciation step, because in the TemplatedOperator.__init__ we try to cast to int the JINJA template (and not the rendered value).
Maybe I missed a point in this solution, but it seems to be unusable as is.
Unfortunately all Jinja templates are rendered as strings so the solution proposed by #JarekPotiuk is your best bet.
However, for anyone using Airflow 2.1+ or if you'd like to upgrade, there is a new parameter that can be set at the DAG level: render_template_as_native_obj
When enabling this parameter, the output from Jinja templating will be returned as native Python types (e.g. list, tuple, int, etc.). Learn more here: https://airflow.apache.org/docs/apache-airflow/stable/concepts/operators.html#rendering-fields-as-native-python-objects
THe easiest way is to define your custom operator deriving from DataprocCreateClusterOperator . It's super easy and you can even do it within the dag file:
Conceptually something like that:
class MyDataprocCreateClusterOperator(DataprocCreateClusterOperator):
template_fields = DataprocCreateClusterOperator.template_fields + ['my_param']
def __init__(my_param='{{ ... }}', .....):
super(int_param=int(my_param), ....)

How do I access an airflow rendered template downstream?

Airflow version: 1.10.12
I'm having trouble passing a rendered template object for use downstream. I'm trying to grab two config variables from the Airflow conf.
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime
with DAG(
dag_id="example_trigger_target_dag",
default_args={"owner": "airflow"},
start_date=datetime(2021, 6, 24),
schedule_interval='#once',
tags=['example'],
) as dag:
bash_task = BashOperator(
task_id="bash_task",
bash_command='echo "{{ conf.test }}:{{ conf.tag }}"',
xcom_push=True
)
Basically what it passes to xcom is just : without the variables present. The full rendered string does show up however in the rendered tab. Am I missing something?
Edit: the variables exist in the conf, I just replaced them with test and tag for security reasons.
I just tested this code (I'm using Airflow 2.1.0) and got that result.
BashOperator(task_id='test_task',
bash_command='echo " VARS: {{ conf.email.email_backend '
'}}:{{ conf.core.executor }}"',
do_xcom_push=True)

Airflow XCOM pull is not rendering

I have a custom operator where in the argument list I am using xcom_pull to get values from XCOM. But it is not rendering to actual value instead it remains as the string.
download= CustomSparkSubmitOperator(
task_id='download',
spark_args=command_func(
env, application,
'%s/spark_args' % key,
['--input_file', "{{ ti.xcom_pull('set_parameters', key='input_file') }}",
'--output_file', "{{ ti.xcom_pull('set_parameters', key='output_file') }}"
],
provide_context=True,
dag=dag)
The operator returns the following output:
spark-submit --deploy-mode cluster ..... --input_file "{{ ti.xcom_pull('set_parameters', key='input_file') }}" --output_file "{{ ti.xcom_pull('set_parameters', key='output_file') }}"
I was able to fix the problem of XCOM values not rendering when using as an argument in my CustomSparkSubmitEMROperator. Internally the operator inherits the EMROperators. For example
class CustomSparkSubmitEMROperator(EmrAddStepsOperator, EmrStepSensor):
So I needed to add the below template_fields as shown below
template_fields = ('job_flow_id', 'steps')
After adding the above lines the XCOM values where properly rendered and was able to see the correct values in the resultant spark-submit command

Python Airflow - Return result from PythonOperator

I have written a DAG with multiple PythonOperators
task1 = af_op.PythonOperator(task_id='Data_Extraction_Environment',
provide_context=True,
python_callable=Task1, dag=dag1)
def Task1(**kwargs):
return(kwargs['dag_run'].conf.get('file'))
From PythonOperator i am calling "Task1" method. That method is returning a value,that value i need to pass to the next PythonOperator.How can i get the value from the "task1" variable or How can i get the value which is returned from Task1 method?
updated :
def Task1(**kwargs):
file_name = kwargs['dag_run'].conf.get[file]
task_instance = kwargs['task_instance']
task_instance.xcom_push(key='file', value=file_name)
return file_name
t1 = PythonOperator(task_id = 'Task1',provide_context=True,python_callable=Task1,dag=dag)
t2 = BashOperator(
task_id='Moving_bucket',
bash_command='python /home/raw.py {{ task_instance.xcom_pull(task_ids='Task1',key='file') }} ',
dag=dag,
)
t2.set_upstream(t1)
You might want to check out Airflow's XCOM: https://airflow.apache.org/docs/apache-airflow/stable/concepts/xcoms.html
If you return a value from a function, this value is stored in xcom. In your case, you could access it like so from other Python code:
task_instance = kwargs['task_instance']
task_instance.xcom_pull(task_ids='Task1')
or in a template like so:
{{ task_instance.xcom_pull(task_ids='Task1') }}
If you want to specify a key you can push into XCOM (being inside a task):
task_instance = kwargs['task_instance']
task_instance.xcom_push(key='the_key', value=my_str)
Then later on you can access it like so:
task_instance.xcom_pull(task_ids='my_task', key='the_key')
EDIT 1
Follow-up question: Instead of using the value in another function how can i pass the value to another PythonOperator like - "t2 = "BashOperator(task_id='Moving_bucket', bash_command='python /home/raw.py "%s" '%file_name, dag=dag)" --- i want to access file_name which is returned by "Task1". How can this will be acheived?
First of all, it seems to me that the value is, in fact, not being passed to another PythonOperator but to a BashOperator.
Secondly, this is already covered in my answer above. The field bash_command is templated (see template_fields in the source: https://github.com/apache/incubator-airflow/blob/master/airflow/operators/bash_operator.py). Hence, we can use the templated version:
BashOperator(
task_id='Moving_bucket',
bash_command='python /home/raw.py {{ task_instance.xcom_pull(task_ids='Task1') }} ',
dag=dag,
)
EDIT 2
Explanation:
Airflow works like this: It will execute Task1, then populate xcom and then execute the next task. So for your example to work you need Task1 executed first and then execute Moving_bucket downstream of Task1.
Since you are using a return function, you could also omit the key='file' from xcom_pull and not manually set it in the function.

Suppress "None" output as string in Jinja2

How do I persuade Jinja2 to not print "None" when the value is None?
I have a number of entries in a dictionary and I would like to output everything in a single loop instead of having special cases for different keywords. If I have a value of None (the NoneType not the string) then the string "None" is inserted into the template rendering results.
Trying to suppress it using
{{ value or '' }} works too well as it will replace the numeric value zero as well.
Do I need to filter the dictionary before passing it to Jinja2 for rendering?
In new versions of Jinja2 (2.9+):
{{ value if value }}
In older versions of Jinja2 (prior to 2.9):
{{ value if value is not none }} works great.
if this raises an error about not having an else try using an else ..
{{ value if value is not none else '' }}
Another option is to use the finalize hook on the environment:
>>> import jinja2
>>> e = jinja2.Environment()
>>> e.from_string("{{ this }} / {{ that }}").render(this=0, that=None)
u'0 / None'
but:
>>> def my_finalize(thing):
... return thing if thing is not None else ''
...
>>> e = jinja2.Environment(finalize=my_finalize)
>>> e.from_string("{{ this }} / {{ that }}").render(this=0, that=None)
u'0 / '
Default filter:
{{ value|default("", True) }}
A custom filter can solve the problem. Declare it like this:
def filter_suppress_none(val):
if not val is None:
return val
else:
return ''
Install it like this:
templating_environment.filters['sn'] = filter_suppress_none
Use it like this:
{{value|sn}}
According to this post from the Pocco Mailing List: https://groups.google.com/d/msg/pocoo-libs/SQ9ubo_Kamw/TadIdab9eN8J
Armin Ronacher (creater of Jinja2/Flask, etc...) recommends the following "pythonic" snippet:
{{ variable or 0 }} {{ variable or '' }}
The notion here being that once again, explicit is preferable to implicit.
Edit: The selected answer is definitely the correct one. I haven't really come across a situation where a template variable would be either a string or the numeric zero, so the above snippets might help reduce the code noise in the template.

Categories