Airflow XCOM pull is not rendering - python

I have a custom operator where in the argument list I am using xcom_pull to get values from XCOM. But it is not rendering to actual value instead it remains as the string.
download= CustomSparkSubmitOperator(
task_id='download',
spark_args=command_func(
env, application,
'%s/spark_args' % key,
['--input_file', "{{ ti.xcom_pull('set_parameters', key='input_file') }}",
'--output_file', "{{ ti.xcom_pull('set_parameters', key='output_file') }}"
],
provide_context=True,
dag=dag)
The operator returns the following output:
spark-submit --deploy-mode cluster ..... --input_file "{{ ti.xcom_pull('set_parameters', key='input_file') }}" --output_file "{{ ti.xcom_pull('set_parameters', key='output_file') }}"

I was able to fix the problem of XCOM values not rendering when using as an argument in my CustomSparkSubmitEMROperator. Internally the operator inherits the EMROperators. For example
class CustomSparkSubmitEMROperator(EmrAddStepsOperator, EmrStepSensor):
So I needed to add the below template_fields as shown below
template_fields = ('job_flow_id', 'steps')
After adding the above lines the XCOM values where properly rendered and was able to see the correct values in the resultant spark-submit command

Related

Run a YAML SSM Run Document from AWS Lambda with Parameters

I am trying to run a YAML SSM document from a Python AWS Lambda, using boto3 ssm.send_command with parameters, but even if I'm just trying to run the sample "Hello World", I get:
"errorMessage": "An error occurred (InvalidParameters) when calling the SendCommand operation: document TestMessage does not support parameters.
JSON Run Documents work without an issue, so it seems like the parameters are being passed in JSON format, but the document I intend this for contains a relatively long Powershell script, JSON needing to run it all on a single line would be awkward, and I am hoping to avoid needing to run it from an S3 bucket. Can anyone suggest a way to run a YAML Run Document with parameters from the Lambda?
As far as I know AWS lambda always gets it's events as JSON. My suggestion would be that in the lambda_handler.py file declare a new variable like this:
import json
import yaml
def handler_name(event, context):
yaml_event = yaml.dump(json.load(event))
#rest of the code...
This way the event will be in YAML format and you can use that variable instead of the event, which is in JSON format.
Here is an example of running a YAML Run Command document using boto3 ssm.send_command in a Lambda running Python 3.8. Variables are passed to the Lambda using either environment variables or SSM Parameter Store. The script is retrieved from S3 and accepts a single parameter formatted as a JSON string which is passed to the bash script running on Linux (sorry I don't have one for PowerShell).
The SSM Document is deployed using CloudFormation but you could also create it through the console or CLI. Based on the error message you cited, perhaps verify the Document Type is set as "Command".
SSM Document (wrapped in CloudFormation template, refer to the Content property)
Neo4jLoadQueryDocument:
Type: AWS::SSM::Document
Properties:
DocumentType: "Command"
DocumentFormat: "YAML"
TargetType: "/AWS::EC2::Instance"
Content:
schemaVersion: "2.2"
description: !Sub "Load Neo4j for ${AppName}"
parameters:
sourceType:
type: "String"
description: "S3"
default: "S3"
sourceInfo:
type: "StringMap"
description: !Sub "Downloads all files under the ${AppName} scripts prefix"
default:
path: !Sub 'https://{{resolve:ssm:/${AppName}/${Stage}/${AWS::Region}/DataBucketName}}.s3.amazonaws.com/config/scripts/'
commandLine:
type: "String"
description: "These commands are invoked by a Lambda script which sets the correct parameters (Refer to documentation)."
default: 'bash start_task.sh'
workingDirectory:
type: "String"
description: "Working directory"
default: "/home/ubuntu"
executionTimeout:
type: "String"
description: "(Optional) The time in seconds for a command to complete before it is considered to have failed. Default is 3600 (1 hour). Maximum is 28800 (8 hours)."
default: "86400"
mainSteps:
- action: "aws:downloadContent"
name: "downloadContent"
inputs:
sourceType: "{{ sourceType }}"
sourceInfo: "{{ sourceInfo }}"
destinationPath: "{{ workingDirectory }}"
- action: "aws:runShellScript"
name: "runShellScript"
inputs:
runCommand:
- ""
- "directory=$(pwd)"
- "export PATH=$PATH:$directory"
- " {{ commandLine }} "
- ""
workingDirectory: "{{ workingDirectory }}"
timeoutSeconds: "{{ executionTimeout }}"
Lambda function
import os
import boto3
neo4j_load_query_document_name = os.environ["NEO4J_LOAD_QUERY_DOCUMENT_NAME"]
# neo4j_database_instance_id = os.environ["NEO4J_DATABASE_INSTANCE_ID"]
neo4j_database_instance_id_param = os.environ["NEO4J_DATABASE_INSTANCE_ID_SSM_PARAM"]
load_neo4j_activity = os.environ["LOAD_NEO4J_ACTIVITY"]
app_name = os.environ["APP_NAME"]
# Get SSM Document Neo4jLoadQuery
ssm = boto3.client('ssm')
response = ssm.get_document(Name=neo4j_load_query_document_name)
neo4j_load_query_document_content = json.loads(response["Content"])
# Get Instance ID
neo4j_database_instance_id = ssm.get_parameter(Name=neo4j_database_instance_id_param)["Parameter"]["Value"]
# Extract document parameters
neo4j_load_query_document_parameters = neo4j_load_query_document_content["parameters"]
command_line_default = neo4j_load_query_document_parameters["commandLine"]["default"]
source_info_default = neo4j_load_query_document_parameters["sourceInfo"]["default"]
def lambda_handler(event, context):
params = {
"params": {
"app_name": app_name,
"activity_arn": load_neo4j_activity,
}
}
# Include params JSON as command line argument
cmd = f"{command_line_default} \'{json.dumps(params)}\'"
try:
response = ssm.send_command(
InstanceIds=[
neo4j_database_instance_id,
],
DocumentName=neo4j_load_query_document_name,
Parameters={
"commandLine":[cmd],
"sourceInfo":[json.dumps(source_info_default)]
},
MaxConcurrency='1')
if response['ResponseMetadata']['HTTPStatusCode'] != 200:
logger.error(json.dumps(response, cls=DatetimeEncoder))
raise Exception("Failed to send command")
else:
logger.info(f"Command `{cmd}` invoked on instance {neo4j_database_instance_id}")
except Exception as err:
logger.error(err)
raise err
return
Parameters in a JSON document are not necessarily in JSON themselves, they can easily be string or numeric values (more likely IMO). If you want to pass a parameter in JSON format (not the same as a JSON document), pay attention to quotes and escaping.

Airflow Variable Usage in DAG file

I have a short DAG where I need to get the variable stored in Airflow (Airflow -> Admin -> variables). But, when we use as a template I'm getting below error.
Sample code and error shown below:
from airflow import DAG
from airflow.models import Variable
from airflow.operators.python import PythonOperator
def display_variable():
my_var = Variable.get("my_var")
print('variable' + my_var)
return my_var
def display_variable1():
my_var = {{ var.value.my_var }}
print('variable' + my_var)
return my_var
dag = DAG(dag_id="variable_dag", start_date=airflow.utils.dates.days_ago(14),
schedule_interval='#daily')
task = PythonOperator(task_id='display_variable', python_callable=display_variable, dag=dag)
task1 = PythonOperator(task_id='display_variable1', python_callable=display_variable1, dag=dag)
task >> task1
Here the usage to get the value of a variable using:
Variable.ger("my_var") --> is working
But, I'm getting an error using the other way:
{{ var.value.my_var }}
Error:
File "/home/airflow_home/dags/variable_dag.py", line 12, in display_variable1
my_var = {{ var.value.my_var }}
NameError: name 'var' is not defined
Both display_variable functions run Python code, so Variable.get() works as intended. The {{ ... }} syntax is used for templated strings. Some arguments of most Airflow operators support templated strings, which can be given as "{{ expression to be evaluated at runtime }}". Look up Jinja templating for more information. Before a task is executed, templated strings are evaluated. For example:
BashOperator(
task_id="print_now",
bash_command="echo It is currently {{ macros.datetime.now() }}",
)
Airflow evaluates the bash_command just before executing it, and as a result the bash_command will hold e.g. "Today is Wednesday".
However, running {{ ... }} as if it were Python code would actually try to create a nested set:
{{ variable }}
^^
|└── inner set
|
outer set
Since sets are not hashable in Python, this will never evaluate, even if the statement inside is valid.
Additional resources:
https://www.astronomer.io/guides/templating
The template_fields attribute on each operator defines which attributes are template-able, see docs for your operator to see the value of template_fields: https://airflow.apache.org/docs/apache-airflow/stable/_api/airflow/operators/python/index.html#airflow.operators.python.PythonOperator.template_fields

Airflow: How can I grab the dag_run.conf value in an ECSOperator

I have a process that uses Airflow to execute docker containers on AWS fargate. The docker containers are just running ETL's written in Python. In some of my python scripts I want to allow team members to pass commands and think dag_run.conf will be a good way to accomplish this. I was wondering if there was a way to append the values from dag_run.conf to the command key in the ecsoperator's override clause. My overrides clause looks something like this:
"containerOverrides": [
{
"name": container_name,
"command": c.split(" ")
},
],```
Pass in a JSON to dag_run.conf with a key overrides >> which will be passed into EcsOperator >> which in turn will be passed to the underlying boto3 client (during run_task operation).
To override container commands, add the key containerOverrides (to the overrides dict) whose value is a list of dictionaries. Note: you must reference the specific container name.
An example input:
{
"overrides": {
"containerOverrides": [
{
"name": "my-container-name",
"command": ["echo", "hello world"]
}
]
}
}
Notes:
Be sure to reference the exact container name
Command should be a list of strings.
I had a very similar problem and here's what I found:
You cannot pass a command as string and then do .split(" "). This is due to the fact that Airflow templating does not happen when the DAG is parsed. Instead, the literal {{ dag_run.conf['command']}} (or, in my formulation, {{ params.my_command }}) is passed to the EcsOperator and only evaluated just before the task is run. So we need to keep the definition (yes, as string) "{{ params.my_command }}" in the code and pass it through.
By default, all parameters for a DAG as passed as string types, but they don't have to! After playing around with jsonschema a bit, I found that you can express "list of strings" as a parameter type like this: Param(type="array", items={"type": "string"}).
The above only ensures that the input can be a list of strings, but you also need to receive it as a list of strings. That functionality is simply switched on by setting render_template_as_native_obj=True.
All put together, you get something like this for your DAG:
#dag(
default_args={"owner": "airflow"},
start_date=days_ago(2),
schedule_interval=None,
params={"my_command": Param(type="array", items={"type": "string"}, default=[])},
render_template_as_native_obj=True,
)
def my_command():
"""run a command manually"""
EcsOperator(
task_id="my_command",
overrides={
"containerOverrides": [
{"name": "my-container-name", "command": command}
]
},
command="{{ params.my_command }}",
...
)
dag = my_command()

How do I access an airflow rendered template downstream?

Airflow version: 1.10.12
I'm having trouble passing a rendered template object for use downstream. I'm trying to grab two config variables from the Airflow conf.
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime
with DAG(
dag_id="example_trigger_target_dag",
default_args={"owner": "airflow"},
start_date=datetime(2021, 6, 24),
schedule_interval='#once',
tags=['example'],
) as dag:
bash_task = BashOperator(
task_id="bash_task",
bash_command='echo "{{ conf.test }}:{{ conf.tag }}"',
xcom_push=True
)
Basically what it passes to xcom is just : without the variables present. The full rendered string does show up however in the rendered tab. Am I missing something?
Edit: the variables exist in the conf, I just replaced them with test and tag for security reasons.
I just tested this code (I'm using Airflow 2.1.0) and got that result.
BashOperator(task_id='test_task',
bash_command='echo " VARS: {{ conf.email.email_backend '
'}}:{{ conf.core.executor }}"',
do_xcom_push=True)

Python Airflow - Return result from PythonOperator

I have written a DAG with multiple PythonOperators
task1 = af_op.PythonOperator(task_id='Data_Extraction_Environment',
provide_context=True,
python_callable=Task1, dag=dag1)
def Task1(**kwargs):
return(kwargs['dag_run'].conf.get('file'))
From PythonOperator i am calling "Task1" method. That method is returning a value,that value i need to pass to the next PythonOperator.How can i get the value from the "task1" variable or How can i get the value which is returned from Task1 method?
updated :
def Task1(**kwargs):
file_name = kwargs['dag_run'].conf.get[file]
task_instance = kwargs['task_instance']
task_instance.xcom_push(key='file', value=file_name)
return file_name
t1 = PythonOperator(task_id = 'Task1',provide_context=True,python_callable=Task1,dag=dag)
t2 = BashOperator(
task_id='Moving_bucket',
bash_command='python /home/raw.py {{ task_instance.xcom_pull(task_ids='Task1',key='file') }} ',
dag=dag,
)
t2.set_upstream(t1)
You might want to check out Airflow's XCOM: https://airflow.apache.org/docs/apache-airflow/stable/concepts/xcoms.html
If you return a value from a function, this value is stored in xcom. In your case, you could access it like so from other Python code:
task_instance = kwargs['task_instance']
task_instance.xcom_pull(task_ids='Task1')
or in a template like so:
{{ task_instance.xcom_pull(task_ids='Task1') }}
If you want to specify a key you can push into XCOM (being inside a task):
task_instance = kwargs['task_instance']
task_instance.xcom_push(key='the_key', value=my_str)
Then later on you can access it like so:
task_instance.xcom_pull(task_ids='my_task', key='the_key')
EDIT 1
Follow-up question: Instead of using the value in another function how can i pass the value to another PythonOperator like - "t2 = "BashOperator(task_id='Moving_bucket', bash_command='python /home/raw.py "%s" '%file_name, dag=dag)" --- i want to access file_name which is returned by "Task1". How can this will be acheived?
First of all, it seems to me that the value is, in fact, not being passed to another PythonOperator but to a BashOperator.
Secondly, this is already covered in my answer above. The field bash_command is templated (see template_fields in the source: https://github.com/apache/incubator-airflow/blob/master/airflow/operators/bash_operator.py). Hence, we can use the templated version:
BashOperator(
task_id='Moving_bucket',
bash_command='python /home/raw.py {{ task_instance.xcom_pull(task_ids='Task1') }} ',
dag=dag,
)
EDIT 2
Explanation:
Airflow works like this: It will execute Task1, then populate xcom and then execute the next task. So for your example to work you need Task1 executed first and then execute Moving_bucket downstream of Task1.
Since you are using a return function, you could also omit the key='file' from xcom_pull and not manually set it in the function.

Categories