I have written a DAG with multiple PythonOperators
task1 = af_op.PythonOperator(task_id='Data_Extraction_Environment',
provide_context=True,
python_callable=Task1, dag=dag1)
def Task1(**kwargs):
return(kwargs['dag_run'].conf.get('file'))
From PythonOperator i am calling "Task1" method. That method is returning a value,that value i need to pass to the next PythonOperator.How can i get the value from the "task1" variable or How can i get the value which is returned from Task1 method?
updated :
def Task1(**kwargs):
file_name = kwargs['dag_run'].conf.get[file]
task_instance = kwargs['task_instance']
task_instance.xcom_push(key='file', value=file_name)
return file_name
t1 = PythonOperator(task_id = 'Task1',provide_context=True,python_callable=Task1,dag=dag)
t2 = BashOperator(
task_id='Moving_bucket',
bash_command='python /home/raw.py {{ task_instance.xcom_pull(task_ids='Task1',key='file') }} ',
dag=dag,
)
t2.set_upstream(t1)
You might want to check out Airflow's XCOM: https://airflow.apache.org/docs/apache-airflow/stable/concepts/xcoms.html
If you return a value from a function, this value is stored in xcom. In your case, you could access it like so from other Python code:
task_instance = kwargs['task_instance']
task_instance.xcom_pull(task_ids='Task1')
or in a template like so:
{{ task_instance.xcom_pull(task_ids='Task1') }}
If you want to specify a key you can push into XCOM (being inside a task):
task_instance = kwargs['task_instance']
task_instance.xcom_push(key='the_key', value=my_str)
Then later on you can access it like so:
task_instance.xcom_pull(task_ids='my_task', key='the_key')
EDIT 1
Follow-up question: Instead of using the value in another function how can i pass the value to another PythonOperator like - "t2 = "BashOperator(task_id='Moving_bucket', bash_command='python /home/raw.py "%s" '%file_name, dag=dag)" --- i want to access file_name which is returned by "Task1". How can this will be acheived?
First of all, it seems to me that the value is, in fact, not being passed to another PythonOperator but to a BashOperator.
Secondly, this is already covered in my answer above. The field bash_command is templated (see template_fields in the source: https://github.com/apache/incubator-airflow/blob/master/airflow/operators/bash_operator.py). Hence, we can use the templated version:
BashOperator(
task_id='Moving_bucket',
bash_command='python /home/raw.py {{ task_instance.xcom_pull(task_ids='Task1') }} ',
dag=dag,
)
EDIT 2
Explanation:
Airflow works like this: It will execute Task1, then populate xcom and then execute the next task. So for your example to work you need Task1 executed first and then execute Moving_bucket downstream of Task1.
Since you are using a return function, you could also omit the key='file' from xcom_pull and not manually set it in the function.
Related
I'm using a parameter that is the timestamp in a set of tasks:
default_dag_args = {'arg1': 'arg1-value',
'arg2': 'arg2-value',
'now': datetime.now()}
I would like that the now parameter would have the same value for all the tasks. But what happens is that it's re-executed for each function
Is there a way of doing it (executing once and using the same value through the dag)? I'm using the TaskFlow API for Airflow 2.0:
#task
def python_task()
context = get_current_context()
context_dag = context['dag']
now = context_dag.default_args['now']
print now
I tried to set the time constant, at the start of the dag file, like:
TIME = datetime.now()
and got the context inside of the tasks with get_current_context() just like you did.
Sadly, I think because of running the DAG file from start, every time a task got defined in the script, time was recalculated.
One idea I have is to use XCOM's in order to save the datetime to a variable and pull it to other tasks:
My sample code is below, I think you'll get the idea.
from airflow.decorators import task, dag
from datetime import datetime
import time
default_arguments = {
'owner': 'admin',
# This is the beginning, for more see: https://airflow.apache.org/faq.html#what-s-the-deal-with-start-date
'start_date': datetime(2022, 5, 2)
}
#dag(
schedule_interval=None,
dag_id = "Time_Example_Dag",
default_args = default_arguments,
catchup=False,
)
def the_global_time_checker_dag():
#task
def time_set():
# To use XCOM to pass the value between tasks,
# we have to parse the datetime to a string.
now = str(datetime.now())
return now
#task
def starting_task(datetime_string):
important_number = 23
# We can use this datetime object in whatever way we like.
date_time_obj = datetime.strptime(datetime_string, '%Y-%m-%d %H:%M:%S.%f')
print(date_time_obj)
return important_number
#task
def important_task(datetime_string, number):
# Passing some time
time.sleep(10)
# Again, we are free to do whatever we want with this object.
date_time_obj = datetime.strptime(datetime_string, '%Y-%m-%d %H:%M:%S.%f')
print(date_time_obj)
print("The important number is: {}".format(number))
time_right_now = time_set()
start = starting_task(datetime_string = time_right_now)
important = important_task(datetime_string = time_right_now, number = start)
time_checker = the_global_time_checker_dag()
Through the logs, you can see all the datetime values are the same.
For more information about XCOM in Taskflow API, you can check here.
When a worker gets a task instance to run, it rebuilds the whole DagBag from the Python files to get the DAG and task definition. So every time a task instance is ran, your DAG file is sourced, rerunning your DAG definition code. And that resulting DAG object is the one that the particular task instance will be defined by.
It's critical to understand that the DAG definition is not simply built once for every execution date and then persisted/reused for all TIs within that DagRun. The DAG definition is constantly being recomputed from your Python code, each TI is ran in a separate process independently and without state from other tasks. Thus, if your DAG definition includes non-deterministic results at DagBag build time - such as datetime.now() - every instantiation of your DAG even for the same execution date will have different values. You need to build your DAGs in a deterministic and idempotent manner.
The only way to share non-deterministic results is to store them in the DB and have your tasks fetch them, as #sezai-burak-kantarcı has pointed out. Best practice is to use task context-specific variables, like {{ ds }}, {{ execution_date }}, {{ data_interval_start }}, etc. These are the same for all tasks within a DAG run. You can see the template variables available here: Airflow emplates reference
Good evening,
In Airflow I have a task_group (tg1) that loops through a list and dynamically calls a python method which then generates a series of tasks. The issue that I am having is that I need to have access to Xcoms inside of the python method and I keep seeing error:
KeyError: "task_instance". or "KeyError: "ti". I have called it both ways to be sure.
Task Group Code:
...
for partitions in partition_list:
t1 = PythonOperator(
task_id='Refresh_Wrapper_{0}'.format(iteration),
python_callable=refresh_task(task_id, partitions, dag, iteration),
provide_context=True,
dag=dag
)
...
Python Method Code:
def refresh_task(task_group, data, dag, iteration, **context):
foo = "baz{0}".format(str(iteration))
bar = "Bar " + context['task_instance'].xcom_pull(task_ids=task_group.foo, key='return_value')
.....
In The PythonOpertator you should put only the name of the function.
If you need to pass params to the function, use op_kwargs like the follow:
t1 = PythonOperator(
task_id='Refresh_Wrapper_{0}'.format(iteration),
python_callable=refresh_task,
op_kwargs={"task_id": task_id, "partitions": partitions, "dag": dag, "iteration": iteration},
provide_context=True,
dag=dag
)
also dag you can get from the context["dag_run"]
I have a short DAG where I need to get the variable stored in Airflow (Airflow -> Admin -> variables). But, when we use as a template I'm getting below error.
Sample code and error shown below:
from airflow import DAG
from airflow.models import Variable
from airflow.operators.python import PythonOperator
def display_variable():
my_var = Variable.get("my_var")
print('variable' + my_var)
return my_var
def display_variable1():
my_var = {{ var.value.my_var }}
print('variable' + my_var)
return my_var
dag = DAG(dag_id="variable_dag", start_date=airflow.utils.dates.days_ago(14),
schedule_interval='#daily')
task = PythonOperator(task_id='display_variable', python_callable=display_variable, dag=dag)
task1 = PythonOperator(task_id='display_variable1', python_callable=display_variable1, dag=dag)
task >> task1
Here the usage to get the value of a variable using:
Variable.ger("my_var") --> is working
But, I'm getting an error using the other way:
{{ var.value.my_var }}
Error:
File "/home/airflow_home/dags/variable_dag.py", line 12, in display_variable1
my_var = {{ var.value.my_var }}
NameError: name 'var' is not defined
Both display_variable functions run Python code, so Variable.get() works as intended. The {{ ... }} syntax is used for templated strings. Some arguments of most Airflow operators support templated strings, which can be given as "{{ expression to be evaluated at runtime }}". Look up Jinja templating for more information. Before a task is executed, templated strings are evaluated. For example:
BashOperator(
task_id="print_now",
bash_command="echo It is currently {{ macros.datetime.now() }}",
)
Airflow evaluates the bash_command just before executing it, and as a result the bash_command will hold e.g. "Today is Wednesday".
However, running {{ ... }} as if it were Python code would actually try to create a nested set:
{{ variable }}
^^
|└── inner set
|
outer set
Since sets are not hashable in Python, this will never evaluate, even if the statement inside is valid.
Additional resources:
https://www.astronomer.io/guides/templating
The template_fields attribute on each operator defines which attributes are template-able, see docs for your operator to see the value of template_fields: https://airflow.apache.org/docs/apache-airflow/stable/_api/airflow/operators/python/index.html#airflow.operators.python.PythonOperator.template_fields
Hi all I have a function
def get_campaign_active(ds, **kwargs):
logging.info('Checking for inactive campaign types..')
the_db = ds['client']
db = the_db['misc-server']
collection = db.campaigntypes
campaign = list(collection.find({}))
for item in campaign:
if item['active'] == False:
# storing false 'active' campaigns
result = "'{}' active status set to False".format(item['text'])
logging.info("'{}' active status set to False".format(item['text']))
mapped to an airflow task
get_campaign_active = PythonOperator(
task_id='get_campaign_active',
provide_context=True,
python_callable=get_campaign_active,
xcom_push=True,
op_kwargs={'client': client_production},
dag=dag)
As you can see I pass in the client_production variable into op_kwargs with the task. The hope is this variable to be passed in through '**kwargs' parameter in the function when this task is run in airflow.
However for testing, when I try to call the function like so
get_campaign_active({"client":client_production})
The client_production variable is found inside the ds parameter. I don't have a staging server for airflow to test this out, but could someone tell me if I deploy this function/task to airflow, will it read the client_production variable from ds or kwargs?
Right now if I try to access the 'client' key in kwargs, kwargs is empty.
Thanks
You should do:
def get_campaign_active(ds, **kwargs):
logging.info('Checking for inactive campaign types..')
the_db = kwargs['client']
the ds (and all other macros are passed to kwargs as you set provide_context=True, you can either use named params like you did or let the ds be passed into kwargs as well)
Since in your code you don't actually use ds nor any other macros you can change your function signature to get_campaign_active(**kwargs) and remove provide_context=True. Note that from Airflow>=2.0 the provide_context=True is not needed at all.
I am using DataFlowJavaOperator() in airflow(Cloud Composer). Is there any way to get ID of executed dataflow job in next PythonOperator task? I would like to use the job_id to call gcloud command to get job result.
def check_dataflow(ds, **kwargs)
# here I want to execute gloud command with the job ID to get job result.
# gcloud dataflow jobs describe <JOB_ID>
t1 = DataFlowJavaOperator(
task_id='task1'
jar='gs://path/to/jar/abc.jar',
options={
'stagingLocation': "gs://stgLocation/",
'tempLocation': "gs://tmpLocation/",
},
provide_context=True
dag=dag,
)
t2 = PythonOperator(
task_id='task2',
python_callable=check_dataflow,
provide_context=True
dag=dag,
)
t1 >> t2
As it appears, the job_name option in the DataFlowJavaOperator gets overridden by the task_id. Job name will have the task as the prefix and append a random ID suffix. If you still want to have a Dataflow job name that is actually different from the task ID you can add hard-code it in the Dataflow Java code:
options.setJobName("jobNameInCode")
Then, using the PythonOperator you can retrieve the job ID from the prefix (either the job name provided in code or otherwise the Composer task id) as I explained here. Briefly, list jobs with:
result = dataflow.projects().locations().jobs().list(
projectId=project,
location=location,
).execute()
and then filter by prefix where job_prefix is the job_name defined when launching the job:
for job in result['jobs']:
if re.findall(r'' + re.escape(job_prefix) + '', job['name']):
job_id = job['id']
break
The break statement is there to ensure we only get the latest job with that name, which should be the one just launched.