Good evening,
In Airflow I have a task_group (tg1) that loops through a list and dynamically calls a python method which then generates a series of tasks. The issue that I am having is that I need to have access to Xcoms inside of the python method and I keep seeing error:
KeyError: "task_instance". or "KeyError: "ti". I have called it both ways to be sure.
Task Group Code:
...
for partitions in partition_list:
t1 = PythonOperator(
task_id='Refresh_Wrapper_{0}'.format(iteration),
python_callable=refresh_task(task_id, partitions, dag, iteration),
provide_context=True,
dag=dag
)
...
Python Method Code:
def refresh_task(task_group, data, dag, iteration, **context):
foo = "baz{0}".format(str(iteration))
bar = "Bar " + context['task_instance'].xcom_pull(task_ids=task_group.foo, key='return_value')
.....
In The PythonOpertator you should put only the name of the function.
If you need to pass params to the function, use op_kwargs like the follow:
t1 = PythonOperator(
task_id='Refresh_Wrapper_{0}'.format(iteration),
python_callable=refresh_task,
op_kwargs={"task_id": task_id, "partitions": partitions, "dag": dag, "iteration": iteration},
provide_context=True,
dag=dag
)
also dag you can get from the context["dag_run"]
Related
Hi all I have a function
def get_campaign_active(ds, **kwargs):
logging.info('Checking for inactive campaign types..')
the_db = ds['client']
db = the_db['misc-server']
collection = db.campaigntypes
campaign = list(collection.find({}))
for item in campaign:
if item['active'] == False:
# storing false 'active' campaigns
result = "'{}' active status set to False".format(item['text'])
logging.info("'{}' active status set to False".format(item['text']))
mapped to an airflow task
get_campaign_active = PythonOperator(
task_id='get_campaign_active',
provide_context=True,
python_callable=get_campaign_active,
xcom_push=True,
op_kwargs={'client': client_production},
dag=dag)
As you can see I pass in the client_production variable into op_kwargs with the task. The hope is this variable to be passed in through '**kwargs' parameter in the function when this task is run in airflow.
However for testing, when I try to call the function like so
get_campaign_active({"client":client_production})
The client_production variable is found inside the ds parameter. I don't have a staging server for airflow to test this out, but could someone tell me if I deploy this function/task to airflow, will it read the client_production variable from ds or kwargs?
Right now if I try to access the 'client' key in kwargs, kwargs is empty.
Thanks
You should do:
def get_campaign_active(ds, **kwargs):
logging.info('Checking for inactive campaign types..')
the_db = kwargs['client']
the ds (and all other macros are passed to kwargs as you set provide_context=True, you can either use named params like you did or let the ds be passed into kwargs as well)
Since in your code you don't actually use ds nor any other macros you can change your function signature to get_campaign_active(**kwargs) and remove provide_context=True. Note that from Airflow>=2.0 the provide_context=True is not needed at all.
Previously, I was using the python_callable parameter of the TriggerDagRunOperator to dynamically alter the dag_run_obj payload that is passed to the newly triggered DAG.
Since its removal in Airflow 2.0.0 (Pull Req: https://github.com/apache/airflow/pull/6317), is there a way to do this, without creating a custom TriggerDagRunOperator?
For context, here is the flow of my code:
#Poll Amazon S3 bucket for new received files
fileSensor_tsk = S3KeySensor()
#Use chooseDAGBasedOnInput function to create dag_run object (previously python_callable was used directly in TriggerDagRunOperator to create the dag_run object for the new triggered DAG)
#dag_run object will pass received file name details to new DAG for reference in order to complete its own work
chooseDAGTrigger_tsk = BranchPythonOperator(
task_id='chooseDAGTrigger_tsk',
python_callable=chooseDAGBasedOnInput,
provide_context=True
)
triggerNewDAG_tsk = TriggerDagRunOperator(
task_id='triggerNewDAG_tsk',
trigger_dag_id='1000_NEW_LOAD'
)
triggerNewDAG2_tsk = TriggerDagRunOperator(
task_id='triggerNew2DAG_tsk',
trigger_dag_id='1000_NEW2_LOAD'
) ...
Any help or commentary would be greatly appreciated!
EDIT - adding previously used python_callable function used in TriggerDagRunOperator:
def intakeFile(context, dag_run_obj):
#read from S3, get filename and pass to triggered DAG
bucket_name = os.environ.get('bucket_name')
s3_hook = S3Hook(aws_conn_id='aws_default')
s3_hook.copy_object()
s3_hook.delete_objects()
...
dag_run_obj.payload = {
'filePath': workingPath,
'source': source,
'fileName': fileName
}
return dag_run_obj
The TriggerDagRunOperator now takes a conf parameter to which a dictinoary can be provided as the conf object for the DagRun. Here is more information on triggering DAGs which you may find helpful as well.
EDIT
Since you need to execute a function to determine which DAG to trigger and do not want to create a custom TriggerDagRunOperator, you could execute intakeFile() in a PythonOperator (or use the #task decorator with the Task Flow API) and use the return value as the conf argument in the TriggerDagRunOperator. As part of Airflow 2.0, return values are automatically pushed to XCom within many operators; the PythonOperator included.
Here is the general idea:
def intakeFile(*args, **kwargs):
# read from S3, get filename and pass to triggered DAG
bucket_name = os.environ.get("bucket_name")
s3_hook = S3Hook(aws_conn_id="aws_default")
s3_hook.copy_object()
s3_hook.delete_objects()
...
dag_run_obj.payload = {
"filePath": workingPath,
"source": source,
"fileName": fileName,
}
return dag_run_obj
get_dag_to_trigger = PythonOperator(
task_id="get_dag_to_trigger",
python_callable=intakeFile
)
triggerNewDAG_tsk = TriggerDagRunOperator(
task_id="triggerNewDAG_tsk",
trigger_dag_id="{{ ti.xcom_pull(task_ids='get_dag_to_trigger', key='return_value') }}",
)
get_dag_to_trigger >> triggerNewDAG_tsk
I am creating an Airflow #daily DAG, It has an upstream task get_daily_data of BigQueryGetDataOperator which fetches data based on execution_date and on downstream dependent task (PythonOperator) uses above date based data via xcom_pull. When I run the airflow backfill command, the downstream task process_data_from_bq where I am doing xcom_pull, it gets the recent data only not the data of the same execution date which the downstream task is expecting.
Airfow documentation is saying if we pass If xcom_pull is passed a single string for task_ids, then the most recent XCom value from that task is returned
However its not saying how to get the data of same instance of the DAG execution.
I went through the one same question How to pull xcom value from other task instance in the same DAG run (not the most recent one)? however, the one solution given there is what I am already doing. but seems its not the correct answer.
DAG defination:
dag = DAG(
'daily_motor',
default_args=default_args,
schedule_interval='#daily'
)
#This task creates data in a BigQuery table based on execution date
extract_daily_data = BigQueryOperator(
task_id='daily_data_extract',
use_legacy_sql=False,
write_disposition='WRITE_TRUNCATE',
allow_large_results=True,
sql=policy_by_transaction_date_sql('{{ ds }}'),
destination_dataset_table='Test.daily_data_tmp',
dag=dag)
get_daily_data = BigQueryGetDataOperator(
task_id='get_daily_data',
dataset_id='Test',
table_id='daily_data_tmp',
max_results='10000',
dag=dag
)
#This is where I need to pull the data of the same execution date/same instance of DAG run not the most recent task run
def process_bq_data(**kwargs):
bq_data = kwargs['ti'].xcom_pull(task_ids = 'get_daily_data')
#This bq_data is most recent one not of the same execution date
obj_creator = IibListToObject()
items = obj_creator.create(bq_data, 'daily')
save_daily_date_wise(items)
process_data = PythonOperator(
task_id='process_data_from_bq',
python_callable=process_bq_data,
provide_context=True,
dag = dag
)
get_daily_data.set_upstream(extract_daily_data)
process_data.set_upstream(get_daily_data)
You must be receiving latest Xcom value. You need to also be sure that values are actually from same execution_date as it is supposed :
:param include_prior_dates:
If False, only XComs from the current
execution_date are returned.
If True, XComs from previous dates
are returned as well.
I am using DataFlowJavaOperator() in airflow(Cloud Composer). Is there any way to get ID of executed dataflow job in next PythonOperator task? I would like to use the job_id to call gcloud command to get job result.
def check_dataflow(ds, **kwargs)
# here I want to execute gloud command with the job ID to get job result.
# gcloud dataflow jobs describe <JOB_ID>
t1 = DataFlowJavaOperator(
task_id='task1'
jar='gs://path/to/jar/abc.jar',
options={
'stagingLocation': "gs://stgLocation/",
'tempLocation': "gs://tmpLocation/",
},
provide_context=True
dag=dag,
)
t2 = PythonOperator(
task_id='task2',
python_callable=check_dataflow,
provide_context=True
dag=dag,
)
t1 >> t2
As it appears, the job_name option in the DataFlowJavaOperator gets overridden by the task_id. Job name will have the task as the prefix and append a random ID suffix. If you still want to have a Dataflow job name that is actually different from the task ID you can add hard-code it in the Dataflow Java code:
options.setJobName("jobNameInCode")
Then, using the PythonOperator you can retrieve the job ID from the prefix (either the job name provided in code or otherwise the Composer task id) as I explained here. Briefly, list jobs with:
result = dataflow.projects().locations().jobs().list(
projectId=project,
location=location,
).execute()
and then filter by prefix where job_prefix is the job_name defined when launching the job:
for job in result['jobs']:
if re.findall(r'' + re.escape(job_prefix) + '', job['name']):
job_id = job['id']
break
The break statement is there to ensure we only get the latest job with that name, which should be the one just launched.
I have written a DAG with multiple PythonOperators
task1 = af_op.PythonOperator(task_id='Data_Extraction_Environment',
provide_context=True,
python_callable=Task1, dag=dag1)
def Task1(**kwargs):
return(kwargs['dag_run'].conf.get('file'))
From PythonOperator i am calling "Task1" method. That method is returning a value,that value i need to pass to the next PythonOperator.How can i get the value from the "task1" variable or How can i get the value which is returned from Task1 method?
updated :
def Task1(**kwargs):
file_name = kwargs['dag_run'].conf.get[file]
task_instance = kwargs['task_instance']
task_instance.xcom_push(key='file', value=file_name)
return file_name
t1 = PythonOperator(task_id = 'Task1',provide_context=True,python_callable=Task1,dag=dag)
t2 = BashOperator(
task_id='Moving_bucket',
bash_command='python /home/raw.py {{ task_instance.xcom_pull(task_ids='Task1',key='file') }} ',
dag=dag,
)
t2.set_upstream(t1)
You might want to check out Airflow's XCOM: https://airflow.apache.org/docs/apache-airflow/stable/concepts/xcoms.html
If you return a value from a function, this value is stored in xcom. In your case, you could access it like so from other Python code:
task_instance = kwargs['task_instance']
task_instance.xcom_pull(task_ids='Task1')
or in a template like so:
{{ task_instance.xcom_pull(task_ids='Task1') }}
If you want to specify a key you can push into XCOM (being inside a task):
task_instance = kwargs['task_instance']
task_instance.xcom_push(key='the_key', value=my_str)
Then later on you can access it like so:
task_instance.xcom_pull(task_ids='my_task', key='the_key')
EDIT 1
Follow-up question: Instead of using the value in another function how can i pass the value to another PythonOperator like - "t2 = "BashOperator(task_id='Moving_bucket', bash_command='python /home/raw.py "%s" '%file_name, dag=dag)" --- i want to access file_name which is returned by "Task1". How can this will be acheived?
First of all, it seems to me that the value is, in fact, not being passed to another PythonOperator but to a BashOperator.
Secondly, this is already covered in my answer above. The field bash_command is templated (see template_fields in the source: https://github.com/apache/incubator-airflow/blob/master/airflow/operators/bash_operator.py). Hence, we can use the templated version:
BashOperator(
task_id='Moving_bucket',
bash_command='python /home/raw.py {{ task_instance.xcom_pull(task_ids='Task1') }} ',
dag=dag,
)
EDIT 2
Explanation:
Airflow works like this: It will execute Task1, then populate xcom and then execute the next task. So for your example to work you need Task1 executed first and then execute Moving_bucket downstream of Task1.
Since you are using a return function, you could also omit the key='file' from xcom_pull and not manually set it in the function.