How to get job ID or result of airflow DataFlowJavaOperator()? - python

I am using DataFlowJavaOperator() in airflow(Cloud Composer). Is there any way to get ID of executed dataflow job in next PythonOperator task? I would like to use the job_id to call gcloud command to get job result.
def check_dataflow(ds, **kwargs)
# here I want to execute gloud command with the job ID to get job result.
# gcloud dataflow jobs describe <JOB_ID>
t1 = DataFlowJavaOperator(
task_id='task1'
jar='gs://path/to/jar/abc.jar',
options={
'stagingLocation': "gs://stgLocation/",
'tempLocation': "gs://tmpLocation/",
},
provide_context=True
dag=dag,
)
t2 = PythonOperator(
task_id='task2',
python_callable=check_dataflow,
provide_context=True
dag=dag,
)
t1 >> t2

As it appears, the job_name option in the DataFlowJavaOperator gets overridden by the task_id. Job name will have the task as the prefix and append a random ID suffix. If you still want to have a Dataflow job name that is actually different from the task ID you can add hard-code it in the Dataflow Java code:
options.setJobName("jobNameInCode")
Then, using the PythonOperator you can retrieve the job ID from the prefix (either the job name provided in code or otherwise the Composer task id) as I explained here. Briefly, list jobs with:
result = dataflow.projects().locations().jobs().list(
projectId=project,
location=location,
).execute()
and then filter by prefix where job_prefix is the job_name defined when launching the job:
for job in result['jobs']:
if re.findall(r'' + re.escape(job_prefix) + '', job['name']):
job_id = job['id']
break
The break statement is there to ensure we only get the latest job with that name, which should be the one just launched.

Related

GCP Cloud Composer : Audit the status of the Dags from Apache Airflow into BigQuery

I want to audit all the information about the DAGs execution status into a table BigQuery ,
I want to do this through a python code in Dags , as already have written code is loading the data into the BigQuery table (as given below). Need help to append the audit logic in the existing code.
with models.DAG(
'C360_GBL_CCN2DPN_CLASSIC',
default_args=default_args,
#schedule_interval='0 9 * * * ') as dag:
schedule_interval=None) as dag:
start = dummy_operator.DummyOperator(
task_id='start',
trigger_rule='all_success'
)
read_json_file(config_file_path)
end = dummy_operator.DummyOperator(
task_id='end',
trigger_rule='all_success'
)
a=[]
if (len(configurations) > 1):
for k in range(0,len(configurations)):
config=configurations[k]
project_id = config['Project_Id']
staging_dataset = config['Dataset']
table_name = config['Table-Name']
write_disposition =config['write_disposition']
sql = config['Sql']
create_disposition = config['create_disposition']
a.append(BigQueryOperator(
task_id=table_name+'_ccn_2_dpn_bq',
sql=sql,
write_disposition=write_disposition,
create_disposition=create_disposition,
use_legacy_sql=False
))
if k != 0 :
a[k-1].set_downstream(a[k])
else:
a[k].set_upstream(start)
a[len(configurations)-1].set_downstream(end)
else:
config = configurations[0]
project_id = config['Project_Id']
staging_dataset = config['Dataset']
table_name = config['Table-Name']
write_disposition = config['write_disposition']
sql = config['Sql']
create_disposition = config['create_disposition']
Task1 = BigQueryOperator(
task_id=table_name+'_ccn_2_dpn_bq',
sql=sql,
write_disposition=write_disposition,
create_disposition=create_disposition,
use_legacy_sql=False
)
Task1.set_upstream(start)
Task1.set_downstream(end)
Airflow writes all the audit log to the table log in its metadata database, you can check if the information are enough for your needs, and if it's the case, you can export them to BigQuery using a stream app (if you need them ASAP after they are created) or by an airflow dag in two steps using the official operators:
a task form PostgresToGCSOperator to export the data to GCS
then a second task from GCSToBigQueryOperator to import the exported files into your BigQuery table,
or in one step using a new operator (you need to develop it).

Airflow passing 'ti' into a python method

Good evening,
In Airflow I have a task_group (tg1) that loops through a list and dynamically calls a python method which then generates a series of tasks. The issue that I am having is that I need to have access to Xcoms inside of the python method and I keep seeing error:
KeyError: "task_instance". or "KeyError: "ti". I have called it both ways to be sure.
Task Group Code:
...
for partitions in partition_list:
t1 = PythonOperator(
task_id='Refresh_Wrapper_{0}'.format(iteration),
python_callable=refresh_task(task_id, partitions, dag, iteration),
provide_context=True,
dag=dag
)
...
Python Method Code:
def refresh_task(task_group, data, dag, iteration, **context):
foo = "baz{0}".format(str(iteration))
bar = "Bar " + context['task_instance'].xcom_pull(task_ids=task_group.foo, key='return_value')
.....
In The PythonOpertator you should put only the name of the function.
If you need to pass params to the function, use op_kwargs like the follow:
t1 = PythonOperator(
task_id='Refresh_Wrapper_{0}'.format(iteration),
python_callable=refresh_task,
op_kwargs={"task_id": task_id, "partitions": partitions, "dag": dag, "iteration": iteration},
provide_context=True,
dag=dag
)
also dag you can get from the context["dag_run"]

Will Airflow read from ds or **kwargs in function parameters

Hi all I have a function
def get_campaign_active(ds, **kwargs):
logging.info('Checking for inactive campaign types..')
the_db = ds['client']
db = the_db['misc-server']
collection = db.campaigntypes
campaign = list(collection.find({}))
for item in campaign:
if item['active'] == False:
# storing false 'active' campaigns
result = "'{}' active status set to False".format(item['text'])
logging.info("'{}' active status set to False".format(item['text']))
mapped to an airflow task
get_campaign_active = PythonOperator(
task_id='get_campaign_active',
provide_context=True,
python_callable=get_campaign_active,
xcom_push=True,
op_kwargs={'client': client_production},
dag=dag)
As you can see I pass in the client_production variable into op_kwargs with the task. The hope is this variable to be passed in through '**kwargs' parameter in the function when this task is run in airflow.
However for testing, when I try to call the function like so
get_campaign_active({"client":client_production})
The client_production variable is found inside the ds parameter. I don't have a staging server for airflow to test this out, but could someone tell me if I deploy this function/task to airflow, will it read the client_production variable from ds or kwargs?
Right now if I try to access the 'client' key in kwargs, kwargs is empty.
Thanks
You should do:
def get_campaign_active(ds, **kwargs):
logging.info('Checking for inactive campaign types..')
the_db = kwargs['client']
the ds (and all other macros are passed to kwargs as you set provide_context=True, you can either use named params like you did or let the ds be passed into kwargs as well)
Since in your code you don't actually use ds nor any other macros you can change your function signature to get_campaign_active(**kwargs) and remove provide_context=True. Note that from Airflow>=2.0 the provide_context=True is not needed at all.

Airflow xcom_pull is not giving the data of same upstream task instance run, instead gives most recent data

I am creating an Airflow #daily DAG, It has an upstream task get_daily_data of BigQueryGetDataOperator which fetches data based on execution_date and on downstream dependent task (PythonOperator) uses above date based data via xcom_pull. When I run the airflow backfill command, the downstream task process_data_from_bq where I am doing xcom_pull, it gets the recent data only not the data of the same execution date which the downstream task is expecting.
Airfow documentation is saying if we pass If xcom_pull is passed a single string for task_ids, then the most recent XCom value from that task is returned
However its not saying how to get the data of same instance of the DAG execution.
I went through the one same question How to pull xcom value from other task instance in the same DAG run (not the most recent one)? however, the one solution given there is what I am already doing. but seems its not the correct answer.
DAG defination:
dag = DAG(
'daily_motor',
default_args=default_args,
schedule_interval='#daily'
)
#This task creates data in a BigQuery table based on execution date
extract_daily_data = BigQueryOperator(
task_id='daily_data_extract',
use_legacy_sql=False,
write_disposition='WRITE_TRUNCATE',
allow_large_results=True,
sql=policy_by_transaction_date_sql('{{ ds }}'),
destination_dataset_table='Test.daily_data_tmp',
dag=dag)
get_daily_data = BigQueryGetDataOperator(
task_id='get_daily_data',
dataset_id='Test',
table_id='daily_data_tmp',
max_results='10000',
dag=dag
)
#This is where I need to pull the data of the same execution date/same instance of DAG run not the most recent task run
def process_bq_data(**kwargs):
bq_data = kwargs['ti'].xcom_pull(task_ids = 'get_daily_data')
#This bq_data is most recent one not of the same execution date
obj_creator = IibListToObject()
items = obj_creator.create(bq_data, 'daily')
save_daily_date_wise(items)
process_data = PythonOperator(
task_id='process_data_from_bq',
python_callable=process_bq_data,
provide_context=True,
dag = dag
)
get_daily_data.set_upstream(extract_daily_data)
process_data.set_upstream(get_daily_data)
You must be receiving latest Xcom value. You need to also be sure that values are actually from same execution_date as it is supposed :
:param include_prior_dates:
If False, only XComs from the current
execution_date are returned.
If True, XComs from previous dates
are returned as well.

Python Airflow - Return result from PythonOperator

I have written a DAG with multiple PythonOperators
task1 = af_op.PythonOperator(task_id='Data_Extraction_Environment',
provide_context=True,
python_callable=Task1, dag=dag1)
def Task1(**kwargs):
return(kwargs['dag_run'].conf.get('file'))
From PythonOperator i am calling "Task1" method. That method is returning a value,that value i need to pass to the next PythonOperator.How can i get the value from the "task1" variable or How can i get the value which is returned from Task1 method?
updated :
def Task1(**kwargs):
file_name = kwargs['dag_run'].conf.get[file]
task_instance = kwargs['task_instance']
task_instance.xcom_push(key='file', value=file_name)
return file_name
t1 = PythonOperator(task_id = 'Task1',provide_context=True,python_callable=Task1,dag=dag)
t2 = BashOperator(
task_id='Moving_bucket',
bash_command='python /home/raw.py {{ task_instance.xcom_pull(task_ids='Task1',key='file') }} ',
dag=dag,
)
t2.set_upstream(t1)
You might want to check out Airflow's XCOM: https://airflow.apache.org/docs/apache-airflow/stable/concepts/xcoms.html
If you return a value from a function, this value is stored in xcom. In your case, you could access it like so from other Python code:
task_instance = kwargs['task_instance']
task_instance.xcom_pull(task_ids='Task1')
or in a template like so:
{{ task_instance.xcom_pull(task_ids='Task1') }}
If you want to specify a key you can push into XCOM (being inside a task):
task_instance = kwargs['task_instance']
task_instance.xcom_push(key='the_key', value=my_str)
Then later on you can access it like so:
task_instance.xcom_pull(task_ids='my_task', key='the_key')
EDIT 1
Follow-up question: Instead of using the value in another function how can i pass the value to another PythonOperator like - "t2 = "BashOperator(task_id='Moving_bucket', bash_command='python /home/raw.py "%s" '%file_name, dag=dag)" --- i want to access file_name which is returned by "Task1". How can this will be acheived?
First of all, it seems to me that the value is, in fact, not being passed to another PythonOperator but to a BashOperator.
Secondly, this is already covered in my answer above. The field bash_command is templated (see template_fields in the source: https://github.com/apache/incubator-airflow/blob/master/airflow/operators/bash_operator.py). Hence, we can use the templated version:
BashOperator(
task_id='Moving_bucket',
bash_command='python /home/raw.py {{ task_instance.xcom_pull(task_ids='Task1') }} ',
dag=dag,
)
EDIT 2
Explanation:
Airflow works like this: It will execute Task1, then populate xcom and then execute the next task. So for your example to work you need Task1 executed first and then execute Moving_bucket downstream of Task1.
Since you are using a return function, you could also omit the key='file' from xcom_pull and not manually set it in the function.

Categories