I am trying to pass into my custom Operator a parameter which is the last run time of the dag itself.
Following the documentation, I understand that i should use dag.get_last_dagrun() https://airflow.apache.org/docs/apache-airflow/1.10.6/_api/airflow/models/dag/index.html#airflow.models.dag.get_last_dagrun . However, I can't manage to pass the session parameter correctly.
Where can I find this?
When using the function without parameters, it return None.
I think that it's because I triggered the Dag myself, thus i want to set include_externally_triggered to true. But i still need to manage the session parameter before.
I tried to create the variable last_run before creating the dag and also when defining the tasks. I suppose that inside the task, self is included and it will fetch correctly without putting any parameters.
But what about the one which is outside of the dag?
I have also tried this solution which give me a time even if its the first time I run the Dag (I have clean the dag log from the ui), Maybe its the current executing DAG timestamp? If yes, I would need to compare the dates to exempt if equal?. https://stackoverflow.com/a/63930004/18036486
from airflow import DAG
from DAG.operators.custom_operator1 import customOperator1
last_run = dag.get_last_dagrun() #HERE
default_args = {
"owner": "admin",
"depends_on_past": False,
"email": ["email#email.com"],
"email_on_failure": False,
"email_on_retry": False,
"retries": 1,
}
with DAG(
dag_id="Custom",
schedule_interval="#once",
description="Desc",
start_date=datetime(2022, 3, 11),
catchup=False,
tags=["custom"],
default_args=default_args) as dag:
#Custom Operator
custom = customOperator1(
task_id = 'custom',
last_run = dag.get_last_dagrun() # OR HERE
)
custom
The actual answer at https://stackoverflow.com/a/63930004/18036486 : . is including the current running Dag. Therefore, i slightly modified the function in order to exempt other dags which status isn't 'running' of course, you can add other conditions for the other Dag states:
enter image description here
Now, I can get the lastest successfull Dag execution_date to dynamically update my data!
from airflow.models import DagRun
def get_last_exec_date(dag_id):
dag_runs = DagRun.find(dag_id=dag_id)
dags = []
for dag in dag_runs:
if dag.state == 'success':
dags.append(dag)
dags.sort(key=lambda x: x.execution_date, reverse=False)
return dags[0] if dags != [] else None
Related
I'm using a parameter that is the timestamp in a set of tasks:
default_dag_args = {'arg1': 'arg1-value',
'arg2': 'arg2-value',
'now': datetime.now()}
I would like that the now parameter would have the same value for all the tasks. But what happens is that it's re-executed for each function
Is there a way of doing it (executing once and using the same value through the dag)? I'm using the TaskFlow API for Airflow 2.0:
#task
def python_task()
context = get_current_context()
context_dag = context['dag']
now = context_dag.default_args['now']
print now
I tried to set the time constant, at the start of the dag file, like:
TIME = datetime.now()
and got the context inside of the tasks with get_current_context() just like you did.
Sadly, I think because of running the DAG file from start, every time a task got defined in the script, time was recalculated.
One idea I have is to use XCOM's in order to save the datetime to a variable and pull it to other tasks:
My sample code is below, I think you'll get the idea.
from airflow.decorators import task, dag
from datetime import datetime
import time
default_arguments = {
'owner': 'admin',
# This is the beginning, for more see: https://airflow.apache.org/faq.html#what-s-the-deal-with-start-date
'start_date': datetime(2022, 5, 2)
}
#dag(
schedule_interval=None,
dag_id = "Time_Example_Dag",
default_args = default_arguments,
catchup=False,
)
def the_global_time_checker_dag():
#task
def time_set():
# To use XCOM to pass the value between tasks,
# we have to parse the datetime to a string.
now = str(datetime.now())
return now
#task
def starting_task(datetime_string):
important_number = 23
# We can use this datetime object in whatever way we like.
date_time_obj = datetime.strptime(datetime_string, '%Y-%m-%d %H:%M:%S.%f')
print(date_time_obj)
return important_number
#task
def important_task(datetime_string, number):
# Passing some time
time.sleep(10)
# Again, we are free to do whatever we want with this object.
date_time_obj = datetime.strptime(datetime_string, '%Y-%m-%d %H:%M:%S.%f')
print(date_time_obj)
print("The important number is: {}".format(number))
time_right_now = time_set()
start = starting_task(datetime_string = time_right_now)
important = important_task(datetime_string = time_right_now, number = start)
time_checker = the_global_time_checker_dag()
Through the logs, you can see all the datetime values are the same.
For more information about XCOM in Taskflow API, you can check here.
When a worker gets a task instance to run, it rebuilds the whole DagBag from the Python files to get the DAG and task definition. So every time a task instance is ran, your DAG file is sourced, rerunning your DAG definition code. And that resulting DAG object is the one that the particular task instance will be defined by.
It's critical to understand that the DAG definition is not simply built once for every execution date and then persisted/reused for all TIs within that DagRun. The DAG definition is constantly being recomputed from your Python code, each TI is ran in a separate process independently and without state from other tasks. Thus, if your DAG definition includes non-deterministic results at DagBag build time - such as datetime.now() - every instantiation of your DAG even for the same execution date will have different values. You need to build your DAGs in a deterministic and idempotent manner.
The only way to share non-deterministic results is to store them in the DB and have your tasks fetch them, as #sezai-burak-kantarcı has pointed out. Best practice is to use task context-specific variables, like {{ ds }}, {{ execution_date }}, {{ data_interval_start }}, etc. These are the same for all tasks within a DAG run. You can see the template variables available here: Airflow emplates reference
I need to create a airflow operator that takes a few inputs and returns a string that will be used as an input for another operator that will run next. I'm new to airflow dags and operators and am confused on how to properly do this. Since I'm building this for people who use airflow and build dags and I'm not an actual airflow user or dag developer I want to get advice on doing it properly. I have created a operator and it returns a token (just a string so hello world operator example works fine). Doing so I see the value in the xcom value for the dag execution. But how would I properly retrieve that value and input it into the next operator? For my example I just called the same operator but in real it will be calling a different operator. I just do not know how to properly code this. Do I just add code to the dag? Does the operator need code added? Or should something else?
Example Dag:
import logging
import os
from airflow import DAG
from airflow.utils.dates import days_ago
from custom_operators.hello_world import HelloWorldOperator
from datetime import datetime
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'email_on_failure': False,
'email_on_retry': False,
'retries': 0,
}
dag = DAG("hello_world_test",
description='Testing out a operator',
start_date=days_ago(1),
schedule_interval=None,
catchup=False,
default_args=default_args)
get_token = HelloWorldOperator(
task_id='hello_world_task_1',
name='My input to generate a token',
dag=dag
)
token = "My token" # Want this to be the return value from get_token
run_this = HelloWorldOperator(
task_id='hello_world_task_2',
name=token,
dag=dag
)
logging.info("Start")
get_token >> run_this
logging.info("End")
Hello World operator:
from airflow.models.baseoperator import BaseOperator
class HelloWorldOperator(BaseOperator):
def __init__(
self,
some_input: str,
**kwargs) -> None:
super().__init__(**kwargs)
self.some_input = some_input
def execute(self, context):
# Bunch of business logic
token = "MyGeneratedToken"
return token
This is a good start :).
The right way to retrieve the token from another task is to use jinja templating
run_this = RetrieveToken(
task_id='hello_world_task_2',
retrieved_token="{{ ti.xcom_pull(task_ids=[\'hello_world_task_1\']) }}'",
dag=dag
)
However, you have to remember in your RetrieveToken to add retrieved_token to template_fields array: https://airflow.apache.org/docs/apache-airflow/stable/howto/custom-operator.html#templating
You can also call xcom_pull method explicitly in your "retrieve" operator and pass the "origin" task id to the operator to retrieve it from the right task.
I am using Gcloud Composer to launch Dataflow jobs.
My DAG consist of two Dataflow jobs that should be run one after the other.
import datetime
from airflow.contrib.operators.dataflow_operator import DataflowTemplateOperator
from airflow import models
default_dag_args = {
'start_date': datetime.datetime(2019, 10, 23),
'dataflow_default_options': {
'project': 'myproject',
'region': 'europe-west1',
'zone': 'europe-west1-c',
'tempLocation': 'gs://somebucket/',
}
}
with models.DAG(
'some_name',
schedule_interval=datetime.timedelta(days=1),
default_args=default_dag_args) as dag:
parameters = {'params': "param1"}
t1 = DataflowTemplateOperator(
task_id='dataflow_example_01',
template='gs://path/to/template/template_001',
parameters=parameters,
dag=dag)
parameters2 = {'params':"param2"}
t2 = DataflowTemplateOperator(
task_id='dataflow_example_02',
template='gs://path/to/templates/template_002',
parameters=parameters2,
dag=dag
)
t1 >> t2
When I check in dataflow the job has succeeded, all the files it is supposed to make are created, but it appears it ran in US region, the cloud composer environment is in Europe west.
In airflow I can see that the first job is still running so the second one is not launched
What should I add to the DAG to make it succeed? How do I run in Europe?
Any advice or solution on how to proceed would be most appreciated. Thanks!
I had to solve this issue in the past. In Airflow 1.10.2 (or lower) the code calls to the service.projects().templates().launch() endpoint. This was fixed in 1.10.3 where the regional one is used instead: service.projects().locations().templates().launch().
As of October 2019, the latest Airflow version available for Composer environments is 1.10.2. If you need a solution immediately, the fix can be back-ported into Composer.
For this we can override the DataflowTemplateOperator for our own version called RegionalDataflowTemplateOperator:
class RegionalDataflowTemplateOperator(DataflowTemplateOperator):
def execute(self, context):
hook = RegionalDataFlowHook(gcp_conn_id=self.gcp_conn_id,
delegate_to=self.delegate_to,
poll_sleep=self.poll_sleep)
hook.start_template_dataflow(self.task_id, self.dataflow_default_options,
self.parameters, self.template)
This will now make use of the modified RegionalDataFlowHook which overrides the start_template_dataflow method of the DataFlowHook operator to call the correct endpoint:
class RegionalDataFlowHook(DataFlowHook):
def _start_template_dataflow(self, name, variables, parameters,
dataflow_template):
...
request = service.projects().locations().templates().launch(
projectId=variables['project'],
location=variables['region'],
gcsPath=dataflow_template,
body=body
)
...
return response
Then, we can create a task using our new operator and a Google-provided template (for testing purposes):
task = RegionalDataflowTemplateOperator(
task_id=JOB_NAME,
template=TEMPLATE_PATH,
parameters={
'inputFile': 'gs://dataflow-samples/shakespeare/kinglear.txt',
'output': 'gs://{}/europe/output'.format(BUCKET)
},
dag=dag,
)
Full working DAG here. For a cleaner version, the operator can be moved into a separate module.
I am creating an Airflow #daily DAG, It has an upstream task get_daily_data of BigQueryGetDataOperator which fetches data based on execution_date and on downstream dependent task (PythonOperator) uses above date based data via xcom_pull. When I run the airflow backfill command, the downstream task process_data_from_bq where I am doing xcom_pull, it gets the recent data only not the data of the same execution date which the downstream task is expecting.
Airfow documentation is saying if we pass If xcom_pull is passed a single string for task_ids, then the most recent XCom value from that task is returned
However its not saying how to get the data of same instance of the DAG execution.
I went through the one same question How to pull xcom value from other task instance in the same DAG run (not the most recent one)? however, the one solution given there is what I am already doing. but seems its not the correct answer.
DAG defination:
dag = DAG(
'daily_motor',
default_args=default_args,
schedule_interval='#daily'
)
#This task creates data in a BigQuery table based on execution date
extract_daily_data = BigQueryOperator(
task_id='daily_data_extract',
use_legacy_sql=False,
write_disposition='WRITE_TRUNCATE',
allow_large_results=True,
sql=policy_by_transaction_date_sql('{{ ds }}'),
destination_dataset_table='Test.daily_data_tmp',
dag=dag)
get_daily_data = BigQueryGetDataOperator(
task_id='get_daily_data',
dataset_id='Test',
table_id='daily_data_tmp',
max_results='10000',
dag=dag
)
#This is where I need to pull the data of the same execution date/same instance of DAG run not the most recent task run
def process_bq_data(**kwargs):
bq_data = kwargs['ti'].xcom_pull(task_ids = 'get_daily_data')
#This bq_data is most recent one not of the same execution date
obj_creator = IibListToObject()
items = obj_creator.create(bq_data, 'daily')
save_daily_date_wise(items)
process_data = PythonOperator(
task_id='process_data_from_bq',
python_callable=process_bq_data,
provide_context=True,
dag = dag
)
get_daily_data.set_upstream(extract_daily_data)
process_data.set_upstream(get_daily_data)
You must be receiving latest Xcom value. You need to also be sure that values are actually from same execution_date as it is supposed :
:param include_prior_dates:
If False, only XComs from the current
execution_date are returned.
If True, XComs from previous dates
are returned as well.
I'm trying to set S3KeySensor's bucket_key up based on dagrun input variable.
I have one dag "dag_trigger" that uses TriggerDagRunOperator to trigger dagrun for dag "dag_triggered". I'm trying to extend example https://github.com/apache/incubator-airflow/blob/master/airflow/example_dags/example_trigger_target_dag.py.
So I want to send a variable to triggered dag, and according to the variable's value I want to set backet_key value in S3KeySensor task. I know how to use sent variable in PythonOperator callable function, but I do not know how to use it on the sensor object.
dag_trigger dag:
import datetime
from airflow import DAG
from airflow.operators.dagrun_operator import TriggerDagRunOperator
default_args = {
'owner': 'airflow',
'start_date': datetime.datetime.now()}
dag = DAG('dag_trigger', default_args=default_args, schedule_interval="#hourly")
def task_1_run(context, dag_run_object):
sent_variable = '2018_02_19' # not important
dag_run_object.payload = {'message': sent_variable}
print "DAG dag_trigger triggered with payload: %s" % dag_run_object.payload)
return dag_run_object
task_1 = TriggerDagRunOperator(task_id="task_1",
trigger_dag_id="dag_triggered",
provide_context=True,
python_callable=task_1_run,
dag=dag)
And dag_triggered dag:
import datetime
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.operators.sensors import S3KeySensor
default_args = {
'owner': 'airflow',
'start_date': datetime.datetime.now()
}
dag = DAG('dag_triggered', default_args=default_args, schedule_interval=None)
wait_files_to_arrive_task = S3KeySensor(
task_id='wait_file_to_arrive',
bucket_key='file_%s' % '', # Here I want to place conf['sent_variable']
wildcard_match=True,
bucket_name='test-bucket',
s3_conn_id='test_s3_conn',
timeout=18*60*60,
poke_interval=120,
dag=dag)
I tried to get the value from dag object using dag.get_dagrun().conf['sent_variable'] but I have a doubt how to set dagrun create_date variable (dag_trigger will triggered dag_triggered every hour and dag_triggered could wait longer for file).
I also tried to create PythonOperator that would be upstream for wait_files_to_arrive_task. The callable python function could get information about sent_variable. After that I tried to set value for bucket_key like bucket_key = callable_function() - but I have problem with arguments.
And I also think the global variables is not good solution.
Maybe someone has idea that works.
It's not possible to fetch a value in your DAG run conf directly in your DAG file. That's something that cannot be determined without context of which DAG run it's part of. One way to think about it is when you run python my_dag.py to test if your DAG file compiles, it has to initialize all these operators without needing to specify an execution date. So anything that could differ by DAG run can't be referenced directly.
So instead, you can pass it as a template value which will later get rendered with context when the task is actually being run.
wait_files_to_arrive_task = S3KeySensor(
task_id='wait_file_to_arrive',
bucket_key='file_{{ dag_run.conf["message"] }}',
...)
Note that only parameters listed in the template_fields of an operator will be rendered. Luckily someone anticipated this so bucket_key is indeed a template field.