How to avoid dynamic execution of expression in dag parameter at Airflow? - python

I'm using a parameter that is the timestamp in a set of tasks:
default_dag_args = {'arg1': 'arg1-value',
'arg2': 'arg2-value',
'now': datetime.now()}
I would like that the now parameter would have the same value for all the tasks. But what happens is that it's re-executed for each function
Is there a way of doing it (executing once and using the same value through the dag)? I'm using the TaskFlow API for Airflow 2.0:
#task
def python_task()
context = get_current_context()
context_dag = context['dag']
now = context_dag.default_args['now']
print now

I tried to set the time constant, at the start of the dag file, like:
TIME = datetime.now()
and got the context inside of the tasks with get_current_context() just like you did.
Sadly, I think because of running the DAG file from start, every time a task got defined in the script, time was recalculated.
One idea I have is to use XCOM's in order to save the datetime to a variable and pull it to other tasks:
My sample code is below, I think you'll get the idea.
from airflow.decorators import task, dag
from datetime import datetime
import time
default_arguments = {
'owner': 'admin',
# This is the beginning, for more see: https://airflow.apache.org/faq.html#what-s-the-deal-with-start-date
'start_date': datetime(2022, 5, 2)
}
#dag(
schedule_interval=None,
dag_id = "Time_Example_Dag",
default_args = default_arguments,
catchup=False,
)
def the_global_time_checker_dag():
#task
def time_set():
# To use XCOM to pass the value between tasks,
# we have to parse the datetime to a string.
now = str(datetime.now())
return now
#task
def starting_task(datetime_string):
important_number = 23
# We can use this datetime object in whatever way we like.
date_time_obj = datetime.strptime(datetime_string, '%Y-%m-%d %H:%M:%S.%f')
print(date_time_obj)
return important_number
#task
def important_task(datetime_string, number):
# Passing some time
time.sleep(10)
# Again, we are free to do whatever we want with this object.
date_time_obj = datetime.strptime(datetime_string, '%Y-%m-%d %H:%M:%S.%f')
print(date_time_obj)
print("The important number is: {}".format(number))
time_right_now = time_set()
start = starting_task(datetime_string = time_right_now)
important = important_task(datetime_string = time_right_now, number = start)
time_checker = the_global_time_checker_dag()
Through the logs, you can see all the datetime values are the same.
For more information about XCOM in Taskflow API, you can check here.

When a worker gets a task instance to run, it rebuilds the whole DagBag from the Python files to get the DAG and task definition. So every time a task instance is ran, your DAG file is sourced, rerunning your DAG definition code. And that resulting DAG object is the one that the particular task instance will be defined by.
It's critical to understand that the DAG definition is not simply built once for every execution date and then persisted/reused for all TIs within that DagRun. The DAG definition is constantly being recomputed from your Python code, each TI is ran in a separate process independently and without state from other tasks. Thus, if your DAG definition includes non-deterministic results at DagBag build time - such as datetime.now() - every instantiation of your DAG even for the same execution date will have different values. You need to build your DAGs in a deterministic and idempotent manner.
The only way to share non-deterministic results is to store them in the DB and have your tasks fetch them, as #sezai-burak-kantarcı has pointed out. Best practice is to use task context-specific variables, like {{ ds }}, {{ execution_date }}, {{ data_interval_start }}, etc. These are the same for all tasks within a DAG run. You can see the template variables available here: Airflow emplates reference

Related

AzureML not able to Schedule Pipeline endpoint

I made a minimal Pipeline with a unique step in AML. I've publish this pipeline and I have and id and REST endpoint for it.
When I try to create a schedule on this pipeline, I get no error, but it will never launch.
from azureml.core.runconfig import RunConfiguration
from azureml.pipeline.steps import PythonScriptStep
from azureml.pipeline.core import Pipeline
datastore = ws.get_default_datastore()
minimal_run_config = RunConfiguration()
minimal_run_config.environment = myenv # Custom Env with Dockerfile from mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest + openSDK 11 + pip/conda packages
step_name = experiment_name
script_step_1 = PythonScriptStep(
name=step_name,
script_name="main.py",
arguments=args,
compute_target=cpu_cluster,
source_directory=str(source_path),
runconfig=minimal_run_config,
)
pipeline = Pipeline(
workspace=ws,
steps=[
script_step_1,
],
)
pipeline.validate()
pipeline.publish(name=experiment_name + "_pipeline")
I can trigger this pipeline with REST python
from azureml.core.authentication import InteractiveLoginAuthentication
from azureml.pipeline.core import PublishedPipeline
import requests
auth = InteractiveLoginAuthentication()
aad_token = auth.get_authentication_header()
pipelines = PublishedPipeline.list(ws)
rest_endpoint1 = [p for p in pipelines if p.name == experiment_name + "_pipeline"][0]
response = requests.post(rest_endpoint1.endpoint,
headers=aad_token,
json={"ExperimentName": experiment_name,
"RunSource": "SDK",
"ParameterAssignments": {"KEY": "value"}})
But when I use the Schedule, I have no warning, no error and nothing is triggered if I use start_time from ScheduleRecurrence. If I don't user start_time, my pipeline is triggered and launch immediately. And I don't want this. For example I'm running the Schedule setter today, but I want it's first trigger to run only the second of each month at 4pm.
from azureml.pipeline.core.schedule import ScheduleRecurrence, Schedule
import datetime
first_run = datetime.datetime(2022, 10, 2, 16, 00)
schedule_name = f"Recocpc monthly run PP {first_run.day:02} {first_run.hour:02}:{first_run.minute:02}"
recurrence = ScheduleRecurrence(
frequency="Month",
interval=1,
start_time=first_run,
)
recurrence.validate()
recurring_schedule = Schedule.create_for_pipeline_endpoint(
ws,
name=schedule_name,
description="Recocpc monthly run PP",
pipeline_endpoint_id=pipeline_endpoint.id,
experiment_name=experiment_name,
recurrence=recurrence,
pipeline_parameters={"KEY": "value"}
)
If I comment start_time, It will work, but the first run is now, and not when I want.
So I was not aware on how start_time was working. It is using DAGs logic like in airflow.
Here is an example:
Today is 10-01-2022 (dd-mm-yyy)
You want your pipeline to run every month, once the on the 10th of each month at 14:00.
Then your start_time is not 2022-01-10T14:00:00, but should be 2021-12-10T14:00:00.
Your scheduler will only trigger if it has made a full revolution of what you are asking him (here one month).
Maybe official documentation should be more explicit on this mecanism for neewbies like me that never used DAGs in their lives.

Get session parameter for airflow.models.dag get_last_dagrun()

I am trying to pass into my custom Operator a parameter which is the last run time of the dag itself.
Following the documentation, I understand that i should use dag.get_last_dagrun() https://airflow.apache.org/docs/apache-airflow/1.10.6/_api/airflow/models/dag/index.html#airflow.models.dag.get_last_dagrun . However, I can't manage to pass the session parameter correctly.
Where can I find this?
When using the function without parameters, it return None.
I think that it's because I triggered the Dag myself, thus i want to set include_externally_triggered to true. But i still need to manage the session parameter before.
I tried to create the variable last_run before creating the dag and also when defining the tasks. I suppose that inside the task, self is included and it will fetch correctly without putting any parameters.
But what about the one which is outside of the dag?
I have also tried this solution which give me a time even if its the first time I run the Dag (I have clean the dag log from the ui), Maybe its the current executing DAG timestamp? If yes, I would need to compare the dates to exempt if equal?. https://stackoverflow.com/a/63930004/18036486
from airflow import DAG
from DAG.operators.custom_operator1 import customOperator1
last_run = dag.get_last_dagrun() #HERE
default_args = {
"owner": "admin",
"depends_on_past": False,
"email": ["email#email.com"],
"email_on_failure": False,
"email_on_retry": False,
"retries": 1,
}
with DAG(
dag_id="Custom",
schedule_interval="#once",
description="Desc",
start_date=datetime(2022, 3, 11),
catchup=False,
tags=["custom"],
default_args=default_args) as dag:
#Custom Operator
custom = customOperator1(
task_id = 'custom',
last_run = dag.get_last_dagrun() # OR HERE
)
custom
The actual answer at https://stackoverflow.com/a/63930004/18036486 : . is including the current running Dag. Therefore, i slightly modified the function in order to exempt other dags which status isn't 'running' of course, you can add other conditions for the other Dag states:
enter image description here
Now, I can get the lastest successfull Dag execution_date to dynamically update my data!
from airflow.models import DagRun
def get_last_exec_date(dag_id):
dag_runs = DagRun.find(dag_id=dag_id)
dags = []
for dag in dag_runs:
if dag.state == 'success':
dags.append(dag)
dags.sort(key=lambda x: x.execution_date, reverse=False)
return dags[0] if dags != [] else None

Python Celery inconsistent cronjob timing for task scheduling with now function

The situation
I have a celery task I am running at different timezone for each customer.
Basically, for each customer in my database, I get the timezone, and then I set up the celery task this way.
'schedule': crontab(minute=30, hour=14, nowfun=self.now_function)
Basically, what I want is the task to run at 14:30, at the customer timezone. Hence the now_function.
My now_function is just getting the current time with the customer timezone.
def now_function(self):
"""
return the now function for the task
this is used to compute the time a task should be scheduled for a given customer
"""
return datetime.now(timezone(self.customer.c_timezone))
What is happening
I am getting inconsistencies in the time the task run, sometimes they run at the expected time, so let's say 14:30 in the customer time zone, if the timezone is America/Chicago it runs at 20:30 and that is my expected behavior.
Some other days, it runs at 14:30, which is just the time in UTC.
I am tracking to see if there is a pattern in the day the task run at the correct time and the day the cards run at the incorrect time.
Additional Information
I have tried this on celery 4.4.2 and 5.xx but it is still has the same behavior.
Here is my celery config.
CELERY_REDIS_SCHEDULER_URL = redis_instance_url
logger.debug("****** CELERY_REDIS_SCHEDULER_URL: ", CELERY_REDIS_SCHEDULER_URL)
logger.debug("****** environment: ", environment)
redbeat_redis_url = CELERY_REDIS_SCHEDULER_URL
broker_url = CELERY_REDIS_SCHEDULER_URL
result_backend = CELERY_REDIS_SCHEDULER_URL
task_serializer = 'pickle'
result_serializer = 'pickle'
accept_content = ['pickle']
enable_utc = False
task_track_started = True
task_send_sent_event = True
You can notice enable_utc is set to False.
I am using Redis instance from AWS to run my task.
I am using the RedBeatScheduler scheduler from this package to schedule my tasks.
If anyone has experienced this issue or can help me to reproduce it, I will be very thankful.
Other edits:
I have another cron for the same job at the same time but running weekly and monthly but they are working perfectly.
weekly_schedule : crontab(minute=30, hour=14, nowfun=self.now_function, day_of_week=1)
monthly_schedule : crontab(minute=30, hour=14, nowfun=self.now_function, day_of_month=1)
Sample Project
Here is a sample project on GitHub if you want to run and reproduce the issue.
RedBeat's encoder and decoder don't support nowfun.
Source code: https://github.com/sibson/redbeat/blob/e6d72e2/redbeat/decoder.py#L94-L102
The behaviour you see was described previously: sibson/redbeat#192 (comment 756397651)
You can subclass and replace RedBeatJSONDecoder and RedBeatJSONEncoder.
Since nowfun has to be JSON serializable, we can only support some special cases,
e.g. nowfun=partial(datetime.now, tz=pytz.timezone(self.customer.c_timezone))
from datetime import datetime
from functools import partial
from celery.schedules import crontab
import pytz
from pytz.tzinfo import DstTzInfo
from redbeat.decoder import RedBeatJSONDecoder, RedBeatJSONEncoder
class CustomJSONDecoder(RedBeatJSONDecoder):
def dict_to_object(self, d):
if '__type__' not in d:
return d
objtype = d.pop('__type__')
if objtype == 'crontab':
if d.get('nowfun', {}).get('keywords', {}).get('zone'):
d['nowfun'] = partial(datetime.now, tz=pytz.timezone(d.pop('nowfun')['keywords']['zone']))
return crontab(**d)
d['__type__'] = objtype
return super().dict_to_object(d)
class CustomJSONEncoder(RedBeatJSONEncoder):
def default(self, obj):
if isinstance(obj, crontab):
d = super().default(obj)
if 'nowfun' not in d and isinstance(obj.nowfun, partial) and obj.nowfun.func == datetime.now:
zone = None
if obj.nowfun.args and isinstance(obj.nowfun.args[0], DstTzInfo):
zone = obj.nowfun.args[0].zone
elif isinstance(obj.nowfun.keywords.get('tz'), DstTzInfo):
zone = obj.nowfun.keywords['tz'].zone
if zone:
d['nowfun'] = {'keywords': {'zone': zone}}
return d
return super().default(obj)
Replace the classes in redbeat.schedulers:
from redbeat import schedulers
schedulers.RedBeatJSONDecoder = CustomJSONDecoder
schedulers.RedBeatJSONEncoder = CustomJSONEncoder

Airflow xcom_pull is not giving the data of same upstream task instance run, instead gives most recent data

I am creating an Airflow #daily DAG, It has an upstream task get_daily_data of BigQueryGetDataOperator which fetches data based on execution_date and on downstream dependent task (PythonOperator) uses above date based data via xcom_pull. When I run the airflow backfill command, the downstream task process_data_from_bq where I am doing xcom_pull, it gets the recent data only not the data of the same execution date which the downstream task is expecting.
Airfow documentation is saying if we pass If xcom_pull is passed a single string for task_ids, then the most recent XCom value from that task is returned
However its not saying how to get the data of same instance of the DAG execution.
I went through the one same question How to pull xcom value from other task instance in the same DAG run (not the most recent one)? however, the one solution given there is what I am already doing. but seems its not the correct answer.
DAG defination:
dag = DAG(
'daily_motor',
default_args=default_args,
schedule_interval='#daily'
)
#This task creates data in a BigQuery table based on execution date
extract_daily_data = BigQueryOperator(
task_id='daily_data_extract',
use_legacy_sql=False,
write_disposition='WRITE_TRUNCATE',
allow_large_results=True,
sql=policy_by_transaction_date_sql('{{ ds }}'),
destination_dataset_table='Test.daily_data_tmp',
dag=dag)
get_daily_data = BigQueryGetDataOperator(
task_id='get_daily_data',
dataset_id='Test',
table_id='daily_data_tmp',
max_results='10000',
dag=dag
)
#This is where I need to pull the data of the same execution date/same instance of DAG run not the most recent task run
def process_bq_data(**kwargs):
bq_data = kwargs['ti'].xcom_pull(task_ids = 'get_daily_data')
#This bq_data is most recent one not of the same execution date
obj_creator = IibListToObject()
items = obj_creator.create(bq_data, 'daily')
save_daily_date_wise(items)
process_data = PythonOperator(
task_id='process_data_from_bq',
python_callable=process_bq_data,
provide_context=True,
dag = dag
)
get_daily_data.set_upstream(extract_daily_data)
process_data.set_upstream(get_daily_data)
You must be receiving latest Xcom value. You need to also be sure that values are actually from same execution_date as it is supposed :
:param include_prior_dates:
If False, only XComs from the current
execution_date are returned.
If True, XComs from previous dates
are returned as well.

How to dynamically add bucket_key value in airflow's S3KeySensor

I'm trying to set S3KeySensor's bucket_key up based on dagrun input variable.
I have one dag "dag_trigger" that uses TriggerDagRunOperator to trigger dagrun for dag "dag_triggered". I'm trying to extend example https://github.com/apache/incubator-airflow/blob/master/airflow/example_dags/example_trigger_target_dag.py.
So I want to send a variable to triggered dag, and according to the variable's value I want to set backet_key value in S3KeySensor task. I know how to use sent variable in PythonOperator callable function, but I do not know how to use it on the sensor object.
dag_trigger dag:
import datetime
from airflow import DAG
from airflow.operators.dagrun_operator import TriggerDagRunOperator
default_args = {
'owner': 'airflow',
'start_date': datetime.datetime.now()}
dag = DAG('dag_trigger', default_args=default_args, schedule_interval="#hourly")
def task_1_run(context, dag_run_object):
sent_variable = '2018_02_19' # not important
dag_run_object.payload = {'message': sent_variable}
print "DAG dag_trigger triggered with payload: %s" % dag_run_object.payload)
return dag_run_object
task_1 = TriggerDagRunOperator(task_id="task_1",
trigger_dag_id="dag_triggered",
provide_context=True,
python_callable=task_1_run,
dag=dag)
And dag_triggered dag:
import datetime
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.operators.sensors import S3KeySensor
default_args = {
'owner': 'airflow',
'start_date': datetime.datetime.now()
}
dag = DAG('dag_triggered', default_args=default_args, schedule_interval=None)
wait_files_to_arrive_task = S3KeySensor(
task_id='wait_file_to_arrive',
bucket_key='file_%s' % '', # Here I want to place conf['sent_variable']
wildcard_match=True,
bucket_name='test-bucket',
s3_conn_id='test_s3_conn',
timeout=18*60*60,
poke_interval=120,
dag=dag)
I tried to get the value from dag object using dag.get_dagrun().conf['sent_variable'] but I have a doubt how to set dagrun create_date variable (dag_trigger will triggered dag_triggered every hour and dag_triggered could wait longer for file).
I also tried to create PythonOperator that would be upstream for wait_files_to_arrive_task. The callable python function could get information about sent_variable. After that I tried to set value for bucket_key like bucket_key = callable_function() - but I have problem with arguments.
And I also think the global variables is not good solution.
Maybe someone has idea that works.
It's not possible to fetch a value in your DAG run conf directly in your DAG file. That's something that cannot be determined without context of which DAG run it's part of. One way to think about it is when you run python my_dag.py to test if your DAG file compiles, it has to initialize all these operators without needing to specify an execution date. So anything that could differ by DAG run can't be referenced directly.
So instead, you can pass it as a template value which will later get rendered with context when the task is actually being run.
wait_files_to_arrive_task = S3KeySensor(
task_id='wait_file_to_arrive',
bucket_key='file_{{ dag_run.conf["message"] }}',
...)
Note that only parameters listed in the template_fields of an operator will be rendered. Luckily someone anticipated this so bucket_key is indeed a template field.

Categories