I am using Gcloud Composer to launch Dataflow jobs.
My DAG consist of two Dataflow jobs that should be run one after the other.
import datetime
from airflow.contrib.operators.dataflow_operator import DataflowTemplateOperator
from airflow import models
default_dag_args = {
'start_date': datetime.datetime(2019, 10, 23),
'dataflow_default_options': {
'project': 'myproject',
'region': 'europe-west1',
'zone': 'europe-west1-c',
'tempLocation': 'gs://somebucket/',
}
}
with models.DAG(
'some_name',
schedule_interval=datetime.timedelta(days=1),
default_args=default_dag_args) as dag:
parameters = {'params': "param1"}
t1 = DataflowTemplateOperator(
task_id='dataflow_example_01',
template='gs://path/to/template/template_001',
parameters=parameters,
dag=dag)
parameters2 = {'params':"param2"}
t2 = DataflowTemplateOperator(
task_id='dataflow_example_02',
template='gs://path/to/templates/template_002',
parameters=parameters2,
dag=dag
)
t1 >> t2
When I check in dataflow the job has succeeded, all the files it is supposed to make are created, but it appears it ran in US region, the cloud composer environment is in Europe west.
In airflow I can see that the first job is still running so the second one is not launched
What should I add to the DAG to make it succeed? How do I run in Europe?
Any advice or solution on how to proceed would be most appreciated. Thanks!
I had to solve this issue in the past. In Airflow 1.10.2 (or lower) the code calls to the service.projects().templates().launch() endpoint. This was fixed in 1.10.3 where the regional one is used instead: service.projects().locations().templates().launch().
As of October 2019, the latest Airflow version available for Composer environments is 1.10.2. If you need a solution immediately, the fix can be back-ported into Composer.
For this we can override the DataflowTemplateOperator for our own version called RegionalDataflowTemplateOperator:
class RegionalDataflowTemplateOperator(DataflowTemplateOperator):
def execute(self, context):
hook = RegionalDataFlowHook(gcp_conn_id=self.gcp_conn_id,
delegate_to=self.delegate_to,
poll_sleep=self.poll_sleep)
hook.start_template_dataflow(self.task_id, self.dataflow_default_options,
self.parameters, self.template)
This will now make use of the modified RegionalDataFlowHook which overrides the start_template_dataflow method of the DataFlowHook operator to call the correct endpoint:
class RegionalDataFlowHook(DataFlowHook):
def _start_template_dataflow(self, name, variables, parameters,
dataflow_template):
...
request = service.projects().locations().templates().launch(
projectId=variables['project'],
location=variables['region'],
gcsPath=dataflow_template,
body=body
)
...
return response
Then, we can create a task using our new operator and a Google-provided template (for testing purposes):
task = RegionalDataflowTemplateOperator(
task_id=JOB_NAME,
template=TEMPLATE_PATH,
parameters={
'inputFile': 'gs://dataflow-samples/shakespeare/kinglear.txt',
'output': 'gs://{}/europe/output'.format(BUCKET)
},
dag=dag,
)
Full working DAG here. For a cleaner version, the operator can be moved into a separate module.
Related
I made a minimal Pipeline with a unique step in AML. I've publish this pipeline and I have and id and REST endpoint for it.
When I try to create a schedule on this pipeline, I get no error, but it will never launch.
from azureml.core.runconfig import RunConfiguration
from azureml.pipeline.steps import PythonScriptStep
from azureml.pipeline.core import Pipeline
datastore = ws.get_default_datastore()
minimal_run_config = RunConfiguration()
minimal_run_config.environment = myenv # Custom Env with Dockerfile from mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest + openSDK 11 + pip/conda packages
step_name = experiment_name
script_step_1 = PythonScriptStep(
name=step_name,
script_name="main.py",
arguments=args,
compute_target=cpu_cluster,
source_directory=str(source_path),
runconfig=minimal_run_config,
)
pipeline = Pipeline(
workspace=ws,
steps=[
script_step_1,
],
)
pipeline.validate()
pipeline.publish(name=experiment_name + "_pipeline")
I can trigger this pipeline with REST python
from azureml.core.authentication import InteractiveLoginAuthentication
from azureml.pipeline.core import PublishedPipeline
import requests
auth = InteractiveLoginAuthentication()
aad_token = auth.get_authentication_header()
pipelines = PublishedPipeline.list(ws)
rest_endpoint1 = [p for p in pipelines if p.name == experiment_name + "_pipeline"][0]
response = requests.post(rest_endpoint1.endpoint,
headers=aad_token,
json={"ExperimentName": experiment_name,
"RunSource": "SDK",
"ParameterAssignments": {"KEY": "value"}})
But when I use the Schedule, I have no warning, no error and nothing is triggered if I use start_time from ScheduleRecurrence. If I don't user start_time, my pipeline is triggered and launch immediately. And I don't want this. For example I'm running the Schedule setter today, but I want it's first trigger to run only the second of each month at 4pm.
from azureml.pipeline.core.schedule import ScheduleRecurrence, Schedule
import datetime
first_run = datetime.datetime(2022, 10, 2, 16, 00)
schedule_name = f"Recocpc monthly run PP {first_run.day:02} {first_run.hour:02}:{first_run.minute:02}"
recurrence = ScheduleRecurrence(
frequency="Month",
interval=1,
start_time=first_run,
)
recurrence.validate()
recurring_schedule = Schedule.create_for_pipeline_endpoint(
ws,
name=schedule_name,
description="Recocpc monthly run PP",
pipeline_endpoint_id=pipeline_endpoint.id,
experiment_name=experiment_name,
recurrence=recurrence,
pipeline_parameters={"KEY": "value"}
)
If I comment start_time, It will work, but the first run is now, and not when I want.
So I was not aware on how start_time was working. It is using DAGs logic like in airflow.
Here is an example:
Today is 10-01-2022 (dd-mm-yyy)
You want your pipeline to run every month, once the on the 10th of each month at 14:00.
Then your start_time is not 2022-01-10T14:00:00, but should be 2021-12-10T14:00:00.
Your scheduler will only trigger if it has made a full revolution of what you are asking him (here one month).
Maybe official documentation should be more explicit on this mecanism for neewbies like me that never used DAGs in their lives.
I'm using a parameter that is the timestamp in a set of tasks:
default_dag_args = {'arg1': 'arg1-value',
'arg2': 'arg2-value',
'now': datetime.now()}
I would like that the now parameter would have the same value for all the tasks. But what happens is that it's re-executed for each function
Is there a way of doing it (executing once and using the same value through the dag)? I'm using the TaskFlow API for Airflow 2.0:
#task
def python_task()
context = get_current_context()
context_dag = context['dag']
now = context_dag.default_args['now']
print now
I tried to set the time constant, at the start of the dag file, like:
TIME = datetime.now()
and got the context inside of the tasks with get_current_context() just like you did.
Sadly, I think because of running the DAG file from start, every time a task got defined in the script, time was recalculated.
One idea I have is to use XCOM's in order to save the datetime to a variable and pull it to other tasks:
My sample code is below, I think you'll get the idea.
from airflow.decorators import task, dag
from datetime import datetime
import time
default_arguments = {
'owner': 'admin',
# This is the beginning, for more see: https://airflow.apache.org/faq.html#what-s-the-deal-with-start-date
'start_date': datetime(2022, 5, 2)
}
#dag(
schedule_interval=None,
dag_id = "Time_Example_Dag",
default_args = default_arguments,
catchup=False,
)
def the_global_time_checker_dag():
#task
def time_set():
# To use XCOM to pass the value between tasks,
# we have to parse the datetime to a string.
now = str(datetime.now())
return now
#task
def starting_task(datetime_string):
important_number = 23
# We can use this datetime object in whatever way we like.
date_time_obj = datetime.strptime(datetime_string, '%Y-%m-%d %H:%M:%S.%f')
print(date_time_obj)
return important_number
#task
def important_task(datetime_string, number):
# Passing some time
time.sleep(10)
# Again, we are free to do whatever we want with this object.
date_time_obj = datetime.strptime(datetime_string, '%Y-%m-%d %H:%M:%S.%f')
print(date_time_obj)
print("The important number is: {}".format(number))
time_right_now = time_set()
start = starting_task(datetime_string = time_right_now)
important = important_task(datetime_string = time_right_now, number = start)
time_checker = the_global_time_checker_dag()
Through the logs, you can see all the datetime values are the same.
For more information about XCOM in Taskflow API, you can check here.
When a worker gets a task instance to run, it rebuilds the whole DagBag from the Python files to get the DAG and task definition. So every time a task instance is ran, your DAG file is sourced, rerunning your DAG definition code. And that resulting DAG object is the one that the particular task instance will be defined by.
It's critical to understand that the DAG definition is not simply built once for every execution date and then persisted/reused for all TIs within that DagRun. The DAG definition is constantly being recomputed from your Python code, each TI is ran in a separate process independently and without state from other tasks. Thus, if your DAG definition includes non-deterministic results at DagBag build time - such as datetime.now() - every instantiation of your DAG even for the same execution date will have different values. You need to build your DAGs in a deterministic and idempotent manner.
The only way to share non-deterministic results is to store them in the DB and have your tasks fetch them, as #sezai-burak-kantarcı has pointed out. Best practice is to use task context-specific variables, like {{ ds }}, {{ execution_date }}, {{ data_interval_start }}, etc. These are the same for all tasks within a DAG run. You can see the template variables available here: Airflow emplates reference
I am trying to pass into my custom Operator a parameter which is the last run time of the dag itself.
Following the documentation, I understand that i should use dag.get_last_dagrun() https://airflow.apache.org/docs/apache-airflow/1.10.6/_api/airflow/models/dag/index.html#airflow.models.dag.get_last_dagrun . However, I can't manage to pass the session parameter correctly.
Where can I find this?
When using the function without parameters, it return None.
I think that it's because I triggered the Dag myself, thus i want to set include_externally_triggered to true. But i still need to manage the session parameter before.
I tried to create the variable last_run before creating the dag and also when defining the tasks. I suppose that inside the task, self is included and it will fetch correctly without putting any parameters.
But what about the one which is outside of the dag?
I have also tried this solution which give me a time even if its the first time I run the Dag (I have clean the dag log from the ui), Maybe its the current executing DAG timestamp? If yes, I would need to compare the dates to exempt if equal?. https://stackoverflow.com/a/63930004/18036486
from airflow import DAG
from DAG.operators.custom_operator1 import customOperator1
last_run = dag.get_last_dagrun() #HERE
default_args = {
"owner": "admin",
"depends_on_past": False,
"email": ["email#email.com"],
"email_on_failure": False,
"email_on_retry": False,
"retries": 1,
}
with DAG(
dag_id="Custom",
schedule_interval="#once",
description="Desc",
start_date=datetime(2022, 3, 11),
catchup=False,
tags=["custom"],
default_args=default_args) as dag:
#Custom Operator
custom = customOperator1(
task_id = 'custom',
last_run = dag.get_last_dagrun() # OR HERE
)
custom
The actual answer at https://stackoverflow.com/a/63930004/18036486 : . is including the current running Dag. Therefore, i slightly modified the function in order to exempt other dags which status isn't 'running' of course, you can add other conditions for the other Dag states:
enter image description here
Now, I can get the lastest successfull Dag execution_date to dynamically update my data!
from airflow.models import DagRun
def get_last_exec_date(dag_id):
dag_runs = DagRun.find(dag_id=dag_id)
dags = []
for dag in dag_runs:
if dag.state == 'success':
dags.append(dag)
dags.sort(key=lambda x: x.execution_date, reverse=False)
return dags[0] if dags != [] else None
I'm trying to create a glue etl job. I'm using boto3. I'm using the script below. I want to create it as type=Spark, but the script below creates a type=Python Shell. Also it doesn't disable bookmarks. Does anyone know what I need to add to make it a type Spark and disable bookmarks?
script:
response = glue_assumed_client.create_job(
Name='mlxxxx',
Role='Awsxxxx',
Command={
'Name': 'mlxxxx',
'ScriptLocation': 's3://aws-glue-scripts-xxxxx-us-west-2/xxxx',
'PythonVersion': '3'
},
Connections={
'Connections': [
'sxxxx',
'spxxxxxx',
]
},
Timeout=2880,
MaxCapacity=10
)
To create Spark jobs, you would have to mention the name of the command as ‘glueetl’ as described below and if you are not running a python shell job you need not specify the python version in the Command parameters
response = client.create_job(
Name='mlxxxyu',
Role='Awsxxxx',
Command={
'Name': 'glueetl', # <—— mention the name as glueetl to create spark job
'ScriptLocation': 's3://aws-glue-scripts-xxxxx-us-west-2/xxxx'
},
Connections={
'Connections': [
'sxxxx',
'spxxxxxx',
]
},
Timeout=2880,
MaxCapacity=10
)
Regarding job bookmarks, job bookmarks are disabled by default, so if you don’t specify a parameter for a job bookmarks then the job created would have bookmarks disabled.
If you want to explicitly disable bookmarks, then you can specify the same in the Default Arguments[1] as shown below.
response = client.create_job(
Name='mlxxxyu',
Role='Awsxxxx',
Command={
'Name': 'glueetl',
'ScriptLocation': ‘s3://aws-glue-scripts-xxxxx-us-west-2/xxxx'
},
DefaultArguments={
'--job-bookmark-option': 'job-bookmark-disable'
},
Timeout=2880,
MaxCapacity=10
)
See the documentation.
Command (dict) -- [REQUIRED] The JobCommand that executes this job.
Name (string) -- The name of the job command. For an Apache Spark ETL job, this must be glueetl . For a Python shell job, it must be pythonshell .
You may reset the bookmark by using the function
client.reset_job_bookmark(
JobName='string',
RunId='string'
)
where the JobName is required. It can be obtained from the response['Name'] of the command create_job()
I am trying to add custom filters for my airflow jinja2 templates.
Since my folders in S3 are like
/year/month/day/
, my purpose is to use yesterday_ds in the Variables screen like this:
s3://logs.web.com/AWSLogs/{{ yesterday_ds | get_year }}/{{ yesterday_ds | get_month }}/{{ yesterday_ds | get_day }}/
I have seen in a PR (which I think is already merged..) that you can do this with the parameter 'user_defined_filters' in the dag_args parameter in the dag object creation here
The problem is that even when doing it, it says 'no filter named get_year', for example.
This is my code:
dag.py
dag = DAG(
dag_id='dag-name',
default_args=utils.get_dag_args(user_defined_filters=utils.get_date_filters()),
template_searchpath=tmpl_search_path,
schedule_interval=timedelta(days=1),
max_active_runs=1,
)
utils.py
def get_dag_args(**kwargs):
return {
'owner' : kwargs.get('owner', 'owner_name'),
'depends_on_past' : kwargs.get('depends_on_past', False),
'start_date' : kwargs.get('start_date', datetime.now() - timedelta(1)),
'email' : kwargs.get('email', ['blabla#blabla.com']),
'retries' : kwargs.get('retries', 5),
'provide_context' : kwargs.get('provide_context', True),
'retry_delay' : kwargs.get('retry_delay', timedelta(minutes=5)),
'user_defined_filters': get_date_filters()
}
def get_date_filters():
return dict(
get_year=lambda date_string: date_string.strftime('%Y'),
get_month=lambda date_string: date_string.strftime('%m'),
get_day=lambda date_string: date_string.strftime('%d'),
)
Does anybody see where I am mistaken? Thank you!
EDIT
Printing this after the dag definition, shows no custom filters, unfortunately :(.
jinja_env = dag.get_template_env()
print(jinja_env.filters)
Also, if I try to add it directly as a DAG object parameter, as it shows in the tests # tests/models.py:
Broken DAG: [/home/ubuntu/airflow/dags/dag.py] __init__() got an unexpected keyword argument 'user_defined_filters'
EDIT 2
Ok what I see is that I have the version 1.8.0 and this one does not have the filters. Anybody knows how to download the 1.8.2rc one via pip? Or we cant?
Airflow has support for custom filters and macros now
Working code example:
from airflow import DAG
from datetime import datetime, timedelta
def first_day_of_month(any_day):
return any_day.replace(day=1)
def last_day_of_month(any_day):
next_month = any_day.replace(day=28) + timedelta(days=4) # this will never fail
return next_month - timedelta(days=next_month.day)
def isoformat_month(any_date):
return any_date.strftime("%Y-%m")
with DAG(
dag_id='generate_raw_logs',
default_args=default_args,
schedule_interval=timedelta(minutes=120),
catchup=False,
user_defined_macros={
'first_day_of_month': first_day_of_month,
'last_day_of_month': last_day_of_month,
},
user_defined_filters={
'isoformat_month': isoformat_month
}
)
Airflow packaging name has been changed on pip. 1.8.2rc1 can be downloaded using pip install apache-airflow instead.
Also, note that according to the mailling list, they are currently working on releasing 1.8.2rc4 as 1.8.2.