I am trying to add custom filters for my airflow jinja2 templates.
Since my folders in S3 are like
/year/month/day/
, my purpose is to use yesterday_ds in the Variables screen like this:
s3://logs.web.com/AWSLogs/{{ yesterday_ds | get_year }}/{{ yesterday_ds | get_month }}/{{ yesterday_ds | get_day }}/
I have seen in a PR (which I think is already merged..) that you can do this with the parameter 'user_defined_filters' in the dag_args parameter in the dag object creation here
The problem is that even when doing it, it says 'no filter named get_year', for example.
This is my code:
dag.py
dag = DAG(
dag_id='dag-name',
default_args=utils.get_dag_args(user_defined_filters=utils.get_date_filters()),
template_searchpath=tmpl_search_path,
schedule_interval=timedelta(days=1),
max_active_runs=1,
)
utils.py
def get_dag_args(**kwargs):
return {
'owner' : kwargs.get('owner', 'owner_name'),
'depends_on_past' : kwargs.get('depends_on_past', False),
'start_date' : kwargs.get('start_date', datetime.now() - timedelta(1)),
'email' : kwargs.get('email', ['blabla#blabla.com']),
'retries' : kwargs.get('retries', 5),
'provide_context' : kwargs.get('provide_context', True),
'retry_delay' : kwargs.get('retry_delay', timedelta(minutes=5)),
'user_defined_filters': get_date_filters()
}
def get_date_filters():
return dict(
get_year=lambda date_string: date_string.strftime('%Y'),
get_month=lambda date_string: date_string.strftime('%m'),
get_day=lambda date_string: date_string.strftime('%d'),
)
Does anybody see where I am mistaken? Thank you!
EDIT
Printing this after the dag definition, shows no custom filters, unfortunately :(.
jinja_env = dag.get_template_env()
print(jinja_env.filters)
Also, if I try to add it directly as a DAG object parameter, as it shows in the tests # tests/models.py:
Broken DAG: [/home/ubuntu/airflow/dags/dag.py] __init__() got an unexpected keyword argument 'user_defined_filters'
EDIT 2
Ok what I see is that I have the version 1.8.0 and this one does not have the filters. Anybody knows how to download the 1.8.2rc one via pip? Or we cant?
Airflow has support for custom filters and macros now
Working code example:
from airflow import DAG
from datetime import datetime, timedelta
def first_day_of_month(any_day):
return any_day.replace(day=1)
def last_day_of_month(any_day):
next_month = any_day.replace(day=28) + timedelta(days=4) # this will never fail
return next_month - timedelta(days=next_month.day)
def isoformat_month(any_date):
return any_date.strftime("%Y-%m")
with DAG(
dag_id='generate_raw_logs',
default_args=default_args,
schedule_interval=timedelta(minutes=120),
catchup=False,
user_defined_macros={
'first_day_of_month': first_day_of_month,
'last_day_of_month': last_day_of_month,
},
user_defined_filters={
'isoformat_month': isoformat_month
}
)
Airflow packaging name has been changed on pip. 1.8.2rc1 can be downloaded using pip install apache-airflow instead.
Also, note that according to the mailling list, they are currently working on releasing 1.8.2rc4 as 1.8.2.
Related
I have a Airflow dag with a PostgresOperator to execute a SQL query. I want to switch to my test database or my prod database with config (run w/config). But postgres_conn_id is not a template field and so PostgresOperator say "{{ dag_run.conf.get('CONN_ID_TEST', 'pg_database') }}" is not a connection.
I run this script with {"CONN_ID_TEST": "pg_database_test"} config.
I try to create a custom postgresql operator with the same code of Airflow github and I add template_fields: Sequence[str] = ("postgres_conn_id",) at the top of my class CustomPostgresOperator but that doesn't work too (same error).
I have two conn_id env variables :
AIRFLOW_CONN_ID_PG_DATABASE (prod)
AIRFLOW_CONN_ID_PG_DATABASE_TEST (test)
My script looks like :
from airflow import DAG
from airflow.providers.postgres.operators.postgres import PostgresOperator
from airflow.operators.dummy import DummyOperator
DAG_ID = "init_database"
POSTGRES_CONN_ID = "{{ dag_run.conf.get('CONN_ID_TEST', 'pg_database') }}"
with DAG(
dag_id=DAG_ID,
description="My dag",
schedule_interval="#once",
start_date=dt.datetime(2022, 1, 1),
catchup=False,
) as dag:
start = DummyOperator(task_id = 'start')
my_task = PostgresOperator( #### OR CustomPostgresOperator
task_id="select",
sql="SELECT * FROM pets LIMIT 1;",
postgres_conn_id=POSTGRES_CONN_ID,
autocommit=True
)
start >> my task
How I can process to solve my problem ? And if is not possible how I can switch my PostgresOperator connection to my dev database without recreate an other DAG script ?
Thanks, Léo
Subclassing is a solid way to modify the template_fields how you wish. Since template_fields is a class attribute your subclass only really needs to be the following (assuming you're just adding the connection ID to the existing template_fields):
from airflow.providers.postgres.operators.postgres import PostgresOperator as _PostgresOperator
class PostgresOperator(_PostgresOperator):
template_fields = [*_PostgresOperator.template_fields, "conn_id"]
The above is using Postgres provider version 5.3.1 which actually uses the Common SQL provider under the hood so the connection parameter is actually conn_id. (template_fields refer to the instance attribute name rather than the parameter name.)
For example, assume the below DAG gets triggered with a run config of {"environment": "dev"}:
from pendulum import datetime
from airflow.decorators import dag
from airflow.providers.postgres.operators.postgres import PostgresOperator as _PostgresOperator
class PostgresOperator(_PostgresOperator):
template_fields = [*_PostgresOperator.template_fields, "conn_id"]
#dag(start_date=datetime(2023, 1, 1), schedule=None)
def template_postgres_conn():
PostgresOperator(task_id="run_sql", sql="SELECT 1;", postgres_conn_id="{{ dag_run.conf['environment'] }}")
template_postgres_conn()
Looking at the task log, the connection ID of "dev" is used to execute the SQL:
I am trying to pass into my custom Operator a parameter which is the last run time of the dag itself.
Following the documentation, I understand that i should use dag.get_last_dagrun() https://airflow.apache.org/docs/apache-airflow/1.10.6/_api/airflow/models/dag/index.html#airflow.models.dag.get_last_dagrun . However, I can't manage to pass the session parameter correctly.
Where can I find this?
When using the function without parameters, it return None.
I think that it's because I triggered the Dag myself, thus i want to set include_externally_triggered to true. But i still need to manage the session parameter before.
I tried to create the variable last_run before creating the dag and also when defining the tasks. I suppose that inside the task, self is included and it will fetch correctly without putting any parameters.
But what about the one which is outside of the dag?
I have also tried this solution which give me a time even if its the first time I run the Dag (I have clean the dag log from the ui), Maybe its the current executing DAG timestamp? If yes, I would need to compare the dates to exempt if equal?. https://stackoverflow.com/a/63930004/18036486
from airflow import DAG
from DAG.operators.custom_operator1 import customOperator1
last_run = dag.get_last_dagrun() #HERE
default_args = {
"owner": "admin",
"depends_on_past": False,
"email": ["email#email.com"],
"email_on_failure": False,
"email_on_retry": False,
"retries": 1,
}
with DAG(
dag_id="Custom",
schedule_interval="#once",
description="Desc",
start_date=datetime(2022, 3, 11),
catchup=False,
tags=["custom"],
default_args=default_args) as dag:
#Custom Operator
custom = customOperator1(
task_id = 'custom',
last_run = dag.get_last_dagrun() # OR HERE
)
custom
The actual answer at https://stackoverflow.com/a/63930004/18036486 : . is including the current running Dag. Therefore, i slightly modified the function in order to exempt other dags which status isn't 'running' of course, you can add other conditions for the other Dag states:
enter image description here
Now, I can get the lastest successfull Dag execution_date to dynamically update my data!
from airflow.models import DagRun
def get_last_exec_date(dag_id):
dag_runs = DagRun.find(dag_id=dag_id)
dags = []
for dag in dag_runs:
if dag.state == 'success':
dags.append(dag)
dags.sort(key=lambda x: x.execution_date, reverse=False)
return dags[0] if dags != [] else None
I need to create a airflow operator that takes a few inputs and returns a string that will be used as an input for another operator that will run next. I'm new to airflow dags and operators and am confused on how to properly do this. Since I'm building this for people who use airflow and build dags and I'm not an actual airflow user or dag developer I want to get advice on doing it properly. I have created a operator and it returns a token (just a string so hello world operator example works fine). Doing so I see the value in the xcom value for the dag execution. But how would I properly retrieve that value and input it into the next operator? For my example I just called the same operator but in real it will be calling a different operator. I just do not know how to properly code this. Do I just add code to the dag? Does the operator need code added? Or should something else?
Example Dag:
import logging
import os
from airflow import DAG
from airflow.utils.dates import days_ago
from custom_operators.hello_world import HelloWorldOperator
from datetime import datetime
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'email_on_failure': False,
'email_on_retry': False,
'retries': 0,
}
dag = DAG("hello_world_test",
description='Testing out a operator',
start_date=days_ago(1),
schedule_interval=None,
catchup=False,
default_args=default_args)
get_token = HelloWorldOperator(
task_id='hello_world_task_1',
name='My input to generate a token',
dag=dag
)
token = "My token" # Want this to be the return value from get_token
run_this = HelloWorldOperator(
task_id='hello_world_task_2',
name=token,
dag=dag
)
logging.info("Start")
get_token >> run_this
logging.info("End")
Hello World operator:
from airflow.models.baseoperator import BaseOperator
class HelloWorldOperator(BaseOperator):
def __init__(
self,
some_input: str,
**kwargs) -> None:
super().__init__(**kwargs)
self.some_input = some_input
def execute(self, context):
# Bunch of business logic
token = "MyGeneratedToken"
return token
This is a good start :).
The right way to retrieve the token from another task is to use jinja templating
run_this = RetrieveToken(
task_id='hello_world_task_2',
retrieved_token="{{ ti.xcom_pull(task_ids=[\'hello_world_task_1\']) }}'",
dag=dag
)
However, you have to remember in your RetrieveToken to add retrieved_token to template_fields array: https://airflow.apache.org/docs/apache-airflow/stable/howto/custom-operator.html#templating
You can also call xcom_pull method explicitly in your "retrieve" operator and pass the "origin" task id to the operator to retrieve it from the right task.
Airflow version: 1.10.12
I'm having trouble passing a rendered template object for use downstream. I'm trying to grab two config variables from the Airflow conf.
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime
with DAG(
dag_id="example_trigger_target_dag",
default_args={"owner": "airflow"},
start_date=datetime(2021, 6, 24),
schedule_interval='#once',
tags=['example'],
) as dag:
bash_task = BashOperator(
task_id="bash_task",
bash_command='echo "{{ conf.test }}:{{ conf.tag }}"',
xcom_push=True
)
Basically what it passes to xcom is just : without the variables present. The full rendered string does show up however in the rendered tab. Am I missing something?
Edit: the variables exist in the conf, I just replaced them with test and tag for security reasons.
I just tested this code (I'm using Airflow 2.1.0) and got that result.
BashOperator(task_id='test_task',
bash_command='echo " VARS: {{ conf.email.email_backend '
'}}:{{ conf.core.executor }}"',
do_xcom_push=True)
I am using Gcloud Composer to launch Dataflow jobs.
My DAG consist of two Dataflow jobs that should be run one after the other.
import datetime
from airflow.contrib.operators.dataflow_operator import DataflowTemplateOperator
from airflow import models
default_dag_args = {
'start_date': datetime.datetime(2019, 10, 23),
'dataflow_default_options': {
'project': 'myproject',
'region': 'europe-west1',
'zone': 'europe-west1-c',
'tempLocation': 'gs://somebucket/',
}
}
with models.DAG(
'some_name',
schedule_interval=datetime.timedelta(days=1),
default_args=default_dag_args) as dag:
parameters = {'params': "param1"}
t1 = DataflowTemplateOperator(
task_id='dataflow_example_01',
template='gs://path/to/template/template_001',
parameters=parameters,
dag=dag)
parameters2 = {'params':"param2"}
t2 = DataflowTemplateOperator(
task_id='dataflow_example_02',
template='gs://path/to/templates/template_002',
parameters=parameters2,
dag=dag
)
t1 >> t2
When I check in dataflow the job has succeeded, all the files it is supposed to make are created, but it appears it ran in US region, the cloud composer environment is in Europe west.
In airflow I can see that the first job is still running so the second one is not launched
What should I add to the DAG to make it succeed? How do I run in Europe?
Any advice or solution on how to proceed would be most appreciated. Thanks!
I had to solve this issue in the past. In Airflow 1.10.2 (or lower) the code calls to the service.projects().templates().launch() endpoint. This was fixed in 1.10.3 where the regional one is used instead: service.projects().locations().templates().launch().
As of October 2019, the latest Airflow version available for Composer environments is 1.10.2. If you need a solution immediately, the fix can be back-ported into Composer.
For this we can override the DataflowTemplateOperator for our own version called RegionalDataflowTemplateOperator:
class RegionalDataflowTemplateOperator(DataflowTemplateOperator):
def execute(self, context):
hook = RegionalDataFlowHook(gcp_conn_id=self.gcp_conn_id,
delegate_to=self.delegate_to,
poll_sleep=self.poll_sleep)
hook.start_template_dataflow(self.task_id, self.dataflow_default_options,
self.parameters, self.template)
This will now make use of the modified RegionalDataFlowHook which overrides the start_template_dataflow method of the DataFlowHook operator to call the correct endpoint:
class RegionalDataFlowHook(DataFlowHook):
def _start_template_dataflow(self, name, variables, parameters,
dataflow_template):
...
request = service.projects().locations().templates().launch(
projectId=variables['project'],
location=variables['region'],
gcsPath=dataflow_template,
body=body
)
...
return response
Then, we can create a task using our new operator and a Google-provided template (for testing purposes):
task = RegionalDataflowTemplateOperator(
task_id=JOB_NAME,
template=TEMPLATE_PATH,
parameters={
'inputFile': 'gs://dataflow-samples/shakespeare/kinglear.txt',
'output': 'gs://{}/europe/output'.format(BUCKET)
},
dag=dag,
)
Full working DAG here. For a cleaner version, the operator can be moved into a separate module.