How to invoke Python function in TriggerDagRunOperator - python

In am trying to call DAG from another DAG( target_dag from parent_dag).
My parent_dag code is :
from datetime import datetime
from airflow.models import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.models import Variable
from airflow.operators.trigger_dagrun import TriggerDagRunOperator
def read_metadata(**kwargs):
asqldb_kv = Variable.get("asql_kv")
perfom some operations based on asqldb_kv and populate the result_dictionary list
if len(result_dictionary) > 0:
my_var = Variable.set("num_runs", len(result_dictionary))
ti = kwargs['ti']
ti.xcom_push(key='res', value=result_dictionary)
default_args = {
'start_date': datetime(year=2021, month=6, day=19),
'provide_context': True
}
with DAG(
dag_id='parent_dag',
default_args=default_args,
schedule_interval='#once',
description='Test Trigger DAG'
) as dag:
trigger = TriggerDagRunOperator(
task_id="test_trigger_dagrun",
python_callable=read_metadata,
trigger_dag_id="target_dag"
)
I am getting the below error :
airflow.exceptions.AirflowException: Invalid arguments were passed to TriggerDagRunOperator (task_id: test_trigger_dagrun). Invalid arguments were:
**kwargs: {'python_callable': <function read_metadata at 0x7ff5f4159620>}
Any help appreciated.
Edit :
python_callable is depreciated in TriggerDagRunOperator - Airflow 2.0.
My requirement is :
I need to access Azure Synapse and get a variable (Say 3). Based on retrieved variable, I need to create tasks dynamically. Say, if Synapse has 3 , then I need to create 3 tasks.
My idea was :
DAG 1 - Access Azure synapse and get Variable. Update this to Airflow Variable. Trigger DAG2 using TriggerDagRunOperator.
DAG 2 - Create tasks depending on the Airflow Variable updated in DAG 1.
Any inputs how can I achieve this?

Related

Airflow reads task job_name parameter as a folder

I'm trying to run a DAG using a custom Operator for a task, but the job_name parameter(set automatically to be the same as the task_id name) is being read as a folder structure instead of the string itself. Example: "example_task" is being read as "\example_task\". By default, Airflow does not accept the "\" character in job_name.
Here is the code:
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from datamechanics_airflow_plugin.operator import DataMechanicsOperator
from airflow.utils.dates import days_ago
import pendulum
local_tz = pendulum.timezone("America/Sao_Paulo")
with DAG(
dag_id="processing_dag",
start_date=days_ago(1).astimezone(tz=local_tz),
schedule_interval="#daily",
) as dag:
start = DummyOperator(task_id="start")
end=DummyOperator(task_id="end")
landing_to_processing=DataMechanicsOperator(
task_id="landing_to_processing",
config_template_name="spark-gcp-hudi",
config_overrides={
"mainApplicationFile": "/home/edu/airflow/dags/scripts/process_landing_data.py",
},
)
start>>landing_to_processing>>end
The DataMechanicsOperator comes from https://www.datamechanics.co/ and the plugin has been correctly installed.
Here is part of the Airflow UI error message:
airflow.exceptions.AirflowException: Response: b'{"errors": {"jobName": "\'landing_to_processing\' does not match \'^[a-z0-9]([-a-z0-9]*[a-z0-9])?(\\\\\\\\\\\\\\\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*$\'"}, "message": "Input payload validation failed"}\n', Status Code: 400
I've tested the code in a local Airflow Server and in a Docker container.
I really can't see what could be possibly causing this.

Cloud Composer can't rendered dynamic dag in webserver UI "DAG seems to be missing"

I'm trying to make a dag that has 2 operators that are created dynamically, depending on the number of "pipelines" that a json config file has. this file is stored in the variable dag_datafusion_args. Then I have a standard bash operator, and I have a task called success at the end that sends a message to the slack saying that the dag is over. the other 2 tasks that are python operators are generated dynamically and run in parallel. I'm using the composer, when I put the dag in the bucket it appears on the webserver ui, but when I click to see the dag the following message appears'DAG "dag_lucas4" seems to be missing. ', If I test the tasks directly by CLI on the kubernetes cluster it works! But I can't seem to make the web UI appear. I tried to do as a suggestion of some people here in SO to restart the webserver by installing a python package, I tried 3x but without success. Does anyone know what can it be?
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
from aux_py_files.med.med_airflow_functions import *
from google.cloud import storage
from datetime import timedelta
​
TEMPLATE_SEARCH_PATH = '/home/airflow/gcs/plugins/'
INDEX=1
default_args = {
'owner':'lucas',
'start_date': '2021-01-10',
'email': ['xxxx'],
'email_on_failure': False,
'email_on_success': False,
'retries': 3,
'retry_delay': timedelta(minutes=2),
'on_failure_callback': post_message_fail_to_slack
}
​
dag_datafusion_args=return_datafusion_config_file('med')
​
with DAG('dag_lucas4', default_args = default_args, schedule_interval="30 23 * * *", template_searchpath = [TEMPLATE_SEARCH_PATH]) as dag:
​
extract_ftp_csv_files_load_in_gcs = BashOperator(
task_id='extract_ftp_csv_files_load_in_gcs',
bash_command='aux_sh_files/med/script.sh'
)
​
success = PythonOperator(
task_id='success',
python_callable=post_message_success_to_slack,
op_kwargs={'dag_name':'dag_lucas2'}
)
​
for pipeline,args in dag_datafusion_args.items():
​
​
configure_pipeline=PythonOperator(
task_id=f'configure_pipeline{str(INDEX)}',
python_callable=setPipelineArguments,
op_kwargs={'dag_name':'med', 'pipeline_name':pipeline},
provide_context=True
)
start_pipeline = PythonOperator(
task_id= f'start_pipeline{str(INDEX)}',
python_callable=start_pipeline_wrapper,
op_kwargs={'configure_pipeline_task':f'configure_pipeline{str(INDEX)}'},
retries=3,
provide_context=True
)
​
[extract_ftp_csv_files_load_in_gcs,configure_pipeline] >> start_pipeline >> success
INDEX += 1
Appears that The Airflow-Webserver in Cloud Composer runs in the tenant project, the worker and scheduler runs in the customer project. Tenant project is nothing but its google side managed environment for some part of airflow components. So the Webserver UI doesn't have complete access to your project resources. As it doesn't run under your project's environment. So I can read my config json file with return_datafusion_config_file . Best way is create an ENV variable with that file.

How to parse nested macros in Airflow

I'm currently facing a challenge in terms of parsing nested macros. Below is my DAG File
from airflow import DAG
from airflow.utils.dates import days_ago
from datetime import timedelta
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from airflow.models import Variable
from apty.utils.date import date_ref_now
default_args = {
"owner": "Akhil",
"depends_on_past": False,
"start_date": days_ago(0),
"email_on_failure": False,
"email_on_retry": False,
"retries": 0,
"retry_delay": timedelta(minutes=5),
}
dag = DAG(
"user_sample",
default_args=default_args,
description="test",
schedule_interval=None,
catchup=False,
)
def sample_app(hello=None):
return hello
extra_attrs = {"date_stamp":"{{ds}}",
"foo":"bar"}
start = DummyOperator(task_id="start", dag=dag)
python = PythonOperator(
python_callable=sample_app,
task_id="mid",
dag=dag,
params={"date_stamp": extra_attrs["date_stamp"]},
op_kwargs={"hello": "{{params.date_stamp}}"},
)
start >> python
I have a scenario where I need to pass {{ds}} as one of the parameters to my operator, after which I'll use that parameter as my wish either passing as an op_kwargs / op_args. (I have used Python Operator as an example but I would be using my own custom Operator).
Here I would like to make it clear that {{ds}} is passed as a parameter value only, I don't want it to be written anywhere i.e in op_kwargs as per this example.
When I try to run it I'm getting return value from python callable as {{ds}} but not the current date_stamp.
Please help me out.
Template or macro variables are only available for parameters that are specified as template_fields on the operator class in use. This depends on the specific version and implementation of Airflow you're using, but here's the latest https://github.com/apache/airflow/blob/98896e4e327f256fd04087a49a13e16a246022c9/airflow/operators/python.py#L72 for the PythonOperator. Since, as you say, you control the operator in question, you can specify any fields you want on the class definition's template_fields. (This all assumes your class inherits from BaseOperator.)

Apache Airflow DAG doesn't call on_success_callback and on_failure_callback

I want to customize my DAGs to send the email when it's failed or succeeded. I'm trying to use on_success_callback and on_failure_callback in DAG constructor, but it doesn't work for DAG. In the same time it works for DummyOperator that I put inside my DAG.
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from datetime import datetime
from utils import get_report_operator, DagStatus
TEST_DAG_NAME='test_dag'
TEST_DAG_REPORT_SUBSCRIBERS = ['MY_EMAIL']
def send_success_report(context):
subject = 'Airflow report: {0} run success'.format(TEST_DAG_NAME)
email_operator = get_report_operator(subject, TEST_DAG_REPORT_SUBSCRIBERS, TEST_DAG_NAME, DagStatus.SUCCESS)
email_operator.execute(context)
def send_failed_report(context):
subject = 'Airflow report: {0} run failed'.format(TEST_DAG_NAME)
email_operator = get_report_operator(subject, TEST_DAG_REPORT_SUBSCRIBERS, TEST_DAG_NAME, DagStatus.FAILED)
email_operator.execute(context)
dag = DAG(dag_id=TEST_DAG_NAME,
schedule_interval=None,
start_date=datetime(2019,6,6),
on_success_callback=send_success_report,
on_failure_callback=send_failed_report)
DummyOperator(task_id='task',
on_success_callback=send_success_report,
on_failure_callback=send_failed_report,
dag = dag)
I've also implemented some add-in over the Airflow EmailOperator for send report. I don't thing that error in this part, but still.
class DagStatus(Enum):
SUCCESS = 0
FAILED = 1
def get_report_operator(sbjct, to_lst, dag_id, dag_status):
status = 'SUCCESS' if dag_status == DagStatus.SUCCESS else 'FAILED'
status_color = '#87C540' if dag_status == DagStatus.SUCCESS else '#FF1717'
with open(os.path.join(os.path.dirname(__file__), 'airflow_report.html'), 'r', encoding='utf-8') as report_file:
report_mask = report_file.read()
report_text = report_mask.format(dag_id, status, status_color)
tmp_dag = DAG(dag_id='tmp_dag', start_date=datetime(year=2019, month=9, day=12), schedule_interval=None)
return EmailOperator(task_id='send_email',
to=to_lst,
subject=sbjct,
html_content=report_text.encode('utf-8'),
dag = tmp_dag)
What I do wrong?
Instead put on_failure_callback as argument in default_args dictionary and pass it to DAG.
All arguments in defaut_args passed to a DAG will be applied to all of DAG's operators. Its the only way, as of now, to apply a common parameter to all the operators in the DAG.
dag = DAG(dag_id=TEST_DAG_NAME,
schedule_interval=None,
start_date=datetime(2019,6,6),
default_args={
'on_success_callback': send_success_report,
'on_failure_callback': send_failed_report
})

Airflow webserver gives cron error for dags with None as schedule interval

I'm running Airflow 1.9.0 with LocalExecutor and PostgreSQL database in a Linux AMI. I want to manually trigger DAGs, but whenever I create a DAG that has schedule_interval set to None or to #once, the webserver tree view crashes with the following error (I only show the last call):
File "/usr/local/lib/python2.7/site-packages/croniter/croniter.py", line 467, in expand
raise CroniterBadCronError(cls.bad_length)
CroniterBadCronError: Exactly 5 or 6 columns has to be specified for iteratorexpression.
Furthermore, when I manually trigger the DAG, a DAG run starts but the tasks themselves are never scheduled. I've looked around, but it seems that I'm the only one with this type of error. Has anyone encountered this error before and found a fix?
Minimal example triggering the problem:
import datetime as dt
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
default_args = {
'owner': 'me'
}
bash_command = """
echo "this is a test task"
"""
with DAG('schedule_test',
default_args=default_args,
start_date = dt.datetime(2018, 7, 24),
schedule_interval='None',
catchup=False
) as dag:
first_task = BashOperator(task_id = "first_task", bash_command = bash_command)
Try this:
Set your schedule_interval to None without the '', or simply do not specify schedule_interval in your DAG. It is set to None as a default. More information on that here: airflow docs -- search for schedule_interval
Set orchestration for your tasks at the bottom of the dag.
Like so:
import datetime
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.dummy_operator import DummyOperator
default_args = {
'owner': 'me'
}
bash_command = """
echo "this is a test task"
"""
with DAG('schedule_test',
default_args=default_args,
start_date = datetime(2018, 7, 24),
schedule_interval=None,
catchup=False
) as dag:
t1 = DummyOperator(
task_id='extract_data',
dag=dag
)
t2 = BashOperator(
task_id = "first_task",
bash_command = bash_command
)
#####ORCHESTRATION#####
## It is saying that in order for t2 to run, t1 must be done.
t2.set_upstream(t1)
None Value Should not in quotes
It should be like this:
schedule_interval=None
Here is the documentation link: https://airflow.apache.org/docs/apache-airflow/1.10.1/scheduler.html#:~:text=Note%3A%20Use%20schedule_interval%3DNone%20and%20not%20schedule_interval%3D%27None%27%20when%20you%20don%E2%80%99t%20want%20to%20schedule%20your%20DAG

Categories