Is it possible to create a Airflow DAG programmatically, by using just REST API?
Background
We have a collection of models, each model consists of:
A collection of SQL files that need to be run for the model
We also keep a JSON file for each model which defines the dependencies between each SQL file.
The scripts are run through a Python job.py file that takes a script file name as parameter.
Our models are updated by many individuals so we need to update our DAG daily. What we have done is created a scheduled Python script that reads all the JSON files and for each model creates in memory DAG that executes each model and its SQL scripts as per the defined dependencies in the JSON config files. What we want to do is to be able to recreate that DAG visually within Airflow DAG programmatically and then execute it, rerun failures etc.
I did some research and per my understanding Airflow DAGs can only be created by using decorators on top of Python files. Is there another approach I missed using REST API?
Here is an example of a JSON we have:
{
"scripts" :[
"Script 1": {
"script_task" : "job.py",
"script_params" : {
"param": "script 1.sql"
},
"dependencies": [
"Script 2",
"Script 3"
]
},
"Script 2": {
"script_task" : "job.py",
"script_params" : {
"param": "script 2.sql"
},
"dependencies": [
"Script 3"
]
},
"Script 3": {
"script_task" : "job.py",
"script_params" : {
"param": "script 3.sql"
},
"dependencies": [
]
}
]
}
Airflow dags are python objects, so you can create a dags factory and use any external data source (json/yaml file, a database, NFS volume, ...) as source for your dags.
Here are the steps to achieve your goal:
create a python script in your dags folder (assume its name is dags_factory.py)
create a python class or method which return a DAG object (assume it is a method and it is defined as create_dag(config_dict))
in the main, load your file/(any external data source) and loop over dags configs, and for each dag:
# this step is very important to persist the created dag and add it to the dag bag
globals()[<dag id>] = create_dag(dag_config)
So without passing in the details of your java file, if you have already a script which creates the dags in memory, try to apply those steps, and you will find the created dags in the metadata and the UI.
Here are some tips:
Airflow runs the dag file processor each X seconds (conf), so no need to use an API, instead, you can upload your files to S3/GCS or a git repository , and load them in the main script before calling the create_dag method.
Try to imporve your json schema, for ex, scripts can be an array
For the method create_dag, I will try to simplify the code (according to what I understood from your json file):
from datetime import datetime
from json import loads
from airflow import DAG
from airflow.operators.bash import BashOperator
def create_dag(dag_id, dag_conf) -> DAG:
scripts = dag_conf["scripts"]
tasks_dict = {}
dag = DAG(dag_id=dag_id, start_date=datetime(2022, 1, 1), schedule_interval=None) # configure your dag
for script_name, script_conf in scripts.items():
task = BashOperator(
bash_command=f"python {script_conf['script_task']} {(f'{k}={v}' for k, v in script_conf['script_params'])}",
dag=dag
)
tasks_dict[script_name] = {
"task": task,
"dependencies": script_conf["dependencies"]
}
for task_conf in tasks_dict.values():
for dependency in task_conf["dependencies"]:
task_conf["task"] << tasks_dict[dependency]["task"] # if you mean the inverse, you can replace << by >>
return dag
if __name__ == '__main__':
# create a loop if you have multiple file
# you can load the files from git or S3, I use local storage for testing
dag_conf_file = open("dag.json", "r")
dag_conf_dict = loads(dag_conf_file.read())
dag_id = "test_dag" # read it from the file
globals()[dag_id] = create_dag(dag_id, dag_conf_dict)
P.S: if you will create a big number of dags in the same script (one script to process multiple json file), you may have some performance issues because Airflow scheduler and workers will re-run the script for each task operation, so you will need to improve it using magic loop or the new syntax added in 2.4
Related
I am trying to submit parallel jobs in a loop on HTCondor, following is a simple example of the python script -
test_mus = np.linspace(0, 5, 10)
results = [pyhf.infer.hypotest(test_mu, data, model)
for test_mu in test_mus]
I would like to submit each job (results), over the for loop (so 10 jobs) simultaneously to 10 machines, and then combine all the results in a pickle.
I have the submission script for this job as below -
executable = CLs.sh
+JobFlavour = "testmatch"
arguments = $(ClusterId) $(ProcId)
Input = LHpruned_NEW_32.json
output = output_sigbkg/CLs.$(ClusterId).$(ProcId).out
error = output_sigbkg/CLs.$(ClusterId).$(ProcId).err
log = output_sigbkg/CLs.$(ClusterId).log
transfer_input_files = CLs_test1point.py, LHpruned_NEW_32.json
should_transfer_files = YES
queue
I would like to know how to submit 10 jobs, to parallelize the jobs.
Thank you !
Best,
Shreya
It's worth looking at the description of how to submit a HTCondor DAG via Python here. In your case, if you install the htcondor Python module, you could do something like:
import htcondor
from htcondor import dags
# create the submit script
sub = htcondor.Submit(
{
"executable": "CLs.sh",
"+JobFlavour": "testmatch",
"arguments": "$(ClusterId) $(ProcId)",
"Input": "LHpruned_NEW_32.json",
"output": "output_sigbkg/CLs.$(ClusterId).$(ProcId).out",
"error": "output_sigbkg/CLs.$(ClusterId).$(ProcId).err",
"log": "output_sigbkg/CLs.$(ClusterId).log",
"transfer_input_files": "CLs_test1point.py, LHpruned_NEW_32.json",
"should_transfer_files": "YES",
}
)
# create DAG
dag = dags.DAG()
# add layer with 10 jobs
layer = dag.layer(
name="CLs_layer",
submit_description=sub,
vars=[{} for i in range(10)],
)
# write out the DAG to current directory
dagfile = dags.write_dag(dag, ".")
You can use the vars argument to add macros giving values for each specific job if you want, e.g., if you wanted mu as one of the executable arguments you could switch this to:
import htcondor
from htcondor import dags
# create the submit script
sub = htcondor.Submit(
{
"executable": "CLs.sh",
"+JobFlavour": "testmatch",
"arguments": "$(ClusterId) $(ProcId) $(MU)",
"Input": "LHpruned_NEW_32.json",
"output": "output_sigbkg/CLs.$(ClusterId).$(ProcId).out",
"error": "output_sigbkg/CLs.$(ClusterId).$(ProcId).err",
"log": "output_sigbkg/CLs.$(ClusterId).log",
"transfer_input_files": "CLs_test1point.py, LHpruned_NEW_32.json",
"should_transfer_files": "YES",
}
)
# create DAG
dag = dags.DAG()
# add layer with 10 jobs
layer = dag.layer(
name="CLs_layer",
submit_description=sub,
vars=[{"MU": mu} for mu in np.linspace(0, 5, 10)],
)
# write out the DAG to current directory
dagfile = dags.write_dag(dag, ".")
Once the DAG is created, you can either submit it as normal with condor_submit_dag from the terminal, or submit it via Python using the instructions here.
Note: your +JobFlavour value for the submit file will actually get converted to MY.JobFlavour in the file that gets created, but that's ok and means the same thing.
I'm trying to make a dag that has 2 operators that are created dynamically, depending on the number of "pipelines" that a json config file has. this file is stored in the variable dag_datafusion_args. Then I have a standard bash operator, and I have a task called success at the end that sends a message to the slack saying that the dag is over. the other 2 tasks that are python operators are generated dynamically and run in parallel. I'm using the composer, when I put the dag in the bucket it appears on the webserver ui, but when I click to see the dag the following message appears'DAG "dag_lucas4" seems to be missing. ', If I test the tasks directly by CLI on the kubernetes cluster it works! But I can't seem to make the web UI appear. I tried to do as a suggestion of some people here in SO to restart the webserver by installing a python package, I tried 3x but without success. Does anyone know what can it be?
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
from aux_py_files.med.med_airflow_functions import *
from google.cloud import storage
from datetime import timedelta
TEMPLATE_SEARCH_PATH = '/home/airflow/gcs/plugins/'
INDEX=1
default_args = {
'owner':'lucas',
'start_date': '2021-01-10',
'email': ['xxxx'],
'email_on_failure': False,
'email_on_success': False,
'retries': 3,
'retry_delay': timedelta(minutes=2),
'on_failure_callback': post_message_fail_to_slack
}
dag_datafusion_args=return_datafusion_config_file('med')
with DAG('dag_lucas4', default_args = default_args, schedule_interval="30 23 * * *", template_searchpath = [TEMPLATE_SEARCH_PATH]) as dag:
extract_ftp_csv_files_load_in_gcs = BashOperator(
task_id='extract_ftp_csv_files_load_in_gcs',
bash_command='aux_sh_files/med/script.sh'
)
success = PythonOperator(
task_id='success',
python_callable=post_message_success_to_slack,
op_kwargs={'dag_name':'dag_lucas2'}
)
for pipeline,args in dag_datafusion_args.items():
configure_pipeline=PythonOperator(
task_id=f'configure_pipeline{str(INDEX)}',
python_callable=setPipelineArguments,
op_kwargs={'dag_name':'med', 'pipeline_name':pipeline},
provide_context=True
)
start_pipeline = PythonOperator(
task_id= f'start_pipeline{str(INDEX)}',
python_callable=start_pipeline_wrapper,
op_kwargs={'configure_pipeline_task':f'configure_pipeline{str(INDEX)}'},
retries=3,
provide_context=True
)
[extract_ftp_csv_files_load_in_gcs,configure_pipeline] >> start_pipeline >> success
INDEX += 1
Appears that The Airflow-Webserver in Cloud Composer runs in the tenant project, the worker and scheduler runs in the customer project. Tenant project is nothing but its google side managed environment for some part of airflow components. So the Webserver UI doesn't have complete access to your project resources. As it doesn't run under your project's environment. So I can read my config json file with return_datafusion_config_file . Best way is create an ENV variable with that file.
Im trying to run scheduled task using Django-q I followed the docs but its not running
heres my config
CACHES = {
'default': {
'BACKEND': 'django.core.cache.backends.db.DatabaseCache',
'LOCATION': 'db_cache_table',
}
}
Q_CLUSTER = {
'name': 'DjangORM',
'workers': 4,
'timeout': 90,
'retry': 120,
'queue_limit': 50,
'bulk': 10,
'orm': 'default'
}
heres my scheduled task
Nothin is executing, please help
I also had problems with getting scheduled tasks processed in the first place, but finally found a workflow.
I run django-q on a windows machine, using the django ORM as a broker.
Before talking about the execution routine i came up, lets quickly check out my modules first, starting with ..
settings.py:
Q_CLUSTER = {
"name": "austrian_energy_monthly",
"workers": 1,
"timeout": 10,
"retry": 20,
"queue_limit": 50,
"bulk": 10,
"orm": "default",
"ack_failures": True,
"max_attempts": 1,
"attempt_count": 0,
}
.. and my folder structure:
As you can see, the folder of my django project is inside the src folder. Further, there's a folder for the app i created for this project, which is simply called "app". Inside the app folder i do have another folder called "cron", which includes the following files and functions related to the scheduling:
tasks.py
I do not use the schedule() method provided by the django-q, instead i go for the creating tables directly (see: django-q official schedule docs)
from django.utils import timezone
from austrian_energy_monthly.app.cron.func import create_text_file
from django_q.models import Schedule
Schedule.objects.create(
func="austrian_energy_monthly.app.cron.func.create_text_file",
kwargs={"content": "Insert this into a text file"},
hooks="austrian_energy_monthly.app.cron.hooks.print_result",
name="Text file creation process",
schedule_type=Schedule.ONCE,
next_run=timezone.now(),
)
Make sure you assign the "right" path to the "func" keyword. Just using "func.create_text_file",didn't work out for me, even though these files are laying in the same folder. Same for the "hooks" keyword.
(NOTE: I've set up my project as a development package via setup.py, such that i can call it from everywhere inside my src folder)
func.py:
Contains the function called by the schedule table object.
def create_text_file(content: str) -> str:
file = open(f"copy.txt", "w")
file.write(content)
file.close()
return "Created a text file"
hooks.py:
Contains the function called after the scheduled process finished.
def print_result(task):
print(task.result)
Let's now see how i managed to get the executions running for with the file examples described above:
First i've scheduled the "Text file creation process". Therefore I used "python manage.py shell" and imported the tasks.py module (you probably could schedule everythin via the admin page as well, but i didnt tested this yet):
You could now see the scheduled task, with a question mark on the success column in the admin page (tab "Scheduled tasks", as within your picture):
After that i opened a new terminal and started the cluster with "python manage.py qcluster", resulting in the following output in the terminal:
The successful execution can be inspected by looking at "13:22:17 [Q] INFO Processed [ten-virginia-potato-high]", alongside the hook print statement "Created a text file" in the terminal. Further you can check it at the admin page, under the tab "Successful Tasks", where you should see:
Hope that helped!
Django-q dont support windows. :)
I am trying to create a DAG that generates tasks dynamically based on a JSON file located in storage. I followed this guide step-by-step:
https://bigdata-etl.com/apache-airflow-create-dynamic-dag/
But the DAG gets stuck with the following message:
Is it possible to read an external file and use it to create tasks dynamically in Composer? I can do this when I read data only from an airflow Variable, but when I read an external file, the dag gets stuck in the isn't available in the web server's DagBag object state. I need to read from an external file as the contents of the JSON will change with every execution.
I am using composer-1.8.2-airflow-1.10.2.
I read this answer to a similar question:
Dynamic task definition in Airflow
But I am not trying to create the tasks based on a separate task, only based on the external file.
This is my second approach that also get's stuck in that error state:
import datetime
import airflow
from airflow.operators import bash_operator
from airflow.operators.dummy_operator import DummyOperator
from airflow.models import Variable
import json
import os
products = json.loads(Variable.get("products"))
default_args = {
'owner': 'Composer Example',
'depends_on_past': False,
'email': [''],
'email_on_failure': False,
'email_on_retry': False,
'retries': 0,
'retry_delay': datetime.timedelta(minutes=5),
'start_date': datetime.datetime(2020, 1, 10),
}
with airflow.DAG(
'json_test2',
default_args=default_args,
# Not scheduled, trigger only
schedule_interval=None) as dag:
# Print the dag_run's configuration, which includes information about the
# Cloud Storage object change.
def read_json_file(file_path):
if os.path.exists(file_path):
with open(file_path, 'r') as f:
return json.load(f)
def get_run_list(files):
run_list = []
#The file is uploaded in the storage bucket used as a volume by Composer
last_exec_json = read_json_file("/home/airflow/gcs/data/last_execution.json")
date = last_exec_json["date"]
hour = last_exec_json["hour"]
for file in files:
#Testing by adding just date and hour
name = file['name']+f'_{date}_{hour}'
run_list.append(name)
return run_list
rl = get_run_list(products)
start = DummyOperator(task_id='start', dag=dag)
end = DummyOperator(task_id='end', dag=dag)
for name in rl:
tsk = DummyOperator(task_id=name, dag=dag)
start >> tsk >> end
It is possible to create DAG that generates task dynamically based on a JSON file, which is located in a Cloud Storage bucket. I followed guide, that you provided, and it works perfectly in my case.
Firstly you need to upload your JSON configuration file to $AIRFLOW_HOME/dags directory, and then DAG python file to the same path (you can find the path in airflow.cfg file, which is located in the bucket).
Later on, you will be able to see DAG in Airflow UI:
As you can see the log DAG isn't available in the web server's DagBag object, the DAG isn't available on Airflow Web Server. However, the DAG can be scheduled as active because Airflow Scheduler is working independently with the Airflow Web Server.
When a lot of DAGs are loaded at once to a Composer environment, it may overload on the environment. As the Airflow webserver is on a Google-managed project, only certain types of updates will cause the webserver container to be restarted, like adding or upgrading one of the PyPI packages or changing an Airflow setting. The workaround is to add a dummy environment variable:
Open Composer instance in GCP
ENVIRONMENT VARIABLE tab
Edit, then add environment variable and Submit
You can use following command to restart it:
gcloud composer environments update ${ENVIRONMENT_NAME} --location=${ENV_LOCATION} --update-airflow-configs=core-dummy=true
gcloud composer environments update ${ENVIRONMENT_NAME} --location=${ENV_LOCATION} --remove-airflow-configs=core-dummy
I hope you find the above pieces of information useful.
I am new in python and airflow dag.
I am following below link and code which is mention in answer section.
How to pass dynamic arguments Airflow operator?
I am facing issue to reading yaml file, In yaml file I have some configuration related arguments.
configs:
cluster_name: "test-cluster"
project_id: "t***********"
zone: "europe-west1-c"
num_workers: 2
worker_machine_type: "n1-standard-1"
master_machine_type: "n1-standard-1"
In DAG script I have created one task which will be create cluster, before executing this task we need all the arguments which we need to pass on it default_args parameter like cluster-name, project_id etc.For reading those parameter I have created one readYML method.see below code
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
from zipfile import ZipFile
from airflow.contrib.operators import dataproc_operator
from airflow.models import Variable
import yaml
def readYML():
print("inside readYML")
global cfg
file_name = "/home/airflow/gcs/data/cluster_config.yml"
with open(file_name, 'r') as ymlfile:
cfg = yaml.load(ymlfile)
print(cfg['configs']['cluster_name'])
# Default Arguments
readYML()
dag_name = Variable.get("dag_name")
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime.now(),
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
#'cluster_name': cfg['configs']['cluster_name'],
}
# Instantiate a DAG
dag = DAG(dag_id='read_yml', default_args=default_args,
schedule_interval=timedelta(days=1))
# Creating Tasks
Task1 = DataprocClusterCreateOperator(
task_id='create_cluster',
dag=dag
)
In this code there is no error, When I am uploading in GCP composer environment, No error notification is showing but this DAG is no runnable there is no Run button is coming.
See attached screen shot.
I am using python 3 & airflow composer-1.7.2-airflow-1.10.2 version.
According to the Data Stored in Cloud Storage page in the Cloud Composer docs:
To avoid a webserver error, make sure that data the webserver needs to parse a DAG (not run) is available in the dags/ folder. Otherwise, the webserver can't access the data or load the Airflow web interface.
Your DAG is attempting to open the YAML file under /home/airflow/gcs/data, which isn't present on the webserver. Put the file under the dags/ folder in your GCS bucket, and it will be accessible to the scheduler, workers, and webserver, and the DAG will work in the Web UI.