Submitting parallel jobs on HTCondor, using python - python

I am trying to submit parallel jobs in a loop on HTCondor, following is a simple example of the python script -
test_mus = np.linspace(0, 5, 10)
results = [pyhf.infer.hypotest(test_mu, data, model)
for test_mu in test_mus]
I would like to submit each job (results), over the for loop (so 10 jobs) simultaneously to 10 machines, and then combine all the results in a pickle.
I have the submission script for this job as below -
executable = CLs.sh
+JobFlavour = "testmatch"
arguments = $(ClusterId) $(ProcId)
Input = LHpruned_NEW_32.json
output = output_sigbkg/CLs.$(ClusterId).$(ProcId).out
error = output_sigbkg/CLs.$(ClusterId).$(ProcId).err
log = output_sigbkg/CLs.$(ClusterId).log
transfer_input_files = CLs_test1point.py, LHpruned_NEW_32.json
should_transfer_files = YES
queue
I would like to know how to submit 10 jobs, to parallelize the jobs.
Thank you !
Best,
Shreya

It's worth looking at the description of how to submit a HTCondor DAG via Python here. In your case, if you install the htcondor Python module, you could do something like:
import htcondor
from htcondor import dags
# create the submit script
sub = htcondor.Submit(
{
"executable": "CLs.sh",
"+JobFlavour": "testmatch",
"arguments": "$(ClusterId) $(ProcId)",
"Input": "LHpruned_NEW_32.json",
"output": "output_sigbkg/CLs.$(ClusterId).$(ProcId).out",
"error": "output_sigbkg/CLs.$(ClusterId).$(ProcId).err",
"log": "output_sigbkg/CLs.$(ClusterId).log",
"transfer_input_files": "CLs_test1point.py, LHpruned_NEW_32.json",
"should_transfer_files": "YES",
}
)
# create DAG
dag = dags.DAG()
# add layer with 10 jobs
layer = dag.layer(
name="CLs_layer",
submit_description=sub,
vars=[{} for i in range(10)],
)
# write out the DAG to current directory
dagfile = dags.write_dag(dag, ".")
You can use the vars argument to add macros giving values for each specific job if you want, e.g., if you wanted mu as one of the executable arguments you could switch this to:
import htcondor
from htcondor import dags
# create the submit script
sub = htcondor.Submit(
{
"executable": "CLs.sh",
"+JobFlavour": "testmatch",
"arguments": "$(ClusterId) $(ProcId) $(MU)",
"Input": "LHpruned_NEW_32.json",
"output": "output_sigbkg/CLs.$(ClusterId).$(ProcId).out",
"error": "output_sigbkg/CLs.$(ClusterId).$(ProcId).err",
"log": "output_sigbkg/CLs.$(ClusterId).log",
"transfer_input_files": "CLs_test1point.py, LHpruned_NEW_32.json",
"should_transfer_files": "YES",
}
)
# create DAG
dag = dags.DAG()
# add layer with 10 jobs
layer = dag.layer(
name="CLs_layer",
submit_description=sub,
vars=[{"MU": mu} for mu in np.linspace(0, 5, 10)],
)
# write out the DAG to current directory
dagfile = dags.write_dag(dag, ".")
Once the DAG is created, you can either submit it as normal with condor_submit_dag from the terminal, or submit it via Python using the instructions here.
Note: your +JobFlavour value for the submit file will actually get converted to MY.JobFlavour in the file that gets created, but that's ok and means the same thing.

Related

Can I create a Airflow DAG dynamically using REST API?

Is it possible to create a Airflow DAG programmatically, by using just REST API?
Background
We have a collection of models, each model consists of:
A collection of SQL files that need to be run for the model
We also keep a JSON file for each model which defines the dependencies between each SQL file.
The scripts are run through a Python job.py file that takes a script file name as parameter.
Our models are updated by many individuals so we need to update our DAG daily. What we have done is created a scheduled Python script that reads all the JSON files and for each model creates in memory DAG that executes each model and its SQL scripts as per the defined dependencies in the JSON config files. What we want to do is to be able to recreate that DAG visually within Airflow DAG programmatically and then execute it, rerun failures etc.
I did some research and per my understanding Airflow DAGs can only be created by using decorators on top of Python files. Is there another approach I missed using REST API?
Here is an example of a JSON we have:
{
"scripts" :[
"Script 1": {
"script_task" : "job.py",
"script_params" : {
"param": "script 1.sql"
},
"dependencies": [
"Script 2",
"Script 3"
]
},
"Script 2": {
"script_task" : "job.py",
"script_params" : {
"param": "script 2.sql"
},
"dependencies": [
"Script 3"
]
},
"Script 3": {
"script_task" : "job.py",
"script_params" : {
"param": "script 3.sql"
},
"dependencies": [
]
}
]
}
Airflow dags are python objects, so you can create a dags factory and use any external data source (json/yaml file, a database, NFS volume, ...) as source for your dags.
Here are the steps to achieve your goal:
create a python script in your dags folder (assume its name is dags_factory.py)
create a python class or method which return a DAG object (assume it is a method and it is defined as create_dag(config_dict))
in the main, load your file/(any external data source) and loop over dags configs, and for each dag:
# this step is very important to persist the created dag and add it to the dag bag
globals()[<dag id>] = create_dag(dag_config)
So without passing in the details of your java file, if you have already a script which creates the dags in memory, try to apply those steps, and you will find the created dags in the metadata and the UI.
Here are some tips:
Airflow runs the dag file processor each X seconds (conf), so no need to use an API, instead, you can upload your files to S3/GCS or a git repository , and load them in the main script before calling the create_dag method.
Try to imporve your json schema, for ex, scripts can be an array
For the method create_dag, I will try to simplify the code (according to what I understood from your json file):
from datetime import datetime
from json import loads
from airflow import DAG
from airflow.operators.bash import BashOperator
def create_dag(dag_id, dag_conf) -> DAG:
scripts = dag_conf["scripts"]
tasks_dict = {}
dag = DAG(dag_id=dag_id, start_date=datetime(2022, 1, 1), schedule_interval=None) # configure your dag
for script_name, script_conf in scripts.items():
task = BashOperator(
bash_command=f"python {script_conf['script_task']} {(f'{k}={v}' for k, v in script_conf['script_params'])}",
dag=dag
)
tasks_dict[script_name] = {
"task": task,
"dependencies": script_conf["dependencies"]
}
for task_conf in tasks_dict.values():
for dependency in task_conf["dependencies"]:
task_conf["task"] << tasks_dict[dependency]["task"] # if you mean the inverse, you can replace << by >>
return dag
if __name__ == '__main__':
# create a loop if you have multiple file
# you can load the files from git or S3, I use local storage for testing
dag_conf_file = open("dag.json", "r")
dag_conf_dict = loads(dag_conf_file.read())
dag_id = "test_dag" # read it from the file
globals()[dag_id] = create_dag(dag_id, dag_conf_dict)
P.S: if you will create a big number of dags in the same script (one script to process multiple json file), you may have some performance issues because Airflow scheduler and workers will re-run the script for each task operation, so you will need to improve it using magic loop or the new syntax added in 2.4

airflow upgrade 2.0 kubernetes_pod_operator not working

I upgraded my airflow to 2.0. After upgrading, my kubernetes_pod_operator not working and give me the following error. How do I fix this upgrade issue. What do I need to change in the code to make it work in airflow 2.0?
Error:
Broken DAG: [/home/airflow/gcs/dags/daily_data_dag.py] Traceback (most recent call last):
File "/opt/python3.8/lib/python3.8/site-packages/airflow/models/baseoperator.py", line 178, in apply_defaults
result = func(self, *args, **kwargs)
File "/opt/python3.8/lib/python3.8/site-packages/airflow/models/baseoperator.py", line 506, in __init__
raise AirflowException(
airflow.exceptions.AirflowException: Invalid arguments were passed to KubernetesPodOperator (task_id: snowpack_daily_data_pipeline_16_11_2021). Invalid arguments were:
**kwargs: {'email_on_success': True}
Code:
import datetime
from airflow import models
from airflow.contrib.operators import kubernetes_pod_operator
from kubernetes.client import models as k8s
# DEFINE VARS HERE:
dag_name = "daily_data_pipeline"
schedule_interval = '#daily'
email = ["xxx#gmail.com"]
# get this from Gitlab
docker_image = "registry.gitlab.com/xxxx:dev"
default_dag_args = {
# The start_date describes when a DAG is valid / can be run. Set this to a
# fixed point in time rather than dynamically, since it is evaluated every
# time a DAG is parsed. See:
# https://airflow.apache.org/faq.html#what-s-the-deal-with-start-date
'start_date': datetime.datetime(2021, 6, 26),
'depends_on_past': False,
'max_active_runs': 1,
# 'concurrency': 1
}
# Define a DAG (directed acyclic graph) of tasks.
# Any task you create within the context manager is automatically added to the
# DAG object.
with models.DAG(
dag_name,
schedule_interval=schedule_interval,
default_args=default_dag_args,
catchup=False) as dag:
kubernetes_min_pod = kubernetes_pod_operator.KubernetesPodOperator(
# The ID specified for the task.
task_id=f'{dag_name}_{datetime.datetime.now().strftime("%d_%m_%Y")}',
# Name of task you want to run, used to generate Pod ID.
name=dag_name,
# Entrypoint of the container, if not specified the Docker container's
# entrypoint is used. The cmds parameter is templated.
# cmds=['echo'],
# The namespace to run within Kubernetes, default namespace is
# `default`. There is the potential for the resource starvation of
# Airflow workers and scheduler within the Cloud Composer environment,
# the recommended solution is to increase the amount of nodes in order
# to satisfy the computing requirements. Alternatively, launching pods
# into a custom namespace will stop fighting over resources.
namespace='default',
# Setup email on failure and success
email_on_failure=True,
email_on_success=True,
email=email,
# Docker image specified. Defaults to hub.docker.com, but any fully
# qualified URLs will point to a custom repository. Supports private
# gcr.io images if the Composer Environment is under the same
# project-id as the gcr.io images and the service account that Composer
# uses has permission to access the Google Container Registry
# (the default service account has permission)
image=docker_image,
image_pull_secrets='gitlab',
image_pull_policy='Always',
# if you have a larger image, you may need to increase the default from 120
startup_timeout_seconds=600)
The problem has to do with the email_on_success parameter: as you can see in the BaseOperator documentation, only email_on_retry and email_on_failure are supported.
If you need to send email on success, you may use on_success_callback. Please, consider for instance this example obtained from this Github issue:
from airflow.utils.email import send_email
def on_success_callback(context):
ti: TaskInstance = context["ti"]
dag_id = ti.dag_id
task_id = ti.task_id
msg = "DAG succeeded"
subject = f"Success {dag_id}.{task_id}"
send_email(to=EMAIL_LIST, subject=subject, html_content=msg)
Please, note the callback of the example is defined at the DAG level. The airflow documentation provides several additional examples.
In your specific use case, it could looks like the following:
import datetime
from airflow import models
from airflow.contrib.operators import kubernetes_pod_operator
from airflow.utils.email import send_email
from kubernetes.client import models as k8s
# DEFINE VARS HERE:
dag_name = "daily_data_pipeline"
schedule_interval = '#daily'
email = ["xxx#gmail.com"]
# get this from Gitlab
docker_image = "registry.gitlab.com/xxxx:dev"
default_dag_args = {
# The start_date describes when a DAG is valid / can be run. Set this to a
# fixed point in time rather than dynamically, since it is evaluated every
# time a DAG is parsed. See:
# https://airflow.apache.org/faq.html#what-s-the-deal-with-start-date
'start_date': datetime.datetime(2021, 6, 26),
'depends_on_past': False,
'max_active_runs': 1,
# 'concurrency': 1
}
# Define a DAG (directed acyclic graph) of tasks.
# Any task you create within the context manager is automatically added to the
# DAG object.
with models.DAG(
dag_name,
schedule_interval=schedule_interval,
default_args=default_dag_args,
catchup=False) as dag:
def notify_successful_execution(context):
# Access the information provided in context if required
send_email(
to=[email],
subject='Successful execution',
html_content='The process was executed successfully'
)
kubernetes_min_pod = kubernetes_pod_operator.KubernetesPodOperator(
# The ID specified for the task.
task_id=f'{dag_name}_{datetime.datetime.now().strftime("%d_%m_%Y")}',
# Name of task you want to run, used to generate Pod ID.
name=dag_name,
# Entrypoint of the container, if not specified the Docker container's
# entrypoint is used. The cmds parameter is templated.
# cmds=['echo'],
# The namespace to run within Kubernetes, default namespace is
# `default`. There is the potential for the resource starvation of
# Airflow workers and scheduler within the Cloud Composer environment,
# the recommended solution is to increase the amount of nodes in order
# to satisfy the computing requirements. Alternatively, launching pods
# into a custom namespace will stop fighting over resources.
namespace='default',
# Setup email on failure and success
email_on_failure=True,
# Comment the following parameter, it is unsupported
# email_on_success=True,
email=email,
# Docker image specified. Defaults to hub.docker.com, but any fully
# qualified URLs will point to a custom repository. Supports private
# gcr.io images if the Composer Environment is under the same
# project-id as the gcr.io images and the service account that Composer
# uses has permission to access the Google Container Registry
# (the default service account has permission)
image=docker_image,
image_pull_secrets='gitlab',
image_pull_policy='Always',
# if you have a larger image, you may need to increase the default from 120
startup_timeout_seconds=600,
# Use on success callback instead for your email notifications
on_success_callback=notify_successful_execution)

Cloud Composer can't rendered dynamic dag in webserver UI "DAG seems to be missing"

I'm trying to make a dag that has 2 operators that are created dynamically, depending on the number of "pipelines" that a json config file has. this file is stored in the variable dag_datafusion_args. Then I have a standard bash operator, and I have a task called success at the end that sends a message to the slack saying that the dag is over. the other 2 tasks that are python operators are generated dynamically and run in parallel. I'm using the composer, when I put the dag in the bucket it appears on the webserver ui, but when I click to see the dag the following message appears'DAG "dag_lucas4" seems to be missing. ', If I test the tasks directly by CLI on the kubernetes cluster it works! But I can't seem to make the web UI appear. I tried to do as a suggestion of some people here in SO to restart the webserver by installing a python package, I tried 3x but without success. Does anyone know what can it be?
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
from aux_py_files.med.med_airflow_functions import *
from google.cloud import storage
from datetime import timedelta
​
TEMPLATE_SEARCH_PATH = '/home/airflow/gcs/plugins/'
INDEX=1
default_args = {
'owner':'lucas',
'start_date': '2021-01-10',
'email': ['xxxx'],
'email_on_failure': False,
'email_on_success': False,
'retries': 3,
'retry_delay': timedelta(minutes=2),
'on_failure_callback': post_message_fail_to_slack
}
​
dag_datafusion_args=return_datafusion_config_file('med')
​
with DAG('dag_lucas4', default_args = default_args, schedule_interval="30 23 * * *", template_searchpath = [TEMPLATE_SEARCH_PATH]) as dag:
​
extract_ftp_csv_files_load_in_gcs = BashOperator(
task_id='extract_ftp_csv_files_load_in_gcs',
bash_command='aux_sh_files/med/script.sh'
)
​
success = PythonOperator(
task_id='success',
python_callable=post_message_success_to_slack,
op_kwargs={'dag_name':'dag_lucas2'}
)
​
for pipeline,args in dag_datafusion_args.items():
​
​
configure_pipeline=PythonOperator(
task_id=f'configure_pipeline{str(INDEX)}',
python_callable=setPipelineArguments,
op_kwargs={'dag_name':'med', 'pipeline_name':pipeline},
provide_context=True
)
start_pipeline = PythonOperator(
task_id= f'start_pipeline{str(INDEX)}',
python_callable=start_pipeline_wrapper,
op_kwargs={'configure_pipeline_task':f'configure_pipeline{str(INDEX)}'},
retries=3,
provide_context=True
)
​
[extract_ftp_csv_files_load_in_gcs,configure_pipeline] >> start_pipeline >> success
INDEX += 1
Appears that The Airflow-Webserver in Cloud Composer runs in the tenant project, the worker and scheduler runs in the customer project. Tenant project is nothing but its google side managed environment for some part of airflow components. So the Webserver UI doesn't have complete access to your project resources. As it doesn't run under your project's environment. So I can read my config json file with return_datafusion_config_file . Best way is create an ENV variable with that file.

Django Scheduled task using Django-q

Im trying to run scheduled task using Django-q I followed the docs but its not running
heres my config
CACHES = {
'default': {
'BACKEND': 'django.core.cache.backends.db.DatabaseCache',
'LOCATION': 'db_cache_table',
}
}
Q_CLUSTER = {
'name': 'DjangORM',
'workers': 4,
'timeout': 90,
'retry': 120,
'queue_limit': 50,
'bulk': 10,
'orm': 'default'
}
heres my scheduled task
Nothin is executing, please help
I also had problems with getting scheduled tasks processed in the first place, but finally found a workflow.
I run django-q on a windows machine, using the django ORM as a broker.
Before talking about the execution routine i came up, lets quickly check out my modules first, starting with ..
settings.py:
Q_CLUSTER = {
"name": "austrian_energy_monthly",
"workers": 1,
"timeout": 10,
"retry": 20,
"queue_limit": 50,
"bulk": 10,
"orm": "default",
"ack_failures": True,
"max_attempts": 1,
"attempt_count": 0,
}
.. and my folder structure:
As you can see, the folder of my django project is inside the src folder. Further, there's a folder for the app i created for this project, which is simply called "app". Inside the app folder i do have another folder called "cron", which includes the following files and functions related to the scheduling:
tasks.py
I do not use the schedule() method provided by the django-q, instead i go for the creating tables directly (see: django-q official schedule docs)
from django.utils import timezone
from austrian_energy_monthly.app.cron.func import create_text_file
from django_q.models import Schedule
Schedule.objects.create(
func="austrian_energy_monthly.app.cron.func.create_text_file",
kwargs={"content": "Insert this into a text file"},
hooks="austrian_energy_monthly.app.cron.hooks.print_result",
name="Text file creation process",
schedule_type=Schedule.ONCE,
next_run=timezone.now(),
)
Make sure you assign the "right" path to the "func" keyword. Just using "func.create_text_file",didn't work out for me, even though these files are laying in the same folder. Same for the "hooks" keyword.
(NOTE: I've set up my project as a development package via setup.py, such that i can call it from everywhere inside my src folder)
func.py:
Contains the function called by the schedule table object.
def create_text_file(content: str) -> str:
file = open(f"copy.txt", "w")
file.write(content)
file.close()
return "Created a text file"
hooks.py:
Contains the function called after the scheduled process finished.
def print_result(task):
print(task.result)
Let's now see how i managed to get the executions running for with the file examples described above:
First i've scheduled the "Text file creation process". Therefore I used "python manage.py shell" and imported the tasks.py module (you probably could schedule everythin via the admin page as well, but i didnt tested this yet):
You could now see the scheduled task, with a question mark on the success column in the admin page (tab "Scheduled tasks", as within your picture):
After that i opened a new terminal and started the cluster with "python manage.py qcluster", resulting in the following output in the terminal:
The successful execution can be inspected by looking at "13:22:17 [Q] INFO Processed [ten-virginia-potato-high]", alongside the hook print statement "Created a text file" in the terminal. Further you can check it at the admin page, under the tab "Successful Tasks", where you should see:
Hope that helped!
Django-q dont support windows. :)

Python windows task to run with admin

I am looking for some Python code to create a Windows Task in the Task Scheduler, it needs to run at start & have the highest permissions level (admin).
// description of your code here
Uses the new COM Task Scheduler Interface to create a new disabled scheduled task, then run it once as part of a script. Use this to launch interactive tasks, even remotely.
import win32com.client
computer_name = "" #leave all blank for current computer, current user
computer_username = ""
computer_userdomain = ""
computer_password = ""
action_id = "Test Task" #arbitrary action ID
action_path = r"c:\windows\system32\calc.exe" #executable path (could be python.exe)
action_arguments = r'' #arguments (could be something.py)
action_workdir = r"c:\windows\system32" #working directory for action executable
author = "Someone" #so that end users know who you are
description = "testing task" #so that end users can identify the task
task_id = "Test Task"
task_hidden = False #set this to True to hide the task in the interface
username = ""
password = ""
run_flags = "TASK_RUN_NO_FLAGS" #see dict below, use in combo with username/password
#define constants
TASK_TRIGGER_DAILY = 2
TASK_CREATE = 2
TASK_CREATE_OR_UPDATE = 6
TASK_ACTION_EXEC = 0
IID_ITask = "{148BD524-A2AB-11CE-B11F-00AA00530503}"
RUNFLAGSENUM = {
"TASK_RUN_NO_FLAGS" : 0,
"TASK_RUN_AS_SELF" : 1,
"TASK_RUN_IGNORE_CONSTRAINTS" : 2,
"TASK_RUN_USE_SESSION_ID" : 4,
"TASK_RUN_USER_SID" : 8
}
#connect to the scheduler (Vista/Server 2008 and above only)
scheduler = win32com.client.Dispatch("Schedule.Service")
scheduler.Connect(computer_name or None, computer_username or None, computer_userdomain or None, computer_password or None)
rootFolder = scheduler.GetFolder("\\")
#(re)define the task
taskDef = scheduler.NewTask(0)
colTriggers = taskDef.Triggers
trigger = colTriggers.Create(TASK_TRIGGER_DAILY)
trigger.DaysInterval = 100
trigger.StartBoundary = "2100-01-01T08:00:00-00:00" #never start
trigger.Enabled = False
colActions = taskDef.Actions
action = colActions.Create(TASK_ACTION_EXEC)
action.ID = action_id
action.Path = action_path
action.WorkingDirectory = action_workdir
action.Arguments = action_arguments
info = taskDef.RegistrationInfo
info.Author = author
info.Description = description
settings = taskDef.Settings
settings.Enabled = False
settings.Hidden = task_hidden
#register the task (create or update, just keep the task name the same)
result = rootFolder.RegisterTaskDefinition(task_id, taskDef, TASK_CREATE_OR_UPDATE, "", "", RUNFLAGSENUM[run_flags] ) #username, password
#run the task once
task = rootFolder.GetTask(task_id)
task.Enabled = True
runningTask = task.Run("")
task.Enabled = False
This code creates a task to run daily, but not at login & not as admin. Is there any way I can do this without requiring the UAC Prompt to open at startup?
EDIT: I am NOT looking to just make the program ask for administrator, as the prompt will popup, as specified above. I need it to have the highest execution level in the WINDOWS EVENT SCHEDULER to run at logon.
I think it would be a good idea for you to build a Batch file running python and your script. Having done that there are a few things that you might want to try, i haven't personally had the time to try them out but here you go:
Compile the .BAT file and make it an .exe once you do that i think you should be able to involve your task scheduler into the mix and get your code running whenever you need.
If nothing works you can always mess around with your window's registry to allow your script to bypass the prompt using a Windows Registry Editor.
Do look it up before you execute as this is code I have found elsewhere, but it should do the trick:
[HKEY_CURRENT_USER\Software\Microsoft\Windows\CurrentVersion\Group Policy Objects\{E2F13B98-650F-47DB-845A-420A1ED34EC7}User\Software\Microsoft\Windows\CurrentVersion\Policies\Associations]
"LowRiskFileTypes"=".exe;.bat;.cmd;.vbs"
[HKEY_CURRENT_USER\Software\Microsoft\Windows\CurrentVersion\Policies\Associations]
"LowRiskFileTypes"=".exe;.bat;.cmd;.vbs"
After searching Google more thoroughly, I came across a command that does exactly what I need.
schtasks.exe /create /S COMPUTERNAME /RU "NT AUTHORITY\SYSTEM" /RL HIGHEST /SC ONLOGON /TN "Administrative OnLogon Script" /TR "cscript.exe \"Path\To\Script.vbs\""
All I need to do is replace the computer name, and program to execute and it makes the task perfectly. I can execute it in a elevated Python environment to create it.

Categories