Parallel Real-Time Scheduling of DAGs in Python

Parallel Real-Time Scheduling of DAGs in Python - python

How To Run/Instantiate a Dag(Airflow) on Parallel basis for multiple categories ?
For example :
I have an airflow(DAG) which i run on regular basis
how i can schedule dag to run on parallel basis on different Batchnames (in parallet):
run the dag for batch1 (pass the batch name in args)
run the dag for batch2 (pass the batch name in args) should run parallel with 1
.
.
.
And so on
I used environment varibale to pass Batchnames and then ran dag in parallel using multiple tmux session on server but it was messed up.
Is there any better approach that I may use and with which I may save time and run dag for multiple batchnames in parallel?
Thanks for your time.

Since airflow runs python classes representing graphs of bash-shell commands, you can do this within airflow by creating two independent DAGs. Here's a slight modification to the tutorial,
dag = DAG(dag_id='batch')
task = [ BashOperator(
task_id='templated',
bash_command=templated_command,
params={'batch_name': batch_name},
dag=dag)
for batch_name in ["batch one", "batch two"]]
dag.add_task(task[0])
dag.add_task(task[1])
Since there is no dependency, they should run in parallel as long as airflow has been set up that way. If you need to set a shell environment variable, add VAR={{ params.batch_name }} somewhere in the template.
Assuming your program uses sys.argv, you could also use normal job control to launch:
python ~/airflow/dags/tutorial.py "batch one" &
python ~/airflow/dags/tutorial.py "batch two" &
wait

Related

Is there any difference between python scripts in airflow and same script in python

I was writing the below code but it is running endless in airflow, but in my system it take 5 min to run
gc=pygsheets.authorize(service_account_file='file.json')
sh3 = gc.open("city")
wks3 = sh3.worksheet_by_title("test")
df = wks3.get_as_df()
df2 = demo_r
wks3.clear()
wks3.set_dataframe(df2,(1,1))

Answering just the question in the title because we can't do anything about your code without more details (stack trace/full code sample/infra setup/etc).
Airflow is a Python framework and will run any code you give it. So there is no difference between a Python script run via an Airflow task or just on your laptop -- the same lines of code will be executed. However, do note that Airflow runs Python code in a separate process, and possibly on different machines, depending on your chosen executor. Airflow registers metadata in a database and manages logfiles from your tasks, so there's more happening around your task when you execute it in Airflow.

Airflow DAGS Orchestration

I have three DAGs (say, DAG1, DAG2 and DAG3). I have a monthly scheduler for DAG1. DAG2 and DAG3 must not be run directly (no scheduler for these) and must be run only when DAG1 is completed successfully. That is, once DAG1 is complete, DAG2 and DAG3 will need to start in parallel.
What is the best mechanism to do this? I came across TriggerDAGRun and ExternalTaskSensor options. I am wanting to understand the pros and cons of each and which one is the best. I see few questions around these. However, I am trying to find the answer for the latest stable Airflow version.

ExternalTaskSensor is not relevant for your use case as none of the DAGs you mention needs to wait for another DAG.
You need to set TriggerDagRunOperator at the code of DAG1 that will trigger the DAG runs for DAG2, DAG3.
A skeleton of the solution would be:
dag2 = DAG(dag_id="DAG2", schedule_inteval=None)
dag3 = DAG(dag_id="DAG3", schedule_inteval=None)
with DAG(dag_id="DAG1", schedule_inteval="#monthly") as dag1:
op_first = DummyOperator(task_id="first") #Replace with operators of your DAG
op_trig2 = TriggerDagRunOperator(task_id="trigger_dag2", trigger_dag_id="DAG2")
op_trig3 = TriggerDagRunOperator(task_id="trigger_dag3", trigger_dag_id="DAG3")
op_first >> [op_trig2, op_trig3]
Edit:
After discussing in comments and since you mentioned you can not edit DAG1 as it's someone else code your best option is ExternalTaskSensor. You will have to set DAG2 & DAG3 to start on the same schedule as DAG1 and they will need to constantly poke DAG1 till it's finish. It will work just not very optimal.

What is the best way to run python scripts in AWS?

I have three python scripts, 1.py, 2.py, and 3.py, each having 3 runtime arguments to be passed.
All three python programs are independent of each other. All 3 may run in a sequential manner in a batch or it may happen any two may run depending upon some configuration.
Manual approach:
Create EC2 instance, run python script, shut it down.
Repeat the above step for the next python script.
The automated way would be trigger the above process through lambda and replicate the above process using some combination of services.
What is the best way to implement this in AWS?

AWS Batch has a DAG scheduler, technically you could define job1, job2, job3 and tell AWS Batch to run them in that order. But I wouldn't recommend that route.
For the above to work you would basically need to create 3 docker images. image1, image2, image3. and then put these in ECR (Docker Hub can also work if not using Fargate launch type).
I don't think that makes sense unless each job is bulky has its own runtime that's different from the others.
Instead I would write a Python program that calls 1.py 2.py and 3.py, put that in a Docker image and run a AWS batch job or just ECS Fargate task.
main.py:
import subprocess
exit_code = subprocess.call("python3 /path/to/1.py", shell=True)
# decide if you want call 2.py and so on ...
# 1.py will see the same stdout, stderr as main.py
# with batch and fargate you can retrieve these form cloudwatch logs ...
Now you have a Docker image that just needs to run somewhere. Fargate is fast to startup, bit pricey, has a 10GB max limit on temporary storage. AWS Batch is slow to startup on a cold start, but can use spot instances in your account. You might need to make a custom AMI for AWS batch to work. i.e. if you want more storage.
Note: for anyone who wants to scream at shell=True, both main.py and 1.py came from the same codebase. It's a batch job, not an internet facing API that took that from user request.

You can run your EC2 instance via a Python Script, using the AWS boto3 library (https://aws.amazon.com/sdk-for-python/). So, a possible solution would be to trigger a Lambda function periodically (you can use Amazon Cloudwatch for periodic events), and inside that function you can boot up your EC2 instance using Python script.
In your instance you can configure your OS to run a Python script every time it boots up, I would suggest you to use Crontab (See this link https://www.instructables.com/id/Raspberry-Pi-Launch-Python-script-on-startup/)
At the end of your script, you can trigger a Amazon SQS event to a function that will shutdown your first instance and than call another function that will start the second script.

You could use meadowrun - disclaimer I am one of the maintainers so obviously biased.
Meadowrun is a python library/tool that manages EC2 instances for you, moves python code + environment dependencies to them, and runs a function without any hassle.
For example, you could put your scripts in a Git repo and run them like so:
import asyncio
from meadowrun import AllocCloudInstance, Deployment, run_function
from script_1 import run_1
async def main():
results = await run_function(
# the function to run on the EC2 instance
lambda: run_1(arguments),
# properties of the VM that runs the function
AllocCloudInstance(
logical_cpu_required=2,
memory_gb_required=16,
interruption_probability_threshold=15,
cloud_provider="EC2"),
# code+env to deploy on the VM, there's other options here
Deployment.git_repo(
"https://github.com/someuser/somerepo",
conda_yml_file="env.yml",
)
)
It will then create an EC2 instance with the given requirements for you (or reuse one if it's already there - could be useful for running your scripts in sequence), creates python code + enviroment there, runs the function and returns any results and output.

For 2022, depending on your infrastructure constraints, i'd say the easiest way would be to set the scripts on Lambda and then call them from the CloudWatch with the required parameters (create a rule):
https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/RunLambdaSchedule.html
That way you can configure them to run independently or sequential and not having to worry about setting up and turning on and off the infrastructure.
This applies to scripts that are not too recursive intensive and that don't run for more than 15 minutes at a time (Lambda time limit)

Python - Run Multiple Scripts At Same Time Methods

I have a bunch of .py scripts as part of a project. Some of them i want to start and have running in the background whilst the others run through what they need to do.
For example, I have a script which takes a Screenshot every 10 seconds until the script is closed and i wish to have this running in the background whilst the other scripts get called and run through till finish.
Another example is a script which calculates the hash of every file in a designated folder. This has the potential to run for a fair amount of time so it would be good if the rest of the scripts could be kicked off at the same time so they do not have to wait for the Hash script to finish what it is doing before they are invoked.
Is Multiprocessor the right method for this kind of processing, or is there another way to achieve these results which would be better such as this answer: Run multiple python scripts concurrently

You could also use something like Celery to run the tasks async and you'll be able to call tasks from within your python code instead of through the shell.

It depends. With multiprocessing you can create a process manager, so it can spawn the processes the way you want, but there are more flexible ways to do it without coding. Multiprocessing is usually hard.
Check out circus, it's a process manager written in Python that you can use as a library, standalone or via remote API. You can define hooks to model dependencies between processes, see docs.
A simple configuration could be:
[watcher:one-shot-script]
cmd = python script.py
numprocesses = 1
warmup_delay = 30
[watcher:snapshots]
cmd = python snapshots.py
numprocesses = 1
warmup_delay = 30
[watcher:hash]
cmd = python hashing.py
numprocesses = 1

Automate Python Script

I'm running a python script manually that fetches data in JSON format.How do I automate this script to run automatically on an hourly basis?
I'm working on Windows7.Can I use tools like Task scheduler?If I can use it,what do I need to put in the batch file?

Can I use tools like Task scheduler?
Yes. Any tool that can run arbitrary programs can run your Python script. Pick the one you like best.
If I can use it,what do I need to put in the batch file?
What batch file? Task Scheduler takes anything that can be run, with arguments—a C program, a .NET program, even a document with a default app associated with it. So, there's no reason you need a batch file. Use C:\Python33\python.exe (or whatever the appropriate path is) as your executable, and your script's path (and its arguments, if any) as the arguments. Just as you do when running the script from the command line.
See Using the Task Scheduler in MSDN for some simple examples, and Task Scheduler Schema Elements or Task Scheduler Scripting Objects for reference (depending on whether you want to create the schedule in XML, or via the scripting interface).
You want to create an ExecAction with Path set to "C:\Python33\python.exe" and Arguments set to "C:\MyStuff\myscript.py", and a RepetitionPattern with Interval set to "PT1H". You should be able to figure out the rest from there.
As sr2222 points out in the comments, often you end up scheduling tasks frequently, and needing to programmatically control their scheduling. If you need this, you can control Task Scheduler's scripting interface from Python, or build something on top of Task Scheduler, or use a different tool that's a bit easier to get at from Python and has more helpful examples online, etc.—but when you get to that point, take a step back and look at whether you're over-using OS task scheduling. (If you start adding delays or tweaking times to make sure the daily foo1.py job never runs until 5 minutes after the most recent hourly foo0.py has finished its job, you're over-using OS task scheduling—but it's not always that obvious.)

May I suggest WinAutomation or AutoMate. These two do the exact same thing, except the UI is a little different. I prefer WinAutomation, because the scripts are a little easier to build.

Yes, you can use the Task Scheduler to run the script on an hourly bases.
To execute a python script via a Batch File, use the following code:
start path_to_python_exe path_to_python_file
Example:
start C:\Users\harshgoyal\AppData\Local\Continuum\Anaconda3\python.exe %UserProfile%\Documents\test_script.py
If python is set as Window’s Environment Window then you can reduce the syntax to:
start python %UserProfile%\Documents\test_script.py
What I generally do is run the batch file once via Task Scheduler and within the python script I call a thread/timer every hour.
class threading.Timer(interval, function, args=None, kwargs=None)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.