I have three DAGs (say, DAG1, DAG2 and DAG3). I have a monthly scheduler for DAG1. DAG2 and DAG3 must not be run directly (no scheduler for these) and must be run only when DAG1 is completed successfully. That is, once DAG1 is complete, DAG2 and DAG3 will need to start in parallel.
What is the best mechanism to do this? I came across TriggerDAGRun and ExternalTaskSensor options. I am wanting to understand the pros and cons of each and which one is the best. I see few questions around these. However, I am trying to find the answer for the latest stable Airflow version.
ExternalTaskSensor is not relevant for your use case as none of the DAGs you mention needs to wait for another DAG.
You need to set TriggerDagRunOperator at the code of DAG1 that will trigger the DAG runs for DAG2, DAG3.
A skeleton of the solution would be:
dag2 = DAG(dag_id="DAG2", schedule_inteval=None)
dag3 = DAG(dag_id="DAG3", schedule_inteval=None)
with DAG(dag_id="DAG1", schedule_inteval="#monthly") as dag1:
op_first = DummyOperator(task_id="first") #Replace with operators of your DAG
op_trig2 = TriggerDagRunOperator(task_id="trigger_dag2", trigger_dag_id="DAG2")
op_trig3 = TriggerDagRunOperator(task_id="trigger_dag3", trigger_dag_id="DAG3")
op_first >> [op_trig2, op_trig3]
Edit:
After discussing in comments and since you mentioned you can not edit DAG1 as it's someone else code your best option is ExternalTaskSensor. You will have to set DAG2 & DAG3 to start on the same schedule as DAG1 and they will need to constantly poke DAG1 till it's finish. It will work just not very optimal.
Related
I was writing the below code but it is running endless in airflow, but in my system it take 5 min to run
gc=pygsheets.authorize(service_account_file='file.json')
sh3 = gc.open("city")
wks3 = sh3.worksheet_by_title("test")
df = wks3.get_as_df()
df2 = demo_r
wks3.clear()
wks3.set_dataframe(df2,(1,1))
Answering just the question in the title because we can't do anything about your code without more details (stack trace/full code sample/infra setup/etc).
Airflow is a Python framework and will run any code you give it. So there is no difference between a Python script run via an Airflow task or just on your laptop -- the same lines of code will be executed. However, do note that Airflow runs Python code in a separate process, and possibly on different machines, depending on your chosen executor. Airflow registers metadata in a database and manages logfiles from your tasks, so there's more happening around your task when you execute it in Airflow.
I could not find in Airflow docs how to set up the retension policy I need.
At the moment, we keep all airflow logs forever on our servers which is not the best way to go.
I wish to create global logs configurations for all the different logs I have.
How and where do I configure:
Number of days to keep
Max file size
I ran into the same situation yesterday, the solution for me was to use a DAG that handles all the log cleanup and schedule it as any other DAG.
Check this repo, you will find a step-by-step guide on how to set it up. Basically what you will achieve is to delete files located on airflow-home/log/ and airflow-home/log/scheduler based on a given period defined on a Variable. The DAG dynamically creates one task for each directory targeted for deletion based on your previous definition.
In my case, the only modification I made to the original DAG was to allow deletion only to the scheduler folder by replacing the initial value of DIRECTORIES_TO_DELETE. All credits to the creators! works very well out of the box, and it's easy to customize.
when I put a new DAG python script in the dags folder, I can view a new entry of DAG in the DAG UI but it was not enabled automatically. On top of that, it seems does not loaded properly as well. I can only click on the Refresh button few times on the right side of the list and toggle the on/off button on the left side of the list to be able to schedule the DAG. These are manual process as I need to trigger something even though the DAG Script was put inside the dag folder.
Anyone can help me on this ? Did I missed something ? Or this is a correct behavior in airflow ?
By the way, as mentioned in the post title, there is an indicator with this message "This DAG isn't available in the webserver DagBag object. It shows up in this list because the scheduler marked it as active in the metdata database" tagged with the DAG title before i trigger all this manual process.
It is not you nor it is correct or expected behavior.
It is a current 'bug' with Airflow.
The web server is caching the DagBag in a way that you cannot really use it as expected.
"Attempt removing DagBag caching for the web server" remains on the official TODO as part of the roadmap, indicating that this bug may not yet be fully resolved, but here are some suggestions on how to proceed:
only use builders in airflow v1.9+
Prior to airflow v1.9 this occurs when a dag is instantiated by a function which is imported into the file where instantiation happens. That is: when a builder or factory pattern is used. Some reports of this issue on github 2 and JIRA 3 led to a fix released with in airflow v1.9.
If you are using an older version of airflow, don't use builder functions.
airflow backfill to reload the cache
As Dmitri suggests, running airflow backfill '<dag_id>' -s '<date>' -e '<date>' for the same start and end date can sometimes help. Thereafter you may end up with the (non)-issue that Priyank points, but that is expected behavior (state: paused or not) depending on the configuration you have in your installation.
Restart the airflow webserver solves my issue.
This error can be misleading. If hitting refresh button or restarting airflow webserver doesn't fix this issue, check the DAG (python script) for errors.
Running airflow list_dags can display the DAG errors (in addition to listing out the dags) or even try running/testing your dag as a normal python script.
After fixing the error, this indicator should go away.
The issue is because the DAG by default is put in the DagBag in paused state so that the scheduler is not overwhelmed with lots of backfill activity on start/restart.
To work around this change the below setting in your airflow.cfg file:
# Are DAGs paused by default at creation
dags_are_paused_at_creation = False
Hope this helps. Cheers!
I have a theory about possible cause of this issue in Google Composer. There is section about dag failures on webserver in troubleshooting documentation for Composer, which says:
Avoid running heavyweight computation at DAG parse time. Unlike the
worker and scheduler nodes, whose machine types can be customized to
have greater CPU and memory capacity, the webserver uses a fixed
machine type, which can lead to DAG parsing failures if the parse-time
computation is too heavyweight.
And I was trying to load configuration from external source (which actually took negligible amount of time comparing to other operations to create DAG, but still broke something, because webserver of Airflow in composer runs on App Engine, which has strange behaviours).
I found the workaround in discussion of this Google issue, and it is to create separate DAG with task which loads all the data needed and stores that data in airflow variable:
Variable.set("pipeline_config", config, serialize_json=True)
Then I could do
Variable.get("pipeline_config", deserialize_json=True)
And successfully generate pipeline from that. Additional benefit is that I get logs from that task, which I get from web server, because of this issue.
How To Run/Instantiate a Dag(Airflow) on Parallel basis for multiple categories ?
For example :
I have an airflow(DAG) which i run on regular basis
how i can schedule dag to run on parallel basis on different Batchnames (in parallet):
run the dag for batch1 (pass the batch name in args)
run the dag for batch2 (pass the batch name in args) should run parallel with 1
.
.
.
And so on
I used environment varibale to pass Batchnames and then ran dag in parallel using multiple tmux session on server but it was messed up.
Is there any better approach that I may use and with which I may save time and run dag for multiple batchnames in parallel?
Thanks for your time.
Since airflow runs python classes representing graphs of bash-shell commands, you can do this within airflow by creating two independent DAGs. Here's a slight modification to the tutorial,
dag = DAG(dag_id='batch')
task = [ BashOperator(
task_id='templated',
bash_command=templated_command,
params={'batch_name': batch_name},
dag=dag)
for batch_name in ["batch one", "batch two"]]
dag.add_task(task[0])
dag.add_task(task[1])
Since there is no dependency, they should run in parallel as long as airflow has been set up that way. If you need to set a shell environment variable, add VAR={{ params.batch_name }} somewhere in the template.
Assuming your program uses sys.argv, you could also use normal job control to launch:
python ~/airflow/dags/tutorial.py "batch one" &
python ~/airflow/dags/tutorial.py "batch two" &
wait
I'm starting to port a nightly data pipeline from a visual ETL tool to Luigi, and I really enjoy that there is a visualiser to see the status of jobs. However, I've noticed that a few minutes after the last job (named MasterEnd) completes, all of the nodes disappear from the graph except for MasterEnd. This is a little inconvenient, as I'd like to see that everything is complete for the day/past days.
Further, if in the visualiser I go directly to the last job's URL, it can't find any history that it ran: Couldn't find task MasterEnd(date=2015-09-17, base_url=http://aws.east.com/, log_dir=/home/ubuntu/logs/). I have verified that it ran successfully this morning.
One thing to note is that I have a cron that runs this pipeline every 15 minutes to check for a file on S3. If it exists, it runs, otherwise it stops. I'm not sure if that is causing the removal of tasks from the visualiser or not. I've noticed it generates a new PID every run, but I couldn't find a way to persist one PID/day in the docs.
So, my questions: Is it possible to persist the completed graph for the current day in the visualiser? And is there a way to see what has happened in the past?
Appreciate all the help
I'm not 100% positive if this is correct, but this is what I would try first. When you call luigi.run, pass it --scheduler-remove-delay. I'm guessing this is how long the scheduler waits before forgetting a task after all of its dependents have completed. If you look through luigi's source, the default is 600 seconds. For example:
luigi.run(["--workers", "8", "--scheduler-remove-delay","86400")], main_task_cls=task_name)
If you configure the remove_delay setting in your luigi.cfg then it will keep the tasks around for longer.
[scheduler]
record_task_history = True
state_path = /x/s/hadoop/luigi/var/luigi-state.pickle
remove_delay = 86400
Note, there is a typo in the documentation ("remove-delay" instead of remove_delay") which is being fixed under https://github.com/spotify/luigi/issues/2133