Having list of tasks which calls different dags in master dag.I'm using the TriggerDagrunoperator to accomplish this. But facing few issues.
TriggerDagrunoperator doesn't wait for completion of external dag, it triggers next task. I want that to wait until completion and next task should trigger based on the status. Came across ExternalTaskSensor. It is making the process complicated. Is there any other solution to fix this?
If I trigger the master dag again, I want the task to restart from where it is failed. Right now, it's not restarting, but for time based schedule,it will.
.. I want that to wait until completion .. Came across
ExternalTaskSensor. It is making the process complicated ..
I'm unaware of any other way to achieve this. I myself did this the same way.
If I trigger the master dag again, I want the task to restart from
where it is failed...
This requirement of your goes against the principle of idempotency that Airflow demands. I'd suggest you try to re-work you jobs in incorporate idempotency (for instance in case of retries, you have to have idempotency). Meanwhile you can take inspiration from some people and try to achieve something similar (but it will be pretty complicated)
With Airflow 2.0.1, the triggering dag can be made to wait for completion of target dag with parameter wait_for_completion
ref: here
Related
Our airflow is forced to interact with a company with a very poor system. Its not unusual for our DAG to get stuck waiting for a report that never actually gets completed. This DAG runs daily pulling the the same information, so if its time for the next run it would be nice to just kill the last run and move on with the new one. I haven't found anything saying Airflow has a DAG argument that can achieve this. Is there a quick easy setting for this behavior, or would it need to be done logically in the sensor that checks if the report is complete?
If your DAG is scheduled daily, how about setting dagrun_timeout to 24 hours? I believe this should in effect kill the previous dag run around when it kicks off a new one. Related question about setting DAG timeouts.
Alternatively, you could either use a PythonOperator, define your own operator, or extend the report sensor you describe to kill the previous DagRun programmatically. I believe that this would look like...
Get the current dag run from the Airflow context
Get the previous dag run with dag_run.get_previous_dagrun()
Set the state on the previous dag run with prev_dag_run.set_state
My recommendation would be to set the timeout given these two options. I agree that there is no specific kill_previous_run dag argument
I have created an workflow (contains few tasks). It is hourly execution. Workflow should be triggered only if another instance of workflow is not running at same time. If it is running, workflow execution should be skipped for that hour.
I checked with "depends_on_past" but couldn't get it.
Set the max_active_runs on your DAG to 1 and also catchup to False
From the official Airflow documentation for trigger rules:
The depends_on_past (boolean), when set to True, keeps a task from getting triggered if the previous schedule for the task hasn’t succeeded.
It will work if you use it in the definition of the task. You can pair it with wait_for_downstream= True as well to guarantee that the new run's instance will not begin until the last run's instance of the task has completed execution.
task_depends = DummyOperator( task_id= "task_depend", dag= dag, depends_on_past= True )
However another way to work around this assuming that you only need the latest run to work is using the Latest Run Only concept:
Standard workflow behavior involves running a series of tasks for a particular date/time range. Some workflows, however, perform tasks that are independent of run time but need to be run on a schedule, much like a standard cron job. In these cases, backfills or running jobs missed during a pause just wastes CPU cycles.
For situations like this, you can use the LatestOnlyOperator to skip tasks that are not being run during the most recent scheduled run for a DAG. The LatestOnlyOperator skips all immediate downstream tasks, and itself, if the time right now is not between its execution_time and the next scheduled execution_time.
Other questions about 'dynamic tasks' seem to address dynamic construction of a DAG at schedule or design time. I'm interested in dynamically adding tasks to a DAG during execution.
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from datetime import datetime
dag = DAG('test_dag', description='a test',
schedule_interval='0 0 * * *',
start_date=datetime(2018, 1, 1),
catchup=False)
def make_tasks():
du1 = DummyOperator(task_id='dummy1', dag=dag)
du2 = DummyOperator(task_id='dummy2', dag=dag)
du3 = DummyOperator(task_id='dummy3', dag=dag)
du1 >> du2 >> du3
p = PythonOperator(
task_id='python_operator',
dag=dag,
python_callable=make_tasks)
This naive implementation doesn't seem to work - the dummy tasks never show up in the UI.
What's the correct way to add new operators to the DAG during execution? Is it possible?
It it not possible to modify the DAG during its execution (without a lot more work).
The dag = DAG(... is picked up in a loop by the scheduler. It will have task instance 'python_operator' in it. That task instance gets scheduled in a dag run, and executed by a worker or executor. Since DAG models in the Airflow DB are only updated by the scheduler these added dummy tasks will not be persisted to the DAG nor scheduled to run. They will be forgotten when the worker exits. Unless you copy all the code from the scheduler regarding persisting & updating the model… but that will be undone the next time the scheduler visits the DAG file for parsing, which could be happening once a minute, once a second or faster depending how many other DAG files there are to parse.
Airflow actually wants each DAG to approximately stay the same layout between runs. It also wants to reload/parse DAG files constantly. So though you could make a DAG file that on each run determines the tasks dynamically based on some external data (preferably cached in a file or pyc module, not network I/O like a DB lookup, you'll slow down the whole scheduling loop for all the DAGs) it's not a good plan as your graph and tree view will get all confusing, and your scheduler parsing will be more taxed by that lookup.
You could make the callable run each task…
def make_tasks(context):
du1 = DummyOperator(task_id='dummy1', dag=dag)
du2 = DummyOperator(task_id='dummy2', dag=dag)
du3 = DummyOperator(task_id='dummy3', dag=dag)
du1.execute(context)
du2.execute(context)
du3.execute(context)
p = PythonOperator(
provides_context=true,
But that's sequential, and you have to work out how to use python to make them parallel (use futures?) and if any raise an exception the whole task fails. Also it is bound to one executor or worker so not using airflow's task distribution (kubernetes, mesos, celery).
The other way to work with this is to add a fixed number of tasks (the maximal number), and use the callable(s) to short circuit the unneeded tasks or push arguments with xcom for each of them, changing their behavior at run time but not changing the DAG.
Regarding your code sample, you never call your function which registers your tasks in your DAG.
To have a kind of dynamic tasks, you can have a single operator which does something different depending on some state or you can have a handful of operators which can be skipped depending on the state, with a ShortCircuitOperator.
I appreciate all the work everybody has done here as I have the same challenge of creating dynamically structured DAGs. I have done enough mistakes to not use software against its design. If I cant inspect the whole run on the UI and zoom in and out, basically use airflow features, which are the main reason I use it anyway. I can just write multiprocessing code inside a function and be done with it as well.
That all being said my solution is to use a resource manager such as redis locking and have a DAG that writes to this resource manager with data about what to run how to run etc; and have another DAG or DAGs that run in certain intervals polling the resource manager, locking them before running and removing them at finish. This way at least I use airflow as expected even though its specifications dont exactly meet my needs. I breakdown the problem into more definable chunks. The solutions are creative but they are against the design and not tested by the developers. The specifically say to have fixed structured workflows. I cannot put a work around code that is not tested and against design unless I rewrite the core airflow code and test myself. I understand my solution brings complexity with locking and all that but at least I know the boundaries to that.
here is my problem. When I create a new scheduled task using win32com in python there is no next run time for the task. It says 'never' in task scheduler gui.
My workflow of creating tasks:
try to make new task, if failed, get existing one for update,
create daily triggers for the task,
save it all.
Any advice?
So here is the simple solution.
I checked the defaults params for the trigger and than I saw, that Flags is set to 4, which means DISABLED.
It seems, that's the default setting for a new trigger for a task.
We need a service that we can use to schedule events. For instance, we might have a task that needs to run at 3 o'clock (one time) or that runs every 2 hours (multiple times). Preferably each task could be configured with an AMQP queue that it would publish to.
We could easily implement this by creating an OS timer event. My concern is how to recover if this service ever went down. We could use CRON if it was something that allowed scheduling on-the-fly.
I was looking for a way to avoid reinventing the wheel. If there isn't a project out there that does this already, we will just create one. This is a pretty common thing, though, so I'd be surprised if no one's put one out there by now.
Celery solves this problem.
celery.schedules lets you define periodic tasks. And you can override is_due to do things like schedule once a month. You can schedule tasks to execute at a specific time using periodic_task, or celery-beat (which I believe is now the standard approach). Yet another way is to use the eta argument to Task.apply_async.