I'm running Airflow in a Celery cluster and I'm having trouble getting cleared tasks to run after their dag is already marked as success.
Say I have a dag in the following state:
I want to run everything up to dummy_task_2 again for some reason. I run clear upstream on the task and everything before it is cleared:
In this state the scheduler will not schedule tasks 1 and 6 for execution, unless I also clear tasks 5 or 4:
basically it seems as if the scheduler simply does not care about tasks in a dag if all the 'final' tasks are already marked as success. If I clear those, then it goes back and picks up all the cleared tasks.
I've googled the hell out of this and it doesn't seem like intended behavior. I've checked the scheduler logs and there's nothing unusual, it simply reports No tasks to send to the executor until I clear one of the last tasks in the dag.
Is this a bug? Are there any logs I could check to see what's happening? I've even tried manually changing the state of the dag in the metadata database from 'success' to 'running' but the scheduler just reverts it to 'success' immediately.
Related
Our airflow is forced to interact with a company with a very poor system. Its not unusual for our DAG to get stuck waiting for a report that never actually gets completed. This DAG runs daily pulling the the same information, so if its time for the next run it would be nice to just kill the last run and move on with the new one. I haven't found anything saying Airflow has a DAG argument that can achieve this. Is there a quick easy setting for this behavior, or would it need to be done logically in the sensor that checks if the report is complete?
If your DAG is scheduled daily, how about setting dagrun_timeout to 24 hours? I believe this should in effect kill the previous dag run around when it kicks off a new one. Related question about setting DAG timeouts.
Alternatively, you could either use a PythonOperator, define your own operator, or extend the report sensor you describe to kill the previous DagRun programmatically. I believe that this would look like...
Get the current dag run from the Airflow context
Get the previous dag run with dag_run.get_previous_dagrun()
Set the state on the previous dag run with prev_dag_run.set_state
My recommendation would be to set the timeout given these two options. I agree that there is no specific kill_previous_run dag argument
I have created an workflow (contains few tasks). It is hourly execution. Workflow should be triggered only if another instance of workflow is not running at same time. If it is running, workflow execution should be skipped for that hour.
I checked with "depends_on_past" but couldn't get it.
Set the max_active_runs on your DAG to 1 and also catchup to False
From the official Airflow documentation for trigger rules:
The depends_on_past (boolean), when set to True, keeps a task from getting triggered if the previous schedule for the task hasn’t succeeded.
It will work if you use it in the definition of the task. You can pair it with wait_for_downstream= True as well to guarantee that the new run's instance will not begin until the last run's instance of the task has completed execution.
task_depends = DummyOperator( task_id= "task_depend", dag= dag, depends_on_past= True )
However another way to work around this assuming that you only need the latest run to work is using the Latest Run Only concept:
Standard workflow behavior involves running a series of tasks for a particular date/time range. Some workflows, however, perform tasks that are independent of run time but need to be run on a schedule, much like a standard cron job. In these cases, backfills or running jobs missed during a pause just wastes CPU cycles.
For situations like this, you can use the LatestOnlyOperator to skip tasks that are not being run during the most recent scheduled run for a DAG. The LatestOnlyOperator skips all immediate downstream tasks, and itself, if the time right now is not between its execution_time and the next scheduled execution_time.
With Airflow 1.8.1, I am using LocalExecutor with max_active_runs_per_dag=16, I called a for loop to dynamic create tasks (~100) with PythonOperators. Most time the tasks completed without any issues. However, it is still possible that task is with queue status but scheduler seems forget it, I can clear the task and able to rerun the queued task and worked, but would like to know how to avoid stuck in queue.
I have an email task in celery that has an eta of 10 days from now(). However, I'm finding that some people are getting 5-6 duplicate emails at a time. I've come across this problem before with BROKER_TRANSPORT_OPTIONS set too low. Now I have this in my settings file:
BROKER_TRANSPORT_OPTIONS = {'visibility_timeout': 2592000} #30 days
So that shouldn't be a problem any more. I'm just wondering if there is anything else that can cause it. i.e. restarting celery. Celery gets restarted every time I deploy new code and that can happen 5 or more times a week so it's the only thing I can think of.
Any ideas?
Thanks.
Task duplicating is possible if worker/beat processes had not stopped correctly. How do you restart celery workers/beat? Check server for zombie celery worker and beat processes. Try to stop all celery processes, check no processes of celery exist and start it again. After all check that ps ax | grep celery shows fresh workers and only one beat.
Tasks won't restart in case of incorrect worker stop if you set CELERY_ACKS_LATE = False. In this case the task marked as acknowledged immediately after consuming. See docs.
Also make sure that your tasks have no retry enabled. If any exception happens inside task - they might retry with the same input arguments.
Another possible case - your tasks are written wrong and each run selects the same recipients set.
I'm adding a job to a scheduler using apscheduler using a script. Unfortunately, the job is not properly scheduled when using a script as I didn't start the scheduler.
scheduler = self.getscheduler() # initializes and returns scheduler
scheduler.add_job(trigger=trigger, func = function, jobstore = 'mongo') #sample code. Note that I did not call scheduler.start()
I'm seeing a message: apscheduler.scheduler - INFO - Adding job tentatively -- it will be properly scheduled when the scheduler starts
The script is supposed to add jobs to the scheduler (not to run the scheduler at that particular instance) and there are some other info which are to be added on the event of a job added to the database. Is it possible to add a job and force the scheduler to add it to the jobstore without actually running the scheduler?
I know, that it is possible to start and shutdown the scheduler after addition of each job to make the scheduler save the job information into the jobstore. Is that really a good approach?
Edit: My original intention was to isolate initialization process of my software. I just wanted to add some jobs to a scheduler, which is not yet started. The real issue is that I've given permission for the user to start and stop scheduler. I cannot assure that there is a running instance of scheduler in the system. I've temporarily fixed the problem by starting the scheduler and shutting it down after addition of jobs. It works.
You would have to have some way to notify the scheduler that a job has been added, so that it could wake up and adjust the delay to its next wakeup. It's better to do this via some sort of RPC mechanism. What kind of mechanism is appropriate for your particular use case, I don't know. But RPyC and Execnet are good candidates. Use one of them or something else to remotely control the scheduler process to add said jobs, and you'll be fine.