I am using MWAA aiflow 1.10 and the tasks do not start, even though the last ones are successful. I do not see any logs problem or anything.
There's no reason why your Tasks wouldn't execute in the order you've told them via bitshift, unless you have set the trigger rules away from all_success to all_failed for example. Especially if there's nothing in the logs, it implies that your encrypt_to_stage task has been set up to not execute if the previous tasks are successful.
A less likely possibility is that you've hit this known issue, but I'd expect your results to be more random than what you've shared (unless there are other DAGs that are running in parallel that complete at the same time as these first two tasks).
Related
I've seen mention of sigkill occurring for others but I think my use case is slightly different.
I'm using the managed airflow service through gcp cloud composer, running airflow 2. I have 3 worker nodes all set to the default instance creation settings.
The environment runs dags fairly smoothly for the most part (api calls, moving files from on prem) however it would seem as though its having a terribly hard time executing a couple of slightly larger jobs.
One of these jobs uses a samba connector to incrementally backfill missing data and store on gcs. The other is a salesforce api connector.
These jobs run locally with absolutely no issue so I'm wondering why I'm encountering these issues. There should be plenty memory to run these tasks as a cluster, although scaling up my cluster for just 2 jobs doesn't seem like it's particularly efficient.
I have tried both dag and task timeouts. I've tried increasing the connection timeout on the samba client.
So could someone please share some insight into how I can get airflow to execute these tasks without killing the session - even if it does take longer.
Happy to add more detail if required but I don't have the available data in front of me currently to share.
I believe this is about keepalives on the connections. If there are long running calls that cause a long period of inactivity (because the other side is busy preparing the data). On manged instance it is quite often the case that inactive connections get killed by the firewalls. This happens for example with long Postgres queries and solution there is to configure keepalives.
I think you should do the same for your samba connection (but you need to figure out how to do it in composer) https://www.linuxtopia.org/online_books/network_administration_guides/using_samba_book/ch08_06_04.html
Frustratingly, increasing resources meant the jobs could run. I don't know why the resources weren't enough as they really should've been. But optimisation for fully managed solutions isn't overly straight forward other than adding cost.
I did quite get the problem that arises by sharing a job store across multiple schedulers in APScheduler.
The official documentation mentions
Job stores must never be shared between schedulers
but doesn't discuss the problems related to that, Can someone please explain it?
and also if I deploy a Django application containing APScheduler in production, will multiple job stores be created for each worker process?
There are multiple reasons for this. In APScheduler 3.x, schedulers do not have any means to signal each other about changes happening in the job stores. When the scheduler starts, it queries the job store for jobs due for execution, processes them and then asks how long it should sleep until the next due job. If another scheduler adds a job that would be executed before that wake-up time, the other scheduler would happily sleep past that time because there is no mechanism with which it could receive a notification about the new (or updated) job.
Additionally, schedulers do not have the ability to enforce the maximum number of running instances of a job since they don't communicate with other schedulers. This can lead to conflicts when the same job is run on more than one scheduler process at the same time.
These shortcomings are addressed in the upcoming 4.x series and the ability to share job stores could be considered one of its most significant new features.
Is there for me to configure celery to just drop the tasks in case of a non-graceful shutdown of a worker? Its more critical for me that tasks are not repeated rather than they are always delivered.
As mentioned in the docs:
If a task isn’t acknowledged within the Visibility Timeout the task will be redelivered to another worker and executed.
This causes problems with ETA/countdown/retry tasks where the time to execute exceeds the visibility timeout; in fact if that happens it will be executed again, and again in a loop.
So you have to increase the visibility timeout to match the time of the longest ETA you’re planning to use.
My use case is that I am using a visibility_timeout of 1 day, but still in some cases that is not enough- I want to schedule tasks even further in the future. "Power failure" or any other event causing a non-graceful shutdown is very rare and I'm fine with tasks being dropped in, say, 0.01% of the cases. Moreover, a task executed 1 day later than it was supposed to, is as bad as the task not being run at all.
One obvious, hacky, way is to set visibility_timeout to 100 years. Is there a better way?
There's a acks_late configuration, but the default value is false (so make sure you didn't enable it):
The acks_late setting would be used when you need the task to be
executed again if the worker (for some reason) crashes mid-execution.
It’s important to note that the worker isn’t known to crash, and if it
does it’s usually an unrecoverable error that requires human
intervention (bug in the worker, or task code).
(quote from here)
The definition of task_acks_late (seems like the name has changed in the last version of some mismatch) can be found here.
I have created an workflow (contains few tasks). It is hourly execution. Workflow should be triggered only if another instance of workflow is not running at same time. If it is running, workflow execution should be skipped for that hour.
I checked with "depends_on_past" but couldn't get it.
Set the max_active_runs on your DAG to 1 and also catchup to False
From the official Airflow documentation for trigger rules:
The depends_on_past (boolean), when set to True, keeps a task from getting triggered if the previous schedule for the task hasn’t succeeded.
It will work if you use it in the definition of the task. You can pair it with wait_for_downstream= True as well to guarantee that the new run's instance will not begin until the last run's instance of the task has completed execution.
task_depends = DummyOperator( task_id= "task_depend", dag= dag, depends_on_past= True )
However another way to work around this assuming that you only need the latest run to work is using the Latest Run Only concept:
Standard workflow behavior involves running a series of tasks for a particular date/time range. Some workflows, however, perform tasks that are independent of run time but need to be run on a schedule, much like a standard cron job. In these cases, backfills or running jobs missed during a pause just wastes CPU cycles.
For situations like this, you can use the LatestOnlyOperator to skip tasks that are not being run during the most recent scheduled run for a DAG. The LatestOnlyOperator skips all immediate downstream tasks, and itself, if the time right now is not between its execution_time and the next scheduled execution_time.
I have a task which I execute once a minute using celerybeat. It works fine. Sometimes though, the task takes a few seconds more than a minute to run because of which two instances of the task run. This leads to some race conditions that mess things up.
I can (and probably should) fix my task to work properly but I wanted to know if celery has any builtin ways to ensure this. My cursory Google searches and RTFMs yielded no results.
You could add a lock, using something like memcached or just your db.
If you are using a cron schedule or time interval for run periodic tasks you will still have the problem. You can always use a lock mechanism using a db or cache or even filesystem or also schedule the next task from the previous one, maybe not the best approach.
This question can probably help you:
django celery: how to set task to run at specific interval programmatically
You can try adding a classfield to the object that holds the function that youre making run and use that field as a "some other guy is working or not" control
The lock is a good way with either beat or a cron.
But, be aware that beat jobs run at worker start time, not at beat run time.
This was causing me to get a race condition even with a lock. Lets say the worker is off and beat throws 10 jobs into the queue. When celery starts up with 4 processes, all 4 of them grab a task and in my case 1 or 2 would get and set the lock at the same time.
Solution one is to use a cron with a lock, as a cron will execute at that time, not at worker start time.
Solution two is to use a slightly more advanced locking mechanism that handles race conditions. For redis look into setnx, or the newer redlock.
This blog post is really good, and includes a decorator pattern that uses redis-py's locking mechanism: http://loose-bits.com/2010/10/distributed-task-locking-in-celery.html.