Airflow with celery executor for memory hungry Dags

Airflow with celery executor for memory hungry Dags - python

We need to execute 3000 Dag runs in a specific hour of the day.
almost all the work inside the Dag is waiting for other micro services to perform their work. (for example AWS sagemaker)
But our Dag consume about 250MB due to all the python imports we need (sagemaker,tensorflow etc)
we noticed that the 250MB memory is consumed even before the first task is executed. ( checked that by printing process.memory_full_info())
we have tried to run 8 concurrent runs on a Celery cluster which have 2 workers (4GB ram each) and worker_concurrency=8 and failed.
the worker machine crashed for OOM error.
drilling down into this i understand that airflow actually work in prefork method , meaning it creates duplicates of its main worker process for each sub worker it initiates. this subprocess consume more or less the memory our Dag consumes (around 300MB) , there is no memory sharing between the sub processes.
so i have started to search for a solution since this one obviously cannot scale unless we put something like 500 workers. (each can take 6 dag runs concurrently from our tests)
i read about different pool worker classes that can be set like gevent or eventlet and how you can reuse the memory so not every sub worker will duplicate its memory .i thought i have found the solution i need.
but then i have tried to set this up and didn't see this behavior in work - i would only see 2 concurrent tasks being executed (with gevent configured)
in addition - our airflow cloud provider explain to me that this is not airflow is working and in its base executor class it still create a subprocess for each task.
this is the reference he sent me https://github.com/apache/airflow/blob/919bb8c1cbb36679c02ca3f8890c300e1527c08b/airflow/task/task_runner/base_task_runner.py#L112-L142
my questions are:
is it possible to scale the way we need , given our Dag consume much memory to begin with ?
does working with Celery executor actually can benefit memory hungry dag runs.
Is the cloud provider right by saying - "this is not how airflow works with celery "
i'm hearing people run thousands of tasks concurrently , even with minimum DAG import i get 130MB memory usage - how can this scale to thousands ?
thanks

Related

How to force all celery workers to execute one task repeatedly?

I have a couple of workers deployed in Kubernetes. I want to write a customized exporter for Prometheus so I need to check all workers' availability.
I have some huge tasks in one queue, which take 200 seconds (for example). The related workers to this queue have been run with eventlet pool and 1000 concurrency. This worker deployed in a workload with 2 pods.
Because of the huge tasks, sometimes light tasks got stuck in these workers and does not process until huge task are done(I have another queue for light tasks, but I have to have some light tasks in this queue).
How I can check all workers' performance and upness?
I come across Bootstrap in celery but I do not know whether it helps me or not, because I want to have a task that is run on every worker (and queues) and I want it to run between huge tasks not separated.
For more details: I want to save this data in a Redis and read it in my exporter.

ECS Fargate batch job tasks via airflow waiting in pending state too long

I want to run 1000 tasks in parallel. These are short running batch jobs that use the same taskdef (hence same container) with just the args being passed different (basically the arg passed is the value from 0 thru 999).
I used airflow to call the ECSOperator in a loop just as it is explained here:
https://headspring.com/2020/06/17/airflow-parallel-tasks/.
When I look at the 'Tasks' tab for my ECS cluster in AWS, I see the tasks queued up with a mix of PROVISIONING, PENDING and RUNNING.
The RUNNING jobs are just a handful - most of the tasks are in PENDING state which eventually go into RUNNING state.
Questions:
Why are most jobs in PENDING state ? what are they waiting for (like a limit on RUNNING jobs?) ? How can I check what it is doing during this PENDING state ?
Why are the RUNNING jobs just a handful ? How can I make most, if not all, tasks go to RUNNING state simultaneously ? Is there some limit on how many jobs can run simultaneously while using Fargate ?
The Services tab is empty - I have not configured any Services - isn't this meant only for long running jobs/daemons or can batch jobs like mine take advantage of it too (and reach the goal of getting all the 1000 tasks run at the same time) ?
I have not setup anything in the 'Capacity Providers' tab. Will that help in getting getting more tasks to run in parallel ?
I am not clear on the concept of autoscaling here - isn't Fargate supposed to provision the 1000 CPUs if need be so that all those tasks can run in parallel ? Is there a default limit and if so, how do I control it ?

So much to unpack.
1-2: there is a TPS (tasks per second) provisioning throughput to be considered. We (AWS) are in the process of documenting better these limits (which we don't do today) but for 1000 tasks consider that it can be expected to take "a few minutes" to have ALL of them in RUNNING state. If you see them taking "hours" to get to RUNNING state then that's not normal. Also note that each account/region has a default concurrent task limit of 1000 (which is not to be confused with the throughput with which you can scale to 1000 concurrently running tasks).
3: No. As you said that's just for control loop so that you can say I always want to run n tasks (or DAEMONS) and ECS will do that. You are essentially using an external control loop (Airflow) that manages the task. This won't have any influence on the throughput.
4: No (or at least I don't think so). You may try if Airflow supports launching tasks using CPs instead of the traditional "launch type" mode.
5: the autoscaler (in the context of Fargate) is pretty much an ECS Service construct (see point #3). There you basically say "I want to run between n and m tasks and scale-in/out based on these metrics". And ECS/Autoscaling will make the task count fluctuate based on that. As I said you are doing all this externally launching tasks individually. If Airflow says "launch 1000 tasks" there is no autoscaling... just a rush to go from 0 to 1000 (see #1 and #2).

Airflow spinning up multiple subprocess for a single task and hanging

Airflow version = 1.10.10
Hosted on Kubernetes, Uses Kubernetes executor.
DAG setup
DAG - Is generated using the dynamic dag
Task - Is a PythonOperator that pulls some data, runs an inference, stores the predictions.
Where does it hang? - When running the inference using tensorflow
More details
One of our running tasks, as mentioned above, was hanging for 4 hours. No amount of restarting can help it to recover from that point. We found out that the pod had almost 30+ subprocess and 40GB of memory used.
We weren't convinced because when running on a local machine, the model doesn't consume more than 400MB. There is no way it can suddenly bump up to 40GB in memory.
Another suspicion was maybe it's spinning up so many processes because we are dynamically generating around 19 DAGS. I changed the generator to generate only 1, and the processes didn't vanish. The worker pods still had 35+ subprocesses with the same memory.
Here comes the interesting part, I wanted to be really sure that it's not the dynamic DAG. Hence I created an independent DAG that prints out 1..100000 while pausing for 5 seconds each. The memory usage was still the same but not the number of processes.
At this point, I am not sure which direction to take to debug the issue further.
Questions
Why is the task hanging?
Why are there so many sub-processes when using dynamic dag?
How can I debug this issue further?
Have you faced this before, and can you help?

Efficient ways of implementing waiting till a certain criterion is met in Airflow

Sensors in Airflow - are a certain type of operator that will keep running until a certain criterion is met but they consume a full worker slot. Curious if people have been able to reliably use more efficient ways of implementing this.
A few ideas on my mind
using pools to restrict the number of worker slots allotted to sensors
skipping all tasks downstream and then clear and resume via an external trigger
pause the run of the DAG and resume again via an external trigger
Other relevant links:
How to implement polling in Airflow?
How to wait for an asynchronous event in a task of a DAG in a workflow implemented using Airflow?
Airflow unpause dag programmatically?

The new version of Airflow,namely 1.10.2 provides new option for sensors, which I think addresses your concerns:
mode (str) – How the sensor operates. Options are: { poke | reschedule }, default is poke. When set to poke the sensor is taking up a worker slot for its whole execution time and sleeps between pokes. Use this mode if the expected runtime of the sensor is short or if a short poke interval is requried. When set to reschedule the sensor task frees the worker slot when the criteria is not yet met and it’s rescheduled at a later time. Use this mode if the expected time until the criteria is met is. The poke inteval should be more than one minute to prevent too much load on the scheduler.
Here is the link to doc.

I think you need to step back and question why it's a problem that a sensor consumes a full worker slot.
Airflow is a scheduler, not a resource allocator. Using worker concurrency, pools and queues, you can limit resource usage, but only very crudely. In the end, Airflow naively assumes a sensor will use the same resources on worker nodes as a BashOperator that spawns a multi-process genome sequencing utility. But sensors are cheap and sleep 99.9% of the time, so that is a bad assumption.
So, if you want to solve the problem of sensors consuming all your worker slots, just bump your worker concurrency. You should be able to have hundreds of sensors running concurrently on a single worker.
If you then get problems with very uneven workload distribution on your cluster nodes and nodes with dangerously high system load, you can limit the number of expensive jobs using either:
pools that expensive jobs must consume (will start the job and wait until a pool resource is available). This creates a cluster-wide limit.
special workers on each node that only take the expensive jobs (using airflow worker --queues my_expensive_queue) and have a low concurrency setting. This creates a per-node limit.
If you have more complex requirements than that, then consider shipping all non-trivial compute jobs to a dedicated resource allocator, e.g. Apache Mesos, where you can specify the exact CPU, memory and other requirements to make sure your cluster load is distributed more efficiently on each node than Airflow will ever be able to do.

Cross DAG dependencies are feasible per this doc
Criteria can be specified in a separte DAG as a separate task so that when that criteria is met for a given date, the child task is allowed to run.

How to limit the number of tasks that runs in celery

I have an app running on Heroku and I'm using celery together with a worker dyno to process background work.
I'm running tasks that are using quite a lot of memory. These tasks get started at roughly the same time, but I want only one or two tasks to be running at the same time, the others must wait in the queue. How can I achieve that?
If they run at the same time I run out of memory and the system gets restarted. I know why it's using a lot of memory and not looking to decrease that

Quite simply: limit your concurrency (number of celery worker processes) to the number of tasks that can safely run in parallel on this server.
Note that if you have different tasks having widly different resource needs (ie one task that eats a lot of ram and takes minutes to complete and a couple ones that are fast and don't require much resources at all) you might be better using two distinct nodes to serve them (one for the heavy tasks and the other for the light ones) so heavy tasks don't block light ones. You can use queues to route tasks to different celery nodes.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.