Airflow spinning up multiple subprocess for a single task and hanging

Airflow spinning up multiple subprocess for a single task and hanging - python

Airflow version = 1.10.10
Hosted on Kubernetes, Uses Kubernetes executor.
DAG setup
DAG - Is generated using the dynamic dag
Task - Is a PythonOperator that pulls some data, runs an inference, stores the predictions.
Where does it hang? - When running the inference using tensorflow
More details
One of our running tasks, as mentioned above, was hanging for 4 hours. No amount of restarting can help it to recover from that point. We found out that the pod had almost 30+ subprocess and 40GB of memory used.
We weren't convinced because when running on a local machine, the model doesn't consume more than 400MB. There is no way it can suddenly bump up to 40GB in memory.
Another suspicion was maybe it's spinning up so many processes because we are dynamically generating around 19 DAGS. I changed the generator to generate only 1, and the processes didn't vanish. The worker pods still had 35+ subprocesses with the same memory.
Here comes the interesting part, I wanted to be really sure that it's not the dynamic DAG. Hence I created an independent DAG that prints out 1..100000 while pausing for 5 seconds each. The memory usage was still the same but not the number of processes.
At this point, I am not sure which direction to take to debug the issue further.
Questions
Why is the task hanging?
Why are there so many sub-processes when using dynamic dag?
How can I debug this issue further?
Have you faced this before, and can you help?

Related

Airflow sigkill tasks on cloud composer

I've seen mention of sigkill occurring for others but I think my use case is slightly different.
I'm using the managed airflow service through gcp cloud composer, running airflow 2. I have 3 worker nodes all set to the default instance creation settings.
The environment runs dags fairly smoothly for the most part (api calls, moving files from on prem) however it would seem as though its having a terribly hard time executing a couple of slightly larger jobs.
One of these jobs uses a samba connector to incrementally backfill missing data and store on gcs. The other is a salesforce api connector.
These jobs run locally with absolutely no issue so I'm wondering why I'm encountering these issues. There should be plenty memory to run these tasks as a cluster, although scaling up my cluster for just 2 jobs doesn't seem like it's particularly efficient.
I have tried both dag and task timeouts. I've tried increasing the connection timeout on the samba client.
So could someone please share some insight into how I can get airflow to execute these tasks without killing the session - even if it does take longer.
Happy to add more detail if required but I don't have the available data in front of me currently to share.

I believe this is about keepalives on the connections. If there are long running calls that cause a long period of inactivity (because the other side is busy preparing the data). On manged instance it is quite often the case that inactive connections get killed by the firewalls. This happens for example with long Postgres queries and solution there is to configure keepalives.
I think you should do the same for your samba connection (but you need to figure out how to do it in composer) https://www.linuxtopia.org/online_books/network_administration_guides/using_samba_book/ch08_06_04.html

Frustratingly, increasing resources meant the jobs could run. I don't know why the resources weren't enough as they really should've been. But optimisation for fully managed solutions isn't overly straight forward other than adding cost.

What's the problem with sharing job stores in APScheduler?

I did quite get the problem that arises by sharing a job store across multiple schedulers in APScheduler.
The official documentation mentions
Job stores must never be shared between schedulers
but doesn't discuss the problems related to that, Can someone please explain it?
and also if I deploy a Django application containing APScheduler in production, will multiple job stores be created for each worker process?

There are multiple reasons for this. In APScheduler 3.x, schedulers do not have any means to signal each other about changes happening in the job stores. When the scheduler starts, it queries the job store for jobs due for execution, processes them and then asks how long it should sleep until the next due job. If another scheduler adds a job that would be executed before that wake-up time, the other scheduler would happily sleep past that time because there is no mechanism with which it could receive a notification about the new (or updated) job.
Additionally, schedulers do not have the ability to enforce the maximum number of running instances of a job since they don't communicate with other schedulers. This can lead to conflicts when the same job is run on more than one scheduler process at the same time.
These shortcomings are addressed in the upcoming 4.x series and the ability to share job stores could be considered one of its most significant new features.

Airflow with celery executor for memory hungry Dags

We need to execute 3000 Dag runs in a specific hour of the day.
almost all the work inside the Dag is waiting for other micro services to perform their work. (for example AWS sagemaker)
But our Dag consume about 250MB due to all the python imports we need (sagemaker,tensorflow etc)
we noticed that the 250MB memory is consumed even before the first task is executed. ( checked that by printing process.memory_full_info())
we have tried to run 8 concurrent runs on a Celery cluster which have 2 workers (4GB ram each) and worker_concurrency=8 and failed.
the worker machine crashed for OOM error.
drilling down into this i understand that airflow actually work in prefork method , meaning it creates duplicates of its main worker process for each sub worker it initiates. this subprocess consume more or less the memory our Dag consumes (around 300MB) , there is no memory sharing between the sub processes.
so i have started to search for a solution since this one obviously cannot scale unless we put something like 500 workers. (each can take 6 dag runs concurrently from our tests)
i read about different pool worker classes that can be set like gevent or eventlet and how you can reuse the memory so not every sub worker will duplicate its memory .i thought i have found the solution i need.
but then i have tried to set this up and didn't see this behavior in work - i would only see 2 concurrent tasks being executed (with gevent configured)
in addition - our airflow cloud provider explain to me that this is not airflow is working and in its base executor class it still create a subprocess for each task.
this is the reference he sent me https://github.com/apache/airflow/blob/919bb8c1cbb36679c02ca3f8890c300e1527c08b/airflow/task/task_runner/base_task_runner.py#L112-L142
my questions are:
is it possible to scale the way we need , given our Dag consume much memory to begin with ?
does working with Celery executor actually can benefit memory hungry dag runs.
Is the cloud provider right by saying - "this is not how airflow works with celery "
i'm hearing people run thousands of tasks concurrently , even with minimum DAG import i get 130MB memory usage - how can this scale to thousands ?
thanks

Fault tolerance in Dask dependency graphs

I have a small cluster upon which I deploy a dask graph using:
from dask.distributed import Client
...
client = Client(f'{scheduler_ip}:{scheduler_port}', set_as_default=False)
client.get(workflow, final_node)
During the workflow I have a bunch of tasks that run in parallel, of course. Sometimes, however, there's an error in a module that one worker is running. As soon as that module fails it gets returned to the scheduler and then the scheduler stops the other works running in parallel (even if the others have no dependency on this one). It stops them midstream.
Is there anyway to allow the others to complete, then fail, instead of shutting them down immediately?

The Client.get function is all-or-nothing. You should probably look at the futures interface. Here you're launching many computations which happen to depend on each other. The ones that can finish will finish.
See https://docs.dask.org/en/latest/futures.html

Best way for single worker implementation in Flask

I have some spider that download pages and store data in database. I have created flask application with admin panel (by Flask-Admin extension) that show database.
Now I want append function to my flask app for control spider state: switch on/off.
I thing it posible by threads or multiprocessing. Celery is not good decision because total program must use minimum memory.
Which method to choose for implementation this function?

Discounting Celery based on memory usage would probably be a mistake, as Celery has low overhead in both time and space. In fact, using Celery+Flask does not use much more memory than using Flask alone.
In addition Celery comes with several choices you can make that can have an impact
on the amount of memory used. For example, there are 5 different pool implementations that all have different strengths and trade-offs, the pool choices are:
multiprocessing
By default Celery uses multiprocessing, which means that it will spawn child processes
to offload work to. This is the most memory expensive option - simply because
every child process will duplicate the amount of base memory needed.
But Celery also comes with an autoscale feature that will kill off worker
processes when there's little work to do, and spawn new processes when there's more work:
$ celeryd --autoscale=0,10
where 0 is the mininum number of processes, and 10 is the maximum. Here celeryd will
start off with no child processes, and grow based on load up to a maximum of 10 processes. When load decreases, so will the number of worker processes.
eventlet/gevent
When using the eventlet/gevent pools only a single process will be used, and thus it will
use a lot less memory, but with the downside that tasks calling blocking code will
block other tasks from executing. If your tasks are mostly I/O bound you should be ok,
and you can also combine different pools and send problem tasks to a multiprocessing pool instead.
threads
Celery also comes with a pool using threads.
The development version that will become version 2.6 includes a lot of optimizations,
and there is no longer any need for the Flask-Celery extension module. If you are not going
into production in the next days then I would encourage you to try the development version
which must be installed like this:
$ pip install https://github.com/ask/kombu/zipball/master
$ pip install https://github.com/ask/celery/zipball/master
The new API is now also Flask inspired, so you should read the new getting started guide:
http://ask.github.com/celery/getting-started/first-steps-with-celery.html
With all this said, most optimization work has been focused on execution speed so far,
and there is probably many more memory optimizations that can be made. It has not been a request so far, but in the unlikely event that Celery does not match your memory constraints, you can open up an issue at our bug tracker and I'm sure it will get focus, or you can even help us to do so.

You could hypervize the process using multiprocess or subprocess, then just hand the handle round the session.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.