Fault tolerance in Dask dependency graphs

Fault tolerance in Dask dependency graphs - python

I have a small cluster upon which I deploy a dask graph using:
from dask.distributed import Client
...
client = Client(f'{scheduler_ip}:{scheduler_port}', set_as_default=False)
client.get(workflow, final_node)
During the workflow I have a bunch of tasks that run in parallel, of course. Sometimes, however, there's an error in a module that one worker is running. As soon as that module fails it gets returned to the scheduler and then the scheduler stops the other works running in parallel (even if the others have no dependency on this one). It stops them midstream.
Is there anyway to allow the others to complete, then fail, instead of shutting them down immediately?

The Client.get function is all-or-nothing. You should probably look at the futures interface. Here you're launching many computations which happen to depend on each other. The ones that can finish will finish.
See https://docs.dask.org/en/latest/futures.html

Related

Airflow spinning up multiple subprocess for a single task and hanging

Airflow version = 1.10.10
Hosted on Kubernetes, Uses Kubernetes executor.
DAG setup
DAG - Is generated using the dynamic dag
Task - Is a PythonOperator that pulls some data, runs an inference, stores the predictions.
Where does it hang? - When running the inference using tensorflow
More details
One of our running tasks, as mentioned above, was hanging for 4 hours. No amount of restarting can help it to recover from that point. We found out that the pod had almost 30+ subprocess and 40GB of memory used.
We weren't convinced because when running on a local machine, the model doesn't consume more than 400MB. There is no way it can suddenly bump up to 40GB in memory.
Another suspicion was maybe it's spinning up so many processes because we are dynamically generating around 19 DAGS. I changed the generator to generate only 1, and the processes didn't vanish. The worker pods still had 35+ subprocesses with the same memory.
Here comes the interesting part, I wanted to be really sure that it's not the dynamic DAG. Hence I created an independent DAG that prints out 1..100000 while pausing for 5 seconds each. The memory usage was still the same but not the number of processes.
At this point, I am not sure which direction to take to debug the issue further.
Questions
Why is the task hanging?
Why are there so many sub-processes when using dynamic dag?
How can I debug this issue further?
Have you faced this before, and can you help?

How do I time out a job submitted to Dask?

I am using Dask to run a pool of tasks, retrieving results in the order they complete by the as_completed method, and potentially submitting new tasks to the pool each time one returns:
# Initial set of jobs
futures = [client.submit(job.run_simulation) for job in jobs]
pool = as_completed(futures, with_results=True)
while True:
# Wait for a job to finish
f, result = next(pool)
# Exit condition
if result == 'STOP':
break
# Do processing and maybe submit more jobs
more_jobs = process_result(f, result)
more_futures = [client.submit(job.run_simulation) for job in more_jobs]
pool.update(more_futures)
Here's my problem: The function job.run_simulation that I am submitting can sometimes hang for a long time, and I want to time out this function - kill the task and move on if the run time exceeds a certain time limit.
Ideally, I'd like to do something like client.submit(job.run_simulation, timeout=10), and have next(pool) return None if the task ran longer than the timeout.
Is there any way that Dask can help me time out jobs like this?
What I've tried so far
My first instinct was to handle the timeout independently of Dask within the job.run_simulation function itself. I've seen two types of suggestions (e.g. here) for generic Python timeouts.
1) Use two threads, one for the function itself and one for a timer. My impression is this doesn't actually work because you can't kill threads. Even if the timer runs out, both threads have to finish before the task is completed.
2) Use two separate processes (with the multiprocessing module), one for the function and one for the timer. This would work, but since I'm already in a daemon subprocess spawned by Dask, I'm not allowed to create new subprocesses.
A third possibility is to move the code block to a separate script that I run with subprocess.run and use the subprocess.run built in timeout. I could do this, but it feels like a worst-case fallback scenario because it would take a lot of cumbersome passing of data to and from the subprocess.
So it feels like I have to accomplish the timeout at the level of Dask. My one idea here is to create a timer as a subprocess at the same time as I submit the task to Dask. Then if the timer runs out, use Client.cancel() to stop the task. The problem with this plan is that Dask might wait for workers to free up before starting the task, and I don't want the timer running before the task is actually running.

Your assessment of the problem seems correct to me and the solutions you went through are the same that I would consider. Some notes:
Client.cancel is unable to stop a function from running if it has already started. These functions are running in a thread pool and so you run into the "can't stop threads" limitation. Dask workers are just Python processes and have the same abilities and limitations.
You say that you can't use processes from within a daemon process. One solution to this would be to change how you're using processes in one of the following ways:
If you're using dask.distributed on a single machine then just don't use processes
client = Client(processes=False)
Don't use Dask's default nanny processes, then your dask worker will be a normal process capable of using multiprocessing
Set dask's multiprocessing-context config to "spawn" rather than fork or forkserver
The clean way to solve this problem though is to solve it inside of your function job.run_simulation. Ideally you would be able to push this timeout logic down to that code and have it raise cleanly.

Persistent Long Running Tasks in Celery

I'm working on a Python based system, to enqueue long running tasks to workers.
The tasks originate from an outside service that generate a "token", but once they're created based on that token, they should run continuously, and stopped only when explicitly removed by code.
The task starts a WebSocket and loops on it. If the socket is closed, it reopens it. Basically, the task shouldn't reach conclusion.
My goals in architecting this solutions are:
When gracefully restarting a worker (for example to load new code), the task should be re-added to the queue, and picked up by some worker.
Same thing should happen when ungraceful shutdown happens.
2 workers shouldn't work on the same token.
Other processes may create more tasks that should be directed to the same worker that's handling a specific token. This will be resolved by sending those tasks to a queue named after the token, which the worker should start listening to after starting the token's task. I am listing this requirement as an explanation to why a task engine is even required here.
Independent servers, fast code reload, etc. - Minimal downtime per task.
All our server side is Python, and looks like Celery is the best platform for it.
Are we using the right technology here? Any other architectural choices we should consider?
Thanks for your help!

According to the docs
When shutdown is initiated the worker will finish all currently executing tasks before it actually terminates, so if these tasks are important you should wait for it to finish before doing anything drastic (like sending the KILL signal).
If the worker won’t shutdown after considerate time, for example because of tasks stuck in an infinite-loop, you can use the KILL signal to force terminate the worker, but be aware that currently executing tasks will be lost (unless the tasks have the acks_late option set).
You may get something like what you want by using retry or acks_late
Overall I reckon you'll need to implement some extra application-side job control, plus, maybe, a lock service.
But, yes, overall you can do this with celery. Whether there are better technologies... that's out of the scope of this site.

Run specific django manage.py commands at intervals

I need to run a specific manage.py commands on an EC2 instance every X minutes. For example: python manage.py some_command.
I have looked up django-chronograph. Following the instructions, I've added chronograph to my settings.py but on runserver it keeps telling me No module named chronograph.
Is there something I'm missing to get this running? And after running how do I get manage.py commands to run using chronograph?
Edit: It's installed in the EC2 instance's virtualenv.

I would suggest you to configure cron to run your command at specific times/intervals.

First, install it by running pip install django-chronograph.

I would say handle this through cross, but if you don't want to use cross then:
Make sure you installed the module in the virtualenv (With easy_install, pip, or any other way that Amazon EC2 allows). After that you might want to look up the threading module documentation:
Python 2 threading module documentation
Python 3 threading module documentation
The purpose of using threading will be to have the following structure:
A "control" thread, which will use the chronograph module and do the time measurements, and putting the new work to do in an "input queue" on each scheduled time, for the worker threads (which will be active already) to process, or just trigger each worker thread (make it active) at the time you want to trigger each execution. In the first case you'll be taking advantage of parallel threads to do a big chunk of work and minimize io wait times, but since the work is in a queue, the workers will process one at a time. Meaning if you schedule two things too close together and the previous element is still being processed, the new item will have to wait (Depending on your programming logic and amount of worker threads some workers might start processing the new item, but is a bit more complex logic).
In the second case your control thread will actually trigger the start of a new thread (or group of threads) each time you want to trigger a scheduled action. If there's big data to process you might need to spawn a new queue for each task to process and create a group of worker threads for it for each task, but if the data is not that big then you can just get away with having the worker process just one data package and be done once execution is done and you get a result. Either way this method will allow you to schedule tasks without limitation on how close they can be, since new independent worker threads will be created for them every time.
Finally, you might want to create an "output queue" and output thread, to store and process (or output, or anything else you want to do with it...) the results of each worker threads.
The control thread will be basically trying to imitate cron in its logic, triggering actions at certain times depending on how it was configured.
There's also a multiprocessing module in python which will work with processes instead and take advantage of true multiprocessing hardware, but I don't think you'll really need it in this case, unless you see performance issues caused by cpu performance.
If you need any clarification, help, examples, just let me know.

Parallel processing within a queue (using Pool within Celery)

I'm using Celery to queue jobs from a CGI application I made. The way I've set it up, Celery makes each job run one- or two-at-a-time by setting CELERYD_CONCURRENCY = 1 or = 2 (so they don't crowd the processor or thrash from memory consumption). The queue works great, thanks to advice I got on StackOverflow.
Each of these jobs takes a fair amount of time (~30 minutes serial), but has an embarrassing parallelizability. For this reason, I was using Pool.map to split it and do the work in parallel. It worked great from the command line, and I got runtimes around 5 minutes using a new many-cored chip.
Unfortunately, there is some limitation that does not allow daemonic process to have subprocesses, and when I run the fancy parallelized code within the CGI queue, I get this error:
AssertionError: daemonic processes are not allowed to have children
I noticed other people have had similar questions, but I can't find an answer that wouldn't require abandoning Pool.map altogether, and making more complicated thread code.
What is the appropriate design choice here? I can easily run my serial jobs using my Celery queue. I can also run my much faster parallelized jobs without a queue. How should I approach this, and is it possible to get what I want (both the queue and the per-job parallelization)?
A couple of ideas I've had (some are quite hacky):
The job sent to the Celery queue simply calls the command line program. That program can use Pool as it pleases, and then saves the result figures & data to a file (just as it does now). Downside: I won't be able to check on the status of the job or see if it terminated successfully. Also, system calls from CGI may cause security issues.
Obviously, if the queue is very full of jobs, I can make use of the CPU resources (by setting CELERYD_CONCURRENCY = 6 or so); this will allow many people to be "at the front of the queue" at once.Downside: Each job will spend a lot of time at the front of the queue; if the queue isn't full, there will be no speedup. Also, many partially finished jobs will be stored in memory at the same time, using much more RAM.
Use Celery's #task to parallelize within sub-jobs. Then, instead of setting CELERYD_CONCURRENCY = 1, I would set it to 6 (or however many sub jobs I'd like to allow in memory at a time). Downside: First of all, I'm not sure whether this will successfully avoid the "task-within-task" problem. But also, the notion of queue position may be lost, and many partially finished jobs may end up in memory at once.
Perhaps there is a way to call Pool.map and specify that the threads are non-daemonic? Or perhaps there is something more lightweight I can use instead of Pool.map? This is similar to an approach taken on another open StackOverflow question. Also, I should note that the parallelization I exploit via Pool.map is similar to linear algebra, and there is no inter-process communication (each just runs independently and returns its result without talking to the others).
Throw away Celery and use multiprocessing.Queue. Then maybe there'd be some way to use the same "thread depth" for every thread I use (i.e. maybe all of the threads could use the same Pool, avoiding nesting)?
Thanks a lot in advance.

What you need is a workflow management system (WFMS) that manages
task concurrency
task dependency
task nesting
among other things.
From a very high level view, a WFMS sits on top of a task pool like celery, and submits the tasks which are ready to execute to the pool. It is also responsible for opening up a nest and submitting the tasks in the nest accordingly.
I've developed a system to do just that. It's called pomsets. Try it out, and feel free to send me any questions.

I using a multiprocessed deamons based on Twisted with forking and Gearman jobs query normally.
Try to look at Gearman.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.