How do I time out a job submitted to Dask? - python

I am using Dask to run a pool of tasks, retrieving results in the order they complete by the as_completed method, and potentially submitting new tasks to the pool each time one returns:
# Initial set of jobs
futures = [client.submit(job.run_simulation) for job in jobs]
pool = as_completed(futures, with_results=True)
while True:
# Wait for a job to finish
f, result = next(pool)
# Exit condition
if result == 'STOP':
break
# Do processing and maybe submit more jobs
more_jobs = process_result(f, result)
more_futures = [client.submit(job.run_simulation) for job in more_jobs]
pool.update(more_futures)
Here's my problem: The function job.run_simulation that I am submitting can sometimes hang for a long time, and I want to time out this function - kill the task and move on if the run time exceeds a certain time limit.
Ideally, I'd like to do something like client.submit(job.run_simulation, timeout=10), and have next(pool) return None if the task ran longer than the timeout.
Is there any way that Dask can help me time out jobs like this?
What I've tried so far
My first instinct was to handle the timeout independently of Dask within the job.run_simulation function itself. I've seen two types of suggestions (e.g. here) for generic Python timeouts.
1) Use two threads, one for the function itself and one for a timer. My impression is this doesn't actually work because you can't kill threads. Even if the timer runs out, both threads have to finish before the task is completed.
2) Use two separate processes (with the multiprocessing module), one for the function and one for the timer. This would work, but since I'm already in a daemon subprocess spawned by Dask, I'm not allowed to create new subprocesses.
A third possibility is to move the code block to a separate script that I run with subprocess.run and use the subprocess.run built in timeout. I could do this, but it feels like a worst-case fallback scenario because it would take a lot of cumbersome passing of data to and from the subprocess.
So it feels like I have to accomplish the timeout at the level of Dask. My one idea here is to create a timer as a subprocess at the same time as I submit the task to Dask. Then if the timer runs out, use Client.cancel() to stop the task. The problem with this plan is that Dask might wait for workers to free up before starting the task, and I don't want the timer running before the task is actually running.

Your assessment of the problem seems correct to me and the solutions you went through are the same that I would consider. Some notes:
Client.cancel is unable to stop a function from running if it has already started. These functions are running in a thread pool and so you run into the "can't stop threads" limitation. Dask workers are just Python processes and have the same abilities and limitations.
You say that you can't use processes from within a daemon process. One solution to this would be to change how you're using processes in one of the following ways:
If you're using dask.distributed on a single machine then just don't use processes
client = Client(processes=False)
Don't use Dask's default nanny processes, then your dask worker will be a normal process capable of using multiprocessing
Set dask's multiprocessing-context config to "spawn" rather than fork or forkserver
The clean way to solve this problem though is to solve it inside of your function job.run_simulation. Ideally you would be able to push this timeout logic down to that code and have it raise cleanly.

Related

Running shell commands in parallel using dask distributed

I have a folder with a lot of .sh scripts. How can I use an already set up dask distributed cluster to run them in parallel?
Currently, I am doing the following:
import dask, distributed, os
# list with shell commands that I want to run
commands = ['./script1.sh', './script2.sh', './script3.sh']
# delayed function used to execute a command on a worker
run_func = dask.delayed(os.system)
# connect to cluster
c = distributed.Client('my_server:8786')
# submit job
futures = c.compute( [run_func(c) for c in commands])
# keep connection alive, do not exit python
import time
while True:
time.sleep(1)
This works, however for this scenario it would be ideal if the client could disconnect without causing the scheduler to cancel the job. I am looking for a way to compute my tasks that does not require an active client connection. How could this be done?
Have you seen http://distributed.readthedocs.io/en/latest/api.html#distributed.client.fire_and_forget ? That would be a way to ensure that some task runs on the cluster after the client has gone.
Note also that you have functions like wait() or even gather() so you don't need sleep-forever loops.
In general, though, subprocess.Popen will launch a child process and not wait for it to finish, so you don't even need anything complex from dask, since it doesn't appear you are interested in any output from the call.

Is this multi-threaded function asynchronous

I'm afraid I'm still a bit confused (despite checking other threads) whether:
all asynchronous code is multi-threaded
all multi-threaded functions are asynchronous
My initial guess is no to both and that proper asynchronous code should be able to run in one thread - however it can be improved by adding threads for example like so:
So I constructed this toy example:
from threading import *
from queue import Queue
import time
def do_something_with_io_lag(in_work):
out = in_work
# Imagine we do some work that involves sending
# something over the internet and processing the output
# once it arrives
time.sleep(0.5) # simulate IO lag
print("Hello, bee number: ",
str(current_thread().name).replace("Thread-",""))
class WorkerBee(Thread):
def __init__(self, q):
Thread.__init__(self)
self.q = q
def run(self):
while True:
# Get some work from the queue
work_todo = self.q.get()
# This function will simiulate I/O lag
do_something_with_io_lag(work_todo)
# Remove task from the queue
self.q.task_done()
if __name__ == '__main__':
def time_me(nmbr):
number_of_worker_bees = nmbr
worktodo = ['some input for work'] * 50
# Create a queue
q = Queue()
# Fill with work
[q.put(onework) for onework in worktodo]
# Launch processes
for _ in range(number_of_worker_bees):
t = WorkerBee(q)
t.start()
# Block until queue is empty
q.join()
# Run this code in serial mode (just one worker)
%time time_me(nmbr=1)
# Wall time: 25 s
# Basically 50 requests * 0.5 seconds IO lag
# For me everything gets processed by bee number: 59
# Run this code using multi-tasking (launch 50 workers)
%time time_me(nmbr=50)
# Wall time: 507 ms
# Basically the 0.5 second IO lag + 0.07 seconds it took to launch them
# Now everything gets processed by different bees
Is it asynchronous?
To me this code does not seem asynchronous because it is Figure 3 in my example diagram. The I/O call blocks the thread (although we don't feel it because they are blocked in parallel).
However, if this is the case I am confused why requests-futures is considered asynchronous since it is a wrapper around ThreadPoolExecutor:
with concurrent.futures.ThreadPoolExecutor(max_workers=20) as executor:
future_to_url = {executor.submit(load_url, url, 10): url for url in get_urls()}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
data = future.result()
Can this function on just one thread?
Especially when compared to asyncio, which means it can run single-threaded
There are only two ways to have a program on a single processor do
“more than one thing at a time.” Multi-threaded programming is the
simplest and most popular way to do it, but there is another very
different technique, that lets you have nearly all the advantages of
multi-threading, without actually using multiple threads. It’s really
only practical if your program is largely I/O bound. If your program
is processor bound, then pre-emptive scheduled threads are probably
what you really need. Network servers are rarely processor bound,
however.
First of all, one note: concurrent.futures.Future is not the same as asyncio.Future. Basically it's just an abstraction - an object, that allows you to refer to job result (or exception, which is also a result) in your program after you assigned a job, but before it is completed. It's similar to assigning common function's result to some variable.
Multithreading: Regarding your example, when using multiple threads you can say that your code is "asynchronous" as several operations are performed in different threads at the same time without waiting for each other to complete, and you can see it in the timing results. And you're right, your function due to sleep is blocking, it blocks the worker thread for the specified amount of time, but when you use several threads those threads are blocked in parallel. So if you would have one job with sleep and the other one without and run multiple threads, the one without sleep would perform calculations while the other would sleep. When you use single thread, the jobs are performed in in a serial manner one after the other, so when one job sleeps the other jobs wait for it, actually they just don't exist until it's their turn. All this is pretty much proven by your time tests. The thing happened with print has to do with "thread safety", i.e. print uses standard output, which is a single shared resource. So when your multiple threads tried to print at the same time the switching happened inside and you got your strange output. (This also show "asynchronicity" of your multithreaded example.) To prevent such errors there are locking mechanisms, e.g. locks, semaphores, etc.
Asyncio: To better understand the purpose note the "IO" part, it's not 'async computation', but 'async input/output'. When talking about asyncio you usually don't think about threads at first. Asyncio is about event loop and generators (coroutines). The event loop is the arbiter, that governs the execution of coroutines (and their callbacks), that were registered to the loop. Coroutines are implemented as generators, i.e. functions that allow to perform some actions iteratively, saving state at each iteration and 'returning', and on the next call continuing with the saved state. So basically the event loop is while True: loop, that calls all coroutines/generators, assigned to it, one after another, and they provide result or no-result on each such call - this provides possibility for "asynchronicity". (A simplification, as there's scheduling mechanisms, that optimize this behavior.) The event loop in this situation can run in single thread and if coroutines are non-blocking it will give you true "asynchronicity", but if they are blocking then it's basically a linear execution.
You can achieve the same thing with explicit multithreading, but threads are costly - they require memory to be assigned, switching them takes time, etc. On the other hand asyncio API allows you to abstract from actual implementation and just consider your jobs to be performed asynchronously. It's implementation may be different, it includes calling the OS API and the OS decides what to do, e.g. DMA, additional threads, some specific microcontroller use, etc. The thing is it works well for IO due to lower level mechanisms, hardware stuff. On the other hand, performing computation will require explicit breaking of computation algorithm into pieces to use as asyncio coroutine, so a separate thread might be a better decision, as you can launch the whole computation as one there. (I'm not talking about algorithms that are special to parallel computing). But asyncio event loop might be explicitly set to use separate threads for coroutines, so this will be asyncio with multithreading.
Regarding your example, if you'll implement your function with sleep as asyncio coroutine, shedule and run 50 of them single threaded, you'll get time similar to the first time test, i.e. around 25s, as it is blocking. If you will change it to something like yield from [asyncio.sleep][3](0.5) (which is a coroutine itself), shedule and run 50 of them single threaded, it will be called asynchronously. So while one coroutine will sleep the other will be started, and so on. The jobs will complete in time similar to your second multithreaded test, i.e. close to 0.5s. If you will add print here you'll get good output as it will be used by single thread in serial manner, but the output might be in different order then the order of coroutine assignment to the loop, as coroutines could be run in different order. If you will use multiple threads, then the result will obviously be close to the last one anyway.
Simplification: The difference in multythreading and asyncio is in blocking/non-blocking, so basicly blocking multithreading will somewhat come close to non-blocking asyncio, but there're a lot of differences.
Multithreading for computations (i.e. CPU bound code)
Asyncio for input/output (i.e. I/O bound code)
Regarding your original statement:
all asynchronous code is multi-threaded
all multi-threaded functions are asynchronous
I hope that I was able to show, that:
asynchronous code might be both single threaded and multi-threaded
all multi-threaded functions could be called "asynchronous"
I think the main confusion comes from the meaning of asynchronous. From the Free Online Dictionary of Computing, "A process [...] whose execution can proceed independently" is asynchronous. Now, apply that to what your bees do:
Retrieve an item from the queue. Only one at a time can do that, while the order in which they get an item is undefined. I wouldn't call that asynchronous.
Sleep. Each bee does so independently of all others, i.e. the sleep duration runs on all, otherwise the time wouldn't go down with multiple bees. I'd call that asynchronous.
Call print(). While the calls are independent, at some point the data is funneled into the same output target, and at that point a sequence is enforced. I wouldn't call that asynchronous. Note however that the two arguments to print() and also the trailing newline are handled independently, which is why they can be interleaved.
Lastly, the call to q.join(). Here of course the calling thread is blocked until the queue is empty, so some kind of synchronization is enforced and wanted. I don't see why this "seems to break" for you.

Managing Python Multiprocess processes with different memory ussage

I use a simple RabbitMQ queue to distribute tasks to worker processes. Each worker process uses a pool of multiprocessing instances to work on multiple task at the same time to use the memory and the cpu as much as possible.
The problem is, that some of the task take much more RAM than the others, so that the worker process would crash if it starts more than one instance. But while the worker is working on the RAM intense task, I'd like it to work on other less RAM intense tasks to use the rest of the CPUs.
One idea would be to use multiple queues or topics but I am wondering what the recommended approach is. Can I catch out of memory errors before they crash the process?
What would be the right approach to solve this?
[updated update]
There whole system will consist of multiple multi core machines, but on each multi core machine there is only one worker program running, that creates as much multiprocessing instances as cores. The different machines should be independent of each other except that they get their tasks from the same queue.
I think trying to catch and recover from OOM errors will be very difficult, if not impossible. You would need a thread or process to be running that constantly monitors memory usage, and when it detects it's too high, does...what exactly? Kills a process that's processing a task? tries to pause it (if that's possible; it may not be depending what yours tasks are doing). Even then, pausing it isn't going to release any memory. You'd have to release the memory and restart the task when its safe, which means you'd have to requeue it, decide when its safe, etc.
Instead of trying to detect and recover from the problem, I would recommend trying to avoid it altogether. Create two queues, and two pools. One queue/pool for high-memory tasks, and another queue/pool for low-memory tasks. The high-memory pool would only have a single process in it, so it would be limited to running one task concurrently, which saves your memory. The low-memory queue would have multiprocessing.cpu_count() - 1 processes, allowing you to keep your CPUs saturated across the two pools.
One potential issue with this approach is that if you exhaust the high-memory queue while still having low-memory tasks pending, you'll be wasting one of your CPU. You could handle this consuming from the high-memory queue in a non-blocking way (or with a timeout), so that if the high-memory queue is empty when you're ready to consume a task, you can grab a low-memory task instead. Then when you're done processing it, check the high-memory queue again.
Something like this:
import multiprocessing
# hi_q and lo_q are placeholders for whatever library you're using to consume from RabbitMQ
def high_mem_consume():
while True:
task = hi_q.consume(timeout=2)
if not task:
lo_q.consume(timeout=2)
if task:
process_task(task)
def low_mem_consume():
while True:
task = lo_q.consume() # Blocks forever
process_task(task)
if __name__ == "__main__":
hi_pool = multiprocessing.Pool(1)
lo_pool = multiprocessing.Pool(multiprocessing.cpu_count() - 1)
hi_pool.apply_async(high_mem_consume)
lo_pool.apply_async(lo_mem_consume)

Non-blocking, non-concurrent tasks in Python

I am working on an implementation of a very small library in Python that has to be non-blocking.
On some production code, at some point, a call to this library will be done and it needs to do its own work, in its most simple form it would be a callable that needs to pass some information to a service.
This "passing information to a service" is a non-intensive task, probably sending some data to an HTTP service or something similar. It also doesn't need to be concurrent or to share information, however it does need to terminate at some point, possibly with a timeout.
I have used the threading module before and it seems the most appropriate thing to use, but the application where this library will be used is so big that I am worried of hitting the threading limit.
On local testing I was able to hit that limit at around ~2500 threads spawned.
There is a good possibility (given the size of the application) that I can hit that limit easily. It also makes me weary of using a Queue given the memory implications of placing tasks at a high rate in it.
I have also looked at gevent but I couldn't see an example of being able to spawn something that would do some work and terminate without joining. The examples I went through where calling .join() on a spawned Greenlet or on an array of greenlets.
I don't need to know the result of the work being done! It just needs to fire off and try to talk to the HTTP service and die with a sensible timeout if it didn't.
Have I misinterpreted the guides/tutorials for gevent ? Is there any other possibility to spawn a callable in fully non-blocking fashion that can't hit a ~2500 limit?
This is a simple example in Threading that does work as I would expect:
from threading import Thread
class Synchronizer(Thread):
def __init__(self, number):
self.number = number
Thread.__init__(self)
def run(self):
# Simulating some work
import time
time.sleep(5)
print self.number
for i in range(4000): # totally doesn't get past 2,500
sync = Synchronizer(i)
sync.setDaemon(True)
sync.start()
print "spawned a thread, number %s" % i
And this is what I've tried with gevent, where it obviously blocks at the end to
see what the workers did:
def task(pid):
"""
Some non-deterministic task
"""
gevent.sleep(1)
print('Task', pid, 'done')
for i in range(100):
gevent.spawn(task, i)
EDIT:
My problem stemmed out from my lack of familiarity with gevent. While the Thread code was indeed spawning threads, it also prevented the script from terminating while it did some work.
gevent doesn't really do that in the code above, unless you add a .join(). All I had to do to see the gevent code do some work with the spawned greenlets was to make it a long running process. This definitely fixes my problem as the code that needs to spawn the greenlets is done within a framework that is a long running process in itself.
Nothing requires you to call join in gevent, if you're expecting your main thread to last longer than any of your workers.
The only reason for the join call is to make sure the main thread lasts at least as long as all of the workers (so that the program doesn't terminate early).
Why not spawn a subprocess with a connected pipe or similar and then, instead of a callable, just drop your data on the pipe and let the subprocess handle it completely out of band.
As explained in Understanding Asynchronous/Multiprocessing in Python, asyncoro framework supports asynchronous, concurrent processes. You can run tens or hundreds of thousands of concurrent processes; for reference, running 100,000 simple processes takes about 200MB. If you want to, you can mix threads in rest of the system and coroutines with asyncoro (provided threads and coroutines don't share variables, but use coroutine interface functions to send messages etc.).

Parallel processing within a queue (using Pool within Celery)

I'm using Celery to queue jobs from a CGI application I made. The way I've set it up, Celery makes each job run one- or two-at-a-time by setting CELERYD_CONCURRENCY = 1 or = 2 (so they don't crowd the processor or thrash from memory consumption). The queue works great, thanks to advice I got on StackOverflow.
Each of these jobs takes a fair amount of time (~30 minutes serial), but has an embarrassing parallelizability. For this reason, I was using Pool.map to split it and do the work in parallel. It worked great from the command line, and I got runtimes around 5 minutes using a new many-cored chip.
Unfortunately, there is some limitation that does not allow daemonic process to have subprocesses, and when I run the fancy parallelized code within the CGI queue, I get this error:
AssertionError: daemonic processes are not allowed to have children
I noticed other people have had similar questions, but I can't find an answer that wouldn't require abandoning Pool.map altogether, and making more complicated thread code.
What is the appropriate design choice here? I can easily run my serial jobs using my Celery queue. I can also run my much faster parallelized jobs without a queue. How should I approach this, and is it possible to get what I want (both the queue and the per-job parallelization)?
A couple of ideas I've had (some are quite hacky):
The job sent to the Celery queue simply calls the command line program. That program can use Pool as it pleases, and then saves the result figures & data to a file (just as it does now). Downside: I won't be able to check on the status of the job or see if it terminated successfully. Also, system calls from CGI may cause security issues.
Obviously, if the queue is very full of jobs, I can make use of the CPU resources (by setting CELERYD_CONCURRENCY = 6 or so); this will allow many people to be "at the front of the queue" at once.Downside: Each job will spend a lot of time at the front of the queue; if the queue isn't full, there will be no speedup. Also, many partially finished jobs will be stored in memory at the same time, using much more RAM.
Use Celery's #task to parallelize within sub-jobs. Then, instead of setting CELERYD_CONCURRENCY = 1, I would set it to 6 (or however many sub jobs I'd like to allow in memory at a time). Downside: First of all, I'm not sure whether this will successfully avoid the "task-within-task" problem. But also, the notion of queue position may be lost, and many partially finished jobs may end up in memory at once.
Perhaps there is a way to call Pool.map and specify that the threads are non-daemonic? Or perhaps there is something more lightweight I can use instead of Pool.map? This is similar to an approach taken on another open StackOverflow question. Also, I should note that the parallelization I exploit via Pool.map is similar to linear algebra, and there is no inter-process communication (each just runs independently and returns its result without talking to the others).
Throw away Celery and use multiprocessing.Queue. Then maybe there'd be some way to use the same "thread depth" for every thread I use (i.e. maybe all of the threads could use the same Pool, avoiding nesting)?
Thanks a lot in advance.
What you need is a workflow management system (WFMS) that manages
task concurrency
task dependency
task nesting
among other things.
From a very high level view, a WFMS sits on top of a task pool like celery, and submits the tasks which are ready to execute to the pool. It is also responsible for opening up a nest and submitting the tasks in the nest accordingly.
I've developed a system to do just that. It's called pomsets. Try it out, and feel free to send me any questions.
I using a multiprocessed deamons based on Twisted with forking and Gearman jobs query normally.
Try to look at Gearman.

Categories