Python Multiprocessing: Crash in subprocess? - python

What happens when a python script opens subprocesses and one process crashes?
https://stackoverflow.com/a/18216437/311901
Will the main process crash?
Will the other subprocesses crash?
Is there a signal or other event that's propagated?

When using multiprocessing.Pool, if one of the subprocesses in the pool crashes, you will not be notified at all, and a new process will immediately be started to take its place:
>>> import multiprocessing
>>> p = multiprocessing.Pool()
>>> p._processes
4
>>> p._pool
[<Process(PoolWorker-1, started daemon)>, <Process(PoolWorker-2, started daemon)>, <Process(PoolWorker-3, started daemon)>, <Process(PoolWorker-4, started daemon)>]
>>> [proc.pid for proc in p._pool]
[30760, 30761, 30762, 30763]
Then in another window:
dan#dantop:~$ kill 30763
Back to the pool:
>>> [proc.pid for proc in p._pool]
[30760, 30761, 30762, 30767] # New pid for the last process
You can continue using the pool as if nothing happened. However, any work item that the killed child process was running at the time it died will not be completed or restarted. If you were running a blocking map or apply call that was relying on that work item to complete, it will likely hang indefinitely. There is a bug filed for this, but the issue was only fixed in concurrent.futures.ProcessPoolExecutor, rather than in multiprocessing.Pool. Starting with Python 3.3, ProcessPoolExecutor will raise a BrokenProcessPool exception if a child process is killed, and disallow any further use of the pool. Sadly, multiprocessing didn't get this enhancement. For now, if you want to guard against a pool call blocking forever due to a sub-process crashing, you have to use ugly workarounds.
Note: The above only applies to a process in a pool actually crashing, meaning the process completely dies. If a sub-process raises an exception, that will be propagated up the parent process when you try to retrieve the result of the work item:
>>> def f(): raise Exception("Oh no")
...
>>> pool = multiprocessing.Pool()
>>> result = pool.apply_async(f)
>>> result.get()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/multiprocessing/pool.py", line 528, in get
raise self._value
Exception: Oh no
When using a multiprocessing.Process directly, the process object will show that the process has exited with a non-zero exit code if it crashes:
>>> def f(): time.sleep(30)
...
>>> p = multiprocessing.Process(target=f)
>>> p.start()
>>> p.join() # Kill the process while this is blocking, and join immediately ends
>>> p.exitcode
-15
The behavior is similar if an exception is raised:
from multiprocessing import Process
def f(x):
raise Exception("Oh no")
if __name__ == '__main__':
p = Process(target=f)
p.start()
p.join()
print(p.exitcode)
print("done")
Output:
Process Process-1:
Traceback (most recent call last):
File "/usr/lib/python3.2/multiprocessing/process.py", line 267, in _bootstrap
self.run()
File "/usr/lib/python3.2/multiprocessing/process.py", line 116, in run
self._target(*self._args, **self._kwargs)
TypeError: f() takes exactly 1 argument (0 given)
1
done
As you can see, the traceback from the child is printed, but it doesn't affect exceution of the main process, which is able to show the exitcode of the child was 1.

Related

Python Multiprocessing weird behavior when NOT USING time.sleep()

This is the exact code from Python.org. If you comment out the time.sleep(), it crashes with a long exception traceback. I would like to know why.
And, I do understand why Python.org included it in their example code. But artificially creating "working time" via time.sleep() shouldn't break the code when it's removed. It seems to me that the time.sleep() is affording some sort of spin up time. But as I said, I'd like to know from people who might actually know the answer.
A user comment asked me to fill in more details on the environment this was happening in. It was on OSX Big Sur 11.4. Using a clean install of Python 3.95 from Python.org (no Homebrew, etc). Run from within Pycharm inside a venv. I hope that helps add to understanding the situation.
import time
import random
from multiprocessing import Process, Queue, current_process, freeze_support
#
# Function run by worker processes
#
def worker(input, output):
for func, args in iter(input.get, 'STOP'):
result = calculate(func, args)
output.put(result)
#
# Function used to calculate result
#
def calculate(func, args):
result = func(*args)
return '%s says that %s%s = %s' % \
(current_process().name, func.__name__, args, result)
#
# Functions referenced by tasks
#
def mul(a, b):
#time.sleep(0.5*random.random()) # <--- time.sleep() commented out
return a * b
def plus(a, b):
#time.sleep(0.5*random.random()). # <--- time.sleep() commented out
return a + b
#
#
#
def test():
NUMBER_OF_PROCESSES = 4
TASKS1 = [(mul, (i, 7)) for i in range(20)]
TASKS2 = [(plus, (i, 8)) for i in range(10)]
# Create queues
task_queue = Queue()
done_queue = Queue()
# Submit tasks
for task in TASKS1:
task_queue.put(task)
# Start worker processes
for i in range(NUMBER_OF_PROCESSES):
Process(target=worker, args=(task_queue, done_queue)).start()
# Get and print results
print('Unordered results:')
for i in range(len(TASKS1)):
print('\t', done_queue.get())
# Add more tasks using `put()`
for task in TASKS2:
task_queue.put(task)
# Get and print some more results
for i in range(len(TASKS2)):
print('\t', done_queue.get())
# Tell child processes to stop
for i in range(NUMBER_OF_PROCESSES):
task_queue.put('STOP')
if __name__ == '__main__':
freeze_support()
test()
This is the traceback if it helps anyone:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/synchronize.py", line 110, in __setstate__
Traceback (most recent call last):
File "<string>", line 1, in <module>
self._semlock = _multiprocessing.SemLock._rebuild(*state)
FileNotFoundError: [Errno 2] No such file or directory
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/synchronize.py", line 110, in __setstate__
self._semlock = _multiprocessing.SemLock._rebuild(*state)
FileNotFoundError: [Errno 2] No such file or directory
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/synchronize.py", line 110, in __setstate__
self._semlock = _multiprocessing.SemLock._rebuild(*state)
FileNotFoundError: [Errno 2] No such file or directory
Here's a technical breakdown.
This is a race condition where the main process finishes, and exits before some of the children have a chance to fully start up. As long as a child fully starts, there are mechanisms in-place to ensure they shut down smoothly, but there's an unsafe in-between time. Race conditions can be very system dependent, as it is up to the OS and the hardware to schedule the different threads, as well as how fast they chew through their work.
Here's what's going on when a process is started... Early on in the creation of a child process, it registers itself in the main process so that it will be either joined or terminated when the main process exits depending on if it's daemonic (multiprocessing.util._exit_function). This exit function was registered with the atexit module on import of multiprocessing.
Also during creation of the child process, a pair of Pipes are opened which will be used to pass the Process object to the child interpreter (which includes what function you want to execute and its arguments). This requires 2 file handles to be shared with the child, and these file handles are also registered to be closed using atexit.
The problem arises when the main process exits before the child has a chance to read all the necessary data from the pipe (un-pickling the Process object) during the startup phase. If the main process first closes the pipe, then waits for the child to join, then we have a problem. The child will continue spinning up the new python instance until it gets to the point when it needs to read in the Process object containing your function and arguments it should run. It will try to read from a pipe which has already been closed, which is an error.
If all the children get a chance to fully start-up you won't see this ever, because that pipe is only used for startup. Putting in a delay which will in some way guarantee that all the children have some time to fully start up is what solves this problem. Manually calling join will provide this delay by waiting for the children before any of the atexit handlers are called. Additionally, any amount of processing delay means that q.get in the main thread will have to wait a while which also gives the children time to start up before closing. I was never able to reproduce the problem you encountered, but presumably you saw the output from all the TASKS (" Process-1 says that mul(19, 7) = 133 "). Only one or two of the child processes ended up doing all the work, allowing the main process to get all the results, and finish up before the other children finished startup.
EDIT:
The error is unambiguous as to what's happening, but I still can't figure how it happens... As far as I can tell, the file handles should be closed when calling _run_finalizers() in _exit_function after joining or terminating all active_children rather than before via _run_finalizers(0)
EDIT2:
_run_finalizers will seemingly actually never call Popen.finalizer to close the pipes, because exitpriority is None. I'm very confused as to what's going on here, and I think I need to sleep on it...
Apparently #user2357112supportsMonica was on the right track. It totally solves the problem if you join the processes before exiting the program. Also #Aaron's answer has the deep knowledge as to why this fixes the issue!
I added the following bits of code as was suggested and it totally fixed the need to have time.sleep() in there.
First I gathered all the processes when they were started:
processes: list[Process] = []
# Start worker processes
for i in range(NUMBER_OF_PROCESSES):
p = Process(target=worker, args=(task_queue, done_queue))
p.start()
processes.append(p)
Then at the end of the program I joined them as follows:
# Join the processes
for p in processes:
p.join()
Totally solved the issues. Thanks for the advice.

Why do main- and child-process not exit after exception in main?

#coding=utf8
from multiprocessing import Process
from time import sleep
import os
def foo():
print("foo")
for i in range(11000):
sleep(1)
print("{}: {}".format(os.getpid(), i))
if __name__ == "__main__":
p = Process(target=foo)
#p.daemon = True
p.start()
print("parent: {}".format(os.getpid()))
sleep(20)
raise Exception("invalid")
An exception is raised in the main-process, but child- and main-process keep running. Why?
When the MainProcess is shutting down, non-daemonic child processes are simply joined. This happens in _exit_function() which is registered with atexit.register(_exit_function). You can look it up in multiprocessing.util.py if you are curious.
You can also insert multiprocessing.log_to_stderr(logging.DEBUG) before you start your process to see the log messages:
parent: 9670
foo
[INFO/Process-1] child process calling self.run()
9675: 0
9675: 1
...
9675: 18
Traceback (most recent call last):
File "/home/...", line 26, in <module>
raise Exception("invalid")
Exception: invalid
[INFO/MainProcess] process shutting down
[DEBUG/MainProcess] running all "atexit" finalizers with priority >= 0
[INFO/MainProcess] calling join() for process Process-1
9675: 19

Do not print stack-trace using Pool python

I use a Pool to run several commands simultaneously. I would like to don't print the stack-trace when the user interrupt the script.
Here is my script structure:
def worker(some_element):
try:
cmd_res = Popen(SOME_COMMAND, stdout=PIPE, stderr=PIPE).communicate()
except (KeyboardInterrupt, SystemExit):
pass
except Exception, e:
print str(e)
return
#deal with cmd_res...
pool = Pool()
try:
pool.map(worker, some_list, chunksize = 1)
except KeyboardInterrupt:
pool.terminate()
print 'bye!'
By calling pool.terminated() when KeyboardInterrupt raises, I expected to don't print the stack-trace, but it doesn't works, I got sometimes something like:
^CProcess PoolWorker-6:
Traceback (most recent call last):
File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
task = get()
File "/usr/lib/python2.7/multiprocessing/queues.py", line 374, in get
racquire()
KeyboardInterrupt
Process PoolWorker-1:
Process PoolWorker-2:
Traceback (most recent call last):
File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
Traceback (most recent call last):
...
bye!
Do you know how I could hide this?
Thanks.
In your case you don't even need pool processes or threads. And then it gets easier to silence KeyboardInterrupts with try-catch.
Pool processes are useful when your Python code does CPU-consuming calculations that can profit from parallelization.
Threads are useful when your Python code does complex blocking I/O that can run in parallel. You just want to execute multiple programs in parallel and wait for the results. When you use Pool you create processes that do nothing other than starting other processes and waiting for them to terminate.
The simplest solution is to create all of the processes in parallel and then to call .communicate() on each of them:
try:
processes = []
# Start all processes at once
for element in some_list:
processes.append(Popen(SOME_COMMAND, stdout=PIPE, stderr=PIPE))
# Fetch their results sequentially
for process in processes:
cmd_res = process.communicate()
# Process your result here
except KeyboardInterrupt:
for process in processes:
try:
process.terminate()
except OSError:
pass
This works when when the output on STDOUT and STDERR isn't too big. Else when another process than the one communicate() is currently running for produces too much output for the PIPE buffer (usually around 1-8 kB) it will be suspended by the OS until communicate() is called on the suspended process. In that case you need a more sophisticated solution:
Asynchronous I/O
Since Python 3.4 you can use the asyncio module for single-thread pseudo-multithreading:
import asyncio
from asyncio.subprocess import PIPE
loop = asyncio.get_event_loop()
#asyncio.coroutine
def worker(some_element):
process = yield from asyncio.create_subprocess_exec(*SOME_COMMAND, stdout=PIPE)
try:
cmd_res = yield from process.communicate()
except KeyboardInterrupt:
process.terminate()
return
try:
pass # Process your result here
except KeyboardInterrupt:
return
# Start all workers
workers = []
for element in some_list:
w = worker(element)
workers.append(w)
asyncio.async(w)
# Run until everything complete
loop.run_until_complete(asyncio.wait(workers))
You should be able to limit the number of concurrent processes using e.g. asyncio.Semaphore if you need to.
When you instantiate Pool, it creates cpu_count() (on my machine, 8) python processes waiting for your worker(). Note that they don't run it yet, they are waiting for the command. When they don't perform your code, they also don't handle KeyboardInterrupt. You can see what they are doing if you specify Pool(processes=2) and send the interruption. You can play with processes number to fix it, but I don't think you can handle it in all the cases.
Personally I don't recommend to use multiprocessing.Pool for the task of launching other processes. It's overkill to launch several python processes for that. Much more efficient way – is using threads (see threading.Thread, Queue.Queue). But in this case you need to implement threading pool youself. Which is not so hard though.
Your child process will receive both the KeyboardInterrupt exception and the exception from the terminate().
Because the child process receives the KeyboardInterrupt, a simple join() in the parent -- rather than the terminate() -- should suffice.
As suggested y0prst I used threading.Thread instead of Pool.
Here is a working example, which rasterize a set of vectors with ImageMagick (I know I can use mogrify for this, it's just an example).
#!/usr/bin/python
from os.path import abspath
from os import listdir
from threading import Thread
from subprocess import Popen, PIPE
RASTERISE_CALL = "magick %s %s"
INPUT_DIR = './tests_in/'
def get_vectors(dir):
'''Return a list of svg files inside the `dir` directory'''
return [abspath(dir+f).replace(' ', '\\ ') for f in listdir(dir) if f.endswith('.svg')]
class ImageMagickError(Exception):
'''Custom error for ImageMagick fails calls'''
def __init__(self, value): self.value = value
def __str__(self): return repr(self.value)
class Rasterise(Thread):
'''Rasterizes a given vector.'''
def __init__(self, svg):
self.stdout = None
self.stderr = None
Thread.__init__(self)
self.svg = svg
def run(self):
p = Popen((RASTERISE_CALL % (self.svg, self.svg + '.png')).split(), shell=False, stdout=PIPE, stderr=PIPE)
self.stdout, self.stderr = p.communicate()
if self.stderr is not '':
raise ImageMagickError, 'can not rasterize ' + self.svg + ': ' + self.stderr
threads = []
def join_threads():
'''Joins all the threads.'''
for t in threads:
try:
t.join()
except(KeyboardInterrupt, SystemExit):
pass
#Rasterizes all the vectors in INPUT_DIR.
for f in get_vectors(INPUT_DIR):
t = Rasterise(f)
try:
print 'rasterize ' + f
t.start()
except (KeyboardInterrupt, SystemExit):
join_threads()
except ImageMagickError:
print 'Opps, IM can not rasterize ' + f + '.'
continue
threads.append(t)
# wait for all threads to end
join_threads()
print ('Finished!')
Please, tell me if you think there are a more pythonic way to do that, or if it can be optimised, I will edit my answer.

How to clear a multiprocessing.Queue?

I just want to know how to clear a multiprocessing.Queue like a queue.Queue in Python:
>>> import queue
>>> queue.Queue().clear()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'Queue' object has no attribute 'clear'
>>> queue.Queue().queue.clear()
>>> import multiprocessing
>>> multiprocessing.Queue().clear()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'Queue' object has no attribute 'clear'
>>> multiprocessing.Queue().queue.clear()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'Queue' object has no attribute 'queue'
So, I take look at Queue class, and you may to try this code:
while not some_queue.empty():
some_queue.get() # as docs say: Remove and return an item from the queue.
Ask for forgiveness rather than permission; just try to empty the queue until you get the Empty exception, then ignore that exception:
from Queue import Empty
def clear(q):
try:
while True:
q.get_nowait()
except Empty:
pass
Better yet: is a built-in class missing the method you want? Subclass the built-in class, and add the method you think should be there!
from Queue import Queue, Empty
class ClearableQueue(Queue):
def clear(self):
try:
while True:
self.get_nowait()
except Empty:
pass
Your ClearableQueue class inherits all the goodness (and behavior) of the built-in Queue class, and has the method you now want.
Simply use q = ClearableQueue() in all places where you used q = Queue(), and call q.clear() when you'd like.
There is no direct way of clearing a multiprocessing.Queue.
I believe the closest you have is close(), but that simply states that no more data will be pushed to that queue, and will close it when all data has been flushed to the pipe.
pipe(7) Linux manual page specifies that a pipe has a limited capacity (65,536 bytes by default) and that writing to a full pipe blocks until enough data has been read from the pipe to allow the write to complete:
I/O on pipes and FIFOs
[…]
If a process attempts to read from an empty pipe, then read(2) will block until data is available. If a process attempts to write to a full pipe (see below), then write(2) blocks until sufficient data has been read from the pipe to allow the write to complete. Nonblocking I/O is possible by using the fcntl(2) F_SETFL operation to enable the O_NONBLOCK open file status flag.
[…]
Pipe capacity
A pipe has a limited capacity. If the pipe is full, then a write(2) will block or fail, depending on whether the O_NONBLOCK flag is set (see below). Different implementations have different limits for the pipe capacity. Applications should not rely on a particular capacity: an application should be designed so that a reading process consumes data as soon as it is available, so that a writing process does not remain blocked.
In Linux versions before 2.6.11, the capacity of a pipe was the same as the system page size (e.g., 4096 bytes on i386). Since Linux 2.6.11, the pipe capacity is 16 pages (i.e., 65,536 bytes in a system with a page size of 4096 bytes). Since Linux 2.6.35, the default pipe capacity is 16 pages, but the capacity can be queried and set using the fcntl(2) F_GETPIPE_SZ and F_SETPIPE_SZ operations. See fcntl(2) for more information.
That is why the multiprocessing Python library documentation recommends to make a consumer process empty each Queue object with Queue.get calls before its feeder threads are joined in producer processes (implicitly with garbage collection or explicitly with Queue.join_thread calls):
Joining processes that use queues
Bear in mind that a process that has put items in a queue will wait before terminating until all the buffered items are fed by the “feeder” thread to the underlying pipe. (The child process can call the Queue.cancel_join_thread method of the queue to avoid this behaviour.)
This means that whenever you use a queue you need to make sure that all items which have been put on the queue will eventually be removed before the process is joined. Otherwise you cannot be sure that processes which have put items on the queue will terminate. Remember also that non-daemonic processes will be joined automatically.
An example which will deadlock is the following:
from multiprocessing import Process, Queue
def f(q):
q.put('X' * 1000000)
if __name__ == '__main__':
queue = Queue()
p = Process(target=f, args=(queue,))
p.start()
p.join() # this deadlocks
obj = queue.get()
A fix here would be to swap the last two lines (or simply remove the p.join() line).
In some applications, a consumer process may not know how many items have been added to a queue by producer processes. In this situation, a reliable way to empty the queue is to make each producer process add a sentinel item when it is done and make the consumer process remove items (regular and sentinel items) until it has removed as many sentinel items as there are producer processes:
import multiprocessing
def f(q, e):
while True:
q.put('X' * 1000000) # block the feeder thread (size > pipe capacity)
if e.is_set():
break
q.put(None) # add a sentinel item
if __name__ == '__main__':
start_count = 5
stop_count = 0
q = multiprocessing.Queue()
e = multiprocessing.Event()
for _ in range(start_count):
multiprocessing.Process(target=f, args=(q, e)).start()
e.set() # stop producer processes
while stop_count < start_count:
if q.get() is None: # empty the queue
stop_count += 1 # count the sentinel items removed
This solution uses blocking Queue.get calls to empty the queue. This guarantees that all items have been added to the queue and removed.
#DanH’s solution uses non-blocking Queue.get_nowait calls to empty the queue. The problem with that solution is that producer processes can still add items to the queue after the consumer process has emptied the queue, which will create a deadlock (the consumer process will wait for the producer processes to terminate, each producer process will wait for its feeder thread to terminate, the feeder thread of each producer process will wait for the consumer process to remove the items added to the queue):
import multiprocessing.queues
def f(q):
q.put('X' * 1000000) # block the feeder thread (size > pipe capacity)
if __name__ == '__main__':
q = multiprocessing.Queue()
p = multiprocessing.Process(target=f, args=(q,))
p.start()
try:
while True:
q.get_nowait()
except multiprocessing.queues.Empty:
pass # reached before the producer process adds the item to the queue
p.join() # deadlock
Or newly created producer processes can fail to deserialise the Process object of the consumer process if the synchronisation resources of the queue that comes with it as an attribute are garbage collected before, raising a FileNotFoundError:
import multiprocessing.queues
def f(q):
q.put('X' * 1000000)
if __name__ == '__main__':
q = multiprocessing.Queue()
multiprocessing.Process(target=f, args=(q,)).start()
try:
while True:
q.get_nowait()
except multiprocessing.queues.Empty:
pass # reached before the producer process deserialises the Process
Standard error:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/local/Cellar/python#3.9/3.9.12/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/usr/local/Cellar/python#3.9/3.9.12/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
File "/usr/local/Cellar/python#3.9/3.9.12/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/synchronize.py", line 110, in __setstate__
self._semlock = _multiprocessing.SemLock._rebuild(*state)
FileNotFoundError: [Errno 2] No such file or directory
I am a newbie so don't be angry on me, but
Why not redefine the .Queue() variable?
import multiprocessing as mp
q = mp.Queue()
chunk = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
for i in chunk:
q.put(i)
print(q.empty())
q = mp.Queue()
print(q.empty())
My output:
>>False
>>True
I'm just self-educating right now, so if I'm wrong, feel free to point it out

Python: multiprocessing.map: If one process raises an exception, why aren't other processes' finally blocks called?

My understanding is that finally clauses must *always* be executed if the try has been entered.
import random
from multiprocessing import Pool
from time import sleep
def Process(x):
try:
print x
sleep(random.random())
raise Exception('Exception: ' + x)
finally:
print 'Finally: ' + x
Pool(3).map(Process, ['1','2','3'])
Expected output is that for each of x which is printed on its own by line 8, there must be an occurrence of 'Finally x'.
Example output:
$ python bug.py
1
2
3
Finally: 2
Traceback (most recent call last):
File "bug.py", line 14, in <module>
Pool(3).map(Process, ['1','2','3'])
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/pool.py", line 225, in map
return self.map_async(func, iterable, chunksize).get()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/pool.py", line 522, in get
raise self._value
Exception: Exception: 2
It seems that an exception terminating one process terminates the parent and sibling processes, even though there is further work required to be done in other processes.
Why am I wrong? Why is this correct? If this is correct, how should one safely clean up resources in multiprocess Python?
Short answer: SIGTERM trumps finally.
Long answer: Turn on logging with mp.log_to_stderr():
import random
import multiprocessing as mp
import time
import logging
logger=mp.log_to_stderr(logging.DEBUG)
def Process(x):
try:
logger.info(x)
time.sleep(random.random())
raise Exception('Exception: ' + x)
finally:
logger.info('Finally: ' + x)
result=mp.Pool(3).map(Process, ['1','2','3'])
The logging output includes:
[DEBUG/MainProcess] terminating workers
Which corresponds to this code in multiprocessing.pool._terminate_pool:
if pool and hasattr(pool[0], 'terminate'):
debug('terminating workers')
for p in pool:
p.terminate()
Each p in pool is a multiprocessing.Process, and calling terminate (at least on non-Windows machines) calls SIGTERM:
from multiprocessing/forking.py:
class Popen(object)
def terminate(self):
...
try:
os.kill(self.pid, signal.SIGTERM)
except OSError, e:
if self.wait(timeout=0.1) is None:
raise
So it comes down to what happens when a Python process in a try suite is sent a SIGTERM.
Consider the following example (test.py):
import time
def worker():
try:
time.sleep(100)
finally:
print('enter finally')
time.sleep(2)
print('exit finally')
worker()
If you run it, then send it a SIGTERM, then the process ends immediately, without entering the finally suite, as evidenced by no output, and no delay.
In one terminal:
% test.py
In second terminal:
% pkill -TERM -f "test.py"
Result in first terminal:
Terminated
Compare that with what happens when the process is sent a SIGINT (C-c):
In second terminal:
% pkill -INT -f "test.py"
Result in first terminal:
enter finally
exit finally
Traceback (most recent call last):
File "/home/unutbu/pybin/test.py", line 14, in <module>
worker()
File "/home/unutbu/pybin/test.py", line 8, in worker
time.sleep(100)
KeyboardInterrupt
Conclusion: SIGTERM trumps finally.
The answer from unutbu definitely explains why you get the behavior you observe. However, it should emphasized that SIGTERM is sent only because of how multiprocessing.pool._terminate_pool is implemented. If you can avoid using Pool, then you can get the behavior you desire. Here is a borrowed example:
from multiprocessing import Process
from time import sleep
import random
def f(x):
try:
sleep(random.random()*10)
raise Exception
except:
print "Caught exception in process:", x
# Make this last longer than the except clause in main.
sleep(3)
finally:
print "Cleaning up process:", x
if __name__ == '__main__':
processes = []
for i in range(4):
p = Process(target=f, args=(i,))
p.start()
processes.append(p)
try:
for process in processes:
process.join()
except:
print "Caught exception in main."
finally:
print "Cleaning up main."
After sending a SIGINT is, example output is:
Caught exception in process: 0
^C
Cleaning up process: 0
Caught exception in main.
Cleaning up main.
Caught exception in process: 1
Caught exception in process: 2
Caught exception in process: 3
Cleaning up process: 1
Cleaning up process: 2
Cleaning up process: 3
Note that the finally clause is ran for all processes. If you need shared memory, consider using Queue, Pipe, Manager, or some external store like redis or sqlite3.
finally re-raises the original exception unless you return from it. The exception is then raised by Pool.map and kills your entire application. The subprocesses are terminated and you see no other exceptions.
You can add a return to swallow the exception:
def Process(x):
try:
print x
sleep(random.random())
raise Exception('Exception: ' + x)
finally:
print 'Finally: ' + x
return
Then you should have None in your map result when an exception occurred.

Categories