Why won't Python Multiprocessing Workers die?

Why won't Python Multiprocessing Workers die? - python

I'm using the python multiprocessing functionality to map some function across some elements. Something along the lines of this:
def computeStuff(arguments, globalData, concurrent=True):
pool = multiprocessing.Pool(initializer=initWorker, initargs=(globalData,))
results = pool.map(workerFunction, list(enumerate(arguments)))
return results
def initWorker(globalData):
workerFunction.globalData = globalData
def workerFunction((index, argument)):
... # computation here
Generally I run tests in ipython using both cPython and Pypy. I have noticed that the spawned processes often don't get killed, so they start accumulating, each using a gig of ram. This happens when hitting ctrl-k during a computation, which sends multiprocessing into a big frenzy of confusion. But even when letting computation finish, those processes won't die in Pypy.
According to the documentation, when the pool gets garbage collected, it should call terminate() and kill all the processes. What's happening here? Do I have to explicitly call close()? If yes, is there some sort of context manager that properly manages closing the resources (i.e. processes)?
This is on Mac OS X Yosemite.

PyPy's garbage collection is lazy, so failing to call close means the Pool is cleaned "sometime", but that might not mean "anytime soon".
Once the Pool is properly closed, the workers exit when they run out of tasks. An easy way to ensure the Pool is closed in pre-3.3 Python is:
from contextlib import closing
def computeStuff(arguments, globalData, concurrent=True):
with closing(multiprocessing.Pool(initializer=initWorker, initargs=(globalData,))) as pool:
return pool.map(workerFunction, enumerate(arguments))
Note: I also removed the explicit conversion to list (pointless, since map will iterate the enumerate iterator for you), and returned the results directly (no need to assign to a name only to return on the next line).
If you want to ensure immediate termination in the exception case (on pre-3.3 Python), you'd use a try/finally block, or write a simple context manager (which could be reused for other places where you use a Pool):
from contextlib import contextmanager
#contextmanager
def terminating(obj):
try:
yield obj
finally:
obj.terminate()
def computeStuff(arguments, globalData, concurrent=True):
with terminating(multiprocessing.Pool(initializer=initWorker, initargs=(globalData,))) as pool:
return pool.map(workerFunction, enumerate(arguments))
The terminating approach is superior in that it guarantees the processes exit immediately; in theory, if you're using threads elsewhere in your main program, the Pool workers might be forked with non-daemon threads, which would keep the processes alive even when the worker task thread exited; terminating hides this by killing the processes forcibly.
If your interpreter is Python 3.3 or higher, the terminating approach is built-in to Pool, so no special wrapper is needed for the with statement, with multiprocessing.Pool(initializer=initWorker, initargs=(globalData,)) as pool: works directly.

Related

How to get rid of zombie processes using torch.multiprocessing.Pool (Python)

I am using torch.multiprocessing.Pool to speed up my NN in inference, like this:
import torch.multiprocessing as mp
mp = torch.multiprocessing.get_context('forkserver')
def parallel_predict(predict_func, sequences, args):
predicted_cluster_ids = []
pool = mp.Pool(args.num_workers, maxtasksperchild=1)
out = pool.imap(
func=functools.partial(predict_func, args=args),
iterable=sequences,
chunksize=1)
for item in tqdm(out, total=len(sequences), ncols=85):
predicted_cluster_ids.append(item)
pool.close()
pool.terminate()
pool.join()
return predicted_cluster_ids
Note 1) I am using imap because I want to be able to show a progress bar with tqdm.
Note 2) I tried with both forkserver and spawn but no luck. I cannot use other methods because of how they interact (poorly) with CUDA.
Note 3) I am using maxtasksperchild=1 and chunksize=1 so for each sequence in sequences it spawns a new process.
Note 4) Adding or removing pool.terminate() and pool.join() makes no difference.
Note 5) predict_func is a method of a class I created. I could also pass the whole model to parallel_predict but it does not change anything.
Everything works fine except the fact that after a while I run out of memory on the CPU (while on the GPU everything works as expected). Using htop to monitor memory usage I notice that, for every process I spawn with pool I get a zombie that uses 0.4% of the memory. They don't get cleared, so they keep using space. Still, parallel_predict does return the correct result and the computation goes on. My script is structured in a way that id does validation multiple times so next time parallel_predict is called the zombies add up.
This is what I get in htop:
Usually, these zombies get cleared after ctrl-c but in some rare cases I need to killall.
Is there some way I can force the Pool to close them?
UPDATE:
I tried to kill the zombie processes using this:
def kill(pool):
import multiprocessing
import signal
# stop repopulating new child
pool._state = multiprocessing.pool.TERMINATE
pool._worker_handler._state = multiprocessing.pool.TERMINATE
for p in pool._pool:
os.kill(p.pid, signal.SIGKILL)
# .is_alive() will reap dead process
while any(p.is_alive() for p in pool._pool):
pass
pool.terminate()
But it does not work. It gets stuck at pool.terminate()
UPDATE2:
I tried to use the initializer arg in imap to catch signals like this:
def process_initializer():
def handler(_signal, frame):
print('exiting')
exit(0)
signal.signal(signal.SIGTERM, handler)
def parallel_predict(predict_func, sequences, args):
predicted_cluster_ids = []
with mp.Pool(args.num_workers, initializer=process_initializer, maxtasksperchild=1) as pool:
out = pool.imap(
func=functools.partial(predict_func, args=args),
iterable=sequences,
chunksize=1)
for item in tqdm(out, total=len(sequences), ncols=85):
predicted_cluster_ids.append(item)
for p in pool._pool:
os.kill(p.pid, signal.SIGTERM)
pool.close()
pool.terminate()
pool.join()
return predicted_cluster_ids
but again it does not free memory.

Ok, I have more insights to share with you. Indeed this is not a bug, it is actually the "supposed" behavior for the multiprocessing module in Python (torch.multiprocessing wraps it). What happens is that, although the Pool terminates all the processes, the memory is not released (given back to the OS). This is also stated in the documentation, though in a very confusing way.
In the documentation it says that
Worker processes within a Pool typically live for the complete duration of the Pool’s work queue
but also:
A frequent pattern found in other systems (such as Apache, mod_wsgi, etc) to free resources held by workers is to allow a worker within a pool to complete only a set amount of work before being exiting, being cleaned up and a new process spawned to replace the old one. The maxtasksperchild argument to the Pool exposes this ability to the end user
but the "clean up" does NOT happen.
To make things worse I found this post in which they recommend to use maxtasksperchild=1. This increases the memory leak, because this way the number of zombies goes with the number of data points to be predicted, and since pool.close() does not free memory they add up.
This is very bad if you are using multiprocessing for example in validation. For every validation step I was reinitializing the pool but the memory didn't get freed from the previous iteration.
The SOLUTION here is to move pool = mp.Pool(args.num_workers) outside the training loop, so the pool does not get closed and reopened, and therefore it always reuses the same processes. NOTE: again remember to remove maxtasksperchild=1 and chunksize=1.
I think this should be included in the best practices page.
BTW in my opinion this behavior of the multiprocessing library should be considered as a bug and should be fixed Python side (not Pytorch side)

will 'multiprocessing' automatically close finished child processes?

I used the multiprocessing lib to create multi-thread to process a list of files(20+ files).
When I run the py file, I set the pool number as 4. But in cmd, it showed there are over 10 processes. And most of them have been running for a long time. Because it's large file and takes long time to process so I'm not sure if the process is hanging or still executing.
So my question is:
if it's executing, how to set the process number as exactly 4?
if it's hanging, it means child process will not shut down after finished. Can I set it automatically shutting down after finished?
from multiprocessing import Pool
poolNum = int(sys.argv[1])
pool = Pool(poolNum)
pool.map(processFunc, fileList)

It won't, not until the Pool is close-ed or terminate-ed (IIRC Pools at least at present have a reference cycle involved, so even when the last live reference to the Pool goes away, the Pool is not deterministically collected, even on CPython, which uses reference counting and normally has deterministic behavior).
Since you're using map, your work is definitely done when map returns, so the simplest solution is just to use a with statement for guaranteed termination:
from multiprocessing import Pool
def main():
poolNum = int(sys.argv[1])
with Pool(poolNum) as pool: # Pool created
pool.map(processFunc, fileList)
# terminate has been called, all workers will be killed
# Adding main guard so this code is valid on Windows and anywhere else which
# doesn't use forking for whatever reason
if __name__ == '__main__':
main()
As I commented, I used a main function with the standard guard against being invoked on import, as Windows (and on 3.8+ macOS, plus any OS if the script opts into the 'spawn' startmethod) simulates forking by reimporting the main module (but not naming it __main__); without the guard, you can end up with the child process creating new processes automatically, which is problematic.
Side-note: If you are dispatching a bunch of tasks but not waiting on them immediately (so you don't want to terminate the pool anywhere near when you create it, but want to ensure the workers are cleaned up promptly), you can still use context management to help out. Just use contextlib.closing to close the pool once all the tasks are dispatched; you must dispatch all the tasks before the end of the with block, but you can retrieve the results later, and when all results are computed, the child processes will close. For example:
from contextlib import closing
from multiprocessing import Pool
def main():
poolNum = int(sys.argv[1])
with closing(Pool(poolNum)) as pool: # Pool created
results = pool.imap_unordered(processFunc, fileList)
# close has been called, so no new work can be submitted,
# and when all outstanding tasks complete, the workers will exit
# immediately/cleanly
for res in results:
# Can still retrieve results even after pool is closed
# Adding main guard so this code is valid on Windows and anywhere else which
# doesn't use forking for whatever reason
if __name__ == '__main__':
main()

Why threads created with `_thread.start_new_thread` don't print anything?

I found this simple code at https://code.google.com/p/pyloadtools/wiki/CodeTutorialMultiThreading
import _thread
def hello(num):
print('hello from thread %s\n' % num)
_thread.start_new_thread(hello, (0,))
_thread.start_new_thread(hello, (1,))
_thread.start_new_thread(hello, (2,))
But when I run this, it works on IDLE, but not on eclipse which uses PyDev. Any idea how to fix it?
Note: I think the main program terminates before the threads run. The threads dont get enough time to run I guess. How do I fix it? May be should I use join?

Quoting the Caveats section of _thread documentation,
When the main thread exits, it is system defined whether the other threads survive. On most systems, they are killed without executing try ... finally clauses or executing object destructors.
When the main thread exits, it does not do any of its usual cleanup (except that try ... finally clauses are honored), and the standard I/O files are not flushed.
There are two possibilities here.
The main thread starts three threads but it exits before the threads finish the execution. So, the standard I/O files are not flushed, as they are buffered, by default.
Or, the main thread dies, and as per the first bullet point quoted, all the child threads are killed in action.
Either way, you need to make sure the main thread doesn't die before the children complete.
But when you run from IDLE, the main thread still exists, so, the I/O buffers are flushed when the threads actually complete. That is why it works in IDLE but not in eclipse.
To make sure that the main thread exits only after all the threads complete, you can make it wait for the child threads with
1. Semaphore
You can use Semaphore, like this
import _thread
import threading
def hello(num):
print('hello from thread %s' % num)
# Release semaphore when the thread is actually done
sem.release()
def create_thread(value):
# Acquire semaphore when the thread is actually created
sem.acquire()
_thread.start_new_thread(hello, (value,))
# Counting semaphore. Maximum three threads can acquire.
# Next acquire call has to wait till somebody releases
sem = threading.Semaphore(3)
for i in range(3):
create_thread(i)
# We are capturing the semaphore three times again, because
# whenever a thread completes it releases it. So, only when we
# acquire it thrice to make sure that all threads have completed.
for i in range(3):
sem.acquire()
2. Lock Objects
Alternatively, you can use the _thread.lock objects, like this
import _thread
locks = []
def hello(num, lockobject):
print('hello from thread %s' % num)
# Release the lock as we are done here
lockobject.release()
def create_thread(value):
# Create a lock and acquire it
a_lock = _thread.allocate_lock()
a_lock.acquire()
# Store it in the global locks list
locks.append(a_lock)
# Pass it to the newly created thread which can release the lock once done
_thread.start_new_thread(hello, (value, a_lock))
for i in range(3):
create_thread(i)
# Acquire all the locks, which means all the threads have released the locks
all(lock.acquire() for lock in locks)
Now you will see that the program always prints the hello from message.
Note: As the documentation says, _thread is a Low-level threading API. So, better use higher level module like threading, where you can simply wait for the all the threads to exit with join method.

From https://docs.python.org/3/library/_thread.html#module-_thread
The threading module provides an easier to use and higher-level threading API built on top of this module.
The module is optional.
So please use threading, not the optional _thread module.

What kind of problems (if any) would there be combining asyncio with multiprocessing?

As almost everyone is aware when they first look at threading in Python, there is the GIL that makes life miserable for people who actually want to do processing in parallel - or at least give it a chance.
I am currently looking at implementing something like the Reactor pattern. Effectively I want to listen for incoming socket connections on one thread-like, and when someone tries to connect, accept that connection and pass it along to another thread-like for processing.
I'm not (yet) sure what kind of load I might be facing. I know there is currently setup a 2MB cap on incoming messages. Theoretically we could get thousands per second (though I don't know if practically we've seen anything like that). The amount of time spent processing a message isn't terribly important, though obviously quicker would be better.
I was looking into the Reactor pattern, and developed a small example using the multiprocessing library that (at least in testing) seems to work just fine. However, now/soon we'll have the asyncio library available, which would handle the event loop for me.
Is there anything that could bite me by combining asyncio and multiprocessing?

You should be able to safely combine asyncio and multiprocessing without too much trouble, though you shouldn't be using multiprocessing directly. The cardinal sin of asyncio (and any other event-loop based asynchronous framework) is blocking the event loop. If you try to use multiprocessing directly, any time you block to wait for a child process, you're going to block the event loop. Obviously, this is bad.
The simplest way to avoid this is to use BaseEventLoop.run_in_executor to execute a function in a concurrent.futures.ProcessPoolExecutor. ProcessPoolExecutor is a process pool implemented using multiprocessing.Process, but asyncio has built-in support for executing a function in it without blocking the event loop. Here's a simple example:
import time
import asyncio
from concurrent.futures import ProcessPoolExecutor
def blocking_func(x):
time.sleep(x) # Pretend this is expensive calculations
return x * 5
#asyncio.coroutine
def main():
#pool = multiprocessing.Pool()
#out = pool.apply(blocking_func, args=(10,)) # This blocks the event loop.
executor = ProcessPoolExecutor()
out = yield from loop.run_in_executor(executor, blocking_func, 10) # This does not
print(out)
if __name__ == "__main__":
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
For the majority of cases, this is function alone is good enough. If you find yourself needing other constructs from multiprocessing, like Queue, Event, Manager, etc., there is a third-party library called aioprocessing (full disclosure: I wrote it), that provides asyncio-compatible versions of all the multiprocessing data structures. Here's an example demoing that:
import time
import asyncio
import aioprocessing
import multiprocessing
def func(queue, event, lock, items):
with lock:
event.set()
for item in items:
time.sleep(3)
queue.put(item+5)
queue.close()
#asyncio.coroutine
def example(queue, event, lock):
l = [1,2,3,4,5]
p = aioprocessing.AioProcess(target=func, args=(queue, event, lock, l))
p.start()
while True:
result = yield from queue.coro_get()
if result is None:
break
print("Got result {}".format(result))
yield from p.coro_join()
#asyncio.coroutine
def example2(queue, event, lock):
yield from event.coro_wait()
with (yield from lock):
yield from queue.coro_put(78)
yield from queue.coro_put(None) # Shut down the worker
if __name__ == "__main__":
loop = asyncio.get_event_loop()
queue = aioprocessing.AioQueue()
lock = aioprocessing.AioLock()
event = aioprocessing.AioEvent()
tasks = [
asyncio.async(example(queue, event, lock)),
asyncio.async(example2(queue, event, lock)),
]
loop.run_until_complete(asyncio.wait(tasks))
loop.close()

Yes, there are quite a few bits that may (or may not) bite you.
When you run something like asyncio it expects to run on one thread or process. This does not (by itself) work with parallel processing. You somehow have to distribute the work while leaving the IO operations (specifically those on sockets) in a single thread/process.
While your idea to hand off individual connections to a different handler process is nice, it is hard to implement. The first obstacle is that you need a way to pull the connection out of asyncio without closing it. The next obstacle is that you cannot simply send a file descriptor to a different process unless you use platform-specific (probably Linux) code from a C-extension.
Note that the multiprocessing module is known to create a number of threads for communication. Most of the time when you use communication structures (such as Queues), a thread is spawned. Unfortunately those threads are not completely invisible. For instance they can fail to tear down cleanly (when you intend to terminate your program), but depending on their number the resource usage may be noticeable on its own.
If you really intend to handle individual connections in individual processes, I suggest to examine different approaches. For instance you can put a socket into listen mode and then simultaneously accept connections from multiple worker processes in parallel. Once a worker is finished processing a request, it can go accept the next connection, so you still use less resources than forking a process for each connection. Spamassassin and Apache (mpm prefork) can use this worker model for instance. It might end up easier and more robust depending on your use case. Specifically you can make your workers die after serving a configured number of requests and be respawned by a master process thereby eliminating much of the negative effects of memory leaks.

Based on #dano's answer above I wrote this function to replace places where I used to use multiprocess pool + map.
def asyncio_friendly_multiproc_map(fn: Callable, l: list):
"""
This is designed to replace the use of this pattern:
with multiprocessing.Pool(5) as p:
results = p.map(analyze_day, list_of_days)
By letting caller drop in replace:
asyncio_friendly_multiproc_map(analyze_day, list_of_days)
"""
tasks = []
with ProcessPoolExecutor(5) as executor:
for e in l:
tasks.append(asyncio.get_event_loop().run_in_executor(executor, fn, e))
res = asyncio.get_event_loop().run_until_complete(asyncio.gather(*tasks))
return res

See PEP 3156, in particular the section on Thread interaction:
http://www.python.org/dev/peps/pep-3156/#thread-interaction
This documents clearly the new asyncio methods you might use, including run_in_executor(). Note that the Executor is defined in concurrent.futures, I suggest you also have a look there.

multiprocessing.pool context and load balancing

I've encountered some unexpected behaviour of the python multiprocessing Pool class.
Here are my questions:
1) When does Pool creates its context, which is later used for serialization? The example below runs fine as long as the Pool object is created after the Container definition. If you swap the Pool initializations, serialization error occurs. In my production code I would like to initialize Pool way before defining the container class. Is it possible to refresh Pool "context" or to achieve this in another way.
2) Does Pool have its own load balancing mechanism and if so how does it work?
If I run a similar example on my i7 machine with the pool of 8 processes I get the following results:
- For a light evaluation function Pool favours using only one process for computation. It creates 8 processes as requested but for most of the time only one is used (I printed the pid from inside and also see this in htop).
- For a heavy evaluation function the behaviour is as expected. It uses all 8 processes equally.
3) When using Pool I always see 4 more processes that I requested (i.e. for Pool(processes=2) I see 6 new processes). What is their role?
I use Linux with Python 2.7.2
from multiprocessing import Pool
from datetime import datetime
POWER = 10
def eval_power(container):
for power in xrange(2, POWER):
container.val **= power
return container
#processes = Pool(processes=2)
class Container(object):
def __init__(self, value):
self.val = value
processes = Pool(processes=2)
if __name__ == "__main__":
cont = [Container(foo) for foo in xrange(20)]
then = datetime.now()
processes.map(eval_power, cont)
now = datetime.now()
print "Eval time:", now - then
EDIT - TO BAKURIU
1) I was afraid that that's the case.
2) I don't understand what the linux scheduler has to do with python assigning computations to processes. My situation can be ilustrated by the example below:
from multiprocessing import Pool
from os import getpid
from collections import Counter
def light_func(ind):
return getpid()
def heavy_func(ind):
for foo in xrange(1000000):
ind += foo
return getpid()
if __name__ == "__main__":
list_ = range(100)
pool = Pool(4)
l_func = pool.map(light_func, list_)
h_func = pool.map(heavy_func, list_)
print "light func:", Counter(l_func)
print "heavy func:", Counter(h_func)
On my i5 machine (4 threads) I get the following results:
light func: Counter({2967: 100})
heavy func: Counter({2969: 28, 2967: 28, 2968: 23, 2970: 21})
It seems that the situation is as I've described it. However I still don't understand why python does it this way. My guess would be that it tries to minimise communication expenses, but still the mechanism which it uses for load balancing is unknown. The documentation isn't very helpful either, the multiprocessing module is very poorly documented.
3) If I run the above code I get 4 more processes as described before. The screen comes from htop: http://i.stack.imgur.com/PldmM.png

The Pool object creates the subprocesses during the call to __init__ hence you must define Container before. By the way, I wouldn't include all the code in a single file but use a module to implement the Container and other utilities and write a small file that launches the main program.
The Pool does exactly what is described in the documentation. In particular it has no control over the scheduling of the processes hence what you see is what Linux's scheduler thinks it is right. For small computations they take so little time that the scheduler doesn't bother parallelizing them(this probably have better performances due to core affinity etc.)
Could you show this with an example and what you see in the task manager? I think they may be the processes that handle the queue inside the Pool, but I'm not sure. On my machine I can see only the main process plus the two subprocesses.
Update on point 2:
The Pool object simply puts the tasks into a queue, and the child processes get the arguments from this queue. If a process takes almost no time to execute an object, than Linux scheduler let the process execute more time(hence consuming more items from the queue). If the execution takes much time then this scheduler will change processes and thus the other child processes are also executed.
In your case a single process is consuming all items because the computation take so little time that before the other child processes are ready it has already finished all items.
As I said, Pool doesn't do anything about balancing the work of the subprocesses. It's simply a queue and a bunch of workers, the pool puts items in the queue and the processes get the items and compute the results. AFAIK the only thing that it does to control the queue is putting a certain number of tasks in a single item in the queue(see the documentation) but there is no guarantee about which process will grab which task. Everything else is left to the OS.
On my machine the results are less extreme. Two processes get about twice the number of calls than the other two for the light computation, while for the heavy one all have more or less the same number of items processed. Probably on different OSes and/or hardware we would obtain even different results.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.