Python's multiprocessing.Queuefails intermittently, and I don't know why. Is this a bug in Python or my script?
Minimal failing script
import multiprocessing
import time
import logging
import multiprocessing.util
multiprocessing.util.log_to_stderr(level=logging.DEBUG)
queue = multiprocessing.Queue(maxsize=10)
def worker(queue):
queue.put('abcdefghijklmnop')
# "Indicate that no more data will be put on this queue by the
# current process." --Documentation
# time.sleep(0.01)
queue.close()
proc = multiprocessing.Process(target=worker, args=(queue,))
proc.start()
# "Indicate that no more data will be put on this queue by the current
# process." --Documentation
# time.sleep(0.01)
queue.close()
proc.join()
I am testing this in CPython 3.6.6 in Debian. It also fails with docker python:3.7.0-alpine.
docker run --rm -v "${PWD}/test.py:/test.py" \
python:3-alpine python3 /test.py
The above script sometimes fails with a BrokenPipeError.
Traceback (most recent call last):
File "/usr/lib/python3.6/multiprocessing/queues.py", line 240, in _feed
send_bytes(obj)
File "/usr/lib/python3.6/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/usr/lib/python3.6/multiprocessing/connection.py", line 404, in _send_bytes
self._send(header + buf)
File "/usr/lib/python3.6/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
Test harness
Because this is intermittent, I wrote a shell script to call it many times and count the failures.
#!/bin/sh
total=10
successes=0
for i in `seq ${total}`
do
if ! docker run --rm -v "${PWD}/test.py:/test.py" python:3-alpine \
python3 test.py 2>&1 \
| grep --silent BrokenPipeError
then
successes=$(expr ${successes} + 1)
fi
done
python3 -c "print(${successes} / ${total})"
This usually shows some fraction, maybe 0.2 indicating intermittent failures.
Timing adjustments
If I insert time.sleep(0.01) before either queue.close(), it works consistently. I noticed in the source code that writing happens in its own thread. I think if the writing thread is still trying to write data and all of the other threads close the queue, then it causes the error.
Debug logs
By uncommenting the first few lines, I can trace the execution for failures and successes.
Failure:
[DEBUG/MainProcess] created semlock with handle 140480257941504
[DEBUG/MainProcess] created semlock with handle 140480257937408
[DEBUG/MainProcess] created semlock with handle 140480257933312
[DEBUG/MainProcess] Queue._after_fork()
[DEBUG/Process-1] Queue._after_fork()
[INFO/Process-1] child process calling self.run()
[DEBUG/Process-1] Queue._start_thread()
[DEBUG/Process-1] doing self._thread.start()
[DEBUG/Process-1] starting thread to feed data to pipe
[DEBUG/Process-1] ... done self._thread.start()
[DEBUG/Process-1] telling queue thread to quit
[INFO/Process-1] process shutting down
[DEBUG/Process-1] running all "atexit" finalizers with priority >= 0
[DEBUG/Process-1] running the remaining "atexit" finalizers
[DEBUG/Process-1] joining queue thread
Traceback (most recent call last):
File "/usr/lib/python3.7/multiprocessing/queues.py", line 242, in _feed
send_bytes(obj)
File "/usr/lib/python3.7/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/usr/lib/python3.7/multiprocessing/connection.py", line 404, in _send_bytes
self._send(header + buf)
File "/usr/lib/python3.7/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
[DEBUG/Process-1] feeder thread got sentinel -- exiting
[DEBUG/Process-1] ... queue thread joined
[INFO/Process-1] process exiting with exitcode 0
[INFO/MainProcess] process shutting down
[DEBUG/MainProcess] running all "atexit" finalizers with priority >= 0
[DEBUG/MainProcess] running the remaining "atexit" finalizers
"Success" (really silent failure, only able to replicate with Python 3.6):
[DEBUG/MainProcess] created semlock with handle 139710276231168
[DEBUG/MainProcess] created semlock with handle 139710276227072
[DEBUG/MainProcess] created semlock with handle 139710276222976
[DEBUG/MainProcess] Queue._after_fork()
[DEBUG/Process-1] Queue._after_fork()
[INFO/Process-1] child process calling self.run()
[DEBUG/Process-1] Queue._start_thread()
[DEBUG/Process-1] doing self._thread.start()
[DEBUG/Process-1] starting thread to feed data to pipe
[DEBUG/Process-1] ... done self._thread.start()
[DEBUG/Process-1] telling queue thread to quit
[INFO/Process-1] process shutting down
[INFO/Process-1] error in queue thread: [Errno 32] Broken pipe
[DEBUG/Process-1] running all "atexit" finalizers with priority >= 0
[DEBUG/Process-1] running the remaining "atexit" finalizers
[DEBUG/Process-1] joining queue thread
[DEBUG/Process-1] ... queue thread joined
[INFO/Process-1] process exiting with exitcode 0
[INFO/MainProcess] process shutting down
[DEBUG/MainProcess] running all "atexit" finalizers with priority >= 0
[DEBUG/MainProcess] running the remaining "atexit" finalizers
True success (using either time.sleep(0.01)):
[DEBUG/MainProcess] created semlock with handle 140283921616896
[DEBUG/MainProcess] created semlock with handle 140283921612800
[DEBUG/MainProcess] created semlock with handle 140283921608704
[DEBUG/MainProcess] Queue._after_fork()
[DEBUG/Process-1] Queue._after_fork()
[INFO/Process-1] child process calling self.run()
[DEBUG/Process-1] Queue._start_thread()
[DEBUG/Process-1] doing self._thread.start()
[DEBUG/Process-1] starting thread to feed data to pipe
[DEBUG/Process-1] ... done self._thread.start()
[DEBUG/Process-1] telling queue thread to quit
[INFO/Process-1] process shutting down
[DEBUG/Process-1] feeder thread got sentinel -- exiting
[DEBUG/Process-1] running all "atexit" finalizers with priority >= 0
[DEBUG/Process-1] running the remaining "atexit" finalizers
[DEBUG/Process-1] joining queue thread
[DEBUG/Process-1] ... queue thread joined
[INFO/Process-1] process exiting with exitcode 0
[INFO/MainProcess] process shutting down
[DEBUG/MainProcess] running all "atexit" finalizers with priority >= 0
[DEBUG/MainProcess] running the remaining "atexit" finalizers
The difference seems to be that in the truly successful case, the feeder receives the sentinel object before the atexit handlers.
the primary issue with your code is that nobody is consuming what your worker process has put in the queue. python queues expect that the data in queues is consumed ("flushed to pipe") prior to the process that put data on it is killed.
in this light, your example doesn't make much sense, but if you want to get it to work:
the key is the queue.cancel_join_thread() -- https://docs.python.org/3/library/multiprocessing.html
Warning As mentioned above, if a child process has put items on a
queue (and it has not used JoinableQueue.cancel_join_thread), then
that process will not terminate until all buffered items have been
flushed to the pipe. This means that if you try joining that process
you may get a deadlock unless you are sure that all items which have
been put on the queue have been consumed. Similarly, if the child
process is non-daemonic then the parent process may hang on exit when
it tries to join all its non-daemonic children.
Note that a queue created using a manager does not have this issue
^ relevant bit. the issue is that stuff is being put on the queue from the child process but NOT consumed by anyone. In this case cancel_join_queue must be called on the CHILD process prior to asking it to join. This code sample will get rid of the error.
import multiprocessing
import time
import logging
import multiprocessing.util
multiprocessing.util.log_to_stderr(level=logging.DEBUG)
queue = multiprocessing.Queue(maxsize=10)
def worker(queue):
queue.put('abcdefghijklmnop')
# "Indicate that no more data will be put on this queue by the
# current process." --Documentation
# time.sleep(0.01)
queue.close()
queue.cancel_join_thread() # ideally, this would not be here but would rather be a response to a signal (or other IPC message) sent from the main process
proc = multiprocessing.Process(target=worker, args=(queue,))
proc.start()
# "Indicate that no more data will be put on this queue by the current
# process." --Documentation
# time.sleep(0.01)
queue.close()
proc.join()
I didn't bother with IPC for this because there's no consumer at all but I hope the idea is clear.
Related
Sorry, if headline is strange. Let me explain.
Let's say there is handler.py:
import funcs
import requests
def initialize_calculate(data):
check_data(data)
funcs.calculate(data) # takes a lot of time like 30 minutes
print('Calculation launched')
requests.get('hostname', params={'func':'calculate', 'status':'launched'})
and here is funcs.py:
import requests
def calculate(data):
result = make_calculations(data)
requests.get('hostname',params={'func':'calculate', 'status':'finished', 'result':result})
So what I want is that handler can initialize another function no matter where, but doesn't wait until it ends, because I want to notify client-side that process is started, and when it's done this process itself will send result when it's finished.
How can I launch independent process with function calculate from initialize_calculate?
I want to know If it's possible without non-native libraries or frameworks.
If you don't wan't to use a 3rd-party lib like daemonocle implementing a "well-behaved" Unix-Daemon, you could
use subprocess.Popen() to create an independent process. Another option would be to modify multiprocessing.Process to prevent auto-joining of the child when the parent exits.
subprocess.Popen()
With subprocess.Popen() you start the new process with specifying commands and arguments like manually from terminal. This means you need to make funcs.py or another file a top-level script which parses string-arguments from stdin and then calls funcs.calculate() with these arguments.
I boiled your example down to the essence so we don't have to read too much code.
funcs.py
#!/usr/bin/env python3
# UNIX: enable executable from terminal with: chmod +x filename
import os
import sys
import time
import psutil # 3rd party for demo
def print_msg(msg):
print(f"[{time.ctime()}, pid: {os.getpid()}] --- {msg}")
def calculate(data, *args):
print_msg(f"parent pid: {psutil.Process().parent().pid}, start calculate()")
for _ in range(int(500e6)):
pass
print_msg(f"parent pid: {psutil.Process().parent().pid}, end calculate()")
if __name__ == '__main__':
if len(sys.argv) > 1:
calculate(*sys.argv[1:])
subp_main.py
#!/usr/bin/env python3
# UNIX: enable executable from terminal with: chmod +x filename
if __name__ == '__main__':
import time
import logging
import subprocess
import multiprocessing as mp
import funcs
mp.log_to_stderr(logging.DEBUG)
filename = funcs.__file__
data = ("data", 42)
# in case filename is an executable you don't need "python" before `filename`:
subprocess.Popen(args=["python", filename, *[str(arg) for arg in data]])
time.sleep(1) # keep parent alive a bit longer for demo
funcs.print_msg(f"exiting")
And important for testing, run from terminal, e.g. not PyCharm-Run, because it won't show what the child prints. In the last line below you see the child process' parent-id changed to 1 because the child got adopted by systemd (Ubuntu) after the parent exited.
$> ./subp_main.py
[Fri Oct 23 20:14:44 2020, pid: 28650] --- parent pid: 28649, start calculate()
[Fri Oct 23 20:14:45 2020, pid: 28649] --- exiting
[INFO/MainProcess] process shutting down
[DEBUG/MainProcess] running all "atexit" finalizers with priority >= 0
[DEBUG/MainProcess] running the remaining "atexit" finalizers
$> [Fri Oct 23 20:14:54 2020, pid: 28650] --- parent pid: 1, end calculate()
class OrphanProcess(multiprocessing.Process)
If you search for something more convenient, well you can't use the high-level multiprocessing.Process as is, because it doesn't let the parent process exit before the child, as you asked for. Regular child-processes are either joined (awaited) or terminated (if you set the daemon-flag for Process) when the parent shuts down. This still happens within Python. Note that the daemon-flag doesn't make a process a Unix-Daemon. The naming is a somewhat frequent source of confusion.
I subclassed multiprocessing.Process to switch the auto-joining off and spend some time with the source and observing if zombies might become an issue. Because the modification turns off automatic joining in the parent, I recommend using "forkserver" as start-method for new processes on Unix (always a good idea if the parent is already multi-threaded) to prevent zombie-children from sticking around as long the parent is still running. When the parent process terminates, its child-zombies get eventually reaped by systemd/init. Running multiprocessing.log_to_stderr() shows everything shutting down cleanly, so nothing seems broken so far.
Consider this approach experimental, but it's probably a lot safer than using raw os.fork() to re-invent part of the extensive multiprocessing machinery, just to add this one feature. For error-handling in the child, write a try-except block and log to file.
orphan.py
import multiprocessing.util
import multiprocessing.process as mpp
import multiprocessing as mp
__all__ = ['OrphanProcess']
class OrphanProcess(mp.Process):
"""Process which won't be joined by parent on parent shutdown."""
def start(self):
super().start()
mpp._children.discard(self)
def __del__(self):
# Finalizer won't `.join()` the child because we discarded it,
# so here last chance to reap a possible zombie from within Python.
# Otherwise systemd/init will reap eventually.
self.join(0)
orph_main.py
#!/usr/bin/env python3
# UNIX: enable executable from terminal with: chmod +x filename
if __name__ == '__main__':
import time
import logging
import multiprocessing as mp
from orphan import OrphanProcess
from funcs import print_msg, calculate
mp.set_start_method("forkserver")
mp.log_to_stderr(logging.DEBUG)
p = OrphanProcess(target=calculate, args=("data", 42))
p.start()
time.sleep(1)
print_msg(f"exiting")
Again test from terminal to get the child print to stdout. When the shell appears to be hanging after everything was printed over the second prompt, hit enter to get a new prompt. The parent-id stays the same here because the parent, from the OS-point of view, is the forkserver-process, not the initial main-process for orph_main.py.
$> ./orph_main.py
[INFO/MainProcess] created temp directory /tmp/pymp-bd75vnol
[INFO/OrphanProcess-1] child process calling self.run()
[Fri Oct 23 21:18:29 2020, pid: 30998] --- parent pid: 30997, start calculate()
[Fri Oct 23 21:18:30 2020, pid: 30995] --- exiting
[INFO/MainProcess] process shutting down
[DEBUG/MainProcess] running all "atexit" finalizers with priority >= 0
[DEBUG/MainProcess] running the remaining "atexit" finalizers
$> [Fri Oct 23 21:18:38 2020, pid: 30998] --- parent pid: 30997, end calculate()
[INFO/OrphanProcess-1] process shutting down
[DEBUG/OrphanProcess-1] running all "atexit" finalizers with priority >= 0
[DEBUG/OrphanProcess-1] running the remaining "atexit" finalizers
[INFO/OrphanProcess-1] process exiting with exitcode 0
you may use Process class from multiprocessing module to do that.
Here is an example:
from multiprocessing import Process
import requests
def calculate(data):
result = make_calculations(data)
requests.get('hostname',params={'func':'calculate', 'status':'finished', 'result':result})
def initialize_calculate(data):
check_data(data)
p = Process(target=calculate, args=(data,))
p.start()
print('Calculation launched')
requests.get('hostname', params={'func':'calculate', 'status':'launched'})
#coding=utf8
from multiprocessing import Process
from time import sleep
import os
def foo():
print("foo")
for i in range(11000):
sleep(1)
print("{}: {}".format(os.getpid(), i))
if __name__ == "__main__":
p = Process(target=foo)
#p.daemon = True
p.start()
print("parent: {}".format(os.getpid()))
sleep(20)
raise Exception("invalid")
An exception is raised in the main-process, but child- and main-process keep running. Why?
When the MainProcess is shutting down, non-daemonic child processes are simply joined. This happens in _exit_function() which is registered with atexit.register(_exit_function). You can look it up in multiprocessing.util.py if you are curious.
You can also insert multiprocessing.log_to_stderr(logging.DEBUG) before you start your process to see the log messages:
parent: 9670
foo
[INFO/Process-1] child process calling self.run()
9675: 0
9675: 1
...
9675: 18
Traceback (most recent call last):
File "/home/...", line 26, in <module>
raise Exception("invalid")
Exception: invalid
[INFO/MainProcess] process shutting down
[DEBUG/MainProcess] running all "atexit" finalizers with priority >= 0
[INFO/MainProcess] calling join() for process Process-1
9675: 19
I'm following the documentation for logging with multiprocessing, but I am seeing two logs created by the worker in each subprocess. Am I making a dumb mistake somewhere?
Environment:
Python 3.6.1 |Anaconda custom (64-bit)| (default, May 11 2017, 13:25:24) [MSC v.1900 64 bit (AMD64)] on win32
Code (edited to fix scope issue recommended by #georgexsh):
import logging
import multiprocessing
logger = multiprocessing.log_to_stderr(logging.INFO)
def test(i):
logger.info(f'worker processing {i}')
if __name__ == '__main__':
with multiprocessing.Pool() as pool:
metrics = pool.map(test, range(20))
Logging output:
[INFO/SpawnPoolWorker-2] child process calling self.run()
[INFO/SpawnPoolWorker-2] child process calling self.run()
[INFO/SpawnPoolWorker-2] worker processing 0
[INFO/SpawnPoolWorker-3] child process calling self.run()
[INFO/SpawnPoolWorker-3] child process calling self.run()
[INFO/SpawnPoolWorker-2] worker processing 0
[INFO/SpawnPoolWorker-1] child process calling self.run()
[INFO/SpawnPoolWorker-3] worker processing 1
[INFO/SpawnPoolWorker-2] worker processing 2
[INFO/SpawnPoolWorker-1] child process calling self.run()
[INFO/SpawnPoolWorker-6] child process calling self.run()
[INFO/SpawnPoolWorker-3] worker processing 1
[INFO/SpawnPoolWorker-4] child process calling self.run()
[INFO/SpawnPoolWorker-2] worker processing 2
[INFO/SpawnPoolWorker-5] child process calling self.run()
[INFO/SpawnPoolWorker-7] child process calling self.run()
[INFO/SpawnPoolWorker-1] worker processing 3
[INFO/SpawnPoolWorker-6] child process calling self.run()
[INFO/SpawnPoolWorker-3] worker processing 4
[INFO/SpawnPoolWorker-4] child process calling self.run()
[INFO/SpawnPoolWorker-2] worker processing 5
[INFO/SpawnPoolWorker-5] child process calling self.run()
[INFO/SpawnPoolWorker-7] child process calling self.run()
[INFO/SpawnPoolWorker-1] worker processing 3
[INFO/SpawnPoolWorker-6] worker processing 6
[INFO/SpawnPoolWorker-3] worker processing 4
...
[INFO/SpawnPoolWorker-5] worker processing 16
[INFO/SpawnPoolWorker-2] worker processing 12
[INFO/SpawnPoolWorker-7] worker processing 17
[INFO/SpawnPoolWorker-1] worker processing 18
[INFO/SpawnPoolWorker-6] worker processing 13
[INFO/SpawnPoolWorker-3] worker processing 19
[INFO/SpawnPoolWorker-8] worker processing 14
[INFO/SpawnPoolWorker-4] worker processing 15
[INFO/SpawnPoolWorker-5] worker processing 16
[INFO/SpawnPoolWorker-7] worker processing 17
[INFO/SpawnPoolWorker-1] worker processing 18
[INFO/SpawnPoolWorker-3] worker processing 19
[INFO/SpawnPoolWorker-2] process shutting down
[INFO/SpawnPoolWorker-6] process shutting down
[INFO/MainProcess] process shutting down
move logger = multiprocessing.log_to_stderr() to global scope, not inside worker function. to make sure it only called once. because each time log_to_stderr gets called, it will add a new handler to the logger:
def test(i):
logger.info('worker processing %s', i)
if __name__ == '__main__':
logger = multiprocessing.log_to_stderr(logging.INFO)
note that under windows, as there is no fork(), the whole module get executed again when the child process is created to reconstruct context, you could initialize logger with Pool's initializer, it runs only once pre-child process:
logger = None
def test(i):
logger.info('worker processing %s', i)
def initializer(level):
global logger
logger = multiprocessing.log_to_stderr(level)
if __name__ == '__main__':
pool = multiprocessing.Pool(4, initializer=initializer, initargs=(logging.INFO,))
metrics = pool.map(test, range(20))
I'm learning multiprocessing in Python while I found this odd behaviour between daemon and non-daemon process with respect to the main process.
My Code:
import multiprocessing
import time
def worker(name,num):
print name, 'Starting'
time.sleep(num)
print name, 'Exiting'
if __name__ == '__main__':
print 'main starting'
p1=multiprocessing.Process(target=worker, args=("p1",7,))
p2=multiprocessing.Process(target=worker, args=("p2",5,))
p2.daemon=True
p1.start()
p2.start()
time.sleep(4)
print 'main exiting'
The output I'm getting is:
main starting
p1 Starting
p2 Starting
main exiting
p1 Exiting
Expected Output:
main starting
p1 Starting
p2 Starting
main exiting
p2 Exiting
p1 Exiting
After few searches, I found this answer and inserted the following line to my code.
logger = multiprocessing.log_to_stderr(logging.INFO)
And the output I got is,
main starting
[INFO/p1] child process calling self.run()
p1 Starting
[INFO/p2] child process calling self.run()
p2 Starting
main exiting
[INFO/MainProcess] process shutting down
[INFO/MainProcess] calling terminate() for daemon p2
[INFO/MainProcess] calling join() for process p2
[INFO/MainProcess] calling join() for process p1
p1 Exiting
[INFO/p1] process shutting down
[INFO/p1] process exiting with exitcode 0
As per my understanding,
When a main process exits, it terminates all of its child daemon processes. (Which can be seen here)
A main process can not exit before all of its child non-daemon processes exit.
But here, why the main process is trying to shut down before p1 exits?
p1 starts and sleeps for 7 seconds.
p2 starts and sleeps for 5 seconds.
main process sleeps for 4 seconds.
main process wakes up after 4 seconds and waits for p1 to exit.
p2 wakes up after 5 seconds and exits.
p1 wakes up after 7 seconds and exits.
main process exits.
Wouldn't be the above a normal time-line for the above program?
Can anyone please explain what's happening here and why?
EDIT
After adding the line p1.join() at the end of the code, I'm getting the following output:
main starting
[INFO/Process-1] child process calling self.run()
p1 Starting
[INFO/Process-2] child process calling self.run()
p2 Starting
main exiting
p2 Exiting
[INFO/Process-2] process shutting down
[INFO/Process-2] process exiting with exitcode 0
p1 Exiting
[INFO/Process-1] process shutting down
[INFO/Process-1] process exiting with exitcode 0
[INFO/MainProcess] process shutting down
When you see
[INFO/MainProcess] calling join() for process p1
it means the main process has not exited yet -- it's in the process of shutting down, but of course will not be done shutting down until that join returns... which will happen only once the joined process is done.
So the timeline is indeed as you expect it to be -- but since that join to p1 is the very last think the main process ever does, you don't see that in the output or logging from it. (main has taken all the termination-triggered action, but as a process it's still alive until then).
To verify, use (on a Unixy system) ps from another terminal while running this (perhaps with slightly longer delays to help you check): you will never see a single Python process running out of this complex -- there will be two (main and p1) until the end.
Python Multiprocess| Daemon & Join
Question end up adding more wait towards the behavior of daemon flag and join method, hence here a quick explanation with a simple script.
The daemon process is terminated automatically before the main program exits, to avoid leaving orphaned processes running but is not terminated with the main program. So any outstanding processes left by the daemoned function won't run!
However if the join() method is called on the daemoned function then the main programm will wait the left over processes.
INPUT
import multiprocessing
import time
import logging
def daemon():
p = multiprocessing.current_process()
print('Starting:', p.name, p.pid, flush=True)
print('---' * 15)
time.sleep(2)
print('===' * 15 + ' < ' + f'Exiting={p.name, p.pid}' + ' > ' + '===' * 15, flush=True)
def non_daemon():
p = multiprocessing.current_process()
print('Starting:', p.name, p.pid, flush=True)
print('---'*15)
print('===' * 15 + ' < ' + f'Exiting={p.name, p.pid}' + ' > ' + '===' * 15, flush=True)
if __name__ == '__main__':
print('==='*15 + ' < ' + 'MAIN PROCESS' + ' > ' + '==='*15)
logger = multiprocessing.log_to_stderr(logging.INFO)
# DAEMON
daemon_proc = multiprocessing.Process(name='MyDaemon', target=daemon)
daemon_proc.daemon = True
# NO-DAEMON
normal_proc = multiprocessing.Process(name='non-daemon', target=non_daemon)
normal_proc.daemon = False
daemon_proc.start()
time.sleep(2)
normal_proc.start()
# daemon_proc.join()
OUTPUT NO JOIN METHOD##
============================================= < MAIN PROCESS > =============================================
Starting: MyDaemon 8448
---------------------------------------------
[INFO/MyDaemon] child process calling self.run()
[INFO/MainProcess] process shutting down
[INFO/MainProcess] calling terminate() for daemon MyDaemon
[INFO/MainProcess] calling join() for process MyDaemon
[INFO/MainProcess] calling join() for process non-daemon
Starting: non-daemon 6688
---------------------------------------------
============================================= < Exiting=('non-daemon', 6688) > =============================================
[INFO/non-daemon] child process calling self.run()
[INFO/non-daemon] process shutting down
[INFO/non-daemon] process exiting with exitcode 0
Process finished with exit code 0
The function daemon() is 'spleeping' 2 second, hence as Python reaches the bottom of the script the program shut down but not daemon().
Note:
[INFO/MainProcess] calling terminate() for daemon MyDaemon
Now if the last line of the script daemon_proc.join() is not commented.
OUTPUT + JOIN METHOD
============================================= < MAIN PROCESS > =============================================
Starting: MyDaemon 13588
---------------------------------------------
[INFO/MyDaemon] child process calling self.run()
============================================= < Exiting=('MyDaemon', 13588) > =============================================
[INFO/MyDaemon] process shutting down
[INFO/MyDaemon] process exiting with exitcode 0
[INFO/MainProcess] process shutting down
[INFO/MainProcess] calling join() for process non-daemon
Starting: non-daemon 13608
---------------------------------------------
============================================= < Exiting=('non-daemon', 13608) > =============================================
[INFO/non-daemon] child process calling self.run()
[INFO/non-daemon] process shutting down
[INFO/non-daemon] process exiting with exitcode 0
Process finished with exit code 0
I have a python script: zombie.py
from multiprocessing import Process
from time import sleep
import atexit
def foo():
while True:
sleep(10)
#atexit.register
def stop_foo():
p.terminate()
p.join()
if __name__ == '__main__':
p = Process(target=foo)
p.start()
while True:
sleep(10)
When I run this with python zombie.py & and kill the parent process with kill -2, the stop() is correctly called and both processes terminate.
Now, suppose I have a bash script zombie.sh:
#!/bin/sh
python zombie.py &
echo "done"
And I run ./zombie.sh from the command line.
Now, stop() never gets called when the parent gets killed. If I run kill -2 on the parent process, nothing happens. kill -15 or kill -9 both just kill the parent process, but not the child:
[foo#bar ~]$ ./zombie.sh
done
[foo#bar ~]$ ps -ef | grep zombie | grep -v grep
foo 27220 1 0 17:57 pts/3 00:00:00 python zombie.py
foo 27221 27220 0 17:57 pts/3 00:00:00 python zombie.py
[foo#bar ~]$ kill -2 27220
[foo#bar ~]$ ps -ef | grep zombie | grep -v grep
foo 27220 1 0 17:57 pts/3 00:00:00 python zombie.py
foo 27221 27220 0 17:57 pts/3 00:00:00 python zombie.py
[foo#bar ~]$ kill 27220
[foo#bar ~]$ ps -ef | grep zombie | grep -v grep
foo 27221 1 0 17:57 pts/3 00:00:00 python zombie.py
What is going on here? How can I make sure the child process dies with the parent?
Neither the atexit nor the p.daemon = True will truely ensure that the child process will die with the father. Receiving a SIGTERM will not trigger the atexit routines.
To make sure the child gets killed upon its father's death you will have to install a signal handler in the father. This way you can react on most signals (SIGQUIT, SIGINT, SIGHUP, SIGTERM, ...) but not on SIGKILL; there simply is no way to react on that signal from within the process which receives it.
Install a signal handler for all useful signals and in that handler kill the child process.
Update: This solution doesn't work for processes killed by a signal.
Your child process is not a zombie. It is alive.
If you want the child process to be killed when its parent exits normally then set p.daemon = True before p.start(). From the docs:
When a process exits, it attempts to terminate all of its daemonic child processes.
Looking at the source code, it is clear that multiprocessing uses atexit callback to kill its daemonic children i.e., it won't work if the parent is killed by a signal. For example:
#!/usr/bin/env python
import logging
import os
import signal
import sys
from multiprocessing import Process, log_to_stderr
from threading import Timer
from time import sleep
def foo():
while True:
sleep(1)
if __name__ == '__main__':
log_to_stderr().setLevel(logging.DEBUG)
p = Process(target=foo)
p.daemon = True
p.start()
# either kill itself or exit normally in 5 seconds
if '--kill' in sys.argv:
Timer(5, os.kill, [os.getpid(), signal.SIGTERM]).start()
else: # exit normally
sleep(5)
Output
$ python kill-orphan.py
[INFO/Process-1] child process calling self.run()
[INFO/MainProcess] process shutting down
[DEBUG/MainProcess] running all "atexit" finalizers with priority >= 0
[INFO/MainProcess] calling terminate() for daemon Process-1
[INFO/MainProcess] calling join() for process Process-1
[DEBUG/MainProcess] running the remaining "atexit" finalizers
Notice "calling terminate() for daemon" line.
Output (with --kill)
$ python kill-orphan.py --kill
[INFO/Process-1] child process calling self.run()
The log shows that if the parent is killed by a signal then "atexit" callback is not called (and ps shows that the child is alive in this case). See also Multiprocess Daemon Not Terminating on Parent Exit.