Issue with python's subprocess,popen (creating a zombie and getting stuck)

Issue with python's subprocess,popen (creating a zombie and getting stuck) - python

An issue I have with Python's (3.4) subprocess.popen:
Very rarely (once in several thousands), calls to popen seem to create another forked process, in addition to the intentional process, and hanging (possibly waiting?), resulting in the intentional process becoming a zombie.
Here's the call sequence:
with subprocess.Popen(['prog', 'arg1', 'arg2'], stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE) as p:
std_out, std_err = p.communicate()
p.wait()
Note: the above call sequence is run itself from a forked process (a form of process pooling, see process list below)
The issue happens with multiple programs (7z for example) so I assume the problem is with the caller and not the callee.
prog is zombiefied, so I assume the p.wait() statement is never reached or not executed properly.
The resulting process list (ps -ef output):
my_user 18219 18212 9 16:16 pts/1 00:18:11 python3 script.py # original process
my_user 1045 18219 0 16:18 pts/1 00:00:14 python3 script.py # Intentionally forked from original (poor man's process pool) - Seems to be stuck or waiting
my_user 2834 1045 0 16:18 pts/1 00:00:00 [prog] <defunct> # Program run by subprocess.popen - Zombie
my_user 2841 1045 0 16:18 pts/1 00:00:00 python3 script.py # !!!! Should not be here, also stuck or waiting, never finishes
Edited (added code sample as requested):
The code in questions:
import os
import subprocess
pid = os.fork()
if pid == 0:
# child
file_name='test.zip'
out_dir='/tmp'
while True:
with subprocess.Popen(['7z', 'x', '-y', '-p', '-o' + out_dir, file_name], stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE) as p:
try:
std_out, std_err = p.communicate(timeout=600)
except subprocess.TimeoutExpired:
p.kill()
std_out, std_err = p.communicate()
logging.critical('7z failed, a timeout has occurred during waiting')
except:
p.kill()
p.wait()
raise
return_code = p.poll()
# do something
else:
# parent
wpid, status = os.waitpid(pid, 0)
exit_code = status >> 8

I believe this is an effect of mixing forking and threading, which is a bad thing to do in Linux. Here are a couple references:
Is it safe to fork from within a thread?
https://rachelbythebay.com/w/2011/06/07/forked/
I believe your process is multithreaded once you import the logging module. (In my case, I was sometimes seeing my program hang while waiting on a logging futex and sometimes hang while waiting inside subprocess with the subprocess having become a zombie.) That module uses OS locks to ensure that it can be called in a thread-safe manner. Once you fork, that lock's state is inherited by the child process. So the child (which is single threaded but inherited the memory of the parent) can't acquire the logging lock because the lock was sometimes locked when the fork happened.
(I'm not super confident in my explanation. My problem went away when I switched from using multiprocessing's default fork behavior to using spawn behavior. In the latter, a child does not inherit its parent's memory, and subprocess and logging no longer caused hangs for me.)

subprocess indeed forks before running the command. This is mentionned in PEP 324 (ctrl-f for “fork”).
The reason is that the command is run using exec, which replaces the calling process by the executed one.
As you can see, it shares the same pid as the executed script, so it actually is the same process, but it is not the python interpreter that is being run.
So, as long as the child process does not return, the caller python process can't.

Related

How ensure subprocess is killed on timeout when using `run`?

I am using the following code to launch a subprocess :
# Run the program
subprocess_result = subprocess.run(
cmd,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
check=False,
timeout=timeout,
cwd=directory,
env=env,
preexec_fn=set_memory_limits,
)
The launched subprocess is also a Python program, with a shebang.
This subprocess may last for longer than the specified timeout.
The subprocess does heavy computations and write results in a file and does not contain any signal handler.
According to the documentation https://docs.python.org/3/library/subprocess.html#subprocess.run, subprocess.run kills a child that timesout :
The timeout argument is passed to Popen.communicate(). If the timeout
expires, the child process will be killed and waited for. The
TimeoutExpired exception will be re-raised after the child process has
terminated.
When my subprocess timesout, I always receive the subprocess.TimeoutExpired exception, but from time to time the subprocess is not killed, hence still consuming resources on my machine.
So my question is, am I doing something wrong here ? If yes, what and if no, why do I have this issue and how can I solve it ?
Note : I am using Python 3.10 on Ubuntu 22_04

The most likely culprit for the behaviour you see is that the subprocess you are spawning is probably using multiprocessing and spawning its own child processes. Killing the parent process does not automatically kill the whole set of descendants. The granchildren are inherited by the init process (i.e. the process with PID 1) and will continue to run.
You can verify from the source code of suprocess.run :
with Popen(*popenargs, **kwargs) as process:
try:
stdout, stderr = process.communicate(input, timeout=timeout)
except TimeoutExpired as exc:
process.kill()
if _mswindows:
# Windows accumulates the output in a single blocking
# read() call run on child threads, with the timeout
# being done in a join() on those threads. communicate()
# _after_ kill() is required to collect that and add it
# to the exception.
exc.stdout, exc.stderr = process.communicate()
else:
# POSIX _communicate already populated the output so
# far into the TimeoutExpired exception.
process.wait()
raise
except: # Including KeyboardInterrupt, communicate handled that.
process.kill()
# We don't call process.wait() as .__exit__ does that for us.
raise
Here you can see at line 550 the timeout is set on the communicate call, if it fires at line 552 the subprocess is .kill()ed. The kill method sends a SIGKILL which immediately kills the subprocess without any cleanup. It's a signal that cannot be caught by the subprocess, so it's not possible that the child is somehow ignoring it.
The TimeoutException is then re-raised at line 564, so if your parent process sees this exception the subprocess is already dead.
This however says nothing of granchildren processes. Those will continue to run as children of PID 1.
I don't see any way in which you can customize how subprocess.run handles subprocess termination. For example, if it used SIGTERM instead of SIGKILL you could modify your child process or write a wrapper process that will catch the signal and properly kill all its descendants. But SIGKILL doesn't give you this luxury.
So I believe that for your use case you cannot use the subprocess.run facade but you should use Popen directly. You can look at the subprocess.run implementation and take just the things that you need, maybe dropping support for platforms you don't use.
Note: There are extremely rare situations in which the subprocesses won't die immediately on SIGKILL. I believe the only situation in which this happens is if the subprocess is performing a very long system call or other kernel operation, which might not be interrupted immediately. If the operation is in deadlock this might prevent the process from terminating forever. However I don't think that this is your case, since you did not mention that the process is stuck doing nothing, but from what you said the process simply seems to continue running.

subprocess.Popen() getting stuck

My question
I encountered a hang-up issue with the combination of threading, multiprocessing, and subprocess. I simplified my situation as below.
import subprocess
import threading
import multiprocessing
class dummy_proc(multiprocessing.Process):
def run(self):
print('run')
while True:
pass
class popen_thread(threading.Thread):
def run(self):
proc = subprocess.Popen('ls -la'.split(), shell=False, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
stdout_byte, stderr_byte = proc.communicate()
rc = proc.returncode
print(rc)
if __name__ == '__main__':
print('start')
t = popen_thread()
t.start()
p = dummy_proc()
p.start()
t.join()
p.terminate()
In this script, a thread and a process are generated, respectively. The thread just issues the system command ls -la. The process just loops infinitely. When the thread finishes getting the return code of the system command, it terminates the process and exits immediately.
When I run this script again and again, it sometimes hangs up. I googled this situation and found some articles which seem to be related.
Is it safe to fork from within a thread?
Issue with python's subprocess,popen (creating a zombie and getting stuck)
So, I guess the hang-up issue is explained something like below.
The process is generated between Popen() and communicate().
It inherits some "blocking" status of the thread, and it is never released.
It prevents the thread from acquiring the result of the communitare().
But I'm not 100% confident, so it would be great if someone helped me explain what happens here.
My environment
I used following environment.
$ uname -a
Linux dell-vostro5490 5.10.96-1-MANJARO #1 SMP PREEMPT Tue Feb 1 16:57:46 UTC 2022 x86_64 GNU/Linux
$ python3 --version
Python 3.9.2
I also tried following environment and got the same result.
$ uname -a
Linux raspberrypi 5.10.17+ #2 Tue Jul 6 21:58:58 PDT 2021 armv6l GNU/Linux
$ python3 --version
Python 3.7.3
What I tried
Use spawn instead of fork for multiprocessing.
Use thread instead of process for dummy_proc.
In both cases, the issue disappeared. So, I guess this issue is related with the behavior of the fork...

This is a bit too long for a comment and so ...
I am having a problem understanding your statement that the problem disappears when you "Use thread instead of process for dummy_proc."
The hanging problem as I understand it is "that fork() only copies the calling thread, and any mutexes held in child threads will be forever locked in the forked child." In other words, the hanging problem arises when a fork is done when there exists one or more threads other than the main thread (i.e, the one associated with the main process).
If you execute a subprocess.Popen call from a newly created subprocess or a newly created thread, either way there will be by definition a new thread in existence prior to the fork done to implement the Popen call and I would think the potential for hanging exists.
import subprocess
import threading
import multiprocessing
import os
class popen_process(multiprocessing.Process):
def run(self):
print(f'popen_process, PID = {os.getpid()}, TID={threading.current_thread().ident}')
proc = subprocess.Popen('ls -la'.split(), shell=False, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
stdout_byte, stderr_byte = proc.communicate()
rc = proc.returncode
if __name__ == '__main__':
print(f'main process, PID = {os.getpid()}, TID={threading.current_thread().ident}')
multiprocessing.set_start_method('spawn')
p = popen_process()
p.start()
p.join()
Prints:
main process, PID = 14, TID=140301923051328
popen_process, PID = 16, TID=140246240732992
Note the new thread with TID=140246240732992
It seems to me that you need to use startup method spawn as long as you are doing the Popen call from another thread or process if you want to be sure of not hanging. For what it's worth, on my Windows Subsystem for Linux I could not get it to hang with fork using your code after quite a few tries. So I am just going by what the linked answer warns against.
In any event, in your example code, there seems to be a potential race condition. Let's assume that even though your popen_process is a new thread, its properties are such that it does not give rise to the hanging problem (no mutexes are being held). Then the problem would be arising from the creation of the dummy_proc process/thread. The question then becomes whether your call to t1.start() completes the starting of the new process that ultimately runs the ls -la command prior to or after the completion of the creation of the dummy_proc process/thread. This timing will determine whether the new dummy_proc thread (there will be one regardless of whether dummy_proc inherits from Process or Thread as we have seen) will exist prior to the creation of the ls -la process. This race condition might explain why you sometimes were hanging. I would have no explanation for why if you make dummy_proc inherit from threading.Thread that you never hang.

Have subprocess.Popen only wait on its child process to return, but not any grandchildren

I have a python script that does this:
p = subprocess.Popen(pythonscript.py, stdin=PIPE, stdout=PIPE, stderr=PIPE, shell=False)
theStdin=request.input.encode('utf-8')
(outputhere,errorshere) = p.communicate(input=theStdin)
It works as expected, it waits for the subprocess to finish via p.communicate(). However within the pythonscript.py I want to "fire and forget" a "grandchild" process. I'm currently doing this by overwriting the join function:
class EverLastingProcess(Process):
def join(self, *args, **kwargs):
pass # Overwrites join so that it doesn't block. Otherwise parent waits.
def __del__(self):
pass
And starting it like this:
p = EverLastingProcess(target=nameOfMyFunction, args=(arg1, etc,), daemon=False)
p.start()
This also works fine I just run pythonscript.py in a bash terminal or bash script. Control and a response returns while the child process started by EverLastingProcess keeps going. However, when I run pythonscript.py with Popen running the process as shown above, it looks from timings that the Popen is waiting on the grandchild to finish.
How can I make it so that the Popen only waits on the child process, and not any grandchild processes?

The solution above (using the join method with the shell=True addition) stopped working when we upgraded our Python recently.
There are many references on the internet about the pieces and parts of this, but it took me some doing to come up with a useful solution to the entire problem.
The following solution has been tested in Python 3.9.5 and 3.9.7.
Problem Synopsis
The names of the scripts match those in the code example below.
A top-level program (grandparent.py):
Uses subprocess.run or subprocess.Popen to call a program (parent.py)
Checks return value from parent.py for sanity.
Collects stdout and stderr from the main process 'parent.py'.
Does not want to wait around for the grandchild to complete.
The called program (parent.py)
Might do some stuff first.
Spawns a very long process (the grandchild - "longProcess" in the code below).
Might do a little more work.
Returns its results and exits while the grandchild (longProcess) continues doing what it does.
Solution Synopsis
The important part isn't so much what happens with subprocess. Instead, the method for creating the grandchild/longProcess is the critical part. It is necessary to ensure that the grandchild is truly emancipated from parent.py.
Subprocess only needs to be used in a way that captures output.
The longProcess (grandchild) needs the following to happen:
It should be started using multiprocessing.
It needs multiprocessing's 'daemon' set to False.
It should also be invoked using the double-fork procedure.
In the double-fork, extra work needs to be done to ensure that the process is truly separate from parent.py. Specifically:
Move the execution away from the environment of parent.py.
Use file handling to ensure that the grandchild no longer uses the file handles (stdin, stdout, stderr) inherited from parent.py.
Example Code
grandparent.py - calls parent.py using subprocess.run()
#!/usr/bin/env python3
import subprocess
p = subprocess.run(["/usr/bin/python3", "/path/to/parent.py"], capture_output=True)
## Comment the following if you don't need reassurance
print("The return code is: " + str(p.returncode))
print("The standard out is: ")
print(p.stdout)
print("The standard error is: ")
print(p.stderr)
parent.py - starts the longProcess/grandchild and exits, leaving the grandchild running. After 10 seconds, the grandchild will write timing info to /tmp/timelog.
!/usr/bin/env python3
import time
def longProcess() :
time.sleep(10)
fo = open("/tmp/timelog", "w")
fo.write("I slept! The time now is: " + time.asctime(time.localtime()) + "\n")
fo.close()
import os,sys
def spawnDaemon(func):
# do the UNIX double-fork magic, see Stevens' "Advanced
# Programming in the UNIX Environment" for details (ISBN 0201563177)
try:
pid = os.fork()
if pid > 0: # parent process
return
except OSError as e:
print("fork #1 failed. See next. " )
print(e)
sys.exit(1)
# Decouple from the parent environment.
os.chdir("/")
os.setsid()
os.umask(0)
# do second fork
try:
pid = os.fork()
if pid > 0:
# exit from second parent
sys.exit(0)
except OSError as e:
print("fork #2 failed. See next. " )
print(e)
print(1)
# Redirect standard file descriptors.
# Here, they are reassigned to /dev/null, but they could go elsewhere.
sys.stdout.flush()
sys.stderr.flush()
si = open('/dev/null', 'r')
so = open('/dev/null', 'a+')
se = open('/dev/null', 'a+')
os.dup2(si.fileno(), sys.stdin.fileno())
os.dup2(so.fileno(), sys.stdout.fileno())
os.dup2(se.fileno(), sys.stderr.fileno())
# Run your daemon
func()
# Ensure that the daemon exits when complete
os._exit(os.EX_OK)
import multiprocessing
daemonicGrandchild=multiprocessing.Process(target=spawnDaemon, args=(longProcess,))
daemonicGrandchild.daemon=False
daemonicGrandchild.start()
print("have started the daemon") # This will get captured as stdout by grandparent.py
References
The code above was mainly inspired by the following two resources.
This reference is succinct about the use of the double-fork but does not include the file handling we need in this situation.
This reference contains the needed file handling, but does many other things that we do not need.

Edit: the below stopped working after a Python upgrade, see the accepted answer from Lachele.
Working answer from a colleague, change to shell=True like this:
p = subprocess.Popen(pythonscript.py, stdin=PIPE, stdout=PIPE, stderr=PIPE, shell=True)
I've tested and the grandchild subprocesses stay alive after the child processes returns without waiting for them to finish.

How to kill subprocess after time.sleep()? [duplicate]

I am running some shell scripts with the subprocess module in python. If the shell scripts is running to long, I like to kill the subprocess. I thought it will be enough if I am passing the timeout=30 to my run(..) statement.
Here is the code:
try:
result=run(['utilities/shell_scripts/{0} {1} {2}'.format(
self.language_conf[key][1], self.proc_dir, config.main_file)],
shell=True,
check=True,
stdout=PIPE,
stderr=PIPE,
universal_newlines=True,
timeout=30,
bufsize=100)
except TimeoutExpired as timeout:
I have tested this call with some shell scripts that runs 120s. I expected the subprocess to be killed after 30s, but in fact the process is finishing the 120s script and than raises the Timeout Exception. Now the Question how can I kill the subprocess by timeout?

The documentation explicitly states that the process should be killed:
from the docs for subprocess.run:
"The timeout argument is passed to Popen.communicate(). If the timeout expires, the child process will be killed and waited for. The TimeoutExpired exception will be re-raised after the child process has terminated."
But in your case you're using shell=True, and I've seen issues like that before, because the blocking process is a child of the shell process.
I don't think you need shell=True if you decompose your arguments properly and your scripts have the proper shebang. You could try this:
result=run(
[os.path.join('utilities/shell_scripts',self.language_conf[key][1]), self.proc_dir, config.main_file], # don't compose argument line yourself
shell=False, # no shell wrapper
check=True,
stdout=PIPE,
stderr=PIPE,
universal_newlines=True,
timeout=30,
bufsize=100)
note that I can reproduce this issue very easily on Windows (using Popen, but it's the same thing):
import subprocess,time
p=subprocess.Popen("notepad",shell=True)
time.sleep(1)
p.kill()
=> notepad stays open, probably because it manages to detach from the parent shell process.
import subprocess,time
p=subprocess.Popen("notepad",shell=False)
time.sleep(1)
p.kill()
=> notepad closes after 1 second
Funnily enough, if you remove time.sleep(), kill() works even with shell=True probably because it successfully kills the shell which is launching notepad.
I'm not saying you have exactly the same issue, I'm just demonstrating that shell=True is evil for many reasons, and not being able to kill/timeout the process is one more reason.
However, if you need shell=True for a reason, you can use psutil to kill all the children in the end. In that case, it's better to use Popen so you get the process id directly:
import subprocess,time,psutil
parent=subprocess.Popen("notepad",shell=True)
for _ in range(30): # 30 seconds
if parent.poll() is not None: # process just ended
break
time.sleep(1)
else:
# the for loop ended without break: timeout
parent = psutil.Process(parent.pid)
for child in parent.children(recursive=True): # or parent.children() for recursive=False
child.kill()
parent.kill()
(source: how to kill process and child processes from python?)
that example kills the notepad instance as well.

Kill a chain of sub processes on KeyboardInterrupt

I'm having a strange problem I've encountered as I wrote a script to start my local JBoss instance.
My code looks something like this:
with open("/var/run/jboss/jboss.pid", "wb") as f:
process = subprocess.Popen(["/opt/jboss/bin/standalone.sh", "-b=0.0.0.0"])
f.write(str(process.pid))
try:
process.wait()
except KeyboardInterrupt:
process.kill()
Should be fairly simple to understand, write the PID to a file while its running, once I get a KeyboardInterrupt, kill the child process.
The problem is that JBoss keeps running in the background after I send the kill signal, as it seems that the signal doesn't propagate down to the Java process started by standalone.sh.
I like the idea of using Python to write system management scripts, but there are a lot of weird edge cases like this where if I would have written it in Bash, everything would have just worked™.
How can I kill the entire subprocess tree when I get a KeyboardInterrupt?

You can do this using the psutil library:
import psutil
#..
proc = psutil.Process(process.pid)
for child in proc.children(recursive=True):
child.kill()
proc.kill()
As far as I know the subprocess module does not offer any API function to retrieve the children spawned by subprocesses, nor does the os module.
A better way of killing the processes would probably be the following:
proc = psutil.Process(process.pid)
procs = proc.children(recursive=True)
procs.append(proc)
for proc in procs:
proc.terminate()
gone, alive = psutil.wait_procs(procs, timeout=1)
for p in alive:
p.kill()
This would give a chance to the processes to terminate correctly and when the timeout ends the remaining processes will be killed.
Note that psutil also provides a Popen class that has the same interface of subprocess.Popen plus all the extra functionality of psutil.Process. You may want to simply use that instead of subprocess.Popen. It is also safer because psutil checks that PIDs don't get reused if a process terminates, while subprocess doesn't.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.