I am using the following code to launch a subprocess :
# Run the program
subprocess_result = subprocess.run(
cmd,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
check=False,
timeout=timeout,
cwd=directory,
env=env,
preexec_fn=set_memory_limits,
)
The launched subprocess is also a Python program, with a shebang.
This subprocess may last for longer than the specified timeout.
The subprocess does heavy computations and write results in a file and does not contain any signal handler.
According to the documentation https://docs.python.org/3/library/subprocess.html#subprocess.run, subprocess.run kills a child that timesout :
The timeout argument is passed to Popen.communicate(). If the timeout
expires, the child process will be killed and waited for. The
TimeoutExpired exception will be re-raised after the child process has
terminated.
When my subprocess timesout, I always receive the subprocess.TimeoutExpired exception, but from time to time the subprocess is not killed, hence still consuming resources on my machine.
So my question is, am I doing something wrong here ? If yes, what and if no, why do I have this issue and how can I solve it ?
Note : I am using Python 3.10 on Ubuntu 22_04
The most likely culprit for the behaviour you see is that the subprocess you are spawning is probably using multiprocessing and spawning its own child processes. Killing the parent process does not automatically kill the whole set of descendants. The granchildren are inherited by the init process (i.e. the process with PID 1) and will continue to run.
You can verify from the source code of suprocess.run :
with Popen(*popenargs, **kwargs) as process:
try:
stdout, stderr = process.communicate(input, timeout=timeout)
except TimeoutExpired as exc:
process.kill()
if _mswindows:
# Windows accumulates the output in a single blocking
# read() call run on child threads, with the timeout
# being done in a join() on those threads. communicate()
# _after_ kill() is required to collect that and add it
# to the exception.
exc.stdout, exc.stderr = process.communicate()
else:
# POSIX _communicate already populated the output so
# far into the TimeoutExpired exception.
process.wait()
raise
except: # Including KeyboardInterrupt, communicate handled that.
process.kill()
# We don't call process.wait() as .__exit__ does that for us.
raise
Here you can see at line 550 the timeout is set on the communicate call, if it fires at line 552 the subprocess is .kill()ed. The kill method sends a SIGKILL which immediately kills the subprocess without any cleanup. It's a signal that cannot be caught by the subprocess, so it's not possible that the child is somehow ignoring it.
The TimeoutException is then re-raised at line 564, so if your parent process sees this exception the subprocess is already dead.
This however says nothing of granchildren processes. Those will continue to run as children of PID 1.
I don't see any way in which you can customize how subprocess.run handles subprocess termination. For example, if it used SIGTERM instead of SIGKILL you could modify your child process or write a wrapper process that will catch the signal and properly kill all its descendants. But SIGKILL doesn't give you this luxury.
So I believe that for your use case you cannot use the subprocess.run facade but you should use Popen directly. You can look at the subprocess.run implementation and take just the things that you need, maybe dropping support for platforms you don't use.
Note: There are extremely rare situations in which the subprocesses won't die immediately on SIGKILL. I believe the only situation in which this happens is if the subprocess is performing a very long system call or other kernel operation, which might not be interrupted immediately. If the operation is in deadlock this might prevent the process from terminating forever. However I don't think that this is your case, since you did not mention that the process is stuck doing nothing, but from what you said the process simply seems to continue running.
Related
I am running some shell scripts with the subprocess module in python. If the shell scripts is running to long, I like to kill the subprocess. I thought it will be enough if I am passing the timeout=30 to my run(..) statement.
Here is the code:
try:
result=run(['utilities/shell_scripts/{0} {1} {2}'.format(
self.language_conf[key][1], self.proc_dir, config.main_file)],
shell=True,
check=True,
stdout=PIPE,
stderr=PIPE,
universal_newlines=True,
timeout=30,
bufsize=100)
except TimeoutExpired as timeout:
I have tested this call with some shell scripts that runs 120s. I expected the subprocess to be killed after 30s, but in fact the process is finishing the 120s script and than raises the Timeout Exception. Now the Question how can I kill the subprocess by timeout?
The documentation explicitly states that the process should be killed:
from the docs for subprocess.run:
"The timeout argument is passed to Popen.communicate(). If the timeout expires, the child process will be killed and waited for. The TimeoutExpired exception will be re-raised after the child process has terminated."
But in your case you're using shell=True, and I've seen issues like that before, because the blocking process is a child of the shell process.
I don't think you need shell=True if you decompose your arguments properly and your scripts have the proper shebang. You could try this:
result=run(
[os.path.join('utilities/shell_scripts',self.language_conf[key][1]), self.proc_dir, config.main_file], # don't compose argument line yourself
shell=False, # no shell wrapper
check=True,
stdout=PIPE,
stderr=PIPE,
universal_newlines=True,
timeout=30,
bufsize=100)
note that I can reproduce this issue very easily on Windows (using Popen, but it's the same thing):
import subprocess,time
p=subprocess.Popen("notepad",shell=True)
time.sleep(1)
p.kill()
=> notepad stays open, probably because it manages to detach from the parent shell process.
import subprocess,time
p=subprocess.Popen("notepad",shell=False)
time.sleep(1)
p.kill()
=> notepad closes after 1 second
Funnily enough, if you remove time.sleep(), kill() works even with shell=True probably because it successfully kills the shell which is launching notepad.
I'm not saying you have exactly the same issue, I'm just demonstrating that shell=True is evil for many reasons, and not being able to kill/timeout the process is one more reason.
However, if you need shell=True for a reason, you can use psutil to kill all the children in the end. In that case, it's better to use Popen so you get the process id directly:
import subprocess,time,psutil
parent=subprocess.Popen("notepad",shell=True)
for _ in range(30): # 30 seconds
if parent.poll() is not None: # process just ended
break
time.sleep(1)
else:
# the for loop ended without break: timeout
parent = psutil.Process(parent.pid)
for child in parent.children(recursive=True): # or parent.children() for recursive=False
child.kill()
parent.kill()
(source: how to kill process and child processes from python?)
that example kills the notepad instance as well.
An issue I have with Python's (3.4) subprocess.popen:
Very rarely (once in several thousands), calls to popen seem to create another forked process, in addition to the intentional process, and hanging (possibly waiting?), resulting in the intentional process becoming a zombie.
Here's the call sequence:
with subprocess.Popen(['prog', 'arg1', 'arg2'], stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE) as p:
std_out, std_err = p.communicate()
p.wait()
Note: the above call sequence is run itself from a forked process (a form of process pooling, see process list below)
The issue happens with multiple programs (7z for example) so I assume the problem is with the caller and not the callee.
prog is zombiefied, so I assume the p.wait() statement is never reached or not executed properly.
The resulting process list (ps -ef output):
my_user 18219 18212 9 16:16 pts/1 00:18:11 python3 script.py # original process
my_user 1045 18219 0 16:18 pts/1 00:00:14 python3 script.py # Intentionally forked from original (poor man's process pool) - Seems to be stuck or waiting
my_user 2834 1045 0 16:18 pts/1 00:00:00 [prog] <defunct> # Program run by subprocess.popen - Zombie
my_user 2841 1045 0 16:18 pts/1 00:00:00 python3 script.py # !!!! Should not be here, also stuck or waiting, never finishes
Edited (added code sample as requested):
The code in questions:
import os
import subprocess
pid = os.fork()
if pid == 0:
# child
file_name='test.zip'
out_dir='/tmp'
while True:
with subprocess.Popen(['7z', 'x', '-y', '-p', '-o' + out_dir, file_name], stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE) as p:
try:
std_out, std_err = p.communicate(timeout=600)
except subprocess.TimeoutExpired:
p.kill()
std_out, std_err = p.communicate()
logging.critical('7z failed, a timeout has occurred during waiting')
except:
p.kill()
p.wait()
raise
return_code = p.poll()
# do something
else:
# parent
wpid, status = os.waitpid(pid, 0)
exit_code = status >> 8
I believe this is an effect of mixing forking and threading, which is a bad thing to do in Linux. Here are a couple references:
Is it safe to fork from within a thread?
https://rachelbythebay.com/w/2011/06/07/forked/
I believe your process is multithreaded once you import the logging module. (In my case, I was sometimes seeing my program hang while waiting on a logging futex and sometimes hang while waiting inside subprocess with the subprocess having become a zombie.) That module uses OS locks to ensure that it can be called in a thread-safe manner. Once you fork, that lock's state is inherited by the child process. So the child (which is single threaded but inherited the memory of the parent) can't acquire the logging lock because the lock was sometimes locked when the fork happened.
(I'm not super confident in my explanation. My problem went away when I switched from using multiprocessing's default fork behavior to using spawn behavior. In the latter, a child does not inherit its parent's memory, and subprocess and logging no longer caused hangs for me.)
subprocess indeed forks before running the command. This is mentionned in PEP 324 (ctrl-f for “fork”).
The reason is that the command is run using exec, which replaces the calling process by the executed one.
As you can see, it shares the same pid as the executed script, so it actually is the same process, but it is not the python interpreter that is being run.
So, as long as the child process does not return, the caller python process can't.
I am using Python 2.7.8 to coordinate and automate the running of several application many times over in a Windows environment. During each run, I use subprocess.Popen to launch several child process, passing subprocess.PIPE for stdin and stdout to each as follows:
proc = subprocess.Popen(cmd, stdin=subprocess.PIPE, stdout=subprocess.PIPE)
where cmd is the list of arguments.
The script waits for an external trigger to know when a given run is done, and then terminates each application it is currently running by writing a string to the stdin of each Popen object. The applications read this string, and perform its own graceful shutdown (which is why I don't simply call kill() or terminate()).
# Try to shutdown process
timeout = 5
try:
if proc.poll() is None:
proc.stdin.write(cmd)
# Wait to see if proc shuts down gracefully
while timeout > 0:
if proc.poll() is not None:
break
else:
time.sleep(1)
timeout -= 1
else:
# Kill it the old fashioned way
proc.kill()
except Error:
pass # Process as necessary...
Once the applications are complete, I'm left with a Popen object. If I inspect the stdin or stdout members of that object, I get something like the following:
<open file '<fdopen>', mode 'wb' at 0x0277C758>
The script then loops to perform the next run, relaunching the necessary applications.
My question is, do I need to explicitly call close() for the stdin and stdout file descriptors each time, in order to avoid leaks, i.e. in the finally statement above? I am wondering this because it is possible for the loop to occur hundreds, or even thousands of times during a given script.
I've looked through the subprocess.py code, but the file handles for the pipes are created by an apparent Windows(-only) call in the _subprocess module, so I can't get any further detail.
The pipes might eventually be closed during the garbage collection but you should not rely on it and close the pipes explicitly.
def kill_process(process):
if process.poll() is None: # don't send the signal unless it seems it is necessary
try:
process.kill()
except OSError: # ignore
pass
# shutdown process in `timeout` seconds
t = Timer(timeout, kill_process, [proc])
t.start()
proc.communicate(cmd)
t.cancel()
.communicate() method closes the pipes and waits for the child process to exit.
I'm having a strange problem I've encountered as I wrote a script to start my local JBoss instance.
My code looks something like this:
with open("/var/run/jboss/jboss.pid", "wb") as f:
process = subprocess.Popen(["/opt/jboss/bin/standalone.sh", "-b=0.0.0.0"])
f.write(str(process.pid))
try:
process.wait()
except KeyboardInterrupt:
process.kill()
Should be fairly simple to understand, write the PID to a file while its running, once I get a KeyboardInterrupt, kill the child process.
The problem is that JBoss keeps running in the background after I send the kill signal, as it seems that the signal doesn't propagate down to the Java process started by standalone.sh.
I like the idea of using Python to write system management scripts, but there are a lot of weird edge cases like this where if I would have written it in Bash, everything would have just worked™.
How can I kill the entire subprocess tree when I get a KeyboardInterrupt?
You can do this using the psutil library:
import psutil
#..
proc = psutil.Process(process.pid)
for child in proc.children(recursive=True):
child.kill()
proc.kill()
As far as I know the subprocess module does not offer any API function to retrieve the children spawned by subprocesses, nor does the os module.
A better way of killing the processes would probably be the following:
proc = psutil.Process(process.pid)
procs = proc.children(recursive=True)
procs.append(proc)
for proc in procs:
proc.terminate()
gone, alive = psutil.wait_procs(procs, timeout=1)
for p in alive:
p.kill()
This would give a chance to the processes to terminate correctly and when the timeout ends the remaining processes will be killed.
Note that psutil also provides a Popen class that has the same interface of subprocess.Popen plus all the extra functionality of psutil.Process. You may want to simply use that instead of subprocess.Popen. It is also safer because psutil checks that PIDs don't get reused if a process terminates, while subprocess doesn't.
I have a script that repeatedly runs an Ant buildfile and scrapes output into a parsable format. When I create the subprocess using Popen, there is a small time window where hitting Ctrl+C will kill the script, but will not kill the subprocess running Ant, leaving a zombie that is printing output to the console that can only be killed using Task Manager. Once Ant has started printing output, hitting Ctrl+C will always kill my script as well as Ant. Is there a way to make it so that hitting Ctrl+C will always kill the subprocess running Ant without leaving a zombie behind?
Also of note: I have a handler for SIGINT that performs a few cleanup operations before calling exit(0). If I manually kill the subprocess in the handler using os.kill(p.pid, signal.SIGTERM) (not SIGINT), then I can successfully kill the subprocess in situations where it would normally zombify. However, when you hit Ctrl+C once Ant has started producing output, you get a stacktrace from subprocess where it is unable to kill the subprocess itself as I have already killed it.
EDIT: My code looked something like:
p = Popen('ls')
def handle_sig_int(signum, stack_frame):
# perform cleanup
os.kill(p.pid, signal.SIGTERM)
exit(0)
signal.signal(signal.SIGINT, handle_sig_int)
p.wait()
Which would produce the following stacktrace when triggered incorrectly:
File "****.py", line ***, in run_test
p.wait()
File "/usr/lib/python2.5/subprocess.py", line 1122, in wait
pid, sts = os.waitpid(self.pid, 0)
File "****.py", line ***, in handle_sig_int
os.kill(p.pid, signal.SIGTERM)
I fixed it by catching the OSError raised by p.wait and exiting:
try:
p.wait()
except OSError:
exit('The operation was interrupted by the user')
This seems to work in the vast majority of my test runs. I occasionally get a uname: write error: Broken pipe, though I don't know what causes it. It seems to happen if I time the Ctrl+C just right before the child process can start displaying output.
Call p.terminate() in your SIGTERM handler:
if p.poll() is None: # Child still around?
p.terminate() # kill it
[EDIT] Since you're stuck with Python 2.5, use os.kill(p.pid, signal.SIGTERM) instead of p.terminate(). The check should make sure you don't get an exception (or reduce the number of times you get one).
To make it even better, you can catch the exception and check the message. If it means "child process not found", then ignore the exception. Otherwise, rethrow it with raise (no arguments).