I'm launching a subprocess with Popen, and I expect it to finish exit. However, not only does the process not exit, but sending sigkill to it still leaves it alive! Below is a script that demonstrates:
from subprocess import Popen
import os
import time
import signal
command = ["python","--version"]
process = Popen(command)
pid = process.pid
time.sleep(5) #ample time to finish
print pid
print "Sending sigkill"
os.kill(pid,signal.SIGKILL)
try:
#Kill with signal 0 just checks whether process exists
os.kill(pid,0)
print "Process still alive immediately after (not so bad...)!"
except Exception as e:
print "Succeeded in terminating child quickly!"
time.sleep(20) #Give it ample time to die
#Kill with signal 0 just checks whether process exists
try:
os.kill(pid,0)
print "Process still alive! That's bad!"
except Exception as e:
print "Succeeded in terminating child!"
For me, this prints:
77881
Python 2.7.10
Sending sigkill
Process still alive immediately after (not so bad...)!
Process still alive! That's bad!
Not only can this script verify that the child is still alive after it should have finished, but I can use ps on the process id that's printed and see that it still exists. Oddly, ps lists the process name as (Python) (note the parenthesis).
You need to either call process.wait(), or use signal.signal(signal.SIGCHLD, signal.SIG_IGN) once to indicate that you don't intend to wait for any children. The former is portable; the latter only works on Unix (but is POSIX-standard). If you do neither of these things, on Unix the process will hang around as a zombie, and Windows has a similar behavior if you keep the process handle open.
Related
I am using the following code to launch a subprocess :
# Run the program
subprocess_result = subprocess.run(
cmd,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
check=False,
timeout=timeout,
cwd=directory,
env=env,
preexec_fn=set_memory_limits,
)
The launched subprocess is also a Python program, with a shebang.
This subprocess may last for longer than the specified timeout.
The subprocess does heavy computations and write results in a file and does not contain any signal handler.
According to the documentation https://docs.python.org/3/library/subprocess.html#subprocess.run, subprocess.run kills a child that timesout :
The timeout argument is passed to Popen.communicate(). If the timeout
expires, the child process will be killed and waited for. The
TimeoutExpired exception will be re-raised after the child process has
terminated.
When my subprocess timesout, I always receive the subprocess.TimeoutExpired exception, but from time to time the subprocess is not killed, hence still consuming resources on my machine.
So my question is, am I doing something wrong here ? If yes, what and if no, why do I have this issue and how can I solve it ?
Note : I am using Python 3.10 on Ubuntu 22_04
The most likely culprit for the behaviour you see is that the subprocess you are spawning is probably using multiprocessing and spawning its own child processes. Killing the parent process does not automatically kill the whole set of descendants. The granchildren are inherited by the init process (i.e. the process with PID 1) and will continue to run.
You can verify from the source code of suprocess.run :
with Popen(*popenargs, **kwargs) as process:
try:
stdout, stderr = process.communicate(input, timeout=timeout)
except TimeoutExpired as exc:
process.kill()
if _mswindows:
# Windows accumulates the output in a single blocking
# read() call run on child threads, with the timeout
# being done in a join() on those threads. communicate()
# _after_ kill() is required to collect that and add it
# to the exception.
exc.stdout, exc.stderr = process.communicate()
else:
# POSIX _communicate already populated the output so
# far into the TimeoutExpired exception.
process.wait()
raise
except: # Including KeyboardInterrupt, communicate handled that.
process.kill()
# We don't call process.wait() as .__exit__ does that for us.
raise
Here you can see at line 550 the timeout is set on the communicate call, if it fires at line 552 the subprocess is .kill()ed. The kill method sends a SIGKILL which immediately kills the subprocess without any cleanup. It's a signal that cannot be caught by the subprocess, so it's not possible that the child is somehow ignoring it.
The TimeoutException is then re-raised at line 564, so if your parent process sees this exception the subprocess is already dead.
This however says nothing of granchildren processes. Those will continue to run as children of PID 1.
I don't see any way in which you can customize how subprocess.run handles subprocess termination. For example, if it used SIGTERM instead of SIGKILL you could modify your child process or write a wrapper process that will catch the signal and properly kill all its descendants. But SIGKILL doesn't give you this luxury.
So I believe that for your use case you cannot use the subprocess.run facade but you should use Popen directly. You can look at the subprocess.run implementation and take just the things that you need, maybe dropping support for platforms you don't use.
Note: There are extremely rare situations in which the subprocesses won't die immediately on SIGKILL. I believe the only situation in which this happens is if the subprocess is performing a very long system call or other kernel operation, which might not be interrupted immediately. If the operation is in deadlock this might prevent the process from terminating forever. However I don't think that this is your case, since you did not mention that the process is stuck doing nothing, but from what you said the process simply seems to continue running.
On Windows boxes, I have a number of scenarios where a parent process will start a child process. For various reasons - the parent process may want to abort the child process but (and this is important) allow it to clean up - ie run a finally clause:
try:
res = bookResource()
doStuff(res)
finally:
cleanupResource(res)
(These things may be embedded in contexts like the closer - and generally are around hardware locking/database state)
The problem is that I'm unable to find a way to signal the child in Windows (as I would in a Linux environment) so it would run the clean up before terminating. I think this requires making the child process raise an exception somehow (as the Ctrl-C would).
Things I've tried:
os.kill
os.signal
subprocess.Popen with creationFlags and using ctypes.windll.kernel32.GenerateConsoleCtrlEvent(1, p.pid) abrt signal. This requires a signal trap and inelegant loop to stop it immediately aborting.
ctypes.windll.kernel32.GenerateConsoleCtrlEvent(0, p.pid)- ctrl-c event - did nothing.
Has anyone got a surefire way of doing this, so that the child process can clean up?
I was able to get the GenerateConsoleCtrlEvent working like this:
import time
import win32api
import win32con
from multiprocessing import Process
def foo():
try:
while True:
print("Child process still working...")
time.sleep(1)
except KeyboardInterrupt:
print "Child process: caught ctrl-c"
if __name__ == "__main__":
p = Process(target=foo)
p.start()
time.sleep(2)
print "sending ctrl c..."
try:
win32api.GenerateConsoleCtrlEvent(win32con.CTRL_C_EVENT, 0)
while p.is_alive():
print("Child process is still alive.")
time.sleep(1)
except KeyboardInterrupt:
print "Main process: caught ctrl-c"
Output
Child process still working...
Child process still working...
sending ctrl c...
Child process is still alive.
Child process: caught ctrl-c
Main process: caught ctrl-c
I'm launching a program with subprocess on Python.
In some cases the program may freeze. This is out of my control. The only thing I can do from the command line it is launched from is CtrlEsc which kills the program quickly.
Is there any way to emulate this with subprocess? I am using subprocess.Popen(cmd, shell=True) to launch the program.
Well, there are a couple of methods on the object returned by subprocess.Popen() which may be of use: Popen.terminate() and Popen.kill(), which send a SIGTERM and SIGKILL respectively.
For example...
import subprocess
import time
process = subprocess.Popen(cmd, shell=True)
time.sleep(5)
process.terminate()
...would terminate the process after five seconds.
Or you can use os.kill() to send other signals, like SIGINT to simulate CTRL-C, with...
import subprocess
import time
import os
import signal
process = subprocess.Popen(cmd, shell=True)
time.sleep(5)
os.kill(process.pid, signal.SIGINT)
p = subprocess.Popen("echo 'foo' && sleep 60 && echo 'bar'", shell=True)
p.kill()
Check out the docs on the subprocess module for more info: http://docs.python.org/2/library/subprocess.html
You can use two signals to kill a running subprocess call i.e., signal.SIGTERM and signal.SIGKILL; for example
import subprocess
import os
import signal
import time
..
process = subprocess.Popen(..)
..
# killing all processes in the group
os.killpg(process.pid, signal.SIGTERM)
time.sleep(2)
if process.poll() is None: # Force kill if process is still alive
time.sleep(3)
os.killpg(process.pid, signal.SIGKILL)
Your question is not too clear, but If I assume that you are about to launch a process wich goes to zombie and you want to be able to control that in some state of your script. If this in the case, I propose you the following:
p = subprocess.Popen([cmd_list], shell=False)
This in not really recommanded to pass through the shell.
I would suggest you ti use shell=False, this way you risk less an overflow.
# Get the process id & try to terminate it gracefuly
pid = p.pid
p.terminate()
# Check if the process has really terminated & force kill if not.
try:
os.kill(pid, 0)
p.kill()
print "Forced kill"
except OSError, e:
print "Terminated gracefully"
Following command worked for me
os.system("pkill -TERM -P %s"%process.pid)
Try wrapping your subprocess.Popen call in a try except block. Depending on why your process is hanging, you may be able to cleanly exit. Here is a list of exceptions you can check for: Python 3 - Exceptions Handling
I'm using python-daemon, and having the problem that when I kill -9 a process, it leaves a pidfile behind (ok) and the next time I run my program it doesn't work unless I have already removed the pidfile by hand (not ok).
I catch all exceptions in order that context.close() is called before terminating -- when this happens (e.g. on a kill) the /var/run/mydaemon.pid* files are removed and a subsequent daemon run succeeds. However, when using SIGKILL (kill -9), I don't have the chance to call context.close(), and the /var/run files remain. In this instance, the next time I run my program it does not start successfully -- the original process returns, but the daemonized process blocks at context.open().
It seems like python-daemon ought to be noticing that there is a pidfile for a process that no longer exists, and clearing it out, but that isn't happening. Am I supposed to be doing this by hand?
Note: I'm not using with because this code runs on Python 2.4
from daemon import DaemonContext
from daemon.pidlockfile import PIDLockFile
context = DaemonContext(pidfile = PIDLockFile("/var/run/mydaemon.pid"))
context.open()
try:
retry_main_loop()
except Exception, e:
pass
context.close()
If you are running linux, and process level locks are acceptable, read on.
We try to acquire the lock. If it fails, check if the lock is acquired by a running process. If no, break the lock and continue.
from lockfile.pidlockfile import PIDLockFile
from lockfile import AlreadyLocked
pidfile = PIDLockFile("/var/run/mydaemon.pid", timeout=-1)
try:
pidfile.acquire()
except AlreadyLocked:
try:
os.kill(pidfile.read_pid(), 0)
print 'Process already running!'
exit(1)
except OSError: #No process with locked PID
pidfile.break_lock()
#pidfile can now be used to create DaemonContext
Edit: Looks like PIDLockFile is available only on lockfile >= 0.9
With the script provided here
the pid file remains on kill -9 as you say, but the script also cleans up properly on a restart.
I have a script that repeatedly runs an Ant buildfile and scrapes output into a parsable format. When I create the subprocess using Popen, there is a small time window where hitting Ctrl+C will kill the script, but will not kill the subprocess running Ant, leaving a zombie that is printing output to the console that can only be killed using Task Manager. Once Ant has started printing output, hitting Ctrl+C will always kill my script as well as Ant. Is there a way to make it so that hitting Ctrl+C will always kill the subprocess running Ant without leaving a zombie behind?
Also of note: I have a handler for SIGINT that performs a few cleanup operations before calling exit(0). If I manually kill the subprocess in the handler using os.kill(p.pid, signal.SIGTERM) (not SIGINT), then I can successfully kill the subprocess in situations where it would normally zombify. However, when you hit Ctrl+C once Ant has started producing output, you get a stacktrace from subprocess where it is unable to kill the subprocess itself as I have already killed it.
EDIT: My code looked something like:
p = Popen('ls')
def handle_sig_int(signum, stack_frame):
# perform cleanup
os.kill(p.pid, signal.SIGTERM)
exit(0)
signal.signal(signal.SIGINT, handle_sig_int)
p.wait()
Which would produce the following stacktrace when triggered incorrectly:
File "****.py", line ***, in run_test
p.wait()
File "/usr/lib/python2.5/subprocess.py", line 1122, in wait
pid, sts = os.waitpid(self.pid, 0)
File "****.py", line ***, in handle_sig_int
os.kill(p.pid, signal.SIGTERM)
I fixed it by catching the OSError raised by p.wait and exiting:
try:
p.wait()
except OSError:
exit('The operation was interrupted by the user')
This seems to work in the vast majority of my test runs. I occasionally get a uname: write error: Broken pipe, though I don't know what causes it. It seems to happen if I time the Ctrl+C just right before the child process can start displaying output.
Call p.terminate() in your SIGTERM handler:
if p.poll() is None: # Child still around?
p.terminate() # kill it
[EDIT] Since you're stuck with Python 2.5, use os.kill(p.pid, signal.SIGTERM) instead of p.terminate(). The check should make sure you don't get an exception (or reduce the number of times you get one).
To make it even better, you can catch the exception and check the message. If it means "child process not found", then ignore the exception. Otherwise, rethrow it with raise (no arguments).