The Problem
I have a program I'm running with Popen that spawns a bunch of subprocesses. Those subprocesses close every few minutes to be replaced by new ones. The main process (the program I called Popen on) dies pretty quickly. I'm trying to figure out how to get the cpu usage of everything in the process group.
What I've Tried
I've tried wait4 and getrusage. I've also tried psutil but I realized that if I want to check the resource usage I'd have to spawn a bunch of threads to concurrently check the resource usage of all the supprocesses and it'd be messy and error prone.
Here's a sample of code that doesn't work. It only gets the resource usage for the immediate child and none of the grand children.
old = time.time()
proc = Popen(["g09", "../sto.com"], text=True, stderr=PIPE, preexec_fn=setsid)
ru = wait4(proc.pid, 0)
new = time.time()
print(100*(ru.ru_utime + ru.ru_stime)/(new - old))
ru = getrusage(RUSAGE_CHILDREN)
print(100*(ru.ru_utime + ru.ru_stime)/(new - old))
I need to set a new sid (or pgid) because sometimes I want to kill the process before it's finished and if I don't set a new one then the whole python script goes down with it.
PS
I could just use the time command (like Popen(["time", "g09",...])) but I was wondering if there was a way to do this only in python.
Related
Context
I have a backup application implemented in Python which allows starting shell scripts before, after etc. a backup is processed. This allows e.g. to mount taken file system snapshots to be backed up from. I have one use-case in which I need to start sub-shells or additional processes in one of those BEFORE-hooks, which need to stay alive all the time the backup gets processed. Though, the hook-script itself finishes at some point and the Python-app really needs to wait for the hook script itself to finish, so that my setup is completed and the backup possible at all.
Problem
The Python-app doesn't ever return after starting the hook and doesn't continue actually processing the backup. I'm somewhat sure that the hook-script itself really finishes, because I it's PID vanishes at some point and trying to kill the PID results in error messages about a missing process.
Additionally, I can see the running sub-shells and when I kill ALL of those, the Python-app does continue processing the backup. Though, because of the missing background processes, the results are not what I need.
Research
I've found one user having pretty much the same problem with Python wrongly waiting for sub-processes of the started process to finish. Though, that user claims that adding a special argument shell=True has solved the problem. Though, my app seems to provide that argument already, but still seems to wait for child processes.
Code
The following is how the hook script gets executed:
execute.execute_command(
[command],
output_log_level=logging.ERROR
if description == 'on-error'
else logging.WARNING,
shell=True,
)
The following is how the process gets started:
process = subprocess.Popen(
command,
stdin=input_file,
stdout=None if do_not_capture else (output_file or subprocess.PIPE),
stderr=None if do_not_capture else (subprocess.PIPE if output_file else subprocess.STDOUT),
shell=shell,
env=environment,
cwd=working_directory,
)
if not run_to_completion:
return process
log_outputs(
(process,), (input_file, output_file), output_log_level, borg_local_path=borg_local_path
)
The following is an excerpt of reading output of processes:
buffer_last_lines = collections.defaultdict(list)
process_for_output_buffer = {
output_buffer_for_process(process, exclude_stdouts): process
for process in processes
if process.stdout or process.stderr
}
output_buffers = list(process_for_output_buffer.keys())
# Log output for each process until they all exit.
while True:
if output_buffers:
(ready_buffers, _, _) = select.select(output_buffers, [], [])
for ready_buffer in ready_buffers:
ready_process = process_for_output_buffer.get(ready_buffer)
# The "ready" process has exited, but it might be a pipe destination with other
# processes (pipe sources) waiting to be read from. So as a measure to prevent
# hangs, vent all processes when one exits.
if ready_process and ready_process.poll() is not None:
for other_process in processes:
if (
other_process.poll() is None
and other_process.stdout
and other_process.stdout not in output_buffers
):
# Add the process's output to output_buffers to ensure it'll get read.
output_buffers.append(other_process.stdout)
line = ready_buffer.readline().rstrip().decode()
if not line or not ready_process:
continue
[...]
still_running = False
for process in processes:
exit_code = process.poll() if output_buffers else process.wait()
[...]
if not still_running:
break
# Consume any remaining output that we missed (if any).
for process in processes:
output_buffer = output_buffer_for_process(process, exclude_stdouts)
if not output_buffer:
continue
[...]
Educated guess
Looking at the above code, there are two possibilities from my point of view: Either invoking the shell script returns multiple process objects somehow, including those of the children and that would be the problem already. Or if only one process object for the hook script itself is returned, trying to read process output somehow unintendendly accesses those of the child processes as well, which simply don't return anything or even finish by design in my case.
Question
So, under which circumstances does Python wait for sub-processes of a started process?
Is that the case by default, only when not providing shell=True, even with providing that argument, depending on other conditions...?
Thanks!
On a Windows 7 platform I'm using python 3.6 as a framework to start working processes (written in C).
For starting the processes subprocess.Popen is used. The following shows the relevant code (one thread per process to be started).
redirstream = open(redirfilename, "w")
proc = subprocess.Popen(batchargs, shell=False, stdout=redirstream)
outs, errs = proc.communicate(timeout=60)
# wait for job to be finished
ret = proc.wait()
...
if ret == 0: # changed !!
redirstream.flush()
redirstream.close()
os.remove(redirfilename)
communicate is just used to be able, to terminate the executable after 60 seconds , for the case it hangs. redirstream is used to write output from the executable (written in C) to a file, used for general debugging purposes (not related to this issue). Of course, all processes are passed redirfiles with different filenames.
Up to ten such subprocesses are started in that way from independent python threads.
Although it works, I made a mysterious observation:
For the case an executable has finished without errors, I want to delete redirfilename, because it is not used anymore.
Now lets say, I have started process-A, B and C.
Processes A and B are finished and gave back 0 as result.
Process C however intentionally doesn't get data (just for testing, a serial connection has been disconnected) and waits for input from a named pipe (created from python) using Windows ReadFile function:
https://msdn.microsoft.com/en-us/library/windows/desktop/aa365467(v=vs.85).aspx
In that case, while "C" is still waiting for ReadFile to be finished, os.remove(redirfilename) for A and B sometimes throws exception "PermissionError", saying, that the file is still used by another process. But from task manager I can see, that the processes A and B are not existing anymore (as expected).
I tried to catch the PermissionError and repeat the delete command after some delay. Only after "C" has terminated (timeout after 60 seconds), the redirfile for A or B can be deleted.
Why is the redirstream still blocked and somehow in use, although the process behind is not alive anymore and why is it blocked by ReadFile() in a completely unrelated process, which is definitely not related to that particular file? Is that an issue in Python or in my implementation?
Any hints are highly appreciated...
I am using the multiprocessing module in python to launch few processes in parallel. These processes are independent of each other. They generate their own output and write out the results in different files. Each process calls an external tool using the subprocess.call method.
It was working fine until I discovered an issue in the external tool where due to some error condition it goes into a 'prompt' mode and waits for the user input. Now in my python script I use the join method to wait till all the processes finish their tasks. This is causing the whole thing to wait for this erroneous subprocess call. I can put a timeout for each of the process but I do not know in advance how long each one is going to run and hence this option is ruled out.
How do I figure out if any child process is waiting for an user input and how do I send an 'exit' command to it? Any pointers or suggestions to relevant modules in python will be really appreciated.
My code here:
import subprocess
import sys
import os
import multiprocessing
def write_script(fname,e):
f = open(fname,'w')
f.write("Some useful cammnd calling external tool")
f.close()
subprocess.call(['chmod','+x',os.path.abspath(fname)])
return os.path.abspath(fname)
def run_use(mname,script):
print "ssh "+mname+" "+script
subprocess.call(['ssh',mname,script])
if __name__ == '__main__':
dict1 = {}
dict['mod1'] = ['pp1','ext2','les3','pw4']
dict['mod2'] = ['aaa','bbb','ccc','ddd']
machines = ['machine1','machine2','machine3','machine4']
log_file.write(str(dict1.keys()))
for key in dict1.keys():
arr = []
for mod in dict1[key]:
d = {}
arr.append(mod)
if ((mod == dict1[key][-1]) | (len(arr)%4 == 0)):
for i in range(0,len(arr)):
e = arr.pop()
script = write_script(e+"_temp.sh",e)
d[i] = multiprocessing.Process(target=run_use,args=(machines[i],script,))
d[i].daemon = True
for pp in d:
d[pp].start()
for pp in d:
d[pp].join()
Since you're writing a shell script to run your subcommands, can you simply tell them to read input from /dev/null?
#!/bin/bash
# ...
my_other_command -a -b arg1 arg2 < /dev/null
# ...
This may stop them blocking on input and is a really simple solution. If this doesn't work for you, read on for some other options.
The subprocess.call() function is simply shorthand for constructing a subprocess.Popen instance and then calling the wait() method on it. So, your spare processes could instead create their own subprocess.Popen instances and poll them with poll() method on the object instead of wait() (in a loop with a suitable delay). This leaves them free to remain in communication with the main process so you can, for example, allow the main process to tell the child process to terminate the Popen instance with the terminate() or kill() methods and then itself exit.
So, the question is how does the child process tell whether the subprocess is awaiting user input, and that's a trickier question. I would say perhaps the easiest approach is to monitor the output of the subprocess and search for the user input prompt, assuming that it always uses some string that you can look for. Alternatively, if the subprocess is expected to generate output continually then you could simply look for any output and if a configured amount of time goes past without any output then you declare that process dead and terminate it as detailed above.
Since you're reading the output, actually you don't need poll() or wait() - the process closing its output file descriptor is good enough to know that it's terminated in this case.
Here's an example of a modified run_use() method which watches the output of the subprocess:
def run_use(mname,script):
print "ssh "+mname+" "+script
proc = subprocess.Popen(['ssh',mname,script], stdout=subprocess.PIPE)
for line in proc.stdout:
if "UserPrompt>>>" in line:
proc.terminate()
break
In this example we assume that the process either gets hung on on UserPrompt>>> (replace with the appropriate string) or it terminates naturally. If it were to get stuck in an infinite loop, for example, then your script would still not terminate - you can only really address that with an overall timeout, but you didn't seem keen to do that. Hopefully your subprocess won't misbehave in that way, however.
Finally, if you don't know in advance the prompt that will be giving from your process then your job is rather harder. Effectively what you're asking to do is monitor an external process and know when it's blocked reading on a file descriptor, and I don't believe there's a particularly clean solution to this. You could consider running a process under strace or similar, but that's quite an awful hack and I really wouldn't recommend it. Things like strace are great for manual diagnostics, but they really shouldn't be part of a production setup.
I am working with a cluster system over linux (www.mosix.org) that allows me to run jobs and have the system run them on different computers. Jobs are run like so:
mosrun ls &
This will naturally create the process and run it on the background, returning the process id, like so:
[1] 29199
Later it will return. I am writing a Python infrastructure that would run jobs and control them. For that I want to run jobs using the mosrun program as above, and save the process ID of the spawned process (29199 in this case). This naturally cannot be done using os.system or commands.getoutput, as the printed ID is not what the process prints to output... Any clues?
Edit:
Since the python script is only meant to initially run the script, the scripts need to run longer than the python shell. I guess it means the mosrun process cannot be the script's child process. Any suggestions?
Thanks
Use subprocess module. Popen instances have a pid attribute.
Looks like you want to ensure the child process is daemonized -- PEP 3143, which I'm pointing to, documents and points to a reference implementation for that, and points to others too.
Once your process (still running Python code) is daemonized, be it by the means offered in PEP 3143 or others, you can os.execl (or other os.exec... function) your target code -- this runs said target code in exactly the same process which we just said is daemonized, and so it keeps being daemonized, as desired.
The last step cannot use subprocess because it needs to run in the same (daemonized) process, overlaying its executable code -- exactly what os.execl and friends are for.
The first step, before daemonization, might conceivably be done via subprocess, but that's somewhat inconvenient (you need to put the daemonize-then-os.exec code in a separate .py): most commonly you'd just want to os.fork and immediately daemonize the child process.
subprocess is quite convenient as a mostly-cross-platform way to run other processes, but it can't really replace Unix's good old "fork and exec" approach for advanced uses (such as daemonization, in this case) -- which is why it's a good thing that the Python standard library also lets you do the latter via those functions in module os!-)
Thanks all for the help. Here's what I did in the end, and seems to work ok. The code uses python-daemon. Maybe something smarter should be done about transferring the process id from the child to the father, but that's the easier part.
import daemon
def run_in_background(command, tmp_dir="/tmp"):
# Decide on a temp file beforehand
warnings.filterwarnings("ignore", "tempnam is a potential security")
tmp_filename = os.tempnam(tmp_dir)
# Duplicate the process
pid = os.fork()
# If we're child, daemonize and run
if pid == 0:
with daemon.DaemonContext():
child_id = os.getpid()
file(tmp_filename,'w').write(str(child_id))
sp = command.split(' ')
os.execl(*([sp[0]]+sp))
else:
# If we're a parent, poll for the new file
n_iter = 0
while True:
if os.path.exists(tmp_filename):
child_id = int(file(tmp_filename, 'r').read().strip())
break
if n_iter == 100:
raise Exception("Cannot read process id from temp file %s" % tmp_filename)
n_iter += 1
time.sleep(0.1)
return child_id
Note: This question has been re-asked with a summary of all debugging attempts here.
I have a Python script that is running as a background process executing every 60 seconds. Part of that is a call to subprocess.Popen to get the output of ps.
ps = subprocess.Popen(['ps', 'aux'], stdout=subprocess.PIPE).communicate()[0]
After running for a few days, the call is erroring with:
File "/home/admin/sd-agent/checks.py", line 436, in getProcesses
File "/usr/lib/python2.4/subprocess.py", line 533, in __init__
File "/usr/lib/python2.4/subprocess.py", line 835, in _get_handles
OSError: [Errno 12] Cannot allocate memory
However the output of free on the server is:
$ free -m
total used free shared buffers cached
Mem: 894 345 549 0 0 0
-/+ buffers/cache: 345 549
Swap: 0 0 0
I have searched around for the problem and found this article which says:
Solution is to add more swap space to your server. When the kernel is forking to start the modeler or discovery process, it first ensures there's enough space available on the swap store the new process if needed.
I note that there is no available swap from the free output above. Is this likely to be the problem and/or what other solutions might there be?
Update 13th Aug 09 The code above is called every 60 seconds as part of a series of monitoring functions. The process is daemonized and the check is scheduled using sched. The specific code for the above function is:
def getProcesses(self):
self.checksLogger.debug('getProcesses: start')
# Memory logging (case 27152)
if self.agentConfig['debugMode'] and sys.platform == 'linux2':
mem = subprocess.Popen(['free', '-m'], stdout=subprocess.PIPE).communicate()[0]
self.checksLogger.debug('getProcesses: memory before Popen - ' + str(mem))
# Get output from ps
try:
self.checksLogger.debug('getProcesses: attempting Popen')
ps = subprocess.Popen(['ps', 'aux'], stdout=subprocess.PIPE).communicate()[0]
except Exception, e:
import traceback
self.checksLogger.error('getProcesses: exception = ' + traceback.format_exc())
return False
self.checksLogger.debug('getProcesses: Popen success, parsing')
# Memory logging (case 27152)
if self.agentConfig['debugMode'] and sys.platform == 'linux2':
mem = subprocess.Popen(['free', '-m'], stdout=subprocess.PIPE).communicate()[0]
self.checksLogger.debug('getProcesses: memory after Popen - ' + str(mem))
# Split out each process
processLines = ps.split('\n')
del processLines[0] # Removes the headers
processLines.pop() # Removes a trailing empty line
processes = []
self.checksLogger.debug('getProcesses: Popen success, parsing, looping')
for line in processLines:
line = line.split(None, 10)
processes.append(line)
self.checksLogger.debug('getProcesses: completed, returning')
return processes
This is part of a bigger class called checks which is initialised once when the daemon is started.
The entire checks class can be found at http://github.com/dmytton/sd-agent/blob/82f5ff9203e54d2adeee8cfed704d09e3f00e8eb/checks.py with the getProcesses function defined from line 442. This is called by doChecks() starting at line 520.
You've perhaps got a memory leak bounded by some resource limit (RLIMIT_DATA, RLIMIT_AS?) inherited by your python script. Check your *ulimit(1)*s before you run your script, and profile the script's memory usage, as others have suggested.
What do you do with the variable ps after the code snippet you show us? Do you keep a reference to it, never to be freed? Quoting the subprocess module docs:
Note: The data read is buffered in memory, so do not use this
method if the data size is large or unlimited.
... and ps aux can be verbose on a busy system...
Update
You can check rlimits from with your python script using the resource module:
import resource
print resource.getrlimit(resource.RLIMIT_DATA) # => (soft_lim, hard_lim)
print resource.getrlimit(resource.RLIMIT_AS)
If these return "unlimited" -- (-1, -1) -- then my hypothesis is incorrect and you may move on!
See also resource.getrusage, esp. the ru_??rss fields, which can help you to instrument for memory consumption from with the python script, without shelling out to an external program.
when you use popen you need to hand in close_fds=True if you want it to close extra file descriptors.
creating a new pipe, which occurs in the _get_handles function from the back trace, creates 2 file descriptors, but your current code never closes them and your eventually hitting your systems max fd limit.
Not sure why the error you're getting indicates an out of memory condition: it should be a file descriptor error as the return value of pipe() has an error code for this problem.
That swap space answer is bogus. Historically Unix systems wanted swap space available like that, but they don't work that way anymore (and Linux never worked that way). You're not even close to running out of memory, so that's not likely the actual problem - you're running out of some other limited resource.
Given where the error is occuring (_get_handles calls os.pipe() to create pipes to the child), the only real problem you could be running into is not enough free file descriptors. I would instead look for unclosed files (lsof -p on the PID of the process doing the popen). If your program really needs to keep a lot of files open at one time, then increase the user limit and/or the system limit for open file descriptors.
If you're running a background process, chances are that you've redirected your processes stdin/stdout/stderr.
In that case, append the option "close_fds=True" to your Popen call, which will prevent the child process from inheriting your redirected output. This may be the limit you're bumping into.
You might want to actually wait for all of those PS processes to finish before adding swap space.
It's not at all clear what "running as a background process executing every 60 seconds" means.
But your call to subprocess.Popen is forking a new process each time.
Update.
I'd guess that you're somehow leaving all those processes running or hung in a zombie state. However, the communicate method should clean up the spawned subprocesses.
Have you watched your process over time?
lsof
ps -aux | grep -i pname
top
All should give interesting information. I am thinking that the process is tying up resources that should be freed up. Is there a chance that it is tying up resource handles (memory blocks, streams, file handles, thread or process handles)? stdin, stdout, stderr from the spawned "ps". Memory handles, ... from many small incremental allocations. I would be very interested in seeing what the above commands display for your process when it has just finished launching and running for the first time and after 24 hours of "sitting" there launching the sub-process regularly.
Since it dies after a few days, you could have it run for only a few loops, and then restart it once a day as a workaround. That would help you in the meantime.
Jacob
You need to
ps = subprocess.Popen(["sleep", "1000"])
os.waitpid(ps.pid, 0)
to free resources.
Note: this does not work on Windows.
I don't think that the circumstances given in the Zenoss article you linked to is the only cause of this message, so it's not clear yet that swap space is definitely the problem. I would advise logging some more information even around successful calls, so that you can see the state of free memory every time just before you do the ps call.
One more thing - if you specify shell=True in the Popen call, do you see different behaviour?
Update: If not memory, the next possible culprit is indeed file handles. I would advise running the failing command under strace to see exactly which system calls are failing.
Virtual Memory matters!!!
I encountered the same issue before I add swap to my OS. The formula for virtual memory is usually like: SwapSize + 50% * PhysicalMemorySize. I finally get this resolved by either adding more physical memory or adding a Swap disk. close_fds won't work in my case.