Continuously process large amounts of stdout and stderr from a child process

Continuously process large amounts of stdout and stderr from a child process - python

There are a lot of good answers on Stack Overflow about how to handle output with subprocesses, async IO, and avoiding deadlock with PIPE. Something is just not sinking in for me though; I need some guidance on how to accomplish the following.
I want to run a subprocess from my python program. The subprocess generates a ton of standard output, and a little bit of standard error if things go bad. The subprocess itself takes about 20 minutes to complete. For the output and error generated, I want to be able to both log it to the terminal, and write it to a log file.
Doing the latter was easy. I just opened two files and set then as stdout and stderr on the Popen object. However, also capturing the output as lines so that I may print them continuously to terminal has me vexed. I was thinking I could use the poll() method to continuously poll. With this though, I'd still need to use PIPE for stdout and stderr, and call read() on them which would block until EOF.
I think what I'm trying to accomplish is this:
start the subprocess
while process is still running
if there are any lines from stdout
print them and write them to the out log file
if there are any lines from stderr
print them and write them to the err log file
sleep for a little bit
Does that seem reasonable? If so, can someone explain how one would implement the 'if' parts here without blocking.
Thanks

Here is my select.select version:
Subprocess (foo.py):
import time
import sys
def foo():
for i in range(5):
print("foo %s" %i, file=sys.stdout, )#flush=True
sys.stdout.flush()
time.sleep(7)
foo()
Main:
import subprocess as sp
import select
proc= sp.Popen(["python", "foo.py"], stderr=sp.PIPE, stdout=sp.PIPE)
last_line = "content"
while last_line:
buff = select.select([proc.stdout], [], [], 60)[0][0]
if not buff:
print('timed out')
break
last_line = buff.readline()
print(last_line)

Related

subprocess get result before terminate

Get Result of a SubProcess in Real Time
I would like to get each result (sys.stdout) in real time before the subprocess terminates.
Suppose we have the following file.py.
import time,sys
sys.stdout.write('something')
while True:
sys.stdout.write('something else')
time.sleep(4)
Well, i made some tries with modules of subprocess, asyncio and threading, although all methods gives me the result when the process is finished. Ideally i would like to terminate the process myself and get each result (stdout, stderr) in real time and not when the process it completes.
import subprocess
proc = sp.Popen([sys.executable, "/Users/../../file.py"], stdout = subprocess.PIPE, stderr= subproces.STDOUT)
proc.communicate() #This one received the result after finish
I tried as well with readline proc.stdout.readline() in a different thread with threading module and with asyncio, but it waits as well until the process completes.
The only usefull that i found is the usage of psutil.Popen(*args, **kwargs) with this one i could terminate whenever i want the process and get some stats for that. But the main issue still remains to get in real time (asynchronously) each sys.stdout or print of file.py, at the moment of each printing.
*preferred solution for python3.6

As noted in the comments, the first and foremost thing is to ensure that your file.py program actually writes the data the way you think it does.
For example, the program you have shown will write nothing for about 40 minutes, because that's how long it takes for 14-byte prints issued at 4-second intervals to fill up the 8-kilobyte IO buffer. Even more confusingly, some programs will appear to write data if you test them on a TTY (i.e. just run them), but not when you start them as subprocesses. This is because on a TTY stdout is line-buffered, and on a pipe it is fully buffered. When the output is not flushed, there is simply no way for another program to detect the output because it is stuck inside the subprocess's buffer that it never bothered to share with anyone.
In other words, don't forget to flush:
while True:
# or just print('something else', flush=True)
sys.stdout.write('something else')
sys.stdout.flush()
time.sleep(4)
With that out of the way, let's examine how to read that output. Asyncio provides a nice stream-based interface to subprocesses that is quite capable of accessing arbitrary output as it arrives. For example:
import asyncio
async def main():
loop = asyncio.get_event_loop()
proc = await asyncio.create_subprocess_exec(
"python", "file.py",
stdout=asyncio.subprocess.PIPE, stderr=asyncio.subprocess.PIPE
)
# loop.create_task() rather than asyncio.create_task() because Python 3.6
loop.create_task(display_as_arrives(proc.stdout, 'stdout'))
loop.create_task(display_as_arrives(proc.stderr, 'stderr'))
await proc.wait()
async def display_as_arrives(stream, where):
while True:
# 1024 chosen arbitrarily - StreamReader.read will happily return
# shorter chunks - this allows reading in real-time.
output = await stream.read(1024)
if output == b'':
break
print('got', where, ':', output)
# run_until_complete() rather than asyncio.run() because Python 3.6
asyncio.get_event_loop().run_until_complete(main())

Gather subprocess output nonblocking in Python

Is there an easy way of gathering the output of a subprocess without actually waiting for it?
I can think of creating a subprocess.Popen() with capturing its stdout, then call p.communicate(), but that would block until the subprocess terminates.
I can think of using subprocess.check_output() or similar, but that also would block.
I need something which I can start, then do other stuff, then check the subprocess for being terminated, and in case it is, takes its output.
I can think of two rather complicated ways to achieve this:
Redirect the output into a file, then after termination I can read the output from that file.
Implement and start a handler thread(!) which constantly tries to read data from the stdout of the subprocess and adds it to a buffer.
The first one needs temporary files and disk I/O which I do not really like in my case. The second one means implementing quite a bit.
I guess there might be a simpler way I couldn't think of yet, or a ready-to-be-used solution in some library I didn't find yet.

What's wrong with calling check_output in a thread?
import threading,subprocess
output = ""
def f():
global output
output = subprocess.check_output("ls") # ["cmd","/c","dir"] for windows
t = threading.Thread(target=f)
t.start()
print('Started')
t.join()
print(output)
note that one could be tempted to use p = subprocess.Popen(cmd,stdout=subprocess.PIPE), wait for p.poll() to be != None and try to read p.stdout afterwards: that only works when the output is small, else you get a deadlock because stdout buffer is full and you have to read it from time to time.
Using p.stdout.readline() would work but would also block if the process doesn't print on a regular basis. If your application prints to the output all the time, then you can consider it as non-blocking and the solution is acceptable.

I think what you want is an unbuffered stdout stream.
With that you will be able to capture the output of your process without waiting for it to finish.
You can achieve that with the subprocess.Popen() function and the parameter stdout=subprocess.PIPE.
Try something like this
proc = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE)
line = proc.stdout.readline()
while line:
print line
line = proc.stdout.readline()

Python subprocess timing out?

I have a script that runs another command, waits for it to finish, logs the stdout and stderr and based the return code does other stuff. Here is the code:
p = subprocess.Popen(command, stdin=subprocess.PIPE, stderr=subprocess.PIPE, stdout=subprocess.PIPE)
o, e = p.communicate()
if p.returncode:
# report error
# do other stuff
The problem I'm having is that if command takes a long time to run none of the other actions get done. The possible errors won't get reported and the other stuff that needs to happen if no errors doesn't get done. It essentially doesn't go past p.communicate() if it takes too long. Some times this command can takes hours (or even longer) to run and some times it can take as little as 5 seconds.
Am I missing something or doing something wrong?

As per the documentation located here, it's safe to say that you're code is waiting for the subprocess to finish.
If you need to go do 'other things' while you wait you could create a loop like:
while p.poll():
# 'other things'
time.sleep(0.2)
Pick a sleep time that's reasonable for how often you want python to wake up and check the subprocess as well as doing its 'other things'.

The Popen.communicate waits for the process to finish, before anything is returned. Thus it is not ideal for any long running command; and even less so if the subprocess can hang waiting for input, say prompting for a password.
The stderr=subprocess.PIPE, stdout=subprocess.PIPE are needed only if you want to capture the output of the command into a variable. If you are OK with the output going to your terminal, then you can remove these both; and even use subprocess.call instead of Popen. Also, if you do not provide input to your subprocess, then do not use stdin=subprocess.PIPE at all, but direct that from the null device instead (in Python 3.3+ you can use stdin=subprocess.DEVNULL; in Python <3.3 use stdin=open(os.devnull, 'rb')
If you need the contents too, then instead of calling p.communicate(), you can read p.stdout and p.stderr yourself in chunks and output to the terminal, but it is a bit complicated, as it is easy to deadlock the program - the dummy approach would try to read from the subprocess' stdout while the subprocess would want to write to stderr. For this case there are 2 remedies:
you could use select.select to poll both stdout and stderr to see whichever becomes ready first and read from it then
or, if you do not care for stdout and stderr being combined into one,
you can use STDOUT to redirect the stderr stream into the stdout stream: stdout=subprocess.PIPE, stderr=subprocess.STDOUT; now all the output comes to p.stdout that you can read easily in loop and output the chunks, without worrying about deadlocks:
If the stdout, stderr are going to be huge, you can also spool them to a file right there in Popen; say,
stdout = open('stdout.txt', 'w+b')
stderr = open('stderr.txt', 'w+b')
p = subprocess.Popen(..., stdout=stdout, stderr=stderr)
while p.poll() is None:
# reading at the end of the file will return an empty string
err = stderr.read()
print(err)
out = stdout.read()
print(out)
# if we met the end of the file, then we can sleep a bit
# here to avoid spending excess CPU cycles just to poll;
# another option would be to use `select`
if not err and not out: # no input, sleep a bit
time.sleep(0.01)

Detecting the end of the stream on popen.stdout.readline

I have a python program which launches subprocesses using Popen and consumes their output nearly real-time as it is produced. The code of the relevant loop is:
def run(self, output_consumer):
self.prepare_to_run()
popen_args = self.get_popen_args()
logging.debug("Calling popen with arguments %s" % popen_args)
self.popen = subprocess.Popen(**popen_args)
while True:
outdata = self.popen.stdout.readline()
if not outdata and self.popen.returncode is not None:
# Terminate when we've read all the output and the returncode is set
break
output_consumer.process_output(outdata)
self.popen.poll() # updates returncode so we can exit the loop
output_consumer.finish(self.popen.returncode)
self.post_run()
def get_popen_args(self):
return {
'args': self.command,
'shell': False, # Just being explicit for security's sake
'bufsize': 0, # More likely to see what's being printed as it happens
# Not guarantted since the process itself might buffer its output
# run `python -u` to unbuffer output of a python processes
'cwd': self.get_cwd(),
'env': self.get_environment(),
'stdout': subprocess.PIPE,
'stderr': subprocess.STDOUT,
'close_fds': True, # Doesn't seem to matter
}
This works great on my production machines, but on my dev machine, the call to .readline() hangs when certain subprocesses complete. That is, it will successfully process all of the output, including the final output line saying "process complete", but then will again poll readline and never return. This method exits properly on the dev machine for most of the sub-processes I call, but consistently fails to exit for one complex bash script that itself calls many sub-processes.
It's worth noting that popen.returncode gets set to a non-None (usually 0) value many lines before the end of the output. So I can't just break out of the loop when that is set or else I lose everything that gets spat out at the end of the process and is still buffered waiting for reading. The problem is that when I'm flushing the buffer at that point, I can't tell when I'm at the end because the last call to readline() hangs. Calling read() also hangs. Calling read(1) gets me every last character out, but also hangs after the final line. popen.stdout.closed is always False. How can I tell when I'm at the end?
All systems are running python 2.7.3 on Ubuntu 12.04LTS. FWIW, stderr is being merged with stdout using stderr=subprocess.STDOUT.
Why the difference? Is it failing to close stdout for some reason? Could the sub-sub-processes do something to keep it open somehow? Could it be because I'm launching the process from a terminal on my dev box, but in production it's launched as a daemon through supervisord? Would that change the way the pipes are processed and if so how do I normalize them?

The main code loop looks right. It could be that the pipe isn't closing because another process is keeping it open. For example, if script launches a background process that writes to stdout then the pipe will no close. Are you sure no other child process still running?
An idea is to change modes when you see the .returncode has set. Once you know the main process is done, read all its output from buffer, but don't get stuck waiting. You can use select to read from the pipe with a timeout. Set a several seconds timeout and you can clear the buffer without getting stuck waiting child process.

Without knowing the contents of the "one complex bash script" which causes the problem, there's too many possibilities to determine the exact cause.
However, focusing on the fact that you claim it works if you run your Python script under supervisord, then it might be getting stuck if a sub-process is trying to read from stdin, or just behaves differently if stdin is a tty, which (I presume) supervisord will redirect from /dev/null.
This minimal example seems to cope better with cases where my example test.sh runs subprocesses which try to read from stdin...
import os
import subprocess
f = subprocess.Popen(args='./test.sh',
shell=False,
bufsize=0,
stdin=open(os.devnull, 'rb'),
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT,
close_fds=True)
while 1:
s = f.stdout.readline()
if not s and f.returncode is not None:
break
print s.strip()
f.poll()
print "done %d" % f.returncode
Otherwise, you can always fall back to using a non-blocking read, and bail out when you get your final output line saying "process complete", although it's a bit of a hack.

If you use readline() or read(), it should not hang. No need to check returncode or poll(). If it is hanging when you know the process is finished, it is most probably a subprocess keeping your pipe open, as others said before.
There are two things you could do to debug this:
* Try to reproduce with a minimal script instead of the current complex one, or
* Run that complex script with strace -f -e clone,execve,exit_group and see what is that script starting, and if any process is surviving the main script (check when the main script calls exit_group, if strace is still waiting after that, you have a child still alive).

I find that calls to read (or readline) sometimes hang, despite previously calling poll. So I resorted to calling select to find out if there is readable data. However, select without a timeout can hang, too, if the process was closed. So I call select in a semi-busy loop with a tiny timeout for each iteration (see below).
I'm not sure if you can adapt this to readline, as readline might hang if the final \n is missing, or if the process doesn't close its stdout before you close its stdin and/or terminate it. You could wrap this in a generator, and everytime you encounter a \n in stdout_collected, yield the current line.
Also note that in my actual code, I'm using pseudoterminals (pty) to wrap the popen handles (to more closely fake user input) but it should work without.
# handle to read from
handle = self.popen.stdout
# how many seconds to wait without data
timeout = 1
begin = datetime.now()
stdout_collected = ""
while self.popen.poll() is None:
try:
fds = select.select([handle], [], [], 0.01)[0]
except select.error, exc:
print exc
break
if len(fds) == 0:
# select timed out, no new data
delta = (datetime.now() - begin).total_seconds()
if delta > timeout:
return stdout_collected
# try longer
continue
else:
# have data, timeout counter resets again
begin = datetime.now()
for fd in fds:
if fd == handle:
data = os.read(handle, 1024)
# can handle the bytes as they come in here
# self._handle_stdout(data)
stdout_collected += data
# process exited
# if using a pseudoterminal, close the handles here
self.popen.wait()

Why are you setting the sdterr to STDOUT?
The real benefit of making a communicate() call on a subproces is that you are able to retrieve a tuple containining the stdout response as well as the stderr meesage.
Those might be useful if the logic depends on their succsss or failure.
Also, it would save you from the pain of having to iterate through lines. Communicate() gives you everything and there would be no unresolved questions about whether or not the full message was received

I wrote a demo with bash subprocess that can be easy explored.
A closed pipe can be recognized by '' in the output from readline(), while the output from an empty line is '\n'.
from subprocess import Popen, PIPE, STDOUT
p = Popen(['bash'], stdout=PIPE, stderr=STDOUT)
out = []
while True:
outdata = p.stdout.readline()
if not outdata:
break
#output_consumer.process_output(outdata)
print "* " + repr(outdata)
out.append(outdata)
print "* closed", repr(out)
print "* returncode", p.wait()
Example of input/output: Closing the pipe distinctly before terminating the process. That is why wait() should be used instead of poll()
[prompt] $ python myscript.py
echo abc
* 'abc\n'
exec 1>&- # close stdout
exec 2>&- # close stderr
* closed ['abc\n']
exit
* returncode 0
[prompt] $
Your code did output a huge number of empty strings for this case.
Example: Fast terminated process without '\n' on the last line:
echo -n abc
exit
* 'abc'
* closed ['abc']
* returncode 0

Proper way of re-using and closing a subprocess object

I have the following code in a loop:
while true:
# Define shell_command
p1 = Popen(shell_command, shell=shell_type, stdout=PIPE, stderr=PIPE, preexec_fn=os.setsid)
result = p1.stdout.read();
# Define condition
if condition:
break;
where shell_command is something like ls (it just prints stuff).
I have read in different places that I can close/terminate/exit a Popen object in a variety of ways, e.g. :
p1.stdout.close()
p1.stdin.close()
p1.terminate
p1.kill
My question is:
What is the proper way of closing a subprocess object once we are done using it?
Considering the nature of my script, is there a way to open a subprocess object only once and reuse it with different shell commands? Would that be more efficient in any way than opening new subprocess objects each time?
Update
I am still a bit confused about the sequence of steps to follow depending on whether I use p1.communicate() or p1.stdout.read() to interact with my process.
From what I understood in the answers and the comments:
If I use p1.communicate() I don't have to worry about releasing resources, since communicate() would wait until the process is finished, grab the output and properly close the subprocess object
If I follow the p1.stdout.read() route (which I think fits my situation, since the shell command is just supposed to print stuff) I should call things in this order:
p1.wait()
p1.stdout.read()
p1.terminate()
Is that right?

What is the proper way of closing a subprocess object once we are done using it?
stdout.close() and stdin.close() will not terminate a process unless it exits itself on end of input or on write errors.
.terminate() and .kill() both do the job, with kill being a bit more "drastic" on POSIX systems, as SIGKILL is sent, which cannot be ignored by the application. Specific differences are explained in this blog post, for example. On Windows, there's no difference.
Also, remember to .wait() and to close the pipes after killing a process to avoid zombies and force the freeing of resources.
A special case that is often encountered are processes which read from STDIN and write their result to STDOUT, closing themselves when EOF is encountered. With these kinds of programs, it's often sensible to use subprocess.communicate:
>>> p = Popen(["sort"], stdin=PIPE, stdout=PIPE)
>>> p.communicate("4\n3\n1")
('1\n3\n4\n', None)
>>> p.returncode
0
This can also be used for programs which print something and exit right after:
>>> p = Popen(["ls", "/home/niklas/test"], stdin=PIPE, stdout=PIPE)
>>> p.communicate()
('file1\nfile2\n', None)
>>> p.returncode
0
Considering the nature of my script, is there a way to open a subprocess object only once and reuse it with different shell commands? Would that be more efficient in any way than opening new subprocess objects each time?
I don't think the subprocess module supports this and I don't see what resources could be shared here, so I don't think it would give you a significant advantage.

Considering the nature of my script, is there a way to open a subprocess object only once and reuse it with different shell commands?
Yes.
#!/usr/bin/env python
from __future__ import print_function
import uuid
import random
from subprocess import Popen, PIPE, STDOUT
MARKER = str(uuid.uuid4())
shell_command = 'echo a'
p = Popen('sh', stdin=PIPE, stdout=PIPE, stderr=STDOUT,
universal_newlines=True) # decode output as utf-8, newline is '\n'
while True:
# write next command
print(shell_command, file=p.stdin)
# insert MARKER into stdout to separate output from different shell_command
print("echo '%s'" % MARKER, file=p.stdin)
# read command output
for line in iter(p.stdout.readline, MARKER+'\n'):
if line.endswith(MARKER+'\n'):
print(line[:-len(MARKER)-1])
break # command output ended without a newline
print(line, end='')
# exit on condition
if random.random() < 0.1:
break
# cleanup
p.stdout.close()
if p.stderr:
p.stderr.close()
p.stdin.close()
p.wait()
Put while True inside try: ... finally: to perform the cleanup in case of exceptions. On Python 3.2+ you could use with Popen(...): instead.
Would that be more efficient in any way than opening new subprocess objects each time?
Does it matter in your case? Don't guess. Measure it.

The "correct" order is:
Create a thread to read stdout (and a second one to read stderr, unless you merged them into one).
Write commands to be executed by the child to stdin. If you're not reading stdout at the same time, writing to stdin can block.
Close stdin (this is the signal for the child that it can now terminate by itself whenever it is done)
When stdout returns EOF, the child has terminated. Note that you need to synchronize the stdout reader thread and your main thread.
call wait() to see if there was a problem and to clean up the child process
If you need to stop the child process for any reason (maybe the user wants to quit), then you can:
Close stdin if the child terminates when it reads EOF.
Kill the with terminate(). This is the correct solution for child processes which ignore stdin.
If the child doesn't respond, try kill()
In all three cases, you must call wait() to clean up the dead child process.

Depends on what you expect the process to do; you should always call p1.wait() in order to avoid zombies. Other steps depend on the behaviour of the subprocess; if it produces any output, you should consume the output (e.g. p1.read() ...but this would eat lots of memory) and only then call the p1.wait(); or you may wait for some timeout and call p1.terminate() to kill the process if you think it doesn't work as expected, and possible call p1.wait() to clean the zombie.
Alternatively, p1.communicate(...) would do the handling if io and waiting for you (not the killing).
Subprocess objects aren't supposed to be reused.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Continuously process large amounts of stdout and stderr from a child process - python

Related

subprocess get result before terminate

Gather subprocess output nonblocking in Python

Python subprocess timing out?

Detecting the end of the stream on popen.stdout.readline

Proper way of re-using and closing a subprocess object

Categories

Resources