Gather subprocess output nonblocking in Python

Gather subprocess output nonblocking in Python - python

Is there an easy way of gathering the output of a subprocess without actually waiting for it?
I can think of creating a subprocess.Popen() with capturing its stdout, then call p.communicate(), but that would block until the subprocess terminates.
I can think of using subprocess.check_output() or similar, but that also would block.
I need something which I can start, then do other stuff, then check the subprocess for being terminated, and in case it is, takes its output.
I can think of two rather complicated ways to achieve this:
Redirect the output into a file, then after termination I can read the output from that file.
Implement and start a handler thread(!) which constantly tries to read data from the stdout of the subprocess and adds it to a buffer.
The first one needs temporary files and disk I/O which I do not really like in my case. The second one means implementing quite a bit.
I guess there might be a simpler way I couldn't think of yet, or a ready-to-be-used solution in some library I didn't find yet.

What's wrong with calling check_output in a thread?
import threading,subprocess
output = ""
def f():
global output
output = subprocess.check_output("ls") # ["cmd","/c","dir"] for windows
t = threading.Thread(target=f)
t.start()
print('Started')
t.join()
print(output)
note that one could be tempted to use p = subprocess.Popen(cmd,stdout=subprocess.PIPE), wait for p.poll() to be != None and try to read p.stdout afterwards: that only works when the output is small, else you get a deadlock because stdout buffer is full and you have to read it from time to time.
Using p.stdout.readline() would work but would also block if the process doesn't print on a regular basis. If your application prints to the output all the time, then you can consider it as non-blocking and the solution is acceptable.

I think what you want is an unbuffered stdout stream.
With that you will be able to capture the output of your process without waiting for it to finish.
You can achieve that with the subprocess.Popen() function and the parameter stdout=subprocess.PIPE.
Try something like this
proc = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE)
line = proc.stdout.readline()
while line:
print line
line = proc.stdout.readline()

Related

Avoid Deadlock wtih Popen and stdout = PIPE in python

I am executing a shell script using Popen. I am also using stdout=PIPE to capture the output.The code is
pipe = Popen('acbd.sh', shell=True, stdout = PIPE)
while pipe.poll() is None:
time.sleep(0.5)
text = pipe.communicate()[0]
if pipe.returncode == 0:
print "File executed"
According to documentation using poll with stdout = PIPE can lead to deadlock. Also communicate() can be used to solve this problem. I have used communicate() here.
Will my code lead to deadlock with communicate too or am I using communicate usage wrong?
Also I have an alternate in subprocess.check_output but I would prefer to use Popen and record the output with same.

Yes, you can deadlock, because of these two lines:
while pipe.poll() is None:
time.sleep(0.5)
Take them out; there's no need for them here. communicate() will wait for the subprocess to close its FDs (as happens on exit) as it is; when you add a loop yourself, and don't read until after that loop completes, then your program can be stuck indefinitely trying to write contents which can't be written until communicate() causes the other side of the pipeline to start reading.
As background: The POSIX specification for the write() call does not make any guarantees about the amount of data that can be written to a FIFO before it will block, or that this amount of data will be consistent even within a given system -- thus, the safe thing is to assume that any write to a FIFO is always allowed to block unless there's a reader actively consuming that data.

Python Popen().stdout.read() hang

I'm trying to get output of another script, using Python's subprocess.Popen like follows
process = Popen(command, stdout=PIPE, shell=True)
exitcode = process.wait()
output = process.stdout.read() # hangs here
It hangs at the third line, only when I run it as a python script and I cannot reproduce this in the python shell.
The other script prints just a few words and I am assuming that it's not a buffer issue.
Does anyone has idea about what I am doing wrong here?

You probably want to use .communicate() rather than .wait() plus .read(). Note the warning about wait() on the subprocess documentation page:
Warning This will deadlock when using stdout=PIPE and/or stderr=PIPE and the child process generates enough output to a pipe such that it blocks waiting for the OS pipe buffer to accept more data. Use communicate() to avoid that.
http://docs.python.org/2/library/subprocess.html#subprocess.Popen.wait

read() waits for EOF before returning.
You can:
wait for the subprocess to die, then read() will return.
use readline() if your output is broken into lines (will still hang if no output lines).
use os.read(F,N) which returns at most N bytes from F, but will still block if the pipe is empty (unless O_NONBLOCK is set on the fd).

You can see how to deal with hanging reading of stdout/stderr in the next sources:
readingproc

Proper way of re-using and closing a subprocess object

I have the following code in a loop:
while true:
# Define shell_command
p1 = Popen(shell_command, shell=shell_type, stdout=PIPE, stderr=PIPE, preexec_fn=os.setsid)
result = p1.stdout.read();
# Define condition
if condition:
break;
where shell_command is something like ls (it just prints stuff).
I have read in different places that I can close/terminate/exit a Popen object in a variety of ways, e.g. :
p1.stdout.close()
p1.stdin.close()
p1.terminate
p1.kill
My question is:
What is the proper way of closing a subprocess object once we are done using it?
Considering the nature of my script, is there a way to open a subprocess object only once and reuse it with different shell commands? Would that be more efficient in any way than opening new subprocess objects each time?
Update
I am still a bit confused about the sequence of steps to follow depending on whether I use p1.communicate() or p1.stdout.read() to interact with my process.
From what I understood in the answers and the comments:
If I use p1.communicate() I don't have to worry about releasing resources, since communicate() would wait until the process is finished, grab the output and properly close the subprocess object
If I follow the p1.stdout.read() route (which I think fits my situation, since the shell command is just supposed to print stuff) I should call things in this order:
p1.wait()
p1.stdout.read()
p1.terminate()
Is that right?

What is the proper way of closing a subprocess object once we are done using it?
stdout.close() and stdin.close() will not terminate a process unless it exits itself on end of input or on write errors.
.terminate() and .kill() both do the job, with kill being a bit more "drastic" on POSIX systems, as SIGKILL is sent, which cannot be ignored by the application. Specific differences are explained in this blog post, for example. On Windows, there's no difference.
Also, remember to .wait() and to close the pipes after killing a process to avoid zombies and force the freeing of resources.
A special case that is often encountered are processes which read from STDIN and write their result to STDOUT, closing themselves when EOF is encountered. With these kinds of programs, it's often sensible to use subprocess.communicate:
>>> p = Popen(["sort"], stdin=PIPE, stdout=PIPE)
>>> p.communicate("4\n3\n1")
('1\n3\n4\n', None)
>>> p.returncode
0
This can also be used for programs which print something and exit right after:
>>> p = Popen(["ls", "/home/niklas/test"], stdin=PIPE, stdout=PIPE)
>>> p.communicate()
('file1\nfile2\n', None)
>>> p.returncode
0
Considering the nature of my script, is there a way to open a subprocess object only once and reuse it with different shell commands? Would that be more efficient in any way than opening new subprocess objects each time?
I don't think the subprocess module supports this and I don't see what resources could be shared here, so I don't think it would give you a significant advantage.

Considering the nature of my script, is there a way to open a subprocess object only once and reuse it with different shell commands?
Yes.
#!/usr/bin/env python
from __future__ import print_function
import uuid
import random
from subprocess import Popen, PIPE, STDOUT
MARKER = str(uuid.uuid4())
shell_command = 'echo a'
p = Popen('sh', stdin=PIPE, stdout=PIPE, stderr=STDOUT,
universal_newlines=True) # decode output as utf-8, newline is '\n'
while True:
# write next command
print(shell_command, file=p.stdin)
# insert MARKER into stdout to separate output from different shell_command
print("echo '%s'" % MARKER, file=p.stdin)
# read command output
for line in iter(p.stdout.readline, MARKER+'\n'):
if line.endswith(MARKER+'\n'):
print(line[:-len(MARKER)-1])
break # command output ended without a newline
print(line, end='')
# exit on condition
if random.random() < 0.1:
break
# cleanup
p.stdout.close()
if p.stderr:
p.stderr.close()
p.stdin.close()
p.wait()
Put while True inside try: ... finally: to perform the cleanup in case of exceptions. On Python 3.2+ you could use with Popen(...): instead.
Would that be more efficient in any way than opening new subprocess objects each time?
Does it matter in your case? Don't guess. Measure it.

The "correct" order is:
Create a thread to read stdout (and a second one to read stderr, unless you merged them into one).
Write commands to be executed by the child to stdin. If you're not reading stdout at the same time, writing to stdin can block.
Close stdin (this is the signal for the child that it can now terminate by itself whenever it is done)
When stdout returns EOF, the child has terminated. Note that you need to synchronize the stdout reader thread and your main thread.
call wait() to see if there was a problem and to clean up the child process
If you need to stop the child process for any reason (maybe the user wants to quit), then you can:
Close stdin if the child terminates when it reads EOF.
Kill the with terminate(). This is the correct solution for child processes which ignore stdin.
If the child doesn't respond, try kill()
In all three cases, you must call wait() to clean up the dead child process.

Depends on what you expect the process to do; you should always call p1.wait() in order to avoid zombies. Other steps depend on the behaviour of the subprocess; if it produces any output, you should consume the output (e.g. p1.read() ...but this would eat lots of memory) and only then call the p1.wait(); or you may wait for some timeout and call p1.terminate() to kill the process if you think it doesn't work as expected, and possible call p1.wait() to clean the zombie.
Alternatively, p1.communicate(...) would do the handling if io and waiting for you (not the killing).
Subprocess objects aren't supposed to be reused.

Best way to fork multiple shell commands/processes in Python?

Most of the examples I've seen with os.fork and the subprocess/multiprocessing modules show how to fork a new instance of the calling python script or a chunk of python code. What would be the best way to spawn a set of arbitrary shell command concurrently?
I suppose, I could just use subprocess.call or one of the Popen commands and pipe the output to a file, which I believe will return immediately, at least to the caller. I know this is not that hard to do, I'm just trying to figure out the simplest, most Pythonic way to do it.
Thanks in advance

All calls to subprocess.Popen return immediately to the caller. It's the calls to wait and communicate which block. So all you need to do is spin up a number of processes using subprocess.Popen (set stdin to /dev/null for safety), and then one by one call communicate until they're all complete.
Naturally I'm assuming you're just trying to start a bunch of unrelated (i.e. not piped together) commands.

I like to use PTYs instead of pipes. For a bunch of processes where I only want to capture error messages I did this.
RNULL = open('/dev/null', 'r')
WNULL = open('/dev/null', 'w')
logfile = open("myprocess.log", "a", 1)
REALSTDERR = sys.stderr
sys.stderr = logfile
This next part was in a loop spawning about 30 processes.
sys.stderr = REALSTDERR
master, slave = pty.openpty()
self.subp = Popen(self.parsed, shell=False, stdin=RNULL, stdout=WNULL, stderr=slave)
sys.stderr = logfile
After this I had a select loop which collected any error messages and sent them to the single log file. Using PTYs meant that I never had to worry about partial lines getting mixed up because the line discipline provides simple framing.

There is no best for all possible circumstances. The best depends on the problem at hand.
Here's how to spawn a process and save its output to a file combining stdout/stderr:
import subprocess
import sys
def spawn(cmd, output_file):
on_posix = 'posix' in sys.builtin_module_names
return subprocess.Popen(cmd, close_fds=on_posix, bufsize=-1,
stdin=open(os.devnull,'rb'),
stdout=output_file,
stderr=subprocess.STDOUT)
To spawn multiple processes that can run in parallel with your script and each other:
processes, files = [], []
try:
for i, cmd in enumerate(commands):
files.append(open('out%d' % i, 'wb'))
processes.append(spawn(cmd, files[-1]))
finally:
for p in processes:
p.wait()
for f in files:
f.close()
Note: cmd is a list everywhere.

I suppose, I could just us subprocess.call or one of the Popen
commands and pipe the output to a file, which I believe will return
immediately, at least to the caller.
That's not a good way to do it if you want to process the data.
In this case, better do
sp = subprocess.Popen(['ls', '-l'], stdout=subprocess.PIPE)
and then sp.communicate() or read directly from sp.stdout.read().
If the data shall be processed in the calling program at a later time, there are two ways to go:
You can retrieve the data ASAP, maybe via a separate thread, reading them and storing them somewhere where the consumer can get them.
You can have the producing subprocess have block and retrieve the data from it when you need them. The subprocess produces as many data as fit in the pipe buffer (usually 64 kiB) and then blocks on further writes. As soon as you need the data, you read() from the subprocess object's stdout (maybe stderr as well) and use them - or, again, you use sp.communicate() at that later time.
Way 1 would the way to go if producing the data needs much time, so that your wprogram would have to wait.
Way 2 would be to be preferred if the size of the data is quite huge and/or the data is produced so fast that buffering would make no sense.

See an older answer of mine including code snippets to do:
Uses processes not threads for blocking I/O because they can more reliably be p.terminated()
Implements a retriggerable timeout watchdog that restarts counting whenever some output happens
Implements a long-term timeout watchdog to limit overall runtime
Can feed in stdin (although I only need to feed in one-time short strings)
Can capture stdout/stderr in the usual Popen means (Only stdout is coded, and stderr redirected to stdout; but can easily be separated)
It's almost realtime because it only checks every 0.2 seconds for output. But you could decrease this or remove the waiting interval easily
Lots of debugging printouts still enabled to see whats happening when.
For spawning multiple concurrent commands, you would need to alter the class RunCmd to instantiate mutliple read output/write input queues and to spawn mutliple Popen subprocesses.

Using subprocess.Popen for Process with Large Output

I have some Python code that executes an external app which works fine when the app has a small amount of output, but hangs when there is a lot. My code looks like:
p = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
errcode = p.wait()
retval = p.stdout.read()
errmess = p.stderr.read()
if errcode:
log.error('cmd failed <%s>: %s' % (errcode,errmess))
There are comments in the docs that seem to indicate the potential issue. Under wait, there is:
Warning: This will deadlock if the child process generates enough output to a stdout or stderr pipe such that it blocks waiting for the OS pipe buffer to accept more data. Use communicate() to avoid that.
though under communicate, I see:
Note The data read is buffered in memory, so do not use this method if the data size is large or unlimited.
So it is unclear to me that I should use either of these if I have a large amount of data. They don't indicate what method I should use in that case.
I do need the return value from the exec and do parse and use both the stdout and stderr.
So what is an equivalent method in Python to exec an external app that is going to have large output?

You're doing blocking reads to two files; the first needs to complete before the second starts. If the application writes a lot to stderr, and nothing to stdout, then your process will sit waiting for data on stdout that isn't coming, while the program you're running sits there waiting for the stuff it wrote to stderr to be read (which it never will be--since you're waiting for stdout).
There are a few ways you can fix this.
The simplest is to not intercept stderr; leave stderr=None. Errors will be output to stderr directly. You can't intercept them and display them as part of your own message. For commandline tools, this is often OK. For other apps, it can be a problem.
Another simple approach is to redirect stderr to stdout, so you only have one incoming file: set stderr=STDOUT. This means you can't distinguish regular output from error output. This may or may not be acceptable, depending on how the application writes output.
The complete and complicated way of handling this is select (http://docs.python.org/library/select.html). This lets you read in a non-blocking way: you get data whenever data appears on either stdout or stderr. I'd only recommend this if it's really necessary. This probably doesn't work in Windows.

Reading stdout and stderr independently with very large output (ie, lots of megabytes) using select:
import subprocess, select
proc = subprocess.Popen(cmd, bufsize=8192, shell=False, \
stdout=subprocess.PIPE, stderr=subprocess.PIPE)
with open(outpath, "wb") as outf:
dataend = False
while (proc.returncode is None) or (not dataend):
proc.poll()
dataend = False
ready = select.select([proc.stdout, proc.stderr], [], [], 1.0)
if proc.stderr in ready[0]:
data = proc.stderr.read(1024)
if len(data) > 0:
handle_stderr_data(data)
if proc.stdout in ready[0]:
data = proc.stdout.read(1024)
if len(data) == 0: # Read of zero bytes means EOF
dataend = True
else:
outf.write(data)

A lot of output is subjective so it's a little difficult to make a recommendation. If the amount of output is really large then you likely don't want to grab it all with a single read() call anyway. You may want to try writing the output to a file and then pull the data in incrementally like such:
f=file('data.out','w')
p = subprocess.Popen(cmd, shell=True, stdout=f, stderr=subprocess.PIPE)
errcode = p.wait()
f.close()
if errcode:
errmess = p.stderr.read()
log.error('cmd failed <%s>: %s' % (errcode,errmess))
for line in file('data.out'):
#do something

Glenn Maynard is right in his comment about deadlocks. However, the best way of solving this problem is two create two threads, one for stdout and one for stderr, which read those respective streams until exhausted and do whatever you need with the output.
The suggestion of using temporary files may or may not work for you depending on the size of output etc. and whether you need to process the subprocess' output as it is generated.
As Heikki Toivonen has suggested, you should look at the communicate method. However, this buffers the stdout/stderr of the subprocess in memory and you get those returned from the communicate call - this is not ideal for some scenarios. But the source of the communicate method is worth looking at.
Another example is in a package I maintain, python-gnupg, where the gpg executable is spawned via subprocess to do the heavy lifting, and the Python wrapper spawns threads to read gpg's stdout and stderr and consume them as data is produced by gpg. You may be able to get some ideas by looking at the source there, as well. Data produced by gpg to both stdout and stderr can be quite large, in the general case.

I had the same problem. If you have to handle a large output, another good option could be to use a file for stdout and stderr, and pass those files per parameter.
Check the tempfile module in python: https://docs.python.org/2/library/tempfile.html.
Something like this might work
out = tempfile.NamedTemporaryFile(delete=False)
Then you would do:
Popen(... stdout=out,...)
Then you can read the file, and erase it later.

You could try communicate and see if that solves your problem. If not, I'd redirect the output to a temporary file.

Here is simple approach which captures both regular output plus error output, all within Python so limitations in stdout don't apply:
com_str = 'uname -a'
command = subprocess.Popen([com_str], stdout=subprocess.PIPE, shell=True)
(output, error) = command.communicate()
print output
Linux 3.11.0-20-generic SMP Fri May 2 21:32:55 UTC 2014
and
com_str = 'id'
command = subprocess.Popen([com_str], stdout=subprocess.PIPE, shell=True)
(output, error) = command.communicate()
print output
uid=1000(myname) gid=1000(mygrp) groups=1000(cell),0(root)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Gather subprocess output nonblocking in Python - python

Related

Avoid Deadlock wtih Popen and stdout = PIPE in python

Python Popen().stdout.read() hang

Proper way of re-using and closing a subprocess object

Best way to fork multiple shell commands/processes in Python?

Using subprocess.Popen for Process with Large Output

Categories

Resources