Using subprocess.Popen for Process with Large Output

Using subprocess.Popen for Process with Large Output - python

I have some Python code that executes an external app which works fine when the app has a small amount of output, but hangs when there is a lot. My code looks like:
p = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
errcode = p.wait()
retval = p.stdout.read()
errmess = p.stderr.read()
if errcode:
log.error('cmd failed <%s>: %s' % (errcode,errmess))
There are comments in the docs that seem to indicate the potential issue. Under wait, there is:
Warning: This will deadlock if the child process generates enough output to a stdout or stderr pipe such that it blocks waiting for the OS pipe buffer to accept more data. Use communicate() to avoid that.
though under communicate, I see:
Note The data read is buffered in memory, so do not use this method if the data size is large or unlimited.
So it is unclear to me that I should use either of these if I have a large amount of data. They don't indicate what method I should use in that case.
I do need the return value from the exec and do parse and use both the stdout and stderr.
So what is an equivalent method in Python to exec an external app that is going to have large output?

You're doing blocking reads to two files; the first needs to complete before the second starts. If the application writes a lot to stderr, and nothing to stdout, then your process will sit waiting for data on stdout that isn't coming, while the program you're running sits there waiting for the stuff it wrote to stderr to be read (which it never will be--since you're waiting for stdout).
There are a few ways you can fix this.
The simplest is to not intercept stderr; leave stderr=None. Errors will be output to stderr directly. You can't intercept them and display them as part of your own message. For commandline tools, this is often OK. For other apps, it can be a problem.
Another simple approach is to redirect stderr to stdout, so you only have one incoming file: set stderr=STDOUT. This means you can't distinguish regular output from error output. This may or may not be acceptable, depending on how the application writes output.
The complete and complicated way of handling this is select (http://docs.python.org/library/select.html). This lets you read in a non-blocking way: you get data whenever data appears on either stdout or stderr. I'd only recommend this if it's really necessary. This probably doesn't work in Windows.

Reading stdout and stderr independently with very large output (ie, lots of megabytes) using select:
import subprocess, select
proc = subprocess.Popen(cmd, bufsize=8192, shell=False, \
stdout=subprocess.PIPE, stderr=subprocess.PIPE)
with open(outpath, "wb") as outf:
dataend = False
while (proc.returncode is None) or (not dataend):
proc.poll()
dataend = False
ready = select.select([proc.stdout, proc.stderr], [], [], 1.0)
if proc.stderr in ready[0]:
data = proc.stderr.read(1024)
if len(data) > 0:
handle_stderr_data(data)
if proc.stdout in ready[0]:
data = proc.stdout.read(1024)
if len(data) == 0: # Read of zero bytes means EOF
dataend = True
else:
outf.write(data)

A lot of output is subjective so it's a little difficult to make a recommendation. If the amount of output is really large then you likely don't want to grab it all with a single read() call anyway. You may want to try writing the output to a file and then pull the data in incrementally like such:
f=file('data.out','w')
p = subprocess.Popen(cmd, shell=True, stdout=f, stderr=subprocess.PIPE)
errcode = p.wait()
f.close()
if errcode:
errmess = p.stderr.read()
log.error('cmd failed <%s>: %s' % (errcode,errmess))
for line in file('data.out'):
#do something

Glenn Maynard is right in his comment about deadlocks. However, the best way of solving this problem is two create two threads, one for stdout and one for stderr, which read those respective streams until exhausted and do whatever you need with the output.
The suggestion of using temporary files may or may not work for you depending on the size of output etc. and whether you need to process the subprocess' output as it is generated.
As Heikki Toivonen has suggested, you should look at the communicate method. However, this buffers the stdout/stderr of the subprocess in memory and you get those returned from the communicate call - this is not ideal for some scenarios. But the source of the communicate method is worth looking at.
Another example is in a package I maintain, python-gnupg, where the gpg executable is spawned via subprocess to do the heavy lifting, and the Python wrapper spawns threads to read gpg's stdout and stderr and consume them as data is produced by gpg. You may be able to get some ideas by looking at the source there, as well. Data produced by gpg to both stdout and stderr can be quite large, in the general case.

I had the same problem. If you have to handle a large output, another good option could be to use a file for stdout and stderr, and pass those files per parameter.
Check the tempfile module in python: https://docs.python.org/2/library/tempfile.html.
Something like this might work
out = tempfile.NamedTemporaryFile(delete=False)
Then you would do:
Popen(... stdout=out,...)
Then you can read the file, and erase it later.

You could try communicate and see if that solves your problem. If not, I'd redirect the output to a temporary file.

Here is simple approach which captures both regular output plus error output, all within Python so limitations in stdout don't apply:
com_str = 'uname -a'
command = subprocess.Popen([com_str], stdout=subprocess.PIPE, shell=True)
(output, error) = command.communicate()
print output
Linux 3.11.0-20-generic SMP Fri May 2 21:32:55 UTC 2014
and
com_str = 'id'
command = subprocess.Popen([com_str], stdout=subprocess.PIPE, shell=True)
(output, error) = command.communicate()
print output
uid=1000(myname) gid=1000(mygrp) groups=1000(cell),0(root)

Related

Gather subprocess output nonblocking in Python

Is there an easy way of gathering the output of a subprocess without actually waiting for it?
I can think of creating a subprocess.Popen() with capturing its stdout, then call p.communicate(), but that would block until the subprocess terminates.
I can think of using subprocess.check_output() or similar, but that also would block.
I need something which I can start, then do other stuff, then check the subprocess for being terminated, and in case it is, takes its output.
I can think of two rather complicated ways to achieve this:
Redirect the output into a file, then after termination I can read the output from that file.
Implement and start a handler thread(!) which constantly tries to read data from the stdout of the subprocess and adds it to a buffer.
The first one needs temporary files and disk I/O which I do not really like in my case. The second one means implementing quite a bit.
I guess there might be a simpler way I couldn't think of yet, or a ready-to-be-used solution in some library I didn't find yet.

What's wrong with calling check_output in a thread?
import threading,subprocess
output = ""
def f():
global output
output = subprocess.check_output("ls") # ["cmd","/c","dir"] for windows
t = threading.Thread(target=f)
t.start()
print('Started')
t.join()
print(output)
note that one could be tempted to use p = subprocess.Popen(cmd,stdout=subprocess.PIPE), wait for p.poll() to be != None and try to read p.stdout afterwards: that only works when the output is small, else you get a deadlock because stdout buffer is full and you have to read it from time to time.
Using p.stdout.readline() would work but would also block if the process doesn't print on a regular basis. If your application prints to the output all the time, then you can consider it as non-blocking and the solution is acceptable.

I think what you want is an unbuffered stdout stream.
With that you will be able to capture the output of your process without waiting for it to finish.
You can achieve that with the subprocess.Popen() function and the parameter stdout=subprocess.PIPE.
Try something like this
proc = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE)
line = proc.stdout.readline()
while line:
print line
line = proc.stdout.readline()

Python subprocess timing out?

I have a script that runs another command, waits for it to finish, logs the stdout and stderr and based the return code does other stuff. Here is the code:
p = subprocess.Popen(command, stdin=subprocess.PIPE, stderr=subprocess.PIPE, stdout=subprocess.PIPE)
o, e = p.communicate()
if p.returncode:
# report error
# do other stuff
The problem I'm having is that if command takes a long time to run none of the other actions get done. The possible errors won't get reported and the other stuff that needs to happen if no errors doesn't get done. It essentially doesn't go past p.communicate() if it takes too long. Some times this command can takes hours (or even longer) to run and some times it can take as little as 5 seconds.
Am I missing something or doing something wrong?

As per the documentation located here, it's safe to say that you're code is waiting for the subprocess to finish.
If you need to go do 'other things' while you wait you could create a loop like:
while p.poll():
# 'other things'
time.sleep(0.2)
Pick a sleep time that's reasonable for how often you want python to wake up and check the subprocess as well as doing its 'other things'.

The Popen.communicate waits for the process to finish, before anything is returned. Thus it is not ideal for any long running command; and even less so if the subprocess can hang waiting for input, say prompting for a password.
The stderr=subprocess.PIPE, stdout=subprocess.PIPE are needed only if you want to capture the output of the command into a variable. If you are OK with the output going to your terminal, then you can remove these both; and even use subprocess.call instead of Popen. Also, if you do not provide input to your subprocess, then do not use stdin=subprocess.PIPE at all, but direct that from the null device instead (in Python 3.3+ you can use stdin=subprocess.DEVNULL; in Python <3.3 use stdin=open(os.devnull, 'rb')
If you need the contents too, then instead of calling p.communicate(), you can read p.stdout and p.stderr yourself in chunks and output to the terminal, but it is a bit complicated, as it is easy to deadlock the program - the dummy approach would try to read from the subprocess' stdout while the subprocess would want to write to stderr. For this case there are 2 remedies:
you could use select.select to poll both stdout and stderr to see whichever becomes ready first and read from it then
or, if you do not care for stdout and stderr being combined into one,
you can use STDOUT to redirect the stderr stream into the stdout stream: stdout=subprocess.PIPE, stderr=subprocess.STDOUT; now all the output comes to p.stdout that you can read easily in loop and output the chunks, without worrying about deadlocks:
If the stdout, stderr are going to be huge, you can also spool them to a file right there in Popen; say,
stdout = open('stdout.txt', 'w+b')
stderr = open('stderr.txt', 'w+b')
p = subprocess.Popen(..., stdout=stdout, stderr=stderr)
while p.poll() is None:
# reading at the end of the file will return an empty string
err = stderr.read()
print(err)
out = stdout.read()
print(out)
# if we met the end of the file, then we can sleep a bit
# here to avoid spending excess CPU cycles just to poll;
# another option would be to use `select`
if not err and not out: # no input, sleep a bit
time.sleep(0.01)

When to use subprocess.call() or subprocess.Popen(), running airodump

I have this little script that puts your wireless device into monitor mode. It does an airodump scan and then after terminating the scan dumps the output to file.txt or a variable, so then I can scrape the BSSID and whatever other info I may need.
I feel I haven't grasped the concept or difference between subprocess.call() and subprocess.Popen().
This is what I currently have:
def setup_device():
try:
output = open("file.txt", "w")
put_device_down = subprocess.call(["ifconfig", "wlan0", "down"])
put_device_mon = subprocess.call(["iwconfig", "wlan0", "mode", "monitor"])
put_device_up = subprocess.call(["iwconfig", "wlano", "up"])
start_device = subprocess.call(["airmon-ng", "start", "wlan0"])
scanned_networks = subprocess.Popen(["airodump-ng", "wlan0"], stdout = output)
time.sleep(10)
scanned_networks.terminate()
except Exception, e:
print "Error:", e
I am still clueless about where and when and in which way to use subprocess.call() and subprocess.Popen()
The thing that I think is confusing me most is the stdout and stderr args. What is PIPE?
Another thing that I could possibly fix myself once I get a better grasp is this:
When running subprocess.Popen() and running airodump, the console window pops up showing the scan. Is there a way to hide this from the user to sort of clean things up?

You don't have to use Popen() if you don't want to. The other functions in the module, such as .call() use Popen(), give you a simpler API to do what you want.
All console applications have 3 'file' streams: stdin for input, and stdout and stderr for output. The application decides what to write where; usually error and diagnostic information to stderr, the rest to stdout. If you want to capture the output for either of these outputs in your Python program, you specify the subprocess.PIPE argument so that the 'stream' is redirected into your program. Hence the name.
If you want to capture the output of the airodump-ng wlan0 command, it's easiest to use the subprocess.check_output() function; it takes care of the PIPE argument for you:
scanned_networks = subprocess.check_output(["airodump-ng", "wlan0"])
Now output contains whatever airodump-ng wrote to its stdout stream.
If you need to have more control over the process, then you do need to use the Popen() class:
proc = subprocess.Popen(["airodump-ng", "wlan0"], stdout=subprocess.PIPE)
for line in proc.stdout:
# do something with line
proc.terminate()

Alternatives to Python Popen.communicate() memory limitations?

I have the following chunk of Python code (running v2.7) that results in MemoryError exceptions being thrown when I work with large (several GB) files:
myProcess = Popen(myCmd, shell=True, stdout=PIPE, stderr=PIPE)
myStdout, myStderr = myProcess.communicate()
sys.stdout.write(myStdout)
if myStderr:
sys.stderr.write(myStderr)
In reading the documentation to Popen.communicate(), there appears to be some buffering going on:
Note The data read is buffered in memory, so do not use this method if the data size is large or unlimited.
Is there a way to disable this buffering, or force the cache to be cleared periodically while the process runs?
What alternative approach should I use in Python for running a command that streams gigabytes of data to stdout?
I should note that I need to handle output and error streams.

I think I found a solution:
myProcess = Popen(myCmd, shell=True, stdout=PIPE, stderr=PIPE)
for ln in myProcess.stdout:
sys.stdout.write(ln)
for ln in myProcess.stderr:
sys.stderr.write(ln)
This seems to get my memory usage down enough to get through the task.
Update
I have recently found a more flexible way of handing data streams in Python, using threads. It's interesting that Python is so poor at something that shell scripts can do so easily!

What I would probably do instead, if I needed to read the stdout for something that large, is send it to a file on creation of the process.
with open(my_large_output_path, 'w') as fo:
with open(my_large_error_path, 'w') as fe:
myProcess = Popen(myCmd, shell=True, stdout=fo, stderr=fe)
Edit: If you need to stream, you could try making a file-like object and passing it to stdout and stderr. (I haven't tried this, though.) You could then read (query) from the object as it's being written.

For those whose application hangs after a certain amount of time when using Popen, please look for my case below:
A Rule of thumb, if you're not going to use stderr and stdout streams then don't pass/init them in the parameters of Popen! because they will fill up and cause you a lot of problems.
If you need them for a certain amount of time and you need to keep the process running, then you can close those streams at any time.
try:
p = Popen(COMMAND, stdout=PIPE, stderr=PIPE)
# After using stdout and stderr
p.stdout.close()
p.stderr.close()
except Exception as e:
pass

scrambled output from a child process run from subprocess

I'm using the following code to run another python script. The problem I'm facing is that the output of that script is coming out in an unorderly manner.
While running it from the command line, I get the correct output i.e. :
some output here
Editing xml file and saving changes
Uploading xml file back..
While running the script using subprocess, am getting some of the output in reverse order:
correct output till here
Uploading xml file back..
Editing xml file and saving changes
The script is executing without errors and making the right changes. So I think the culprit might be the code that is calling the child script, but I can't find the problem:
cmd = "child_script.py"
proc = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE,stderr=subprocess.STDOUT)
(fout ,ferr) = ( proc.stdout, proc.stderr )
print "Going inside while - loop"
while True:
line = proc.stdout.readline()
print line
fo.write(line)
try :
err = ferr.readline()
fe.write(err)
except Exception, e:
pass
if not line:
pass
break
[EDIT]: fo and fe are file handles to output and error logs. Also the script is being run on Windows.Sorry for missing these details.

There are a few problems with the part of the script you've quoted, I'm afraid:
As mentioned in detly's comment, what are fo and fe? Presumably those are objects to which you're writing the output of the child process? (Update: you indicate that these are both for writing output logs.)
There's an indentation error on line 3. (Update: I've fixed that in the original post.)
You're specifying stderr=subprocess.STDOUT, so: (a) ferr will always be None in your loop and (b) due to buffering, standard output and error may be mixed in an unpredictable way. However, it looks from your code as if you actually want to deal with standard output and standard error separately, so perhaps try stderr=subprocess.PIPE instead.
It would be a good idea to rewrite your loop as jsbueno suggests:
from subprocess import Popen, PIPE
proc = Popen(["child_script.py"], stdout=PIPE, stderr=PIPE)
fout, ferr = proc.stdout, proc.stderr
for line in fout:
print(line.rstrip())
fo.write(line)
for line in ferr:
fe.write(line)
... or to reduce it even further, since it seems that the aim is essentially that you just want to write the standard output and standard error from the child process to fo and fe, just do:
proc = subprocess.Popen(["child_script.py"], stdout=fo, stderr=fe)
If you still see the output lines swapped in the file that fo is writing to, then we can only assume that there is some way in which this can happen in the child script. e.g. is the child script multi-threaded? Is one of the lines printed via a callback from another function?

Most of the times I've seen order of output differ based on execution, some output was sent to the C standard IO streams stdin, and some output was sent to stderr. The buffering characteristics of stdout and stderr vary depending upon if they are connected to a terminal, pipes, files, etc:
NOTES
The stream stderr is unbuffered. The stream stdout is
line-buffered when it points to a terminal. Partial lines
will not appear until fflush(3) or exit(3) is called, or a
newline is printed. This can produce unexpected results,
especially with debugging output. The buffering mode of
the standard streams (or any other stream) can be changed
using the setbuf(3) or setvbuf(3) call. Note that in case
stdin is associated with a terminal, there may also be
input buffering in the terminal driver, entirely unrelated
to stdio buffering. (Indeed, normally terminal input is
line buffered in the kernel.) This kernel input handling
can be modified using calls like tcsetattr(3); see also
stty(1), and termios(3).
So perhaps you should configure both stdout and stderr to go to the same source, so the same buffering will be applied to both streams.
Also, some programs open the terminal directly open("/dev/tty",...) (mostly so they can read passwords), so comparing terminal output with pipe output isn't always going to work.
Further, if your program is mixing direct write(2) calls with standard IO calls, the order of output can be different based on the different buffering choices.
I hope one of these is right :) let me know which, if any.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using subprocess.Popen for Process with Large Output - python

You could try communicate and see if that solves your problem. If not, I'd redirect the output to a temporary file.

Related

Gather subprocess output nonblocking in Python

Python subprocess timing out?

When to use subprocess.call() or subprocess.Popen(), running airodump

Alternatives to Python Popen.communicate() memory limitations?

scrambled output from a child process run from subprocess

Categories

Resources