python subprocess and mysqldump - python

I know parts of this question have been asked before, but I have some related questions.
I'm trying to execute
mysqldump -u uname -ppassword --add-drop-database --databases databaseName | gzip > fileName
I'm potentially dumping a very large (200GB?) db. Is that in itself a dumb thing to do? I then want to send the zipped file over the network for storage, delete the local dump, and purge a couple of tables.
Anyway, I was using subprocess like this, because there doesn't seem to be a way to execute the entire original call without subprocess considering | to be a table name.:
from subprocess import Popen, PIPE
f = open(FILENAME, 'wb')
args = ['mysqldump', '-u', 'UNAME', '-pPASSWORD', '--add-drop-database', '--databases', 'DB']
p1 = Popen(args, stdout=PIPE)
P2 = Popen('gzip', stdin=p1.stdout, stdout=f)
p2.communicate()
but then I read that communicate caches the data in memory, which wouldn't work for me. Is this true?
What I ended up doing for now is:
import gzip
subprocess.call(args, stdout=f)
f.close()
f = open(filename, 'rb')
zipFilename = filename + '.gz'
f2 = gzip.open(zipFilename, 'wb')
f2.writelines(f)
f2.close()
f.close()
of course this takes a million years, and I hate it.
My Questions:
1. Can I use my first approach on a very large db?
2. Could I possibly pipe the output of mysqldump to a socket and fire it across the network and save it when it arrives, rather than sending a zipped file?
Thanks!

You don't need communicate(). Its only there as a convenience method if you want to read stdout/stderr to completion. But since you are chaining the commands, they are doing that for you. Just wait for them to complete.
from subprocess import Popen, PIPE
args = ['mysqldump', '-u', 'UNAME', '-pPASSWORD', '--add-drop-database', '--databases', 'DB']
with open(FILENAME, 'wb', 0) as f:
p1 = Popen(args, stdout=PIPE)
p2 = Popen('gzip', stdin=p1.stdout, stdout=f)
p1.stdout.close() # force write error (/SIGPIPE) if p2 dies
p2.wait()
p1.wait()

You are quite close to where you want:
from subprocess import Popen, PIPE
f = open(FILENAME, 'wb')
args = ['mysqldump', '-u', 'UNAME', '-pPASSWORD', '--add-drop-database', '--databases', 'DB']
p1 = Popen(args, stdout=PIPE)
Till here it is right.
p2 = Popen('gzip', stdin=p1.stdout, stdout=PIPE)
This one takes p1's output and processes it. Afterwards we can (and should) immediately p1.stdout.close().
Now we have a p2.stdout which can be read from and, without using a temporary file, send it via the network:
s = socket.create_connection(('remote_pc', port))
while True:
r = p2.stdout.read(65536)
if not r: break
s.send(r)

Yup the data is buffered in memory:
"Note The data read is buffered in memory, so do not use this method
if the data size is large or unlimited." - subprocess docs
Unfortunately at the moment there is no way to asynchronously use Popen: PEP3145
Rather than doing this all in python you can manually do
os.system("mysqldump -u uname -ppassword --add-drop-database --databases databaseName | gzip > fileName
")
with the appropriate string replacements using string.format of course; otherwise you're putting an unnecessary amount of stress on your computer, especially trying to communicate 200gb via a pipe ...
Can you elaborate on what you are trying to do? Right now it sounds like you're both dumping and zipping on the same computer.
Yes you can stream a file across the network .. I don't know if you want to directly stream the output of mysql directly though - you might want to look at your network capabilities before considering that
bash:
#!/bin/bash
mysqldump -u uname -ppassword --add-drop-database --databases databaseName | gzip > fileName
#transfer fileName to other computer
^ you can also put this in a crontab and have it run at intervals :)

Your example code using two subprocess.Popen calls is correct (albeit slightly-improve-able), and this:
... I read that communicate caches the data in memory
is also correct—it reads into memory all of the standard-output and standard-error-output that the "communicating command" produces on a subprocess.PIPE—but not a problem here, because you have this:
p1 = Popen(args, stdout=PIPE)
P2 = Popen('gzip', stdin=p1.stdout, stdout=f)
p2.communicate()
You're calling communicate() on p2, whose stdout output is sent to f (an opened file), and whose stderr output—which is probably empty anyway (no errors occur)—is not being sent to a PIPE. Thus, p2.communicate() would at worst have to read and buffer-up a grand total of zero bytes of stdout plus zero bytes of stderr. It's actually a bit more clever, noticing that there is no PIPE, so it returns the tuple (None, None).
If you were to call p1.communicate(), that would be more of an issue (although in this case you'd then be fighting with p2, the gzip process, for the output from p1, which would be even worse). But you are not; p1's output flows to p2, and p2's output flows to a file.
Since none of p2's output is sent to a PIPE, there is no need to call p2.communicate() here: you can simply call p2.wait(). That makes it clearer that there's no data flowing back from p2 (which I would say is a minor improvement to the code, although if you decide you want to capture p2's stderr after all, you'd have to change that back).
Edit to add: as in glglgl's answer, it's important to close p1's pipe to p2 after creating p2, otherwise p2 will wait for your Python process to send data to p2, too.

Related

Python subprocess reading process terminates before writing process example, clarification needed

Code snippet from: http://docs.python.org/3/library/subprocess.html#replacing-shell-pipeline
output=`dmesg | grep hda`
# becomes
p1 = Popen(["dmesg"], stdout=PIPE)
p2 = Popen(["grep", "hda"], stdin=p1.stdout, stdout=PIPE)
p1.stdout.close() # Allow p1 to receive a SIGPIPE if p2 exits.
output = p2.communicate()[0]
Question: I do not quite understand why this line is needed: p1.stdout.close()?
What if by doing this p1 stdout is closed even before it is completely done outputting data and p2 is still alive ? Are we not risking that by closing p1.stdout so soon? How does this work?
p1.stdout.close() closes Python's copy of the file descriptor. p2 already has that descriptor open (via stdin=p1.stdout), so closing Python's descriptor doesn't affect p2. However, now that pipe end is only opened once, so when it closes (e.g. if p2 dies), p1 will see the pipe close and will get SIGPIPE.
If you didn't close p1.stdout in Python, and p2 died, p1 would get no signal because Python's descriptor would be holding the pipe open.
Pipes are external to processes (its an operating system thing) and are accessed by processes using read and write handles. Many processes can have handles to the pipe and can read and write in all sorts of disastrous ways if not managed properly. Pipes close when all handles to the pipes are closed.
Although process execution works differently in Linux and Windows, Here is basically what happens (I'm going to get killed on this!)
p1 = Popen(["dmesg"], stdout=PIPE)
Create pipe_1, give a write handle to dmesg as its stdout, and return a read handle in the parent as p1.stdout. You now have 1 pipe with 2 handles (pipe_1 write in dmesg, pipe_1 read in the parent).
p2 = Popen(["grep", "hda"], stdin=p1.stdout, stdout=PIPE)
Create pipe_2. Give grep a write handle to pipe_2 and a copy of the read handle to pipe_1. You now have 2 pipes and 5 handles (pipe_1 write in dmesg, pipe_1 read and pipe_2 write in grep, pipe_1 read and pipe_2 read in the parent).
p1.stdout.close() # Allow p1 to receive a SIGPIPE if p2 exits.
Notice that pipe_1 has two read handles. You want grep to have the read handle so that it reads dmesg data. You don't need the handle in the parent any more. Close it so that there is only 1 read handle on pipe_1. If grep dies, its pipe_1 read handle is closed, the operating system notices there are no remaining read handles for pipe_1 and gives dmesg the bad news.
output = p2.communicate()[0]
dmesg sends data to stdout (the pipe_1 write handle) which begins filling pipe_1. grep reads stdin (the pipe_1 read handle) which empties pipe_1. grep also writes stdout (the pipe_2 write handle) filling pipe_2. The parent process reads pipe_2... and you got yourself a pipeline!

Linking subprocesses in Python

Hi I had a question about linking input and output with sub-processes in python. I am trying to simplify the program by skipping the output of one step by passing it to another subprocess rather than output it to a file. Then open another process to run on that file.
E.g. First process uses SAMTOOLS to output a specific chromosome from a large bam file.
So...
bigfile.bam is read in and outputs chromosome22.bam
The next subprocess uses BEDTOOLS to convert that chromosome22.bam to chromosome22.bed
So...
chromosome22.bam is read in and outputs chromosome22.bed
What I want to do is pass the stdout of the first process into the second so there is no need for the intermediate file.
So far I have this...
for x in 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,'X','Y':
subprocess.call("%s view -bh %s %s > %s/%s/%s.bam" % (samtools,bam,x,bampath,out,x), shell=True)
This makes the chromosome[1-22,X,Y].bam files. But can I avoid this and put another subprocess command in the same loop to convert them to bed files?
The command for bed conversion is:
bedpath/bedtools bamtobed -i [bamfile] > [bedfile]
Please have a look at the replacing shell pipeline example in the documentation.
output=$(dmesg | grep hda)
becomes:
p1 = Popen(["dmesg"], stdout=PIPE)
p2 = Popen(["grep", "hda"], stdin=p1.stdout, stdout=PIPE)
p1.stdout.close() # Allow p1 to receive a SIGPIPE if p2 exits.
output = p2.communicate()[0]
The explanation is:
The p1.stdout.close() call after starting the p2 is important in order for p1 to receive a SIGPIPE if p2 exits before p1.
No need to use python here. Much easier in shell. But essentially, it works the same as in python.
If bedtools can read from stdin, you can e.g. do
#!/bin/sh
for x in `seq 1 22` X Y; do
$samtools view -bh $bam $x | $bedtools bamtobed > $bampath/$out/$x.bam
done
Depending on how bedtools was desinged, you might also need to use -i - to have it read from stdin.
If you stick with python, I strongly recommend about learning how to do this
without doing it in all shell,
without producing shell commands, that you need to escape properly to avoid errors
subprocess is more safe to use when you use the array-based syntax and no shell.
Make that two subprocess invocations, one for each command. See http://docs.python.org/library/subprocess.html#replacing-shell-pipeline for more details.
cmd1 = [samtools, "view", "-bh", bam, x]
cmd2 = [bedtools, "bamtobed"]
c1 = subprocess.Popen(cmd1, stdout=subprocess.PIPE)
c2 = subprocess.Popen(cmd2, stdin=c1.stdout, stdout=open(outputfilename, "w"))
c1.stdout.close()
c2.communicate()
Yes, you can use the pipe functionality. See if you can read from stdin for the bamtobed process ... if you can, try the following. This way you save on the disk IO time assuming the processing load is light.
SLIGHT modification:
proc1.stdout is now the stdin for the 2nd process.
proc1 = subprocess.call("%s view -bh %s %s" % (samtools,bam,x,bampath,out,x), shell=True, stdout=subprocess.PIPE)
proc2 = subprocess.call("bedpath/bedtools bamtobed > %s" % (outFileName, ), shell=True, stdin=proc1.stdout)

how to concatenate multiple files for stdin of Popen

I'm porting a bash script to python 2.6, and want to replace some code:
cat $( ls -tr xyz_`date +%F`_*.log ) | filter args > bzip2
I guess I want something similar to the "Replacing shell pipe line" example at http://docs.python.org/release/2.6/library/subprocess.html, ala...
p1 = Popen(["filter", "args"], stdin=*?WHAT?*, stdout=PIPE)
p2 = Popen(["bzip2"], stdin=p1.stdout, stdout=PIPE)
output = p2.communicate()[0]
But, I'm not sure how best to provide p1's stdin value so it concatenates the input files. Seems I could add...
p0 = Popen(["cat", "file1", "file2"...], stdout=PIPE)
p1 = ... stdin=p0.stdout ...
...but that seems to be crossing beyond use of (slow, inefficient) pipes to call external programs with significant functionality. (Any decent shell performs the cat internally.)
So, I can imagine a custom class that satisfies the file object API requirements and can therefore be used for p1's stdin, concatenating arbitrary other file objects. (EDIT: existing answers explain why this isn't possible)
Does python 2.6 have a mechanism addressing this need/want, or might another Popen to cat be considered perfectly fine in python circles?
Thanks.
You can replace everything that you're doing with Python code, except for your external utility. That way your program will remain portable as long as your external util is portable. You can also consider turning the C++ program into a library and using Cython to interface with it. As Messa showed, date is replaced with time.strftime, globbing is done with glob.glob and cat can be replaced with reading all the files in the list and writing them to the input of your program. The call to bzip2 can be replaced with the bz2 module, but that will complicate your program because you'd have to read and write simultaneously. To do that, you need to either use p.communicate or a thread if the data is huge (select.select would be a better choice but it won't work on Windows).
import sys
import bz2
import glob
import time
import threading
import subprocess
output_filename = '../whatever.bz2'
input_filenames = glob.glob(time.strftime("xyz_%F_*.log"))
p = subprocess.Popen(['filter', 'args'], stdin=subprocess.PIPE, stdout=subprocess.PIPE)
output = open(output_filename, 'wb')
output_compressor = bz2.BZ2Compressor()
def data_reader():
for filename in input_filenames:
f = open(filename, 'rb')
p.stdin.writelines(iter(lambda: f.read(8192), ''))
p.stdin.close()
input_thread = threading.Thread(target=data_reader)
input_thread.start()
with output:
for chunk in iter(lambda: p.stdout.read(8192), ''):
output.write(output_compressor.compress(chunk))
output.write(output_compressor.flush())
input_thread.join()
p.wait()
Addition: How to detect file input type
You can use either the file extension or the Python bindings for libmagic to detect how the file is compressed. Here's a code example that does both, and automatically chooses magic if it is available. You can take the part that suits your needs and adapt it to your needs. The open_autodecompress should detect the mime encoding and open the file with the appropriate decompressor if it is available.
import os
import gzip
import bz2
try:
import magic
except ImportError:
has_magic = False
else:
has_magic = True
mime_openers = {
'application/x-bzip2': bz2.BZ2File,
'application/x-gzip': gzip.GzipFile,
}
ext_openers = {
'.bz2': bz2.BZ2File,
'.gz': gzip.GzipFile,
}
def open_autodecompress(filename, mode='r'):
if has_magic:
ms = magic.open(magic.MAGIC_MIME_TYPE)
ms.load()
mimetype = ms.file(filename)
opener = mime_openers.get(mimetype, open)
else:
basepart, ext = os.path.splitext(filename)
opener = ext_openers.get(ext, open)
return opener(filename, mode)
If you look inside the subprocess module implementation, you will see that std{in,out,err} are expected to be fileobjects supporting fileno() method, so a simple concatinating file-like object with python interface (or even a StringIO object) is not suitable here.
If it were iterators, not file objects, you could use itertools.chain.
Of course, sacrificing the memory consumption you can do something like this:
import itertools, os
# ...
files = [f for f in os.listdir(".") if os.path.isfile(f)]
input = ''.join(itertools.chain(open(file) for file in files))
p2.communicate(input)
When using subprocess you have to consider the fact that internally Popen will use the file descriptor(handler) and call os.dup2() for stdin, stdout and stderr before passing them to the child process created.
So if you don't want to use system shell pipe with Popen:
p0 = Popen(["cat", "file1", "file2"...], stdout=PIPE)
p1 = Popen(["filter", "args"], stdin=p0.stdout, stdout=PIPE)
...
I think your other option is to write a cat function in python and generate a file in cat-like way and pass this file to p1 stdin, don't think about a class that implement the io API because it will not work as i said because internally the child process will just get the file descriptors.
With that said i think your better option is to use unix PIPE way like in subprocess doc.
This should be easy. First, create a pipe using os.pipe, then Popen the filter with read end of the pipe as standard input. Then for each file in the directory with name matching the pattern, just pass its contents to the write end of the pipe. This should be exactly the same what the shell command cat ..._*.log | filter args does.
Update: Sorry, pipe from os.pipe is not needed, I forgot that subprocess.Popen(..., stdin=subprocess.PIPE) actualy creates one for you. Also a pipe cannot be stuffed with too much data, more data can be written to a pipe only after the previous data are read.
So the solution (for example with wc -l) would be:
import glob
import subprocess
p = subprocess.Popen(["wc", "-l"], stdin=subprocess.PIPE)
processDate = "2011-05-18" # or time.strftime("%F")
for name in glob.glob("xyz_%s_*.log" % processDate):
f = open(name, "rb")
# copy all data from f to p.stdin
while True:
data = f.read(8192)
if not data:
break # reached end of file
p.stdin.write(data)
f.close()
p.stdin.close()
p.wait()
Usage example:
$ hexdump /dev/urandom | head -n 10000 >xyz_2011-05-18_a.log
$ hexdump /dev/urandom | head -n 10000 >xyz_2011-05-18_b.log
$ hexdump /dev/urandom | head -n 10000 >xyz_2011-05-18_c.log
$ ./example.py
30000

blocks - send input to python subprocess pipeline

I'm testing subprocesses pipelines with python. I'm aware that I can do what the programs below do in python directly, but that's not the point. I just want to test the pipeline so I know how to use it.
My system is Linux Ubuntu 9.04 with default python 2.6.
I started with this documentation example.
from subprocess import Popen, PIPE
p1 = Popen(["grep", "-v", "not"], stdout=PIPE)
p2 = Popen(["cut", "-c", "1-10"], stdin=p1.stdout, stdout=PIPE)
output = p2.communicate()[0]
print output
That works, but since p1's stdin is not being redirected, I have to type stuff in the terminal to feed the pipe. When I type ^D closing stdin, I get the output I want.
However, I want to send data to the pipe using a python string variable. First I tried writing on stdin:
p1 = Popen(["grep", "-v", "not"], stdin=PIPE, stdout=PIPE)
p2 = Popen(["cut", "-c", "1-10"], stdin=p1.stdout, stdout=PIPE)
p1.stdin.write('test\n')
output = p2.communicate()[0] # blocks forever here
Didn't work. I tried using p2.stdout.read() instead on last line, but it also blocks. I added p1.stdin.flush() and p1.stdin.close() but it didn't work either. I Then I moved to communicate:
p1 = Popen(["grep", "-v", "not"], stdin=PIPE, stdout=PIPE)
p2 = Popen(["cut", "-c", "1-10"], stdin=p1.stdout, stdout=PIPE)
p1.communicate('test\n') # blocks forever here
output = p2.communicate()[0]
So that's still not it.
I noticed that running a single process (like p1 above, removing p2) works perfectly. And passing a file handle to p1 (stdin=open(...)) also works. So the problem is:
Is it possible to pass data to a pipeline of 2 or more subprocesses in python, without blocking? Why not?
I'm aware I could run a shell and run the pipeline in the shell, but that's not what I want.
UPDATE 1: Following Aaron Digulla's hint below I'm now trying to use threads to make it work.
First I've tried running p1.communicate on a thread.
p1 = Popen(["grep", "-v", "not"], stdin=PIPE, stdout=PIPE)
p2 = Popen(["cut", "-c", "1-10"], stdin=p1.stdout, stdout=PIPE)
t = threading.Thread(target=p1.communicate, args=('some data\n',))
t.start()
output = p2.communicate()[0] # blocks forever here
Okay, didn't work. Tried other combinations like changing it to .write() and also p2.read(). Nothing. Now let's try the opposite approach:
def get_output(subp):
output = subp.communicate()[0] # blocks on thread
print 'GOT:', output
p1 = Popen(["grep", "-v", "not"], stdin=PIPE, stdout=PIPE)
p2 = Popen(["cut", "-c", "1-10"], stdin=p1.stdout, stdout=PIPE)
t = threading.Thread(target=get_output, args=(p2,))
t.start()
p1.communicate('data\n') # blocks here.
t.join()
code ends up blocking somewhere. Either in the spawned thread, or in the main thread, or both. So it didn't work. If you know how to make it work it would make easier if you can provide working code. I'm trying here.
UPDATE 2
Paul Du Bois answered below with some information, so I did more tests.
I've read entire subprocess.py module and got how it works. So I tried applying exactly that to code.
I'm on linux, but since I was testing with threads, my first approach was to replicate the exact windows threading code seen on subprocess.py's communicate() method, but for two processes instead of one. Here's the entire listing of what I tried:
import os
from subprocess import Popen, PIPE
import threading
def get_output(fobj, buffer):
while True:
chunk = fobj.read() # BLOCKS HERE
if not chunk:
break
buffer.append(chunk)
p1 = Popen(["grep", "-v", "not"], stdin=PIPE, stdout=PIPE)
p2 = Popen(["cut", "-c", "1-10"], stdin=p1.stdout, stdout=PIPE)
b = [] # create a buffer
t = threading.Thread(target=get_output, args=(p2.stdout, b))
t.start() # start reading thread
for x in xrange(100000):
p1.stdin.write('hello world\n') # write data
p1.stdin.flush()
p1.stdin.close() # close input...
t.join()
Well. It didn't work. Even after p1.stdin.close() was called, p2.stdout.read() still blocks.
Then I tried the posix code on subprocess.py:
import os
from subprocess import Popen, PIPE
import select
p1 = Popen(["grep", "-v", "not"], stdin=PIPE, stdout=PIPE)
p2 = Popen(["cut", "-c", "1-10"], stdin=p1.stdout, stdout=PIPE)
numwrites = 100000
to_read = [p2.stdout]
to_write = [p1.stdin]
b = [] # create buffer
while to_read or to_write:
read_now, write_now, xlist = select.select(to_read, to_write, [])
if read_now:
data = os.read(p2.stdout.fileno(), 1024)
if not data:
p2.stdout.close()
to_read = []
else:
b.append(data)
if write_now:
if numwrites > 0:
numwrites -= 1
p1.stdin.write('hello world!\n'); p1.stdin.flush()
else:
p1.stdin.close()
to_write = []
print b
Also blocks on select.select(). By spreading prints around, I found out this:
Reading is working. Code reads many times during execution.
Writing is also working. Data is written to p1.stdin.
At the end of numwrites, p1.stdin.close() is called.
When select() starts blocking, only to_read has something, p2.stdout. to_write is already empty.
os.read() call always returns something, so p2.stdout.close() is never called.
Conclusion from both tests: Closing the stdin of the first process on the pipeline (grep in the example) is not making it dump its buffered output to the next and die.
No way to make it work?
PS: I don't want to use a temporary file, I've already tested with files and I know it works. And I don't want to use windows.
I found out how to do it.
It is not about threads, and not about select().
When I run the first process (grep), it creates two low-level file descriptors, one for each pipe. Lets call those a and b.
When I run the second process, b gets passed to cut sdtin. But there is a brain-dead default on Popen - close_fds=False.
The effect of that is that cut also inherits a. So grep can't die even if I close a, because stdin is still open on cut's process (cut ignores it).
The following code now runs perfectly.
from subprocess import Popen, PIPE
p1 = Popen(["grep", "-v", "not"], stdin=PIPE, stdout=PIPE)
p2 = Popen(["cut", "-c", "1-10"], stdin=p1.stdout, stdout=PIPE, close_fds=True)
p1.stdin.write('Hello World\n')
p1.stdin.close()
result = p2.stdout.read()
assert result == "Hello Worl\n"
close_fds=True SHOULD BE THE DEFAULT on unix systems. On windows it closes all fds, so it prevents piping.
EDIT:
PS: For people with a similar problem reading this answer: As pooryorick said in a comment, that also could block if data written to p1.stdin is bigger than the buffers. In that case you should chunk the data into smaller pieces, and use select.select() to know when to read/write. The code in the question should give a hint on how to implement that.
EDIT2: Found another solution, with more help from pooryorick - instead of using close_fds=True and close ALL fds, one could close the fds that belongs to the first process, when executing the second, and it will work. The closing must be done in the child so the preexec_fn function from Popen comes very handy to do just that. On executing p2 you can do:
p2 = Popen(cmd2, stdin=p1.stdout, stdout=PIPE, stderr=devnull, preexec_fn=p1.stdin.close)
Working with large files
Two principles need to be applied uniformly when working with large files in Python.
Since any IO routine can block, we must keep each stage of the pipeline in a different thread or process. We use threads in this example, but subprocesses would let you avoid the GIL.
We must use incremental reads and writes so that we don't wait for EOF before starting to make progress.
An alternative is to use nonblocking IO, though this is cumbersome in standard Python. See gevent for a lightweight threading library that implements the synchronous IO API using nonblocking primitives.
Example code
We'll construct a silly pipeline that is roughly
{cat /usr/share/dict/words} | grep -v not \
| {upcase, filtered tee to stderr} | cut -c 1-10 \
| {translate 'E' to '3'} | grep K | grep Z | {downcase}
where each stage in braces {} is implemented in Python while the others use standard external programs. TL;DR: See this gist.
We start with the expected imports.
#!/usr/bin/env python
from subprocess import Popen, PIPE
import sys, threading
Python stages of the pipeline
All but the last Python-implemented stage of the pipeline needs to go in a thread so that it's IO does not block the others. These could instead run in Python subprocesses if you wanted them to actually run in parallel (avoid the GIL).
def writer(output):
for line in open('/usr/share/dict/words'):
output.write(line)
output.close()
def filter(input, output):
for line in input:
if 'k' in line and 'z' in line: # Selective 'tee'
sys.stderr.write('### ' + line)
output.write(line.upper())
output.close()
def leeter(input, output):
for line in input:
output.write(line.replace('E', '3'))
output.close()
Each of these needs to be put in its own thread, which we'll do using this convenience function.
def spawn(func, **kwargs):
t = threading.Thread(target=func, kwargs=kwargs)
t.start()
return t
Create the pipeline
Create the external stages using Popen and the Python stages using spawn. The argument bufsize=-1 says to use the system default buffering (usually 4 kiB). This is generally faster than the default (unbuffered) or line buffering, but you'll want line buffering if you want to visually monitor the output without lags.
grepv = Popen(['grep','-v','not'], stdin=PIPE, stdout=PIPE, bufsize=-1)
cut = Popen(['cut','-c','1-10'], stdin=PIPE, stdout=PIPE, bufsize=-1)
grepk = Popen(['grep', 'K'], stdin=PIPE, stdout=PIPE, bufsize=-1)
grepz = Popen(['grep', 'Z'], stdin=grepk.stdout, stdout=PIPE, bufsize=-1)
twriter = spawn(writer, output=grepv.stdin)
tfilter = spawn(filter, input=grepv.stdout, output=cut.stdin)
tleeter = spawn(leeter, input=cut.stdout, output=grepk.stdin)
Drive the pipeline
Assembled as above, all the buffers in the pipeline will fill up, but since nobody is reading from the end (grepz.stdout), they will all block. We could read the entire thing in one call to grepz.stdout.read(), but that would use a lot of memory for large files. Instead, we read incrementally.
for line in grepz.stdout:
sys.stdout.write(line.lower())
The threads and processes clean up once they reach EOF. We can explicitly clean up using
for t in [twriter, tfilter, tleeter]: t.join()
for p in [grepv, cut, grepk, grepz]: p.wait()
Python-2.6 and earlier
Internally, subprocess.Popen calls fork, configures the pipe file descriptors, and calls exec. The child process from fork has copies of all file descriptors in the parent process, and both copies will need to be closed before the corresponding reader will get EOF. This can be fixed by manually closing the pipes (either by close_fds=True or a suitable preexec_fn argument to subprocess.Popen) or by setting the FD_CLOEXEC flag to have exec automatically close the file descriptor. This flag is set automatically in Python-2.7 and later, see issue12786. We can get the Python-2.7 behavior in earlier versions of Python by calling
p._set_cloexec_flags(p.stdin)
before passing p.stdin as an argument to a subsequent subprocess.Popen.
There are three main tricks to making pipes work as expected
Make sure each end of the pipe is used in a different thread/process
(some of the examples near the top suffer from this problem).
explicitly close the unused end of the pipe in each process
deal with buffering by either disabling it (Python -u option), using
pty's, or simply filling up the buffer with something that won't affect the
data, ( maybe '\n', but whatever fits).
The examples in the Python "pipeline" module (I'm the author) fit your scenario
exactly, and make the low-level steps fairly clear.
http://pypi.python.org/pypi/pipeline/
More recently, I used the subprocess module as part of a
producer-processor-consumer-controller pattern:
http://www.darkarchive.org/w/Pub/PythonInteract
This example deals with buffered stdin without resorting to using a pty, and
also illustrates which pipe ends should be closed where. I prefer processes to
threading, but the principle is the same. Additionally, it illustrates
synchronizing Queues to which feed the producer and collect output from the consumer,
and how to shut them down cleanly (look out for the sentinels inserted into the
queues). This pattern allows new input to be generated based on recent output,
allowing for recursive discovery and processing.
Nosklo's offered solution will quickly break if too much data is written to the receiving end of the pipe:
from subprocess import Popen, PIPE
p1 = Popen(["grep", "-v", "not"], stdin=PIPE, stdout=PIPE)
p2 = Popen(["cut", "-c", "1-10"], stdin=p1.stdout, stdout=PIPE, close_fds=True)
p1.stdin.write('Hello World\n' * 20000)
p1.stdin.close()
result = p2.stdout.read()
assert result == "Hello Worl\n"
If this script doesn't hang on your machine, just increase "20000" to something that exceeds the size of your operating system's pipe buffers.
This is because the operating system is buffering the input to "grep", but once that buffer is full, the p1.stdin.write call will block until something reads from p2.stdout. In toy scenarios, you can get way with writing to/reading from a pipe in the same process, but in normal usage, it is necessary to write from one thread/process and read from a separate thread/process. This is true for subprocess.popen, os.pipe, os.popen*, etc.
Another twist is that sometimes you want to keep feeding the pipe with items generated from earlier output of the same pipe. The solution is to make both the pipe feeder and the pipe reader asynchronous to the man program, and implement two queues: one between the main program and the pipe feeder and one between the main program and the pipe reader. PythonInteract is an example of that.
Subprocess is a nice convenience model, but because it hides the details of the os.popen and os.fork calls it does under the hood, it can sometimes be more difficult to deal with than the lower-level calls it utilizes. For this reason, subprocess is not a good way to learn about how inter-process pipes really work.
You must do this in several threads. Otherwise, you'll end up in a situation where you can't send data: child p1 won't read your input since p2 doesn't read p1's output because you don't read p2's output.
So you need a background thread that reads what p2 writes out. That will allow p2 to continue after writing some data to the pipe, so it can read the next line of input from p1 which again allows p1 to process the data which you send to it.
Alternatively, you can send the data to p1 with a background thread and read the output from p2 in the main thread. But either side must be a thread.
Responding to nosklo's assertion (see other comments to this question) that it can't be done without close_fds=True:
close_fds=True is only necessary if you've left other file
descriptors open. When opening multiple child processes, it's always good to
keep track of open files that might get inherited, and to explicitly close any
that aren't needed:
from subprocess import Popen, PIPE
p1 = Popen(["grep", "-v", "not"], stdin=PIPE, stdout=PIPE)
p1.stdin.write('Hello World\n')
p1.stdin.close()
p2 = Popen(["cut", "-c", "1-10"], stdin=p1.stdout, stdout=PIPE)
result = p2.stdout.read()
assert result == "Hello Worl\n"
close_fds defaults to False because subprocess
prefers to trust the calling program to know what it's doing with open file
descriptors, and just provide the caller with an easy option to close them all
if that's what it wants to do.
But the real issue is that pipe buffers will bite you for all but toy examples.
As I have said in my other answers to this question, the rule of thumb is to
not have your reader and your writer open in the same process/thread. Anyone
who wants to use the subprocess module for two-way communication would be
well-served to study os.pipe and os.fork, first. They're actually not that
hard to use if you have a good example to look at.
I think you may be examining the wrong problem. Certainly as Aaron says if you try to be both a producer to the beginning of a pipeline, and a consumer of the end of the pipeline, it is easy to get into a deadlock situation. This is the problem that communicate() solves.
communicate() isn't exactly correct for you since stdin and stdout are on different subprocess objects; but if you take a look at the implementation in subprocess.py you'll see that it does exactly what Aaron suggested.
Once you see that communicate both reads and writes, you'll see that in your second try communicate() competes with p2 for the output of p1:
p1 = Popen(["grep", "-v", "not"], stdin=PIPE, stdout=PIPE)
p2 = Popen(["cut", "-c", "1-10"], stdin=p1.stdout, stdout=PIPE)
# ...
p1.communicate('data\n') # reads from p1.stdout, as does p2
I am running on win32, which definitely has different i/o and buffering characteristics, but this works for me:
p1 = Popen(["grep", "-v", "not"], stdin=PIPE, stdout=PIPE)
p2 = Popen(["cut", "-c", "1-10"], stdin=p1.stdout, stdout=PIPE)
t = threading.Thread(target=get_output, args=(p2,))
t.start()
p1.stdin.write('hello world\n' * 100000)
p1.stdin.close()
t.join()
I tuned the input size to produce a deadlock when using a naive unthreaded p2.read()
You might also try buffering into a file, eg
fd, _ = tempfile.mkstemp()
os.write(fd, 'hello world\r\n' * 100000)
os.lseek(fd, 0, os.SEEK_SET)
p1 = Popen(["grep", "-v", "not"], stdin=fd, stdout=PIPE)
p2 = Popen(["cut", "-c", "1-10"], stdin=p1.stdout, stdout=PIPE)
print p2.stdout.read()
That also works for me without deadlocks.
In one of the comments above, I challenged nosklo to either post some code to back up his assertions about select.select or to upvote my responses he had previously down-voted. He responded with the following code:
from subprocess import Popen, PIPE
import select
p1 = Popen(["grep", "-v", "not"], stdin=PIPE, stdout=PIPE)
p2 = Popen(["cut", "-c", "1-10"], stdin=p1.stdout, stdout=PIPE, close_fds=True)
data_to_write = 100000 * 'hello world\n'
to_read = [p2.stdout]
to_write = [p1.stdin]
b = [] # create buffer
written = 0
while to_read or to_write:
read_now, write_now, xlist = select.select(to_read, to_write, [])
if read_now:
data = p2.stdout.read(1024)
if not data:
p2.stdout.close()
to_read = []
else:
b.append(data)
if write_now:
if written < len(data_to_write):
part = data_to_write[written:written+1024]
written += len(part)
p1.stdin.write(part); p1.stdin.flush()
else:
p1.stdin.close()
to_write = []
print b
One problem with this script is that it second-guesses the size/nature of the
system pipe buffers. The script would experience fewer failures if it could remove
magic numbers like 1024.
The big problem is that this script code only works consistently with the right
combination of data input and external programs. grep and cut both work with
lines, and so their internal buffers behave a bit differently. If we use a
more generic command like "cat", and write smaller bits of data into the pipe,
the fatal race condition will pop up more often:
from subprocess import Popen, PIPE
import select
import time
p1 = Popen(["cat"], stdin=PIPE, stdout=PIPE)
p2 = Popen(["cat"], stdin=p1.stdout, stdout=PIPE, close_fds=True)
data_to_write = 'hello world\n'
to_read = [p2.stdout]
to_write = [p1.stdin]
b = [] # create buffer
written = 0
while to_read or to_write:
time.sleep(1)
read_now, write_now, xlist = select.select(to_read, to_write, [])
if read_now:
print 'I am reading now!'
data = p2.stdout.read(1024)
if not data:
p1.stdout.close()
to_read = []
else:
b.append(data)
if write_now:
print 'I am writing now!'
if written < len(data_to_write):
part = data_to_write[written:written+1024]
written += len(part)
p1.stdin.write(part); p1.stdin.flush()
else:
print 'closing file'
p1.stdin.close()
to_write = []
print b
In this case, two different results will manifest:
write, write, close file, read -> success
write, read -> hang
So again, I challenge nosklo to either post code showing the use of
select.select to handle arbitrary input and pipe buffering from a
single thread, or to upvote my responses.
Bottom line: don't try to manipulate both ends of a pipe from a single thread.
It's just not worth it. See
pipeline for a nice low-level
example of how to do this correctly.
What about using a SpooledTemporaryFile ? This bypasses (but perhaps doesn't solve) the issue:
http://docs.python.org/library/tempfile.html#tempfile.SpooledTemporaryFile
You can write to it like a file, but it's actually a memory block.
Or am I totally misunderstanding...
Here's an example of using Popen together with os.fork to accomplish the same
thing. Instead of using close_fds it just closes the pipes at the
right places. Much simpler than trying to use select.select, and
takes full advantage of system pipe buffers.
from subprocess import Popen, PIPE
import os
import sys
p1 = Popen(["cat"], stdin=PIPE, stdout=PIPE)
pid = os.fork()
if pid: #parent
p1.stdin.close()
p2 = Popen(["cat"], stdin=p1.stdout, stdout=PIPE)
data = p2.stdout.read()
sys.stdout.write(data)
p2.stdout.close()
else: #child
data_to_write = 'hello world\n' * 100000
p1.stdin.write(data_to_write)
p1.stdin.close()
It's much simpler than you think!
import sys
from subprocess import Popen, PIPE
# Pipe the command here. It will read from stdin.
# So cat a file, to stdin, like (cat myfile | ./this.py),
# or type on terminal and hit control+d when done, etc
# No need to handle this yourself, that's why we have shell's!
p = Popen("grep -v not | cut -c 1-10", shell=True, stdout=PIPE)
nextData = None
while True:
nextData = p.stdout.read()
if nextData in (b'', ''):
break
sys.stdout.write ( nextData.decode('utf-8') )
p.wait()
This code is written for python 3.6, and works with python 2.7.
Use it like:
cat README.md | python ./example.py
or
python example.py < README.md
To pipe the contents of "README.md" to this program.
But.. at this point, why not just use "cat" directly, and pipe the output like you want? like:
cat filename | grep -v not | cut -c 1-10
typed into the console will do the job as well. I personally would only use the code option if I was further processing the output, otherwise a shell script would be easier to maintain and be retained.
You just, use the shell to do the piping for you. In one, out the other. That's what she'll are GREAT at doing, managing processes, and managing single-width chains of input and output. Some would call it a shell's best non-interactive feature..

Using subprocess.Popen for Process with Large Output

I have some Python code that executes an external app which works fine when the app has a small amount of output, but hangs when there is a lot. My code looks like:
p = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
errcode = p.wait()
retval = p.stdout.read()
errmess = p.stderr.read()
if errcode:
log.error('cmd failed <%s>: %s' % (errcode,errmess))
There are comments in the docs that seem to indicate the potential issue. Under wait, there is:
Warning: This will deadlock if the child process generates enough output to a stdout or stderr pipe such that it blocks waiting for the OS pipe buffer to accept more data. Use communicate() to avoid that.
though under communicate, I see:
Note The data read is buffered in memory, so do not use this method if the data size is large or unlimited.
So it is unclear to me that I should use either of these if I have a large amount of data. They don't indicate what method I should use in that case.
I do need the return value from the exec and do parse and use both the stdout and stderr.
So what is an equivalent method in Python to exec an external app that is going to have large output?
You're doing blocking reads to two files; the first needs to complete before the second starts. If the application writes a lot to stderr, and nothing to stdout, then your process will sit waiting for data on stdout that isn't coming, while the program you're running sits there waiting for the stuff it wrote to stderr to be read (which it never will be--since you're waiting for stdout).
There are a few ways you can fix this.
The simplest is to not intercept stderr; leave stderr=None. Errors will be output to stderr directly. You can't intercept them and display them as part of your own message. For commandline tools, this is often OK. For other apps, it can be a problem.
Another simple approach is to redirect stderr to stdout, so you only have one incoming file: set stderr=STDOUT. This means you can't distinguish regular output from error output. This may or may not be acceptable, depending on how the application writes output.
The complete and complicated way of handling this is select (http://docs.python.org/library/select.html). This lets you read in a non-blocking way: you get data whenever data appears on either stdout or stderr. I'd only recommend this if it's really necessary. This probably doesn't work in Windows.
Reading stdout and stderr independently with very large output (ie, lots of megabytes) using select:
import subprocess, select
proc = subprocess.Popen(cmd, bufsize=8192, shell=False, \
stdout=subprocess.PIPE, stderr=subprocess.PIPE)
with open(outpath, "wb") as outf:
dataend = False
while (proc.returncode is None) or (not dataend):
proc.poll()
dataend = False
ready = select.select([proc.stdout, proc.stderr], [], [], 1.0)
if proc.stderr in ready[0]:
data = proc.stderr.read(1024)
if len(data) > 0:
handle_stderr_data(data)
if proc.stdout in ready[0]:
data = proc.stdout.read(1024)
if len(data) == 0: # Read of zero bytes means EOF
dataend = True
else:
outf.write(data)
A lot of output is subjective so it's a little difficult to make a recommendation. If the amount of output is really large then you likely don't want to grab it all with a single read() call anyway. You may want to try writing the output to a file and then pull the data in incrementally like such:
f=file('data.out','w')
p = subprocess.Popen(cmd, shell=True, stdout=f, stderr=subprocess.PIPE)
errcode = p.wait()
f.close()
if errcode:
errmess = p.stderr.read()
log.error('cmd failed <%s>: %s' % (errcode,errmess))
for line in file('data.out'):
#do something
Glenn Maynard is right in his comment about deadlocks. However, the best way of solving this problem is two create two threads, one for stdout and one for stderr, which read those respective streams until exhausted and do whatever you need with the output.
The suggestion of using temporary files may or may not work for you depending on the size of output etc. and whether you need to process the subprocess' output as it is generated.
As Heikki Toivonen has suggested, you should look at the communicate method. However, this buffers the stdout/stderr of the subprocess in memory and you get those returned from the communicate call - this is not ideal for some scenarios. But the source of the communicate method is worth looking at.
Another example is in a package I maintain, python-gnupg, where the gpg executable is spawned via subprocess to do the heavy lifting, and the Python wrapper spawns threads to read gpg's stdout and stderr and consume them as data is produced by gpg. You may be able to get some ideas by looking at the source there, as well. Data produced by gpg to both stdout and stderr can be quite large, in the general case.
I had the same problem. If you have to handle a large output, another good option could be to use a file for stdout and stderr, and pass those files per parameter.
Check the tempfile module in python: https://docs.python.org/2/library/tempfile.html.
Something like this might work
out = tempfile.NamedTemporaryFile(delete=False)
Then you would do:
Popen(... stdout=out,...)
Then you can read the file, and erase it later.
You could try communicate and see if that solves your problem. If not, I'd redirect the output to a temporary file.
Here is simple approach which captures both regular output plus error output, all within Python so limitations in stdout don't apply:
com_str = 'uname -a'
command = subprocess.Popen([com_str], stdout=subprocess.PIPE, shell=True)
(output, error) = command.communicate()
print output
Linux 3.11.0-20-generic SMP Fri May 2 21:32:55 UTC 2014
and
com_str = 'id'
command = subprocess.Popen([com_str], stdout=subprocess.PIPE, shell=True)
(output, error) = command.communicate()
print output
uid=1000(myname) gid=1000(mygrp) groups=1000(cell),0(root)

Categories