how to concatenate multiple files for stdin of Popen - python

I'm porting a bash script to python 2.6, and want to replace some code:
cat $( ls -tr xyz_`date +%F`_*.log ) | filter args > bzip2
I guess I want something similar to the "Replacing shell pipe line" example at http://docs.python.org/release/2.6/library/subprocess.html, ala...
p1 = Popen(["filter", "args"], stdin=*?WHAT?*, stdout=PIPE)
p2 = Popen(["bzip2"], stdin=p1.stdout, stdout=PIPE)
output = p2.communicate()[0]
But, I'm not sure how best to provide p1's stdin value so it concatenates the input files. Seems I could add...
p0 = Popen(["cat", "file1", "file2"...], stdout=PIPE)
p1 = ... stdin=p0.stdout ...
...but that seems to be crossing beyond use of (slow, inefficient) pipes to call external programs with significant functionality. (Any decent shell performs the cat internally.)
So, I can imagine a custom class that satisfies the file object API requirements and can therefore be used for p1's stdin, concatenating arbitrary other file objects. (EDIT: existing answers explain why this isn't possible)
Does python 2.6 have a mechanism addressing this need/want, or might another Popen to cat be considered perfectly fine in python circles?
Thanks.

You can replace everything that you're doing with Python code, except for your external utility. That way your program will remain portable as long as your external util is portable. You can also consider turning the C++ program into a library and using Cython to interface with it. As Messa showed, date is replaced with time.strftime, globbing is done with glob.glob and cat can be replaced with reading all the files in the list and writing them to the input of your program. The call to bzip2 can be replaced with the bz2 module, but that will complicate your program because you'd have to read and write simultaneously. To do that, you need to either use p.communicate or a thread if the data is huge (select.select would be a better choice but it won't work on Windows).
import sys
import bz2
import glob
import time
import threading
import subprocess
output_filename = '../whatever.bz2'
input_filenames = glob.glob(time.strftime("xyz_%F_*.log"))
p = subprocess.Popen(['filter', 'args'], stdin=subprocess.PIPE, stdout=subprocess.PIPE)
output = open(output_filename, 'wb')
output_compressor = bz2.BZ2Compressor()
def data_reader():
for filename in input_filenames:
f = open(filename, 'rb')
p.stdin.writelines(iter(lambda: f.read(8192), ''))
p.stdin.close()
input_thread = threading.Thread(target=data_reader)
input_thread.start()
with output:
for chunk in iter(lambda: p.stdout.read(8192), ''):
output.write(output_compressor.compress(chunk))
output.write(output_compressor.flush())
input_thread.join()
p.wait()
Addition: How to detect file input type
You can use either the file extension or the Python bindings for libmagic to detect how the file is compressed. Here's a code example that does both, and automatically chooses magic if it is available. You can take the part that suits your needs and adapt it to your needs. The open_autodecompress should detect the mime encoding and open the file with the appropriate decompressor if it is available.
import os
import gzip
import bz2
try:
import magic
except ImportError:
has_magic = False
else:
has_magic = True
mime_openers = {
'application/x-bzip2': bz2.BZ2File,
'application/x-gzip': gzip.GzipFile,
}
ext_openers = {
'.bz2': bz2.BZ2File,
'.gz': gzip.GzipFile,
}
def open_autodecompress(filename, mode='r'):
if has_magic:
ms = magic.open(magic.MAGIC_MIME_TYPE)
ms.load()
mimetype = ms.file(filename)
opener = mime_openers.get(mimetype, open)
else:
basepart, ext = os.path.splitext(filename)
opener = ext_openers.get(ext, open)
return opener(filename, mode)

If you look inside the subprocess module implementation, you will see that std{in,out,err} are expected to be fileobjects supporting fileno() method, so a simple concatinating file-like object with python interface (or even a StringIO object) is not suitable here.
If it were iterators, not file objects, you could use itertools.chain.
Of course, sacrificing the memory consumption you can do something like this:
import itertools, os
# ...
files = [f for f in os.listdir(".") if os.path.isfile(f)]
input = ''.join(itertools.chain(open(file) for file in files))
p2.communicate(input)

When using subprocess you have to consider the fact that internally Popen will use the file descriptor(handler) and call os.dup2() for stdin, stdout and stderr before passing them to the child process created.
So if you don't want to use system shell pipe with Popen:
p0 = Popen(["cat", "file1", "file2"...], stdout=PIPE)
p1 = Popen(["filter", "args"], stdin=p0.stdout, stdout=PIPE)
...
I think your other option is to write a cat function in python and generate a file in cat-like way and pass this file to p1 stdin, don't think about a class that implement the io API because it will not work as i said because internally the child process will just get the file descriptors.
With that said i think your better option is to use unix PIPE way like in subprocess doc.

This should be easy. First, create a pipe using os.pipe, then Popen the filter with read end of the pipe as standard input. Then for each file in the directory with name matching the pattern, just pass its contents to the write end of the pipe. This should be exactly the same what the shell command cat ..._*.log | filter args does.
Update: Sorry, pipe from os.pipe is not needed, I forgot that subprocess.Popen(..., stdin=subprocess.PIPE) actualy creates one for you. Also a pipe cannot be stuffed with too much data, more data can be written to a pipe only after the previous data are read.
So the solution (for example with wc -l) would be:
import glob
import subprocess
p = subprocess.Popen(["wc", "-l"], stdin=subprocess.PIPE)
processDate = "2011-05-18" # or time.strftime("%F")
for name in glob.glob("xyz_%s_*.log" % processDate):
f = open(name, "rb")
# copy all data from f to p.stdin
while True:
data = f.read(8192)
if not data:
break # reached end of file
p.stdin.write(data)
f.close()
p.stdin.close()
p.wait()
Usage example:
$ hexdump /dev/urandom | head -n 10000 >xyz_2011-05-18_a.log
$ hexdump /dev/urandom | head -n 10000 >xyz_2011-05-18_b.log
$ hexdump /dev/urandom | head -n 10000 >xyz_2011-05-18_c.log
$ ./example.py
30000

Related

Run a program from python several times whitout initialize different shells

I want to run a compiled Fortran numerical model from Python. It is too complex to compile it using F2PY without implement several changes in the Fortran routines. This is why I am just calling its executable using the subprocess module.
The problem is that I have to call it few thousands of times, and I have the feeling that generating soo many shells is slowing the whole thing.
My implememtation (It is difficult to provide a reproducible example, sorry) looks like:
import os
import subprocess
foo_path = '/path/to/compiled/program/'
program_dir = os.path.join(foo_path, "FOO") #FOO is the Fortran executable
instruction = program_dir + " < nlst" #It is necesary to provide FOO a text file (nlst)
#with the configuration to the program
subprocess.call(instruction, shell=True, cwd=foo_path) #run the executable
Running it in this way (inside a loop), it works well and FOO generates a text file output that I can read from python. But I'd like to do the same keeping the shell active and just providing to it the "nlst" file path. Another nice option may be start an empty shell and keep it waiting for the instruction string, that will look like "./FOO < nlst". But I am not sure about how to do it, any ideas?
Thanks!
[Edited] Something like this should work but .comunicate ends process and a second call does not work:
from subprocess import Popen, PIPE
foo_path = '/path/to/FOO/'
process = Popen(['bash'], stdin=PIPE, cwd=foo_path)
process.communicate(b'./FOO < nlst')
I found this solution using the pexpect module,
import pexpect
import os.path
foo_path = '/path/to/FOO/'
out_path = '/path/to/FOO/foo_out_file' #path to output file
child = pexpect.spawn('bash', cwd=foo_path)
child.sendline('./FOO < nlst')
while not os.path.exists(out_path): #wait until out_path is created
continue
To extend my comment, here is an example for threading with your code:
import os
import subprocess
from concurrent.futures import ThreadPoolExecutor
foo_path = '/path/to/compiled/program/'
program_dir = os.path.join(foo_path, "FOO") #FOO is the Fortran executable
instruction = program_dir + " < nlst" #It is necesary to provide FOO a text file (nlst)
#with the configuration to the program
def your_function():
subprocess.call(instruction, shell=True, cwd=foo_path) #run the executable
# create executor object
executor = ThreadPoolExecutor(max_workers=4) # uncertain of how many workers you might need/want
# specify how often you want to run the function
for i in range(10):
# start your function as thread
executor.submit(your_function)
What I meant in my comment was something like the following Python script:
from subprocess import Popen, PIPE
foo_path = '/home/ronald/tmp'
process = Popen(['/home/ronald/tmp/shexecutor'], stdin=PIPE, cwd=foo_path)
process.stdin.write("ls\n")
process.stdin.write("echo hello\n")
process.stdin.write("quit\n")
And the shell script that executes the commands:
#!/bin/bash
while read cmdline; do
if [ "$cmdline" == "quit" ]; then
exit 0
fi
eval "$cmdline" >> x.output
done
Instead of doing an "eval", you can do virtually anything.
Note that this is just an outline of a real implementation.
You'd need to do some error handling. And if you are going to use this in a production environment, be sure to harden the code to the limit.

How to return a dictionary as a function's return value running as a subprocess to its parent process?

I have two scripts parent.py and child.py The parent.py calls child.py as a subprocess. Child.py has a function that collects certain result in a dictionary and i wish to return that dictionary back to the parent process. I have tried by printing that dictionary from child.py onto its STDOUT so that the parent process can read it but then thats not helping me as the dictionary's content are being read as strings on seperate lines by the parent.
Moreover , as suggested in comments , i tried serializing the dictionary using JSON while printing it on stdout and also read it back from the parent using JSON , that works fine, but i also am printing a lot of other information from the child to its stdout which is eventually also being read by the parent and is mixing things up .
Another suggestion that came up was by writing the result from the child to a file in the directory and make the parent read from that file. That would work too , but i would be running 100s of instances of this code in Celery and hence it would lead to overwrites on that same file by other instances of the child.
My question is since we have a PIPE connecting the two processes how can i just write my dictionary directly into the PIPE from child.py and get it read from the parent.py
# parent.py
import subprocess
proc = subprocess.Popen(['python3', 'child.py'],
stdin=subprocess.PIPE,
stdout = subprocess.PIPE
)
proc.comunicate()
result = proc.stdout
#child.py
def child_function():
result = {}
result[1] = "one"
result[2] = "two"
print(result)
#return result
if __name__ == "__main__":
child_function()
Have the parent create a FIFO (named pipe) for the child:
with os.mkfifo(mypipe) as pipe:
proc = subprocess.Popen(['python3', 'child.py', 'mypipe'],
stdin=subprocess.PIPE, stdout=subprocess.PIPE)
print(pipe.read())
Now the child can do this:
pipe_path = # get from argv
with open(pipe_path, 'w') as pipe:
pipe.write(str(result))
This keeps your communication separate from stdin/stdout/stderr.
A subprocess running Python is in no way different from a subprocess running something else. Python doesn't know or care that the other program is also a Python program; they have no access to each other's variables, memory, running state, or other internals. Simply imagine that the subprocess is a monolithic binary. The only ways you can communicate with it is to send and receive bytes (which can be strings, if you agree on a character encoding) and signals (so you can kill your subprocess, or raise some other signal which it can trap and handle -- like a timer; you get exactly one bit of information when the timer expires, and what you do with that bit is up to the receiver of the signal).
To "serialize" information means to encode it in a way which lets the recipient deserialize it. JSON is a good example; you can transfer a structure consisting of a (possibly nested structure of) dictionary or list as text, and the recipient will know how to map that stream of bytes into the same structure.
When both sender and receiver are running the same Python version, you could also use pickles; pickle is a native Python format which allows you to transfer a richer structure. But if your needs are modest, I'd simply go with JSON.
parent.py:
import subprocess
import json
# Prefer subprocess.run() over bare-bones Popen()
proc = subprocess.run(['python3', 'child.py'],
check=True, capture_output=True, text=True)
result = json.loads(proc.stdout)
child.py:
import json
import logging
def child_function():
result = {}
result[1] = "one"
result[2] = "two"
loggging.info('Some unrelated output which should not go into the JSON')
print(json.dumps(result))
#return result
if __name__ == "__main__":
logging.basicConfig(level=logging.WARNING)
child_function()
To avoid mixing JSON with other output, print the other output to standard error instead of standard output (or figure out a way to embed it into the JSON after all). The logging module is a convenient way to do that, with the added bonus that you can turn it off easily, partially or entirely (the above example demonstrates logging which is turned off via logging.basicConfig because it only selects printing of messages of priority WARNING or higher, which excludes INFO). The parent will get these messages in proc.stderr.
You can get the results via a file.
parent.py:
import tempfile
import os
import subprocess
import json
fd, temp_file_name = tempfile.mkstemp() # create temporary file
os.close(fd) # close the file
proc = subprocess.Popen(['python3', 'child.py', temp_file_name]) # pass file_name
proc.communicate()
with open(temp_file_name) as fp:
result = json.load(fp) # get dictionary from here
os.unlink(temp_file_name) # no longer need this file
child.py:
import sys
import json
def child_function(temp_file_name):
result = {}
result[1] = "one"
result[2] = "two"
with open(temp_file_name, 'w') as fp:
json.dump(result, fp)
if __name__ == "__main__":
child_function(sys.argv[1]) # pass the file name argument

Playing a MP3 from a Python process's memory via omxplayer, without writing to disk

The following code receives an MP3, writes it to disk and plays it using OMXPlayer. I want to eliminate the need to write the MP3 to disk before playing it.
song = response.content
file = open("temp.mp3", "wb")
file.write(song)
file.close()
response.close()
play_song_subprocess = subprocess.call(['omxplayer', '-o', 'local', '--vol', '-500', 'temp.mp3'])
How can I eliminate the file.write()?
I'm looking to do something like this:
song = response.content
play_song_subprocess = subprocess.call(['omxplayer', '-o', 'local', '--vol', '-500', song])
But this causes the following error:
embedded null byte
Backstory For Readers
Established in chat and comments:
cat temp.mp3 | omxplayer -o local --vol -500 /dev/stdin causes a segfault.
omxplayer -o local --vol -500 /dev/fd/3 3< <(cat temp.mp3) works correctly.
Thus, we can pass a MP3's data in... but not on stdin (which omxplayer uses for controls: pausing, early exiting, etc).
Approach 1: Using A Shell For File Descriptor Wrangling
This is equivalent to "Approach 3", but instead of using very new and modern Python functionality to do the FD wrangling in-process, it launches a copy of /bin/sh to do the work (and consequently will work with much older Python releases).
play_from_stdin_sh = '''
exec 3<&0 # copy stdin to FD 3
exec </dev/tty || exec </dev/null # make stdin now be /dev/tty or /dev/null
exec omxplayer -o local --vol -500 /dev/fd/3 # play data from FD 3
'''
p = subprocess.Popen(['sh', '-c', play_from_stdin_sh], stdin=subprocess.POPEN)
p.communicate(song) # passes data in "song" as stdin to the copy of sh
Because omxplayer expects to use stdin to get instructions from its user, we need to use a different file descriptor for passing in its contents. Thus, while we have the Python interpreter pass content on stdin, we then have a shell copy stdin to FD 3 and replace the original stdin with a handle or either /dev/tty or /dev/null before invoking omxplayer.
Approach 2: Using A Named Pipe
There's a little bit of a question as to whether this is cheating on the "no writing to disk" constraint. It doesn't write any of the MP3 data to disk, but it does create a filesystem object that both processes can open as a way to connect to each other, even though the data written to that object flows directly between the processes, without being written to disk.
import tempfile, os, os.path, shutil, subprocess
fifo_dir = None
try:
fifo_dir = tempfile.mkdtemp('mp3-fifodir')
fifo_name = os.path.join(fifo_dir, 'fifo.mp3')
os.mkfifo(fifo_name)
# now, we start omxplayer, and tell it to read from the FIFO
# as long as it opens it in read mode, it should just hang until something opens
# ...the write side of the FIFO, writes content to it, and then closes it.
p = subprocess.Popen(['omxplayer', '-o', 'local', '--vol', '-500', fifo_name])
# this doesn't actually write content to a file on disk! instead, it's written directly
# ...to the omxplayer process's handle on the other side of the FIFO.
fifo_fd = open(fifo_name, 'w')
fifo_fd.write(song)
fifo_fd.close()
p.wait()
finally:
shutil.rmtree(fifo_dir)
Approach 3: Using A preexec_fn In Python
We can implement the file descriptor wrangling that Approach 1 used a shell for in native Python using the Popen object's preexec_fn argument. Consider:
import os, subprocess
def move_stdin():
os.dup2(0, 3) # copy our stdin -- FD 0 -- to FD 3
try:
newstdin = open('/dev/tty', 'r') # open /dev/tty...
os.dup2(newstdin.fileno(), 0) # ...and attach it to FD 0.
except IOError:
newstdin = open('/dev/null', 'r') # Couldn't do that? Open /dev/null...
os.dup2(newstdin.fileno(), 0) # ...and attach it to FD 0.
p = subprocess.Popen(['omxplayer', '-o', 'local', '--vol', '-500', '/dev/fd/3'],
stdin=subprocess.PIPE, preexec_fn=move_stdin, pass_fds=[0,1,2,3])
p.communicate(song)

Calling multi-level commands/programs from python

I have a shell command 'fst-mor'. It takes an argument in form of file e.g. NOUN.A which is a lex file or something. Final command : fst-mor NOUN.A
It then produces following output:
analyze>INPUT_A_STRING_HERE
OUTPUT_HERE
Now I want to put call that fst-mor from my python script and then input string and want back output in the script.
So far I have:
import os
print os.system("fst-mor NOUN.A")
You want to capture the output of another command. Use the subprocess module for this.
import subprocess
output = subprocess.check_output('fst-mor', 'NOUN.A')
If your command requires interactive input, you have two options:
Use a subprocess.Popen() object, and set the stdin parameter to subprocess.PIPE and write the input to the stdin pipe available. For one input parameter, that's often enough. Study the documentation for the subprocess module for details, but the basic interaction is:
proc = subprocess.Popen(['fst-mor', 'NOUN.A'], stdin=subprocess.PIPE, stdout=subprocess.PIPE)
output, err = proc.communicate('INPUT_A_STRING_HERE')
Use the pexpect library to drive a process. This let's you create more complex interactions with a subprocess by looking for patterns is the output it generates:
import pexpect
py = pexpect.spawn('fst-mor NOUN.A')
py.expect('analyze>')
py.send('INPUT_A_STRING_HERE')
output = py.read()
py.close()
You could try:
from subprocess import Popen, PIPE
p = Popen(["fst-mor", "NOUN.A"], stdin=PIPE, stdout=PIPE)
output = p.communicate("INPUT_A_STRING_HERE")[0]
A sample that communicates with another process:
pipe = subprocess.Popen(['clisp'],stdin=subprocess.PIPE, stdout=subprocess.PIPE)
(response,err) = pipe.communicate("(+ 1 1)\n(* 2 2)")
#only print the last 6 lines to chop off the REPL intro text.
#Obviously you can do whatever manipulations you feel are necessary
#to correctly grab the input here
print '\n'.join(response.split('\n')[-6:])
Note that communicate will close the streams after it runs, so you have to know all your commands ahead of time for this method to work. It seems like the pipe.stdout doesn't flush until stdin is closed? I'd be curious if there is a way around that I'm missing.
You should use the subprocess module subprocess module
In your example you might run:
subprocess.check_output(["fst-mor", "NOUN.A"])

Very large input and piping using subprocess.Popen

I have pretty simple problem. I have a large file that goes through three steps, a decoding step using an external program, some processing in python, and then recoding using another external program. I have been using subprocess.Popen() to try to do this in python rather than forming unix pipes. However, all the data are buffered to memory. Is there a pythonic way of doing this task, or am I best dropping back to a simple python script that reads from stdin and writes to stdout with unix pipes on either side?
import os, sys, subprocess
def main(infile,reflist):
print infile,reflist
samtoolsin = subprocess.Popen(["samtools","view",infile],
stdout=subprocess.PIPE,bufsize=1)
samtoolsout = subprocess.Popen(["samtools","import",reflist,"-",
infile+".tmp"],stdin=subprocess.PIPE,bufsize=1)
for line in samtoolsin.stdout.read():
if(line.startswith("#")):
samtoolsout.stdin.write(line)
else:
linesplit = line.split("\t")
if(linesplit[10]=="*"):
linesplit[9]="*"
samtoolsout.stdin.write("\t".join(linesplit))
Popen has a bufsize parameter that will limit the size of the buffer in memory. If you don't want the files in memory at all, you can pass file objects as the stdin and stdout parameters. From the subprocess docs:
bufsize, if given, has the same meaning as the corresponding argument to the built-in open() function: 0 means unbuffered, 1 means line buffered, any other positive value means use a buffer of (approximately) that size. A negative bufsize means to use the system default, which usually means fully buffered. The default value for bufsize is 0 (unbuffered).
Try to make this small change, see if the efficiency is better.
for line in samtoolsin.stdout:
if(line.startswith("#")):
samtoolsout.stdin.write(line)
else:
linesplit = line.split("\t")
if(linesplit[10]=="*"):
linesplit[9]="*"
samtoolsout.stdin.write("\t".join(linesplit))
However, all the data are buffered to memory ...
Are you using subprocess.Popen.communicate()? By design, this function will wait for the process to finish, all the while accumulating the data in a buffer, and then return it to you. As you've pointed out, this is problematic if dealing with very large files.
If you want to process the data while it is generated, you will need to write a loop using the poll() and .stdout.read() methods, then write that output to another socket/file/etc.
Do be sure to notice the warnings in the documentation against doing this as it is easy to result in a deadlock (the parent process waits for the child process to generate data, who is in turn waiting for the parent process to empty the pipe buffer).
I was using the .read() method on the stdout stream. Instead, I simply needed to read directly from the stream in the for loop above. The corrected code does what I expected.
#!/usr/bin/env python
import os
import sys
import subprocess
def main(infile,reflist):
print infile,reflist
samtoolsin = subprocess.Popen(["samtools","view",infile],
stdout=subprocess.PIPE,bufsize=1)
samtoolsout = subprocess.Popen(["samtools","import",reflist,"-",
infile+".tmp"],stdin=subprocess.PIPE,bufsize=1)
for line in samtoolsin.stdout:
if(line.startswith("#")):
samtoolsout.stdin.write(line)
else:
linesplit = line.split("\t")
if(linesplit[10]=="*"):
linesplit[9]="*"
samtoolsout.stdin.write("\t".join(linesplit))
Trying to do some basic shell piping with very large input in python:
svnadmin load /var/repo < r0-100.dump
I found the simplest way to get this working even with large (2-5GB) files was:
subprocess.check_output('svnadmin load %s < %s' % (repo, fname), shell=True)
I like this method because it's simple and you can do standard shell redirection.
I tried going the Popen route to run a redirect:
cmd = 'svnadmin load %s' % repo
p = Popen(cmd, stdin=PIPE, stdout=PIPE, shell=True)
with open(fname) as inline:
for line in inline:
p.communicate(input=line)
But that broke with large files. Using:
p.stdin.write()
Also broke with very large files.

Categories