Python: how to get the final output of multiple system commands? - python

There are many posts here on SO, like this one: Store output of subprocess.Popen call in a string
There is problem with complicated commands. For example, if I need to get output from this
ps -ef|grep something|wc -l
Subprocess won't do the job, because argument for subprocess is [name of program, arguments], so it is not possible to use more sophisicated commands (more programs, pipes, etc.).
Is there way to capture the output of a chain of multiple commands?

Just pass the shell=True option to subprocess
import subprocess
subprocess.check_output('ps -ef | grep something | wc -l', shell=True)

For a no-shell, clean version using the subprocess module, you can use the following example (from the documentation):
output = `dmesg | grep hda`
becomes
p1 = Popen(["dmesg"], stdout=PIPE)
p2 = Popen(["grep", "hda"], stdin=p1.stdout, stdout=PIPE)
p1.stdout.close() # Allow p1 to receive a SIGPIPE if p2 exits.
output = p2.communicate()[0]
The Python program essentially does here what the shell does: it sends the output of each command to the next one in turn. An advantage of this approach is that the programmer has full control on the individual standard error outputs of the commands (they can be suppressed if needed, logged, etc.).
That said, I generally prefer to use instead the subprocess.check_output('ps -ef | grep something | wc -l', shell=True) shell-delegation approach suggested by nneonneo: it is general, very legible and convenient.

Well, another alternative would just be to implement part of the command in plain Python. For example,
count = 0
for line in subprocess.check_output(['ps', '-ef']).split('\n'):
if something in line: # or re.search(something, line) to use regex
count += 1
print count

Related

Chaining subprocess.Popen to similuate pipes [duplicate]

How do I execute the following shell command using the Python subprocess module?
echo "input data" | awk -f script.awk | sort > outfile.txt
The input data will come from a string, so I don't actually need echo. I've got this far, can anyone explain how I get it to pipe through sort too?
p_awk = subprocess.Popen(["awk","-f","script.awk"],
stdin=subprocess.PIPE,
stdout=file("outfile.txt", "w"))
p_awk.communicate( "input data" )
UPDATE: Note that while the accepted answer below doesn't actually answer the question as asked, I believe S.Lott is right and it's better to avoid having to solve that problem in the first place!
You'd be a little happier with the following.
import subprocess
awk_sort = subprocess.Popen( "awk -f script.awk | sort > outfile.txt",
stdin=subprocess.PIPE, shell=True )
awk_sort.communicate( b"input data\n" )
Delegate part of the work to the shell. Let it connect two processes with a pipeline.
You'd be a lot happier rewriting 'script.awk' into Python, eliminating awk and the pipeline.
Edit. Some of the reasons for suggesting that awk isn't helping.
[There are too many reasons to respond via comments.]
Awk is adding a step of no significant value. There's nothing unique about awk's processing that Python doesn't handle.
The pipelining from awk to sort, for large sets of data, may improve elapsed processing time. For short sets of data, it has no significant benefit. A quick measurement of awk >file ; sort file and awk | sort will reveal of concurrency helps. With sort, it rarely helps because sort is not a once-through filter.
The simplicity of "Python to sort" processing (instead of "Python to awk to sort") prevents the exact kind of questions being asked here.
Python -- while wordier than awk -- is also explicit where awk has certain implicit rules that are opaque to newbies, and confusing to non-specialists.
Awk (like the shell script itself) adds Yet Another Programming language. If all of this can be done in one language (Python), eliminating the shell and the awk programming eliminates two programming languages, allowing someone to focus on the value-producing parts of the task.
Bottom line: awk can't add significant value. In this case, awk is a net cost; it added enough complexity that it was necessary to ask this question. Removing awk will be a net gain.
Sidebar Why building a pipeline (a | b) is so hard.
When the shell is confronted with a | b it has to do the following.
Fork a child process of the original shell. This will eventually become b.
Build an os pipe. (not a Python subprocess.PIPE) but call os.pipe() which returns two new file descriptors that are connected via common buffer. At this point the process has stdin, stdout, stderr from its parent, plus a file that will be "a's stdout" and "b's stdin".
Fork a child. The child replaces its stdout with the new a's stdout. Exec the a process.
The b child closes replaces its stdin with the new b's stdin. Exec the b process.
The b child waits for a to complete.
The parent is waiting for b to complete.
I think that the above can be used recursively to spawn a | b | c, but you have to implicitly parenthesize long pipelines, treating them as if they're a | (b | c).
Since Python has os.pipe(), os.exec() and os.fork(), and you can replace sys.stdin and sys.stdout, there's a way to do the above in pure Python. Indeed, you may be able to work out some shortcuts using os.pipe() and subprocess.Popen.
However, it's easier to delegate that operation to the shell.
import subprocess
some_string = b'input_data'
sort_out = open('outfile.txt', 'wb', 0)
sort_in = subprocess.Popen('sort', stdin=subprocess.PIPE, stdout=sort_out).stdin
subprocess.Popen(['awk', '-f', 'script.awk'], stdout=sort_in,
stdin=subprocess.PIPE).communicate(some_string)
To emulate a shell pipeline:
from subprocess import check_call
check_call('echo "input data" | a | b > outfile.txt', shell=True)
without invoking the shell (see 17.1.4.2. Replacing shell pipeline):
#!/usr/bin/env python
from subprocess import Popen, PIPE
a = Popen(["a"], stdin=PIPE, stdout=PIPE)
with a.stdin:
with a.stdout, open("outfile.txt", "wb") as outfile:
b = Popen(["b"], stdin=a.stdout, stdout=outfile)
a.stdin.write(b"input data")
statuses = [a.wait(), b.wait()] # both a.stdin/stdout are closed already
plumbum provides some syntax sugar:
#!/usr/bin/env python
from plumbum.cmd import a, b # magic
(a << "input data" | b > "outfile.txt")()
The analog of:
#!/bin/sh
echo "input data" | awk -f script.awk | sort > outfile.txt
is:
#!/usr/bin/env python
from plumbum.cmd import awk, sort
(awk["-f", "script.awk"] << "input data" | sort > "outfile.txt")()
The accepted answer is sidestepping actual question.
here is a snippet that chains the output of multiple processes:
Note that it also prints the (somewhat) equivalent shell command so you can run it and make sure the output is correct.
#!/usr/bin/env python3
from subprocess import Popen, PIPE
# cmd1 : dd if=/dev/zero bs=1m count=100
# cmd2 : gzip
# cmd3 : wc -c
cmd1 = ['dd', 'if=/dev/zero', 'bs=1M', 'count=100']
cmd2 = ['tee']
cmd3 = ['wc', '-c']
print(f"Shell style : {' '.join(cmd1)} | {' '.join(cmd2)} | {' '.join(cmd3)}")
p1 = Popen(cmd1, stdout=PIPE, stderr=PIPE) # stderr=PIPE optional, dd is chatty
p2 = Popen(cmd2, stdin=p1.stdout, stdout=PIPE)
p3 = Popen(cmd3, stdin=p2.stdout, stdout=PIPE)
print("Output from last process : " + (p3.communicate()[0]).decode())
# thoretically p1 and p2 may still be running, this ensures we are collecting their return codes
p1.wait()
p2.wait()
print("p1 return: ", p1.returncode)
print("p2 return: ", p2.returncode)
print("p3 return: ", p3.returncode)
http://www.python.org/doc/2.5.2/lib/node535.html covered this pretty well. Is there some part of this you didn't understand?
Your program would be pretty similar, but the second Popen would have stdout= to a file, and you wouldn't need the output of its .communicate().
Inspired by #Cristian's answer. I met just the same issue, but with a different command. So I'm putting my tested example, which I believe could be helpful:
grep_proc = subprocess.Popen(["grep", "rabbitmq"],
stdin=subprocess.PIPE,
stdout=subprocess.PIPE)
subprocess.Popen(["ps", "aux"], stdout=grep_proc.stdin)
out, err = grep_proc.communicate()
This is tested.
What has been done
Declared lazy grep execution with stdin from pipe. This command will be executed at the ps command execution when the pipe will be filled with the stdout of ps.
Called the primary command ps with stdout directed to the pipe used by the grep command.
Grep communicated to get stdout from the pipe.
I like this way because it is natural pipe conception gently wrapped with subprocess interfaces.
The previous answers missed an important point. Replacing shell pipeline is basically correct, as pointed out by geocar. It is almost sufficient to run communicate on the last element of the pipe.
The remaining problem is passing the input data to the pipeline. With multiple subprocesses, a simple communicate(input_data) on the last element doesn't work - it hangs forever. You need to create a a pipeline and a child manually like this:
import os
import subprocess
input = """\
input data
more input
""" * 10
rd, wr = os.pipe()
if os.fork() != 0: # parent
os.close(wr)
else: # child
os.close(rd)
os.write(wr, input)
os.close(wr)
exit()
p_awk = subprocess.Popen(["awk", "{ print $2; }"],
stdin=rd,
stdout=subprocess.PIPE)
p_sort = subprocess.Popen(["sort"],
stdin=p_awk.stdout,
stdout=subprocess.PIPE)
p_awk.stdout.close()
out, err = p_sort.communicate()
print (out.rstrip())
Now the child provides the input through the pipe, and the parent calls communicate(), which works as expected. With this approach, you can create arbitrary long pipelines without resorting to "delegating part of the work to the shell". Unfortunately the subprocess documentation doesn't mention this.
There are ways to achieve the same effect without pipes:
from tempfile import TemporaryFile
tf = TemporaryFile()
tf.write(input)
tf.seek(0, 0)
Now use stdin=tf for p_awk. It's a matter of taste what you prefer.
The above is still not 100% equivalent to bash pipelines because the signal handling is different. You can see this if you add another pipe element that truncates the output of sort, e.g. head -n 10. With the code above, sort will print a "Broken pipe" error message to stderr. You won't see this message when you run the same pipeline in the shell. (That's the only difference though, the result in stdout is the same). The reason seems to be that python's Popen sets SIG_IGN for SIGPIPE, whereas the shell leaves it at SIG_DFL, and sort's signal handling is different in these two cases.
EDIT: pipes is available on Windows but, crucially, doesn't appear to actually work on Windows. See comments below.
The Python standard library now includes the pipes module for handling this:
https://docs.python.org/2/library/pipes.html, https://docs.python.org/3.4/library/pipes.html
I'm not sure how long this module has been around, but this approach appears to be vastly simpler than mucking about with subprocess.
For me, the below approach is the cleanest and easiest to read
from subprocess import Popen, PIPE
def string_to_2_procs_to_file(input_s, first_cmd, second_cmd, output_filename):
with open(output_filename, 'wb') as out_f:
p2 = Popen(second_cmd, stdin=PIPE, stdout=out_f)
p1 = Popen(first_cmd, stdout=p2.stdin, stdin=PIPE)
p1.communicate(input=bytes(input_s))
p1.wait()
p2.stdin.close()
p2.wait()
which can be called like so:
string_to_2_procs_to_file('input data', ['awk', '-f', 'script.awk'], ['sort'], 'output.txt')

How to use linux command in python with python functions (sys.argv)

In my python script, I need to use 'awk' but I want to pass file using the sys.argv.
My current code is like this:
import sys
import os
cmd="awk '/regex/ {print}' sys.argv[1] | sed 's/old/new/g'"
x=os.popen(cmd).read()
Now the problem is that 'sys.argv' is a python thing but cmd variable is using a linux command.
So my question is - Is there any way to include sys.argv in my linux command?
You really don't need Awk or sed for this. Python can do these things natively, elegantly, flexibly, robustly, and naturally.
import sys
import re
r = re.compile(r'regex')
s = re.compile(r'old')
with open(sys.argv[1]) as input:
for line in input:
if r.search(line):
print(s.sub('new', line))
If you really genuinely want to use subprocesses for something, simply use Python's general string interpolation functions where you need to insert the value of a Python variable into a string.
import subprocess
import sys
import shlex
result = subprocess.run(
"""awk '/regex/ {print}' {} |
sed 's/old/new/g'""".format(shlex.quote(sys.argv[1])),
stdout=subprocess.PIPE,
shell=True, check=True)
print(subprocess.stdout)
But really, don't do this. If you really can't avoid a subprocess, keep it as simple as possible (avoid shell=True and peel off all the parts which can be done in Python).
Just try like this
cmd="awk '/regex/ {print}' " + str(sys.argv[1]) + " | sed 's/old/new/g'"
x=os.popen(cmd).read()
Your best choice is to implement your logic as pure Python logic, as described in the first part of the answer by #tripleee. Your second best choice is to keep the external tools, but eliminate the need for a shell in invoking them and connecting them together.
See the Python documentation section Replacing Shell Pipelines.
import sys
from subprocess import Popen, PIPE
p1 = Popen(['awk', '/regex/ {print}'], stdin=open(sys.argv[1]), stdout=PIPE)
p2 = Popen(['sed', 's/old/new/g'], stdin=p1.stdout, stdout=PIPE)
x = p2.communicate()[0]
Your third best choice is to keep the shell, but pass the data out-of-band from the code:
p = subprocess.run([
"""awk '/regex/ {print}' <"$1" | sed 's/old/new/'""", # code to run
'_', # $0 in context of that code
sys.argv[1] # $1 in context of that code
], shell=True, check=True, stdout=subprocess.PIPE)
print(p.stdout)

How to execute '<(cat fileA fileB)' using python?

I am writing a python program that uses other software. I was able to pass the command using subprocess.popen. I am facing a new problem: I need to concatenate multiples files as two
files and use them as the input for the external program. The command line looks like this:
extersoftware --fq --f <(cat fileA_1 fileB_1) <(cat fileA_2 fileB_2)
I cannot use shell=True because there are other commands I need to pass by variables, such as --fq.(They are not limited to --fq, here is just an example)
One possible solution is to generate middle file.
This is what I have tried:
file_1 = ['cat', 'fileA_1', 'fileB_1']
p1 = Popen(file_1, stdout=PIPE)
p2 = Popen(['>', 'output_file'], stdin=p1.stdout, stdout=PIPE)
p1.stdout.close()
output = p2.communicate()
print output
I got error message: OSError: [Errno 2] No such file or directory Which part did I do wrong?
It would be better if there is no middle file. For this reason, I am looking at named pipe. I do not quiet understand it.
I have looked at multiple questions that have been answered here. To me they are all some how different from my question here.
Thanks ahead for all your help.
The way bash handles <(..) is to:
Create a pipe
Fork a command that writes to the write end
Substitute the <(..) for /dev/fd/N where N is the input end file descriptor of the pipe (try echo <(true)).
Run the command
The command will then open /dev/fd/N, and the OS will cause that to duplicate the inherited read end of the pipe.
We can do the same thing in Python:
import subprocess
import os
# Open a pipe and run a command that writes to the write end
input_fd, output_fd = os.pipe()
subprocess.Popen(["cat", "foo.txt", "bar.txt"], shell=False, stdout=output_fd)
os.close(output_fd);
# Run a command that uses /dev/fd/* to read from the read end
proc = subprocess.Popen(["wc", "/dev/fd/" + str(input_fd)],
shell=False, stdout = subprocess.PIPE)
# Read that command's output
print proc.communicate()[0]
For example:
$ cat foo.txt
Hello
$ cat bar.txt
World
$ wc <(cat foo.txt bar.txt)
2 2 12 /dev/fd/63
$ python test.py
2 2 12 /dev/fd/4
Process substitution returns the device filename that is being used. You will have to assign the pipe to a higher FD (e.g. 20) by passing a function to preexec_fn that uses os.dup2() to copy it, and then pass the FD device filename (e.g. /dev/fd/20) as one of the arguments of the call.
def assignfd(fd, handle):
def assign():
os.dup2(handle, fd)
return assign
...
p2 = Popen(['cat', '/dev/fd/20'], preexec_fn=assignfd(20, p1.stdout.fileno()))
...
It's actually possible have it both ways -- using a shell, while passing a list of arguments through unambiguously in a way that doesn't allow them to be shell-parsed.
Use bash explicitly rather than shell=True to ensure that you have support for <(), and use "$#" to refer to the additional argv array elements, like so:
subprocess.Popen(['bash', '-c',
'extersoftware "$#" --f <(cat fileA_1 fileB_1) <(cat fileA_2 fileB_2)',
"_", # this is a dummy passed in as argv[0] of the interpreter
"--fq", # this is substituted into the shell by the "$#"
])
If you wanted to independently pass in all three arrays -- extra arguments, and the exact filenames to be passed to each cat instance:
BASH_SCRIPT=r'''
declare -a filelist1=( )
filelist1_len=$1; shift
while (( filelist1_len-- > 0 )); do
filelist1+=( "$1" ); shift
done
filelist2_len=$1; shift
while (( filelist2_len-- > 0 )); do
filelist2+=( "$1" ); shift
done
extersoftware "$#" --f <(cat "${filelist1[#]}") <(cat "${filelist2[#]}")
'''
subprocess.Popen(['bash', '-c', BASH_SCRIPT, '' +
[str(len(filelist1))] + filelist1 +
[str(len(filelist2))] + filelist2 +
["--fq"],
])
You could put more interesting logic in the embedded shell script as well, were you so inclined.
In this specific case, we may use:
import subprocess
import os
if __name__ == '__main__':
input_fd1, output_fd1 = os.pipe()
subprocess.Popen(['cat', 'fileA_1', 'fileB_1'],
shell=False, stdout=output_fd1)
os.close(output_fd1)
input_fd2, output_fd2 = os.pipe();
subprocess.Popen(['cat', 'fileA_2', 'fileB_2'],
shell=False, stdout=output_fd2)
os.close(output_fd2)
proc = subprocess.Popen(['extersoftware','--fq', '--f',
'/dev/fd/'+str(input_fd1), '/dev/fd/' + str(input_fd2)], shell=False)
Change log:
Reformatted the code so it should be easier to read now (and hopefully still syntactically correct). It's tested in Python 2.6.6 on Scientific Linux 6.5 and everything looks fine.
Removed unnecessary semicolons.

Running an external program from python: piping without waiting for all output

I have looked at Calling an external command in Python and tried every possible way with subprocess and os.popen but nothing seems to work.
If I try
import os
stream = os.popen("program.ex -f file.dat | grep fish | head -4")
I get lines and lines of
grep: broken pipe
If I switch the grep and head commands around, it never gets to the grep command because the output from program.ex is prohibitively long (which is why I run with head -4).
Of course the following fails because of the pipes:
import subprocess as sp
cmd = "program.ex -f file.dat | grep fish | head -4"
proc = sp.Popen(cmd.split(),stdout=sp.PIPE,stderr=sp.PIPE)
stdout, stderr = proc.communicate()
So I tried breaking it down
cmd1 = "program.ex -f file.ex"
cmd2 = "head -4"
cmd3 = "grep fish"
proc1 = sp.Popen(cmd1.split(),stdout=sp.PIPE,stderr=sp.PIPE)
proc2 = sp.Popen(cmd2.split(),stdout=sp.PIPE,stdin=proc1.stdout)
proc3 = sp.Popen(cmd3.split(),stdout=sp.PIPE,stdin=proc2.stdout)
stdout, stderr = proc1.communicate()
which does run, except it gets stuck on cmd1 because the output from program.ex is prohibitively long.
Finally I tried hiding it in an external shell script and fortran program, but the fortran program does a
call system("program.ex -f file.dat | grep fish | head -4")
and I guess this messes up python again.
Note: If I do this directly in the terminal, it doesn't need to get the whole output from program.ex and the command finishes instantly.
So, my question is:
How can I get the above command to run in python like it does in the terminal (ie, head and grep the output from program.ex without needing to wait for all the output from program.ex)?
Help is greatly appreciated!
Edit:
I also tried with shell=True:
import subprocess as sp
cmd = "program.ex -f file.dat | head -4 | grep fish"
proc = sp.Popen(cmd.split(),stdout=sp.PIPE,stderr=sp.PIPE,shell=True)
stdout, stderr = proc.communicate()
which does run, and while stderr has expected (un-needed) content, stdout is empty. If I replace the above cmd variable with the name of a fortran program which calls the system command instead, then it hangs on program.ex again, probably waiting for all the output to finish.
You can use bash to handle the pipes.
It can only run script files and it won't run commands(bash -e echo gives/bin/echo: /bin/echo: cannot execute binary file)
bash -e <script to run>
if you put the commands in the script file it will run them
This still gives me an error of sorts in the stderr from the first process, but maybe it's still good enough for your purposes? Using your multiple pipes example, but calling .communicate() on the output process:
import subprocess
cmd1 = ['yes', 'fishy'] # is this similar enough to your example program?
cmd2 = ['head', '-4']
cmd3 = ['grep', 'fish']
proc1 = subprocess.Popen(cmd1, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
proc2 = subprocess.Popen(cmd2, stdout=subprocess.PIPE, stderr=subprocess.PIPE,
stdin=proc1.stdout)
proc3 = subprocess.Popen(cmd3, stdout=subprocess.PIPE, stderr=subprocess.PIPE,
stdin=proc2.stdout)
result, err3 = proc3.communicate()
proc2.wait()
err2 = proc2.stderr.read()
proc1.stdout.close()
proc1.wait()
err1 = proc1.stderr.read() # 'yes: standard output: Broken pipe\nyes: write error\n'
This amazing, but not well known library might be what you're looking for:
https://github.com/kennethreitz/envoy
Be sure to use the Github version, not the one that gets installed with pip. It's only a single file by the way.

Linking subprocesses in Python

Hi I had a question about linking input and output with sub-processes in python. I am trying to simplify the program by skipping the output of one step by passing it to another subprocess rather than output it to a file. Then open another process to run on that file.
E.g. First process uses SAMTOOLS to output a specific chromosome from a large bam file.
So...
bigfile.bam is read in and outputs chromosome22.bam
The next subprocess uses BEDTOOLS to convert that chromosome22.bam to chromosome22.bed
So...
chromosome22.bam is read in and outputs chromosome22.bed
What I want to do is pass the stdout of the first process into the second so there is no need for the intermediate file.
So far I have this...
for x in 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,'X','Y':
subprocess.call("%s view -bh %s %s > %s/%s/%s.bam" % (samtools,bam,x,bampath,out,x), shell=True)
This makes the chromosome[1-22,X,Y].bam files. But can I avoid this and put another subprocess command in the same loop to convert them to bed files?
The command for bed conversion is:
bedpath/bedtools bamtobed -i [bamfile] > [bedfile]
Please have a look at the replacing shell pipeline example in the documentation.
output=$(dmesg | grep hda)
becomes:
p1 = Popen(["dmesg"], stdout=PIPE)
p2 = Popen(["grep", "hda"], stdin=p1.stdout, stdout=PIPE)
p1.stdout.close() # Allow p1 to receive a SIGPIPE if p2 exits.
output = p2.communicate()[0]
The explanation is:
The p1.stdout.close() call after starting the p2 is important in order for p1 to receive a SIGPIPE if p2 exits before p1.
No need to use python here. Much easier in shell. But essentially, it works the same as in python.
If bedtools can read from stdin, you can e.g. do
#!/bin/sh
for x in `seq 1 22` X Y; do
$samtools view -bh $bam $x | $bedtools bamtobed > $bampath/$out/$x.bam
done
Depending on how bedtools was desinged, you might also need to use -i - to have it read from stdin.
If you stick with python, I strongly recommend about learning how to do this
without doing it in all shell,
without producing shell commands, that you need to escape properly to avoid errors
subprocess is more safe to use when you use the array-based syntax and no shell.
Make that two subprocess invocations, one for each command. See http://docs.python.org/library/subprocess.html#replacing-shell-pipeline for more details.
cmd1 = [samtools, "view", "-bh", bam, x]
cmd2 = [bedtools, "bamtobed"]
c1 = subprocess.Popen(cmd1, stdout=subprocess.PIPE)
c2 = subprocess.Popen(cmd2, stdin=c1.stdout, stdout=open(outputfilename, "w"))
c1.stdout.close()
c2.communicate()
Yes, you can use the pipe functionality. See if you can read from stdin for the bamtobed process ... if you can, try the following. This way you save on the disk IO time assuming the processing load is light.
SLIGHT modification:
proc1.stdout is now the stdin for the 2nd process.
proc1 = subprocess.call("%s view -bh %s %s" % (samtools,bam,x,bampath,out,x), shell=True, stdout=subprocess.PIPE)
proc2 = subprocess.call("bedpath/bedtools bamtobed > %s" % (outFileName, ), shell=True, stdin=proc1.stdout)

Categories