I am writing a python program that uses other software. I was able to pass the command using subprocess.popen. I am facing a new problem: I need to concatenate multiples files as two
files and use them as the input for the external program. The command line looks like this:
extersoftware --fq --f <(cat fileA_1 fileB_1) <(cat fileA_2 fileB_2)
I cannot use shell=True because there are other commands I need to pass by variables, such as --fq.(They are not limited to --fq, here is just an example)
One possible solution is to generate middle file.
This is what I have tried:
file_1 = ['cat', 'fileA_1', 'fileB_1']
p1 = Popen(file_1, stdout=PIPE)
p2 = Popen(['>', 'output_file'], stdin=p1.stdout, stdout=PIPE)
p1.stdout.close()
output = p2.communicate()
print output
I got error message: OSError: [Errno 2] No such file or directory Which part did I do wrong?
It would be better if there is no middle file. For this reason, I am looking at named pipe. I do not quiet understand it.
I have looked at multiple questions that have been answered here. To me they are all some how different from my question here.
Thanks ahead for all your help.
The way bash handles <(..) is to:
Create a pipe
Fork a command that writes to the write end
Substitute the <(..) for /dev/fd/N where N is the input end file descriptor of the pipe (try echo <(true)).
Run the command
The command will then open /dev/fd/N, and the OS will cause that to duplicate the inherited read end of the pipe.
We can do the same thing in Python:
import subprocess
import os
# Open a pipe and run a command that writes to the write end
input_fd, output_fd = os.pipe()
subprocess.Popen(["cat", "foo.txt", "bar.txt"], shell=False, stdout=output_fd)
os.close(output_fd);
# Run a command that uses /dev/fd/* to read from the read end
proc = subprocess.Popen(["wc", "/dev/fd/" + str(input_fd)],
shell=False, stdout = subprocess.PIPE)
# Read that command's output
print proc.communicate()[0]
For example:
$ cat foo.txt
Hello
$ cat bar.txt
World
$ wc <(cat foo.txt bar.txt)
2 2 12 /dev/fd/63
$ python test.py
2 2 12 /dev/fd/4
Process substitution returns the device filename that is being used. You will have to assign the pipe to a higher FD (e.g. 20) by passing a function to preexec_fn that uses os.dup2() to copy it, and then pass the FD device filename (e.g. /dev/fd/20) as one of the arguments of the call.
def assignfd(fd, handle):
def assign():
os.dup2(handle, fd)
return assign
...
p2 = Popen(['cat', '/dev/fd/20'], preexec_fn=assignfd(20, p1.stdout.fileno()))
...
It's actually possible have it both ways -- using a shell, while passing a list of arguments through unambiguously in a way that doesn't allow them to be shell-parsed.
Use bash explicitly rather than shell=True to ensure that you have support for <(), and use "$#" to refer to the additional argv array elements, like so:
subprocess.Popen(['bash', '-c',
'extersoftware "$#" --f <(cat fileA_1 fileB_1) <(cat fileA_2 fileB_2)',
"_", # this is a dummy passed in as argv[0] of the interpreter
"--fq", # this is substituted into the shell by the "$#"
])
If you wanted to independently pass in all three arrays -- extra arguments, and the exact filenames to be passed to each cat instance:
BASH_SCRIPT=r'''
declare -a filelist1=( )
filelist1_len=$1; shift
while (( filelist1_len-- > 0 )); do
filelist1+=( "$1" ); shift
done
filelist2_len=$1; shift
while (( filelist2_len-- > 0 )); do
filelist2+=( "$1" ); shift
done
extersoftware "$#" --f <(cat "${filelist1[#]}") <(cat "${filelist2[#]}")
'''
subprocess.Popen(['bash', '-c', BASH_SCRIPT, '' +
[str(len(filelist1))] + filelist1 +
[str(len(filelist2))] + filelist2 +
["--fq"],
])
You could put more interesting logic in the embedded shell script as well, were you so inclined.
In this specific case, we may use:
import subprocess
import os
if __name__ == '__main__':
input_fd1, output_fd1 = os.pipe()
subprocess.Popen(['cat', 'fileA_1', 'fileB_1'],
shell=False, stdout=output_fd1)
os.close(output_fd1)
input_fd2, output_fd2 = os.pipe();
subprocess.Popen(['cat', 'fileA_2', 'fileB_2'],
shell=False, stdout=output_fd2)
os.close(output_fd2)
proc = subprocess.Popen(['extersoftware','--fq', '--f',
'/dev/fd/'+str(input_fd1), '/dev/fd/' + str(input_fd2)], shell=False)
Change log:
Reformatted the code so it should be easier to read now (and hopefully still syntactically correct). It's tested in Python 2.6.6 on Scientific Linux 6.5 and everything looks fine.
Removed unnecessary semicolons.
Related
I am facing difficulties calling a command line from my script.I run the script but I don't get any result. Through this command line in my script I want to run a tool which produces a folder that has the output files for each line.The inputpath is already defined. Can you please help me?
for line in inputFile:
cmd = 'python3 CRISPRcasIdentifier.py -f %s/%s.fasta -o %s/%s.csv -st dna -co %s/'%(inputpath,line.strip(),outputfolder,line.strip(),outputfolder)
os.system(cmd)
You really want to use the Python standard library module subprocess. Using functions from that module, you can construct you command line as a list of strings, and each would be processed as one file name, option or value. This bypasses the shell's escaping, and eliminates the need to massage you script arguments before calling.
Besides, your code would not work, because the body block of the for statement is not indented. Python would simply not accept this code (could be you pasted into the questiong without the proper indentations).
as mentioned before, executing command vias: os.system(command) is not recomended. please use subprocess (read in python docs about this modulesubprocess_module_docs). see the code here:
for command in input_file:
p = subprocess.Popen(command, stdout=subprocess.PIPE, stdin=subprocess.PIPE, stderr=subprocess.PIPE)
# use this if you want to communicate with child process
# p = subprocess.Popen(command, stdout=subprocess.PIPE, stdin=subprocess.PIPE, stderr=subprocess.PIPE)
p.communicate()
# --- do the rest
I usually do like this for static command
from subprocess import check_output
def sh(command):
return check_output(command, shell=True, universal_newlines=True)
output = sh('echo hello world | sed s/h/H/')
BUT THIS IS NOT SAFE!!! It's vunerable to shell injection you should do
from subprocess import check_output
from shlex import split
def sh(command):
return check_output(split(command), universal_newlines=True)
output = sh('echo hello world')
The difference is subtle but important. shell=True will create a new shell, so pipes, etc will work. I use this when I have a big command line with pipes and that is static, I mean, it do not depend on user input. This is because this variant is vunerable to shell injection, a user can input something; rm -rf / and it will run.
The second variant only accepts one command, it will not spawn a shell, instead it will run the command directly. So no pipes and such shell things will work, and is safer.
universal_newlines=True is for getting output as string instead of bytes. Use it for text output, if you need binary output just ommit it. The default is false.
So here is the full example
from subprocess import check_output
from shlex import split
def sh(command):
return check_output(split(command), universal_newlines=True)
for line in inputFile:
cmd = 'python3 CRISPRcasIdentifier.py -f %s/%s.fasta -o %s/%s.csv -st dna -co %s/'%(inputpath,line.strip(),outputfolder,line.strip(),outputfolder)
sh(cmd)
Ps: I didn't test this
How do I execute the following shell command using the Python subprocess module?
echo "input data" | awk -f script.awk | sort > outfile.txt
The input data will come from a string, so I don't actually need echo. I've got this far, can anyone explain how I get it to pipe through sort too?
p_awk = subprocess.Popen(["awk","-f","script.awk"],
stdin=subprocess.PIPE,
stdout=file("outfile.txt", "w"))
p_awk.communicate( "input data" )
UPDATE: Note that while the accepted answer below doesn't actually answer the question as asked, I believe S.Lott is right and it's better to avoid having to solve that problem in the first place!
You'd be a little happier with the following.
import subprocess
awk_sort = subprocess.Popen( "awk -f script.awk | sort > outfile.txt",
stdin=subprocess.PIPE, shell=True )
awk_sort.communicate( b"input data\n" )
Delegate part of the work to the shell. Let it connect two processes with a pipeline.
You'd be a lot happier rewriting 'script.awk' into Python, eliminating awk and the pipeline.
Edit. Some of the reasons for suggesting that awk isn't helping.
[There are too many reasons to respond via comments.]
Awk is adding a step of no significant value. There's nothing unique about awk's processing that Python doesn't handle.
The pipelining from awk to sort, for large sets of data, may improve elapsed processing time. For short sets of data, it has no significant benefit. A quick measurement of awk >file ; sort file and awk | sort will reveal of concurrency helps. With sort, it rarely helps because sort is not a once-through filter.
The simplicity of "Python to sort" processing (instead of "Python to awk to sort") prevents the exact kind of questions being asked here.
Python -- while wordier than awk -- is also explicit where awk has certain implicit rules that are opaque to newbies, and confusing to non-specialists.
Awk (like the shell script itself) adds Yet Another Programming language. If all of this can be done in one language (Python), eliminating the shell and the awk programming eliminates two programming languages, allowing someone to focus on the value-producing parts of the task.
Bottom line: awk can't add significant value. In this case, awk is a net cost; it added enough complexity that it was necessary to ask this question. Removing awk will be a net gain.
Sidebar Why building a pipeline (a | b) is so hard.
When the shell is confronted with a | b it has to do the following.
Fork a child process of the original shell. This will eventually become b.
Build an os pipe. (not a Python subprocess.PIPE) but call os.pipe() which returns two new file descriptors that are connected via common buffer. At this point the process has stdin, stdout, stderr from its parent, plus a file that will be "a's stdout" and "b's stdin".
Fork a child. The child replaces its stdout with the new a's stdout. Exec the a process.
The b child closes replaces its stdin with the new b's stdin. Exec the b process.
The b child waits for a to complete.
The parent is waiting for b to complete.
I think that the above can be used recursively to spawn a | b | c, but you have to implicitly parenthesize long pipelines, treating them as if they're a | (b | c).
Since Python has os.pipe(), os.exec() and os.fork(), and you can replace sys.stdin and sys.stdout, there's a way to do the above in pure Python. Indeed, you may be able to work out some shortcuts using os.pipe() and subprocess.Popen.
However, it's easier to delegate that operation to the shell.
import subprocess
some_string = b'input_data'
sort_out = open('outfile.txt', 'wb', 0)
sort_in = subprocess.Popen('sort', stdin=subprocess.PIPE, stdout=sort_out).stdin
subprocess.Popen(['awk', '-f', 'script.awk'], stdout=sort_in,
stdin=subprocess.PIPE).communicate(some_string)
To emulate a shell pipeline:
from subprocess import check_call
check_call('echo "input data" | a | b > outfile.txt', shell=True)
without invoking the shell (see 17.1.4.2. Replacing shell pipeline):
#!/usr/bin/env python
from subprocess import Popen, PIPE
a = Popen(["a"], stdin=PIPE, stdout=PIPE)
with a.stdin:
with a.stdout, open("outfile.txt", "wb") as outfile:
b = Popen(["b"], stdin=a.stdout, stdout=outfile)
a.stdin.write(b"input data")
statuses = [a.wait(), b.wait()] # both a.stdin/stdout are closed already
plumbum provides some syntax sugar:
#!/usr/bin/env python
from plumbum.cmd import a, b # magic
(a << "input data" | b > "outfile.txt")()
The analog of:
#!/bin/sh
echo "input data" | awk -f script.awk | sort > outfile.txt
is:
#!/usr/bin/env python
from plumbum.cmd import awk, sort
(awk["-f", "script.awk"] << "input data" | sort > "outfile.txt")()
The accepted answer is sidestepping actual question.
here is a snippet that chains the output of multiple processes:
Note that it also prints the (somewhat) equivalent shell command so you can run it and make sure the output is correct.
#!/usr/bin/env python3
from subprocess import Popen, PIPE
# cmd1 : dd if=/dev/zero bs=1m count=100
# cmd2 : gzip
# cmd3 : wc -c
cmd1 = ['dd', 'if=/dev/zero', 'bs=1M', 'count=100']
cmd2 = ['tee']
cmd3 = ['wc', '-c']
print(f"Shell style : {' '.join(cmd1)} | {' '.join(cmd2)} | {' '.join(cmd3)}")
p1 = Popen(cmd1, stdout=PIPE, stderr=PIPE) # stderr=PIPE optional, dd is chatty
p2 = Popen(cmd2, stdin=p1.stdout, stdout=PIPE)
p3 = Popen(cmd3, stdin=p2.stdout, stdout=PIPE)
print("Output from last process : " + (p3.communicate()[0]).decode())
# thoretically p1 and p2 may still be running, this ensures we are collecting their return codes
p1.wait()
p2.wait()
print("p1 return: ", p1.returncode)
print("p2 return: ", p2.returncode)
print("p3 return: ", p3.returncode)
http://www.python.org/doc/2.5.2/lib/node535.html covered this pretty well. Is there some part of this you didn't understand?
Your program would be pretty similar, but the second Popen would have stdout= to a file, and you wouldn't need the output of its .communicate().
Inspired by #Cristian's answer. I met just the same issue, but with a different command. So I'm putting my tested example, which I believe could be helpful:
grep_proc = subprocess.Popen(["grep", "rabbitmq"],
stdin=subprocess.PIPE,
stdout=subprocess.PIPE)
subprocess.Popen(["ps", "aux"], stdout=grep_proc.stdin)
out, err = grep_proc.communicate()
This is tested.
What has been done
Declared lazy grep execution with stdin from pipe. This command will be executed at the ps command execution when the pipe will be filled with the stdout of ps.
Called the primary command ps with stdout directed to the pipe used by the grep command.
Grep communicated to get stdout from the pipe.
I like this way because it is natural pipe conception gently wrapped with subprocess interfaces.
The previous answers missed an important point. Replacing shell pipeline is basically correct, as pointed out by geocar. It is almost sufficient to run communicate on the last element of the pipe.
The remaining problem is passing the input data to the pipeline. With multiple subprocesses, a simple communicate(input_data) on the last element doesn't work - it hangs forever. You need to create a a pipeline and a child manually like this:
import os
import subprocess
input = """\
input data
more input
""" * 10
rd, wr = os.pipe()
if os.fork() != 0: # parent
os.close(wr)
else: # child
os.close(rd)
os.write(wr, input)
os.close(wr)
exit()
p_awk = subprocess.Popen(["awk", "{ print $2; }"],
stdin=rd,
stdout=subprocess.PIPE)
p_sort = subprocess.Popen(["sort"],
stdin=p_awk.stdout,
stdout=subprocess.PIPE)
p_awk.stdout.close()
out, err = p_sort.communicate()
print (out.rstrip())
Now the child provides the input through the pipe, and the parent calls communicate(), which works as expected. With this approach, you can create arbitrary long pipelines without resorting to "delegating part of the work to the shell". Unfortunately the subprocess documentation doesn't mention this.
There are ways to achieve the same effect without pipes:
from tempfile import TemporaryFile
tf = TemporaryFile()
tf.write(input)
tf.seek(0, 0)
Now use stdin=tf for p_awk. It's a matter of taste what you prefer.
The above is still not 100% equivalent to bash pipelines because the signal handling is different. You can see this if you add another pipe element that truncates the output of sort, e.g. head -n 10. With the code above, sort will print a "Broken pipe" error message to stderr. You won't see this message when you run the same pipeline in the shell. (That's the only difference though, the result in stdout is the same). The reason seems to be that python's Popen sets SIG_IGN for SIGPIPE, whereas the shell leaves it at SIG_DFL, and sort's signal handling is different in these two cases.
EDIT: pipes is available on Windows but, crucially, doesn't appear to actually work on Windows. See comments below.
The Python standard library now includes the pipes module for handling this:
https://docs.python.org/2/library/pipes.html, https://docs.python.org/3.4/library/pipes.html
I'm not sure how long this module has been around, but this approach appears to be vastly simpler than mucking about with subprocess.
For me, the below approach is the cleanest and easiest to read
from subprocess import Popen, PIPE
def string_to_2_procs_to_file(input_s, first_cmd, second_cmd, output_filename):
with open(output_filename, 'wb') as out_f:
p2 = Popen(second_cmd, stdin=PIPE, stdout=out_f)
p1 = Popen(first_cmd, stdout=p2.stdin, stdin=PIPE)
p1.communicate(input=bytes(input_s))
p1.wait()
p2.stdin.close()
p2.wait()
which can be called like so:
string_to_2_procs_to_file('input data', ['awk', '-f', 'script.awk'], ['sort'], 'output.txt')
code-1 : Passing linux commands as a sequence of arguments
from subprocess import Popen, PIPE
run_cmd = Popen(['ls','-l','mkdir','hello'], stdout = PIPE, stderr = PIPE)
output,error = run_cmd.communicate()
print error,output, run_cmd.returncode
Output - 1:
ls: cannot access mkdir: No such file or directory
ls: cannot access hello: No such file or directory
2
In the above code I am trying to run multiple linux commands by passing them as a sequence of arguments. If I am modifying the above code to the following one it works fine.
code-2 : Pass linux commands as a string
from subprocess import Popen, PIPE
run_cmd = Popen('mkdir hello; ls -l; echo Hello; rm -r hello', shell=True, stdout = PIPE, stderr = PIPE)
output,error = run_cmd.communicate()
print error,output, run_cmd.returncode
Output - 2 :
drwxrwxr-x. 2 sujatap sujatap 6 May 9 21:28 hello
-rw-rw-r--. 1 sujatap sujatap 53 May 8 20:51 test.py
Hello
0
As shell=True is not a suggested way to be used so I want to run the linux commands using the former one. Thanks.
If something does not work, check its documentation: https://docs.python.org/2/library/subprocess.html#popen-constructor
args should be a sequence of program arguments or else a single string. By default, the program to execute is the first item in args if args is a sequence. If args is a string, the interpretation is platform-dependent and described below. See the shell and executable arguments for additional differences from the default behavior. Unless otherwise stated, it is recommended to pass args as a sequence.
So test your single program runs first (list of program and its arguments), then make a list of lists and run them in sequence with a loop:
myprogramsequence = [
["ls", "-l"],
["mkdir", "hello"]
]
for argumentlist in myprogramsequence:
run_cmd = Popen( argumentlist, ...
I am trying to mirror the following shell command using subprocess.Popen():
echo "SELECT employeeid FROM Users WHERE samaccountname=${1};" | bsqldb -S mdw2k8sqlp02.dow.com -D PhoneBookClient -U PortManUser -P plum45\\torts -q
It currently looks like:
stdout = subprocess.Popen(["echo", "\"SELECT", "employeeid", "FROM", "Users", "WHERE", "samaccountname=${1};\"", "|", "bsqldb", "arg1etc"], stdout=subprocess.PIPE)
for line in stdout.stdout.readlines():
print line
It seems that this is wrong, it returns the following standard out:
"SELECT employeeid FROM Users WHERE samaccountname=${1};" | bsqldb arg1etc
Does anyone know where my syntax for subprocess.Popen() has gone wrong?
The problem is that you're trying to run a shell command without the shell. What happens is that you're passing all of those strings—including "|" and everything after is—as arguments to the echo command.
Just add shell=True to your call to fix that.
However, you almost definitely want to pass the command line as a string, instead of trying to guess at the list that will be joined back up into the string to pass to the shell.
Or, even better, don't use the shell, and instead pipe within Python. The docs have a nice section about Replacing shell pipeline (and all kinds of other things) with subprocess code.
But in your case, the thing you're trying to pipe is just echo, which is quite silly, since you already have exactly what echo would return, and can just feed it as the input to the second program.
Also, I'm not sure what you expect that ${1} to get filled in with. Presumably you're porting a shell script that took some arguments on the command line; your Python script may have the same thing in sys.argv[1], but without knowing more about what you're doing, that's little more than a guess.
The analog of echo some string | command arg1 arg2 shell pipeline in Python is:
from subprocess import Popen, PIPE
p = Popen(["command", "arg1", "arg2"], stdin=PIPE)
p.communicate("some string")
In your case, you could write it as:
import shlex
import sys
from subprocess import Popen, PIPE
cmd = shlex.split("bsqldb -S mdw2k8sqlp02.dow.com -D PhoneBookClient "
"-U PortManUser -P plum45\\torts -q")
sql = """SELECT employeeid FROM Users
WHERE samaccountname={name};""".format(name=sql_escape(sys.argv[1]))
p = Popen(cmd, stdin=PIPE)
p.communicate(input=sql)
sys.exit(p.returncode)
Hi I had a question about linking input and output with sub-processes in python. I am trying to simplify the program by skipping the output of one step by passing it to another subprocess rather than output it to a file. Then open another process to run on that file.
E.g. First process uses SAMTOOLS to output a specific chromosome from a large bam file.
So...
bigfile.bam is read in and outputs chromosome22.bam
The next subprocess uses BEDTOOLS to convert that chromosome22.bam to chromosome22.bed
So...
chromosome22.bam is read in and outputs chromosome22.bed
What I want to do is pass the stdout of the first process into the second so there is no need for the intermediate file.
So far I have this...
for x in 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,'X','Y':
subprocess.call("%s view -bh %s %s > %s/%s/%s.bam" % (samtools,bam,x,bampath,out,x), shell=True)
This makes the chromosome[1-22,X,Y].bam files. But can I avoid this and put another subprocess command in the same loop to convert them to bed files?
The command for bed conversion is:
bedpath/bedtools bamtobed -i [bamfile] > [bedfile]
Please have a look at the replacing shell pipeline example in the documentation.
output=$(dmesg | grep hda)
becomes:
p1 = Popen(["dmesg"], stdout=PIPE)
p2 = Popen(["grep", "hda"], stdin=p1.stdout, stdout=PIPE)
p1.stdout.close() # Allow p1 to receive a SIGPIPE if p2 exits.
output = p2.communicate()[0]
The explanation is:
The p1.stdout.close() call after starting the p2 is important in order for p1 to receive a SIGPIPE if p2 exits before p1.
No need to use python here. Much easier in shell. But essentially, it works the same as in python.
If bedtools can read from stdin, you can e.g. do
#!/bin/sh
for x in `seq 1 22` X Y; do
$samtools view -bh $bam $x | $bedtools bamtobed > $bampath/$out/$x.bam
done
Depending on how bedtools was desinged, you might also need to use -i - to have it read from stdin.
If you stick with python, I strongly recommend about learning how to do this
without doing it in all shell,
without producing shell commands, that you need to escape properly to avoid errors
subprocess is more safe to use when you use the array-based syntax and no shell.
Make that two subprocess invocations, one for each command. See http://docs.python.org/library/subprocess.html#replacing-shell-pipeline for more details.
cmd1 = [samtools, "view", "-bh", bam, x]
cmd2 = [bedtools, "bamtobed"]
c1 = subprocess.Popen(cmd1, stdout=subprocess.PIPE)
c2 = subprocess.Popen(cmd2, stdin=c1.stdout, stdout=open(outputfilename, "w"))
c1.stdout.close()
c2.communicate()
Yes, you can use the pipe functionality. See if you can read from stdin for the bamtobed process ... if you can, try the following. This way you save on the disk IO time assuming the processing load is light.
SLIGHT modification:
proc1.stdout is now the stdin for the 2nd process.
proc1 = subprocess.call("%s view -bh %s %s" % (samtools,bam,x,bampath,out,x), shell=True, stdout=subprocess.PIPE)
proc2 = subprocess.call("bedpath/bedtools bamtobed > %s" % (outFileName, ), shell=True, stdin=proc1.stdout)