Trouble calling EMBOSS program from python - python

I am having trouble calling an EMBOSS program (which runs via command line) called sixpack through Python.
I run Python via Windows 7, Python version 3.23, Biopython version 1.59, EMBOSS version 6.4.0.4. Sixpack is used to translate a DNA sequence in all six reading frames and creates two files as output; a sequence file identifying ORFs, and a file containing the protein sequences.
There are three required arguments which I can successfully call from command line: (-sequence [input file], -outseq [output sequence file], -outfile [protein sequence file]). I have been using the subprocess module in place of os.system as I have read that it is more powerful and versatile.
The following is my python code, which runs without error but does not produce the desired output files.
from Bio import SeqIO
import re
import os
import subprocess
infile = input('Full path to EXISTING .fasta file would you like to open: ')
outdir = input('NEW Directory to write outfiles to: ')
os.mkdir(outdir)
for record in SeqIO.parse(infile, "fasta"):
print("Translating (6-Frame): " + record.id)
ident=re.sub("\|", "-", record.id)
print (infile)
print ("Old record ID: " + record.id)
print ("New record ID: " + ident)
subprocess.call (['C:\memboss\sixpack.exe', '-sequence ' + infile, '-outseq ' + outdir + ident + '.sixpack', '-outfile ' + outdir + ident + '.format'])
print ("Translation of: " + infile + "\nWritten to: " + outdir + ident)

Found the answer.. I was using the wrong syntax to call subprocess. This is the correct syntax:
subprocess.call (['C:\memboss\sixpack.exe', '-sequence', infile, '-outseq', outdir + ident + '.sixpack', '-outfile', outdir + ident + '.format'])

Related

Create a .exe that takes input

I'm trying to convert a python .py function (that takes two inputs) into a .exe file to execute (always in python).
The function looks like this and works perfectly:
def ncd(File1, FileN):
..do something..
return
Now to convert this to a .exe I'm using pyinstaller through the command and also in this case everything is fine.
But then when I execute it in python it doesn't take the two inputs file that the function neeeds.
import subprocess
File1 = "Cam0001.dat"
FileN = "Cam1000.dat"
exe_ncd = "ncd.exe"
subprocess.run("\"" + exe_ncd + "\"" + " " + "\"" + File1 + "\"" + " " + "\"" + FileN + "\"", shell=True)
How can I do it? Thanks a lot :)

python subprocess.Popen is creating duplicate file and writing to duplicate file instead the original

This is my code:
outputFile = 'C:\\myfileslog\\log\\' + timestr + str(myrandam) + '_' + "out.res"
binaryExe = 'C:\\Users\\xxx\\Desktop\\test\\test2.9.exe'
input = binaryExe + ' -o ' + outputFile + [inputParameters]
subprocess.Popen(input ,bufsize=65,stdout=subprocess.PIPE, stderr=subprocess.PIPE, universal_newlines=True)
While running above code, i am getting 2 files created as below:
1) 20181004-124704_0.45529096783117506_out.res
2) 20181004-124704_0.45529096783117506_out.ÿÿÿÿCPI
I have already tried giving stdout=outputFile - which not writing anything on the file. Please help, i dont want duplicate file to be created which is causing issues

Calling bash command inside Python returns error, but works in terminal

Here is the except of my code related to this:
def grd_commands(directory):
for filename in os.listdir(directory)[1:]:
print filename
new_filename = ''
first_letter = ''
second_letter = ''
bash_command = 'gmt grdinfo ' + filename + ' -I-'
print bash_command
coordinates = Popen(bash_command, stdout=PIPE, shell=True)
coordinates = coordinates.communicate()
latlong = re.findall(r'^\D*?([-+]?\d+)\D*?[-+]?\d+\D*?([-+]?\d+)', coordinates)
if '-' in latlong[1]:
first_letter = 'S'
else:
first_letter = 'N'
if '-' in latlong[0]:
second_letter = 'W'
else:
second_letter = 'E'
new_filename = first_letter + str(latlong[1]) + second_letter + str(latlong[0]) + '.grd'
Popen('gmt grdconvert ' + str(filename) + ' ' + new_filename, shell=True)
filenameis the name of the file that is is being passed to the function. When I run my code, I am receiving this error:
/bin/sh: gmt: command not found
Traceback (most recent call last):
File "/Users/student/Desktop/Code/grd_commands.py", line 38, in <module>
main()
File "/Users/student/Desktop/Code/grd_commands.py", line 10, in main
grd_commands(directory)
File "/Users/student/Desktop/Code/grd_commands.py", line 23, in grd_commands
latlong = re.findall(r'^\D*?([-+]?\d+)\D*?[-+]?\d+\D*?([-+]?\d+)', coordinates).split('\n')
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/re.py", line 177, in findall
return _compile(pattern, flags).findall(string)
TypeError: expected string or buffer
If I print out the string bash_command and try entering it into terminal it fully functions. Why doesn't it work when being called by my Python script?
The entire command line is being treated as a single command name. You need to either use shell=True to have the shell parse it as a command line:
coordinates = Popen(bash_command, stdout=PIPE, shell=True)
or preferably store the command name and its arguments as separate elements of a list:
bash_command = ['gmt', 'grdinfo', filename, '-I-']
coordinates = Popen(bash_command, stdout=PIPE)
Popen takes a list of arguments. There is a warning for using shell=True
Passing shell=True can be a security hazard if combined with untrusted input.
Try this:
from subprocess import Popen, PIPE
bash_command = 'gmt grdinfo ' + filename + ' -I-'
print(bash_command)
coordinates = Popen(bash_command.split(), stdout=PIPE)
print(coordinates.communicate()[0])
Ensure gmt is installed in a location specified by PATH in your /etc/environment file:
PATH=$PATH:/path/to/gmt
Alternatively, specify the path to gmt in bash_command:
bash_command = '/path/to/gmt grdinfo ' + filename + ' -I-'
You should be able to find the path with:
which gmt
As other people have suggested, an actual list would be the best approach instead of a string. Additionally, you must escape spaces with a '\' in order to actually access the file if there is a space in it.
for filename in os.listdir(directory)[1:]:
bash_command = ['gmt', 'grdinfo', filename.replace(" ", "\ "), '-I-']

Can't get output from Tesseract command run through os.system

I've created a function which loops over images and gets the orientation from the image with the tesseract library. The code looks like this:
def fix_incorrect_orientation(pathName):
for filename in os.listdir(pathName):
tesseractResult = str(os.system('tesseract ' + pathName + '/' + filename + ' - -psm 0'))
print('tesseractResult: ' + tesseractResult)
regexObj = re.search('([Orientation:]+[\s][0-9]{1})',tesseractResult)
if regexObj:
orientation = regexObj.groups(0)[0]
print('orientation123: ' + str(orientation))
else:
print('Not getting in the Regex.')
The result from the variable tesseractResult is always 0 though. But in the terminal I will get the following result from the command:
Orientation: 3
Orientation in degrees: 90
Orientation confidence: 19.60
Script: 1
Script confidence: 21.33
I've tried catching the output from the os.system in multiple ways, such as with Popen and subprocess but without any succes. It seems that I can't catch the output from the tesseract library.
So, how exactly should I do this?
Thanks,
Yenthe
Literally 10 minutes after asking the question I found a way.. First import commands:
import commands
And then the following code will do the trick:
def fix_incorrect_orientation(pathName):
for filename in os.listdir(pathName):
tesseractResult = str(commands.getstatusoutput('tesseract ' + pathName + '/' + filename + ' - -psm 0'))
print('tesseractResult: ' + tesseractResult)
regexObj = re.search('([Orientation:]+[\s][0-9]{1})',tesseractResult)
if regexObj:
orientation = regexObj.groups(0)[0]
print('orientation123: ' + str(orientation))
else:
print('Not getting in the Regex.')
This will pass the command around with the commands library and the output is caught thanks to getstatusoutput from the commands library.

Python - Readline skipping characters

I ran into a curious problem while parsing json objects in large text files, and the solution I found doesn't really make much sense. I was working with the following script. It copies bz2 files, unzips them, then parses each line as a json object.
import os, sys, json
# =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
# USER INPUT
# =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
args = sys.argv
extractDir = outputDir = ""
if (len(args) >= 2):
extractDir = args[1]
else:
extractDir = raw_input('Directory to extract from: ')
if (len(args) >= 3):
outputDir = args[2]
else:
outputDir = raw_input('Directory to output to: ')
# =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
# RETRIEVE FILE
# =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
tweetModel = [u'id', u'text', u'lang', u'created_at', u'retweeted', u'retweet_count', u'in_reply_to_user_id', u'coordinates', u'place', u'hashtags', u'in_reply_to_status_id']
filenames = next(os.walk(extractDir))[2]
for file in filenames:
if file[-4:] != ".bz2":
continue
os.system("cp " + extractDir + '/' + file + ' ' + outputDir)
os.system("bunzip2 " + outputDir + '/' + file)
# =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
# PARSE DATA
# =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
input = open (outputDir + '/' + file[:-4], 'r')
output = open (outputDir + '/p_' + file[:-4], 'w+')
for line in input.readlines():
try:
tweet = json.loads(line)
for field in enumerate(tweetModel):
if tweet.has_key(field[1]) and tweet[field[1]] != None:
if field[0] != 0:
output.write('\t')
fieldData = tweet[field[1]]
if not isinstance(fieldData, unicode):
fieldData = unicode(str(fieldData), "utf-8")
output.write(fieldData.encode('utf8'))
else:
output.write('\t')
except ValueError as e:
print ("Parse Error: " + str(e))
print line
line = input.readline()
quit()
continue
print "Success! " + str(len(line))
input.flush()
output.write('\n')
# =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
# REMOVE OLD FILE
# =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
os.system("rm " + outputDir + '/' + file[:-4])
While reading in certain lines in the for line in input.readlines(): loop, the lines would occasionally be truncated at inconsistent locations. Since the newline character was truncated as well, it would keep reading until it found the newline character at the end of the next json object. The result was an incomplete json object followed by a complete json object, all considered one line by the parser. I could not find the reason for this issue, but I did find that changing the loop to
filedata = input.read()
for line in filedata.splitlines():
worked. Does anyone know what is going on here?
After looking at the source code for file.readlines and string.splitlines I think I see whats up. Note: This is python 2.7 source code so if you're using another version... maybe this answer pertains maybe not.
readlines uses the function Py_UniversalNewlineFread to test for a newline splitlines uses a constant STRINGLIB_ISLINEBREAK that just tests for \n or \r. I would suspect Py_UniversalNewlineFread is picking up some character in the file stream as linebreak when its not really intended as a line break, could be from the encoding.. I don't know... but when you just dump all that same data to a string the splitlines checks it against \r and \n theres no match so splitlines moves on until the real line break is encountered and you get your intended line.

Categories