Receiving NameError - how to fix? - python

I am working on a project and am having issues with the following code that I have written in nano:
from Bio import SeqIO
import sys
import re
fasta_file = (sys.argv[1])
for myfile in SeqIO.parse(fasta_file, "fasta"):
if len(myfile) > 250:
gene_id = myfile.id
list = re.match('H149xcV\_\w+\_\w+\_\w+', gene_id)
print (">"+list.group(1))
This is the error I receive when I execute my command on command-line:
File "mpo.py", line 7, in <module>
gene_id = myfile.id
NameError: name 'myfile' is not defined
I have a fasta file with the format
>H149xcV_Fge342_r3_h2_d1 len=210 path=[0:0-206]
ACTATACATGAGGAGAACATAGAACAAAAATGGGACCATAGATATATAACAATAGAAGATATAGAGAACACAATAGACAACTTATTAGGAAAGAGGTGTGTCGTCATGGAGCTGATGTTCGAGGATACTTTGCATGGTCATTCTTGGATAATTTTGAGTGGGCTATGGGATACACCAAGAGGTTTGGCATTGTTTATGTTGATTATAAGAACGGGC
>H149xcV_ytR1oP_r3_h2_d1 len=306 path=[0:0-207]
ATTAGAGTCTGAGAGAGTCTTGATTTGTCGTCGTCGAGAAATATAGGAGATCTGATTAGAGGAGAGAGCGGCCTAGGCGATGCGCGATATAGCGCTATATAGGCCTAGAGGAGAGTCTCTCTCTTTTAGAAGAGATAATATATATATATATATGGCTCTCCGGCGGGGCCGCGCGAGAGCTCGATCGATCGATATTAGCTGTACGATGCTAGCTAGCTTATATTCGATCGATTATAGCTTAGATCTCTCTCTAAAGGTCGATATCGCTTATGCGCGCGTATATCG
I would like to reformat my file so that it only provides me with the unique gene id's and only output those genes id's with a length greater than 250 bp.
I would like my desired output to look like this:
>H149xcV_Fge342_r3_h2
>H149xcV_ytR1oP_r3_h2
>H149xcV_DPN78333_r3_h2
>H149xcV_AgV472_r3_h2
>H149xcV_DNP733_r3_h2

As suggested in the comments following your question, the parameter to match should be a string. The one thing I'll add is that python3 has a r"" string delimiter for regular expressions. Your code becomes this:
from Bio import SeqIO
import sys
import re
fasta_file = (sys.argv[1])
for myfile in SeqIO.parse(fasta_file, "fasta"):
if len(myfile) > 250:
gene_id = myfile.id
list = re.match(r"H149xcV_\w+_\w+_\w+", gene_id)
print (">"+list.group(0))
The underscore _ is not a special regular expression character (as I recall) so it doesn't need to be escaped.
The match() function takes a regex and the string you are searching (so I added gene_id). Lastly, you want to output group(0). group(0) means the whole match. group(1) is from the first capturing paren (of which you have none) so stick with group(0).

Use infoseq utility from the EMBOSS package and pipe the output (a table with sequence ids and lengths) through a one-liner in a scripting language of your choice. Here, I am using Perl for simplicity:
cat input.fasta | \
infoseq -auto -only -name -length stdin | \
perl -lane 'my ($name, $length) = #F; if ( !$seen{$name}++ && $length > 250 ) { print ">$name"; }' > output.fasta
Install EMBOSS, for example, using conda:
conda create --name emboss emboss

Related

Python: Make List of Matching Patterns for Subprocess Call to pcregrep multiline

TLDR: Is there a clean way to make a list of entries for subprocess.check_output('pcregrep', '-M', '-e', pattern, file)?
I'm using python's subprocess.check_output() to call pcregrep -M. Normally I would separate results by calling splitlines() but since I'm looking for a multiline pattern, that won't work. I'm having trouble finding a clean way to create a list of the matching patterns, where each entry of the list is an individual matching pattern.
Here's a simple example file I'm pcgrep'ing
module test_module(
input wire in0,
input wire in1,
input wire in2,
input wire cond,
input wire cond2,
output wire out0,
output wire out1
);
assign out0 = (in0 & in1 & in2);
assign out1 = cond1 ? in1 & in2 :
cond2 ? in1 || in2 :
in0;
Here's (some of) my python code
#!/usr/bin/env python
import subprocess, re
output_str = subprocess.check_output(['pcregrep', '-M', '-e',"^\s*assign\\s+\\bout0\\b[^;]+;",
"/home/<username>/pcregrep_file.sv"]).split(';')
# Print out the matches
for idx, line in enumerate(output_str):
print "output_str[%d] = %s" % (idx, line)
# Clear out the whitespace list entries
output_str = [line for line in output_str if re.match(\S+, line)]
Here is the output
output_str[0] =
assign out0 = in0 & in1 & in2
output_str[1] =
assign out1 = cond1 ? in1 & in2 :
cond2 ? in1 || in2 :
in0
output_str[2] =
It would be nice if I could do something like
output_list = subprocess.check_output('pcregrep', -M, -e, <pattern>, <file>).split(<multiline_delimiter>)
without creating garbage to clean up (whitespace list entries) or even to have a delimiter to split() on that is independent on the pattern.
Is there a clean way to create a list of the matching multiline patterns?
Per Casimir et Hippolyte's comment, and the very helpful post, How do I re.search or re.match on a whole file without reading it all into memory?, I read in the file using re instead of an external call to pcregrep and used re.findall(pattern, file, re.MULTILINE)
Full solution (which only slightly modifies the referenced post)
#!/usr/bin/env python
import re, mmap
filename = "/home/<username>/pcregrep_file.sv"
with open(filename, 'r+') as f:
data = mmap.mmap(f.fileno(), 0)
output_str = re.findall(r'^\s*assign\s+\bct_ela\b[^;]+;', data, re.MULTILINE)
for i, l in enumerate(output_str):
print "output_str[%d] = '%s'" % (i,l)
which creates the desired list.
Don't do that. If you can't use the Python regular expression module for some reason, just use the Python bindings for pcre.

How to find parenthesis bound strings in python

I'm learning Python and wanted to automate one of my assignments in a cybersecurity class.
I'm trying to figure out how I would look for the contents of a file that are bound by a set of parenthesis. The contents of the (.txt) file look like:
cow.jpg : jphide[v5](asdfl;kj88876)
fish.jpg : jphide[v5](65498ghjk;0-)
snake.jpg : jphide[v5](poi098*/8!##)
test_practice_0707.jpg : jphide[v5](sJ*=tT#&Ve!2)
test_practice_0101.jpg : jphide[v5](nKFdFX+C!:V9)
test_practice_0808.jpg : jphide[v5](!~rFX3FXszx6)
test_practice_0202.jpg : jphide[v5](X&aC$|mg!wC2)
test_practice_0505.jpg : jphide[v5](pe8f%yC$V6Z3)
dog.jpg : negative`
And here is my code so far:
import sys, os, subprocess, glob, shutil
# Finding the .jpg files that will be copied.
sourcepath = os.getcwd() + '\\imgs\\'
destpath = 'stegdetect'
rawjpg = glob.glob(sourcepath + '*.jpg')
# Copying the said .jpg files into the destpath variable
for filename in rawjpg:
shutil.copy(filename, destpath)
# Asks user for what password file they want to use.
passwords = raw_input("Enter your password file with the .txt extension:")
shutil.copy(passwords, 'stegdetect')
# Navigating to stegdetect. Feel like this could be abstracted.
os.chdir('stegdetect')
# Preparing the arguments then using subprocess to run
args = "stegbreak.exe -r rules.ini -f " + passwords + " -t p *.jpg"
# Uses open to open the output file, and then write the results to the file.
with open('cracks.txt', 'w') as f: # opens cracks.txt and prepares to w
subprocess.call(args, stdout=f)
# Processing whats in the new file.
f = open('cracks.txt')
If it should just be bound by ( and ) you can use the following regex, which ensures starting ( and closing ) and you can have numbers and characters between them. You can add any other symbol also that you want to include.
[\(][a-z A-Z 0-9]*[\)]
[\(] - starts the bracket
[a-z A-Z 0-9]* - all text inside bracket
[\)] - closes the bracket
So for input sdfsdfdsf(sdfdsfsdf)sdfsdfsdf , the output will be (sdfdsfsdf)
Test this regex here: https://regex101.com/
I'm learning Python
If you are learning you should consider alternative implementations, not only regexps.
TO iterate line by line of a text file you just open the file and for over the file handle:
with open('file.txt') as f:
for line in f:
do_something(line)
Each line is a string with the line contents, including the end-of-line char '/n'. To find the start index of a specific substring in a string you can use find:
>>> A = "hello (world)"
>>> A.find('(')
6
>>> A.find(')')
12
To get a substring from the string you can use the slice notation in the form:
>>> A[6:12]
'(world'
You should use regular expressions which are implemented in the Python re module
a simple regex like \(.*\) could match your "parenthesis string"
but it would be better with a group \((.*)\) which allows to get only the content in the parenthesis.
import re
test_string = """cow.jpg : jphide[v5](asdfl;kj88876)
fish.jpg : jphide[v5](65498ghjk;0-)
snake.jpg : jphide[v5](poi098*/8!##)
test_practice_0707.jpg : jphide[v5](sJ*=tT#&Ve!2)
test_practice_0101.jpg : jphide[v5](nKFdFX+C!:V9)
test_practice_0808.jpg : jphide[v5](!~rFX3FXszx6)
test_practice_0202.jpg : jphide[v5](X&aC$|mg!wC2)
test_practice_0505.jpg : jphide[v5](pe8f%yC$V6Z3)
dog.jpg : negative`"""
REGEX = re.compile(r'\((.*)\)', re.MULTILINE)
print(REGEX.findall(test_string))
# ['asdfl;kj88876', '65498ghjk;0-', 'poi098*/8!##', 'sJ*=tT#&Ve!2', 'nKFdFX+C!:V9' , '!~rFX3FXszx6', 'X&aC$|mg!wC2', 'pe8f%yC$V6Z3']

How to remove all-N sequence entries from fasta file(s)

I would like to remove entries of a fasta file where all nucleotides are N, but not entries which contain ACGT and N nucleotides.
from an example of input file content:
#>seq_1
TGCTAGCTAGCTGATCGTGTCGATCG
CACCACANNNNNCACGTGTCG
#>seq2
NNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNN
#>seq3
catgcatcgacgatgctgacgatc
#>seq4
cacacaccNNNNttgtgca
#...
Hoping the output file content to be:
#>seq_1
TGCTAGCTAGCTGATCGTGTCGATCG
CACCACANNNNNCACGTGTCG
#>seq3
catgcatcgacgatgctgacgatc
#>seq4
cacacaccNNNNttgtgca
#...
Any suggestions in doing this with awk, perl, python, other?
Thank you!
FDS
With GNU awk
awk -v RS='#>seq[[:digit:]]+' '!/^[N\n]+$/{printf "%s",term""$0}; {term=RT}' file
#>seq_1
TGCTAGCTAGCTGATCGTGTCGATCG
CACCACANNNNNCACGTGTCG
#>seq3
catgcatcgacgatgctgacgatc
#>seq4
cacacaccNNNNttgtgca
With python, using the BioPython module:
import Bio
INPUT = "bio_input.fas"
OUTPUT = "bio_output.fas"
def main():
records = Bio.SeqIO.parse(INPUT, 'fasta')
filtered = (rec for rec in records if any(ch != 'N' for ch in rec.seq))
Bio.SeqIO.write(filtered, OUTPUT, 'fasta')
if __name__=="__main__":
main()
however note that the FastA spec says sequence ids are supposed to start with >, not #>!
Run against
>seq_1
TGCTAGCTAGCTGATCGTGTCGATCG
CACCACANNNNNCACGTGTCG
>seq2
NNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNN
>seq3
catgcatcgacgatgctgacgatc
>seq4
cacacaccNNNNttgtgca
this produces
>seq_1
TGCTAGCTAGCTGATCGTGTCGATCGCACCACANNNNNCACGTGTCG
>seq3
catgcatcgacgatgctgacgatc
>seq4
cacacaccNNNNttgtgca
(Note, the default linewrap length is 60 chars).
So essentially blocks start with a #> marker, and you wish to remove blocks where no line contains anything other than N. One way in Python:
#! /usr/bin/env python
import fileinput, sys, re
block=[]
nonN=re.compile('[^N\n]')
for line in fileinput.input():
if line.startswith('#>'):
if len(block)==1 or any(map(nonN.search, block[1:])):
sys.stdout.writelines(block)
block=[line]
else:
block.append(line)
if len(block)==1 or any(map(nonN.search, block[1:])):
sys.stdout.writelines(block)
in python using regex:
#!/usr/bin/env python
import re
ff = open('test', 'r')
data = ff.read()
ff.close()
m = re.compile(r'(#>seq\d+[N\n]+)$', re.M)
f = re.sub(m, '', data)
fo = open('out', 'w')
fo.write(f)
fo.close()
and you will get in your out file:
#>seq_1
TGCTAGCTAGCTGATCGTGTCGATCG
CACCACANNNNNCACGTGTCG
#>seq3
catgcatcgacgatgctgacgatc
#>seq4
cacacaccNNNNttgtgca
#...
hope this helps.
With shell command egrep (grep + regrex)
egrep -B 1 "^[^NNNN && ^#seq]" your.fa >conert.fa
# -B 1 means print match row and the previous row
# ^[^NNNN && ^#seq] is regrex pattern means not match with (begin with NNNN and
# #seq)
so only match with begin with common A/T/G/C sequence row and its previous row which is fasta header

Python: How to extract floating point numbers from a text file with mixed content?

I have a tab delimited text file with the following data:
ahi1
b/se
ahi
test -2.435953
1.218364
ahi2
b/se
ahi
test -2.001858
1.303935
I want to extract the two floating point numbers to a separate csv file with two columns, ie.
-2.435953 1.218264
-2.001858 1.303935
Currently my hack attempt is:
import csv
from itertools import islice
results = csv.reader(open('test', 'r'), delimiter="\n")
list(islice(results,3))
print results.next()
print results.next()
list(islice(results,3))
print results.next()
print results.next()
Which is not ideal. I am a Noob to Python so I apologise in advance and thank you for your time.
Here is the code to do the job:
import re
# this is the same data just copy/pasted from your question
data = """ ahi1
b/se
ahi
test -2.435953
1.218364
ahi2
b/se
ahi
test -2.001858
1.303935"""
# what we're gonna do, is search through it line-by-line
# and parse out the numbers, using regular expressions
# what this basically does is, look for any number of characters
# that aren't digits or '-' [^-\d] ^ means NOT
# then look for 0 or 1 dashes ('-') followed by one or more decimals
# and a dot and decimals again: [\-]{0,1}\d+\.\d+
# and then the same as first..
pattern = re.compile(r"[^-\d]*([\-]{0,1}\d+\.\d+)[^-\d]*")
results = []
for line in data.split("\n"):
match = pattern.match(line)
if match:
results.append(match.groups()[0])
pairs = []
i = 0
end = len(results)
while i < end - 1:
pairs.append((results[i], results[i+1]))
i += 2
for p in pairs:
print "%s, %s" % (p[0], p[1])
The output:
>>>
-2.435953, 1.218364
-2.001858, 1.303935
Instead of printing out the numbers, you could save them in a list and zip them together afterwards..
I'm using the python regular expression framework to parse the text. I can only recommend you pick up regular expressions if you don't already know it. I find it very useful to parse through text and all sorts of machine generated output-files.
EDIT:
Oh and BTW, if you're worried about the performance, I tested on my slow old 2ghz IBM T60 laptop and I can parse a megabyte in about 200ms using the regex.
UPDATE:
I felt kind, so I did the last step for you :P
Maybe this can help
zip(*[results]*5)
eg
import csv
from itertools import izip
results = csv.reader(open('test', 'r'), delimiter="\t")
for result1, result2 in (x[3:5] for x in izip(*[results]*5)):
... # do something with the result
Tricky enough but more eloquent and sequential solution:
$ grep -v "ahi" myFileName | grep -v se | tr -d "test\" " | awk 'NR%2{printf $0", ";next;}1'
-2.435953, 1.218364
-2.001858, 1.303935
How it works: Basically remove specific text lines, then remove unwanted text in lines, then join every second line with formatting. I just added the comma for beautification purposes. Leave the comma out of awks printf ", " if you don't need it.

Bash/Python : open url & print top 10 words

I need to extract the 10 most frequent words from a text using a pipe (and any additional python scripts as needed); output being a block of all-caps words separated by a space.
This pipe needs to extract text from any external file: I've managed to get it to work on .txt files, but I also need to be able to input a URL and have it do the same thing with that.
I have the following code:
alias words="tr a-zA-Z | tr -cs A-Z | tr ' ' '\012' | sort -n | uniq -c |
sort -r | head -n 10 | awk '{printf \"%s \", \$2}END{print \"\"}'" (on one line)
which, with cat hamlet.txt | words gives me:
TO THE AND A 'TIS THAT OR OF IS
To make it more complicated, I need to exclude any 'function' words: these are 'non-lexical' words like 'a', 'the', 'of', 'is', any pronouns (I, you, him), and any prepositions (there, at, from).
I need to be able to type htmlstrip http://www.google.com.au | words and have it print out like the above.
For the URL-opening:
The python script I'm trying to figure out (let's call it htmlstrip) strips any tags from the text, leaving only 'human readable' text. This should be able to open any given URL, but I can't figure out how to get this to work.
What I have so far:
import re
import urllib2
filename = raw_input('File name: ')
filehandle = open(filename)
html = filehandle.read()
f = urllib2.urlopen('http://') #???
print f.read()
text = [ ]
inTag = False
for ch in html:
if ch == '<':
inTag = True
if not inTag:
text.append(ch)
if ch == '>':
inTag = False
print ''.join(text)
I know this is both incomplete and probably incorrect - any guidance would really be appreciated.
You can use scrape.py and regular expressions like this:
#!/usr/bin/env python
from scrape import s
import sys, re
if len(sys.argv) < 2:
print "Usage: words.py url"
sys.exit(0)
s.go(sys.argv[1]) # fetch content
text = s.doc.text # extract readable text
text = re.sub("\W+", " ", text) # remove all non-word characters and repeating whitespace
print text
And then just:
./words.py http://whatever.com
Use re.sub for this:
import re
text = re.sub(r"<.+>", " ", html)
For special cases such as scripts, you can include a regex such as:
<script.*>.*</script>
UPDATE: Sorry, just read the comment about the pure Python without any additional modules. Yes, in this situation re, I think, will be the best way.
Maybe it'll be easier and more correct to use pycURL rather then to remove tags by re?
from StringIO import StringIO
import pycurl
url = 'http://www.google.com/'
storage = StringIO()
c = pycurl.Curl()
c.setopt(c.URL, url)
c.setopt(c.WRITEFUNCTION, storage.write)
c.perform()
c.close()
content = storage.getvalue()
print content

Categories