How to obtain overlapping patterns with Python finditer? - python

I'm using a python script that is searching for a pattern in a fasta file. It is working very well but it does not return overlapping strings. Unfortunately, I'm interested in potential overlapping strings. Since I'm not a programmer (I'm just trying to learn Python), I was wondering if someone could modify the script in order to find overlapping strings. I think that the regex module could do it but I tried to install it on my computer (Windows) without succes. I got this:
C:\Python33>regex-2014.02.19>python setup.py install
running install
running build
running build_py
runnning built_ext
building'_regex' extension
error:Unable to find vcvarsall.bat
For me it would be easier to work with a modified script. So here is my script:
import re
import sys
psq_re_f= re.compile('G{3,}.{1,7}?G{3,}.{1,7}?G{3,}.{1,7}?G{3,}') #((?<=G)[^G]|(?<!G).)
psq_re_r= re.compile('C{3,}.{1,7}?C{3,}.{1,7}?C{3,}.{1,7}?C{3,}') #((?<=C)[^C]|(?<!C).)
filename = input('Enter the name of the input fasta file: ')
ref_seq_fh = open(filename)
outputfileg = open("strelkaindels_quadg.txt",'wt')
outputfilec = open("strelkaindels_quadc.txt",'wt')
outputfileg.write('#\tID\tEntry Length\tStart\tEnd\tLength\tStrand\tSequence\n')
outputfilec.write('#\tID\tEntry Length\tStart\tEnd\tLength\tStrand\tSequence\n')
count = 0
ref_seq = []
line = (ref_seq_fh.readline()).strip()
chr = re.sub('^>', '', line)
chr1 = chr.split (":")
#line = (ref_seq_fh.readline()).strip()
while True:
while line.startswith('>') is False:
ref_seq.append(line)
line = (ref_seq_fh.readline()).strip()
if line == '':
break
ref_seq = ''.join(ref_seq)
for m in re.finditer(psq_re_f, ref_seq):
count=count+1
outputfileg.write('%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s' %(count, chr1[0], len(ref_seq), m.start(), m.end(), len(m.group(0)), '+', m.group(0))+'\n')
for m in re.finditer(psq_re_r, ref_seq):
count=count+1
outputfilec.write('%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s' %(count, chr1[0], len(ref_seq), m.start(), m.end(), len(m.group(0)), '-', m.group(0))+'\n')
chr = re.sub('^>', '', line)
chr1 = chr.split (":")
ref_seq = []
line= (ref_seq_fh.readline()).strip()
if line == '':
break
outputfileg.close()
outputfilec.close()
Finally an example of fasta file (text-based format for representing nucleotide sequences) widely used in biology:
>id_1
agatagatgatagatatagagagcgcgctagatcgatcgatcgagtcgatcgcgcggggggcccctctctctctatagggacatacga
>id_2
agacatcagatacagagatatttacataacaagagatacag
>id_3
cgctctagctcctcctctcgcgtagctagctctctctaacatgattagaattcagatcgatcgatcgatggttttttttctctct
and so on...
For example, let's imagine the following sequence:
GGGTGGGTGGGCGGGAGGG
The script will return only this string:
GGGTGGGTGGGCGGG
But I would like to also get that one too:
GGGTGGGCGGGAGGG

You could try using a positive looakehead:
(?=(G{3,}.{1,7}?G{3,}.{1,7}?G{3,}.{1,7}?G{3,}))
regex101 demo
And in your code, you will have to change your groups to .group(1) but m.end() will be the same as m.start(), so you might work around it a bit, maybe my using len():
for m in re.finditer(psq_re_f, ref_seq):
count=count+1
outputfileg.write('%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s' % (count,
chr1[0], len(ref_seq), m.start(),
m.start() + len(m.group(1)), len(m.group(1)),
'+',m.group(1))+'\n')

Related

Python - How to make sure that a line being read from a file contain only a given string and nothing else

In order to make sure I start and stop reading a text file exactly where I want to, I am providing 'start1'<->'end1', 'start2'<->'end2' as tags in between the text file and providing that to my python script. In my script I read it as:
start_end = ['start1','end1']
line_num = []
with open(file_path) as fp1:
for num, line in enumerate(fp1, 1):
for i in start_end:
if i in line:
line_num.append(num)
fp1.close()
print '\nLine number: ', line_num
fp2 = open(file_path)
for k, line2 in enumerate(fp2):
for x in range(line_num[0], line_num[1] - 1):
if k == x:
header.append(line2)
fp2.close()
This works well until I reach start10 <-> end10 and further. Eg. it checks if I have "start2" in the line and also reads the text that has "start21" and similarly for end tag as well. so providing "start1, end1" as input also reads "start10, end10". If I replace the line:
if i in line:
with
if i == line:
it throws an error.
How can I make sure that the script reads the line that contains ONLY "start1" and not "start10"?
import re
prog = re.compile('start1$')
if prog.match(line):
print line
That should return None if there is no match and return a regex match object if the line matches the compiled regex. The '$' at the end of the regex says that's the end of the line, so 'start1' works but 'start10' doesn't.
or another way..
def test(line):
import re
prog = re.compile('start1$')
return prog.match(line) != None
> test('start1')
True
> test('start10')
False
Since your markers are always at the end of the line, change:
start_end = ['start1','end1']
to:
start_end = ['start1\n','end1\n']
You probably want to look into regular expressions. The Python re library has some good regex tools. It would let you define a string to compare your line to and it has the ability to check for start and end of lines.
If you can control the input file, consider adding an underscore (or any non-number character) to the end of each tag.
'start1_'<->'end1_'
'start10_'<->'end10_'
The regular expression solution presented in other answers is more elegant, but requires using regular expressions.
You can do this with find():
for num, line in enumerate(fp1, 1):
for i in start_end:
if i in line:
# make sure the next char isn't '0'
if line[line.find(i)+len(i)] != '0':
line_num.append(num)

Parsing a text file for pattern and writing found pattern back to another file python 3.4

I am trying to open a text file. Parse the text file for specific regex patterns then when if I find that pattern I write the regex returned pattern to another text file.
Specifically a list of IP Addresses which I want to parse specific ones out of.
So the file may have
10.10.10.10
9.9.9.9
5.5.5.5
6.10.10.10
And say I want just the IPs that end in 10 (the regex I think I am good with) My example looks for the 10.180.42, o4 41.XX IP hosts. But I will adjust as needed.
I've tried several method and fail miserably at them all. It's days like this I know why I just never mastered any language. But I'm committed to Python so here goes.
import re
textfile = open("SymantecServers.txt", 'r')
matches = re.findall('^10.180\.4[3,1].\d\d',str(textfile))
print(matches)
This gives me empty backets. I had to encase the textfile in the str function or it just puked. I don't know if this is right.
This just failed all over the place no matter how I fine tuned it.
f = open("SymantecServers.txt","r")
o = open("JustIP.txt",'w', newline="\r\n")
for line in f:
pattern = re.compile("^10.180\.4[3,1].\d\d")
print(pattern)
#o.write(pattern)
#o.close()
f.close()
I did get one working but it just returned the entire line (including netmask and other test like hostname which are all on the same line in the text file. I just want IP)
Any help on how to read a text file and if it has a pattern of IP grab the full IP and write that into another text file so I end up with a text file with a list of just the IPs I want. I am 3 hours into it and behind on work so going to do the first file by hand...
I am just at a loss what I am missing. Sorry for being a newbie
here is it working:
>>> s = """10.10.10.10
... 9.9.9.9
... 5.5.5.5
... 10.180.43.99
... 6.10.10.10"""
>>> re.findall(r'10\.180\.4[31]\.\d\d', s)
['10.180.43.99']
you do not really need to add line boundaries, as you're matching a very specific IP address, if your file does not have weird things like '123.23.234.10.180.43.99.21354' that you don't want to match, it should be ok!
your syntax of [3,1] is matching either 3, 1 or , and you don't want to match against a comma ;-)
about your function:
r = re.compile(r'10\.180\.4[31]\.\d\d')
with open("SymantecServers.txt","r") as f:
with open("JustIP.txt",'w', newline="\r\n") as o:
for line in f:
matches = r.findall(line)
for match in matches:
o.write(match)
though if I were you, I'd extract IPs using:
r = re.compile(r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}')
with open("SymantecServers.txt","r") as f:
with open("JustIP.txt",'w', newline="\r\n") as o:
for line in f:
matches = r.findall(line)
for match in matches:
a, b, c, d = match.split('.')
if int(a) < 255 and int(b) < 255 and int(c) in (43, 41) and int(d) < 100:
o.write(match)
or another way to do it:
r = re.compile(r'(\d{1,3})\.(\d{1,3})\.(\d{1,3})\.(\d{1,3})')
with open("SymantecServers.txt","r") as f:
with open("JustIP.txt",'w', newline="\r\n") as o:
for line in f:
m = r.match(line)
if m:
a, b, c, d = m.groups()
if int(a) < 255 and int(b) < 255 and int(c) in (43, 41) and int(d) < 100:
o.write(match)
which uses the regex to split the IP address into groups.
What you're missing is that you're doing a re.compile() which creates a Regular Expression object in Python. You're forgetting to match.
You could try:
# This isn't the best way to match IP's, but if it fits for your use-case keep it for now.
pattern = re.compile("^10.180\.4[13].\d\d")
f = open("SymantecServers.txt",'r')
o = open("JustIP.txt",'w')
for line in f:
m = pattern.match(line)
if m is not None:
print "Match: %s" %(m.group(0))
o.write(m.group(0) + "\n")
f.close()
o.close()
Which is compiling the Python object, attempting to match the line against the compiled object, and then printing out that current match. I can avoid having to split my matches, but I have to pay attention to matching groups - therefore group(0)
You can also look at re.search() which you can do, but if you're running search enough times with the same regular expression, it becomes more worthwhile to use compile.
Also note that I moved the f.close() to the outside of the for loop.

Merging Words into a Line

I am currently using Python v2.6 and trying to merge words into a line. My code supposed to read data from a text file, in which I have two rows of data both of which are strings. Then, it takes the second row data every time, which are the words of sentences, those are separated by delimiter strings, such that:
Inside the .txt:
"delimiter_string"
"row_1_data" "row_2_data"
"row_1_data" "row_2_data"
"row_1_data" "row_2_data"
"row_1_data" "row_2_data"
"row_1_data" "row_2_data"
"delimiter_string"
"row_1_data" "row_2_data"
"row_1_data" "row_2_data"
...
Those "row_2_data" will add-up to a sentence later. Sorry for the long introduction btw.
Here is my code:
import sys
import re
newLine = ''
for line in sys.stdin:
word = line.split(' ')[1]
if word == '<S>+BSTag':
continue
elif word == '</S>+ESTag':
print newLine
newLine = ''
continue
else:
w = re.sub('\[.*?]', '', word)
if newLine == '':
newLine += w
else:
newLine += ' ' + w
"BSTag" is the tag for "Sentence Begins" and "ESTag" is for "Sentence Ends": the so called "delimiters". "re.sub" is used for a special purpose and it works as far as I checked.
The problem is that, when I execute this python script from the command line in linux with the following command: $ cat file.txt | script.py | less, I can not see any output, but just a blank file.
For those who are not familiar with linux, I guess the problem has nothing to do with terminal execution, thus you can neglect that part. Simply, the code does not work as intended and I can not find a single mistake.
Any help will be appreciated, and thanks for reading the long post :)
Ok, the problem is solved, which was actually a corpus error instead of a coding one. A very odd entry was detected in the text file, which was causing problems. Removing it solved it. You can use both of these approaches: mine and the one presented by "snurre" if you want a similar text processing.
Cheers.
def foo(lines):
output = []
for line in lines:
words = line.split()
if len(words) < 2:
word = words[0]
else:
word = words[1]
if word == '</S>+ESTag':
yield ' '.join(output)
output = []
elif word != '<S>+BSTag':
output.append(words[1])
for sentence in foo(sys.stdin):
print sentence
Your regex is a little funky. From what I can tell, it's replacing anything between (and including) a pair of [ and ] with '', so it ends up printing empty strings.
I think the problem is that the script isn't being executed (unless you just excluded the shebang in the code you posted)
Try this
cat file.txt | python script.py | less

Hadoop MapReduce job on file containing HTML tags

I have a bunch of large HTML files and I want to run a Hadoop MapReduce job on them to find the most frequently used words. I wrote both my mapper and reducer in Python and used Hadoop streaming to run them.
Here is my mapper:
#!/usr/bin/env python
import sys
import re
import string
def remove_html_tags(in_text):
'''
Remove any HTML tags that are found.
'''
global flag
in_text=in_text.lstrip()
in_text=in_text.rstrip()
in_text=in_text+"\n"
if flag==True:
in_text="<"+in_text
flag=False
if re.search('^<',in_text)!=None and re.search('(>\n+)$', in_text)==None:
in_text=in_text+">"
flag=True
p = re.compile(r'<[^<]*?>')
in_text=p.sub('', in_text)
return in_text
# input comes from STDIN (standard input)
global flag
flag=False
for line in sys.stdin:
# remove leading and trailing whitespace, set to lowercase and remove HTMl tags
line = line.strip().lower()
line = remove_html_tags(line)
# split the line into words
words = line.split()
# increase counters
for word in words:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
#
# tab-delimited; the trivial word count is 1
if word =='': continue
for c in string.punctuation:
word= word.replace(c,'')
print '%s\t%s' % (word, 1)
Here is my reducer:
#!/usr/bin/env python
from operator import itemgetter
import sys
# maps words to their counts
word2count = {}
# input comes from STDIN
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# parse the input we got from mapper.py
word, count = line.split('\t', 1)
# convert count (currently a string) to int
try:
count = int(count)
word2count[word] = word2count.get(word, 0) + count
except ValueError:
pass
sorted_word2count = sorted(word2count.iteritems(),
key=lambda(k,v):(v,k),reverse=True)
# write the results to STDOUT (standard output)
for word, count in sorted_word2count:
print '%s\t%s'% (word, count)
Whenever I just pipe a small sample small string like 'hello world hello hello world ...' I get the proper output of a ranked list. However, when I try to use a small HTML file, and try using cat to pipe the HTML into my mapper, I get the following error (input2 contains some HTML code):
rohanbk#hadoop:~$ cat input2 | /home/rohanbk/mapper.py | sort | /home/rohanbk/reducer.py
Traceback (most recent call last):
File "/home/rohanbk/reducer.py", line 15, in <module>
word, count = line.split('\t', 1)
ValueError: need more than 1 value to unpack
Can anyone explain why I'm getting this? Also, what is a good way to debug a MapReduce job program?
You can reproduce the bug even with just:
echo "hello - world" | ./mapper.py | sort | ./reducer.py
The issue is here:
if word =='': continue
for c in string.punctuation:
word= word.replace(c,'')
If word is a single punctuation mark, as would be the case for the above input (after it is split), then it is converted to an empty string. So, just move the check for an empty string to after the replacement.

str.startswith() not working as I intended

I'm trying to test for a /t or a space character and I can't understand why this bit of code won't work. What I am doing is reading in a file, counting the loc for the file, and then recording the names of each function present within the file along with their individual lines of code. The bit of code below is where I attempt to count the loc for the functions.
import re
...
else:
loc += 1
for line in infile:
line_t = line.lstrip()
if len(line_t) > 0 \
and not line_t.startswith('#') \
and not line_t.startswith('"""'):
if not line.startswith('\s'):
print ('line = ' + repr(line))
loc += 1
return (loc, name)
else:
loc += 1
elif line_t.startswith('"""'):
while True:
if line_t.rstrip().endswith('"""'):
break
line_t = infile.readline().rstrip()
return(loc,name)
Output:
Enter the file name: test.txt
line = '\tloc = 0\n'
There were 19 lines of code in "test.txt"
Function names:
count_loc -- 2 lines of code
As you can see, my test print for the line shows a /t, but the if statement explicitly says (or so I thought) that it should only execute with no whitespace characters present.
Here is my full test file I have been using:
def count_loc(infile):
""" Receives a file and then returns the amount
of actual lines of code by not counting commented
or blank lines """
loc = 0
for line in infile:
line = line.strip()
if len(line) > 0 \
and not line.startswith('//') \
and not line.startswith('/*'):
loc += 1
func_loc, func_name = checkForFunction(line);
elif line.startswith('/*'):
while True:
if line.endswith('*/'):
break
line = infile.readline().rstrip()
return loc
if __name__ == "__main__":
print ("Hi")
Function LOC = 15
File LOC = 19
\s is only whitespace to the re package when doing pattern matching.
For startswith, an ordinary method of ordinary strings, \s is nothing special. Not a pattern, just characters.
Your question has already been answered and this is slightly off-topic, but...
If you want to parse code, it is often easier and less error-prone to use a parser. If your code is Python code, Python comes with a couple of parsers (tokenize, ast, parser). For other languages, you can find a lot of parsers on the internet. ANTRL is a well-known one with Python bindings.
As an example, the following couple of lines of code print all lines of a Python module that are not comments and not doc-strings:
import tokenize
ignored_tokens = [tokenize.NEWLINE,tokenize.COMMENT,tokenize.N_TOKENS
,tokenize.STRING,tokenize.ENDMARKER,tokenize.INDENT
,tokenize.DEDENT,tokenize.NL]
with open('test.py', 'r') as f:
g = tokenize.generate_tokens(f.readline)
line_num = 0
for a_token in g:
if a_token[2][0] != line_num and a_token[0] not in ignored_tokens:
line_num = a_token[2][0]
print(a_token)
As a_token above is already parsed, you can easily check for function definition, too. You can also keep track where the function ends by looking at the current column start a_token[2][1]. If you want to do more complex things, you should use ast.
You string literals aren't what you think they are.
You can specify a space or TAB like so:
space = ' '
tab = '\t'

Categories