Python regex problems with string matching - python

Sorry in advance for such a long post
EDIT--
Modified from Norman's Solution to print and return if we find an exact solution, otherwise print all approximate matches. It's currently still only getting 83/85 matches for a specific example of searching for etnse on the dictionary file provided below on the third pastebin link.
def doMatching(file, origPattern):
entireFile = file.read()
patterns = []
startIndices = []
begin = time.time()
# get all of the patterns associated with the given phrase
for pattern in generateFuzzyPatterns(origPattern):
patterns.append(pattern)
for m in re.finditer(pattern, entireFile):
startIndices.append((m.start(), m.end(), m.group()))
# if the first pattern(exact match) is valid, then just print the results and we're done
if len(startIndices) != 0 and startIndices[0][2] == origPattern:
print("\nThere is an exact match at: [{}:{}] for {}").format(*startIndices[0])
return
print('Used {} patterns:').format(len(patterns))
for i, p in enumerate(patterns, 1):
print('- [{}] {}').format(i, p)
# list for all non-overlapping starting indices
nonOverlapping = []
# hold the last matches ending position
lastEnd = 0
# find non-overlapping matches by comparing each matches starting index to the previous matches ending index
# if the starting index > previous items ending index they aren't overlapping
for start in sorted(startIndices):
print(start)
if start[0] >= lastEnd:
# startIndicex[start][0] gets the ending index from the current matches tuple
lastEnd = start[1]
nonOverlapping.append(start)
print()
print('Found {} matches:').format(len(startIndices))
# i is the key <starting index> assigned to the value of the indices (<ending index>, <string at those indices>
for start in sorted(startIndices):
# *startIndices[i] means to unpack the tuple associated to the key i's value to be used by format as 2 inputs
# for explanation, see: http://stackoverflow.com/questions/2921847/what-does-the-star-operator-mean-in-python
print('- [{}:{}] {}').format(*start)
print()
print('Found {} non-overlapping matches:').format(len(nonOverlapping))
for ov in nonOverlapping:
print('- [{}:{}] {}').format(*ov)
end = time.time()
print(end-begin)
def generateFuzzyPatterns(origPattern):
# Escape individual symbols.
origPattern = [re.escape(c) for c in origPattern]
# Find exact matches.
pattern = ''.join(origPattern)
yield pattern
# Find matches with changes. (replace)
for i in range(len(origPattern)):
t = origPattern[:]
# replace with a wildcard for each index
t[i] = '.'
pattern = ''.join(t)
yield pattern
# Find matches with deletions. (omitted)
for i in range(len(origPattern)):
t = origPattern[:]
# remove a char for each index
t[i] = ''
pattern = ''.join(t)
yield pattern
# Find matches with insertions.
for i in range(len(origPattern) + 1):
t = origPattern[:]
# insert a wildcard between adjacent chars for each index
t.insert(i, '.')
pattern = ''.join(t)
yield pattern
# Find two adjacent characters being swapped.
for i in range(len(origPattern) - 1):
t = origPattern[:]
if t[i] != t[i + 1]:
t[i], t[i + 1] = t[i + 1], t[i]
pattern = ''.join(t)
yield pattern
ORIGINAL:
http://pastebin.com/bAXeYZcD - the actual function
http://pastebin.com/YSfD00Ju - data to use, should be 8 matches for 'ware' but only gets 6
http://pastebin.com/S9u50ig0 - data to use, should get 85 matches for 'etnse' but only gets 77
I left all of the original code in the function because I'm not sure exactly what is causing the problem.
you can search for 'Board:isFull()' on anything to get the error stated below.
examples:
assume you named the second pastebin 'someFile.txt' in a folder named files in the same directory as the .py file.
file = open('./files/someFile.txt', 'r')
doMatching(file, "ware")
OR
file = open('./files/someFile.txt', 'r')
doMatching(file, "Board:isFull()")
OR
assume you named the third pastebin 'dictionary.txt' in a folder named files in the same directory as the .py file.
file = open('./files/dictionary.txt', 'r')
doMatching(file, "etnse")
--EDIT
The functions parameters work like so:
file is the location of a file.
origPattern is a phrase.
The function is basically supposed to be a fuzzy search. It's supposed to take the pattern and search through a file to find matches that are either exact, or with a 1 character deviation. i.e.: 1 missing character, 1 extra character, 1 replaced character, or 1 character swapped with an adjacent character.
For the most part it works, But i'm running into a few problems.
First, when I try to use something like 'Board:isFull()' for origPattern I get the following:
raise error, v # invalid expression
sre_constants.error: unbalanced parenthesis
the above is from the re library
I've tried using re.escape() but it doesn't change anything.
Second, when I try some other things like 'Fun()' it says it has a match at some index that doesn't even contain any of that; it's just a line of '*'
Third, When it does find matches it doesn't always find all of the matches. For example, there's one file I have that should find 85 matches, but it only comes up with like 77, and another with 8 but it only comes up with 6. However, they are just alphabetical so it's likely only a problem with how I do searching or something.
Any help is appreciated.
I also can't use fuzzyfinder

I found some issues in the code:
re.escape() seems to not work because its result is not assigned.
Do origPattern = re.escape(origPattern).
When pattern is correctly escaped, be mindful of not breaking the escaping when manipulating the pattern.
Example: re.escape('Fun()') yields the string Fun\(\). The two \( substrings in it must never be separated: never remove, replace, or swap a \ without the char it escapes.
Bad manipulations: Fun(\) (removal), Fu\n(\) (swap), Fun\.{0,2}\).
Good manipulations: Fun\) (removal), Fu\(n\) (swap), Fun.{0,2}\).
You find too few matches because you only try to find fuzzy matches if there are no exact matches. (See line if indices.__len__() != 0:.) You must always look for them.
The loops inserting '.{0,2}' produce one too many pattern, e.g. 'ware.{0,2}' for ware. Unless you intend that, this pattern will find wareXY which has two insertions.
The patterns with .{0,2} don't work as described; they allow one change and one insertion.
I'm not sure about the code involving difflib.Differ. I don't understand it, but I suspect there should be no break statements.
Even though you use a set to store indices, matches from different regexes may still overlap.
You don't use word boundaries (\b) in your regexes, though for natural language that would make sense.
Not a bug, but: Why do you call magic methods explicitly?
(E.g. indices.__len__() != 0 instead of len(indices) != 0.)
I rewrote your code a bit to address any issues I saw:
def doMatching(file, origPattern):
entireFile = file.read()
patterns = []
startIndices = {}
for pattern in generateFuzzyPatterns(origPattern):
patterns.append(pattern)
startIndices.update((m.start(), (m.end(), m.group())) for m in re.finditer(pattern, entireFile))
print('Used {} patterns:'.format(len(patterns)))
for i, p in enumerate(patterns, 1):
print('- [{}] {}'.format(i, p))
nonOverlapping = []
lastEnd = 0
for start in sorted(startIndices):
if start >= lastEnd:
lastEnd = startIndices[start][0]
nonOverlapping.append(start)
print()
print('Found {} matches:'.format(len(startIndices)))
for i in sorted(startIndices):
print('- [{}:{}] {}'.format(i, *startIndices[i]))
print()
print('Found {} non-overlapping matches:'.format(len(nonOverlapping)))
for i in nonOverlapping:
print('- [{}:{}] {}'.format(i, *startIndices[i]))
def generateFuzzyPatterns(origPattern):
# Escape individual symbols.
origPattern = [re.escape(c) for c in origPattern]
# Find exact matches.
pattern = ''.join(origPattern)
yield pattern
# Find matches with changes.
for i in range(len(origPattern)):
t = origPattern[:]
t[i] = '.'
pattern = ''.join(t)
yield pattern
# Find matches with deletions.
for i in range(len(origPattern)):
t = origPattern[:]
t[i] = ''
pattern = ''.join(t)
yield pattern
# Find matches with insertions.
for i in range(len(origPattern) + 1):
t = origPattern[:]
t.insert(i, '.')
pattern = ''.join(t)
yield pattern
# Find two adjacent characters being swapped.
for i in range(len(origPattern) - 1):
t = origPattern[:]
if t[i] != t[i + 1]:
t[i], t[i + 1] = t[i + 1], t[i]
pattern = ''.join(t)
yield pattern

Related

find pattern in a string without using regex

I'm trying to find a pattern in a string. Example:
trail = 'AABACCCACCACCACCACCACC" one can note the "ACC" repetition after a prefix of AAB; so the result should be AAB(ACC)
Without using regex 'import re' how can I do this. What I did so far:
def get_pattern(trail):
for j in range(0,len(trail)):
k = j+1
while k<len(trail) and trail[j]!=trail[k]:
k+=1
if k==len(trail)-1:
continue
window = ''
stop = trail[j]
m = j
while m<len(trail) and k<len(trail) and trail[m]==trail[k]:
window+=trail[m]
m+=1
k+=1
if trail[m]==stop and len(window)>1:
break
if len(window)>1:
prefix=''
if j>0:
prefix = trail[0:j]
return prefix+'('+window+')'
return False
This will do (almost) the trick because in a use case like this:
"AAAAAAAAAAAAAAAAAABDBDBDBDBDBDBDBDBDBDBDBDBDBDBDBD"
the result is AA but it should be: AAAAAAAAAAAAAAAAAA(BD)
The issue with your code is that once you find a repetition that is of length 2 or greater, you don't check forward to make sure it's maintained. In your second example, this causes it to grab onto the 'AA' without seeing the 'BD's that follow.
Since we know we're dealing with cases of prefix + window, it makes sense to instead look from the end rather than the beginning.
def get_pattern(string):
str_len = len(string)
splits = [[string[i-rep_length: i] for i in range(str_len, 0, -rep_length)] for rep_length in range(1, str_len//2)]
reps = [[window == split[0] for window in split].index(False) for split in splits]
prefix_lengths = [str_len - (i+1)*rep for i,rep in enumerate(reps)]
shortest_prefix_length = min(prefix_lengths)
indices = [i for i, pre_len in enumerate(prefix_lengths) if pre_len == shortest_prefix_length]
reps = list(map(reps.__getitem__, indices))
splits = list(map(splits.__getitem__, indices))
max_reps = max(reps)
window = splits[reps.index(max_reps)][0]
prefix = string[0:shortest_prefix_length]
return f'{prefix}({window})' if max_reps > 1 else None
splits uses list comprehension to create a list of lists where each sublist splits the string into rep_length sized pieces starting from the end.
For each sublist split, the first split[0] is our proposed pattern and we see how many times that it's repeated. This is easily done by finding the first instance of False when checking window == split[0] using the list.index() function. We also want to calculate the size of the prefix. We want the shortest prefix with the largest number of reps. This is because of nasty edge cases like jeifjeiAABBBBBBBBBBBBBBAABBBBBBBBBBBBBBAABBBBBBBBBBBBBBAABBBBBBBBBBBBBB where the window has B that repeats more than the window itself. Additionally, anything that repeats 4 times can also be seen as a double-sized window repeated twice.
If you want to deal with an additional suffix, we can do a hacky solution by just trimming from the end until get_pattern() returns a pattern and then just append what was trimmed:
def get_pattern_w_suffix(string):
for i in range(len(string), 0, -1):
pattern = get_pattern(string[0:i])
suffix = string[i:]
if pattern is not None:
return pattern + suffix
return None
However, this assumes that the suffix doesn't have a pattern itself.

How to find a pattern in an input string

I need to find a given pattern in a text file and print the matching patterns. The text file is a string of digits and the pattern can be any string of digits or placeholders represented by 'X'.
I figured the way to approach this problem would be by loading the sequence into a variable, then creating a list of testable subsequences, and then testing each subsequence. This is my first function in python so I'm confused as to how to create the list of test sequences easily and then test it.
def find(pattern): #finds a pattern in the given input file
with open('sequence.txt', 'r') as myfile:
string = myfile.read()
print('Test data is:', string)
testableStrings = []
#how to create a list of testable sequences?
for x in testableStrings:
if x == pattern:
print(x)
return
For example, searching for "X10X" in "11012102" should print "1101" and "2102".
Let pattern = "X10X", string = "11012102", n = len(pattern) - just for followed illustration:
Without using regular expressions, your algorithm may be as follows:
Construct a list of all subsequences of string with length of n:
In[2]: parts = [string[i:i+n] for i in range(len(string) - n + 1)]
In[3]: parts
Out[3]: ['1101', '1012', '0121', '1210', '2102']
Compare pattern with each element in parts:
for part in parts:
The comparison of pattern with part (both have now equal lengths) will be symbol with symbol in corresponding positions:
for ch1, ch2 in zip(pattern, part):
If ch1 is the X symbol or ch1 == ch2, the comparison of corresponding symbols will continue, else we will break it:
if ch1 == "X" or ch1 == ch2:
continue
else:
break
Finally, if all symbol with symbol comparisons were successful, i. e. all pairs of corresponding symbols were exhausted, the else branch of the for statement will be executed (yes, for statements may have an else branch for that case).
Now you may perform any actions with that matched part, e. g. print it or append it to some list:
else:
print(part)
So all in one place:
pattern = "X10X"
string = "11012102"
n = len(pattern)
parts = [string[i:i+n] for i in range(len(string) - n + 1)]
for part in parts:
for ch1, ch2 in zip(pattern, part):
if ch1 == "X" or ch1 == ch2:
continue
else:
break
else:
print(part)
The output:
1101
2102
You probably wanted to create the list of testable sequences from the individual rows of the input file. So instead of
with open('sequence.txt', 'r') as myfile:
string = myfile.read()
use
with open('sequence.txt') as myfile: # 'r' is default
testableStrings = [row.strip() for row in myfile]
The strip() method removes whitespace characters from the start and end of rows, including \n symbols at the end of lines.
Example of the sequence.txt file:
123456789
87654321
111122223333
The output of the print(testableStrings) command:
['123456789', '87654321', '111122223333']

Algorithm to find the most repetitive (not the most common) sequence in a string (aka tandem repeats)

I am looking for an algorithm (possibly implemented in Python) able to find the most REPETITIVE sequence in a string. Where for REPETITIVE, I mean any combination of chars that is repeated over and over without interruption (tandem repeat).
The algorithm I am looking for is not the same as the "find the most common word" one. In fact, the repetitive block doesn't need to be the most common word (substring) in the string.
For example:
s = 'asdfewfUBAUBAUBAUBAUBAasdkBAjnfBAenBAcs'
> f(s)
'UBAUBAUBAUBAUBA' #the "most common word" algo would return 'BA'
Unfortunately, I have no idea on how to tackle this. Any help is very welcome.
UPDATE
A little extra example to clarify that I want to be returned the sequence with the most number of repetition, whatever its basic building block is.
g = 'some noisy spacer'
s = g + 'AB'*5 + g + '_ABCDEF'*2 + g + 'AB'*3
> f(s)
'ABABABABAB' #the one with the most repetitions, not the max len
Examples from #rici:
s = 'aaabcabc'
> f(s)
'abcabc'
s = 'ababcababc'
> f(s)
'ababcababc' #'abab' would also be a solution here
# since it is repeated 2 times in a row as 'ababcababc'.
# The proper algorithm would return both solutions.
With combination of re.findall() (using specific regex patten) and max() functions:
import re
# extended sample string
s = 'asdfewfUBAUBAUBAUBAUBAasdkjnfencsADADADAD sometext'
def find_longest_rep(s):
result = max(re.findall(r'((\w+?)\2+)', s), key=lambda t: len(t[0]))
return result[0]
print(find_longest_rep(s))
The output:
UBAUBAUBAUBAUBA
The crucial pattern:
((\w+?)\2+):
(....) - the outermost captured group which is the 1st captured group
(\w+?) - any non-whitespace character sequence enclosed into the 2nd captured group; +? - quantifier, matches between one and unlimited times, as few times as possible, expanding as needed
\2+ - matches the same text as most recently matched by the 2nd capturing group
Here is the solution based on ((\w+?)\2+) regex but with additional improvements:
import re
from itertools import chain
def repetitive(sequence, rep_min_len=1):
"""Find the most repetitive sequence in a string.
:param str sequence: string for search
:param int rep_min_len: minimal length of repetitive substring
:return the most repetitive substring or None
"""
greedy, non_greedy = re.compile(r'((\w+)\2+)'), re.compile(r'((\w+?)\2+)')
all_rep_seach = lambda regex: \
(regex.search(sequence[shift:]) for shift in range(len(sequence)))
searched = list(
res.groups()
for res in chain(all_rep_seach(greedy), all_rep_seach(non_greedy))
if res)
if not sequence:
return None
cmp_key = lambda res: res[0].count(res[1]) if len(res[1]) >= rep_min_len else 0
return max(searched, key=cmp_key)[0]
You can test it like so:
def check(seq, expected, rep_min_len=1):
result = repetitive(seq, rep_min_len)
print('%s => %s' % (seq, result))
assert result == expected, expected
check('asdfewfUBAUBAUBAUBAUBAasdkBAjnfBAenBAcs', 'UBAUBAUBAUBAUBA')
check('some noisy spacerABABABABABsome noisy spacer_ABCDEF_ABCDEFsome noisy spacerABABAB', 'ABABABABAB')
check('aaabcabc', 'aaa')
check('aaabcabc', 'abcabc', rep_min_len=2)
check('ababcababc', 'ababcababc')
check('ababcababcababc', 'ababcababcababc')
Key features:
used greedy ((\w+)\2+) and non-greedy ((\w+)\2+?) regex;
search repetitive substring in all substrings with the shift from the beginning (e.g.'string' => ['string', 'tring', 'ring', 'ing', 'ng', 'g']);
selection is based on the number of repetitions not on the length of subsequence (e.g. for 'ABABABAB_ABCDEF_ABCDEF' result will be 'ABABABAB', not '_ABCDEF_ABCDEF');
the minimum length of a repeating sequence is matters (see 'aaabcabc' check).
What you are searching for is an algorithm to find the 'largest' primitive tandem repeat in a string. Here is a paper describing a linear time algorithm to find all tandem repeats in a string and by extension all primitive tandem repeats. Gusfield. Linear Time Algorithms for Finding and Representing all Tandem Repeats in a String
Here is a brute force algorithm that I wrote. Maybe it will be useful:
def find_most_repetitive_substring(string):
max_counter = 1
position, substring_length, times = 0, 0, 0
for i in range(len(string)):
for j in range(len(string) - i):
counter = 1
if j == 0:
continue
while True:
if string[i + counter * j: i + (counter + 1) * j] != string[i: i + j] or i + (counter + 1) * j > len(string):
if counter > max_counter:
max_counter = counter
position, substring_length, times = i, j, counter
break
else:
counter += 1
return string[position: position + substring_length * times]

Realizing if there is a pattern in a string (Does not need to start at index 0, could be any length)

Coding a program to detect a n-length pattern in a string, even without knowing where the pattern starts, could be easily done by creating a list of n-length substrings and check if starting at one point there are same items or the rest of the list. Without any piece of information other than the string to check through, is the only way to recognize the pattern is to brute-force through all lengths and check or is there a more efficient algorithm?
(I'm just a beginner in Python, so this may be easy to code... )
Current code that only suits checking for starting at index 0:
def search(s):
match=s[0]+s[1]
while (match != s) and (match[0] != match[-1]):
for matchLen in range(len(match),len(s)-1):
letter = s[matchLen]
if letter == match[-1]:
match += s[len(match)]
break
if match == s:
return None
else:
return match[:-1]
You can use re.findall(r'(.{2,})\1+', string). The parentheses creates a capture group that is later backreferenced by \1. The . matches any character (except for line breaks). The {2,} requires the pattern to be at least two characters long (otherwise strings like ss would be considered a pattern). Finally the + requires that pattern to repeat 1 or more times (in addition to the first time that it occurred inside the capture group). You can see it working in action.
Pattern is a far too vague term, but assuming you mean some string repeating itself, the regexp (?P<pat>.+)(?P=pat) will work.
Given a string what you could do is -
You start with length = 1, and take two pointer variables i and j which you shall use to traverse the string.
Set i = 0 and j = i+length
if str[i]==str[j]:
i++,j++ // till j not equal to length of string
else:
length = length + 1
//increase length by 1 and start the algorithm over from i = 0
Take the example abcdeabcde :
In this we see
Initially i = 0, j = 1 ,
but str[0]!=str[1] i.e. a!=b,
Then we get length = 2 i.e., i = 0,j = 2
but str[0]!=str[2] i.e. a!=c,
Continuing in the same fashion,
We see when length = 5 and i = 0 and j = 5,
str[0]==str[5]
and thus you can see that i and j increment till j is equal to string length.
And you have your answer that is the pattern length. It may not seem obvious but i would suggest you dry-run this algorithm over some of your test cases and let me know the results.
You can use re.findall() to find all matches:
import re
s = "somethingabcdeabcdeabcdeabcdeabcdeelseabcdeabcdeabcde"
li = re.findall(r'abcde',s)
print(li)
Output:
['abcde', 'abcde', 'abcde', 'abcde', 'abcde', 'abcde', 'abcde', 'abcde']

Match parts of string that are not consecutive stretches of certain character

I have a simple function that yields all stretches of at least gapSize consecutive N's from a string:
def get_gap_coordinates(sequence, gapSize=25):
gapPattern = "N{"+str(gapSize)+",}"
p = re.compile(gapPattern)
m = p.finditer(sequence)
for gap in m:
start,stop = gap.span()
yield(start,stop)
Now I'd like to have a function that does the exact opposite: Match all characters that are NOT consecutive stretches of at least gapSize N's. These stretches may occur at any position in the string (beginning, middle and end) with any given number.
I have looked into lookarounds and tried
(?!N{25,}).*
but this does not do what I need.
Any help is much appreciated!
edit:
For example: a sequence NNNNNNACTGACGTNNNACTGACNNNNN with should match ACTGACGTNNNACTGAC for gapSize=5 and ACTGACGT & ACTGAC for gapSize = 3.
So here is a pure regex solution which seems to be what you want, but I wonder if there is actually a better way to do it. I'll add alternatives as I come up with them. I used several online regex tools as well as playing around in the shell.
One of the tools has a nice graphic of the regex and facilty to generate SO answer code: the regex (with a gap of 10) is:
^.*?(?=N{10})|(?<=N{10})[^N].*?(?=N{10})|(?<=N{10})[^N].*?$
Usage:
s = 'NAANANNNNNNNNNNBBBBNNNCCNNNNNNNNNNDDDDN'
def foo(s, gapSize = 25):
'''yields non-gap items (re.match objects) in s or
if gaps are not present raises StopIteration immediately
'''
# beginning of string and followed by a 'gap' OR
# preceded a 'gap' and followed by a 'gap' OR
# preceded a 'gap' and followed by end of string
pattern = r'^.*?(?=N{{{}}})|(?<=N{{{}}})[^N].*?(?=N{{{}}})|(?<=N{{{}}})[^N].*?$'
pattern = pattern.format(gapSize, gapSize, gapSize, gapSize)
for match in re.finditer(pattern, s):
#yield match.span()
yield match
for match in foo(s, 10):
print match.span(), match.group()
'''
>>>
(0, 5) NAANA
(15, 24) BBBBNNNCC
(34, 39) DDDDN
>>>
'''
So if you think about it a bit you see that the beginning of a gap is the end of a non-gap and vis-versa. So with a simple regex: iterate over the gaps, add logic to the loop to keep track of the non-gap spans, and yield the spans. (my placeholder variable names could probably be improved)
s = 'NAANANNNNNNNNNNBBBBNNNCCNNNNNNNNNNDDDDN'
def bar(s, n):
'''Yields the span of non-gap items in s or
immediately raises StopIteration if gaps are not present.
'''
gap = r'N{{{},}}'.format(n)
# initialize the placeholders
previous_start = 0
end = len(s)
for match in re.finditer(gap, s):
start, end = match.span()
if start == 0:
previous_start = end
continue
end = start
yield previous_start, end
previous_start = match.end()
if end != len(s):
yield previous_start, len(s)
Usage
for start, end in bar(s, 4):
print (start, end), s[start:end]
'''
>>>
(0, 5) NAANA
(15, 24) BBBBNNNCC
(34, 39) DDDDN
>>>
'''
Negative lookahead seems to work ok. E.g. for gap-size 3, the regexp would be:
N{3,}?([^N](?:(?!N{3,}?).)*)
Try it here.
import re
def get_gap_coordinates(sequence, gapSize=25):
gapPattern = "N{%s,}?([^N](?:(?!N{%s,}?).)*)" % (gapSize, gapSize)
p = re.compile(gapPattern)
m = p.finditer(sequence)
for gap in m:
start,stop = gap.start(1), gap.end(1)
yield(start,stop)
for x in get_gap_coordinates('NNNNNNACTGACGTNNNACTGACNNNNN', 3):
print x
Warning: This might not match well at the beginning of the string, if the string does not start with an 'N' sequence. But you can always pad the string with gap-size times 'N' from the left.
I thought about regexes to directly match the wanted blocks, but nothing good came to mind. I think it's better to keep finding the gaps and simply use the gap coordinates to get the good block coordinates. I mean, they're basically the same, right? Gap stops are block starts and gap starts are block stops.
def get_block_coordinates(sequence, gapSize=25):
gapPattern = "N{"+str(gapSize)+",}"
p = re.compile(gapPattern)
m = p.finditer(sequence)
prevStop = 0
for gap in m:
start,stop = gap.span()
if start:
yield(prevStop,start)
prevStop = stop
if prevStop < len(sequence):
yield(prevStop,len(sequence))
I think you can do something like:
gapPattern = "(N{"+str(gapSize)+",})"
p = re.compile(gapPattern)
i = 0
for s in re.split(p, sequence):
if not re.match(p, s):
yield i
i += len(s)
And that'll generate a sequence of offsets to substrings that aren't gap_size "N" characters as per the re.split function.

Categories