Find all (including overlapping) substrings matching a regular expression in Python - python

I need to find positions of all substrings matching a regular expression within a string. For instance, if the string is abbba and the regexp is (b|bb)(?=a), the result should be [(2, 4), (3, 4)].
What I’ve come up with is
def get_ranges(pattern: str, string: str) -> list[tuple[int, int]]:
n = len(string)
return [(start, end) for start in range(n + 1) for end in range(start, n + 1)
if re.fullmatch(f'.{{{start}}}({pattern}).{{{n - end}}}', string)]
But this tends to perform extremely slowly, especially given that it doesn’t allow precompiled regexps to be used. Are there any more efficient ways to solve the problem?

First of all I think you should use for end in range(start, n + 1). For the end variable I don't see any reason to start ranging from 0 for every iteration. Just with this edit, executing this code
for i in range(300000):
get_ranges(r"(b|bb)(?=a)", "abbba")
I pass from 9.67 to 6.01 secs.

You may want to try a slightly different regex - (b{1,2})(?=a) which should be slightly faster.
Secondly, you can compile the pattern and use it compiled by cutting the string and not the regex:
pattern = re.compile('(b{1,2})(?=a)')
def get_ranges(pattern: str, string: str):
result = []
start, end = 0, len(string)
match = pattern.search(string)
while match is not None and start < end:
result.append((match.start(0)+start, match.end(0)+start))
start += match.start(0) + 1
match = pattern.search(string[start:])
return result
You can also yield instead of constructing result in its entirety before returning.
Timings for comparison (per 1 million execs):
original: 38.42 s
above: 2.14 s (17.95x speedup)
yield version: 0.2914 s (131.85x speedup)

Related

Regex - Counting greatest number of short tandem repeats

Im looping through a list of Short Tandem Repeats and trying to find the greatest amount of times they occur consecutively in a sequence of DNA.
Sequence:
AAGGTAAGTTTAGAATATAAAAGGTGAGTTAAATAGAATAGGTTAAAATTAAAGGAGATCAGATCAGATCAGATCTATCTATCTATCTATCTATCAGAAAAGAGAGATCAGATCAGTTAAAGAGTAAGATATTGAATTAATGGAAAATATTGTTGGGGAAAGGAGGGATAGAGATCAGATC
STRs = ['AGATC', 'AATG', 'TATC']
for STR in STRs:
max_repeats = len(re.findall(f'(?<={STR}){STR}', sequence))
print(max_repeats)
The greatest amount of consecutive repeats in the sequence is 4
My current regex is only returning 3 matches because of the positive lookbehind, I'm also not too sure how to modify my code to capture the greatest amount of repeats (4) without also including the other repeat groups, is it possible to retrieve a nested list of the matches which I could iterate over to find the greatest number like below.
[[AGATC, AGATC, AGATC, AGATC], [AGATAC,AGATC], [AGATC, AGATC]]
When there is nothing dynamic in the string that you're searching for, and especially when the string that you are searching in is large, I would try to find a solution that avoids regex, because basic string methods typically are significantly faster than regex.
The following is probably un-Pythonic in more ways than one (improvement suggestions welcome), but the underlying assumption is that str.split() massively outperforms regex, and the arithmetic to calculate string positions takes negligible time.
def find_tandem_repeats(sequence, search):
""" searches through an DNA sequence and returns (position, repeats) tuples """
if sequence == '' or search == '':
return
lengths = list(map(len, sequence.split(search)))
pos = lengths[0]
repeats = 0
pending = False
for l in lengths[1:]:
if l == 0:
pending = True
repeats += 1
continue
repeats += 1
yield (pos, repeats)
pos += l + len(search) * repeats
repeats = 0
pending = False
if pending:
yield (pos, repeats)
usage:
data = "AAGGTAAGTTTAGAATATAAAAGGTGAGTTAAATAGAATAGGTTAAAATTAAAGGAGATCAGATCAGATCAGATCTATCTATCTATCTATCTATCAGAAAAGAGAGATCAGATCAGTTAAAGAGTAAGATATTGAATTAATGGAAAATATTGTTGGGGAAAGGAGGGATAGAGATCAGATC"
positions = list(find_tandem_repeats(data, 'AGATC'))
for position in positions:
print("at index %s, repeats %s times" % position)
max_position = max(positions, key=lambda x: x[1])
print("maximum at index %s, repeats %s times" % max_position)
Output
at index 55, repeats 4 times
at index 104, repeats 2 times
at index 171, repeats 2 times
maximum at index 55, repeats 4 times
To find all sequences of AGATC as long as it's not at the end of the sequence, you can use:
>>> re.findall(r'AGATC\B', sequence)
['AGATC', 'AGATC', 'AGATC', 'AGATC']
From the python documentation on \B:
Matches the empty string, but only when it is not at the beginning or end of a word. This means that r'py\B' matches 'python', 'py3', 'py2', but not 'py', 'py.', or 'py!'. \B is just the opposite of \b, so word characters in Unicode patterns are Unicode alphanumerics or the underscore, although this can be changed by using the ASCII flag. Word boundaries are determined by the current locale if the LOCALE flag is used.
Here's one way to approach the problem:
>>> STRs = ['AGATC', 'AATG', 'TATC']
>>> pattern = '|'.join(f'({tgt})+' for tgt in STRs)
>>> for mo in re.finditer(pattern, seq):
print(mo.group(0))
AGATCAGATCAGATCAGATC
TATCTATCTATCTATCTATC
AGATCAGATC
AATG
AGATCAGATC
Key ideas:
1) The core pattern used + for a group to allow for consecutive repeats:
(AGATC)+
2) The patterns are combined with | to allow any STR to match:
>>> pattern
'(AGATC)+|(AATG)+|(TATC)+'
3) The re.finditer() call gives the match objects one at a time.
4) If needed the match object can give other information like start and stop points to compute length or a tuple to show which STR matched:
>>> mo.group(0)
'AGATCAGATC'
>>> mo.span()
(171, 181)
>>> mo.groups()
('AGATC', None, None)
How to count the maximum repetitions:
>>> tracker = dict.fromkeys(STRs, 0)
>>> for mo in re.finditer(pattern, seq):
start, end = mo.span()
STR = next(filter(None, mo.groups()))
reps = (end - start) // len(STR)
tracker[STR] = max(tracker[STR], reps)
>>> tracker
{'AGATC': 4, 'AATG': 1, 'TATC': 5}
Here is a functional programming approach. In each recursion step the first occurrences of a consecutive repeat is found and after that eliminated from the search string. You keep record of the maximum of repeats found so far and pass it to each recursive call until no consecutive repeats are found anymore.
def find_repeats (pattern, seq, max_r=0):
g = re.search(f'{pattern}({pattern})+', seq)
if g:
max_repeats = len ( g.group() ) / len(pattern)
return find_repeats (pattern, seq.replace (g.group(), '', 1), max (max_r, max_repeats) )
else:
return max_r
print (find_repeats ('AGATC', sequence))

Algorithm to find the most repetitive (not the most common) sequence in a string (aka tandem repeats)

I am looking for an algorithm (possibly implemented in Python) able to find the most REPETITIVE sequence in a string. Where for REPETITIVE, I mean any combination of chars that is repeated over and over without interruption (tandem repeat).
The algorithm I am looking for is not the same as the "find the most common word" one. In fact, the repetitive block doesn't need to be the most common word (substring) in the string.
For example:
s = 'asdfewfUBAUBAUBAUBAUBAasdkBAjnfBAenBAcs'
> f(s)
'UBAUBAUBAUBAUBA' #the "most common word" algo would return 'BA'
Unfortunately, I have no idea on how to tackle this. Any help is very welcome.
UPDATE
A little extra example to clarify that I want to be returned the sequence with the most number of repetition, whatever its basic building block is.
g = 'some noisy spacer'
s = g + 'AB'*5 + g + '_ABCDEF'*2 + g + 'AB'*3
> f(s)
'ABABABABAB' #the one with the most repetitions, not the max len
Examples from #rici:
s = 'aaabcabc'
> f(s)
'abcabc'
s = 'ababcababc'
> f(s)
'ababcababc' #'abab' would also be a solution here
# since it is repeated 2 times in a row as 'ababcababc'.
# The proper algorithm would return both solutions.
With combination of re.findall() (using specific regex patten) and max() functions:
import re
# extended sample string
s = 'asdfewfUBAUBAUBAUBAUBAasdkjnfencsADADADAD sometext'
def find_longest_rep(s):
result = max(re.findall(r'((\w+?)\2+)', s), key=lambda t: len(t[0]))
return result[0]
print(find_longest_rep(s))
The output:
UBAUBAUBAUBAUBA
The crucial pattern:
((\w+?)\2+):
(....) - the outermost captured group which is the 1st captured group
(\w+?) - any non-whitespace character sequence enclosed into the 2nd captured group; +? - quantifier, matches between one and unlimited times, as few times as possible, expanding as needed
\2+ - matches the same text as most recently matched by the 2nd capturing group
Here is the solution based on ((\w+?)\2+) regex but with additional improvements:
import re
from itertools import chain
def repetitive(sequence, rep_min_len=1):
"""Find the most repetitive sequence in a string.
:param str sequence: string for search
:param int rep_min_len: minimal length of repetitive substring
:return the most repetitive substring or None
"""
greedy, non_greedy = re.compile(r'((\w+)\2+)'), re.compile(r'((\w+?)\2+)')
all_rep_seach = lambda regex: \
(regex.search(sequence[shift:]) for shift in range(len(sequence)))
searched = list(
res.groups()
for res in chain(all_rep_seach(greedy), all_rep_seach(non_greedy))
if res)
if not sequence:
return None
cmp_key = lambda res: res[0].count(res[1]) if len(res[1]) >= rep_min_len else 0
return max(searched, key=cmp_key)[0]
You can test it like so:
def check(seq, expected, rep_min_len=1):
result = repetitive(seq, rep_min_len)
print('%s => %s' % (seq, result))
assert result == expected, expected
check('asdfewfUBAUBAUBAUBAUBAasdkBAjnfBAenBAcs', 'UBAUBAUBAUBAUBA')
check('some noisy spacerABABABABABsome noisy spacer_ABCDEF_ABCDEFsome noisy spacerABABAB', 'ABABABABAB')
check('aaabcabc', 'aaa')
check('aaabcabc', 'abcabc', rep_min_len=2)
check('ababcababc', 'ababcababc')
check('ababcababcababc', 'ababcababcababc')
Key features:
used greedy ((\w+)\2+) and non-greedy ((\w+)\2+?) regex;
search repetitive substring in all substrings with the shift from the beginning (e.g.'string' => ['string', 'tring', 'ring', 'ing', 'ng', 'g']);
selection is based on the number of repetitions not on the length of subsequence (e.g. for 'ABABABAB_ABCDEF_ABCDEF' result will be 'ABABABAB', not '_ABCDEF_ABCDEF');
the minimum length of a repeating sequence is matters (see 'aaabcabc' check).
What you are searching for is an algorithm to find the 'largest' primitive tandem repeat in a string. Here is a paper describing a linear time algorithm to find all tandem repeats in a string and by extension all primitive tandem repeats. Gusfield. Linear Time Algorithms for Finding and Representing all Tandem Repeats in a String
Here is a brute force algorithm that I wrote. Maybe it will be useful:
def find_most_repetitive_substring(string):
max_counter = 1
position, substring_length, times = 0, 0, 0
for i in range(len(string)):
for j in range(len(string) - i):
counter = 1
if j == 0:
continue
while True:
if string[i + counter * j: i + (counter + 1) * j] != string[i: i + j] or i + (counter + 1) * j > len(string):
if counter > max_counter:
max_counter = counter
position, substring_length, times = i, j, counter
break
else:
counter += 1
return string[position: position + substring_length * times]

How to split string everywhere a letter appears?

I have a string containing letters and numbers like this -
12345A6789B12345C
How can I get a list that looks like this
[12345A, 6789B, 12345C]
>>> my_string = '12345A6789B12345C'
>>> import re
>>> re.findall('\d*\w', my_string)
['12345A', '6789B', '12345C']
For the sake of completeness, non-regex solution:
data = "12345A6789B12345C"
result = [""]
for char in data:
result[-1] += char
if char.isalpha():
result.append("")
if not result[-1]:
result.pop()
print(result)
# ['12345A', '6789B', '12345C']
Should be faster for smaller strings, but if you're working with huge data go with regex as once compiled and warmed up, the search separation happens on the 'fast' C side.
You could build this with a generator, too. The approach below keeps track of start and end indices of each slice, yielding a generator of strings. You'll have to cast it to list to use it as one, though (splitonalpha(some_string)[-1] will fail, since generators aren't indexable)
def splitonalpha(s):
start = 0
for end, ch in enumerate(s, start=1):
if ch.isalpha:
yield s[start:end]
start = end
list(splitonalpha("12345A6789B12345C"))
# ['12345A', '6789B', '12345C']

Realizing if there is a pattern in a string (Does not need to start at index 0, could be any length)

Coding a program to detect a n-length pattern in a string, even without knowing where the pattern starts, could be easily done by creating a list of n-length substrings and check if starting at one point there are same items or the rest of the list. Without any piece of information other than the string to check through, is the only way to recognize the pattern is to brute-force through all lengths and check or is there a more efficient algorithm?
(I'm just a beginner in Python, so this may be easy to code... )
Current code that only suits checking for starting at index 0:
def search(s):
match=s[0]+s[1]
while (match != s) and (match[0] != match[-1]):
for matchLen in range(len(match),len(s)-1):
letter = s[matchLen]
if letter == match[-1]:
match += s[len(match)]
break
if match == s:
return None
else:
return match[:-1]
You can use re.findall(r'(.{2,})\1+', string). The parentheses creates a capture group that is later backreferenced by \1. The . matches any character (except for line breaks). The {2,} requires the pattern to be at least two characters long (otherwise strings like ss would be considered a pattern). Finally the + requires that pattern to repeat 1 or more times (in addition to the first time that it occurred inside the capture group). You can see it working in action.
Pattern is a far too vague term, but assuming you mean some string repeating itself, the regexp (?P<pat>.+)(?P=pat) will work.
Given a string what you could do is -
You start with length = 1, and take two pointer variables i and j which you shall use to traverse the string.
Set i = 0 and j = i+length
if str[i]==str[j]:
i++,j++ // till j not equal to length of string
else:
length = length + 1
//increase length by 1 and start the algorithm over from i = 0
Take the example abcdeabcde :
In this we see
Initially i = 0, j = 1 ,
but str[0]!=str[1] i.e. a!=b,
Then we get length = 2 i.e., i = 0,j = 2
but str[0]!=str[2] i.e. a!=c,
Continuing in the same fashion,
We see when length = 5 and i = 0 and j = 5,
str[0]==str[5]
and thus you can see that i and j increment till j is equal to string length.
And you have your answer that is the pattern length. It may not seem obvious but i would suggest you dry-run this algorithm over some of your test cases and let me know the results.
You can use re.findall() to find all matches:
import re
s = "somethingabcdeabcdeabcdeabcdeabcdeelseabcdeabcdeabcde"
li = re.findall(r'abcde',s)
print(li)
Output:
['abcde', 'abcde', 'abcde', 'abcde', 'abcde', 'abcde', 'abcde', 'abcde']

Match parts of string that are not consecutive stretches of certain character

I have a simple function that yields all stretches of at least gapSize consecutive N's from a string:
def get_gap_coordinates(sequence, gapSize=25):
gapPattern = "N{"+str(gapSize)+",}"
p = re.compile(gapPattern)
m = p.finditer(sequence)
for gap in m:
start,stop = gap.span()
yield(start,stop)
Now I'd like to have a function that does the exact opposite: Match all characters that are NOT consecutive stretches of at least gapSize N's. These stretches may occur at any position in the string (beginning, middle and end) with any given number.
I have looked into lookarounds and tried
(?!N{25,}).*
but this does not do what I need.
Any help is much appreciated!
edit:
For example: a sequence NNNNNNACTGACGTNNNACTGACNNNNN with should match ACTGACGTNNNACTGAC for gapSize=5 and ACTGACGT & ACTGAC for gapSize = 3.
So here is a pure regex solution which seems to be what you want, but I wonder if there is actually a better way to do it. I'll add alternatives as I come up with them. I used several online regex tools as well as playing around in the shell.
One of the tools has a nice graphic of the regex and facilty to generate SO answer code: the regex (with a gap of 10) is:
^.*?(?=N{10})|(?<=N{10})[^N].*?(?=N{10})|(?<=N{10})[^N].*?$
Usage:
s = 'NAANANNNNNNNNNNBBBBNNNCCNNNNNNNNNNDDDDN'
def foo(s, gapSize = 25):
'''yields non-gap items (re.match objects) in s or
if gaps are not present raises StopIteration immediately
'''
# beginning of string and followed by a 'gap' OR
# preceded a 'gap' and followed by a 'gap' OR
# preceded a 'gap' and followed by end of string
pattern = r'^.*?(?=N{{{}}})|(?<=N{{{}}})[^N].*?(?=N{{{}}})|(?<=N{{{}}})[^N].*?$'
pattern = pattern.format(gapSize, gapSize, gapSize, gapSize)
for match in re.finditer(pattern, s):
#yield match.span()
yield match
for match in foo(s, 10):
print match.span(), match.group()
'''
>>>
(0, 5) NAANA
(15, 24) BBBBNNNCC
(34, 39) DDDDN
>>>
'''
So if you think about it a bit you see that the beginning of a gap is the end of a non-gap and vis-versa. So with a simple regex: iterate over the gaps, add logic to the loop to keep track of the non-gap spans, and yield the spans. (my placeholder variable names could probably be improved)
s = 'NAANANNNNNNNNNNBBBBNNNCCNNNNNNNNNNDDDDN'
def bar(s, n):
'''Yields the span of non-gap items in s or
immediately raises StopIteration if gaps are not present.
'''
gap = r'N{{{},}}'.format(n)
# initialize the placeholders
previous_start = 0
end = len(s)
for match in re.finditer(gap, s):
start, end = match.span()
if start == 0:
previous_start = end
continue
end = start
yield previous_start, end
previous_start = match.end()
if end != len(s):
yield previous_start, len(s)
Usage
for start, end in bar(s, 4):
print (start, end), s[start:end]
'''
>>>
(0, 5) NAANA
(15, 24) BBBBNNNCC
(34, 39) DDDDN
>>>
'''
Negative lookahead seems to work ok. E.g. for gap-size 3, the regexp would be:
N{3,}?([^N](?:(?!N{3,}?).)*)
Try it here.
import re
def get_gap_coordinates(sequence, gapSize=25):
gapPattern = "N{%s,}?([^N](?:(?!N{%s,}?).)*)" % (gapSize, gapSize)
p = re.compile(gapPattern)
m = p.finditer(sequence)
for gap in m:
start,stop = gap.start(1), gap.end(1)
yield(start,stop)
for x in get_gap_coordinates('NNNNNNACTGACGTNNNACTGACNNNNN', 3):
print x
Warning: This might not match well at the beginning of the string, if the string does not start with an 'N' sequence. But you can always pad the string with gap-size times 'N' from the left.
I thought about regexes to directly match the wanted blocks, but nothing good came to mind. I think it's better to keep finding the gaps and simply use the gap coordinates to get the good block coordinates. I mean, they're basically the same, right? Gap stops are block starts and gap starts are block stops.
def get_block_coordinates(sequence, gapSize=25):
gapPattern = "N{"+str(gapSize)+",}"
p = re.compile(gapPattern)
m = p.finditer(sequence)
prevStop = 0
for gap in m:
start,stop = gap.span()
if start:
yield(prevStop,start)
prevStop = stop
if prevStop < len(sequence):
yield(prevStop,len(sequence))
I think you can do something like:
gapPattern = "(N{"+str(gapSize)+",})"
p = re.compile(gapPattern)
i = 0
for s in re.split(p, sequence):
if not re.match(p, s):
yield i
i += len(s)
And that'll generate a sequence of offsets to substrings that aren't gap_size "N" characters as per the re.split function.

Categories