Regex - Counting greatest number of short tandem repeats

Regex - Counting greatest number of short tandem repeats - python

Im looping through a list of Short Tandem Repeats and trying to find the greatest amount of times they occur consecutively in a sequence of DNA.
Sequence:
AAGGTAAGTTTAGAATATAAAAGGTGAGTTAAATAGAATAGGTTAAAATTAAAGGAGATCAGATCAGATCAGATCTATCTATCTATCTATCTATCAGAAAAGAGAGATCAGATCAGTTAAAGAGTAAGATATTGAATTAATGGAAAATATTGTTGGGGAAAGGAGGGATAGAGATCAGATC
STRs = ['AGATC', 'AATG', 'TATC']
for STR in STRs:
max_repeats = len(re.findall(f'(?<={STR}){STR}', sequence))
print(max_repeats)
The greatest amount of consecutive repeats in the sequence is 4
My current regex is only returning 3 matches because of the positive lookbehind, I'm also not too sure how to modify my code to capture the greatest amount of repeats (4) without also including the other repeat groups, is it possible to retrieve a nested list of the matches which I could iterate over to find the greatest number like below.
[[AGATC, AGATC, AGATC, AGATC], [AGATAC,AGATC], [AGATC, AGATC]]

When there is nothing dynamic in the string that you're searching for, and especially when the string that you are searching in is large, I would try to find a solution that avoids regex, because basic string methods typically are significantly faster than regex.
The following is probably un-Pythonic in more ways than one (improvement suggestions welcome), but the underlying assumption is that str.split() massively outperforms regex, and the arithmetic to calculate string positions takes negligible time.
def find_tandem_repeats(sequence, search):
""" searches through an DNA sequence and returns (position, repeats) tuples """
if sequence == '' or search == '':
return
lengths = list(map(len, sequence.split(search)))
pos = lengths[0]
repeats = 0
pending = False
for l in lengths[1:]:
if l == 0:
pending = True
repeats += 1
continue
repeats += 1
yield (pos, repeats)
pos += l + len(search) * repeats
repeats = 0
pending = False
if pending:
yield (pos, repeats)
usage:
data = "AAGGTAAGTTTAGAATATAAAAGGTGAGTTAAATAGAATAGGTTAAAATTAAAGGAGATCAGATCAGATCAGATCTATCTATCTATCTATCTATCAGAAAAGAGAGATCAGATCAGTTAAAGAGTAAGATATTGAATTAATGGAAAATATTGTTGGGGAAAGGAGGGATAGAGATCAGATC"
positions = list(find_tandem_repeats(data, 'AGATC'))
for position in positions:
print("at index %s, repeats %s times" % position)
max_position = max(positions, key=lambda x: x[1])
print("maximum at index %s, repeats %s times" % max_position)
Output
at index 55, repeats 4 times
at index 104, repeats 2 times
at index 171, repeats 2 times
maximum at index 55, repeats 4 times

To find all sequences of AGATC as long as it's not at the end of the sequence, you can use:
>>> re.findall(r'AGATC\B', sequence)
['AGATC', 'AGATC', 'AGATC', 'AGATC']
From the python documentation on \B:
Matches the empty string, but only when it is not at the beginning or end of a word. This means that r'py\B' matches 'python', 'py3', 'py2', but not 'py', 'py.', or 'py!'. \B is just the opposite of \b, so word characters in Unicode patterns are Unicode alphanumerics or the underscore, although this can be changed by using the ASCII flag. Word boundaries are determined by the current locale if the LOCALE flag is used.

Here's one way to approach the problem:
>>> STRs = ['AGATC', 'AATG', 'TATC']
>>> pattern = '|'.join(f'({tgt})+' for tgt in STRs)
>>> for mo in re.finditer(pattern, seq):
print(mo.group(0))
AGATCAGATCAGATCAGATC
TATCTATCTATCTATCTATC
AGATCAGATC
AATG
AGATCAGATC
Key ideas:
1) The core pattern used + for a group to allow for consecutive repeats:
(AGATC)+
2) The patterns are combined with | to allow any STR to match:
>>> pattern
'(AGATC)+|(AATG)+|(TATC)+'
3) The re.finditer() call gives the match objects one at a time.
4) If needed the match object can give other information like start and stop points to compute length or a tuple to show which STR matched:
>>> mo.group(0)
'AGATCAGATC'
>>> mo.span()
(171, 181)
>>> mo.groups()
('AGATC', None, None)
How to count the maximum repetitions:
>>> tracker = dict.fromkeys(STRs, 0)
>>> for mo in re.finditer(pattern, seq):
start, end = mo.span()
STR = next(filter(None, mo.groups()))
reps = (end - start) // len(STR)
tracker[STR] = max(tracker[STR], reps)
>>> tracker
{'AGATC': 4, 'AATG': 1, 'TATC': 5}

Here is a functional programming approach. In each recursion step the first occurrences of a consecutive repeat is found and after that eliminated from the search string. You keep record of the maximum of repeats found so far and pass it to each recursive call until no consecutive repeats are found anymore.
def find_repeats (pattern, seq, max_r=0):
g = re.search(f'{pattern}({pattern})+', seq)
if g:
max_repeats = len ( g.group() ) / len(pattern)
return find_repeats (pattern, seq.replace (g.group(), '', 1), max (max_r, max_repeats) )
else:
return max_r
print (find_repeats ('AGATC', sequence))

Related

Is there an easy way to get the number of repeating character in a word?

I'm trying to get how many any character repeats in a word. The repetitions must be sequential.
For example, the method with input "loooooveee" should return 6 (4 times 'o', 2 times 'e').
I'm trying to implement string level functions and I can do it this way but, is there an easy way to do this? Regex, or some other sort of things?

Original question: order of repetition does not matter
You can subtract the number of unique letters by the number of total letters. set applied to a string will return a unique collection of letters.
x = "loooooveee"
res = len(x) - len(set(x)) # 6
Or you can use collections.Counter, subtract 1 from each value, then sum:
from collections import Counter
c = Counter("loooooveee")
res = sum(i-1 for i in c.values()) # 6
New question: repetitions must be sequential
You can use itertools.groupby to group sequential identical characters:
from itertools import groupby
g = groupby("aooooaooaoo")
res = sum(sum(1 for _ in j) - 1 for i, j in g) # 5
To avoid the nested sum calls, you can use itertools.islice:
from itertools import groupby, islice
g = groupby("aooooaooaoo")
res = sum(1 for _, j in g for _ in islice(j, 1, None)) # 5

You could use a regular expression if you want:
import re
rx = re.compile(r'(\w)\1+')
repeating = sum(x[1] - x[0] - 1
for m in rx.finditer("loooooveee")
for x in [m.span()])
print(repeating)
This correctly yields 6 and makes use of the .span() function.
The expression is
(\w)\1+
which captures a word character (one of a-zA-Z0-9_) and tries to repeat it as often as possible.
See a demo on regex101.com for the repeating pattern.
If you want to match any character (that is, not only word characters), change your expression to:
(.)\1+
See another demo on regex101.com.

try this:
word=input('something:')
sum = 0
chars=set(list(word)) #get the set of unique characters
for item in chars: #iterate over the set and output the count for each item
if word.count(char)>1:
sum+=word.count(char)
print('{}|{}'.format(item,str(word.count(char)))
print('Total:'+str(sum))
EDIT:
added total count of repetitions

Since it doesn't matter where the repetition is occurring or which characters are being repeated, you can make use of the set data structure provided in Python. It will discard the duplicate occurrences of any character or an object.
Therefore, the solution would look something like this:
def measure_normalized_emphasis(text):
return len(text) - len(set(text))
This will give you the exact result.
Also, make sure to look out for some edge cases, which you should as it is a good practice.

I think your code is comparing the wrong things
You start by finding the last character:
char = text[-1]
Then you compare this to itself:
for i in range(1, len(text)):
if text[-i] == char: #<-- surely this is test[-1] to begin with?
Why not just run through the characters:
def measure_normalized_emphasis(text):
char = text[0]
emphasis_size = 0
for i in range(1, len(text)):
if text[i] == char:
emphasis_size += 1
else:
char = text[i]
return emphasis_size
This seems to work.

Backward search implementation python

I am dealing with some string search tasks just to improve an efficient way of searching.
I am trying to implement a way of counting how many substrings there are in a given set of strings by using backward search.
For example given the following strings:
original = 'panamabananas$'
s = smnpbnnaaaaa$a
s1 = $aaaaaabmnnnps #sorted version of s
I am trying to find how many times the substring 'ban' it occurs. For doing so I was thinking in iterate through both strings with zip function. In the backward search, I should first look for the last character of ban (n) in s1 and see where it matches with the next character a in s. It matches in indexes 9,10 and 11, which actually are the third, fourth and fifth a in s. The next character to look for is b but only for the matches that occurred before (This means, where n in s1 matched with a in s). So we took those a (third, fourth and fifth) from s and see if any of those third, fourth or fifth a in s1 match with any b in s. This way we would have found an occurrence of 'ban'.
It seems complex to me to iterate and save cuasi-occurences so what I was trying is something like this:
n = 0 #counter of occurences
for i, j in zip(s1, s):
if i == 'n' and j == 'a': # this should save the match
if i[3:6] == 'a' and any(j[3:6] == 'b'):
n += 1
I think nested if statements may be needed but I am still a beginner. Because I am getting 0 occurrences when there are one ban occurrences in the original.

You can run a loop with find to count the number of occurence of substring.
s = 'panamabananasbananasba'
ss = 'ban'
count = 0
idx = s.find(ss, 0)
while (idx != -1):
count += 1
idx += len(ss)
idx = s.find(ss, idx)
print count
If you really want backward search, then reverse the string and substring and do the same mechanism.
s = 'panamabananasbananasban'
s = s[::-1]
ss = 'ban'
ss = ss[::-1]

Algorithm to find the most repetitive (not the most common) sequence in a string (aka tandem repeats)

I am looking for an algorithm (possibly implemented in Python) able to find the most REPETITIVE sequence in a string. Where for REPETITIVE, I mean any combination of chars that is repeated over and over without interruption (tandem repeat).
The algorithm I am looking for is not the same as the "find the most common word" one. In fact, the repetitive block doesn't need to be the most common word (substring) in the string.
For example:
s = 'asdfewfUBAUBAUBAUBAUBAasdkBAjnfBAenBAcs'
> f(s)
'UBAUBAUBAUBAUBA' #the "most common word" algo would return 'BA'
Unfortunately, I have no idea on how to tackle this. Any help is very welcome.
UPDATE
A little extra example to clarify that I want to be returned the sequence with the most number of repetition, whatever its basic building block is.
g = 'some noisy spacer'
s = g + 'AB'*5 + g + '_ABCDEF'*2 + g + 'AB'*3
> f(s)
'ABABABABAB' #the one with the most repetitions, not the max len
Examples from #rici:
s = 'aaabcabc'
> f(s)
'abcabc'
s = 'ababcababc'
> f(s)
'ababcababc' #'abab' would also be a solution here
# since it is repeated 2 times in a row as 'ababcababc'.
# The proper algorithm would return both solutions.

With combination of re.findall() (using specific regex patten) and max() functions:
import re
# extended sample string
s = 'asdfewfUBAUBAUBAUBAUBAasdkjnfencsADADADAD sometext'
def find_longest_rep(s):
result = max(re.findall(r'((\w+?)\2+)', s), key=lambda t: len(t[0]))
return result[0]
print(find_longest_rep(s))
The output:
UBAUBAUBAUBAUBA
The crucial pattern:
((\w+?)\2+):
(....) - the outermost captured group which is the 1st captured group
(\w+?) - any non-whitespace character sequence enclosed into the 2nd captured group; +? - quantifier, matches between one and unlimited times, as few times as possible, expanding as needed
\2+ - matches the same text as most recently matched by the 2nd capturing group

Here is the solution based on ((\w+?)\2+) regex but with additional improvements:
import re
from itertools import chain
def repetitive(sequence, rep_min_len=1):
"""Find the most repetitive sequence in a string.
:param str sequence: string for search
:param int rep_min_len: minimal length of repetitive substring
:return the most repetitive substring or None
"""
greedy, non_greedy = re.compile(r'((\w+)\2+)'), re.compile(r'((\w+?)\2+)')
all_rep_seach = lambda regex: \
(regex.search(sequence[shift:]) for shift in range(len(sequence)))
searched = list(
res.groups()
for res in chain(all_rep_seach(greedy), all_rep_seach(non_greedy))
if res)
if not sequence:
return None
cmp_key = lambda res: res[0].count(res[1]) if len(res[1]) >= rep_min_len else 0
return max(searched, key=cmp_key)[0]
You can test it like so:
def check(seq, expected, rep_min_len=1):
result = repetitive(seq, rep_min_len)
print('%s => %s' % (seq, result))
assert result == expected, expected
check('asdfewfUBAUBAUBAUBAUBAasdkBAjnfBAenBAcs', 'UBAUBAUBAUBAUBA')
check('some noisy spacerABABABABABsome noisy spacer_ABCDEF_ABCDEFsome noisy spacerABABAB', 'ABABABABAB')
check('aaabcabc', 'aaa')
check('aaabcabc', 'abcabc', rep_min_len=2)
check('ababcababc', 'ababcababc')
check('ababcababcababc', 'ababcababcababc')
Key features:
used greedy ((\w+)\2+) and non-greedy ((\w+)\2+?) regex;
search repetitive substring in all substrings with the shift from the beginning (e.g.'string' => ['string', 'tring', 'ring', 'ing', 'ng', 'g']);
selection is based on the number of repetitions not on the length of subsequence (e.g. for 'ABABABAB_ABCDEF_ABCDEF' result will be 'ABABABAB', not '_ABCDEF_ABCDEF');
the minimum length of a repeating sequence is matters (see 'aaabcabc' check).

What you are searching for is an algorithm to find the 'largest' primitive tandem repeat in a string. Here is a paper describing a linear time algorithm to find all tandem repeats in a string and by extension all primitive tandem repeats. Gusfield. Linear Time Algorithms for Finding and Representing all Tandem Repeats in a String

Here is a brute force algorithm that I wrote. Maybe it will be useful:
def find_most_repetitive_substring(string):
max_counter = 1
position, substring_length, times = 0, 0, 0
for i in range(len(string)):
for j in range(len(string) - i):
counter = 1
if j == 0:
continue
while True:
if string[i + counter * j: i + (counter + 1) * j] != string[i: i + j] or i + (counter + 1) * j > len(string):
if counter > max_counter:
max_counter = counter
position, substring_length, times = i, j, counter
break
else:
counter += 1
return string[position: position + substring_length * times]

Realizing if there is a pattern in a string (Does not need to start at index 0, could be any length)

Coding a program to detect a n-length pattern in a string, even without knowing where the pattern starts, could be easily done by creating a list of n-length substrings and check if starting at one point there are same items or the rest of the list. Without any piece of information other than the string to check through, is the only way to recognize the pattern is to brute-force through all lengths and check or is there a more efficient algorithm?
(I'm just a beginner in Python, so this may be easy to code... )
Current code that only suits checking for starting at index 0:
def search(s):
match=s[0]+s[1]
while (match != s) and (match[0] != match[-1]):
for matchLen in range(len(match),len(s)-1):
letter = s[matchLen]
if letter == match[-1]:
match += s[len(match)]
break
if match == s:
return None
else:
return match[:-1]

You can use re.findall(r'(.{2,})\1+', string). The parentheses creates a capture group that is later backreferenced by \1. The . matches any character (except for line breaks). The {2,} requires the pattern to be at least two characters long (otherwise strings like ss would be considered a pattern). Finally the + requires that pattern to repeat 1 or more times (in addition to the first time that it occurred inside the capture group). You can see it working in action.

Pattern is a far too vague term, but assuming you mean some string repeating itself, the regexp (?P<pat>.+)(?P=pat) will work.

Given a string what you could do is -
You start with length = 1, and take two pointer variables i and j which you shall use to traverse the string.
Set i = 0 and j = i+length
if str[i]==str[j]:
i++,j++ // till j not equal to length of string
else:
length = length + 1
//increase length by 1 and start the algorithm over from i = 0
Take the example abcdeabcde :
In this we see
Initially i = 0, j = 1 ,
but str[0]!=str[1] i.e. a!=b,
Then we get length = 2 i.e., i = 0,j = 2
but str[0]!=str[2] i.e. a!=c,
Continuing in the same fashion,
We see when length = 5 and i = 0 and j = 5,
str[0]==str[5]
and thus you can see that i and j increment till j is equal to string length.
And you have your answer that is the pattern length. It may not seem obvious but i would suggest you dry-run this algorithm over some of your test cases and let me know the results.

You can use re.findall() to find all matches:
import re
s = "somethingabcdeabcdeabcdeabcdeabcdeelseabcdeabcdeabcde"
li = re.findall(r'abcde',s)
print(li)
Output:
['abcde', 'abcde', 'abcde', 'abcde', 'abcde', 'abcde', 'abcde', 'abcde']

Match parts of string that are not consecutive stretches of certain character

I have a simple function that yields all stretches of at least gapSize consecutive N's from a string:
def get_gap_coordinates(sequence, gapSize=25):
gapPattern = "N{"+str(gapSize)+",}"
p = re.compile(gapPattern)
m = p.finditer(sequence)
for gap in m:
start,stop = gap.span()
yield(start,stop)
Now I'd like to have a function that does the exact opposite: Match all characters that are NOT consecutive stretches of at least gapSize N's. These stretches may occur at any position in the string (beginning, middle and end) with any given number.
I have looked into lookarounds and tried
(?!N{25,}).*
but this does not do what I need.
Any help is much appreciated!
edit:
For example: a sequence NNNNNNACTGACGTNNNACTGACNNNNN with should match ACTGACGTNNNACTGAC for gapSize=5 and ACTGACGT & ACTGAC for gapSize = 3.

So here is a pure regex solution which seems to be what you want, but I wonder if there is actually a better way to do it. I'll add alternatives as I come up with them. I used several online regex tools as well as playing around in the shell.
One of the tools has a nice graphic of the regex and facilty to generate SO answer code: the regex (with a gap of 10) is:
^.*?(?=N{10})|(?<=N{10})[^N].*?(?=N{10})|(?<=N{10})[^N].*?$
Usage:
s = 'NAANANNNNNNNNNNBBBBNNNCCNNNNNNNNNNDDDDN'
def foo(s, gapSize = 25):
'''yields non-gap items (re.match objects) in s or
if gaps are not present raises StopIteration immediately
'''
# beginning of string and followed by a 'gap' OR
# preceded a 'gap' and followed by a 'gap' OR
# preceded a 'gap' and followed by end of string
pattern = r'^.*?(?=N{{{}}})|(?<=N{{{}}})[^N].*?(?=N{{{}}})|(?<=N{{{}}})[^N].*?$'
pattern = pattern.format(gapSize, gapSize, gapSize, gapSize)
for match in re.finditer(pattern, s):
#yield match.span()
yield match
for match in foo(s, 10):
print match.span(), match.group()
'''
>>>
(0, 5) NAANA
(15, 24) BBBBNNNCC
(34, 39) DDDDN
>>>
'''
So if you think about it a bit you see that the beginning of a gap is the end of a non-gap and vis-versa. So with a simple regex: iterate over the gaps, add logic to the loop to keep track of the non-gap spans, and yield the spans. (my placeholder variable names could probably be improved)
s = 'NAANANNNNNNNNNNBBBBNNNCCNNNNNNNNNNDDDDN'
def bar(s, n):
'''Yields the span of non-gap items in s or
immediately raises StopIteration if gaps are not present.
'''
gap = r'N{{{},}}'.format(n)
# initialize the placeholders
previous_start = 0
end = len(s)
for match in re.finditer(gap, s):
start, end = match.span()
if start == 0:
previous_start = end
continue
end = start
yield previous_start, end
previous_start = match.end()
if end != len(s):
yield previous_start, len(s)
Usage
for start, end in bar(s, 4):
print (start, end), s[start:end]
'''
>>>
(0, 5) NAANA
(15, 24) BBBBNNNCC
(34, 39) DDDDN
>>>
'''

Negative lookahead seems to work ok. E.g. for gap-size 3, the regexp would be:
N{3,}?([^N](?:(?!N{3,}?).)*)
Try it here.
import re
def get_gap_coordinates(sequence, gapSize=25):
gapPattern = "N{%s,}?([^N](?:(?!N{%s,}?).)*)" % (gapSize, gapSize)
p = re.compile(gapPattern)
m = p.finditer(sequence)
for gap in m:
start,stop = gap.start(1), gap.end(1)
yield(start,stop)
for x in get_gap_coordinates('NNNNNNACTGACGTNNNACTGACNNNNN', 3):
print x
Warning: This might not match well at the beginning of the string, if the string does not start with an 'N' sequence. But you can always pad the string with gap-size times 'N' from the left.

I thought about regexes to directly match the wanted blocks, but nothing good came to mind. I think it's better to keep finding the gaps and simply use the gap coordinates to get the good block coordinates. I mean, they're basically the same, right? Gap stops are block starts and gap starts are block stops.
def get_block_coordinates(sequence, gapSize=25):
gapPattern = "N{"+str(gapSize)+",}"
p = re.compile(gapPattern)
m = p.finditer(sequence)
prevStop = 0
for gap in m:
start,stop = gap.span()
if start:
yield(prevStop,start)
prevStop = stop
if prevStop < len(sequence):
yield(prevStop,len(sequence))

I think you can do something like:
gapPattern = "(N{"+str(gapSize)+",})"
p = re.compile(gapPattern)
i = 0
for s in re.split(p, sequence):
if not re.match(p, s):
yield i
i += len(s)
And that'll generate a sequence of offsets to substrings that aren't gap_size "N" characters as per the re.split function.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regex - Counting greatest number of short tandem repeats - python

Related

Is there an easy way to get the number of repeating character in a word?

Backward search implementation python

Algorithm to find the most repetitive (not the most common) sequence in a string (aka tandem repeats)

Realizing if there is a pattern in a string (Does not need to start at index 0, could be any length)

Match parts of string that are not consecutive stretches of certain character

Categories

Resources