Searching strings where substring occur at specific positions with negative look-ahead - python

I am just facing a probem when trying to create a regex which should help finding strings including specific combinations of substrings.
For example i am searching for the substring combination:
ab-ab-cd
1) "xxxabxxxxxxabxxxxcdxxx" -> should be a match
2) "xxxabxxxxabxxxxabxxxxcdxxxx -> no match
3) "xxxabxxxxxxxxxxcdxxxx -> no match
to make it even more complicated:
4) "xxxabxxxxxabxxxxcdxxxabxxx -> should also be a match
My substring combinations could also be like this:
ab-cd
or
ab-ab-ab-cd
or
ab-cd-ab-cd
For all these (and more) examples I am looking for a systematic way to build the corresponding regexes in a systematic way so that only strings are found as matches where the substrings occur in the right order and with correct frequency.
I got something like this for the "ab-ab-cd" substring search but it fails in cases like 4) of my examples.
p = re.compile("(?:(?!ab).)*ab.*?ab(?!.*ab).*cd",re.IGNORECASE)
In cases like 4) this one works in but in also matches strings like 2):
p = re.compile("(?:(?!ab).)*ab(?:(?!ab).)*ab((?!ab|cd)*).*cd", re.IGNORECASE)
Could you please point me to my mistake?
Thanks a lot!
EDIT:
Sorry to all, that my question was not clear enough. I tried to break my problem down into a more simple case, which might have been no good idea.
Here comes the detailed explanation of the problem:
I have list of (protein) sequences and to assign a specific type to each sequence on the basis of sequence patterns.
Therefore I created a dictionary with type-name as key and feature template (list of sequence features in a specific order) as value, e.g.:
type_a -> [A,A,B,C]
type_b -> [A,B,C]
type_c -> [A,B,A,B]
In other dict I have (simple) regex patters for each feature, e.g.:
A -> [PHT]AG[QP]LI
B -> RS[TP]EV
C -> ...
D -> ...
Now each template (type_a, type_b,...) I now to systematically build the concatenated regex patters (i.e. for type_a build a regex searching for A,A,B,C).
That would than result into another dict with types as key and and the complete regex as value.
Now I want to go through each sequence in my list of sequences and map all complete regex templates against each sequence. In best case, only one complete regex (type) should match the sequence.
Taking the example from above, having the following regex-templates:
cd
ab-cd
ab-ab-cd
ab-ab-ab-cd
ab-cd-ab-cd
ab-ab-cd-ab
"xxxabxxxxxxabxxxxcdxxx"
->this sequence should match the regex for the template "ab-ab-cd" and not any of the others
With the following regex I could perfectly look for ab-ab-cd.
p = re.compile("(?:(?!ab).)*ab.*?ab(?!.*ab).*cd",re.IGNORECASE)
If my tests were correct it would only match sequence 1) from above and not 2) or 3).
However, if I would like to search for ab-ab-cd-ab the negative look-ahead would not allow to find the last ab. I found something like the following code to break the negative look-ahead after the second "ab" part. In my understand the negative look-ahead should stop with the "cd", so that the last "ab" could be matches again.
p = re.compile("(?:(?!ab).)*ab(?:(?!ab).)*ab((?!ab|cd)*).*cd", re.IGNORECASE)
It solves the problem with the last "ab" from ab-ab-cd-ab.
But somehow it now does not only match the for 2 times "ab" before the "cd" (Sequence 1) - ab-ab-cd) but also the 3 (or more) times "ab" before the "cd" (Sequence 2, ab-ab-ab-cd), which it should not.
I hope my problem is more clear. Thanks a lot for all the answers, I will try the code tomorrow when I am back at work. Any further answers are highly appreciated, explanations of the regex code (I am pretty new to regex) and suggestions with re.functions (match, final...) to use.
Thanks

You could use re.findall and post-process it. Effectively you want to find all instances of ab or cd and see if your pattern(['ab', 'ab', 'cd']) is at the start of the list. The following:
import re
test1 = "xxxabxxxxxxabxxxxcdxxx"
test2 = "xxxabxxxxabxxxxabxxxxcdxxxx"
test3 = "xxxabxxxxxxxxxxcdxxxx"
test4 = "xxxabxxxxxabxxxxcdxxxabxxx"
for x in (test1, test2, test3, test4):
matches = re.findall(r'(ab|cd)', x)
print matches[:3] == ['ab', 'ab', 'cd']
prints
True
False
False
True
As required.

Why do you need the negative look ahead?
Why not use something as simple as that:
*ab.*ab.*cd
Or if you need it to find a match from the beginning of the line, you can use:
^.*ab.*ab.*cd
Edit:
After your comment I understood what you need. Try this one:
^(?:(?!ab).)*ab(?:(?!ab).)*ab(?:(?!ab).)*cd

Related

Why doesn't replace () change all occurrences?

I have the following code:
dna = "TGCGAGAAGGGGCGATCATGGAGATCTACTATCCTCTCGGGGTATGGTGGGGTTGAGA"
print(dna.count("GAGA"))
dna = dna.replace("GAGA", "AGAG")
print(dna.count("GAGA"))
Replace does not replace all occurrences. Could somebody help my in understanding why it happened?
It replaces all occurences. That might lead to new occurences (look at your replacement string!).
I'd say, logically, all is fine.
You could repeat this replace while dna.count("GAGA") > 0 , but: that sounds not like what you should be doing. (I bet you really just want to do one round of replacement to simulate something specific happening. Not a genetics expert at all though.)
It did make all replacements (that's what .replace() does in Python unless specified otherwise), but some of these replacements inadvertently introduced new instances of GAGA. Take the beginning of your string:
TGCGAGAA
There's GAGA at indices 3-6. If you replace that with AGAG, you get
TGCAGAGA
So the last G from that AGAG, together with the subsequent A that was already there before, forms a new GAGA.
Replacements does not occur "until exhausted"; they occur when a substring is matched in your original string.
Consider the following from your string:
>>> a = "TGCGAGAA"
>>> a.replace("GAGA", "AGAG")
'TGCAGAGA'
>>>
The replacement does not happen again, since the original string did not match GAGA in that location.
If you want to do the replacement until no match is found, you can wrap it in a loop:
>>> while a.count("GAGA") > 0: # you probably don't want to use count here if the string is long because of performance considerations
... a = a.replace("GAGA", "AGAG")
...
>>> a
'TGCAAGAG'

Get the actual ending when testing with .endswith(tuple)

I found a nice question where one can search for multiple endings of a string using: endswith(tuple)
Check if string ends with one of the strings from a list
My question is, how can I return which value from the tuple is actually found to be the match? and what if I have multiple matches, how can I choose the best match?
for example:
str= "ERTYHGFYUUHGFREDFYAAAAAAAAAA"
endings = ('AAAAA', 'AAAAAA', 'AAAAAAA', 'AAAAAAAA', 'AAAAAAAAA')
str.endswith(endings) ## this will return true for all of values inside the tuple, but how can I get which one matches the best
In this case, multiple matches can be found from the tuple, how can I deal with this and return only the best (biggest) match, which in this case should be: 'AAAAAAAAA' which I want to remove at the end (which can be done with a regular expression or so).
I mean one could do this in a for loop, but maybe there is an easier pythonic way?
>>> s = "ERTYHGFYUUHGFREDFYAAAAAAAAAA"
>>> endings = ['AAAAA', 'AAAAAA', 'AAAAAAA', 'AAAAAAAA', 'AAAAAAAAA']
>>> max([i for i in endings if s.endswith(i)],key=len)
'AAAAAAAAA'
import re
str= "ERTYHGFYUUHGFREDFYAAAAAAAAAA"
endings = ['AAAAA', 'AAAAAA', 'AAAAAAA', 'AAAAAAAA', 'AAAAAAAAA']
print max([i for i in endings if re.findall(i+r"$",str)],key=len)
How about:
len(str) - len(str.rstrip('A'))
str.endswith(tuple) is (currently) implemented as a simple loop over tuple, repeatedly re- running the match, any similarities between the endings are not taken into account.
In the example case, a regular expression should compile into an automaton that essentially runs in linear time:
regexp = '(' + '|'.join(
re.escape(ending) for ending in sorted(endings, key=len, reverse=True
) + ')$'
Edit 1: As pointed out correctly by Martijn Pieters, Python's re does not return the longest overall match, but for alternates only matches the first matching subexpression:
https://docs.python.org/2/library/re.html#module-re:
When one pattern completely matches, that branch is accepted. This means that once A matches, B will not be tested further, even if it would produce a longer overall match.
(emphasis mine)
Hence, unfortunately the need for sorting by length.
Note that this makes Python's re different from POSIX regular expressions, which match the longest overall match.

Finding the recurring pattern

Let's say I have a number with a recurring pattern, i.e. there exists a string of digits that repeat themselves in order to make the number in question. For example, such a number might be 1234123412341234, created by repeating the digits 1234.
What I would like to do, is find the pattern that repeats itself to create the number. Therefore, given 1234123412341234, I would like to compute 1234 (and maybe 4, to indicate that 1234 is repeated 4 times to create 1234123412341234)
I know that I could do this:
def findPattern(num):
num = str(num)
for i in range(len(num)):
patt = num[:i]
if (len(num)/len(patt))%1:
continue
if pat*(len(num)//len(patt)):
return patt, len(num)//len(patt)
However, this seems a little too hacky. I figured I could use itertools.cycle to compare two cycles for equality, which doesn't really pan out:
In [25]: c1 = itertools.cycle(list(range(4)))
In [26]: c2 = itertools.cycle(list(range(4)))
In [27]: c1==c2
Out[27]: False
Is there a better way to compute this? (I'd be open to a regex, but I have no idea how to apply it there, which is why I didn't include it in my attempts)
EDIT:
I don't necessarily know that the number has a repeating pattern, so I have to return None if there isn't one.
Right now, I'm only concerned with detecting numbers/strings that are made up entirely of a repeating pattern. However, later on, I'll likely also be interested in finding patterns that start after a few characters:
magic_function(78961234123412341234)
would return 1234 as the pattern, 4 as the number of times it is repeated, and 4 as the first index in the input where the pattern first presents itself
(.+?)\1+
Try this. Grab the capture. See demo.
import re
p = re.compile(ur'(.+?)\1+')
test_str = u"1234123412341234"
re.findall(p, test_str)
Add anchors and flag Multiline if you want the regex to fail on 12341234123123, which should return None.
^(.+?)\1+$
See demo.
One way to find a recurring pattern and number of times repeated is to use this pattern:
(.+?)(?=\1+$|$)
w/ g option.
It will return the repeated pattern and number of matches (times repeated)
Non-repeated patterns (fails) will return only "1" match
Repeated patterns will return 2 or more matches (number of times repeated).
Demo

Sequence of vowels count

This is not a homework question, it is an exam preparation question.
I should define a function syllables(word) that counts the number of syllables in
A word in the following way:
• a maximal sequence of vowels is a syllable;
• a final e in a word is not a syllable (or the vowel sequence it is a part
Of).
I do not have to deal with any special cases, such as a final e in a
One-syllable word (e.g., ’be’ or ’bee’).
>>> syllables(’honour’)
2
>>> syllables(’decode’)
2
>>> syllables(’oiseau’)
2
Should I use regular expression here or just list comprehension ?
I find regular expressions natural for this question. (I think a non-regex answer would take more coding. I use two string methods, 'lower' and 'endswith' to make the answer more clear.)
import re
def syllables(word):
word = word.lower()
if word.endswith('e'):
word = word[:-1]
count = len(re.findall('[aeiou]+', word))
return count
for word in ('honour', 'decode', 'decodes', 'oiseau', 'pie'):
print word, syllables(word)
Which prints:
honour 2
decode 2
decodes 3
oiseau 2
pie 1
Note that 'decodes' has one more syllable than 'decode' (which is strange, but fits your definition).
Question. How does this help you? Isn't the point of the study question that you work through it yourself? You may get more benefit in the future by posting a failed attempt in your question, so you can learn exactly where you are lacking.
Use regexps - most languages will let you count the number of matches of a regexp in a string.
Then special-case the terminal-e by checking the right-most match group.
I don't think regex is the right solution here.
It seems pretty straightforward to write this treating each string as a list.
Some pointers:
[abc] matches a, b or c.
A + after a regex token allows the token to match once or more
$ matches the end of the string.
(?<=x) matches the current position only if the previous character is an x.
(?!x) matches the current position only if the next character is not an x.
EDIT:
I just saw your comment that since this is not homework, actual code is requested.
Well, then:
[aeiou]+(?!(?<=e)$)
If you don't want to count final vowel sequences that end in e at all (like the u in tongue or the o in toe), then use
[aeiou]+(?=[^aeiou])|[aeiou]*[aiou]$
I'm sure you'll be able to figure out how it works if you read the explanation above.
Here's an answer without regular expressions. My real answer (also posted) uses regular expressions. Untested code:
def syllables(word):
word = word.lower()
if word.endswith('e'):
word = word[:-1]
vowels = 'aeiou'
in_vowel_group = False
vowel_groups = 0
for letter in word:
if letter in vowels:
if not in_vowel_group:
in_vowel_group = True
vowel_groups += 1
else:
in_vowel_group = False
return vowel_groups
Both ways work. You said yourself that it was for exam preparation. Use whichever is going to be on the exam. If they're both on the exam, use which you need more practice for. Just remember:
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. ~Jamie Zawinski
So in my opinion, don't use regex unless you need the practice.
Regular expressions would be way too complex, and a list comprehension probably wouldn't be robust enough. You will probably be able to solve this easily using a grammar lexer like PyParsing. Give it a shot!
Use a regex that matches a,e,i,o, or u, convert the string to a list, then iterate through the list... 1 for first true, 1 for next false, 2 for next true, 2 for next false, etc.
To handle the case where the last letter is 'e' following a consonant (as in ate), just check the last two letters of the word before you start. If they match that pattern truncate the final e and process as normal.
This pattern works for your definition:
(?!e$)([aeiouy]+)
Just count how many times it occurs.

Regex to match 'lol' to 'lolllll' and 'omg' to 'omggg', etc

Hey there, I love regular expressions, but I'm just not good at them at all.
I have a list of some 400 shortened words such as lol, omg, lmao...etc. Whenever someone types one of these shortened words, it is replaced with its English counterpart ([laughter], or something to that effect). Anyway, people are annoying and type these short-hand words with the last letter(s) repeated x number of times.
examples:
omg -> omgggg, lol -> lollll, haha -> hahahaha, lol -> lololol
I was wondering if anyone could hand me the regex (in Python, preferably) to deal with this?
Thanks all.
(It's a Twitter-related project for topic identification if anyone's curious. If someone tweets "Let's go shoot some hoops", how do you know the tweet is about basketball, etc)
FIRST APPROACH -
Well, using regular expression(s) you could do like so -
import re
re.sub('g+', 'g', 'omgggg')
re.sub('l+', 'l', 'lollll')
etc.
Let me point out that using regular expressions is a very fragile & basic approach to dealing with this problem. You could so easily get strings from users which will break the above regular expressions. What I am trying to say is that this approach requires lot of maintenance in terms of observing the patterns of mistakes the users make & then creating case specific regular expressions for them.
SECOND APPROACH -
Instead have you considered using difflib module? It's a module with helpers for computing deltas between objects. Of particular importance here for you is SequenceMatcher. To paraphrase from official documentation-
SequenceMatcher is a flexible class
for comparing pairs of sequences of
any type, so long as the sequence
elements are hashable. SequenceMatcher
tries to compute a "human-friendly
diff" between two sequences. The
fundamental notion is the longest
contiguous & junk-free matching subsequence.
import difflib as dl
x = dl.SequenceMatcher(lambda x : x == ' ', "omg", "omgggg")
y = dl.SequenceMatcher(lambda x : x == ' ', "omgggg","omg")
avg = (x.ratio()+y.ratio())/2.0
if avg>= 0.6:
print 'Match!'
else:
print 'Sorry!'
According to documentation, any ratio() over 0.6 is a close match. You might need to explore tweak the ratio for your data needs. If you need more stricter matching I found any value over 0.8 serves well.
How about
\b(?=lol)\S*(\S+)(?<=\blol)\1*\b
(replace lol with omg, haha etc.)
This will match lol, lololol, lollll, lollollol etc. but fail lolo, lollllo, lolly and so on.
The rules:
Match the word lol completely.
Then allow any repetition of one or more characters at the end of the word (i. e. l, ol or lol)
So \b(?=zomg)\S*(\S+)(?<=\bzomg)\1*\b will match zomg, zomggg, zomgmgmg, zomgomgomg etc.
In Python, with comments:
result = re.sub(
r"""(?ix)\b # assert position at a word boundary
(?=lol) # assert that "lol" can be matched here
\S* # match any number of characters except whitespace
(\S+) # match at least one character (to be repeated later)
(?<=\blol) # until we have reached exactly the position after the 1st "lol"
\1* # then repeat the preceding character(s) any number of times
\b # and ensure that we end up at another word boundary""",
"lol", subject)
This will also match the "unadorned" version (i. e. lol without any repetition). If you don't want this, use \1+ instead of \1*.

Categories