python re.findall() with substring in alternations - python

If I have a substring (or 'subpattern') of another string or pattern in a regex alternation, like so:
r'abcd|bc'
What is the expected behaviour of re.compile(r'abcd|bc').findall('abcd bcd bc ab')?
Trying it out, I get (as expected)
['abcd', 'bc', 'bc']
so I thought re.compile(r'bc|abcd').findall('abcd bcd bc ab') might yield ['bc', 'bc', 'bc'] but instead it again returns
['abcd', 'bc', 'bc']
Can someone explain this? I was under the impression that findall would greedily return matches but apparently, it backtracks and tries to match alternate patterns what would yield longer tokens.

No backtracking takes place at all. Your pattern matches two different types of strings; | means or. Each pattern is tried out at each position.
So when the expression finds abcd at the start of your input, that text matches your pattern just fine, it fits the abcd part of the (bc or abcd) pattern you gave it.
Ordering of the alternative parts doesn't play here, as far as the regular expression engine is concerned, abcd|bc is the same thing as bc|abcd. abcd is not disregarded just because bc might match later on in the string.

Related

Regex for matching adjacent words where order doesn't matter

I'm using regex with python and trying to figure out the best way to match a pattern where the order of two words I'm searching for doesn't matter, but they must be adjacent. So for example, I'm searching for either the phrase "fat cat lasagna co" or "cat fat lasagna co", and I have to imagine there's a better way than just r"\b(fat cat|cat fat) lasagna co\b"
I read this question which addressed a similar problem but the words didn't have to be adjacent and couldn't figure out how to apply it to my problem.
There is no strictly better solution, but there's an alternative.
Now, if you have two normal words like "fat" and "cat", then (fat cat|cat fat) is undoubtedly the best solution. But what if you have 5 words? Or if you have more complex patterns than just fat and cat that you don't want to type twice?
Say instead of fat and cat you have 3 regex patterns A, B and C, and instead of the space between fat and cat you have the regex pattern S. In that case, you could use this recipe:
(?:(?:(?!\1)()|\1(?:S))(?:(?!\2)()(?:A)|(?!\3)()(?:B)|(?!\4)()(?:C))){3}
If you don't have an S, this can be simplified to
(?:(?!\1)()(?:A)|(?!\2)()(?:B)|(?!\3)()(?:C)){3}
(Note: (?:X) can be simplified to X if X doesn't contain an alternation |.)
Example
If we set A = fat, B = cat and S = space, we get:
(?:(?:(?!\1)()|\1 )(?:(?!\2)()fat|(?!\3)()cat)){2}
Try it online.
Explanation
In essence, we're using capture groups to "remember" which patterns have already matched. To do so, we use this little pattern here:
(?!\1)()some_pattern
What does this do? It's a regex that matches exactly once. Once it has matched, it won't ever match again. If you try to add a quantifier around that pattern like (?:(?!\1)()some_pattern)* it'll match either once or won't match at all.
The trick there is the usage of a backreference to a capture group before that group has even been defined. Because capture groups are initialized with a "failed to match" state, the negative lookahead (?!\1) will match successfully - but only the first time. Because right afterwards, the capture group () matches and captures the empty string. From this point forward, the negative lookahead (?!\1) will never match again.
With this as a building block, we can create a regex that matches fatcat and catfat while only containing the words fat and cat once:
(?:(?!\1)()fat|(?!\2)()cat){2}
Because of the negative lookaheads, each word can only match at most once. Adding a {2} quantifier at the end guarantees that each of the two words matches exactly once, or the entire match fails.
Now we just need to find a way to match a space between fat and cat. Well, that's just a slight variation of the same pattern:
(?:(?!\1)()|\1 )
This pattern will match the empty string on its first match, and on each subsequent match it'll match a space.
Put it all together, and voilĂ :
(?:(?:(?!\1)()|\1 )(?:(?!\2)()fat|(?!\3)()cat)){2}
Templates (for the lazy)
2 patterns A and B, with separator S:
(?:(?:(?!\1)()|\1(?:S))(?:(?!\2)()(?:A)|(?!\3)()(?:B))){2}
3 patterns A, B and C, with separator S:
(?:(?:(?!\1)()|\1(?:S))(?:(?!\2)()(?:A)|(?!\3)()(?:B)|(?!\4)()(?:C))){3}
4 patterns A, B, C and D, with separator S:
(?:(?:(?!\1)()|\1(?:S))(?:(?!\2)()(?:A)|(?!\3)()(?:B)|(?!\4)()(?:C)|(?!\5)()(?:D))){4}
2 patterns A and B, without S:
(?:(?!\1)()(?:A)|(?!\2)()(?:B)){2}
3 patterns A, B and C, without S:
(?:(?!\1)()(?:A)|(?!\2)()(?:B)|(?!\3)()(?:C)){3}
4 patterns A, B, C and D, without S:
(?:(?!\1)()(?:A)|(?!\2)()(?:B)|(?!\3)()(?:C)|(?!\4)()(?:D)){4}

Ambiguous substring with mismatches

I'm trying to use regular expressions to find a substring in a string of DNA. This substring has ambiguous bases, that like ATCGR, where R could be A or G. Also, the script must allow x number of mismatches. So this is my code
import regex
s = 'ACTGCTGAGTCGT'
regex.findall(r"T[AG]T"+'{e<=1}', s, overlapped=True)
So, with one mismatch I would expect 3 substrings AC**TGC**TGAGTCGT and ACTGC**TGA**GTCGT and ACTGCTGAGT**CGT**. The expected result should be like this:
['TGC', 'TGA', 'AGT', 'CGT']
But the output is
['TGC', 'TGA']
Even using re.findall, the code doesn't recognize the last substring.
On the other hand, if the code is setting to allow 2 mismatches with {e<=2}, the output is
['TGC', 'TGA']
Is there another way to get all the substrings?
If I understand well, you are looking for all three letters substrings that match the pattern T[GA]T and you allow at worst one error, but I think the error you are looking for is only a character substitution since you never spoke about 2 letters results.
To obtain the expected result, you have to change {e<=1} to {s<=1} (or {s<2}) and to apply it to the whole pattern (and not only the last letter) enclosing it in a group (capturing or not capturing, like you want), otherwise the predicate {s<=1} is only linked to the last letter:
regex.findall(r'(T[AG]T){s<=1}', s, overlapped=True)

Limiting regex length

I'm having an issue in python creating a regex to get each occurance that matches a regex.
I have this code that I made that I need help with.
strToSearch= "1A851B 1C331 1A3X1 1N111 1A3 and a whole lot of random other words."
print(re.findall('\d{1}[A-Z]{1}\d{3}', strToSearch.upper())) #1C331, 1N111
print(re.findall('\d{1}[A-Z]{1}\d{1}[X]\d{1}', strToSearch.upper())) #1A3X1
print(re.findall('\d{1}[A-Z]{1}\d{3}[A-Z]{1}', strToSearch.upper())) #1A851B
print(re.findall('\d{1}[A-Z]{1}\d{1}', strToSearch.upper())) #1A3
>['1A851', '1C331', '1N111']
>['1A3X1']
>['1A851B']
>['1A8', '1C3', '1A3', '1N1', '1A3']
As you can see it returns "1A851" in the first one, which I don't want it to. How do I keep it from showing in the first regex? Some things for you to know is it may appear in the string like " words words 1A851B?" so I need to keep the punctuation from being grabbed.
Also how can I combine these into one regex. Essentially my end goal is to run an if statement in python similar to the pseudo code below.
lstResults = []
strToSearch= " Alot of 1N1X1 people like to eat 3C191 cheese and I'm a 1A831B aka 1A8."
lstResults = re.findall('<REGEX HERE>', strToSearch)
for r in lstResults:
print(r)
And the desired output would be
1N1X1
3C191
1A831B
1A8
With single regex pattern:
strToSearch= " Alot of 1N1X1 people like to eat 3C191 cheese and I'm a 1A831B aka 1A8."
lstResults = [i[0] for i in re.findall(r'(\d[A-Z]\d{1,3}(X\d|[A-Z])?)', strToSearch)]
print(lstResults)
The output:
['1N1X1', '3C191', '1A831B', '1A8']
Yo may use word boundaries:
\b\d{1}[A-Z]{1}\d{3}\b
See demo
For the combination, it is unclear the criterium according to which you consider a word "random word", but you can use something like this:
[A-Z\d]*\d[A-Z\d]*[A-Z][A-Z\d]*
This is a word that contains at least a digit and at least a non-digit character. See demo.
Or maybe you can use:
\b\d[A-Z\d]*[A-Z][A-Z\d]*
dor a word that starts with a digit and contains at least a non-digit character. See demo.
Or if you want to combine exactly those regex, use.
\b\d[A-Z]\d(X\d|\d{2}[A-Z]?)?\b
See the final demo.
If you want to find "words" where there are both digits and letters mixed, the easiest is to use the word-boundary operator, \b; but notice that you need to use r'' strings / escape the \ in the code (which you would need to do for the \d anyway in future Python versions). To match any sequence of alphanumeric characters separated by word boundary, you could use
r'\b[0-9A-Z]+\b'
However, this wouldn't yet guarantee that there is at least one number and at least one letter. For that we will use positive zero-width lookahead assertion (?= ) which means that the whole regex matches only if the contained pattern matches at that point. We need 2 of them: one ensures that there is at least one digit and one that there is at least one letter:
>>> p = r'\b(?=[0-9A-Z]*[0-9])(?=[0-9A-Z]*[A-Z])[0-9A-Z]+\b'
>>> re.findall(p, '1A A1 32 AA 1A123B')
['1A', 'A1', '1A123B']
This will now match everything including 33333A or AAAAAAAAAA3A for as long as there is at least one digit and one letter. However if the pattern will always start with a digit and always contain a letter, it becomes slightly easier, for example:
>>> p = r'\b\d+[A-Z][0-9A-Z]*\b'
>>> re.findall(p, '1A A1 32 AA 1A123B')
['1A', '1A123B']
i.e. A1 didn't match because it doesn't start with a digit.

Python regex module vs re module - pattern mismatch

Update: This issue was resolved by the developer in commit be893e9
If you encounter the same problem, update your regex module.
You need version 2017.04.23 or above.
As pointed out in this answer
I need this regular expression:
(?i)\b((\w{1,3})(-|\.{2,10})[\t ]?)+(\2\w{2,})
working with the regex module too...
import re # standard library
import regex # https://pypi.python.org/pypi/regex/
content = '"Erm....yes. T..T...Thank you for that."'
pattern = r"(?i)\b((\w{1,3})(-|\.{2,10})[\t ]?)+(\2\w{2,})"
substitute = r"\2-\4"
print(re.sub(pattern, substitute, content))
print(regex.sub(pattern, substitute, content))
Output:
"Erm....yes. T-Thank you for that."
"-yes. T..T...Thank you for that."
Q: How do I have to write this regex to make the regex module react to it the same way the re module does?
Using the re module is not an option as I require look-behinds with dynamic lengths.
For clarification: it would be nice if the regex would work with both modules but in the end I only need it for regex
It seems that this bug is related to backtracking. It occurs when a capture group is repeated, and the capture group matches but the pattern after the group doesn't.
An example:
>>> regex.sub(r'(?:(\d{1,3})x)+', r'\1', '123x5')
'5'
For reference, the expected output would be:
>>> re.sub(r'(?:(\d{1,3})x)+', r'\1', '123x5')
'1235'
In the first iteration, the capture group (\d{1,3}) consumes the first 3 digits, and x consumes the following "x" character. Then, because of the +, the match is attempted a 2nd time. This time, (\d{1,3}) matches "5", but the x fails to match. However, the capture group's value is now (re)set to the empty string instead of the expected 123.
As a workaround, we can prevent the capture group from matching. In this case, changing it to (\d{2,3}) is enough to bypass the bug (because it no longer matches "5"):
>>> regex.sub(r'(?:(\d{2,3})x)+', r'\1', '123x5')
'1235'
As for the pattern in question, we can use a lookahead assertion; we change (\w{1,3}) to (?=\w{1,3}(?:-|\.\.))(\w{1,3}):
>>> pattern= r"(?i)\b((?=\w{1,3}(?:-|\.\.))(\w{1,3})(-|\.{2,10})[\t ]?)+(\2\w{2,})"
>>> regex.sub(pattern, substitute, content)
'"Erm....yes. T-Thank you for that."'
edit: the bug is now resolved in regex 2017.04.23
just tested in Python 3.6.1 and the original pattern works the same in re and regex
Original workaround - you can use a lazy operator +? (i.e. a different regex that will behave differently than original pattern in edge cases like T...Tha....Thank):
pattern = r"(?i)\b((\w{1,3})(-|\.{2,10})[\t ]?)+?(\2\w{2,})"
The bug in 2017.04.05 was due to backtracking, something like this:
The unsuccessful longer match creates empty \2 group and conceptually, it should trigger backtracking to shorter match, where the nested group will be not empty, but regex seems to "optimize" and does not compute the shorter match from scratch, but uses some cached values, forgetting to undo the update of nested match groups.
Example greedy matching ((\w{1,3})(\.{2,10})){1,3} will first attempt 3 repetitions, then backtracks to less:
import re
import regex
content = '"Erm....yes. T..T...Thank you for that."'
base_pattern_template = r'((\w{1,3})(\.{2,10})){%s}'
test_cases = ['1,3', '3', '2', '1']
for tc in test_cases:
pattern = base_pattern_template % tc
expected = re.findall(pattern, content)
actual = regex.findall(pattern, content)
# TODO: convert to test case, e.g. in pytest
# assert str(expected) == str(actual), '{}\nexpected: {}\nactual: {}'.format(tc, expected, actual)
print('expected:', tc, expected)
print('actual: ', tc, actual)
output:
expected: 1,3 [('Erm....', 'Erm', '....'), ('T...', 'T', '...')]
actual: 1,3 [('Erm....', '', '....'), ('T...', '', '...')]
expected: 3 []
actual: 3 []
expected: 2 [('T...', 'T', '...')]
actual: 2 [('T...', 'T', '...')]
expected: 1 [('Erm....', 'Erm', '....'), ('T..', 'T', '..'), ('T...', 'T', '...')]
actual: 1 [('Erm....', 'Erm', '....'), ('T..', 'T', '..'), ('T...', 'T', '...')]

Separating RegEx pattern matches that have the same potential starting characters

I would like to have a RegEx that matches several of the same character in a row, within a range of possible characters but does not return those pattern matches as one pattern. How can this be accomplished?
For clarification:
I want a pattern that starts with [a-c] and ungreedly returns any number of the same character, but not the other characters in the range. In the sequence 'aafaabbybcccc' it would find patterns for:
('aa', 'aa', 'bb', 'b', 'cccc')
but would exclude the following:
('f', 'aabb', 'y', 'bcccc')
I don't want to use multiple RegEx pattern searches because the order that i find the patterns will determine the output of another function. This question is for the purposes of self study (python), not homework. (I'm also under 15 rep but will come back and upvote when I can.)
Good question. Use a regex like:
(?P<L>[a-c])(?P=L)+
This is more robust - you're not limited to a-c, you can replace it with a-z if you like. It first defines any character within a-c as L, then sees whether that character occurs again one or more times. You want to run re.findall() using this regex.
You can use backreference \1 - \9 to capture previously matched 1st to 9th group.
/([a-c])(\1+)/
[a-c]: Matches one of the character.
\1+ : Matches subsequent one or more previously matched character.
Perl:
perl -e '#m = "ccccbbb" =~ /([a-c])(\1+)/; print $m[0], $m[1]'
cccc
Python:
>>> import re
>>> [m.group(0) for m in re.finditer(r"([a-c])\1+", 'aafaabbybcccc')]
['aa', 'aa', 'bb', 'cccc']

Categories