Python regex module vs re module - pattern mismatch - python

Update: This issue was resolved by the developer in commit be893e9
If you encounter the same problem, update your regex module.
You need version 2017.04.23 or above.
As pointed out in this answer
I need this regular expression:
(?i)\b((\w{1,3})(-|\.{2,10})[\t ]?)+(\2\w{2,})
working with the regex module too...
import re # standard library
import regex # https://pypi.python.org/pypi/regex/
content = '"Erm....yes. T..T...Thank you for that."'
pattern = r"(?i)\b((\w{1,3})(-|\.{2,10})[\t ]?)+(\2\w{2,})"
substitute = r"\2-\4"
print(re.sub(pattern, substitute, content))
print(regex.sub(pattern, substitute, content))
Output:
"Erm....yes. T-Thank you for that."
"-yes. T..T...Thank you for that."
Q: How do I have to write this regex to make the regex module react to it the same way the re module does?
Using the re module is not an option as I require look-behinds with dynamic lengths.
For clarification: it would be nice if the regex would work with both modules but in the end I only need it for regex

It seems that this bug is related to backtracking. It occurs when a capture group is repeated, and the capture group matches but the pattern after the group doesn't.
An example:
>>> regex.sub(r'(?:(\d{1,3})x)+', r'\1', '123x5')
'5'
For reference, the expected output would be:
>>> re.sub(r'(?:(\d{1,3})x)+', r'\1', '123x5')
'1235'
In the first iteration, the capture group (\d{1,3}) consumes the first 3 digits, and x consumes the following "x" character. Then, because of the +, the match is attempted a 2nd time. This time, (\d{1,3}) matches "5", but the x fails to match. However, the capture group's value is now (re)set to the empty string instead of the expected 123.
As a workaround, we can prevent the capture group from matching. In this case, changing it to (\d{2,3}) is enough to bypass the bug (because it no longer matches "5"):
>>> regex.sub(r'(?:(\d{2,3})x)+', r'\1', '123x5')
'1235'
As for the pattern in question, we can use a lookahead assertion; we change (\w{1,3}) to (?=\w{1,3}(?:-|\.\.))(\w{1,3}):
>>> pattern= r"(?i)\b((?=\w{1,3}(?:-|\.\.))(\w{1,3})(-|\.{2,10})[\t ]?)+(\2\w{2,})"
>>> regex.sub(pattern, substitute, content)
'"Erm....yes. T-Thank you for that."'

edit: the bug is now resolved in regex 2017.04.23
just tested in Python 3.6.1 and the original pattern works the same in re and regex
Original workaround - you can use a lazy operator +? (i.e. a different regex that will behave differently than original pattern in edge cases like T...Tha....Thank):
pattern = r"(?i)\b((\w{1,3})(-|\.{2,10})[\t ]?)+?(\2\w{2,})"
The bug in 2017.04.05 was due to backtracking, something like this:
The unsuccessful longer match creates empty \2 group and conceptually, it should trigger backtracking to shorter match, where the nested group will be not empty, but regex seems to "optimize" and does not compute the shorter match from scratch, but uses some cached values, forgetting to undo the update of nested match groups.
Example greedy matching ((\w{1,3})(\.{2,10})){1,3} will first attempt 3 repetitions, then backtracks to less:
import re
import regex
content = '"Erm....yes. T..T...Thank you for that."'
base_pattern_template = r'((\w{1,3})(\.{2,10})){%s}'
test_cases = ['1,3', '3', '2', '1']
for tc in test_cases:
pattern = base_pattern_template % tc
expected = re.findall(pattern, content)
actual = regex.findall(pattern, content)
# TODO: convert to test case, e.g. in pytest
# assert str(expected) == str(actual), '{}\nexpected: {}\nactual: {}'.format(tc, expected, actual)
print('expected:', tc, expected)
print('actual: ', tc, actual)
output:
expected: 1,3 [('Erm....', 'Erm', '....'), ('T...', 'T', '...')]
actual: 1,3 [('Erm....', '', '....'), ('T...', '', '...')]
expected: 3 []
actual: 3 []
expected: 2 [('T...', 'T', '...')]
actual: 2 [('T...', 'T', '...')]
expected: 1 [('Erm....', 'Erm', '....'), ('T..', 'T', '..'), ('T...', 'T', '...')]
actual: 1 [('Erm....', 'Erm', '....'), ('T..', 'T', '..'), ('T...', 'T', '...')]

Related

Combine multiple regex expressions in Python

For clarity, i was looking for a way to compile multiple regex at once.
For simplicity, let's say that every expression should be in the format (.*) something (.*).
There will be no more than 60 expressions to be tested.
As seen here, i finally wrote the following.
import re
re1 = r'(.*) is not (.*)'
re2 = r'(.*) is the same size as (.*)'
re3 = r'(.*) is a word, not (.*)'
re4 = r'(.*) is world know, not (.*)'
sentences = ["foo2 is a word, not bar2"]
for sentence in sentences:
match = re.compile("(%s|%s|%s|%s)" % (re1, re2, re3, re4)).search(sentence)
if match is not None:
print(match.group(1))
print(match.group(2))
print(match.group(3))
As regex are separated by a pipe, i thought that it will be automatically exited once a rule has been matched.
Executing the code, i have
foo2 is a word, not bar2
None
None
But by inverting re3 and re1 in re.compile match = re.compile("(%s|%s|%s|%s)" % (re3, re2, re1, re4)).search(sentence), i have
foo2 is a word, not bar2
foo2
bar2
As far as i can understand, first rule is executed but not the others.
Can someone please point me on the right direction on this case ?
Kind regards,
There are various issues with your example:
You are using a capturing group, so it gets the index 1 that you'd expect to reference the first group of the inner regexes. Use a non-capturing group (?:%s|%s|%s|%s) instead.
Group indexes increase even inside |. So(?:(a)|(b)|(c)) you'd get:
>>> re.match(r'(?:(a)|(b)|(c))', 'a').groups()
('a', None, None)
>>> re.match(r'(?:(a)|(b)|(c))', 'b').groups()
(None, 'b', None)
>>> re.match(r'(?:(a)|(b)|(c))', 'c').groups()
(None, None, 'c')
It seems like you'd expect to only have one group 1 that returns either a, b or c depending on the branch... no, indexes are assigned in order from left to right without taking account the grammar of the regex.
The regex module does what you want with numbering the groups. If you want to use the built-in module you'll have to live with the fact that numbering is not the same between different branches of the regex if you use named groups:
>>> import regex
>>> regex.match(r'(?:(?P<x>a)|(?P<x>b)|(?P<x>c))', 'a').groups()
('a',)
>>> regex.match(r'(?:(?P<x>a)|(?P<x>b)|(?P<x>c))', 'b').groups()
('b',)
>>> regex.match(r'(?:(?P<x>a)|(?P<x>b)|(?P<x>c))', 'c').groups()
('c',)
(Trying to use that regex with re will give an error for duplicated groups).
Giacomo answered the question.
However, I also suggest: 1) put the "compile" before the loop, 2) gather non empty groups in a list, 3) think about using (.+) instead of (.*) in re1,re2,etc.
rex= re.compile("%s|%s|%s|%s" % (re1, re2, re3, re4))
for sentence in sentences:
match = rex.search(sentence)
if match:
l=[ g for g in match.groups() if g!=None ]
print(l[0],l[1])

Python seems to incorrectly identify case-sensitive string using regex

I'm checking for a case-sensitive string pattern using Python 2.7 and it seems to return an incorrect match. I've run the following tests:
>>> import re
>>> rex_str = "^((BOA_[0-9]{4}-[0-9]{1,3})(?:CO)?.(?i)pdf$)"
>>> not re.match(rex_str, 'BOA_1988-148.pdf')
>>> False
>>> not re.match(rex_str, 'BOA_1988-148.PDF')
>>> False
>>> not re.match(rex_str, 'BOA1988-148.pdf')
>>> True
>>> not re.match(rex_str, 'boa_1988-148.pdf')
>>> False
The first three tests are correct, but the final test, 'boa_1988-148.pdf' should return True because the pattern is supposed to treat the first 3 characters (BOA) as case-sensitive.
I checked the expression with an online tester (https://regex101.com/) and the pattern was correct, flagging the final as a no match because the 'boa' was lower case. Am I missing something or do you have to explicitly declare a group as case-sensitive using a case-sensitive mode like (?c)?
Flags do not apply to portions of a regex. You told the regex engine to match case insensitively:
(?i)
From the the syntax documentation:
(?aiLmsux)
(One or more letters from the set 'a', 'i', 'L', 'm', 's', 'u', 'x'.) The group matches the empty string; the letters set the corresponding flags: re.A (ASCII-only matching), re.I (ignore case), re.L (locale dependent), re.M (multi-line), re.S (dot matches all), and re.X (verbose), for the entire regular expression. (The flags are described in Module Contents.) This is useful if you wish to include the flags as part of the regular expression, instead of passing a flag argument to the re.compile() function. Flags should be used first in the expression string.
Emphasis mine, the flag applies to the whole pattern, not just a substring. If you need to match just pdf or PDF, use that in your pattern directly:
r"^((BOA_[0-9]{4}-[0-9]{1,3})(?:CO)?.(?:pdf|PDF)$)"
This matches either .pdf or .PDF. If you need to match any mix of uppercase and lowercase, use:
r"^((BOA_[0-9]{4}-[0-9]{1,3})(?:CO)?.[pP][dD][fF]$)"
(?i) doesn’t only apply after itself or to the group that contains it. From the Python 2 re documentation:
(?iLmsux)
(One or more letters from the set 'i', 'L', 'm', 's', 'u', 'x'.) The group matches the empty string; the letters set the corresponding flags […] for the entire regular expression.
One option is to do it manually:
r"^(BOA_[0-9]{4}-[0-9]{1,3})(?:CO)?\.[Pp][Dd][Ff]\Z"
Another is to use a separate case-sensitive check:
rex_str = r"(?i)^(BOA_[0-9]{4}-[0-9]{1,3})(?:CO)?\.pdf\Z"
match = re.match(rex_str, s) if s.startswith("BOA_") else None
or separate case-insensitive one:
rex_str = r"^(BOA_[0-9]{4}-[0-9]{1,3})(?:CO)?\..{3}\Z"
match = re.match(rex_str, s) if s.lower().endswith(".pdf") else None

python regex finditer

I have question about re, I tried to look answer on re documentary but I think I am to newbie for this.
I have string like this
string = "id=186 s_id=0 channel_name=[cspacer0]---BlaBla--- number=2"
I want to retrive all result after '=' so I used
re.finditer("=[\w]*", string)
My result was as follow
186
0
empty space <-- there should be a [cspacer0]--BlaBla--
2
How should my pattern look to get the channel_name as well?
The \w token only matches word characters, to allow metacharacters I would use \S (any non-white space character) instead. Also, instead of finditer you can use findall for this task:
>>> import re
>>> s = 'id=186 s_id=0 channel_name=[cspacer0]---BlaBla--- number=2'
>>> re.findall(r'=(\S+)', s)
['186', '0', '[cspacer0]---BlaBla---', '2']
EDIT
The orginal string looks like this, I want to get everything starting with = skip =ok and idx=0
>>> s = 'error idx=0 msg=ok id=186 s_id=0 channel_name=[cspacer0]---BlaBla--- number=2'
>>> re.findall(r'(?<!idx)=(?!ok)(\S+)', s)
['186', '0', '[cspacer0]---BlaBla---', '2']

Best way to split a string for the last space

I'm wondering the best way to split a string separated by spaces for the last space in the string which is not inside [, {, ( or ". For instance I could have:
a = 'a b c d e f "something else here"'
b = 'another parse option {(["gets confusing"])}'
For a it should parse into ['a', 'b', 'c', 'd', 'e', 'f'], ["something else here"]
and b should parse into ['another', 'parse', 'option'], ['([{"gets confusing"}])']
Right now I have this:
def getMin(aList):
min = sys.maxint
for item in aList:
if item < min and item != -1:
min = item
return min
myList = []
myList.append(b.find('['))
myList.append(b.find('{'))
myList.append(b.find('('))
myList.append(b.find('"'))
myMin = getMin(myList)
print b[:myMin], b[myMin:]
I'm sure there's better ways to do this and I'm open to all suggestions
Matching vs. Splitting
There is an easy solution. The key is to understand that matching and splitting are two sides of the same coin. When you say "match all", that means "split on what I don't want to match", and vice-versa. Instead of splitting, we're going to match, and you'll end up with the same result.
The Reduced, Simple Version
Let's start with the simplest version of the regex so you don't get scared by something long:
{[^{}]*}|\S+
This matches all the items of your second string—the same as if we were splitting (see demo)
The left side of the | alternation matches complete sets of {braces}.
The right side of the | matches any characters that are not whitespace characters.
It's that simple!
The Full Regex
We also need to match "full quotes", (full parentheses) and [full brackets]. No problem: we just add them to the alternation. Just for clarity, I'm throwing them together in a non-capture group (?: so that the \S+ pops out on its own, but there is no need.
(?:{[^{}]*}|"[^"]*"|\([^()]*\)|\[[^][]*\])|\S+
See demo.
Notes Potential Improvements
We could replace the quoted string regex by one that accepts escaped quotes
We could replace the brace, brackets and parentheses expressions by recursive expressions to allow nested constructions, but you'd have to use Matthew Barnett's (awesome) regex module instead of re
The technique is related to a simple and beautiful trick to Match (or replace) a pattern except when...
Let me know if you have questions!
You can use regular expressions:
import re
def parse(text):
m = re.search(r'(.*) ([[({"].*)', text)
if not m:
return None
return m.group(1).split(), [m.group(2)]
The first part (.*) catches everything up to the section in quotes or parenthesis, and the second part catches anything starting at a character in ([{".
If you need something more robust, this has a more complicated regular expression, but it will make sure that the opening token is matched, and it makes the last expression optional.
def parse(text):
m = re.search(r'(.*?)(?: ("[^"]*"|\([^)]*\)|\[[^]]*\]|\{[^}]*\}))?$', text)
if not m:
return None
return m.group(1).split(), [m.group(2)]
Perhaps this link will help:
Split a string by spaces -- preserving quoted substrings -- in Python
It explains how to preserve quoted substrings when splitting a string by spaces.

Separating RegEx pattern matches that have the same potential starting characters

I would like to have a RegEx that matches several of the same character in a row, within a range of possible characters but does not return those pattern matches as one pattern. How can this be accomplished?
For clarification:
I want a pattern that starts with [a-c] and ungreedly returns any number of the same character, but not the other characters in the range. In the sequence 'aafaabbybcccc' it would find patterns for:
('aa', 'aa', 'bb', 'b', 'cccc')
but would exclude the following:
('f', 'aabb', 'y', 'bcccc')
I don't want to use multiple RegEx pattern searches because the order that i find the patterns will determine the output of another function. This question is for the purposes of self study (python), not homework. (I'm also under 15 rep but will come back and upvote when I can.)
Good question. Use a regex like:
(?P<L>[a-c])(?P=L)+
This is more robust - you're not limited to a-c, you can replace it with a-z if you like. It first defines any character within a-c as L, then sees whether that character occurs again one or more times. You want to run re.findall() using this regex.
You can use backreference \1 - \9 to capture previously matched 1st to 9th group.
/([a-c])(\1+)/
[a-c]: Matches one of the character.
\1+ : Matches subsequent one or more previously matched character.
Perl:
perl -e '#m = "ccccbbb" =~ /([a-c])(\1+)/; print $m[0], $m[1]'
cccc
Python:
>>> import re
>>> [m.group(0) for m in re.finditer(r"([a-c])\1+", 'aafaabbybcccc')]
['aa', 'aa', 'bb', 'cccc']

Categories