Python regex non-capturing not working within capturing groups - python

I was working a regex expression in Python to extract groups. I am correctly extracting the 3 groups I want (symbol, num, atom). However, the 'symbol' group should not have the '[' or ']' as I am using 'non-capturing' notation (?:..) per python's docs (https://docs.python.org/3/library/re.html).
Am I understanding non-capturing wrong, or is this a bug?
Thanks!
import re
result = re.match(r'(?P<symbol>(?:\[)(?P<num>[0-9]{0,3})(?P<atom>C)(?:\]))', '[12C]')
print(result.groups())
# ('[12C]', '12', 'C')
# expected: ('12C', '12', 'C')

Move the checks for \[ and \] outside of the capture for P<symbol>. Moving them out of the capture will also mean you also don't need to use the non-capturing groups notation. e.g.
>>> import re
>>> result = re.match(r'\[(?P<symbol>(?P<num>[0-9]{0,3})(?P<atom>C))]', '[12C]')
>>> result.groups()
('12C', '12', 'C')

Related

Regular expression to extract number with hyphen

The text is like "1-2years. 3years. 10years."
I want get result [(1,2),(3),(10)].
I use python.
I first tried r"([0-9]?)[-]?([0-9])years". It works well except for the case of 10. I also tried r"([0-9]?)[-]?([0-9]|10)years" but the result is still [(1,2),(3),(1,0)].
Your attempt r"([0-9]?)[-]?([0-9])years" doesn't work for the case of 10 because you ask it to match one (or zero) digit per group.
You also don't need the hyphen in brackets.
This should work: Regex101
(\d+)(?:-(\d+))?years
Explanation:
(\d+): Capturing group for one or more digits
(?: ): Non-capturing group
- : hyphen
(\d+): Capturing group for one or more digits
(?: )?: Make the previous non-capturing group optional
In python:
import re
result = re.findall(r"(\d+)(?:-(\d+))?years", "1-2years. 3years. 10years.")
# Gives: [('1', '2'), ('3', ''), ('10', '')]
Each tuple in the list contains two elements: The number on the left side of the hyphen, and the number on the right side of the hyphen. Removing the blank elements is quite easy: you loop over each item in result, then you loop over each match in this item and only select it (and convert it to int) if it is not empty.
final_result = [tuple(int(match) for match in item if match) for item in result]
# gives: [(1, 2), (3,), (10,)]
This should work:
import re
st = '1-2years. 3years. 10years.'
result = [tuple(e for e in tup if e)
for tup in re.findall(r'(?:(\d+)-(\d+)|(\d+))years', st)]
# [('1', '2'), ('3',), ('10',)]
The regex will look for either one number, or two separated by a hyphen, immediately prior to the word years. If we give this to re.findall(), it will give us the output [('1', '2', ''), ('', '', '3'), ('', '', '10')], so we also use a quick list comprehension to filter out the empty strings.
Alternately we could use r'(\d+)(?:-(\d+))?years' to basically the same effect, which is closer to what you've already tried.
You can use this pattern: (?:(\d+)-)?(\d+)years
See Regex Demo
Code:
import re
pattern = r"(?:(\d+)-)?(\d+)years"
text = "1-2years. 3years. 10years."
print([tuple(int(z) for z in x if z) for x in re.findall(pattern, text)])
Output:
[(1, 2), (3,), (10,)]
You only match a single digit as the character class [0-9] is not repeated.
Another option is to match the first digits with an optional part for - and digits.
Then you can split the matches on -
\b(\d+(?:-\d+)?)years\.
\b A word boundary
( Capture group 1 (which will be returned by re.findall)
\d+(?:-\d+)? Match 1+ digits and optionally match - and again 1+ digits
) Close group 1
years\. Match literally with the escaped .
See a regex demo and a Python demo.
Example
pattern = r"\b(\d+(?:-\d+)?)years\."
s = "1-2years. 3years. 10years."
res = [tuple(v.split('-')) for v in re.findall(pattern, s)]
print(res)
Output
[('1', '2'), ('3',), ('10',)]
Or if a list of lists is also ok instead of tuples
res = [v.split('-') for v in re.findall(pattern, s)]
Output
[['1', '2'], ['3'], ['10']]

Multiple capturing groups within non-capturing group using Python regexes

I have the following code using multiple capturing groups within a non-capturing group:
>>> regex = r'(?:a ([ac]+)|b ([bd]+))'
>>> re.match(regex, 'a caca').groups()
('caca', None)
>>> re.match(regex, 'b bdbd').groups()
(None, 'bdbd')
How can I change the code so it outputs either ('caca') or ('bdbd')?
You are close.
To get the capture always as group 1 can use a lookahead to do the match and then a separate capturing group to capture:
(?:a (?=[ac]+)|b (?=[bd]+))(.*)
Demo
Or in Python3:
>>> regex=r'(?:a (?=[ac]+)|b (?=[bd]+))(.*)'
>>> (?:a (?=[ac]+)|b (?=[bd]+))(.*)
>>> re.match(regex, 'a caca').groups()
('caca',)
>>> re.match(regex, 'b bdbd').groups()
('bdbd',)
Another option is to get the matches using a lookbehind without a capturing group:
(?<=a )[ac]+|(?<=b )[bd]+
Regex demo
For example
import re
pattern = r'(?<=a )[ac]+|(?<=b )[bd]+'
print (re.search(pattern, 'a caca').group())
print (re.search(pattern, 'b bdbd').group())
Output
caca
bdbd
You may use a branch reset group with PyPi regex module:
Alternatives inside a branch reset group share the same capturing groups. The syntax is (?|regex) where (?| opens the group and regex is any regular expression. If you don’t use any alternation or capturing groups inside the branch reset group, then its special function doesn’t come into play. It then acts as a non-capturing group.
The regex will look like
(?|a ([ac]+)|b ([bd]+))
See the regex demo. See the Python 3 demo:
import regex
rx = r'(?|a ([ac]+)|b ([bd]+))'
print (regex.search(rx, 'a caca').groups()) # => ('caca',)
print (regex.search(rx, 'b bdbd').groups()) # => ('bdbd',)
See the problem the other way around:
((?:a [ac]+)|(?:b [bd]+))
^ ^ ^ ^
| | | other exact match
| | OR
| not capturing for exact match
capture everything
A easier look: https://regex101.com/r/e3bK2B/1/

Python regex module vs re module - pattern mismatch

Update: This issue was resolved by the developer in commit be893e9
If you encounter the same problem, update your regex module.
You need version 2017.04.23 or above.
As pointed out in this answer
I need this regular expression:
(?i)\b((\w{1,3})(-|\.{2,10})[\t ]?)+(\2\w{2,})
working with the regex module too...
import re # standard library
import regex # https://pypi.python.org/pypi/regex/
content = '"Erm....yes. T..T...Thank you for that."'
pattern = r"(?i)\b((\w{1,3})(-|\.{2,10})[\t ]?)+(\2\w{2,})"
substitute = r"\2-\4"
print(re.sub(pattern, substitute, content))
print(regex.sub(pattern, substitute, content))
Output:
"Erm....yes. T-Thank you for that."
"-yes. T..T...Thank you for that."
Q: How do I have to write this regex to make the regex module react to it the same way the re module does?
Using the re module is not an option as I require look-behinds with dynamic lengths.
For clarification: it would be nice if the regex would work with both modules but in the end I only need it for regex
It seems that this bug is related to backtracking. It occurs when a capture group is repeated, and the capture group matches but the pattern after the group doesn't.
An example:
>>> regex.sub(r'(?:(\d{1,3})x)+', r'\1', '123x5')
'5'
For reference, the expected output would be:
>>> re.sub(r'(?:(\d{1,3})x)+', r'\1', '123x5')
'1235'
In the first iteration, the capture group (\d{1,3}) consumes the first 3 digits, and x consumes the following "x" character. Then, because of the +, the match is attempted a 2nd time. This time, (\d{1,3}) matches "5", but the x fails to match. However, the capture group's value is now (re)set to the empty string instead of the expected 123.
As a workaround, we can prevent the capture group from matching. In this case, changing it to (\d{2,3}) is enough to bypass the bug (because it no longer matches "5"):
>>> regex.sub(r'(?:(\d{2,3})x)+', r'\1', '123x5')
'1235'
As for the pattern in question, we can use a lookahead assertion; we change (\w{1,3}) to (?=\w{1,3}(?:-|\.\.))(\w{1,3}):
>>> pattern= r"(?i)\b((?=\w{1,3}(?:-|\.\.))(\w{1,3})(-|\.{2,10})[\t ]?)+(\2\w{2,})"
>>> regex.sub(pattern, substitute, content)
'"Erm....yes. T-Thank you for that."'
edit: the bug is now resolved in regex 2017.04.23
just tested in Python 3.6.1 and the original pattern works the same in re and regex
Original workaround - you can use a lazy operator +? (i.e. a different regex that will behave differently than original pattern in edge cases like T...Tha....Thank):
pattern = r"(?i)\b((\w{1,3})(-|\.{2,10})[\t ]?)+?(\2\w{2,})"
The bug in 2017.04.05 was due to backtracking, something like this:
The unsuccessful longer match creates empty \2 group and conceptually, it should trigger backtracking to shorter match, where the nested group will be not empty, but regex seems to "optimize" and does not compute the shorter match from scratch, but uses some cached values, forgetting to undo the update of nested match groups.
Example greedy matching ((\w{1,3})(\.{2,10})){1,3} will first attempt 3 repetitions, then backtracks to less:
import re
import regex
content = '"Erm....yes. T..T...Thank you for that."'
base_pattern_template = r'((\w{1,3})(\.{2,10})){%s}'
test_cases = ['1,3', '3', '2', '1']
for tc in test_cases:
pattern = base_pattern_template % tc
expected = re.findall(pattern, content)
actual = regex.findall(pattern, content)
# TODO: convert to test case, e.g. in pytest
# assert str(expected) == str(actual), '{}\nexpected: {}\nactual: {}'.format(tc, expected, actual)
print('expected:', tc, expected)
print('actual: ', tc, actual)
output:
expected: 1,3 [('Erm....', 'Erm', '....'), ('T...', 'T', '...')]
actual: 1,3 [('Erm....', '', '....'), ('T...', '', '...')]
expected: 3 []
actual: 3 []
expected: 2 [('T...', 'T', '...')]
actual: 2 [('T...', 'T', '...')]
expected: 1 [('Erm....', 'Erm', '....'), ('T..', 'T', '..'), ('T...', 'T', '...')]
actual: 1 [('Erm....', 'Erm', '....'), ('T..', 'T', '..'), ('T...', 'T', '...')]

Python regex matching only returns one digit when using regex.findall, but matches correctly with regex.search [duplicate]

I have a file that includes a bunch of strings like "size=XXX;". I am trying Python's re module for the first time and am a bit mystified by the following behavior: if I use a pipe for 'or' in a regular expression, I only see that bit of the match returned. E.g.:
>>> myfile = open('testfile.txt', 'r').read()
>>> re.findall('size=50;', myfile)
['size=50;', 'size=50;', 'size=50;', 'size=50;']
>>> re.findall('size=51;', myfile)
['size=51;', 'size=51;', 'size=51;']
>>> re.findall('size=(50|51);', myfile)
['51', '51', '51', '50', '50', '50', '50']
>>> re.findall(r'size=(50|51);', myfile)
['51', '51', '51', '50', '50', '50', '50']
The "size=" part of the match is gone (Yet it is certainly used in the search, otherwise there would be more results). What am I doing wrong?
The problem you have is that if the regex that re.findall tries to match captures groups (i.e. the portions of the regex that are enclosed in parentheses), then it is the groups that are returned, rather than the matched string.
One way to solve this issue is to use non-capturing groups (prefixed with ?:).
>>> import re
>>> s = 'size=50;size=51;'
>>> re.findall('size=(?:50|51);', s)
['size=50;', 'size=51;']
If the regex that re.findall tries to match does not capture anything, it returns the whole of the matched string.
Although using character classes might be the simplest option in this particular case, non-capturing groups provide a more general solution.
When a regular expression contains parentheses, they capture their contents to groups, changing the behaviour of findall() to only return those groups. Here's the relevant section from the docs:
(...)
Matches whatever regular expression is inside the parentheses,
and indicates the start and end of a group; the contents of a group
can be retrieved after a match has been performed, and can be matched
later in the string with the \number special sequence, described
below. To match the literals '(' or ')', use \( or \), or enclose them
inside a character class: [(] [)].
To avoid this behaviour, you can use a non-capturing group:
>>> re.findall(r'size=(?:50|51);',myfile)
['size=51;', 'size=51;', 'size=51;', 'size=50;', 'size=50;', 'size=50;', 'size=50;']
Again, from the docs:
(?:...)
A non-capturing version of regular parentheses. Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern.
In some cases, the non-capturing group is not appropriate, for example with regex which detects repeated words (example from python docs)
r'(\b\w+)\s+\1'
In this situation to get whole match one can use
[groups[0] for groups in re.findall(r'((\b\w+)\s+\2)', text)]
Note that \1 has changed to \2.
As others mentioned, the "problem" with re.findall is that it returns a list of strings/tuples-of-strings depending on the use of capture groups. If you don't want to change the capture groups you're using (not to use character groups [] or non-capturing groups (?:)), you can use finditer instead of findall. This gives an iterator of Match objects, instead of just strings. So now you can fetch the full match, even when using capture groups:
import re
s = 'size=50;size=51;'
for m in re.finditer('size=(50|51);', s):
print(m.group())
Will give:
size=50;
size=51;
And if you need a list, similar to findall, you can use a list-comprehension:
>>> [m.group() for m in re.finditer('size=(50|51);', s)]
['size=50;', 'size=51;']
'size=(50|51);' means you are looking for size=50 or size=51 but only matching the 50 or 51 part (note the parentheses), therefore it does not return the sign=.
If you want the sign= returned, you can do:
re.findall('(size=50|size=51);',myfile)
I think what you want is using [] instead of (). [] indicates a set of characters while () indicates a group match. Try something like this:
re.findall('size=5[01];', myfile)

python regex finditer

I have question about re, I tried to look answer on re documentary but I think I am to newbie for this.
I have string like this
string = "id=186 s_id=0 channel_name=[cspacer0]---BlaBla--- number=2"
I want to retrive all result after '=' so I used
re.finditer("=[\w]*", string)
My result was as follow
186
0
empty space <-- there should be a [cspacer0]--BlaBla--
2
How should my pattern look to get the channel_name as well?
The \w token only matches word characters, to allow metacharacters I would use \S (any non-white space character) instead. Also, instead of finditer you can use findall for this task:
>>> import re
>>> s = 'id=186 s_id=0 channel_name=[cspacer0]---BlaBla--- number=2'
>>> re.findall(r'=(\S+)', s)
['186', '0', '[cspacer0]---BlaBla---', '2']
EDIT
The orginal string looks like this, I want to get everything starting with = skip =ok and idx=0
>>> s = 'error idx=0 msg=ok id=186 s_id=0 channel_name=[cspacer0]---BlaBla--- number=2'
>>> re.findall(r'(?<!idx)=(?!ok)(\S+)', s)
['186', '0', '[cspacer0]---BlaBla---', '2']

Categories