As a learning exercise, I like to compare two regular expressions doing the same thing.
In this case, I want to extract the sequences of numbers from strings like this:
CC_nums=[
'2341-3421-5632-0981-009',
'521-9085-3948-2543-89-9'
]
And the correct result after capturing in a regex will be
['2341', '3421', '5632', '0981', '009']
['4521', '9085', '3948', '2543', '89', '9']
I understand that this works in python:
for number in CC_nums:
print re.findall('(\d+)',number)
But, to understand this more deeply, I tried the following:
for number in CC_nums:
print re.findall('\s*(?:(\d+)\D+)+(\d+)\s*', number)
..which returns:
[('0981', '009')]
[('89', '9')]
Two questions:
Firstly, why does the second one return a tuple instead of a list?
Secondly, why does the second one not match the other sets of digits, like 2341, 3241, etc.?
I know that findall will return non-overlapping capturing groups, so I tried to avoid this. The capturing groups are non-overlapping because of the (\d+), so I thought that this would not be an issue.
See Python re.findall behaves weird to see why the re.findall returns a tuple list. Basically, it returns a tuple because there are more than one capturing group inside your pattern.
The regex returns the last digits-digits substring because the + quantifier is applied to the (?:(\d+)\D+) group, and thus, each time this subpattern captures a substring, the previous one is replaced with the new one in the group buffer.
Related
In [29]: re.findall("([abc])+","abc")
Out[29]: ['c']
In [30]: re.findall("[abc]+","abc")
Out[30]: ['abc']
Confused by the grouped one. How does it make difference?
There are two things that need to be explained here: the behavior of quantified groups, and the design of the findall() method.
In your first example, [abc] matches the a, which is captured in group #1. Then it matches b and captures it in group #1, overwriting the a. Then again with the c, and that's what's left in group #1 at the end of the match.
But it does match the whole string. If you were using search() or finditer(), you would be able to look at the MatchObject and see that group(0) contains abc and group(1) contains c. But findall() returns strings, not MatchObjects. If there are no groups, it returns a list of the overall matches; if there are groups, the list contains all the captures, but not the overall match.
So both of your regexes are matching the whole string, but the first one is also capturing and discarding each character individually (which is kinda pointless). It's only the unexpected behavior of findall() that makes it look like you're getting different results.
In the first example you have a repeated captured group which only capture the last iteration. Here c.
([abc])+
Debuggex Demo
In the second example you are matching a single character in the list one and unlimited times.
[abc]+
Debuggex Demo
Here's the way I would think about it. ([abc])+ is attempting to repeat a captured group. When you use "+" after the capture group, it doesn't mean you are going to get two captured groups. What ends up happening, at least for Python's regex and most implementations, is that the "+" forces iteration until the capture group only contains the last match.
If you want to capture a repeated expression, you need to reverse the ordering of "(...)" and "+", e.g. instead of ([abc])+ use ([abc]+).
input "abc"
[abc]
match a single character => "a"
[abc]+
+ Between one and unlimited times, as many times as possible => "abc"
([abc])
Capturing group ([abc]) => "a"
([abc])+
+ A repeated capturing group will only capture the last iteration => "c"
Grouping just gives different preference.
([abc])+ => Find one from selection. Can match one or more. It finds one and all conditions are met as the + means 1 or more. This breaks up the regex into two stages.
While the ungrouped one is treated as a whole.
I couldn't understand why this regular expression,
re.findall(r"(do|re|mi)+","mimi rere midore"),
generates this result,
['mi', 're', 're'].
My expected result is ['mimi', 'rere', 'midore']...
However, when I use this regular expression,
re.findall(r"(?:do|re|mi)+","mimi rere midore"),
it generates the result as expected.
Can you tell me the different between two regular expressions?
Thank you.
The difference is in the capturing group. With a capturing froup, findall() returns only what was captured. Without a capturing group, the whole match is returned.
In your first example, the group only captures the two characters, repeated or not. In the second example, the whole match includes any repetitions.
The re.findall() documentation is quite clear on the difference:
Return all non-overlapping matches of pattern in string, as a list of strings. […] If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.
If your (do|re|mi)+ pattern is part of a larger pattern and you want findall() to only return the full repeated set of characters, use a non-capturing group for the two-letter options with a capturing group around the whole:
r'Some example text: ((?:do|re|me)+)'
This question already has answers here:
re.findall behaves weird
(3 answers)
Closed 3 years ago.
I am trying to match two string variables, and would like to catch multiple matches. re.findall seems like the obvious choice for this task, but it doesn't appear to be working the way I would expect it to. The following is an example:
a = 'a(pp)?le'
b = 'ale, apple, apol'
match = re.findall(a,b)
match
['','pp']
However, when I apply the same variables to re.search, it recognizes the embedded regular expression within the string, and picks up the first match:
match = re.search(a,b)
match.group()
'ale'
Can anyone explain why re.findall is not working in this instance? I would expect the following:
match = re.findall(a,b)
match
['ale','apple']
Thanks!
You are using a capturing group, wheras you want a non-capturing group:
a = 'a(?:pp)?le'
As stated in the docs (...) in a regex will create a "capturing group" and the result of re.findall will be only what is inside the parens.
If you just want to group things (e.g. for the purpose of applying a ?) use (?:...)which creates a non-capturing group. The result of re.findall in this case will be the whole regex (or the largest capturing group).
The key part of the re.findall docs are:
If one or more groups are present in the pattern, return a list of groups
this explains the difference in results between re.findall and re.search.
Let me quote the Python docs about re.findall():
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.
And this is what your expression a(pp)?le does. It matches the content in your group, i.e. pp. You can always disable this special behavior of a group by taking a non-capturing group (?:...).
This is in continuation of my earlier question where I wanted to compile many patterns as one regular expression and after the discussion I did something like this
REGEX_PATTERN = '|'.join(self.error_patterns.keys())
where self.error_patterns.keys() would be pattern like
: error:
: warning:
cc1plus:
undefine reference to
Failure:
and do
error_found = re.findall(REGEX_PATTERN,line)
Now when I run it against some file which might contain one or more than one patterns, how do I know what pattern exactly matched? I mean I can anyway see the line manually and find it out, but want to know if after doing re.findall I can find out the pattern like re.group() or something
Thank you
re.findall will return all portions of text that matched your expression.
If that is not sufficient to identify the pattern unambiguously, you can still do a second re.match/re.find against the individual subpatterns you have join()ed. At the time of applying your initial regular expression, the matcher is no longer aware that you have composed it of several subpatterns however, hence it cannot provide more detailed information which subpattern has matched.
Another, equally unwieldy option would be to enclose each pattern in a group (...). Then, re.findall will return an array of None values (for all the non-matching patterns), with the exception of the one group that matched the pattern.
MatchObject has a lastindex property that contains the index of the last capturing group that participated in the match. If you enclose each pattern in its own capturing group, like this:
(: error:)|(: warning:)
...lastindex will tell you which one matched (assuming you know the order in which the patterns appear in the regex). You'll probably want to use finditer() (which creates an iterator of MatchObjects) instead of findall() (which returns a list of strings). Also, make sure there are no other capturing groups in the regex, to throw your indexing out of sync.
While using regex to help solve a problem in the Python Challenge, I came across some behaviour that confused me.
from here:
(...) Matches whatever regular expression is inside the parentheses.
and
'+' Causes the resulting RE to match 1 or more repetitions of the preceding RE.
So this makes sense:
>>>import re
>>>re.findall(r"(\d+)", "1111112")
['1111112']
But this doesn't:
>>> re.findall(r"(\d)+", "1111112")
['2']
I realise that findall returns only groups when groups are present in the regex, but why is only the '2' returned? What happends to all the 1's in the match?
Because you only have one capturing group, but it's "run" repeatedly, the new matches are repeatedly entered into the "storage space" for that group. In other words, the 1s were lost when they were "overwritten" by subsequent 1s and eventually the 2.
You are repeating the group itself by appending '+' after ')', I do not know the implementation details but it matches 7 times, and returns only the last match.
In the first one, you are matching 7 digits, and making it a group.