In [29]: re.findall("([abc])+","abc")
Out[29]: ['c']
In [30]: re.findall("[abc]+","abc")
Out[30]: ['abc']
Confused by the grouped one. How does it make difference?
There are two things that need to be explained here: the behavior of quantified groups, and the design of the findall() method.
In your first example, [abc] matches the a, which is captured in group #1. Then it matches b and captures it in group #1, overwriting the a. Then again with the c, and that's what's left in group #1 at the end of the match.
But it does match the whole string. If you were using search() or finditer(), you would be able to look at the MatchObject and see that group(0) contains abc and group(1) contains c. But findall() returns strings, not MatchObjects. If there are no groups, it returns a list of the overall matches; if there are groups, the list contains all the captures, but not the overall match.
So both of your regexes are matching the whole string, but the first one is also capturing and discarding each character individually (which is kinda pointless). It's only the unexpected behavior of findall() that makes it look like you're getting different results.
In the first example you have a repeated captured group which only capture the last iteration. Here c.
([abc])+
Debuggex Demo
In the second example you are matching a single character in the list one and unlimited times.
[abc]+
Debuggex Demo
Here's the way I would think about it. ([abc])+ is attempting to repeat a captured group. When you use "+" after the capture group, it doesn't mean you are going to get two captured groups. What ends up happening, at least for Python's regex and most implementations, is that the "+" forces iteration until the capture group only contains the last match.
If you want to capture a repeated expression, you need to reverse the ordering of "(...)" and "+", e.g. instead of ([abc])+ use ([abc]+).
input "abc"
[abc]
match a single character => "a"
[abc]+
+ Between one and unlimited times, as many times as possible => "abc"
([abc])
Capturing group ([abc]) => "a"
([abc])+
+ A repeated capturing group will only capture the last iteration => "c"
Grouping just gives different preference.
([abc])+ => Find one from selection. Can match one or more. It finds one and all conditions are met as the + means 1 or more. This breaks up the regex into two stages.
While the ungrouped one is treated as a whole.
Related
I have the following string,
"ATAG:AAAABTAG:BBBBCTAG:CCCCCTAG:DDDDEEEECTAG.FFFFCTAG GGGGCTAGHHHH"
In above string, using REGEX, I want to find all occurrences of 'TAG' except first 3 occurrences.
I used this REGEX, '(TAG.*?){4}', but it only finds 4th occurrence ('TAG:'), but not the others ('TAG.','TAG ','TAGH').
If you want a capture group with all the remaining matches, you have to consume the first ones first:
(TAG.*?){3}(TAG.*?)*
This matches the first 3 occurences in the first capture group and matches the rest in the 2nd.
If you don't want the first matches to be in a capture group, you can flag it as non-capturing group:
(?:TAG.*?){3}(TAG.*?)*
Depending on your example, I think the regex inside the capture group is not correct yet. If this doesn't give you the right Idea on how to do this already, please give us an example of the matches you want to see. I'll edit my answer then.
EDIT:
I get the feeling that you want to capture the 3rd and following occurences in own capture groups while still ignoring the first 3 occurences.
I can't properly explain why, but I think that's not possible because of the following reasons:
Ignoring the first 3 occurences in an own (non-)capturing group forces you to abandon the 'g' modifier for finding all occurences (because that would just do 'ignore 3 TAGS, find 1' in a loop).
It is not possible to capture multiple groups with just one capture group. Trying to do that always captures the last occurence. There is a possibility to capture not just the last but all occurences together in a single capture group but it seems like you want them in separate groups.
So, how to solve this?
I'd come up with a proper regex for one TAG and repeat that using the find all or g modifier. In python you then can simply take all findings skipping the first 3:
import re
str = "ATAG:AAAABTAG:BBBBCTAG:CCCCCTAG:DDDDEEEECTAG.FFFFCTAG GGGGCTAGHHHH"
pattern = r"(?:TAG((?:(?!TAG).)+))"
findings = re.findall(pattern, str)[:3]
If you want to ignore the first character after TAG, just add a . behind TAG:
pattern = r"(?:TAG.((?:(?!TAG).)+))"
Explanation of the regex:
- I use ?: to make some capturing groups non-capturing groups. I only want to deal with one capture group.
- To get rid of the non-greedy modifier and be a little bit more
specific in what we actually want, I've introduced the negative
lookahead after the TAG occurence.
As a learning exercise, I like to compare two regular expressions doing the same thing.
In this case, I want to extract the sequences of numbers from strings like this:
CC_nums=[
'2341-3421-5632-0981-009',
'521-9085-3948-2543-89-9'
]
And the correct result after capturing in a regex will be
['2341', '3421', '5632', '0981', '009']
['4521', '9085', '3948', '2543', '89', '9']
I understand that this works in python:
for number in CC_nums:
print re.findall('(\d+)',number)
But, to understand this more deeply, I tried the following:
for number in CC_nums:
print re.findall('\s*(?:(\d+)\D+)+(\d+)\s*', number)
..which returns:
[('0981', '009')]
[('89', '9')]
Two questions:
Firstly, why does the second one return a tuple instead of a list?
Secondly, why does the second one not match the other sets of digits, like 2341, 3241, etc.?
I know that findall will return non-overlapping capturing groups, so I tried to avoid this. The capturing groups are non-overlapping because of the (\d+), so I thought that this would not be an issue.
See Python re.findall behaves weird to see why the re.findall returns a tuple list. Basically, it returns a tuple because there are more than one capturing group inside your pattern.
The regex returns the last digits-digits substring because the + quantifier is applied to the (?:(\d+)\D+) group, and thus, each time this subpattern captures a substring, the previous one is replaced with the new one in the group buffer.
Recently I have been playing around with regex expressions in Python and encountered a problem with r"(\w{3})+" and with its non-greedy equivalent r"(\w{3})+?".
Please let's take a look at the following example:
S = "abcdefghi" # string used for all the cases below
1. Greedy search
m = re.search(r"(\w{3})+", S)
print m.group() # abcdefghi
print m.groups() # ('ghi',)
m.group is exactly as I expected - just whole match.
Regarding m.groups please confirm: ghi is printed because it has overwritten previous captured groups of def and abc, am I right? If yes, then can I capture all overwritten groups as well? Of course, for this particular string I could just write m = re.search(r"(\w{3})(\w{3})(\w{3})", S) but I am looking for a more general way to capture groups not knowing how many of them I can expect, thus metacharacter +.
2. Non-greedy search
m = re.search(r"(\w{3})+?", S)
print m.group() # abc
print m.groups() # ('abc',)
Now we are not greedy so only abc was found - exactly as I expected.
Regarding m.groups(), the engine stopped when it found abc so I understand that this is the only found group here.
3. Greedy findall
print re.findall(r"(\w{3})+", S) # ['ghi']
Now I am truly perplexed, I always thought that function re.findall finds all substrings where the RE matches and returns them as a list. Here, we have only one match abcdefghi (according to common sense and bullet 1), so I expected to have a list containing this one item. Why only ghi was returned?
4. Non-greedy findall
print re.findall(r"(\w{3})+?", S) # ['abc', 'def', 'ghi']
Here, in turn, I expected to have abc only, but maybe having bullet 3 explained will help me understand this as well. Maybe this is even the answer for my question from bullet 1 (about capturing overwritten groups), but I would really like to understand what is happening here.
You should think about the greedy/non-greedy behavior in the context of your regex (r"(\w{3})+") versus a regex where the repeating pattern was not at the end: (r"(\w{3})+\w")
It's important because the default behavior of regex matching is:
The entire regex must match
Starting as early in the target string as possible
Matching as much of the target string as possible (greedy)
If you have a "repeat" operator - either * or + - in your regex, then the default behavior is for that to match as much as it can, so long as the rest of the regex is satisfied.
When the repeat operator is at the end of the pattern, there is no rest of the regex, so the behavior becomes match as much as it can.
If you have a repeat operator with a non-greedy qualifier - *? or +? - in your regex, then the behavior is to match as little as it can, so long as the rest of the regex is satisfied.
When the repeat-nongreedy operator is at the end of the pattern, there is no rest of the regex, so the behavior becomes match as little as it can.
All that is in just one match. You are mixing re.findall() in as well, which will then repeat the match, if possible.
The first time you run re.findall, with r"(\w{3})+" you are using a greedy match at the end of the pattern. Thus, it will try to apply that last block as many times as possible in a single match. You have the case where, like the call to re.search, the single match consumes the entire string. As part of consuming the entire string, the w3 block gets repeated, and the group buffer is overwritten several times.
The second time you run re.findall, with r"(\w{3})+?" you are using a non-greedy match at the end of the pattern. Thus, it will try to apply that last block as few times as possible in a single match. Since the operator is +, that would be 1. Now you have a case where the match can stop without consuming the entire string. And now, the group buffer only gets filled one time, and not overwritten. Which means that findall can return that result (abc), then loop for a different result (def), then loop for a final result (ghi).
Regarding m.groups please confirm: ghi is printed because it has overwritten previous captured groups of def and abc, am I right?
Right. Only the last captured text is stored in the group memory buffer.
can I capture all overwritten groups as well?
Not with re, but with PyPi regex, you can. Its match object has a captures method. However, with re, you can just match them with re.findall(r'\w{3}', S). However, in this case, you will match all 3-word character chunks from the string, not just those consecutive ones. With the regex module, you can get all the 3-character consecutive chunks from the beginning of the string with the help of \G operator: regex.findall(r"\G\w{3}", "abcdefghi") (result: abc, def, ghi).
Why only ghi was returned with re.findall(r"(\w{3})+", S)?
Because there is only one match that is equal to the whole abcdefghi string, and Capture group 1 contains just the last three characters. re.findall only returns the captured values if capturing groups are defined in the pattern.
I've been trying to match a phrase between hyphens. I realise that I can easily just split on the hyphen and get out the phrases but my equivalent regex for this is not working as expected and I want to understand why:
([^-,]+(?:(?: - )|$))+
[^-,]+ is just my definition of a phrase
(?: - ) is just the non capturing space delimited hyphen
so (?:(?: - )|$)is capturing a hyphen or end of line
Finally, the whole thing surrounded in parentheses with a + quantifier matches more than one.
What I get if I perform regex.match("A - B - C").groups() is ('C',)
I've also tried the much simpler regex ([^,-]+)+ with similar results
I'm using re.match because I wanted to use pandas.Series.str.extract to apply this to a very long list.
To reiterate: I'm now using an easy split on a hyphen but why isn't this regex returning multiple groups?
Thanks
Regular expression capturing groups are “named” statically by their appearance in the expression. Each capturing group gets its own number, and matches are assigned to that group regardless of how often a single group captures something.
If a group captured something before and later does again, the later result overwrites what was captured before. There is no way to collect all a group’s captures values using a normal matching.
If you want to find multiple values, you will need to match only a single group and repeat matching on the remainder of the string. This is commonly done by re.findall or re.finditer:
>>> re.findall('\s*([^-,]+?)\s*', 'A - B - C')
['A', 'B', 'C']
While using regex to help solve a problem in the Python Challenge, I came across some behaviour that confused me.
from here:
(...) Matches whatever regular expression is inside the parentheses.
and
'+' Causes the resulting RE to match 1 or more repetitions of the preceding RE.
So this makes sense:
>>>import re
>>>re.findall(r"(\d+)", "1111112")
['1111112']
But this doesn't:
>>> re.findall(r"(\d)+", "1111112")
['2']
I realise that findall returns only groups when groups are present in the regex, but why is only the '2' returned? What happends to all the 1's in the match?
Because you only have one capturing group, but it's "run" repeatedly, the new matches are repeatedly entered into the "storage space" for that group. In other words, the 1s were lost when they were "overwritten" by subsequent 1s and eventually the 2.
You are repeating the group itself by appending '+' after ')', I do not know the implementation details but it matches 7 times, and returns only the last match.
In the first one, you are matching 7 digits, and making it a group.