How do capture groups work? (wrt python regular expressions)

How do capture groups work? (wrt python regular expressions) - python

While using regex to help solve a problem in the Python Challenge, I came across some behaviour that confused me.
from here:
(...) Matches whatever regular expression is inside the parentheses.
and
'+' Causes the resulting RE to match 1 or more repetitions of the preceding RE.
So this makes sense:
>>>import re
>>>re.findall(r"(\d+)", "1111112")
['1111112']
But this doesn't:
>>> re.findall(r"(\d)+", "1111112")
['2']
I realise that findall returns only groups when groups are present in the regex, but why is only the '2' returned? What happends to all the 1's in the match?

Because you only have one capturing group, but it's "run" repeatedly, the new matches are repeatedly entered into the "storage space" for that group. In other words, the 1s were lost when they were "overwritten" by subsequent 1s and eventually the 2.

You are repeating the group itself by appending '+' after ')', I do not know the implementation details but it matches 7 times, and returns only the last match.
In the first one, you are matching 7 digits, and making it a group.

Related

re.sub part of string: (?: ...) mystery [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 6 years ago.
I have a character string:
temp = '4424396.6\t1\tk__Bacteria\tp__Firmicutes\tc__Erysipelotrichi\to__Erysipelotrichales'
And I need to get rid of tabulations only in between taxonomy terms.
I tried
re.sub(r'(?:\D{1})\t', ',', temp)
It came quite close, but also replaced the letter before tabs:
'4424396.6\t1\tk__Bacteri,p__Firmicute,c__Erysipelotrich,o__Erysipelotrichales'
I am confused as re documentation for (?:...) goes:
...the substring matched by the group cannot be retrieved after
performing a match or referenced later in the pattern.
The last letter was within the parenthesis, so how could it be replaced?
PS
I used re.sub(r'(?<=\D{1})(\t)', ',', temp) and it works perfectly fine, but I can't understand what's wrong with the first regexp

The text matched by (?:...) does not form a capture group, as does (...), and therefore cannot be referred to later with a backreference such as \1. However, it's still part of the overall match, and is part of the text that re.sub() will replace.
The point of non-capturing groups is that they are slightly more efficient, and may be required in uses such as re.split() where the mere existence of capturing groups will affect the output.

According to the documentation, (?:...) specifies a non-capturing group. It explains:
Sometimes you’ll want to use a group to collect a part of a regular expression, but aren’t interested in retrieving the group’s contents.
What this means is that anything that matches the ... expression (in your case, the preceding letter) will not be captured as a group but will still be part of the match. The only thing special about this is that you won't be able to access the part of the input captured by this group using match.group:
Except for the fact that you can’t retrieve the contents of what the group matched, a non-capturing group behaves exactly the same as a capturing group
In contrast, (?<=...) is a positive lookbehind assertion; the regular expression will check to make sure any matches are preceded by text matching ..., but won't capture that part.

Capturing groups and greediness in Python

Recently I have been playing around with regex expressions in Python and encountered a problem with r"(\w{3})+" and with its non-greedy equivalent r"(\w{3})+?".
Please let's take a look at the following example:
S = "abcdefghi" # string used for all the cases below
1. Greedy search
m = re.search(r"(\w{3})+", S)
print m.group() # abcdefghi
print m.groups() # ('ghi',)
m.group is exactly as I expected - just whole match.
Regarding m.groups please confirm: ghi is printed because it has overwritten previous captured groups of def and abc, am I right? If yes, then can I capture all overwritten groups as well? Of course, for this particular string I could just write m = re.search(r"(\w{3})(\w{3})(\w{3})", S) but I am looking for a more general way to capture groups not knowing how many of them I can expect, thus metacharacter +.
2. Non-greedy search
m = re.search(r"(\w{3})+?", S)
print m.group() # abc
print m.groups() # ('abc',)
Now we are not greedy so only abc was found - exactly as I expected.
Regarding m.groups(), the engine stopped when it found abc so I understand that this is the only found group here.
3. Greedy findall
print re.findall(r"(\w{3})+", S) # ['ghi']
Now I am truly perplexed, I always thought that function re.findall finds all substrings where the RE matches and returns them as a list. Here, we have only one match abcdefghi (according to common sense and bullet 1), so I expected to have a list containing this one item. Why only ghi was returned?
4. Non-greedy findall
print re.findall(r"(\w{3})+?", S) # ['abc', 'def', 'ghi']
Here, in turn, I expected to have abc only, but maybe having bullet 3 explained will help me understand this as well. Maybe this is even the answer for my question from bullet 1 (about capturing overwritten groups), but I would really like to understand what is happening here.

You should think about the greedy/non-greedy behavior in the context of your regex (r"(\w{3})+") versus a regex where the repeating pattern was not at the end: (r"(\w{3})+\w")
It's important because the default behavior of regex matching is:
The entire regex must match
Starting as early in the target string as possible
Matching as much of the target string as possible (greedy)
If you have a "repeat" operator - either * or + - in your regex, then the default behavior is for that to match as much as it can, so long as the rest of the regex is satisfied.
When the repeat operator is at the end of the pattern, there is no rest of the regex, so the behavior becomes match as much as it can.
If you have a repeat operator with a non-greedy qualifier - *? or +? - in your regex, then the behavior is to match as little as it can, so long as the rest of the regex is satisfied.
When the repeat-nongreedy operator is at the end of the pattern, there is no rest of the regex, so the behavior becomes match as little as it can.
All that is in just one match. You are mixing re.findall() in as well, which will then repeat the match, if possible.
The first time you run re.findall, with r"(\w{3})+" you are using a greedy match at the end of the pattern. Thus, it will try to apply that last block as many times as possible in a single match. You have the case where, like the call to re.search, the single match consumes the entire string. As part of consuming the entire string, the w3 block gets repeated, and the group buffer is overwritten several times.
The second time you run re.findall, with r"(\w{3})+?" you are using a non-greedy match at the end of the pattern. Thus, it will try to apply that last block as few times as possible in a single match. Since the operator is +, that would be 1. Now you have a case where the match can stop without consuming the entire string. And now, the group buffer only gets filled one time, and not overwritten. Which means that findall can return that result (abc), then loop for a different result (def), then loop for a final result (ghi).

Regarding m.groups please confirm: ghi is printed because it has overwritten previous captured groups of def and abc, am I right?
Right. Only the last captured text is stored in the group memory buffer.
can I capture all overwritten groups as well?
Not with re, but with PyPi regex, you can. Its match object has a captures method. However, with re, you can just match them with re.findall(r'\w{3}', S). However, in this case, you will match all 3-word character chunks from the string, not just those consecutive ones. With the regex module, you can get all the 3-character consecutive chunks from the beginning of the string with the help of \G operator: regex.findall(r"\G\w{3}", "abcdefghi") (result: abc, def, ghi).
Why only ghi was returned with re.findall(r"(\w{3})+", S)?
Because there is only one match that is equal to the whole abcdefghi string, and Capture group 1 contains just the last three characters. re.findall only returns the captured values if capturing groups are defined in the pattern.

difference between two regular expressions: [abc]+ and ([abc])+

In [29]: re.findall("([abc])+","abc")
Out[29]: ['c']
In [30]: re.findall("[abc]+","abc")
Out[30]: ['abc']
Confused by the grouped one. How does it make difference?

There are two things that need to be explained here: the behavior of quantified groups, and the design of the findall() method.
In your first example, [abc] matches the a, which is captured in group #1. Then it matches b and captures it in group #1, overwriting the a. Then again with the c, and that's what's left in group #1 at the end of the match.
But it does match the whole string. If you were using search() or finditer(), you would be able to look at the MatchObject and see that group(0) contains abc and group(1) contains c. But findall() returns strings, not MatchObjects. If there are no groups, it returns a list of the overall matches; if there are groups, the list contains all the captures, but not the overall match.
So both of your regexes are matching the whole string, but the first one is also capturing and discarding each character individually (which is kinda pointless). It's only the unexpected behavior of findall() that makes it look like you're getting different results.

In the first example you have a repeated captured group which only capture the last iteration. Here c.
([abc])+
Debuggex Demo
In the second example you are matching a single character in the list one and unlimited times.
[abc]+
Debuggex Demo

Here's the way I would think about it. ([abc])+ is attempting to repeat a captured group. When you use "+" after the capture group, it doesn't mean you are going to get two captured groups. What ends up happening, at least for Python's regex and most implementations, is that the "+" forces iteration until the capture group only contains the last match.
If you want to capture a repeated expression, you need to reverse the ordering of "(...)" and "+", e.g. instead of ([abc])+ use ([abc]+).

input "abc"
[abc]
match a single character => "a"
[abc]+
+ Between one and unlimited times, as many times as possible => "abc"
([abc])
Capturing group ([abc]) => "a"
([abc])+
+ A repeated capturing group will only capture the last iteration => "c"

Grouping just gives different preference.
([abc])+ => Find one from selection. Can match one or more. It finds one and all conditions are met as the + means 1 or more. This breaks up the regex into two stages.
While the ungrouped one is treated as a whole.

How does regex {m,n}? work in Python?

From the Python documentation of the re module:
{m,n}?
Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as few repetitions as possible. This is the non-greedy version of the previous qualifier. For example, on the 6-character string 'aaaaaa', a{3,5} will match 5 'a' characters, while a{3,5}? will only match 3 characters.
I'm confused about how this works. How is this any different from {m}? I do not see how there could ever be a case where the pattern could match more than m repetitions. If there are m+1 repetitions in a row, then there are also m. What am I missing?

Whereas, it is true that a regex solely containing a{3,5}? and one with the pattern: a{3} will match the same thing (i.e. re.match(r'a{3,5}?', 'aaaaa').group(0) and re.match(r'a{3}', 'aaaaa').group(0)
will both return 'aaa'), the differences between the patterns becomes clear when you look at patterns containing these two elements. Say your pattern is a{3,5}?b, then aaab, aaaab, and aaaaab will be matched. If you just used a{3}b then only aaab would get matched. aaaab and aaaaab would not.
Look to Shashank's answer for examples that flush out this difference a little more, or test your own. I've found that this site is a good resource to use to test out python regular expressions.

I think the way to see the difference between the two is through the following examples:
>>> re.findall(r'ab{3,5}?', 'abbbbb')
['abbb']
>>> re.findall(r'ab{3}', 'abbbbb')
['abbb']
Those two runs give the same results as expected, but let's see some differences.
Difference 1: A range quantifier on a subpattern lets you match a large range of patterns containing that subpattern. This lets you find matches where there normally wouldn't be any if you used an exact quantifier:
>>> re.findall(r'ab{3,5}?c', 'abbbbbc')
['abbbbbc']
>>> re.findall(r'ab{3}c', 'abbbbbc')
[]
Difference 2: Greedy doesn't necessarily mean "match the shortest subpattern possible". It's actually a bit more like "match the shortest subpattern possible starting from the leftmost unmatched index that can possibly start off a match":
>>> re.findall(r'b{3,5}?c', 'bbbbbc')
['bbbbbc']
>>> re.findall(r'b{3}c', 'bbbbbc')
['bbbc']
The way I think of regex is as a construct that scans the string from left to right with two iterators that point to indices in the string. The first iterator marks the beginning of the next possible pattern. The second iterator goes through the suffix of the substring starting from the first iterator and tries to complete the pattern. The first iterator only advances when the construct determines that the regex pattern cannot possibly match a string starting from that index. Thus, defining a range for your quantifier will make it so that the first iterator will keep matching sub-patterns beyond the minimum value specified even if the quantifier is non-greedy.
A non-greedy regex will stop its second iterator as soon as the pattern can stop, but a greedy regex will "save" the position of a matched pattern and keep searching for a longer one. If a longer pattern is found, then it uses that one instead, if it's not found, then it uses the shorter one that it saved in memory earlier.
That's why you see the possibly surprising result with 'b{3,5}?c' and 'bbbbbc'. Although the regex is greedy, it will still never advance its first iterator until the pattern match fails, and that's why the substring with 5 'b' characters is matched by the non-greedy regex even though its not the shortest pattern matchable.

SwankSwashbucklers's answer describes the greedy version. The ? makes it non-greedy, which means it will try to match as few items as possible, which means that
`re.match('a{3,5}?b', 'aaaab').group(0)` # returns `'aaaab'`
but
`re.match('a{3,5}?', 'aaaa').group(0)` # returns `'aaa'`

let say we have a string to be searched is:
str ="aaaaa"
Now we have patter = a{3,5}
The string which it matches are :{aaa,aaaa,aaaaa}
But here we have string as "aaaaa" since we have only one option.
Now lets say we have pattern = a{3,5}?
in this case it matches only "aaa" not "aaaaa".
Thus it takes the minimum items as possible,being non greedy.
please try using online regular Expression at :https://pythex.org/
It will be great help and we check immediately what it matches and what it does not

Don’t understand lazy regex

Say we have a string 1abcd1efg1hjk1lmn1 and want to find stuff between 1-s. What we do is
re.findall('1.*?1','1abcd1efg1hjk1lmn1')
and get two results
['1abcd1', '1hjk1']
ok I get that. But if we do
re.findall('1.*?1hj','1abcd1efg1hjk1lmn1')
why does it grab TWO intervals between 1s instead of one? Why do we get ['1abcd1efg1hj'] instead of ['1efg1hj']? Isn’t this what laziness is supposed to do?

Regex always tries to match the input string from left to right. Consider your '1.*?1hj' regex. 1 in your regex matches the first one and the following .*? matches all the characters upto the 1hj sub-string non-greedily. So that you got ['1abcd1efg1hj'] instead of ['1efg1hj']
To get ['1efg1hj'] as output, you need use a negated class as 1[^1]*1hj
>>> s = "1abcd1efg1hjk1lmn1"
>>> re.findall(r'1.*?1hj', s)
['1abcd1efg1hj']
>>> re.findall(r'1[^1]*1hj', s)
['1efg1hj']

['1abcd1efg1hj']
You get this because this satisfies your regex. 1.*?1hj essentially means start from 1 then move lazily till you find 1 followed by hj. The 1 in between if followed by ef so that will not match but . will consume all. You don't get ['1efg1hj'] because that string has already been consumed by the first match.Use lookahead to see that both satisfy the conditions. See demo.
A lookahead does not consume string so you get both the match,
https://regex101.com/r/aQ3zJ3/5

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How do capture groups work? (wrt python regular expressions) - python

Because you only have one capturing group, but it's "run" repeatedly, the new matches are repeatedly entered into the "storage space" for that group. In other words, the 1s were lost when they were "overwritten" by subsequent 1s and eventually the 2.

You are repeating the group itself by appending '+' after ')', I do not know the implementation details but it matches 7 times, and returns only the last match. In the first one, you are matching 7 digits, and making it a group.

Related

re.sub part of string: (?: ...) mystery [duplicate]

Capturing groups and greediness in Python

difference between two regular expressions: [abc]+ and ([abc])+

How does regex {m,n}? work in Python?

Don’t understand lazy regex

Categories

Resources