I'm spending some time trying to understand the Regex terminology in Python 3 and I can't figure out how (\D\d)+ works here.
I know that \D represents a nondigit character and that \d represents a digit character, and also that the plus sign + represents one or more repetitions of the preceding expression. But when I try the following code, I simply can't wrap my head around the result.
Input:
import re
text = "a1 b2 c3 d4e5f6"
regex = re.findall(r'(\D\d)+',text)
print(regex)
Output:
['a1', 'b2', 'c3', 'f6']
Since that the regex includes a plus sign, shouldn't it also output d4e5f6 as they are a sequence of nondigit and digit characters?
You aren't directly repeating the \D\d subpattern with the +, you are repeating a capturing group (indicated by parentheses) that contains that subpattern. The final match is indeed of the text d4e5f6, but it does so as three instances of the capturing group, each one of which overwrites the last. And the behavior of Python's re.findall() in the presence of capturing groups is that it returns THEM (as a tuple, if there's more than one) instead of the overall match.
There is a newer, experimental regex module in Python 3.x that is capable of returning multiple matches for a single capturing group, although I'm not exactly sure how that interacts with findall(). You could also write the expression as (?:\D\d)+ - (?: starts a non-capturing group, so findall() will give you the entire match as you expect.
Related
My original question was closed for being a duplicate. I disagree with it being a duplicate as this is a different use case looking at regular expression syntax. I have tried to clarify my question below.
Is it possible to create a regular expression which matches two duplicate consecutive characters within a string (in this example lowercase letters) but does not match a section of the string if the same characters are either side. e.g. match 'aa' but not 'aaa' or 'aaaa'?
Additionally:
Although I am using Python 3.10 I am trying to work out if this is possible using 'standard' regular expression syntax without utilising additional functionality provided by external modules. For example using Python this would mean a solution which uses the 're' module from the standard library.
If there are 3 or more duplicate consecutive characters, the string should still match if there are two duplicate consecutive characters elsewhere in the sting. e.g match 'aa' even if 'bbb' exists elsewhere in the string.
The string should also match if the two duplicate consecutive characters appear at the beginning or end of the string.
My examples are 16 character strings if a specific length makes a difference.
Examples:
ffumlmqwfcsyqpss should match either 'ff' or 'ss'.
zztdcqzqddaazdjp should match either 'zz','dd', 'aa'.
urrvucyrzzzooxhx should match 'rr' or 'oo' even though 'zzz' exists in the string.
zettygjpcoedwyio should match 'tt'.
dtfkgggvqadhqbwb should not match 'ggg'.
rwgwbwzebsnjmtln should not match.
What I had originally tried
([a-z])\1 to capture the duplicate character but this also matches when there are additional duplicate characters such as 'aaa' or 'aaaa' etc.
([a-z])\1(?!\1) to negate the third duplicate character but this just moves the match to the end of the duplicate character string.
Negative lookarounds to compensate for a match at the beginning but I think I am causing some kind of loop which will never match.
>>>import re
>>>re.search(r'([a-z])\1(?!\1)', 'dtfkgggvqadhqbwb')
<re.Match object; span=(5, 7), match='gg'> # should not match as 'gg' ('[gg]g' or 'g[gg]')
Currently offered solutions don't match described criteria.
Wiktor Stribiżew's solution uses the additional (*SKIP) functionality of the external python regex module.
Tim Biegeleisen's solution does not match duplicate pairs if there are duplicate triples etc in the same string.
In the linked question, Cary Swoveland's solutions do not work for duplicate pairs at the beginning or end of a string or match even when there is no duplicate in the string.
In the linked question, the fourth bird's solution does not match duplicate pairs at the beginning or end of strings.
Summary
So far the only answer which works is Wiktor Stribiżew's but this uses the (*SKIP) function of the external 'regex' module. Is a solution not possible using 'standard' regular expression syntax?
In Python re, the main problem with creating the right regex for this task is the fact that you need to define the capturing group before using a backreference to the group, and negative lookbehinds are usually placed before the captured pattern. Also, regex101.com Python testing option is not always reflecting the current state of affairs in the re library, and it confuses users with the message like "This token can not be used in a lookbehind due to either making it non-fixed width or interfering with the pattern matching" when it sees a \1 in (?<!\1), while Python allows this since v3.5 for groups of fixed length.
The pattern you can use here is
(.)(?<!\1.)\1(?!\1)
See the regex demo.
Details
(.) - Capturing group 1: any single char (if re.DOTALL is used, even line break chars)
(?<!\1.) - a negative lookbehind that fails the match if there is the same char as captured in Group 1 and then any single char (we can use \1 instead of the . here, and it will work the same) immediately to the left of the current location
\1 - same char as in Group 1
(?!\1) - a negative lookahead that fails the match if there is the same char as in Group 1 immediately to the right of the current location.
See the Python test:
import re
tests ={'ffumlmqwfcsyqpss': ['ff','ss'],
'zztdcqzqddaazdjp': ['zz','dd', 'aa'],
'urrvucyrzzzooxhx': ['rr','oo'],
'zettygjpcoedwyio': ['tt'],
'dtfkgggvqadhqbwb': [],
'rwgwbwzebsnjmtln': []
}
for test, answer in tests.items():
matches = [m.group() for m in re.finditer(r'(.)(?<!\1.)\1(?!\1)', test, re.DOTALL)]
if matches:
print(f"Matches found in '{test}': {matches}. Is the answer expected? {set(matches)==set(answer)}.")
else:
print(f"No match found in '{test}'. Is the answer expected? {set(matches)==set(answer)}.")
Output:
Matches found in 'ffumlmqwfcsyqpss': ['ff', 'ss']. Is the answer expected? True.
Matches found in 'zztdcqzqddaazdjp': ['zz', 'dd', 'aa']. Is the answer expected? True.
Matches found in 'urrvucyrzzzooxhx': ['rr', 'oo']. Is the answer expected? True.
Matches found in 'zettygjpcoedwyio': ['tt']. Is the answer expected? True.
No match found in 'dtfkgggvqadhqbwb'. Is the answer expected? True.
No match found in 'rwgwbwzebsnjmtln'. Is the answer expected? True.
You may use the following regex pattern:
^(?![a-z]*([a-z])\1{2,})[a-z]*([a-z])\2[a-z]*$
Demo
This pattern says to match:
^ start of the string
(?![a-z]*([a-z])\1{2,}) same letter does not occur 3 times or more
[a-z]* zero or more letters
([a-z]) capture a letter
\2 which is followed by the same letter
[a-z]* zero or more letters
$ end of the string
Why doesn't \0 work (i.e. to return the full match) in Python regexp substitutions, i.e. with sub() or match.expand(), while match.group(0) does, and also \1, \2, ... ?
This simple example (executed in Python 3.7) says it all:
import re
subject = '123'
regexp_pattern = r'\d(2)\d'
expand_template_full = r'\0'
expand_template_group = r'\1'
regexp_obj = re.compile(regexp_pattern)
match = regexp_obj.search(subject)
if match:
print('Full match, by method: {}'.format(match.group(0)))
print('Full match, by template: {}'.format(match.expand(expand_template_full)))
print('Capture group 1, by method: {}'.format(match.group(1)))
print('Capture group 1, by template: {}'.format(match.expand(expand_template_group)))
The output from this is:
Full match, by method: 123
Full match, by template:
Capture group 1, by method: 2
Capture group 1, by template: 2
Is there any other sequence I can use in the replacement/expansion template to get the full match? If not, for the love of god, why?
Is this a Python bug?
Huh, you're right, that is annoying!
Fortunately, Python's way ahead of you. The docs for sub say this:
In string-type repl arguments, in addition to the character escapes and backreferences described above, \g<name> will use the substring matched by the group named name, as defined by the (?P<name>...) syntax. \g<number> uses the corresponding group number.... The backreference \g<0> substitutes in the entire substring matched by the RE.
So your code example can be:
import re
subject = '123'
regexp_pattern = r'\d(2)\d'
expand_template_full = r'\g<0>'
regexp_obj = re.compile(regexp_pattern)
match = regexp_obj.search(subject)
if match:
print('Full match, by template: {}'.format(match.expand(expand_template_full)))
You also asked the far more interesting question of "why?". The rationale in the docs explains that you can use this to replace with more than 10 capture groups, because it's not clear whether \10 should be substituted with the 10th group, or with the first capture group followed by a zero, but doesn't explain why \0 doesn't work. I've not been able to find a PEP explaining the rationale, but here's my guess:
We want the repl argument to re.sub to use the same capture group backreferencing syntax as in regex matching. When regex matching, the concept of \0 "backreferencing" to the entire matched string is nonsensical; the hypothetical regex r'A\0' would match an infinitely long string of A characters and nothing else. So we cannot allow \0 to exist as a backreference. If you can't match with a backreference that looks like that, you shouldn't be able to replace with it either.
I can't say I agree with this logic, \g<> is already an arbitrary extension, but it's an argument that I can see someone making.
If you will look into docs, you will find next:
The backreference \g<0> substitutes in the entire substring matched by the RE.
A bit more deep in docs (back in 2003) you will find next tip:
There is a group 0, which is the entire matched pattern, but it can't be referenced with \0; instead, use \g<0>.
So, you need to follow this recommendations and use \g<0>:
expand_template_full = r'\g<0>'
Quoting from https://docs.python.org/3/library/re.html
\number
Matches the contents of the group of the same number. Groups are numbered starting from 1. For example, (.+) \1 matches 'the the' or '55 55', but not 'thethe' (note the space after the group). This special sequence can only be used to match one of the first 99 groups. If the first digit of number is 0, or number is 3 octal digits long, it will not be interpreted as a group match, but as the character with octal value number. Inside the '[' and ']' of a character class, all numeric escapes are treated as characters.
To summarize:
Use \1, \2 up to \99 provided no more digits are present after the numbered backreference
Use \g<0>, \g<1>, etc (not limited to 99) to robustly backreference a group
as far as I know, \g<0> is useful in replacement section to refer to entire matched portion but wouldn't make sense in search section
if you use the 3rd party regex module, then (?0) is useful in search section as well, for example to create recursively matching patterns
This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 6 years ago.
I have a character string:
temp = '4424396.6\t1\tk__Bacteria\tp__Firmicutes\tc__Erysipelotrichi\to__Erysipelotrichales'
And I need to get rid of tabulations only in between taxonomy terms.
I tried
re.sub(r'(?:\D{1})\t', ',', temp)
It came quite close, but also replaced the letter before tabs:
'4424396.6\t1\tk__Bacteri,p__Firmicute,c__Erysipelotrich,o__Erysipelotrichales'
I am confused as re documentation for (?:...) goes:
...the substring matched by the group cannot be retrieved after
performing a match or referenced later in the pattern.
The last letter was within the parenthesis, so how could it be replaced?
PS
I used re.sub(r'(?<=\D{1})(\t)', ',', temp) and it works perfectly fine, but I can't understand what's wrong with the first regexp
The text matched by (?:...) does not form a capture group, as does (...), and therefore cannot be referred to later with a backreference such as \1. However, it's still part of the overall match, and is part of the text that re.sub() will replace.
The point of non-capturing groups is that they are slightly more efficient, and may be required in uses such as re.split() where the mere existence of capturing groups will affect the output.
According to the documentation, (?:...) specifies a non-capturing group. It explains:
Sometimes you’ll want to use a group to collect a part of a regular expression, but aren’t interested in retrieving the group’s contents.
What this means is that anything that matches the ... expression (in your case, the preceding letter) will not be captured as a group but will still be part of the match. The only thing special about this is that you won't be able to access the part of the input captured by this group using match.group:
Except for the fact that you can’t retrieve the contents of what the group matched, a non-capturing group behaves exactly the same as a capturing group
In contrast, (?<=...) is a positive lookbehind assertion; the regular expression will check to make sure any matches are preceded by text matching ..., but won't capture that part.
Recently I have been playing around with regex expressions in Python and encountered a problem with r"(\w{3})+" and with its non-greedy equivalent r"(\w{3})+?".
Please let's take a look at the following example:
S = "abcdefghi" # string used for all the cases below
1. Greedy search
m = re.search(r"(\w{3})+", S)
print m.group() # abcdefghi
print m.groups() # ('ghi',)
m.group is exactly as I expected - just whole match.
Regarding m.groups please confirm: ghi is printed because it has overwritten previous captured groups of def and abc, am I right? If yes, then can I capture all overwritten groups as well? Of course, for this particular string I could just write m = re.search(r"(\w{3})(\w{3})(\w{3})", S) but I am looking for a more general way to capture groups not knowing how many of them I can expect, thus metacharacter +.
2. Non-greedy search
m = re.search(r"(\w{3})+?", S)
print m.group() # abc
print m.groups() # ('abc',)
Now we are not greedy so only abc was found - exactly as I expected.
Regarding m.groups(), the engine stopped when it found abc so I understand that this is the only found group here.
3. Greedy findall
print re.findall(r"(\w{3})+", S) # ['ghi']
Now I am truly perplexed, I always thought that function re.findall finds all substrings where the RE matches and returns them as a list. Here, we have only one match abcdefghi (according to common sense and bullet 1), so I expected to have a list containing this one item. Why only ghi was returned?
4. Non-greedy findall
print re.findall(r"(\w{3})+?", S) # ['abc', 'def', 'ghi']
Here, in turn, I expected to have abc only, but maybe having bullet 3 explained will help me understand this as well. Maybe this is even the answer for my question from bullet 1 (about capturing overwritten groups), but I would really like to understand what is happening here.
You should think about the greedy/non-greedy behavior in the context of your regex (r"(\w{3})+") versus a regex where the repeating pattern was not at the end: (r"(\w{3})+\w")
It's important because the default behavior of regex matching is:
The entire regex must match
Starting as early in the target string as possible
Matching as much of the target string as possible (greedy)
If you have a "repeat" operator - either * or + - in your regex, then the default behavior is for that to match as much as it can, so long as the rest of the regex is satisfied.
When the repeat operator is at the end of the pattern, there is no rest of the regex, so the behavior becomes match as much as it can.
If you have a repeat operator with a non-greedy qualifier - *? or +? - in your regex, then the behavior is to match as little as it can, so long as the rest of the regex is satisfied.
When the repeat-nongreedy operator is at the end of the pattern, there is no rest of the regex, so the behavior becomes match as little as it can.
All that is in just one match. You are mixing re.findall() in as well, which will then repeat the match, if possible.
The first time you run re.findall, with r"(\w{3})+" you are using a greedy match at the end of the pattern. Thus, it will try to apply that last block as many times as possible in a single match. You have the case where, like the call to re.search, the single match consumes the entire string. As part of consuming the entire string, the w3 block gets repeated, and the group buffer is overwritten several times.
The second time you run re.findall, with r"(\w{3})+?" you are using a non-greedy match at the end of the pattern. Thus, it will try to apply that last block as few times as possible in a single match. Since the operator is +, that would be 1. Now you have a case where the match can stop without consuming the entire string. And now, the group buffer only gets filled one time, and not overwritten. Which means that findall can return that result (abc), then loop for a different result (def), then loop for a final result (ghi).
Regarding m.groups please confirm: ghi is printed because it has overwritten previous captured groups of def and abc, am I right?
Right. Only the last captured text is stored in the group memory buffer.
can I capture all overwritten groups as well?
Not with re, but with PyPi regex, you can. Its match object has a captures method. However, with re, you can just match them with re.findall(r'\w{3}', S). However, in this case, you will match all 3-word character chunks from the string, not just those consecutive ones. With the regex module, you can get all the 3-character consecutive chunks from the beginning of the string with the help of \G operator: regex.findall(r"\G\w{3}", "abcdefghi") (result: abc, def, ghi).
Why only ghi was returned with re.findall(r"(\w{3})+", S)?
Because there is only one match that is equal to the whole abcdefghi string, and Capture group 1 contains just the last three characters. re.findall only returns the captured values if capturing groups are defined in the pattern.
I've been trying to match a phrase between hyphens. I realise that I can easily just split on the hyphen and get out the phrases but my equivalent regex for this is not working as expected and I want to understand why:
([^-,]+(?:(?: - )|$))+
[^-,]+ is just my definition of a phrase
(?: - ) is just the non capturing space delimited hyphen
so (?:(?: - )|$)is capturing a hyphen or end of line
Finally, the whole thing surrounded in parentheses with a + quantifier matches more than one.
What I get if I perform regex.match("A - B - C").groups() is ('C',)
I've also tried the much simpler regex ([^,-]+)+ with similar results
I'm using re.match because I wanted to use pandas.Series.str.extract to apply this to a very long list.
To reiterate: I'm now using an easy split on a hyphen but why isn't this regex returning multiple groups?
Thanks
Regular expression capturing groups are “named” statically by their appearance in the expression. Each capturing group gets its own number, and matches are assigned to that group regardless of how often a single group captures something.
If a group captured something before and later does again, the later result overwrites what was captured before. There is no way to collect all a group’s captures values using a normal matching.
If you want to find multiple values, you will need to match only a single group and repeat matching on the remainder of the string. This is commonly done by re.findall or re.finditer:
>>> re.findall('\s*([^-,]+?)\s*', 'A - B - C')
['A', 'B', 'C']