How shall we write a RegEx that captures repeating a substring in a nonconsecutive position?
For example, in aaabcaaa, aaa repeats with bc in between.
\1 can only be used in replacement not in the match pattern, right? Can we write (.*)bc\1?
The Regex can be (.+)bc\1
>>> s = "aaabcaaa"
>>> re.search(r'(.+)bc\1',s).group(1)
'aaa'
Debuggex Demo
To solve your doubt let me quote from the Regex HOWto
Backreferences in a pattern allow you to specify that the contents of an earlier capturing group must also be found at the current location in the string. For example, \1 will succeed if the exact contents of group 1 can be found at the current position, and fails otherwise.
The official docs also include a program to solve your problem (slightly changed)
>>> p = re.compile(r'(\b\w+)bc\1')
>>> p.search(s).group(1)
'aaa'
Yes, you can use \1 in the match. I guess you haven't tried before asking?
Related
My original question was closed for being a duplicate. I disagree with it being a duplicate as this is a different use case looking at regular expression syntax. I have tried to clarify my question below.
Is it possible to create a regular expression which matches two duplicate consecutive characters within a string (in this example lowercase letters) but does not match a section of the string if the same characters are either side. e.g. match 'aa' but not 'aaa' or 'aaaa'?
Additionally:
Although I am using Python 3.10 I am trying to work out if this is possible using 'standard' regular expression syntax without utilising additional functionality provided by external modules. For example using Python this would mean a solution which uses the 're' module from the standard library.
If there are 3 or more duplicate consecutive characters, the string should still match if there are two duplicate consecutive characters elsewhere in the sting. e.g match 'aa' even if 'bbb' exists elsewhere in the string.
The string should also match if the two duplicate consecutive characters appear at the beginning or end of the string.
My examples are 16 character strings if a specific length makes a difference.
Examples:
ffumlmqwfcsyqpss should match either 'ff' or 'ss'.
zztdcqzqddaazdjp should match either 'zz','dd', 'aa'.
urrvucyrzzzooxhx should match 'rr' or 'oo' even though 'zzz' exists in the string.
zettygjpcoedwyio should match 'tt'.
dtfkgggvqadhqbwb should not match 'ggg'.
rwgwbwzebsnjmtln should not match.
What I had originally tried
([a-z])\1 to capture the duplicate character but this also matches when there are additional duplicate characters such as 'aaa' or 'aaaa' etc.
([a-z])\1(?!\1) to negate the third duplicate character but this just moves the match to the end of the duplicate character string.
Negative lookarounds to compensate for a match at the beginning but I think I am causing some kind of loop which will never match.
>>>import re
>>>re.search(r'([a-z])\1(?!\1)', 'dtfkgggvqadhqbwb')
<re.Match object; span=(5, 7), match='gg'> # should not match as 'gg' ('[gg]g' or 'g[gg]')
Currently offered solutions don't match described criteria.
Wiktor Stribiżew's solution uses the additional (*SKIP) functionality of the external python regex module.
Tim Biegeleisen's solution does not match duplicate pairs if there are duplicate triples etc in the same string.
In the linked question, Cary Swoveland's solutions do not work for duplicate pairs at the beginning or end of a string or match even when there is no duplicate in the string.
In the linked question, the fourth bird's solution does not match duplicate pairs at the beginning or end of strings.
Summary
So far the only answer which works is Wiktor Stribiżew's but this uses the (*SKIP) function of the external 'regex' module. Is a solution not possible using 'standard' regular expression syntax?
In Python re, the main problem with creating the right regex for this task is the fact that you need to define the capturing group before using a backreference to the group, and negative lookbehinds are usually placed before the captured pattern. Also, regex101.com Python testing option is not always reflecting the current state of affairs in the re library, and it confuses users with the message like "This token can not be used in a lookbehind due to either making it non-fixed width or interfering with the pattern matching" when it sees a \1 in (?<!\1), while Python allows this since v3.5 for groups of fixed length.
The pattern you can use here is
(.)(?<!\1.)\1(?!\1)
See the regex demo.
Details
(.) - Capturing group 1: any single char (if re.DOTALL is used, even line break chars)
(?<!\1.) - a negative lookbehind that fails the match if there is the same char as captured in Group 1 and then any single char (we can use \1 instead of the . here, and it will work the same) immediately to the left of the current location
\1 - same char as in Group 1
(?!\1) - a negative lookahead that fails the match if there is the same char as in Group 1 immediately to the right of the current location.
See the Python test:
import re
tests ={'ffumlmqwfcsyqpss': ['ff','ss'],
'zztdcqzqddaazdjp': ['zz','dd', 'aa'],
'urrvucyrzzzooxhx': ['rr','oo'],
'zettygjpcoedwyio': ['tt'],
'dtfkgggvqadhqbwb': [],
'rwgwbwzebsnjmtln': []
}
for test, answer in tests.items():
matches = [m.group() for m in re.finditer(r'(.)(?<!\1.)\1(?!\1)', test, re.DOTALL)]
if matches:
print(f"Matches found in '{test}': {matches}. Is the answer expected? {set(matches)==set(answer)}.")
else:
print(f"No match found in '{test}'. Is the answer expected? {set(matches)==set(answer)}.")
Output:
Matches found in 'ffumlmqwfcsyqpss': ['ff', 'ss']. Is the answer expected? True.
Matches found in 'zztdcqzqddaazdjp': ['zz', 'dd', 'aa']. Is the answer expected? True.
Matches found in 'urrvucyrzzzooxhx': ['rr', 'oo']. Is the answer expected? True.
Matches found in 'zettygjpcoedwyio': ['tt']. Is the answer expected? True.
No match found in 'dtfkgggvqadhqbwb'. Is the answer expected? True.
No match found in 'rwgwbwzebsnjmtln'. Is the answer expected? True.
You may use the following regex pattern:
^(?![a-z]*([a-z])\1{2,})[a-z]*([a-z])\2[a-z]*$
Demo
This pattern says to match:
^ start of the string
(?![a-z]*([a-z])\1{2,}) same letter does not occur 3 times or more
[a-z]* zero or more letters
([a-z]) capture a letter
\2 which is followed by the same letter
[a-z]* zero or more letters
$ end of the string
Why doesn't \0 work (i.e. to return the full match) in Python regexp substitutions, i.e. with sub() or match.expand(), while match.group(0) does, and also \1, \2, ... ?
This simple example (executed in Python 3.7) says it all:
import re
subject = '123'
regexp_pattern = r'\d(2)\d'
expand_template_full = r'\0'
expand_template_group = r'\1'
regexp_obj = re.compile(regexp_pattern)
match = regexp_obj.search(subject)
if match:
print('Full match, by method: {}'.format(match.group(0)))
print('Full match, by template: {}'.format(match.expand(expand_template_full)))
print('Capture group 1, by method: {}'.format(match.group(1)))
print('Capture group 1, by template: {}'.format(match.expand(expand_template_group)))
The output from this is:
Full match, by method: 123
Full match, by template:
Capture group 1, by method: 2
Capture group 1, by template: 2
Is there any other sequence I can use in the replacement/expansion template to get the full match? If not, for the love of god, why?
Is this a Python bug?
Huh, you're right, that is annoying!
Fortunately, Python's way ahead of you. The docs for sub say this:
In string-type repl arguments, in addition to the character escapes and backreferences described above, \g<name> will use the substring matched by the group named name, as defined by the (?P<name>...) syntax. \g<number> uses the corresponding group number.... The backreference \g<0> substitutes in the entire substring matched by the RE.
So your code example can be:
import re
subject = '123'
regexp_pattern = r'\d(2)\d'
expand_template_full = r'\g<0>'
regexp_obj = re.compile(regexp_pattern)
match = regexp_obj.search(subject)
if match:
print('Full match, by template: {}'.format(match.expand(expand_template_full)))
You also asked the far more interesting question of "why?". The rationale in the docs explains that you can use this to replace with more than 10 capture groups, because it's not clear whether \10 should be substituted with the 10th group, or with the first capture group followed by a zero, but doesn't explain why \0 doesn't work. I've not been able to find a PEP explaining the rationale, but here's my guess:
We want the repl argument to re.sub to use the same capture group backreferencing syntax as in regex matching. When regex matching, the concept of \0 "backreferencing" to the entire matched string is nonsensical; the hypothetical regex r'A\0' would match an infinitely long string of A characters and nothing else. So we cannot allow \0 to exist as a backreference. If you can't match with a backreference that looks like that, you shouldn't be able to replace with it either.
I can't say I agree with this logic, \g<> is already an arbitrary extension, but it's an argument that I can see someone making.
If you will look into docs, you will find next:
The backreference \g<0> substitutes in the entire substring matched by the RE.
A bit more deep in docs (back in 2003) you will find next tip:
There is a group 0, which is the entire matched pattern, but it can't be referenced with \0; instead, use \g<0>.
So, you need to follow this recommendations and use \g<0>:
expand_template_full = r'\g<0>'
Quoting from https://docs.python.org/3/library/re.html
\number
Matches the contents of the group of the same number. Groups are numbered starting from 1. For example, (.+) \1 matches 'the the' or '55 55', but not 'thethe' (note the space after the group). This special sequence can only be used to match one of the first 99 groups. If the first digit of number is 0, or number is 3 octal digits long, it will not be interpreted as a group match, but as the character with octal value number. Inside the '[' and ']' of a character class, all numeric escapes are treated as characters.
To summarize:
Use \1, \2 up to \99 provided no more digits are present after the numbered backreference
Use \g<0>, \g<1>, etc (not limited to 99) to robustly backreference a group
as far as I know, \g<0> is useful in replacement section to refer to entire matched portion but wouldn't make sense in search section
if you use the 3rd party regex module, then (?0) is useful in search section as well, for example to create recursively matching patterns
I'm developing a calculator program in Python, and need to remove leading zeros from numbers so that calculations work as expected. For example, if the user enters "02+03" into the calculator, the result should return 5. In order to remove these leading zeroes in-front of digits, I asked a question on here and got the following answer.
self.answer = eval(re.sub(r"((?<=^)|(?<=[^\.\d]))0+(\d+)", r"\1\2", self.equation.get()))
I fully understand how the positive lookbehind to the beginning of the string and lookbehind to the non digit, non period character works. What I'm confused about is where in this regex code can I find the replacement for the matched patterns?
I found this online when researching regex expressions.
result = re.sub(pattern, repl, string, count=0, flags=0)
Where is the "repl" in the regex code above? If possible, could somebody please help to explain what the r"\1\2" is used for in this regex also?
Thanks for your help! :)
The "repl" part of the regex is this component:
r"\1\2"
In the "find" part of the regex, group capturing is taking place (ordinarily indicated by "()" characters around content, although this can be overridden by specific arguments).
In python regex, the syntax used to indicate a reference to a positional captured group (sometimes called a "backreference") is "\n" (where "n" is a digit refering to the position of the group in the "find" part of the regex).
So, this regex is returning a string in which the overall content is being replaced specifically by parts of the input string matched by numbered groups.
Note: I don't believe the "\1" part of the "repl" is actually required. I think:
r"\2"
...would work just as well.
Further reading: https://www.regular-expressions.info/brackets.html
Firstly, repl includes what you are about to replace.
To understand \1\2 you need to know what capture grouping is.
Check this video out for basics of Group capturing.
Here , since your regex splits every match it finds into groups which are 1,2... so on. This is so because of the parenthesis () you have placed in the regex.
$1 , $2 or \1,\2 can be used to refer to them.
In this case: The regex is replacing all numbers after the leading 0 (which is caught by group 2) with itself.
Note: \1 is not necessary. works fine without it.
See example:
>>> import re
>>> s='awd232frr2cr23'
>>> re.sub('\d',' ',s)
'awd frr cr '
>>>
Explanation:
As it is, '\d' is for integer so removes them and replaces with repl (in this case ' ').
Say we have a string 1abcd1efg1hjk1lmn1 and want to find stuff between 1-s. What we do is
re.findall('1.*?1','1abcd1efg1hjk1lmn1')
and get two results
['1abcd1', '1hjk1']
ok I get that. But if we do
re.findall('1.*?1hj','1abcd1efg1hjk1lmn1')
why does it grab TWO intervals between 1s instead of one? Why do we get ['1abcd1efg1hj'] instead of ['1efg1hj']? Isn’t this what laziness is supposed to do?
Regex always tries to match the input string from left to right. Consider your '1.*?1hj' regex. 1 in your regex matches the first one and the following .*? matches all the characters upto the 1hj sub-string non-greedily. So that you got ['1abcd1efg1hj'] instead of ['1efg1hj']
To get ['1efg1hj'] as output, you need use a negated class as 1[^1]*1hj
>>> s = "1abcd1efg1hjk1lmn1"
>>> re.findall(r'1.*?1hj', s)
['1abcd1efg1hj']
>>> re.findall(r'1[^1]*1hj', s)
['1efg1hj']
['1abcd1efg1hj']
You get this because this satisfies your regex. 1.*?1hj essentially means start from 1 then move lazily till you find 1 followed by hj. The 1 in between if followed by ef so that will not match but . will consume all. You don't get ['1efg1hj'] because that string has already been consumed by the first match.Use lookahead to see that both satisfy the conditions. See demo.
A lookahead does not consume string so you get both the match,
https://regex101.com/r/aQ3zJ3/5
I'm taking a beginning Python course, and am having problems trying to do a regex substitution.
The question states: Write a substitution command that will change names like file1, file2, etc. to file01, file02, etc. but will not add a zero to names like file10 or file20.
Here's my solution:
re.sub(r'(\D+)(\d)$',r'\10\2','file1')
As you can see, the 0 is messing with my \1 reference. Can anyone help me with an easy solution? Thanks!
import re
print re.sub(r'(\D+)(\d)$',r'\g<1>0\2','file1')
Don't ask.. just do the \g<#> thing and it'll work fine in python. Other languages have the same issue:
http://resbook.wordpress.com/2011/01/04/regex-with-back-references-followed-by-number/
dont know python, but in your regex you just want one digit and not two
for the match you can do it like this
.+[^\d]\d$
test1 will match
test1 will not match
Good luck
#sdanzig has the correct answer, but if you insist to ask, it is actually a documented feature:
http://docs.python.org/2/library/re.html
Read the last paragraph for re.sub().
In string-type repl arguments, in addition to the character escapes
and backreferences described above, \g will use the substring
matched by the group named name, as defined by the (?P...)
syntax. \g uses the corresponding group number; \g<2> is
therefore equivalent to \2, but isn’t ambiguous in a replacement such
as \g<2>0. \20 would be interpreted as a reference to group 20, not a
reference to group 2 followed by the literal character '0'. The
backreference \g<0> substitutes in the entire substring matched by the
RE.