Invalid group reference in python 2.7+ - python

I am trying to convert all WikiLink type of strings in my webpage(created in django) to html links.
I am using the following expression
import re
expr = r'\s+[A-Z][a-z]+[A-Z][a-z]+\s'
repl=r'\1'
mystr = 'this is a string to Test whether WikiLink will work ProPerly'
parser=re.compile(expr)
parser.sub(repl, mystr)
This returns me the following string with hex value replaced for the string.
"this is a string to Test whether<a href='/mywiki/\x01>\x01</a>'will work<a href='/mywiki/\x01>\x01</a>'"
Looking at the python help for re.sub, I tried changing \1 to \g<1> but that results in a invalid group reference error.
Please help me understand how to get this working

The problem here is that you don't have any captured groups in the expr.
Whatever part of the match you want to show up as \1, you need to put in parentheses. For example:
>>> expr = r'\s+([A-Z][a-z]+[A-Z][a-z]+)\s'
>>> parser=re.compile(expr)
>>> parser.sub(repl, mystr)
'this is a string to Test whetherWikiLinkwill work ProPerly'
The backreference \1 refers to the group 1 within the match, which is the part that matched the first parenthesized subexpression. Likewise, \2 is group 2, the part that matched the second parenthesized subexpression, and so on. If you use \1 when you have fewer than 1 group, some regexp engines will give you an error, others will use a literal '\1' character, a ctrl-A; Python does the latter, and the canonical representation of ctrl-A is '\x01', so that's why you see it that way.
Group 0 is the entire match. But that's not what you want in this case, because you don't want the spaces to be part of the substitution.
The only reason you need the g syntax is when a simple backreference is ambiguous. For example, if sub were 123\1456, there's no way to tell whether that means 123, followed by group 1, followed by 456, or 123 followed by group 1456, or…
Further reading on grouping and backreferences.

Related

Custom named regex backreference in Python re.sub [duplicate]

I have a string S = '02143' and a list A = ['a','b','c','d','e']. I want to replace all those digits in 'S' with their corresponding element in list A.
For example, replace 0 with A[0], 2 with A[2] and so on. Final output should be S = 'acbed'.
I tried:
S = re.sub(r'([0-9])', A[int(r'\g<1>')], S)
However this gives an error ValueError: invalid literal for int() with base 10: '\\g<1>'. I guess it is considering backreference '\g<1>' as a string. How can I solve this especially using re.sub and capture-groups, else alternatively?
The reason the re.sub(r'([0-9])',A[int(r'\g<1>')],S) does not work is that \g<1> (which is an unambiguous representation of the first backreference otherwise written as \1) backreference only works when used in the string replacement pattern. If you pass it to another method, it will "see" just \g<1> literal string, since the re module won't have any chance of evaluating it at that time. re engine only evaluates it during a match, but the A[int(r'\g<1>')] part is evaluated before the re engine attempts to find a match.
That is why it is made possible to use callback methods inside re.sub as the replacement argument: you may pass the matched group values to any external methods for advanced manipulation.
See the re documentation:
re.sub(pattern, repl, string, count=0, flags=0)
If repl is a function, it is called for every non-overlapping
occurrence of pattern. The function takes a single match object
argument, and returns the replacement string.
Use
import re
S = '02143'
A = ['a','b','c','d','e']
print(re.sub(r'[0-9]',lambda x: A[int(x.group())],S))
See the Python demo
Note you do not need to capture the whole pattern with parentheses, you can access the whole match with x.group().

Why doesn't \0 work in Python regexp substitutions, i.e. with sub() or expand(), while match.group(0) does, and also \1, \2, ...?

Why doesn't \0 work (i.e. to return the full match) in Python regexp substitutions, i.e. with sub() or match.expand(), while match.group(0) does, and also \1, \2, ... ?
This simple example (executed in Python 3.7) says it all:
import re
subject = '123'
regexp_pattern = r'\d(2)\d'
expand_template_full = r'\0'
expand_template_group = r'\1'
regexp_obj = re.compile(regexp_pattern)
match = regexp_obj.search(subject)
if match:
print('Full match, by method: {}'.format(match.group(0)))
print('Full match, by template: {}'.format(match.expand(expand_template_full)))
print('Capture group 1, by method: {}'.format(match.group(1)))
print('Capture group 1, by template: {}'.format(match.expand(expand_template_group)))
The output from this is:
Full match, by method: 123
Full match, by template:
Capture group 1, by method: 2
Capture group 1, by template: 2
Is there any other sequence I can use in the replacement/expansion template to get the full match? If not, for the love of god, why?
Is this a Python bug?
Huh, you're right, that is annoying!
Fortunately, Python's way ahead of you. The docs for sub say this:
In string-type repl arguments, in addition to the character escapes and backreferences described above, \g<name> will use the substring matched by the group named name, as defined by the (?P<name>...) syntax. \g<number> uses the corresponding group number.... The backreference \g<0> substitutes in the entire substring matched by the RE.
So your code example can be:
import re
subject = '123'
regexp_pattern = r'\d(2)\d'
expand_template_full = r'\g<0>'
regexp_obj = re.compile(regexp_pattern)
match = regexp_obj.search(subject)
if match:
print('Full match, by template: {}'.format(match.expand(expand_template_full)))
You also asked the far more interesting question of "why?". The rationale in the docs explains that you can use this to replace with more than 10 capture groups, because it's not clear whether \10 should be substituted with the 10th group, or with the first capture group followed by a zero, but doesn't explain why \0 doesn't work. I've not been able to find a PEP explaining the rationale, but here's my guess:
We want the repl argument to re.sub to use the same capture group backreferencing syntax as in regex matching. When regex matching, the concept of \0 "backreferencing" to the entire matched string is nonsensical; the hypothetical regex r'A\0' would match an infinitely long string of A characters and nothing else. So we cannot allow \0 to exist as a backreference. If you can't match with a backreference that looks like that, you shouldn't be able to replace with it either.
I can't say I agree with this logic, \g<> is already an arbitrary extension, but it's an argument that I can see someone making.
If you will look into docs, you will find next:
The backreference \g<0> substitutes in the entire substring matched by the RE.
A bit more deep in docs (back in 2003) you will find next tip:
There is a group 0, which is the entire matched pattern, but it can't be referenced with \0; instead, use \g<0>.
So, you need to follow this recommendations and use \g<0>:
expand_template_full = r'\g<0>'
Quoting from https://docs.python.org/3/library/re.html
\number
Matches the contents of the group of the same number. Groups are numbered starting from 1. For example, (.+) \1 matches 'the the' or '55 55', but not 'thethe' (note the space after the group). This special sequence can only be used to match one of the first 99 groups. If the first digit of number is 0, or number is 3 octal digits long, it will not be interpreted as a group match, but as the character with octal value number. Inside the '[' and ']' of a character class, all numeric escapes are treated as characters.
To summarize:
Use \1, \2 up to \99 provided no more digits are present after the numbered backreference
Use \g<0>, \g<1>, etc (not limited to 99) to robustly backreference a group
as far as I know, \g<0> is useful in replacement section to refer to entire matched portion but wouldn't make sense in search section
if you use the 3rd party regex module, then (?0) is useful in search section as well, for example to create recursively matching patterns

Need a specific explanation of part of a regex code

I'm developing a calculator program in Python, and need to remove leading zeros from numbers so that calculations work as expected. For example, if the user enters "02+03" into the calculator, the result should return 5. In order to remove these leading zeroes in-front of digits, I asked a question on here and got the following answer.
self.answer = eval(re.sub(r"((?<=^)|(?<=[^\.\d]))0+(\d+)", r"\1\2", self.equation.get()))
I fully understand how the positive lookbehind to the beginning of the string and lookbehind to the non digit, non period character works. What I'm confused about is where in this regex code can I find the replacement for the matched patterns?
I found this online when researching regex expressions.
result = re.sub(pattern, repl, string, count=0, flags=0)
Where is the "repl" in the regex code above? If possible, could somebody please help to explain what the r"\1\2" is used for in this regex also?
Thanks for your help! :)
The "repl" part of the regex is this component:
r"\1\2"
In the "find" part of the regex, group capturing is taking place (ordinarily indicated by "()" characters around content, although this can be overridden by specific arguments).
In python regex, the syntax used to indicate a reference to a positional captured group (sometimes called a "backreference") is "\n" (where "n" is a digit refering to the position of the group in the "find" part of the regex).
So, this regex is returning a string in which the overall content is being replaced specifically by parts of the input string matched by numbered groups.
Note: I don't believe the "\1" part of the "repl" is actually required. I think:
r"\2"
...would work just as well.
Further reading: https://www.regular-expressions.info/brackets.html
Firstly, repl includes what you are about to replace.
To understand \1\2 you need to know what capture grouping is.
Check this video out for basics of Group capturing.
Here , since your regex splits every match it finds into groups which are 1,2... so on. This is so because of the parenthesis () you have placed in the regex.
$1 , $2 or \1,\2 can be used to refer to them.
In this case: The regex is replacing all numbers after the leading 0 (which is caught by group 2) with itself.
Note: \1 is not necessary. works fine without it.
See example:
>>> import re
>>> s='awd232frr2cr23'
>>> re.sub('\d',' ',s)
'awd frr cr '
>>>
Explanation:
As it is, '\d' is for integer so removes them and replaces with repl (in this case ' ').

Why can't I match the last part of my regular expression in python?

I want to match a sentence with an optional end 'other (\\w+)'. For example, the regular expression should match both sentence as follows and extract the word 'things':
The apple and other things.
The apple is big.
I wrote a regular expression as below. However, I got a result (None,). If I remove the last ?. I will get the right answer. Why?
>>> re.search('\w+(?: other (\\w+))?', 'A and other things').groups()
(None,)
>>> re.search('\w+(?: other (\\w+))', 'A and other things').groups()
('things',)
If you use:
re.search(r'\w+(?: other (\w+))?', 'A and other things').group()
You will see what is happening. Since anything after \w+ is optional your search matches first word A.
As per official documentation:
.groups()
Return a tuple containing all the subgroups of the match, from 1 up to however many groups are in the pattern.
And your search call doesn't return any subgroup hence you get:
re.search(r'\w+(?: other (\w+))?', 'A and other things').groups()
(None,)
To solve your problem you can use this alternation based regex:
r'\w+(?: other (\w+)|$)'
Examples:
>>> re.search(r'\w+(?: other (\w+)|$)', 'A and other things').group()
'and'
>>> re.search(r'\w+(?: other (\w+)|$)', 'The apple is big').group()
'big'
The rule for regular expression searches is that they produce the leftmost longest match. Yes, it tries to give you longer matches if possible, but most importantly, when it finds the first successful match, it will stop looking further.
In the first regular expression, the leftmost point where \w+ matches is A. The optional portion doesn't match there, so it's done.
In the second regular expression, the parenthesized expression is mandatory, so A is not a match. Therefore, it continues looking. The \w+ matches and, then the second \\w+ matches things.
Note that for regular expressions in Python, especially those containing backslashes, it's a good idea to write them using r'raw strings'.

Is there a way to refer to the entire matched expression in re.sub without the use of a group?

Suppose I want to prepend all occurrences of a particular expression with a character such as \.
In sed, it would look like this.
echo '__^^^%%%__FooBar' | sed 's/[_^%]/\\&/g'
Note that the & character is used to represent the original matched expression.
I have looked through the regex docs and the regex howto, but I do not see an equivalent to the & character that can be used to substitute in the matched expression.
The only workaround I have found is to use the an extra set of () to group the expression and then refernece the group, as follows.
import re
line = "__^^^%%%__FooBar"
print re.sub("([_%^$])", r"\\\1", line)
Is there a clean way to reference the entire matched expression without the extra group creation?
From the docs:
The backreference \g<0> substitutes in the entire substring matched by the RE.
Example:
>>> print re.sub("[_%^$]", r"\\\g<0>", line)
\_\_\^\^\^\%\%\%\_\_FooBar
You could get the result also by using Positive lookahead .
>>> print re.sub("(?=[_%^$])", r"\\", line)
\_\_\^\^\^\%\%\%\_\_FooBar

Categories