Custom named regex backreference in Python re.sub [duplicate] - python

I have a string S = '02143' and a list A = ['a','b','c','d','e']. I want to replace all those digits in 'S' with their corresponding element in list A.
For example, replace 0 with A[0], 2 with A[2] and so on. Final output should be S = 'acbed'.
I tried:
S = re.sub(r'([0-9])', A[int(r'\g<1>')], S)
However this gives an error ValueError: invalid literal for int() with base 10: '\\g<1>'. I guess it is considering backreference '\g<1>' as a string. How can I solve this especially using re.sub and capture-groups, else alternatively?

The reason the re.sub(r'([0-9])',A[int(r'\g<1>')],S) does not work is that \g<1> (which is an unambiguous representation of the first backreference otherwise written as \1) backreference only works when used in the string replacement pattern. If you pass it to another method, it will "see" just \g<1> literal string, since the re module won't have any chance of evaluating it at that time. re engine only evaluates it during a match, but the A[int(r'\g<1>')] part is evaluated before the re engine attempts to find a match.
That is why it is made possible to use callback methods inside re.sub as the replacement argument: you may pass the matched group values to any external methods for advanced manipulation.
See the re documentation:
re.sub(pattern, repl, string, count=0, flags=0)
If repl is a function, it is called for every non-overlapping
occurrence of pattern. The function takes a single match object
argument, and returns the replacement string.
Use
import re
S = '02143'
A = ['a','b','c','d','e']
print(re.sub(r'[0-9]',lambda x: A[int(x.group())],S))
See the Python demo
Note you do not need to capture the whole pattern with parentheses, you can access the whole match with x.group().

Related

Is there an equivalent of $` in Javascript's `replace()` for Python's re.sub()?

In JS, you can use
$` Inserts the portion of the string that precedes the matched substring.
$' Inserts the portion of the string that follows the matched substring.
To get the substring before and after the match.
Is there an equivalent of this in Python's re.sub()?
Instead of a replacement string, you can pass a function to re.sub. The function will receive a match object, and should return the replacement for the match.
Within the function, you can use match.start() and match.end() to get the start and end indices of the match in the original string, and match.string to get the original string passed to re.sub. Thus,
match.string[:match.start()]
gives the effect of $`, and
match.string[match.end():]
gives the effect of $'.

Is there a way in python to replace a string but leave a middle character intact?

Is there any way to replace all occurrences of a string in a file, while leaving an unknown character in the middle of the string intact?
For example, replacing the string 'ab{unknown}cde' with '(ab{unknown}cde)'
That's not a replacement so much as wrapping a matched substring in parentheses.
>>> re.sub('(ab.cde)', r'(\1)', '123abxcde456')
'123(abxcde)456'
The pattern is the regular expression ab.cde. The parentheses in the pattern indicate that the entire match is a capture group. The replacement text is a pair of parentheses containing whatever the (first) group matched.
Instead of replacement text, you can also specify a function that receives the result of the regular expression match. This lets you, if nothing else, avoid explicitly defining a capture group in the regular expression.
def surround(m):
return f'({m.group()})'
new_str = re.sub('ab.cde', surround, '123abxcde456')
assert new_str == '123(abxcde)456'
you can use a regex that recognizes the known part and replaces the unknown part with some wildcard. in your example,
from re import sub
replaced = sub('(ab.cde)', r'(\1)', 'asdasdab5cdeasdasd')

re.fullmatch equivalent in pandas text handling [duplicate]

I'm trying to check if a string is a number, so the regex "\d+" seemed good. However that regex also fits "78.46.92.168:8000" for some reason, which I do not want, a little bit of code:
class Foo():
_rex = re.compile("\d+")
def bar(self, string):
m = _rex.match(string)
if m != None:
doStuff()
And doStuff() is called when the ip adress is entered. I'm kind of confused, how does "." or ":" match "\d"?
\d+ matches any positive number of digits within your string, so it matches the first 78 and succeeds.
Use ^\d+$.
Or, even better: "78.46.92.168:8000".isdigit()
There are a couple of options in Python to match an entire input with a regex.
Python 2 and 3
In Python 2 and 3, you may use
re.match(r'\d+$') # re.match anchors the match at the start of the string, so $ is what remains to add
or - to avoid matching before the final \n in the string:
re.match(r'\d+\Z') # \Z will only match at the very end of the string
Or the same as above with re.search method requiring the use of ^ / \A start-of-string anchor as it does not anchor the match at the start of the string:
re.search(r'^\d+$')
re.search(r'\A\d+\Z')
Note that \A is an unambiguous string start anchor, its behavior cannot be redefined with any modifiers (re.M / re.MULTILINE can only redefine the ^ and $ behavior).
Python 3
All those cases described in the above section and one more useful method, re.fullmatch (also present in the PyPi regex module):
If the whole string matches the regular expression pattern, return a corresponding match object. Return None if the string does not match the pattern; note that this is different from a zero-length match.
So, after you compile the regex, just use the appropriate method:
_rex = re.compile("\d+")
if _rex.fullmatch(s):
doStuff()
re.match() always matches from the start of the string (unlike re.search()) but allows the match to end before the end of the string.
Therefore, you need an anchor: _rex.match(r"\d+$") would work.
To be more explicit, you could also use _rex.match(r"^\d+$") (which is redundant) or just drop re.match() altogether and just use _rex.search(r"^\d+$").
\Z matches the end of the string while $ matches the end of the string or just before the newline at the end of the string, and exhibits different behaviour in re.MULTILINE. See the syntax documentation for detailed information.
>>> s="1234\n"
>>> re.search("^\d+\Z",s)
>>> s="1234"
>>> re.search("^\d+\Z",s)
<_sre.SRE_Match object at 0xb762ed40>
Change it from \d+ to ^\d+$

Modifying a regular expression by another by adding a something to it

I am trying to modifying my regex expression using replace. What ultimately want to do is to add 01/ in the front of my existing pattern.It is litterally replacing a pattern by another.
Here is what I am doing with replace:
df['found_d'].str.replace(pattern2, '1/'+pattern2)
#must be str, not _sre.SRE_Pattern
I would like to use sub it takes 3 arguments and I am not too sure of how to use it at this point.
Here is an expected input:
df['found_d']= 01/07/91 or 01/07/1991
I need to add a missing date to my pattern.
No need for callables, re provides dedicated means to access the matched text during replacement.
In order to append a literal 01/ to a pattern match, use a \g<0> unambiguous backreference to the whole pattern in the replacement pattern rather than using the regex pattern:
df['found_d'] = df['found_d'].str.replace(pattern2, r'01/\g<0>')
^^^^^^^^^^^
Starting from version 0.20, pandas str.replace can accept a callable that will receive a match object. For example if a column has a pattern of 2 uppercase letters followed with 2 decimal digits and you would want to reverse them with a colon between, you could use:
df['col'] = df['col'].str.replace(r'([A-Z]{2})([0-9]{2})',
lamdba m: "{}:{}".format(m.group(2), m.group(1)))
It gives you the full power of the re module inside pandas, changing here 'AB12' with '12:AB'

Invalid group reference in python 2.7+

I am trying to convert all WikiLink type of strings in my webpage(created in django) to html links.
I am using the following expression
import re
expr = r'\s+[A-Z][a-z]+[A-Z][a-z]+\s'
repl=r'\1'
mystr = 'this is a string to Test whether WikiLink will work ProPerly'
parser=re.compile(expr)
parser.sub(repl, mystr)
This returns me the following string with hex value replaced for the string.
"this is a string to Test whether<a href='/mywiki/\x01>\x01</a>'will work<a href='/mywiki/\x01>\x01</a>'"
Looking at the python help for re.sub, I tried changing \1 to \g<1> but that results in a invalid group reference error.
Please help me understand how to get this working
The problem here is that you don't have any captured groups in the expr.
Whatever part of the match you want to show up as \1, you need to put in parentheses. For example:
>>> expr = r'\s+([A-Z][a-z]+[A-Z][a-z]+)\s'
>>> parser=re.compile(expr)
>>> parser.sub(repl, mystr)
'this is a string to Test whetherWikiLinkwill work ProPerly'
The backreference \1 refers to the group 1 within the match, which is the part that matched the first parenthesized subexpression. Likewise, \2 is group 2, the part that matched the second parenthesized subexpression, and so on. If you use \1 when you have fewer than 1 group, some regexp engines will give you an error, others will use a literal '\1' character, a ctrl-A; Python does the latter, and the canonical representation of ctrl-A is '\x01', so that's why you see it that way.
Group 0 is the entire match. But that's not what you want in this case, because you don't want the spaces to be part of the substitution.
The only reason you need the g syntax is when a simple backreference is ambiguous. For example, if sub were 123\1456, there's no way to tell whether that means 123, followed by group 1, followed by 456, or 123 followed by group 1456, or…
Further reading on grouping and backreferences.

Categories