Python unescaping string in regex replacements

Python unescaping string in regex replacements - python

The output of the code below:
rpl = 'This is a nicely escaped newline \\n'
my_string = 'I hope this apple is replaced with a nicely escaped string'
reg = re.compile('apple')
reg.sub( rpl, my_string )
..is:
'I hope this This is a nicely escaped newline \n is replaced with a nicely escaped string'
..so when printed:
I hope this This is a nicely escaped newline
is replaced with a nicely escaped string
So python is unescaping the string when it replaces 'apple' in the other string? For now I've just done
reg.sub( rpl.replace('\\','\\\\'), my_string )
Is this safe? Is there a way to stop Python from doing that?

From help(re.sub) [emphasis mine]:
sub(pattern, repl, string, count=0, flags=0)
Return the string obtained by replacing the leftmost
non-overlapping occurrences of the pattern in string by the
replacement repl. repl can be either a string or a callable;
if a string, backslash escapes in it are processed. If it is
a callable, it's passed the match object and must return
a replacement string to be used.
One way to get around this is to pass a lambda:
>>> reg.sub(rpl, my_string )
'I hope this This is a nicely escaped newline \n is replaced with a nicely escaped string'
>>> reg.sub(lambda x: rpl, my_string )
'I hope this This is a nicely escaped newline \\n is replaced with a nicely escaped string'

All regex patterns used for Python's re module are unescaped, including both search and replacement patterns. This is why the r modifier is generally used with regex patterns in Python, as it reduces the amount of "backwhacking" necessary to write usable patterns.
The r modifier appears before a string constant and basically makes all \ characters (except those before string delimiters) verbatim. So, r'\\' == '\\\\', and r'\n' == '\\n'.
Writing your example as
rpl = r'This is a nicely escaped newline \\n'
my_string = 'I hope this apple is replaced with a nicely escaped string'
reg = re.compile(r'apple')
reg.sub( rpl, my_string )
works as expected.

Related

Python pattern matching with 'r' prefix [duplicate]

I don't understand the logic in the functioning of the scape operator \ in python regex together with r' of raw strings.
Some help is appreciated.
code:
import re
text=' esto .es 10 . er - 12 .23 with [ and.Other ] here is more ; puntuation'
print('text0=',text)
text1 = re.sub(r'(\s+)([;:\.\-])', r'\2', text)
text2 = re.sub(r'\s+\.', '\.', text)
text3 = re.sub(r'\s+\.', r'\.', text)
print('text1=',text1)
print('text2=',text2)
print('text3=',text3)
The theory says:
backslash character ('\') to indicate special forms or to allow special characters to be used without invoking their special meaning.
And as far as the link provided at the end of this question explains, r' represents a raw string, i.e. there is no special meaning for symbols, it is as it stays.
so in the above regex I would expect text2 and text3 to be different, since the substitution text is '.' in text 2, i.e. a period, whereas (in principle) the substitution text in text 3 is r'.' which is a raw string, i.e. the string as it is should appear, backslash and period. But they result in the same:
The result is:
text0= esto .es 10 . er - 12 .23 with [ and.Other ] here is more ; puntuation
text1= esto.es 10. er- 12.23 with [ and.Other ] here is more; puntuation
text2= esto\.es 10\. er - 12\.23 with [ and.Other ] here is more ; puntuation
text3= esto\.es 10\. er - 12\.23 with [ and.Other ] here is more ; puntuation
#text2=text3 but substitutions are not the same r'\.' vs '\.'
It looks to me that the r' does not work the same way in substitution part, nor the backslash. On the other hand my intuition tells me I am missing something here.
EDIT 1:
Following #Wiktor Stribiżew comment.
He pointed out that (following his link):
import re
print(re.sub(r'(.)(.)(.)(.)(.)(.)', 'a\6b', '123456'))
print(re.sub(r'(.)(.)(.)(.)(.)(.)', r'a\6b', '123456'))
# in my example the substitutions were not the same and the result were equal
# here indeed r' changes the results
which gives:
ab
a6b
that puzzles me even more.
Note:
I read this stack overflow question about raw strings which is super complete. Nevertheless it does not speak about substitutions

First and foremost,
replacement patterns ≠ regular expression patterns
We use a regex pattern to search for matches, we use replacement patterns to replace matches found with regex.
NOTE: The only special character in a substitution pattern is a backslash, \. Only the backslash must be doubled.
Replacement pattern syntax in Python
The re.sub docs are confusing as they mention both string escape sequences that can be used in replacement patterns (like \n, \r) and regex escape sequences (\6) and those that can be used as both regex and string escape sequences (\&).
I am using the term regex escape sequence to denote an escape sequence consisting of a literal backslash + a character, that is, '\\X' or r'\X', and a string escape sequence to denote a sequence of \ and a char or some sequence that together form a valid string escape sequence. They are only recognized in regular string literals. In raw string literals, you can only escape " (and that is the reason why you can't end a raw string literal with \", but the backlash is still part of the string then).
So, in a replacement pattern, you may use backreferences:
re.sub(r'\D(\d)\D', r'\1', 'a1b') # => 1
re.sub(r'\D(\d)\D', '\\1', 'a1b') # => 1
re.sub(r'\D(\d)\D', '\g<1>', 'a1b') # => 1
re.sub(r'\D(\d)\D', r'\g<1>', 'a1b') # => 1
You may see that r'\1' and '\\1' is the same replacement pattern, \1. If you use '\1', it will get parse as a string escape sequence, a character with octal value 001. If you forget to use r prefix with the unambiguous backreference, there is no problem because \g is not a valid string escape sequence, and there, \ escape character remains in the string. Read on the docs I linked to:
Unlike Standard C, all unrecognized escape sequences are left in the string unchanged, i.e., the backslash is left in the result.
So, when you pass '\.' as a replacement string, you actually send \. two-char combination as the replacement string, and that is why you get \. in the result.
\ is a special character in Python replacement pattern
If you use re.sub(r'\s+\.', r'\\.', text), you will get the same result as in text2 and text3 cases, see this demo.
That happens because \\, two literal backslashes, denote a single backslash in the replacement pattern. If you have no Group 2 in your regex pattern, but pass r'\2' in the replacement to actually replace with \ and 2 char combination, you would get an error.
Thus, when you have dynamic, user-defined replacement patterns you need to double all backslashes in the replacement patterns that are meant to be passed as literal strings:
re.sub(some_regex, some_replacement.replace('\\', '\\\\'), input_string)

A simple way to work around all these string escaping issues is to use a function/lambda as the repl argument, instead of a string. For example:
output = re.sub(
pattern=find_pattern,
repl=lambda _: replacement,
string=input,
)
The replacement string won't be parsed at all, just substituted in place of the match.

From the doc (my emphasis):
re.sub(pattern, repl, string, count=0, flags=0)
Return the string
obtained by replacing the leftmost non-overlapping occurrences of
pattern in string by the replacement repl. If the pattern isn’t found,
string is returned unchanged. repl can be a string or a function; if
it is a string, any backslash escapes in it are processed. That is, \n
is converted to a single newline character, \r is converted to a
carriage return, and so forth. Unknown escapes of ASCII letters are
reserved for future use and treated as errors. Other unknown escapes
such as \& are left alone. Backreferences, such as \6, are replaced
with the substring matched by group 6 in the pattern.
The repl argument is not just plain text. It can also be the name of a function or refer to a position in a group (e.g. \g<quote>, \g<1>, \1).
Also, from here:
Unlike Standard C, all unrecognized escape sequences are left in the
string unchanged, i.e., the backslash is left in the result.
Since . is not a special escape character, '\.' is the same as r'\.\.

Python Regex escape operator \ in substitutions & raw strings

I don't understand the logic in the functioning of the scape operator \ in python regex together with r' of raw strings.
Some help is appreciated.
code:
import re
text=' esto .es 10 . er - 12 .23 with [ and.Other ] here is more ; puntuation'
print('text0=',text)
text1 = re.sub(r'(\s+)([;:\.\-])', r'\2', text)
text2 = re.sub(r'\s+\.', '\.', text)
text3 = re.sub(r'\s+\.', r'\.', text)
print('text1=',text1)
print('text2=',text2)
print('text3=',text3)
The theory says:
backslash character ('\') to indicate special forms or to allow special characters to be used without invoking their special meaning.
And as far as the link provided at the end of this question explains, r' represents a raw string, i.e. there is no special meaning for symbols, it is as it stays.
so in the above regex I would expect text2 and text3 to be different, since the substitution text is '.' in text 2, i.e. a period, whereas (in principle) the substitution text in text 3 is r'.' which is a raw string, i.e. the string as it is should appear, backslash and period. But they result in the same:
The result is:
text0= esto .es 10 . er - 12 .23 with [ and.Other ] here is more ; puntuation
text1= esto.es 10. er- 12.23 with [ and.Other ] here is more; puntuation
text2= esto\.es 10\. er - 12\.23 with [ and.Other ] here is more ; puntuation
text3= esto\.es 10\. er - 12\.23 with [ and.Other ] here is more ; puntuation
#text2=text3 but substitutions are not the same r'\.' vs '\.'
It looks to me that the r' does not work the same way in substitution part, nor the backslash. On the other hand my intuition tells me I am missing something here.
EDIT 1:
Following #Wiktor Stribiżew comment.
He pointed out that (following his link):
import re
print(re.sub(r'(.)(.)(.)(.)(.)(.)', 'a\6b', '123456'))
print(re.sub(r'(.)(.)(.)(.)(.)(.)', r'a\6b', '123456'))
# in my example the substitutions were not the same and the result were equal
# here indeed r' changes the results
which gives:
ab
a6b
that puzzles me even more.
Note:
I read this stack overflow question about raw strings which is super complete. Nevertheless it does not speak about substitutions

First and foremost,
replacement patterns ≠ regular expression patterns
We use a regex pattern to search for matches, we use replacement patterns to replace matches found with regex.
NOTE: The only special character in a substitution pattern is a backslash, \. Only the backslash must be doubled.
Replacement pattern syntax in Python
The re.sub docs are confusing as they mention both string escape sequences that can be used in replacement patterns (like \n, \r) and regex escape sequences (\6) and those that can be used as both regex and string escape sequences (\&).
I am using the term regex escape sequence to denote an escape sequence consisting of a literal backslash + a character, that is, '\\X' or r'\X', and a string escape sequence to denote a sequence of \ and a char or some sequence that together form a valid string escape sequence. They are only recognized in regular string literals. In raw string literals, you can only escape " (and that is the reason why you can't end a raw string literal with \", but the backlash is still part of the string then).
So, in a replacement pattern, you may use backreferences:
re.sub(r'\D(\d)\D', r'\1', 'a1b') # => 1
re.sub(r'\D(\d)\D', '\\1', 'a1b') # => 1
re.sub(r'\D(\d)\D', '\g<1>', 'a1b') # => 1
re.sub(r'\D(\d)\D', r'\g<1>', 'a1b') # => 1
You may see that r'\1' and '\\1' is the same replacement pattern, \1. If you use '\1', it will get parse as a string escape sequence, a character with octal value 001. If you forget to use r prefix with the unambiguous backreference, there is no problem because \g is not a valid string escape sequence, and there, \ escape character remains in the string. Read on the docs I linked to:
Unlike Standard C, all unrecognized escape sequences are left in the string unchanged, i.e., the backslash is left in the result.
So, when you pass '\.' as a replacement string, you actually send \. two-char combination as the replacement string, and that is why you get \. in the result.
\ is a special character in Python replacement pattern
If you use re.sub(r'\s+\.', r'\\.', text), you will get the same result as in text2 and text3 cases, see this demo.
That happens because \\, two literal backslashes, denote a single backslash in the replacement pattern. If you have no Group 2 in your regex pattern, but pass r'\2' in the replacement to actually replace with \ and 2 char combination, you would get an error.
Thus, when you have dynamic, user-defined replacement patterns you need to double all backslashes in the replacement patterns that are meant to be passed as literal strings:
re.sub(some_regex, some_replacement.replace('\\', '\\\\'), input_string)

A simple way to work around all these string escaping issues is to use a function/lambda as the repl argument, instead of a string. For example:
output = re.sub(
pattern=find_pattern,
repl=lambda _: replacement,
string=input,
)
The replacement string won't be parsed at all, just substituted in place of the match.

From the doc (my emphasis):
re.sub(pattern, repl, string, count=0, flags=0)
Return the string
obtained by replacing the leftmost non-overlapping occurrences of
pattern in string by the replacement repl. If the pattern isn’t found,
string is returned unchanged. repl can be a string or a function; if
it is a string, any backslash escapes in it are processed. That is, \n
is converted to a single newline character, \r is converted to a
carriage return, and so forth. Unknown escapes of ASCII letters are
reserved for future use and treated as errors. Other unknown escapes
such as \& are left alone. Backreferences, such as \6, are replaced
with the substring matched by group 6 in the pattern.
The repl argument is not just plain text. It can also be the name of a function or refer to a position in a group (e.g. \g<quote>, \g<1>, \1).
Also, from here:
Unlike Standard C, all unrecognized escape sequences are left in the
string unchanged, i.e., the backslash is left in the result.
Since . is not a special escape character, '\.' is the same as r'\.\.

Regex subbing in Python leads to ASCII characters appearing

I am trying to use regex to replace some issues in some text.
Strings look like this:
a = "Here is a shortString with various issuesWith spacing"
My regex looks like this right now:
new_string = re.sub("[a-z][A-Z]", "\1 \2", a).
This takes those places with missing spaces (there is always a capital letter after a lowercase letter), and adds a space.
Unfortunately, the output looks like this:
Here is a shor\x01 \x02tring with various issue\x01 \x02ith spacing
I want it to look like this:
b = "Here is a short String with various issues With spacing"
It seems that the regex is properly matching the correct instances of things I want to change, but there is something wrong with my substitution. I thought \1 \2 meant replace with the first part of the regex, add a space, and then add the second matched item. But for some reason I get something else?

>>> a = "Here is a shortString with various issuesWith spacing"
>>> re.sub("([a-z])([A-Z])", r"\1 \2", a)
'Here is a short String with various issues With spacing'
capturing group and backslash escaping was missing.
you can go even further:
>>> a = "Here is a shortString with various issuesWith spacing"
>>> re.sub('([a-z])([A-Z])', r'\1 \2', a).lower().capitalize()
'Here is a short string with various issues with spacing'

You need to define capturing groups, and use raw string literals:
import re
a = "Here is a shortString with various issuesWith spacing"
new_string = re.sub(r"([a-z])([A-Z])", r"\1 \2", a)
print(new_string)
See the Python demo.
Note that without the r'' prefix Python interpreted the \1 and \2 as characters rather than as backreferences since the \ was parsed as part of an escape sequence. In raw string literals, \ is parsed as a literal backslash.

You can have a try like this:
>>>> import re
>>>> a = "Here is a shortString with various issuesWith spacing"
>>>> re.sub(r"(?<=[a-z])(?=[A-Z])", " ", a)
>>>> Here is a short String with various issues With spacing

How can I find a match and update it with RegEx?

I have a string as
a = "hello i am stackoverflow.com user +-"
Now I want to convert the escape characters in the string except the quotation marks and white space. So my expected output is :
a = "hello i am stackoverflow\.com user \+\-"
What I did so far is find all the special characters in a string except whitespace and double quote using
re.findall(r'[^\w" ]',a)
Now, once I found all the required special characters I want to update the string. I even tried re.sub but it replaces the special characters. Is there anyway I can do it?

Use re.escape.
>>> a = "hello i am stackoverflow.com user +-"
>>> print(re.sub(r'\\(?=[\s"])', r'', re.escape(a)))
hello i am stackoverflow\.com user \+\-
re.escape(string)
Return string with all non-alphanumerics backslashed; this is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it.
r'\\(?=[\s"])' matches all the backslashes which exists just before to space or double quotes. Replacing the matched backslashes with an empty string will give you the desired output.
OR
>>> a = 'hello i am stackoverflow.com user "+-'
>>> print(re.sub(r'((?![\s"])\W)', r'\\\1', a))
hello i am stackoverflow\.com user "\+\-
((?![\s"])\W) captures all the non-word characters but not of space or double quotes. Replacing the matched characters with backslash + chars inside group index 1 will give you the desired output.

It seems like you could use backreferences with re.sub to achieve what your desired output:
import re
a = "hello i am stackoverflow.com user +-"
print re.sub(r'([^\w" ])', r'\\\1', a) # hello i am stackoverflow\.com user \+\-
The replacement pattern r'\\\1' is just \\ which means a literal backslash, followed \1 which means capture group 1, the pattern captured in the parentheses in the first argument.
In other words, it will escape everything except:
alphanumeric characters
underscore
double quotes
space

Matching \[\] in Python regexes

I am trying to replace all expressions of the form
\[something\]
in a string by
\[<img src='something'>\]
Since \ and [ ] are special characters, I need to espace them (so \\, \[ and \]), thus my code would be
def repl(m):
return "<img src='"+m.group(1)+"'>"
print re.sub("\\\[(.*?)\\\]", repl, "frfrfr\nfrrffr<p>\[something\]</p>frff\nfrfrr", re.S)
However, this returns the original string. Could someone point out my mistake ?

Escape \ correctly, or use r'raw string' as follow.
>>> print re.sub(r"\\\[(.*?)\\\]", repl, "frfrfr\nfrrffr<p>\[something\]</p>frff\nfrfrr", flags=re.S)
frfrfr
frrffr<p><img src='something'></p>frff
frfrr
>>> print re.sub("\\\\\\[(.*?)\\\\\\]", repl, "frfrfr\nfrrffr<p>\[something\]</p>frff\nfrfrr", flags=re.S)
frfrfr
frrffr<p><img src='something'></p>frff
frfrr
UPDATE
The fourth parameter of re.sub is count, not flags. To specify flags, use keyword arguments. Otherwise, re.S is recognized as count.
>>> print re.sub(r"\\\[(.*?)\\\]", repl, "frfrfr\nfrrffr<p>\[something\nblah\]</p>frff\nfrfrr", re.S)
frfrfr
frrffr<p>\[something
blah\]</p>frff
frfrr
>>> print re.sub(r"\\\[(.*?)\\\]", repl, "frfrfr\nfrrffr<p>\[something\nblah\]</p>frff\nfrfrr", flags=re.S)
frfrfr
frrffr<p><img src='something
blah'></p>frff
frfrr

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python unescaping string in regex replacements - python

Related

Python pattern matching with 'r' prefix [duplicate]

Python Regex escape operator \ in substitutions & raw strings

Regex subbing in Python leads to ASCII characters appearing

How can I find a match and update it with RegEx?

Matching \[\] in Python regexes

Categories

Resources