Python re.sub: ignore backreferences in the replacement string

Python re.sub: ignore backreferences in the replacement string - python

I want to replace a pattern with a string. The string is given in a variable. It might, of course, contains '\1', and it should not be interpreted as a backreference - but simply as \1.
How can I achieve that?

The previous answer using re.escape() would escape too much, and you would get undesirable backslashes in the replacement and the replaced string.
It seems like in Python only the backslash needs escaping in the replacement string, thus something like this could be sufficient:
replacement = replacement.replace("\\", "\\\\")
Example:
import re
x = r'hai! \1 <ops> $1 \' \x \\'
print "want to see: "
print x
print "getting: "
print re.sub(".(.).", x, "###")
print "over escaped: "
print re.sub(".(.).", re.escape(x), "###")
print "could work: "
print re.sub(".(.).", x.replace("\\", "\\\\"), "###")
Output:
want to see:
hai! \1 <ops> $1 \' \x \\
getting:
hai! # <ops> $1 \' \x \
over escaped:
hai\!\ \1\ \<ops\>\ \$1\ \\'\ \x\ \\
could work:
hai! \1 <ops> $1 \' \x \\

Due to comments I thought quite a while about this and tried it out. Helped me a lot to increase my understanding about escaping, so I changed my answer nearly completely that it could be useful for later readers.
NullUserException gave you just the short version, I try to explain it a bit more. And thanks to the critical reviews of Qtax and Duncan, this answer is hopefully now correct and helpful.
The backslash has a special meaning, its the escape character in strings, that means the backslash and the following character form an escape sequence that is translated to something else when something is done with the string. This "something is done" is already the creation of the string. So if you want to use \ literally you need to escape it. This escape character is the backslash itself.
So as start some examples for a better understanding what happens. I print additionally the ASCII codes of the characters in the string to hopefully increase the understandability of what happens.
s = "A\1\nB"
print s
print [x for x in s]
print [hex(ord(x)) for x in s]
is printing
A
B
['A', '\x01', '\n', 'B']
['0x41', '0x1', '0xa', '0x42']
So while I typed \ and 1 in the code, s does not contain those two characters, it contains the ASCII character 0x01 which is "Start of heading". Same for the \n, it translated to 0x0a the Linefeed character.
Since this behaviour is not always wanted, raw strings can be used, where the escape sequences are ignored.
s = r"A\1\nB"
print s
print [x for x in s]
print [hex(ord(x)) for x in s]
I just added the r before the string and the result is now
A\1\nB
['A', '\\', '1', '\\', 'n', 'B']
['0x41', '0x5c', '0x31', '0x5c', '0x6e', '0x42']
All characters are printed as I typed them.
This is the situation we have. Now there is the next thing.
There can be the situation that a string should be passed to a regex to be found literally, so every character that has a special meaning within a regex (e.g. +*$[.) needs to escaped, therefore there is a special function re.escape that does this job.
But for this question this is the wrong function, because the string should not be used within a regex, but as the replacement string for re.sub.
So new situation:
A raw string including escape sequences should be used as replacement string for re.sub. re.sub will also handle the escape sequences, but with a small, but important, difference to the handling before: \n is still translated to 0x0a the Linefeed character, but the transition of \1 has changed now! It will be replaced by the content of the capturing group 1 of the regex in re.sub.
s = r"A\1\nB"
print re.sub(r"(Replace)" ,s , "1 Replace 2")
And the result is
1 AReplace
B 2
The \1 has been replaced with the content of the capturing group and \n with the LineFeed character.
The important point is, you have to understand this behaviour and now you have two possiblities to my opinion (and I am not going to judge which one is the correct one)
The creator is unsure about the string behaviour and if he inputs \n then he wants a newline. In this case use this to just escape the \ that are followed by a digit.
OnlyDigits = re.sub(r"(Replace)" ,re.sub(r"(\\)(?=\d)", r"\\\\", s) , "1 Replace 2")
print OnlyDigits
print [x for x in OnlyDigits]
print [hex(ord(x)) for x in OnlyDigits
Output:
1 A\1
B 2
['1', ' ', 'A', '\\', '1', '\n', 'B', ' ', '2']
['0x31', '0x20', '0x41', '0x5c', '0x31', '0xa', '0x42', '0x20', '0x32']
The creator nows exactly what he is doing and if he would have wanted a newline, he would have typed \0xa. In this case escape all
All = re.sub(r"(Replace)" ,re.sub(r"(\\)", r"\\\\", s) , "1 Replace 2")
print All
print [x for x in All]
print [hex(ord(x)) for x in All]
Output:
1 A\1\nB 2
['1', ' ', 'A', '\\', '1', '\\', 'n', 'B', ' ', '2']
['0x31', '0x20', '0x41', '0x5c', '0x31', '0x5c', '0x6e', '0x42', '0x20', '0x32']

Related

Escaping regex unicode string in Python

I have a user defined string.
I want to use it in regex with small improvement: search by three apostrophes instead of one.
For example,
APOSTROPHES = re.escape('\'\u2019\u02bc')
word = re.escape("п'ять")
word = ''.join([s if s not in APOSTROPHES else '[%s]' % APOSTROPHES for s in word])
It works good for latin, but for unicode list comprehension gives the following string:
"[\\'\\\\u2019\\\\u02bc]\xd0[\\'\\\\u2019\\\\u02bc]\xbf[\\'\\\\u2019\\\\u02bc][\\'\\\\u2019\\\\u02bc][\\'\\\\u2019\\\\u02bc]\xd1[\\'\\\\u2019\\\\u02bc]\x8f[\\'\\\\u2019\\\\u02bc]\xd1[\\'\\\\u2019\\\\u02bc]\x82[\\'\\\\u2019\\\\u02bc]\xd1[\\'\\\\u2019\\\\u02bc]\x8c"
Looks like it finds backslashes in both strings and then substitutes APOSTROPHES
Also, print(list(w for w in APOSTROPHES)) gives ['\\', "'", '\\', '\\', 'u', '2', '0', '1', '9', '\\', '\\', 'u', '0', '2', 'b', 'c'].
How can I avoid it? I want to get "\п[\'\u2019\u02bc]\я\т\ь"

What I understand is: you want to create a regular expression which can match a given word with any apostrophe:
The RegEx which match any apostrophe can be defined in a group:
APOSTROPHES_REGEX = r'[\'\u2019\u02bc]'
For instance, you have this (Ukrainian?) word which contains a single quote:
word = "п'ять"
EDIT: If your word contains another kind of apostrophe, you can normalize it, like this:
word = re.sub(APOSTROPHES_REGEX , r"\'", word, flags=re.UNICODE)
To create a RegEx, you escape this string (because in some context, it can contains special characters like punctuation, I think). When escaped, the single quote "'" is replaced by an escaped single quote, like this: r"\'".
You can replace this r"\'" by your apostrophe RegEx:
import re
word_regex = re.escape(word)
word_regex = word_regex.replace(r'\'', APOSTROPHES_REGEX)
The new RegEx can then be used to match the same word with any apostrophe:
assert re.match(word_regex, "п'ять") # '
assert re.match(word_regex, "п’ять") # \u2019
assert re.match(word_regex, "пʼять") # \u02bc
Note: don’t forget to use the re.UNICODE flag, it will help you for some RegEx characters classes like r"\w".

python regex preserve specified special characters only [duplicate]

This question already has an answer here:
Learning Regular Expressions [closed]
(1 answer)
Closed 6 years ago.
I've been looking for a way to isolate special characters in a regex expression, but I only seem to find the exact opposite of what I'm looking for. So basically I want to is something along the lines of this:
import re
str = "I only want characters from the pattern below to appear in a list ()[]' including quotations"
pattern = """(){}[]"'-"""
result = re.findall(pattern, str)
What I expect from this is:
print(result)
#["(", ")", "[", "]", "'"]
Edit: thank you to whomever answered then deleted their comment with this regex that solved my problem:
pattern = r"""[(){}\[\]"'\-]"""

Why would you need regex for this when it can be done without regex?
>>> str = "I only want characters from the pattern below to appear in a list ()[]' including quotations"
>>> pattern = """(){}[]"'-"""
>>> [x for x in str if x in pattern]
['(', ')', '[', ']', "'"]

If it's for learning purposes (regex isn't really the best way here), then you can use:
import re
text = "I only want characters from the pattern below to appear in a list ()[]' including quotations"
output = re.findall('[' + re.escape("""(){}[]"'-""") + ']', text)
# ['(', ')', '[', ']', "'"]
Surrounding the characters in [ and ] makes it a regex character class and re.escape will escape any characters that have special regex meaning to avoid breaking the regex string (eg: ] terminating the characters early or - in a certain place causing it to act like a character range).

Several of the characters in your set have special meaning in regular expressions; to match them literally, you need to backslash-escape them.
pattern = r"""\(\)\{\}\[]"'-"""
Alternatively, you could use a character class:
pattern = """[]-[(){}"']"""
Notice also the use of a "raw string" r'...' to avoid having Python interpret the backslashes.

Python regex '\s' vs '\\s'

I have simple expression \s and \\s. Both expression matches This is Sparta!!.
>>> re.findall('\\s',"This is Sparta")
[' ', ' ']
>>> re.findall('\s',"This is Sparta")
[' ', ' ']
I am confused here. \ is used to escape special character and \s represents white space but, how both are acting here?

Don't confuse python-level string-escaping and regex-level string-escaping. Since s is not an escapable character at python-level, the interpreter understand a string like "\s" as the two characters "\" and "s". Replace "s" with "n" (for example), and it understands it as the newline character.
'\s' == '\\s'
True
'\n' == '\\n'
False

\ only escapes the following character if the escaped character is valid
>>> len('\s')
2
>>> len('\n')
1
compare with
>>> len('\\s')
2
>>> len('\\n')
2

Python Regex - Match a character without consuming it

I would like to convert the following string
"For "The" Win","Way "To" Go"
to
"For ""The"" Win","Way ""To"" Go"
The straightforward regex would be
str2 = re.sub(r'(?<!,|^)"(?=\w)|(?<=\w)"(?!,|$)', '""', str1,flags=re.MULTILINE)
i.e., Double the quotes that are
Followed by a letter but not preceded by a comma or the beginning of line
Preceded by a letter but not followed by a comma or the end of line
The problem is I am using python and it's regex engine does not allow using the OR operator in the lookbehind construct. I get the error
sre_constants.error: look-behind requires fixed-width pattern
What I am looking for is a regex that will replace the '"' around 'The' and 'To' with '""'.
I can use the following regex (An answer provided to another question)
\b\s*"(?!,|[ \t]*$)
but that consumes the space just before the 'The' and 'To' and I get the below
"For""The"" Win","Way""To"" Go"
Is there a workaround so that I can double the quotes around 'The' and 'To' without consuming the spaces just before them?

Instead of saying not preceded by comma or the line start, say preceded by a non-comma character:
r'(?<=[^,])"(?=\w)|(?<=\w)"(?!,|$)'

Looks to me like you don't need to bother with anchors.
If there is a character before the quote, you know it's not at the beginning of the string.
If that character is not a newline, you're not at the beginning of a line.
If the character is not a comma, you're not at the beginning of a field.
So you don't need to use anchors, just do a positive lookbehind/lookahead for a single character:
result = re.sub(r'(?<=[^",\r\n])"(?=[^,"\r\n])', '""', subject)
I threw in the " on the chance that there might be some quotes that are already escaped. But realistically, if that's the case you're probably screwed anyway. ;)

re.sub(r'\b(\s*)"(?!,|[ \t]*$)', r'\1""', s)

Most direct workaround whenever you encounter this issue: explode the look-behind into two look-behinds.
str2 = re.sub(r'(?<!,)(?<!^)"(?=\w)|(?<=\w)"(?!,|$)', '""', str1,flags=re.MULTILINE)
(don't name your strings str)

str2 = re.sub('(?<=[^,])"(?=\w)'
'|'
'(?<=\w)"(?!,|$)',
'""', ss,
flags=re.MULTILINE)
I always wonder why people use raw strings for regex patterns when it isn't needed.
Note I changed your str which is the name of a builtin class to ss
.
For `"fun" :
str2 = re.sub('"'
'('
'(?<=[^,]")(?=\w)'
'|'
'(?<=\w")(?!,|$)'
')',
'""', ss,
flags=re.MULTILINE)
or also
str2 = re.sub('(?<=[^,]")(?=\w)'
'|'
'(?<=\w")(?!,|$)',
'"', ss,
flags=re.MULTILINE)

Can't get single \ in python

I'm trying to learn python, and I'm pretty new at it, and I can't figure this one part out.
Basically, what I'm doing now is something that takes the source code of a webpage, and takes out everything that isn't words.
Webpages have a lot of \n and \t, and I want something that will find \ and delete everything between it and the next ' '.
def removebackslash(source):
while(source.find('\') != -1):
startback = source.find('\')
endback = source[startback:].find(' ') + startback + 1
source = source[0:startback] + source[endback:]
return source
is what I have. It doesn't work like this, because the \' doesn't close the string, but when I change \ to \\, it interprets the string as \\. I can't figure out anything that is interpreted at '\'

\ is an escape character; it either gives characters a special meaning or takes said special meaning away. Right now, it's escaping the closing single quote and treating it as a literal single quote. You need to escape it with itself to insert a literal backslash:
def removebackslash(source):
while(source.find('\\') != -1):
startback = source.find('\\')
endback = source[startback:].find(' ') + startback + 1
source = source[0:startback] + source[endback:]
return source

Try using replace:
str.replace(old, new[, count])
Return a copy of the string with all occurrences of substring old replaced by new. If the optional argument count is given, only the first count occurrences are replaced.
So in your case:
my_text = my_text.replace('\n', '')
my_text = my_text.replace('\t', '')

As others have said, you need to use '\\'. The reason you think this isn't working is because when you get the results, they look like they begin with two backslashes. But they don't begin with two backslashes, it's just that Python shows two backslashes. If it didn't, you couldn't tell the difference between a newline (represented as \n) and a backslash followed by the letter n (represented as \\n).
There are two ways to convince yourself of what's really going on. One is to use print on the result, which causes it to expand the escapes:
>>> x = "here is a backslash \\ and here comes a newline \n this is on the next line"
>>> x
u'here is a backslash \\ and here comes a newline \n this is on the next line'
>>> print x
here is a backslash \ and here comes a newline
this is on the next line
>>> startback = x.find('\\')
>>> x[startback:]
u'\\ and here comes a newline \n this is on the next line'
>>> print x[startback:]
\ and here comes a newline
this is on the next line
Another way is to use len to verify the length of the string:
>>> x = "Backslash \\ !"
>>> startback = x.find('\\')
>>> x[startback:]
u'\\ !'
>>> print x[startback:]
\ !
>>> len(x[startback:])
3
Notice that len(x[startback:]) is 3. The string contains three characters: backslash, space, and exclamation point. You can see what's going on even more simply by just looking at a string that contains only a backslash:
>>> x = "\\"
>>> x
u'\\'
>>> print x
\
>>> len(x)
1
x only looks like it starts with two backslashes when you evaluate it at the interactive prompt (or otherwise use it's __repr__ method). When you actually print it, you can see it's only one backslash, and when you look at its length, you can see it's only one character long.
So what this means is you need to escape the backslash in your find, and you need to recognize that the backslashes displayed in the output may also be doubled.

The SO auto-format shows your problem. Since \ is used to escape characters, it's escaping the end quotes. Try changing that line to (note the use of double quotes):
while(source.find("\\") != -1):
Read more about escape characters in the docs.

I don't think anyone's mentioned this yet, but if you don't want to deal with having to escape characters just use a raw string.
source.find(r'\')
Adding the letter r before the string tells Python not to interpret any special characters and keeps the string exactly as you type it.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python re.sub: ignore backreferences in the replacement string - python

I want to replace a pattern with a string. The string is given in a variable. It might, of course, contains '\1', and it should not be interpreted as a backreference - but simply as \1. How can I achieve that?

Related

Escaping regex unicode string in Python

python regex preserve specified special characters only [duplicate]

Python regex '\s' vs '\\s'

Python Regex - Match a character without consuming it

Can't get single \ in python

Categories

Resources