Confusion escaping single quotes in a single-quoted raw string literal - python

The following works as expected:
>>> print re.sub('(\w)"(\W)', r"\1''\2", 'The "raw string literal" is a special case of a "string literal".')
The "raw string literal'' is a special case of a "string literal''.
Since I wanted to use single quotes in the replacement expression (is that the correct terminology?), I quoted it using double quotes.
But then for my edification I tried using single quotes in the replacement expression and can't understand the results:
>>> print re.sub('(\w)"(\W)', r'\1\'\'\2', 'The "raw string literal" is a special case of a "string literal".')
The "raw string literal\'\' is a special case of a "string literal\'\'.
Shouldn't the two forms produce exactly the same output?
So, my questions are:
How do I escape a single quote in a single-quoted raw string?
How do I escape a double quote in a double-quoted raw string?
Why is it that in the first parameter to re.sub() I didn't have to use raw string, but in the second parameter I have to. Both seem like string representations of regexes to this Python noob.
If it makes a difference, am using Python 2.7.5 on Mac OS X (10.9, Mavericks).

No, they should not. A raw string literal does let you escape quotes, but the backslashes will be included:
>>> r"\'"
"\\'"
where Python echoes the resulting string as a string literal with the backslash escaped.
This is explicitly documented behaviour of the raw string literal syntax:
When an 'r' or 'R' prefix is present, a character following a backslash is included in the string without change, and all backslashes are left in the string. For example, the string literal r"\n" consists of two characters: a backslash and a lowercase 'n'. String quotes can be escaped with a backslash, but the backslash remains in the string; for example, r"\"" is a valid string literal consisting of two characters: a backslash and a double quote; r"\" is not a valid string literal (even a raw string cannot end in an odd number of backslashes).
If you didn't use a raw string literal for the second parameter, Python would interpret the \digit combination as octal byte values:
>>> '\0'
'\x00'
You can construct the same string without raw string literals with doubling the backslash:
>>> '\\1\'\'\\2'
"\\1''\\2"

To answer the questions of the OP:
How do I escape a single quote in a single-quoted raw string?
That is not possible, except if you have the special case where the single quote is preceded by a backslash (as Martijn pointed out).
How do I escape a double quote in a double-quoted raw string?
See above.
Why is it that in the first parameter to re.sub() I didn't have to use raw string, but in the second parameter I have to. Both seem like string representations of regexes to this Python noob.
Completing Martijn's answer (which only covered the second parameter): The backslashes in the first parameter are attempted to be interpreted as escape characters together with their following characters, because the string is not raw. However, because the following characters do not happen to form valid escape sequences together with a backslash, the backslash is interpreted as a character:
>>> '(\w)"(\W)'
'(\\w)"(\\W)'
>>> '(\t)"(\W)'
'(\t)"(\\W)'

Related

How does Python interpret backslash in string? [duplicate]

This question already has answers here:
Why do 3 backslashes equal 4 in a Python string?
(5 answers)
Closed 7 months ago.
Although I noticed the pattern but how does the backslash work in string theoretically?
'##2_#]&*^%$\]'
output: '##2_#]&*^%$\\]'
'##2_#]&*^%$\\]'
output: '##2_#]&*^%$\\]'
'##2_#]&*^%$\\\]'
output: '##2_#]&*^%$\\\\]'
The backslash \ character is used to escape characters that otherwise have a special meaning, such as newline, backslash itself, or the quote character. String literals may optionally be prefixed with a letter `r' or 'R'; such strings are called raw strings and use different rules for backslash escape sequences.
Unless an 'r' or 'R' prefix is present, escape sequences in strings are interpreted according to rules similar to those used by Standard C.
In strict compatibility with Standard C, up to three octal digits are accepted, but an unlimited number of hex digits is taken to be part of the hex escape (and then the lower 8 bits of the resulting hex number are used in 8-bit implementations).
Unlike Standard C, all unrecognized escape sequences are left in the string unchanged, i.e., the backslash is left in the string. (This behavior is useful when debugging: if an escape sequence is mistyped, the resulting output is more easily recognized as broken.)
When an 'r' or 'R' prefix is present, backslashes are still used to quote the following character, but all backslashes are left in the string. For example, the string literal r"\n" consists of two characters: a backslash and a lowercase `n'. String quotes can be escaped with a backslash, but the backslash remains in the string; for example, r"\"" is a valid string literal consisting of two characters: a backslash and a double quote; r"\" is not a value string literal (even a raw string cannot end in an odd number of backslashes). Specifically, a raw string cannot end in a single backslash (since the backslash would escape the following quote character). Note also that a single backslash followed by a newline is interpreted as those two characters as part of the string, not as a line continuation.
From your follow-up comment:
What puzzled me is in my example, it doesn't escape. Single backslash produces double backslashes. Double backslashes produce Double backslashes. Triple backslashes produce quadruple backslashes.....
To be clear: your first output is a string with one backslash in it. Python displays two backslashes in its representation of the string.
When you input the string with a single backslash, Python does not treat the sequence \] in the input as any special escape sequence, and therefore the \ is turned into an actual backslash in the actual string, and the ] into a closing square bracket. Quoting from the documentation linked by Klaus D.:
Unlike Standard C, all unrecognized escape sequences are left in the string unchanged, i.e., the backslash is left in the result. (This behavior is useful when debugging: if an escape sequence is mistyped, the resulting output is more easily recognized as broken.)
When you input the string with a double backslash, the sequence \\ is an escape sequence for a single backslash, and then the ] is just a ].
Either way, when Python displays the string back to you, it uses \\ for the single actual backslash, because it does not look ahead to determine that a single backslash would work - the backslash always gets escaped.
To go into a little more detail: Python doesn't care about how you specified the string in the first place - it has a specific "normalized" form that depends only on what the string actually contains. We can see this by playing around with the different ways to quote a string:
>>> 'foo'
'foo'
>>> "foo"
'foo'
>>> r'foo'
'foo'
>>> """foo"""
'foo'
The normalized form will use double quotes if that avoids escape sequences for single quotes:
>>> '\'\'\''
"'''"
But it will switch back to single quotes if the string contains both types of quote:
>>> '\'"'
'\'"'
>>> "'\"'
'\'"'
(Exercise: how many characters are actually in this string, and what are they? How many backslashes does the string contain?)
It contains two characters - a single-quote and a double-quote - and no backslashes.
For the first pattern
'##2_#]&*^%$\]'
\ is not escaped so in the output one more \ is added to escape it.
For the second pattern
'##2_#]&*^%$\\]'
\ is already escaped in the pattern so no new \ in the output.
For the third pattern
'##2_#]&*^%$\\\]' first \ is escaping the second \ and third
\ is being escaped by adding one more \ in the output. So four \.
Hope it helps.

Escape sequence char as a list string [duplicate]

When I write print('\') or print("\") or print("'\'"), Python doesn't print the backslash \ symbol. Instead it errors for the first two and prints '' for the third. What should I do to print a backslash?
This question is about producing a string that has a single backslash in it. This is particularly tricky because it cannot be done with raw strings. For the related question about why such a string is represented with two backslashes, see Why do backslashes appear twice?. For including literal backslashes in other strings, see using backslash in python (not to escape).
You need to escape your backslash by preceding it with, yes, another backslash:
print("\\")
And for versions prior to Python 3:
print "\\"
The \ character is called an escape character, which interprets the character following it differently. For example, n by itself is simply a letter, but when you precede it with a backslash, it becomes \n, which is the newline character.
As you can probably guess, \ also needs to be escaped so it doesn't function like an escape character. You have to... escape the escape, essentially.
See the Python 3 documentation for string literals.
A hacky way of printing a backslash that doesn't involve escaping is to pass its character code to chr:
>>> print(chr(92))
\
print(fr"\{''}")
or how about this
print(r"\ "[0])
For completeness: A backslash can also be escaped as a hex sequence: "\x5c"; or a short Unicode sequence: "\u005c"; or a long Unicode sequence: "\U0000005c". All of these will produce a string with a single backslash, which Python will happily report back to you in its canonical representation - '\\'.

Regular Expression tested but not working [duplicate]

From the python documentation on regex, regarding the '\' character:
The solution is to use Python’s raw string notation for regular
expression patterns; backslashes are not handled in any special way in
a string literal prefixed with 'r'. So r"\n" is a two-character string
containing '\' and 'n', while "\n" is a one-character string
containing a newline. Usually patterns will be expressed in Python
code using this raw string notation.
What is this raw string notation? If you use a raw string format, does that mean "*" is taken as a a literal character rather than a zero-or-more indicator? That obviously can't be right, or else regex would completely lose its power. But then if it's a raw string, how does it recognize newline characters if "\n" is literally a backslash and an "n"?
I don't follow.
Edit for bounty:
I'm trying to understand how a raw string regex matches newlines, tabs, and character sets, e.g. \w for words or \d for digits or all whatnot, if raw string patterns don't recognize backslashes as anything more than ordinary characters. I could really use some good examples.
Zarkonnen's response does answer your question, but not directly. Let me try to be more direct, and see if I can grab the bounty from Zarkonnen.
You will perhaps find this easier to understand if you stop using the terms "raw string regex" and "raw string patterns". These terms conflate two separate concepts: the representations of a particular string in Python source code, and what regular expression that string represents.
In fact, it's helpful to think of these as two different programming languages, each with their own syntax. The Python language has source code that, among other things, builds strings with certain contents, and calls the regular expression system. The regular expression system has source code that resides in string objects, and matches strings. Both languages use backslash as an escape character.
First, understand that a string is a sequence of characters (i.e. bytes or Unicode code points; the distinction doesn't much matter here). There are many ways to represent a string in Python source code. A raw string is simply one of these representations. If two representations result in the same sequence of characters, they produce equivalent behaviour.
Imagine a 2-character string, consisting of the backslash character followed by the n character. If you know that the character value for backslash is 92, and for n is 110, then this expression generates our string:
s = chr(92)+chr(110)
print len(s), s
2 \n
The conventional Python string notation "\n" does not generate this string. Instead it generates a one-character string with a newline character. The Python docs 2.4.1. String literals say, "The backslash (\) character is used to escape characters that otherwise have a special meaning, such as newline, backslash itself, or the quote character."
s = "\n"
print len(s), s
1
 
(Note that the newline isn't visible in this example, but if you look carefully, you'll see a blank line after the "1".)
To get our two-character string, we have to use another backslash character to escape the special meaning of the original backslash character:
s = "\\n"
print len(s), s
2 \n
What if you want to represent strings that have many backslash characters in them? Python docs 2.4.1. String literals continue, "String literals may optionally be prefixed with a letter 'r' or 'R'; such strings are called raw strings and use different rules for interpreting backslash escape sequences." Here is our two-character string, using raw string representation:
s = r"\n"
print len(s), s
2 \n
So we have three different string representations, all giving the same string, or sequence of characters:
print chr(92)+chr(110) == "\\n" == r"\n"
True
Now, let's turn to regular expressions. The Python docs, 7.2. re — Regular expression operations says, "Regular expressions use the backslash character ('\') to indicate special forms or to allow special characters to be used without invoking their special meaning. This collides with Python’s usage of the same character for the same purpose in string literals..."
If you want a Python regular expression object which matches a newline character, then you need a 2-character string, consisting of the backslash character followed by the n character. The following lines of code all set prog to a regular expression object which recognises a newline character:
prog = re.compile(chr(92)+chr(110))
prog = re.compile("\\n")
prog = re.compile(r"\n")
So why is it that "Usually patterns will be expressed in Python code using this raw string notation."? Because regular expressions are frequently static strings, which are conveniently represented as string literals. And from the different string literal notations available, raw strings are a convenient choice, when the regular expression includes a backslash character.
Questions
Q: what about the expression re.compile(r"\s\tWord")? A: It's easier to understand by separating the string from the regular expression compilation, and understanding them separately.
s = r"\s\tWord"
prog = re.compile(s)
The string s contains eight characters: a backslash, an s, a backslash, a t, and then four characters Word.
Q: What happens to the tab and space characters? A: At the Python language level, string s doesn't have tab and space character. It starts with four characters: backslash, s, backslash, t . The regular expression system, meanwhile, treats that string as source code in the regular expression language, where it means "match a string consisting of a whitespace character, a tab character, and the four characters Word.
Q: How do you match those if that's being treated as backlash-s and backslash-t? A: Maybe the question is clearer if the words 'you' and 'that' are made more specific: how does the regular expression system match the expressions backlash-s and backslash-t? As 'any whitespace character' and as 'tab character'.
Q: Or what if you have the 3-character string backslash-n-newline? A: In the Python language, the 3-character string backslash-n-newline can be represented as conventional string "\\n\n", or raw plus conventional string r"\n" "\n", or in other ways. The regular expression system matches the 3-character string backslash-n-newline when it finds any two consecutive newline characters.
N.B. All examples and document references are to Python 2.7.
Update: Incorporated clarifications from answers of #Vladislav Zorov and #m.buettner, and from follow-up question of #Aerovistae.
Most of these questions have a lot of words in them and maybe it's hard to find the answer to your specific question.
If you use a regular string and you pass in a pattern like "\t" to the RegEx parser, Python will translate that literal into a buffer with the tab byte in it (0x09).
If you use a raw string and you pass in a pattern like r"\t" to the RegEx parser, Python does not do any interpretation, and it creates a buffer with two bytes in it: '\', and 't'. (0x5c, 0x74).
The RegEx parser knows what to do with the sequence '\t' -- it matches that against a tab. It also knows what to do with the 0x09 character -- that also matches a tab. For the most part, the results will be indistinguishable.
So the key to understanding what's happening is recognizing that there are two parsers being employed here. The first one is the Python parser, and it translates your string literal (or raw string literal) into a sequence of bytes. The second one is Python's regular expression parser, and it converts a sequence of bytes into a compiled regular expression.
The issue with using a normal string to write regexes that contain a \ is that you end up having to write \\ for every \. So the string literals "stuff\\things" and r"stuff\things" produce the same string. This gets especially useful if you want to write a regular expression that matches against backslashes.
Using normal strings, a regexp that matches the string \ would be "\\\\"!
Why? Because we have to escape \ twice: once for the regular expression syntax, and once for the string syntax.
You can use triple quotes to include newlines, like this:
r'''stuff\
things'''
Note that usually, python would treat \-newline as a line continuation, but this is not the case in raw strings. Also note that backslashes still escape quotes in raw strings, but are left in themselves. So the raw string literal r"\"" produces the string \". This means you can't end a raw string literal with a backslash.
See the lexical analysis section of the Python documentation for more information.
You seem to be struggling with the idea that a RegEx isn't part of Python, but instead a different programming language with its own parser and compiler. Raw strings help you get the "source code" of a RegEx safely to the RegEx parser, which will then assign meaning to character sequences like \d, \w, \n, etc...
The issue exists because Python and RegExps use \ as escape character, which is, by the way, a coincidence - there are languages with other escape characters (like "`n" for a newline, but even there you have to use "\n" in RegExps). The advantage is that you don't need to differentiate between raw and non-raw strings in these languages, they won't both try to convert the text and butcher it, because they react to different escape sequences.
raw string does not affect special sequences in python regex such as \w, \d. It only affects escape sequences such as \n. So most of the time it doesn't matter we write r in front or not.
I think that is the answer most beginners are looking for.
The relevant Python manual section ("String and Bytes literals") has a clear explanation of raw string literals:
Both string and bytes literals may optionally be prefixed with a
letter 'r' or 'R'; such strings are called raw strings and treat
backslashes as literal characters. As a result, in string literals,
'\U' and '\u' escapes in raw strings are not treated specially. Given
that Python 2.x’s raw unicode literals behave differently than Python
3.x’s the 'ur' syntax is not supported.
New in version 3.3: The 'rb' prefix of raw bytes literals has been
added as a synonym of 'br'.
New in version 3.3: Support for the unicode legacy literal (u'value')
was reintroduced to simplify the maintenance of dual Python 2.x and
3.x codebases. See PEP 414 for more information.
In triple-quoted strings, unescaped newlines and quotes are allowed
(and are retained), except that three unescaped quotes in a row
terminate the string. (A “quote” is the character used to open the
string, i.e. either ' or ".)
Unless an 'r' or 'R' prefix is present, escape sequences in strings
are interpreted according to rules similar to those used by Standard
C. The recognized escape sequences are:
Escape Sequence Meaning Notes
\newline Backslash and newline ignored
\ Backslash ()
\' Single quote (')
\" Double quote (")
\a ASCII Bell (BEL)
\b ASCII Backspace (BS)
\f ASCII Formfeed (FF)
\n ASCII Linefeed (LF)
\r ASCII Carriage Return (CR)
\t ASCII Horizontal Tab (TAB)
\v ASCII Vertical Tab (VT)
\ooo Character with octal value ooo (1,3)
\xhh Character with hex value hh (2,3)
Escape sequences only recognized in string literals are:
Escape Sequence Meaning Notes \N{name} Character named name in the
Unicode database (4) \uxxxx Character with 16-bit hex value xxxx (5)
\Uxxxxxxxx Character with 32-bit hex value xxxxxxxx (6)
Notes:
As in Standard C, up to three octal digits are accepted.
Unlike in Standard C, exactly two hex digits are required.
In a bytes literal, hexadecimal and octal escapes denote the byte with the given value. In a string literal, these escapes denote a
Unicode character with the given value.
Changed in version 3.3: Support for name aliases [1] has been added.
Individual code units which form parts of a surrogate pair can be encoded using this escape sequence. Exactly four hex digits are
required.
Any Unicode character can be encoded this way, but characters outside the Basic Multilingual Plane (BMP) will be encoded using a
surrogate pair if Python is compiled to use 16-bit code units (the
default). Exactly eight hex digits are required.
Unlike Standard C, all unrecognized escape sequences are left in the
string unchanged, i.e., the backslash is left in the string. (This
behavior is useful when debugging: if an escape sequence is mistyped,
the resulting output is more easily recognized as broken.) It is also
important to note that the escape sequences only recognized in string
literals fall into the category of unrecognized escapes for bytes
literals.
Even in a raw string, string quotes can be escaped with a backslash,
but the backslash remains in the string; for example, r"\"" is a valid
string literal consisting of two characters: a backslash and a double
quote; r"\" is not a valid string literal (even a raw string cannot
end in an odd number of backslashes). Specifically, a raw string
cannot end in a single backslash (since the backslash would escape the
following quote character). Note also that a single backslash followed
by a newline is interpreted as those two characters as part of the
string, not as a line continuation.
\n is an Escape Sequence in Python
\w is a Special Sequence in (Python) Regex
They look like they are in the same family but they are not. Raw string notation will affect Escape Sequences but not Regex Special Sequences.
For more about Escape Sequences
search for "\newline"
https://docs.python.org/3/reference/lexical_analysis.html
For more about Special Sequences:
search for "\number"
https://docs.python.org/3/library/re.html

What exactly is a "raw string regex" and how can you use it?

From the python documentation on regex, regarding the '\' character:
The solution is to use Python’s raw string notation for regular
expression patterns; backslashes are not handled in any special way in
a string literal prefixed with 'r'. So r"\n" is a two-character string
containing '\' and 'n', while "\n" is a one-character string
containing a newline. Usually patterns will be expressed in Python
code using this raw string notation.
What is this raw string notation? If you use a raw string format, does that mean "*" is taken as a a literal character rather than a zero-or-more indicator? That obviously can't be right, or else regex would completely lose its power. But then if it's a raw string, how does it recognize newline characters if "\n" is literally a backslash and an "n"?
I don't follow.
Edit for bounty:
I'm trying to understand how a raw string regex matches newlines, tabs, and character sets, e.g. \w for words or \d for digits or all whatnot, if raw string patterns don't recognize backslashes as anything more than ordinary characters. I could really use some good examples.
Zarkonnen's response does answer your question, but not directly. Let me try to be more direct, and see if I can grab the bounty from Zarkonnen.
You will perhaps find this easier to understand if you stop using the terms "raw string regex" and "raw string patterns". These terms conflate two separate concepts: the representations of a particular string in Python source code, and what regular expression that string represents.
In fact, it's helpful to think of these as two different programming languages, each with their own syntax. The Python language has source code that, among other things, builds strings with certain contents, and calls the regular expression system. The regular expression system has source code that resides in string objects, and matches strings. Both languages use backslash as an escape character.
First, understand that a string is a sequence of characters (i.e. bytes or Unicode code points; the distinction doesn't much matter here). There are many ways to represent a string in Python source code. A raw string is simply one of these representations. If two representations result in the same sequence of characters, they produce equivalent behaviour.
Imagine a 2-character string, consisting of the backslash character followed by the n character. If you know that the character value for backslash is 92, and for n is 110, then this expression generates our string:
s = chr(92)+chr(110)
print len(s), s
2 \n
The conventional Python string notation "\n" does not generate this string. Instead it generates a one-character string with a newline character. The Python docs 2.4.1. String literals say, "The backslash (\) character is used to escape characters that otherwise have a special meaning, such as newline, backslash itself, or the quote character."
s = "\n"
print len(s), s
1
 
(Note that the newline isn't visible in this example, but if you look carefully, you'll see a blank line after the "1".)
To get our two-character string, we have to use another backslash character to escape the special meaning of the original backslash character:
s = "\\n"
print len(s), s
2 \n
What if you want to represent strings that have many backslash characters in them? Python docs 2.4.1. String literals continue, "String literals may optionally be prefixed with a letter 'r' or 'R'; such strings are called raw strings and use different rules for interpreting backslash escape sequences." Here is our two-character string, using raw string representation:
s = r"\n"
print len(s), s
2 \n
So we have three different string representations, all giving the same string, or sequence of characters:
print chr(92)+chr(110) == "\\n" == r"\n"
True
Now, let's turn to regular expressions. The Python docs, 7.2. re — Regular expression operations says, "Regular expressions use the backslash character ('\') to indicate special forms or to allow special characters to be used without invoking their special meaning. This collides with Python’s usage of the same character for the same purpose in string literals..."
If you want a Python regular expression object which matches a newline character, then you need a 2-character string, consisting of the backslash character followed by the n character. The following lines of code all set prog to a regular expression object which recognises a newline character:
prog = re.compile(chr(92)+chr(110))
prog = re.compile("\\n")
prog = re.compile(r"\n")
So why is it that "Usually patterns will be expressed in Python code using this raw string notation."? Because regular expressions are frequently static strings, which are conveniently represented as string literals. And from the different string literal notations available, raw strings are a convenient choice, when the regular expression includes a backslash character.
Questions
Q: what about the expression re.compile(r"\s\tWord")? A: It's easier to understand by separating the string from the regular expression compilation, and understanding them separately.
s = r"\s\tWord"
prog = re.compile(s)
The string s contains eight characters: a backslash, an s, a backslash, a t, and then four characters Word.
Q: What happens to the tab and space characters? A: At the Python language level, string s doesn't have tab and space character. It starts with four characters: backslash, s, backslash, t . The regular expression system, meanwhile, treats that string as source code in the regular expression language, where it means "match a string consisting of a whitespace character, a tab character, and the four characters Word.
Q: How do you match those if that's being treated as backlash-s and backslash-t? A: Maybe the question is clearer if the words 'you' and 'that' are made more specific: how does the regular expression system match the expressions backlash-s and backslash-t? As 'any whitespace character' and as 'tab character'.
Q: Or what if you have the 3-character string backslash-n-newline? A: In the Python language, the 3-character string backslash-n-newline can be represented as conventional string "\\n\n", or raw plus conventional string r"\n" "\n", or in other ways. The regular expression system matches the 3-character string backslash-n-newline when it finds any two consecutive newline characters.
N.B. All examples and document references are to Python 2.7.
Update: Incorporated clarifications from answers of #Vladislav Zorov and #m.buettner, and from follow-up question of #Aerovistae.
Most of these questions have a lot of words in them and maybe it's hard to find the answer to your specific question.
If you use a regular string and you pass in a pattern like "\t" to the RegEx parser, Python will translate that literal into a buffer with the tab byte in it (0x09).
If you use a raw string and you pass in a pattern like r"\t" to the RegEx parser, Python does not do any interpretation, and it creates a buffer with two bytes in it: '\', and 't'. (0x5c, 0x74).
The RegEx parser knows what to do with the sequence '\t' -- it matches that against a tab. It also knows what to do with the 0x09 character -- that also matches a tab. For the most part, the results will be indistinguishable.
So the key to understanding what's happening is recognizing that there are two parsers being employed here. The first one is the Python parser, and it translates your string literal (or raw string literal) into a sequence of bytes. The second one is Python's regular expression parser, and it converts a sequence of bytes into a compiled regular expression.
The issue with using a normal string to write regexes that contain a \ is that you end up having to write \\ for every \. So the string literals "stuff\\things" and r"stuff\things" produce the same string. This gets especially useful if you want to write a regular expression that matches against backslashes.
Using normal strings, a regexp that matches the string \ would be "\\\\"!
Why? Because we have to escape \ twice: once for the regular expression syntax, and once for the string syntax.
You can use triple quotes to include newlines, like this:
r'''stuff\
things'''
Note that usually, python would treat \-newline as a line continuation, but this is not the case in raw strings. Also note that backslashes still escape quotes in raw strings, but are left in themselves. So the raw string literal r"\"" produces the string \". This means you can't end a raw string literal with a backslash.
See the lexical analysis section of the Python documentation for more information.
You seem to be struggling with the idea that a RegEx isn't part of Python, but instead a different programming language with its own parser and compiler. Raw strings help you get the "source code" of a RegEx safely to the RegEx parser, which will then assign meaning to character sequences like \d, \w, \n, etc...
The issue exists because Python and RegExps use \ as escape character, which is, by the way, a coincidence - there are languages with other escape characters (like "`n" for a newline, but even there you have to use "\n" in RegExps). The advantage is that you don't need to differentiate between raw and non-raw strings in these languages, they won't both try to convert the text and butcher it, because they react to different escape sequences.
raw string does not affect special sequences in python regex such as \w, \d. It only affects escape sequences such as \n. So most of the time it doesn't matter we write r in front or not.
I think that is the answer most beginners are looking for.
The relevant Python manual section ("String and Bytes literals") has a clear explanation of raw string literals:
Both string and bytes literals may optionally be prefixed with a
letter 'r' or 'R'; such strings are called raw strings and treat
backslashes as literal characters. As a result, in string literals,
'\U' and '\u' escapes in raw strings are not treated specially. Given
that Python 2.x’s raw unicode literals behave differently than Python
3.x’s the 'ur' syntax is not supported.
New in version 3.3: The 'rb' prefix of raw bytes literals has been
added as a synonym of 'br'.
New in version 3.3: Support for the unicode legacy literal (u'value')
was reintroduced to simplify the maintenance of dual Python 2.x and
3.x codebases. See PEP 414 for more information.
In triple-quoted strings, unescaped newlines and quotes are allowed
(and are retained), except that three unescaped quotes in a row
terminate the string. (A “quote” is the character used to open the
string, i.e. either ' or ".)
Unless an 'r' or 'R' prefix is present, escape sequences in strings
are interpreted according to rules similar to those used by Standard
C. The recognized escape sequences are:
Escape Sequence Meaning Notes
\newline Backslash and newline ignored
\ Backslash ()
\' Single quote (')
\" Double quote (")
\a ASCII Bell (BEL)
\b ASCII Backspace (BS)
\f ASCII Formfeed (FF)
\n ASCII Linefeed (LF)
\r ASCII Carriage Return (CR)
\t ASCII Horizontal Tab (TAB)
\v ASCII Vertical Tab (VT)
\ooo Character with octal value ooo (1,3)
\xhh Character with hex value hh (2,3)
Escape sequences only recognized in string literals are:
Escape Sequence Meaning Notes \N{name} Character named name in the
Unicode database (4) \uxxxx Character with 16-bit hex value xxxx (5)
\Uxxxxxxxx Character with 32-bit hex value xxxxxxxx (6)
Notes:
As in Standard C, up to three octal digits are accepted.
Unlike in Standard C, exactly two hex digits are required.
In a bytes literal, hexadecimal and octal escapes denote the byte with the given value. In a string literal, these escapes denote a
Unicode character with the given value.
Changed in version 3.3: Support for name aliases [1] has been added.
Individual code units which form parts of a surrogate pair can be encoded using this escape sequence. Exactly four hex digits are
required.
Any Unicode character can be encoded this way, but characters outside the Basic Multilingual Plane (BMP) will be encoded using a
surrogate pair if Python is compiled to use 16-bit code units (the
default). Exactly eight hex digits are required.
Unlike Standard C, all unrecognized escape sequences are left in the
string unchanged, i.e., the backslash is left in the string. (This
behavior is useful when debugging: if an escape sequence is mistyped,
the resulting output is more easily recognized as broken.) It is also
important to note that the escape sequences only recognized in string
literals fall into the category of unrecognized escapes for bytes
literals.
Even in a raw string, string quotes can be escaped with a backslash,
but the backslash remains in the string; for example, r"\"" is a valid
string literal consisting of two characters: a backslash and a double
quote; r"\" is not a valid string literal (even a raw string cannot
end in an odd number of backslashes). Specifically, a raw string
cannot end in a single backslash (since the backslash would escape the
following quote character). Note also that a single backslash followed
by a newline is interpreted as those two characters as part of the
string, not as a line continuation.
\n is an Escape Sequence in Python
\w is a Special Sequence in (Python) Regex
They look like they are in the same family but they are not. Raw string notation will affect Escape Sequences but not Regex Special Sequences.
For more about Escape Sequences
search for "\newline"
https://docs.python.org/3/reference/lexical_analysis.html
For more about Special Sequences:
search for "\number"
https://docs.python.org/3/library/re.html

Why can't I end a raw string with a backslash? [duplicate]

This question already has answers here:
Why can't Python's raw string literals end with a single backslash?
(14 answers)
Closed 6 years ago.
I am confused here, even though raw strings convert every \ to \\ but when this \ appears in the end it raises error.
>>> r'so\m\e \te\xt'
'so\\m\\e \\te\\xt'
>>> r'so\m\e \te\xt\'
SyntaxError: EOL while scanning string literal
Update:
This is now covered in Python FAQs as well: Why can’t raw strings (r-strings) end with a backslash?
You still need \ to escape ' or " in raw strings, since otherwise the python interpreter doesn't know where the string stops. In your example, you're escaping the closing '.
Otherwise:
r'it wouldn\'t be possible to store this string'
r'since it'd produce a syntax error without the escape'
Look at the syntax highlighting to see what I mean.
Raw strings can't end in single backslashes because of how the parser works (there is no actual escaping going on, though). The workaround is to add the backslash as a non-raw string literal afterwards:
>>> print(r'foo\')
File "<stdin>", line 1
print(r'foo\')
^
SyntaxError: EOL while scanning string literal
>>> print(r'foo''\\')
foo\
Not pretty, but it works. You can add plus to make it clearer what is happening, but it's not necessary:
>>> print(r'foo' + '\\')
foo\
Python strings are processed in two steps:
First the tokenizer looks for the closing quote. It recognizes backslashes when it does this, but doesn't interpret them - it just looks for a sequence of string elements followed by the closing quote mark, where "string elements" are either (a character that's not a backslash, closing quote or a newline - except newlines are allowed in triple-quotes), or (a backslash, followed by any single character).
Then the contents of the string are interpreted (backslash escapes are processed) depending on what kind of string it is. The r flag before a string literal only affects this step.
Quote from https://docs.python.org/3.4/reference/lexical_analysis.html#literals:
Even in a raw literal, quotes can be escaped with a backslash, but the
backslash remains in the result; for example, r"\"" is a valid string
literal consisting of two characters: a backslash and a double quote;
r"\" is not a valid string literal (even a raw string cannot end in an
odd number of backslashes). Specifically, a raw literal cannot end in
a single backslash (since the backslash would escape the following
quote character). Note also that a single backslash followed by a
newline is interpreted as those two characters as part of the literal,
not as a line continuation.
So in raw string, backslash are not treated specially, except when preceding " or '. Therefore, r'\' or r"\" is not a valid string cause right quote is escaped thus making the string literal invalid. In such case, there's no difference whether r exists, i.e. r'\' is equivalent to '\' and r"\" is equivalent to "\".

Categories