Using `r` with String Literals in Python [duplicate] - python

This question already has answers here:
What exactly do "u" and "r" string prefixes do, and what are raw string literals?
(7 answers)
Closed 8 years ago.
I'm slightly confused over what r actually does and haven't been able to make sense of other explanations associated with it. For example, what is the difference between s1 and s2:
s1 = r'this\\has\no\special\characters'
Edit:
s2 = 'this\\has\no\special\characters'
Thanks

The difference is that s1 has 2 backslashes between "this" and "has" and s2 only has 1. Also, s2 picks up a newline at the \n whereas s1 does not. The difference becomes very clear if you print the strings.
Basically, with r in front of a string literal, what you see is what you get1. Without r in front, python will translate various escape codes (\t, \n, \\, etc) into different characters (tab, newline, \, etc.)
1There is 1 gotcha that I know of ... r'\' is a SyntaxError ...

You can see that in the first case the r makes it a raw string so the slashes and any control characters are handled correctly (in the first case you now have a double slash), compare with string 2 where the \n now becomes a new line:
In [218]:
s1 = r'this\\has\no\special\characters'
print(s1)
s2 = 'this\\has\no\special\characters'
print(s2)
this\\has\no\special\characters
this\has
o\special\characters
Something to be careful of is using raw strings for building a path, if the path contains a trailing back slash then this will not be handled:
In [220]:
path = r'c:\mytemp\'
File "<ipython-input-220-ca80e74afea4>", line 1
path = r'c:\mytemp\'
^
SyntaxError: EOL while scanning string literal

the first one (s1) is a regular expressions set and second just is a string ! and based on the python doc :
Regular expressions use the backslash character ('\') to indicate special forms or to allow special characters to be used without invoking their special meaning. This collides with Python’s usage of the same character for the same purpose in string literals; for example, to match a literal backslash, one might have to write '\\\\' as the pattern string, because the regular expression must be \\, and each backslash must be expressed as \\ inside a regular Python string literal.
The solution is to use Python’s raw string notation for regular expression patterns; backslashes are not handled in any special way in a string literal prefixed with 'r'. So r"\n" is a two-character string containing '\' and 'n', while "\n" is a one-character string containing a newline. Usually patterns will be expressed in Python code using this raw string notation.
so you have :
>>> s1 = r'this\\has\no\special\characters'
>>> s1
'this\\\\has\\no\\special\\characters'
>>> s2 = 'this\\has\no\special\characters'
>>> s2
'this\\has\no\\special\\characters'

Related

Python regex doesnt match when string contains the special character '+' [duplicate]

This question already has answers here:
Escape special characters in a Python string
(7 answers)
Escaping regex string
(4 answers)
Closed 2 years ago.
import re
response = 'string contains+ as special character'
re.match(response, response)
print match
The string match is not successful as the strring contains the special character '+' . If any other special character , then match is successfull.
Even if putting back slash in special character , it doesnt match.
Both doesnt match:
response = r'string contains\+ as special character'
response = 'string contains\\+ as special character'
How to match it when the string is a variable and has this special character.
If you want use an arbitrary string and in a regex but treat it as plain text (so the special regex characters don't take effect), you can escape the whole string with re.escape.
>>> import re
>>> response = 'string contains+ as special character'
>>> re.match(re.escape(response), response)
<re.Match object; span=(0, 37), match='string contains+ as special character'>
In the general case, an arbitrary string does not match itself, though of course this is true for any string which doesn't contain regex metacharacters.
There are several characters which are regex metacharacters and which do not match themselves. A corner case is . which matches any character (except newline, by default), and so of course it also matches a literal ., but not exclusively. The quantifiers *, +, and ? as well as the generalized repetition operator {m,n} modify the preceding regular expression, round parentheses are reserved for grouping, | for alternation, square brackets define character classes, and finally of course the backslash \ is used to escape any of the preceding metacharacters, or itself.
Depending on what you want to accomplish, you can convert a string to a regex which matches exactly that literal string with re.escape(); but perhaps you simply need to have an incorrect assumption corrected.

The backslash character in Regex for Python [duplicate]

This question already has answers here:
Python regex - r prefix
(5 answers)
Closed 7 months ago.
In the Python documentation for Regex, the author mentions:
regular expressions use the backslash character ('\') to indicate
special forms or to allow special characters to be used without
invoking their special meaning. This conflicts with Python’s usage of
the same character for the same purpose in string literals.
He then goes on to give an example of matching \section in a regex:
to match a literal backslash, one has to write '\\' as the RE
string, because the regular expression must be \, and each backslash
must be expressed as \ inside a regular Python string literal. In REs
that feature backslashes repeatedly, this leads to lots of repeated
backslashes and makes the resulting strings difficult to understand.
He then says that the solution to this "backslash plague" is to begin a string with r to turn it into a raw string.
Later though, he gives this example of using Regex:
p = re.compile('\d+')
p.findall('12 drummers drumming, 11 pipers piping, 10 lords a-leaping')
which results in:
['12', '11', '10']
I am confused as to why we did not need to include an r in this case before '\d+'. I thought, based on the previous explanations of backslash, that we'd need to tell Python that the backslash in this string is not the backslash that it knows.
Python only recognizes some sequences starting with \ as escape sequences. For example \d is not a known escape sequence so for this particular case there is no need to escape the backslah to keep it there.
(In Python 3.6) "\d" and "\\d" are equivalent:
>>> "\d" == "\\d"
True
>>> r"\d" == "\\d"
True
Here is a list of all the recognized escape sequences: https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals

Using findall method in a tokenized text, and prefix 'r' [duplicate]

This question already has answers here:
What does the "r" in pythons re.compile(r' pattern flags') mean?
(3 answers)
Closed 5 years ago.
I understand that the 'r' prefix indicates a raw string, hence why in the following example is the 'r' prefix being used, since there are special regex characters in the string, which should not be taken literally?
the 'string' that is being searched is an nltk Text object, I suppose it has something to do with this? However I don't understand how it affects the usage of findall.
moby.findall(r"<a> (<.*>) <man>")
In this particular case, r makes no difference, as this string does not contain any sequences which could be misinterpreted. However, it is a good habit to use r when writing regular expressions, to avoid misinterpretation of sequences like \n or \t; with r, they are treated literally, as two characters - backslash followed by a letter; without r, they evaluate to newline and tab, respectively.
The r preceeding the string is called a sigil.
For example, '\n' will be treated as a newline character, while r'\n' will be treated as the characters \ followed by n.
But for your regex:
moby.findall(r"<a> (<.*>) <man>")
it doesn't make a difference but it is always a good idea to treat regex as raw strings to avoid escaping backslashes.

What does the "r" in pythons re.compile(r' pattern flags') mean?

I am reading through http://docs.python.org/2/library/re.html. According to this the "r" in pythons re.compile(r' pattern flags') refers the raw string notation :
The solution is to use Python’s raw string notation for regular
expression patterns; backslashes are not handled in any special way in
a string literal prefixed with 'r'. So r"\n" is a two-character string
containing '\' and 'n', while "\n" is a one-character string
containing a newline. Usually patterns will be expressed in Python
code using this raw string notation.
Would it be fair to say then that:
re.compile(r pattern) means that "pattern" is a regex while, re.compile(pattern) means that "pattern" is an exact match?
As #PauloBu stated, the r string prefix is not specifically related to regex's, but to strings generally in Python.
Normal strings use the backslash character as an escape character for special characters (like newlines):
>>> print('this is \n a test')
this is
a test
The r prefix tells the interpreter not to do this:
>>> print(r'this is \n a test')
this is \n a test
>>>
This is important in regular expressions, as you need the backslash to make it to the re module intact - in particular, \b matches empty string specifically at the start and end of a word. re expects the string \b, however normal string interpretation '\b' is converted to the ASCII backspace character, so you need to either explicitly escape the backslash ('\\b'), or tell python it is a raw string (r'\b').
>>> import re
>>> re.findall('\b', 'test') # the backslash gets consumed by the python string interpreter
[]
>>> re.findall('\\b', 'test') # backslash is explicitly escaped and is passed through to re module
['', '']
>>> re.findall(r'\b', 'test') # often this syntax is easier
['', '']
No, as the documentation pasted in explains the r prefix to a string indicates that the string is a raw string.
Because of the collisions between Python escaping of characters and regex escaping, both of which use the back-slash \ character, raw strings provide a way to indicate to python that you want an unescaped string.
Examine the following:
>>> "\n"
'\n'
>>> r"\n"
'\\n'
>>> print "\n"
>>> print r"\n"
\n
Prefixing with an r merely indicates to the string that backslashes \ should be treated literally and not as escape characters for python.
This is helpful, when for example you are searching on a word boundry. The regex for this is \b, however to capture this in a Python string, I'd need to use "\\b" as the pattern. Instead, I can use the raw string: r"\b" to pattern match on.
This becomes especially handy when trying to find a literal backslash in regex. To match a backslash in regex I need to use the pattern \\, to escape this in python means I need to escape each slash and the pattern becomes "\\\\", or the much simpler r"\\".
As you can guess in longer and more complex regexes, the extra slashes can get confusing, so raw strings are generally considered the way to go.
No. Not everything in regex syntax needs to be preceded by \, so ., *, +, etc still have special meaning in a pattern
The r'' is often used as a convenience for regex that do need a lot of \ as it prevents the clutter of doubling up the \

Python - Should I be using string prefix r when looking for a period (full stop or .) using regex?

I would like to know the reason I get the same result when using string prefix "r" or not when looking for a period (full stop) using python regex.
After reading a number sources (Links below) a multiple times and experimenting with in code to find the same result (again see below), I am still unsure of:
What is the difference when using string prefix "r" and not using string prefix "r", when looking for a period using regex?
Which way is considered the correct way of finding a period in a string using python regex with string prefix "r" or without string prefix "r"?
re.compile("\.").sub("!", "blah.")
'blah!'
re.compile(r"\.").sub("!", "blah.")
'blah!'
re.compile(r"\.").search("blah.").group()
'.'
re.compile("\.").search("blah.").group()
'.'
Sources I have looked at:
Python docs: string literals
http://docs.python.org/2/reference/lexical_analysis.html#string-literals
Regular expression to replace "escaped" characters with their originals
Python regex - r prefix
r prefix is for raw strings
http://forums.udacity.com/questions/7000217/r-prefix-is-for-raw-strings
The raw string notation is just that, a notation to specify a string value. The notation results in different string values when it comes to backslash escapes recognized by the normal string notation. Because regular expressions also attach meaning to the backslash character, raw string notation is quite handy as it avoids having to use excessive escaping.
Quoting from the Python Regular Expression HOWTO:
The solution is to use Python’s raw string notation for regular expressions; backslashes are not handled in any special way in a string literal prefixed with 'r', so r"\n" is a two-character string containing '\' and 'n', while "\n" is a one-character string containing a newline. Regular expressions will often be written in Python code using this raw string notation.
The \. combination has no special meaning in regular python strings, so there is no difference, at all between the result of '\.' and r'\.'; you can use either:
>>> len('\.')
2
>>> len(r'\.')
2
Raw strings only make a difference when the backslash + other characters do have special meaning in regular string notation:
>>> '\b'
'\x08'
>>> r'\b'
'\\b'
>>> len('\b')
1
>>> len(r'\b')
2
The \b combination has special meaning; in a regular string it is interpreted as the backspace character. But regular expressions see \b as a word boundary anchor, so you'd have to use \\b in your Python string every time you wanted to use this in a regular expression. Using r'\b' instead makes it much easier to read and write your expressions.
The regular expression functions are passed string values; the result of Python interpreting your string literal. The functions do not know if you used raw or normal string literal syntax.

Categories