The backslash character in Regex for Python [duplicate]

The backslash character in Regex for Python [duplicate] - python

This question already has answers here:
Python regex - r prefix
(5 answers)
Closed 7 months ago.
In the Python documentation for Regex, the author mentions:
regular expressions use the backslash character ('\') to indicate
special forms or to allow special characters to be used without
invoking their special meaning. This conflicts with Python’s usage of
the same character for the same purpose in string literals.
He then goes on to give an example of matching \section in a regex:
to match a literal backslash, one has to write '\\' as the RE
string, because the regular expression must be \, and each backslash
must be expressed as \ inside a regular Python string literal. In REs
that feature backslashes repeatedly, this leads to lots of repeated
backslashes and makes the resulting strings difficult to understand.
He then says that the solution to this "backslash plague" is to begin a string with r to turn it into a raw string.
Later though, he gives this example of using Regex:
p = re.compile('\d+')
p.findall('12 drummers drumming, 11 pipers piping, 10 lords a-leaping')
which results in:
['12', '11', '10']
I am confused as to why we did not need to include an r in this case before '\d+'. I thought, based on the previous explanations of backslash, that we'd need to tell Python that the backslash in this string is not the backslash that it knows.

Python only recognizes some sequences starting with \ as escape sequences. For example \d is not a known escape sequence so for this particular case there is no need to escape the backslah to keep it there.
(In Python 3.6) "\d" and "\\d" are equivalent:
>>> "\d" == "\\d"
True
>>> r"\d" == "\\d"
True
Here is a list of all the recognized escape sequences: https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals

Related

Raw notation doesn't give the desired outcome [duplicate]

This question already has answers here:
Escaping regex string
(4 answers)
Closed 2 years ago.
I'm trying to use Python's raw notation to find a pattern that includes special characters with no success.
When using the 'r' notation to ignore the special characters nothing is found - see the example below:
Problematic Code
import re
pattern = re.compile(r"testing+101#gmail.com")
sentence = '___dsdtesting+101#gmail.comaaa___'
result = re.search(pattern, sentence).group()
print(result)
The above code will not find the pattern and return
AttributeError: 'NoneType' object has no attribute 'group'
Working Code
When escaping the '+' with '\' it works as expected:
import re
pattern = re.compile("testing\+101#gmail.com")
sentence = '___dsdtesting+101#gmail.comaaa___'
result = re.search(pattern, sentence).group()
print(result)
The above code will return the desired outcome of "testing+101#gmail.com".
Am I using the raw notation wrong? What's going on?
TO CLARIFY: I am not interested in escaping with the '\', rather I want to use the raw notation.

There are two levels of special characters here — those that are special to Python’s string syntax, and those that are special in regular expressions. Using raw strings takes care of the first group, but not the second group.
The plus sign is special in regexes, so to match the string a+ you need the regex a\+. Because the backslash is special to Python strings, if you do not use raw strings you need to type this as 'a\\+'. Using raw strings lets you type r'a\+'.
(Because the sequence \+ does not mean anything special to Python, and Python leaves such sequences unchanged, you could actually get away with just 'a\+'.)

Using findall method in a tokenized text, and prefix 'r' [duplicate]

This question already has answers here:
What does the "r" in pythons re.compile(r' pattern flags') mean?
(3 answers)
Closed 5 years ago.
I understand that the 'r' prefix indicates a raw string, hence why in the following example is the 'r' prefix being used, since there are special regex characters in the string, which should not be taken literally?
the 'string' that is being searched is an nltk Text object, I suppose it has something to do with this? However I don't understand how it affects the usage of findall.
moby.findall(r"<a> (<.*>) <man>")

In this particular case, r makes no difference, as this string does not contain any sequences which could be misinterpreted. However, it is a good habit to use r when writing regular expressions, to avoid misinterpretation of sequences like \n or \t; with r, they are treated literally, as two characters - backslash followed by a letter; without r, they evaluate to newline and tab, respectively.

The r preceeding the string is called a sigil.
For example, '\n' will be treated as a newline character, while r'\n' will be treated as the characters \ followed by n.
But for your regex:
moby.findall(r"<a> (<.*>) <man>")
it doesn't make a difference but it is always a good idea to treat regex as raw strings to avoid escaping backslashes.

regex syntax in python, the r before the opening quote [duplicate]

This question already has answers here:
Python - Raw String Literals [duplicate]
(2 answers)
Closed 7 years ago.
This is a line of regex from a Python thing I'm writing:
m = re.match(r"{(.+)}", self.label)
As far as i can tell, it's working fine.
Anyways, my question is about the r character before the first double quote. I've never really questioned it. But why is it there? What is its purpose?

The r before a string literal tells Python not to do any \ escaping on the string. For instance:
>>> print('a\nb')
a
b
>>> print(r'a\nb')
a\nb
>>>
The reason r-prefixed strings are often used with regular expressions is because regular expressions often use a lot of \'s. For instance, to use a simple example, compare the regular expression '\\d+' versus r'\d+'. They're actually the same string, just represented in different ways. With the r syntax, you don't have to escape the \'s that are used in the regular expression syntax. Now imagine having a lot of \'s in your regular expression; it's much cleaner to use the r syntax.

"String literals may optionally be prefixed with a letter 'r' or 'R'; such strings are called raw strings and use different rules for interpreting backslash escape sequences."
https://docs.python.org/2/reference/lexical_analysis.html#string-literals

Using `r` with String Literals in Python [duplicate]

This question already has answers here:
What exactly do "u" and "r" string prefixes do, and what are raw string literals?
(7 answers)
Closed 8 years ago.
I'm slightly confused over what r actually does and haven't been able to make sense of other explanations associated with it. For example, what is the difference between s1 and s2:
s1 = r'this\\has\no\special\characters'
Edit:
s2 = 'this\\has\no\special\characters'
Thanks

The difference is that s1 has 2 backslashes between "this" and "has" and s2 only has 1. Also, s2 picks up a newline at the \n whereas s1 does not. The difference becomes very clear if you print the strings.
Basically, with r in front of a string literal, what you see is what you get1. Without r in front, python will translate various escape codes (\t, \n, \\, etc) into different characters (tab, newline, \, etc.)
1There is 1 gotcha that I know of ... r'\' is a SyntaxError ...

You can see that in the first case the r makes it a raw string so the slashes and any control characters are handled correctly (in the first case you now have a double slash), compare with string 2 where the \n now becomes a new line:
In [218]:
s1 = r'this\\has\no\special\characters'
print(s1)
s2 = 'this\\has\no\special\characters'
print(s2)
this\\has\no\special\characters
this\has
o\special\characters
Something to be careful of is using raw strings for building a path, if the path contains a trailing back slash then this will not be handled:
In [220]:
path = r'c:\mytemp\'
File "<ipython-input-220-ca80e74afea4>", line 1
path = r'c:\mytemp\'
^
SyntaxError: EOL while scanning string literal

the first one (s1) is a regular expressions set and second just is a string ! and based on the python doc :
Regular expressions use the backslash character ('\') to indicate special forms or to allow special characters to be used without invoking their special meaning. This collides with Python’s usage of the same character for the same purpose in string literals; for example, to match a literal backslash, one might have to write '\\\\' as the pattern string, because the regular expression must be \\, and each backslash must be expressed as \\ inside a regular Python string literal.
The solution is to use Python’s raw string notation for regular expression patterns; backslashes are not handled in any special way in a string literal prefixed with 'r'. So r"\n" is a two-character string containing '\' and 'n', while "\n" is a one-character string containing a newline. Usually patterns will be expressed in Python code using this raw string notation.
so you have :
>>> s1 = r'this\\has\no\special\characters'
>>> s1
'this\\\\has\\no\\special\\characters'
>>> s2 = 'this\\has\no\special\characters'
>>> s2
'this\\has\no\\special\\characters'

Python - Should I be using string prefix r when looking for a period (full stop or .) using regex?

I would like to know the reason I get the same result when using string prefix "r" or not when looking for a period (full stop) using python regex.
After reading a number sources (Links below) a multiple times and experimenting with in code to find the same result (again see below), I am still unsure of:
What is the difference when using string prefix "r" and not using string prefix "r", when looking for a period using regex?
Which way is considered the correct way of finding a period in a string using python regex with string prefix "r" or without string prefix "r"?
re.compile("\.").sub("!", "blah.")
'blah!'
re.compile(r"\.").sub("!", "blah.")
'blah!'
re.compile(r"\.").search("blah.").group()
'.'
re.compile("\.").search("blah.").group()
'.'
Sources I have looked at:
Python docs: string literals
http://docs.python.org/2/reference/lexical_analysis.html#string-literals
Regular expression to replace "escaped" characters with their originals
Python regex - r prefix
r prefix is for raw strings
http://forums.udacity.com/questions/7000217/r-prefix-is-for-raw-strings

The raw string notation is just that, a notation to specify a string value. The notation results in different string values when it comes to backslash escapes recognized by the normal string notation. Because regular expressions also attach meaning to the backslash character, raw string notation is quite handy as it avoids having to use excessive escaping.
Quoting from the Python Regular Expression HOWTO:
The solution is to use Python’s raw string notation for regular expressions; backslashes are not handled in any special way in a string literal prefixed with 'r', so r"\n" is a two-character string containing '\' and 'n', while "\n" is a one-character string containing a newline. Regular expressions will often be written in Python code using this raw string notation.
The \. combination has no special meaning in regular python strings, so there is no difference, at all between the result of '\.' and r'\.'; you can use either:
>>> len('\.')
2
>>> len(r'\.')
2
Raw strings only make a difference when the backslash + other characters do have special meaning in regular string notation:
>>> '\b'
'\x08'
>>> r'\b'
'\\b'
>>> len('\b')
1
>>> len(r'\b')
2
The \b combination has special meaning; in a regular string it is interpreted as the backspace character. But regular expressions see \b as a word boundary anchor, so you'd have to use \\b in your Python string every time you wanted to use this in a regular expression. Using r'\b' instead makes it much easier to read and write your expressions.
The regular expression functions are passed string values; the result of Python interpreting your string literal. The functions do not know if you used raw or normal string literal syntax.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

The backslash character in Regex for Python [duplicate] - python

Related

Raw notation doesn't give the desired outcome [duplicate]

Using findall method in a tokenized text, and prefix 'r' [duplicate]

regex syntax in python, the r before the opening quote [duplicate]

Using `r` with String Literals in Python [duplicate]

Python - Should I be using string prefix r when looking for a period (full stop or .) using regex?

Categories

Resources