Using findall method in a tokenized text, and prefix 'r' [duplicate]

Using findall method in a tokenized text, and prefix 'r' [duplicate] - python

This question already has answers here:
What does the "r" in pythons re.compile(r' pattern flags') mean?
(3 answers)
Closed 5 years ago.
I understand that the 'r' prefix indicates a raw string, hence why in the following example is the 'r' prefix being used, since there are special regex characters in the string, which should not be taken literally?
the 'string' that is being searched is an nltk Text object, I suppose it has something to do with this? However I don't understand how it affects the usage of findall.
moby.findall(r"<a> (<.*>) <man>")

In this particular case, r makes no difference, as this string does not contain any sequences which could be misinterpreted. However, it is a good habit to use r when writing regular expressions, to avoid misinterpretation of sequences like \n or \t; with r, they are treated literally, as two characters - backslash followed by a letter; without r, they evaluate to newline and tab, respectively.

The r preceeding the string is called a sigil.
For example, '\n' will be treated as a newline character, while r'\n' will be treated as the characters \ followed by n.
But for your regex:
moby.findall(r"<a> (<.*>) <man>")
it doesn't make a difference but it is always a good idea to treat regex as raw strings to avoid escaping backslashes.

Related

Isn't the 'r' letter making the regex pattern string literal? [duplicate]

This question already has answers here:
What exactly is a "raw string regex" and how can you use it?
(7 answers)
Closed 7 months ago.
I had thought the 'r' prefix in the pattern is to make sure that anything in the pattern will be interpreted as string literal, so that I don't have to use escape, but in this case below, I still have to use '.' for literal match. So what's the purpose of the 'r' in the beginning of the regex?
pattern = r'.'
text = "this is. test"
text = re.sub(pattern, ' ', text)

The r prefix stands for "raw." It means that escape sequences inside a raw string will appear as literal. Consider:
print('Hello\b World') # Hello World
print(r'Hello\b World') # Hello\b World
In the first non raw string example, \b is interpreted as a control character (which doesn't get printed). In the second example using a raw string, \b is a literal word boundary.
Another example would be comparing '\1' to r'\1'. In the former, '\1' is a control character, while the latter is the first capture group. Note that to represent the first capture group without using a raw string we can double up backslashes, i.e. use '\\1'.

Regex patterns with windows paths in python [duplicate]

This question already has answers here:
Why do backslashes appear twice?
(2 answers)
Closed 7 months ago.
I found a python package on GitHub that doesn't work. It attempts to replace a substring within a url with another string.
string = "filename.txt"
rewrite = "c:\\windows\\system32\\drivers\\hosts"
url = "https://www.example.com/path?parameter=filename.txt"
fullrewrite = re.sub(string, rewrite, url)
The string, rewrite, and url parameters are arbitrary and not hard-coded. I just put them there as an example (this is a path traversal testing library I'm trying to play around with).
When I run this code, I get a KeyError from re, which is expected according to the docs:
If you’re not using a raw string to express the pattern, remember that Python also uses the backslash as an escape sequence in string literals; if the escape sequence isn’t recognized by Python’s parser, the backslash and subsequent character are included in the resulting string. However, if Python would recognize the resulting sequence, the backslash should be repeated twice. This is complicated and hard to understand, so it’s highly recommended that you use raw strings for all but the simplest expressions.
I tried using repr to convert the string into a raw string:
raw = repr(rewrite)[1:-1] # [1:-1] removes extra quotes.
fullrewrite = re.sub(string, raw, url)
But this creates double backslashes in the resulting url: https://www.example.com/path?parameter=c:\\windows\\system32\\drivers\\hosts
My question is how am I supposed to have it replace the key word so that the resulting string is: https://www.example.com/path?parameter=c:\windows\system32\drivers\hosts?

This is my understanding, please correct me if i'm wrong.
You don't get double backslashes, but escaped backslashes. In Re and Python, one backslash is a special character. It does not match the backslash character.(or rather, not always) To print one backslash, one would need to escape it with another.(again - most often) Thus, one can say that a double backslash is an internal representation of a backslash.
If one puts 'c:\\' into print() or save it to a 'txt' file, one will get 'c:\'.
P.S. Since '\q' is not a special sequence in Python, '\q'=='\\q' returns True.

Raw notation doesn't give the desired outcome [duplicate]

This question already has answers here:
Escaping regex string
(4 answers)
Closed 2 years ago.
I'm trying to use Python's raw notation to find a pattern that includes special characters with no success.
When using the 'r' notation to ignore the special characters nothing is found - see the example below:
Problematic Code
import re
pattern = re.compile(r"testing+101#gmail.com")
sentence = '___dsdtesting+101#gmail.comaaa___'
result = re.search(pattern, sentence).group()
print(result)
The above code will not find the pattern and return
AttributeError: 'NoneType' object has no attribute 'group'
Working Code
When escaping the '+' with '\' it works as expected:
import re
pattern = re.compile("testing\+101#gmail.com")
sentence = '___dsdtesting+101#gmail.comaaa___'
result = re.search(pattern, sentence).group()
print(result)
The above code will return the desired outcome of "testing+101#gmail.com".
Am I using the raw notation wrong? What's going on?
TO CLARIFY: I am not interested in escaping with the '\', rather I want to use the raw notation.

There are two levels of special characters here — those that are special to Python’s string syntax, and those that are special in regular expressions. Using raw strings takes care of the first group, but not the second group.
The plus sign is special in regexes, so to match the string a+ you need the regex a\+. Because the backslash is special to Python strings, if you do not use raw strings you need to type this as 'a\\+'. Using raw strings lets you type r'a\+'.
(Because the sequence \+ does not mean anything special to Python, and Python leaves such sequences unchanged, you could actually get away with just 'a\+'.)

The backslash character in Regex for Python [duplicate]

This question already has answers here:
Python regex - r prefix
(5 answers)
Closed 7 months ago.
In the Python documentation for Regex, the author mentions:
regular expressions use the backslash character ('\') to indicate
special forms or to allow special characters to be used without
invoking their special meaning. This conflicts with Python’s usage of
the same character for the same purpose in string literals.
He then goes on to give an example of matching \section in a regex:
to match a literal backslash, one has to write '\\' as the RE
string, because the regular expression must be \, and each backslash
must be expressed as \ inside a regular Python string literal. In REs
that feature backslashes repeatedly, this leads to lots of repeated
backslashes and makes the resulting strings difficult to understand.
He then says that the solution to this "backslash plague" is to begin a string with r to turn it into a raw string.
Later though, he gives this example of using Regex:
p = re.compile('\d+')
p.findall('12 drummers drumming, 11 pipers piping, 10 lords a-leaping')
which results in:
['12', '11', '10']
I am confused as to why we did not need to include an r in this case before '\d+'. I thought, based on the previous explanations of backslash, that we'd need to tell Python that the backslash in this string is not the backslash that it knows.

Python only recognizes some sequences starting with \ as escape sequences. For example \d is not a known escape sequence so for this particular case there is no need to escape the backslah to keep it there.
(In Python 3.6) "\d" and "\\d" are equivalent:
>>> "\d" == "\\d"
True
>>> r"\d" == "\\d"
True
Here is a list of all the recognized escape sequences: https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals

regex syntax in python, the r before the opening quote [duplicate]

This question already has answers here:
Python - Raw String Literals [duplicate]
(2 answers)
Closed 7 years ago.
This is a line of regex from a Python thing I'm writing:
m = re.match(r"{(.+)}", self.label)
As far as i can tell, it's working fine.
Anyways, my question is about the r character before the first double quote. I've never really questioned it. But why is it there? What is its purpose?

The r before a string literal tells Python not to do any \ escaping on the string. For instance:
>>> print('a\nb')
a
b
>>> print(r'a\nb')
a\nb
>>>
The reason r-prefixed strings are often used with regular expressions is because regular expressions often use a lot of \'s. For instance, to use a simple example, compare the regular expression '\\d+' versus r'\d+'. They're actually the same string, just represented in different ways. With the r syntax, you don't have to escape the \'s that are used in the regular expression syntax. Now imagine having a lot of \'s in your regular expression; it's much cleaner to use the r syntax.

"String literals may optionally be prefixed with a letter 'r' or 'R'; such strings are called raw strings and use different rules for interpreting backslash escape sequences."
https://docs.python.org/2/reference/lexical_analysis.html#string-literals

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.