Isn't the 'r' letter making the regex pattern string literal? [duplicate] - python

This question already has answers here:
What exactly is a "raw string regex" and how can you use it?
(7 answers)
Closed 7 months ago.
I had thought the 'r' prefix in the pattern is to make sure that anything in the pattern will be interpreted as string literal, so that I don't have to use escape, but in this case below, I still have to use '.' for literal match. So what's the purpose of the 'r' in the beginning of the regex?
pattern = r'.'
text = "this is. test"
text = re.sub(pattern, ' ', text)

The r prefix stands for "raw." It means that escape sequences inside a raw string will appear as literal. Consider:
print('Hello\b World') # Hello World
print(r'Hello\b World') # Hello\b World
In the first non raw string example, \b is interpreted as a control character (which doesn't get printed). In the second example using a raw string, \b is a literal word boundary.
Another example would be comparing '\1' to r'\1'. In the former, '\1' is a control character, while the latter is the first capture group. Note that to represent the first capture group without using a raw string we can double up backslashes, i.e. use '\\1'.

Related

Python regex doesnt match when string contains the special character '+' [duplicate]

This question already has answers here:
Escape special characters in a Python string
(7 answers)
Escaping regex string
(4 answers)
Closed 2 years ago.
import re
response = 'string contains+ as special character'
re.match(response, response)
print match
The string match is not successful as the strring contains the special character '+' . If any other special character , then match is successfull.
Even if putting back slash in special character , it doesnt match.
Both doesnt match:
response = r'string contains\+ as special character'
response = 'string contains\\+ as special character'
How to match it when the string is a variable and has this special character.
If you want use an arbitrary string and in a regex but treat it as plain text (so the special regex characters don't take effect), you can escape the whole string with re.escape.
>>> import re
>>> response = 'string contains+ as special character'
>>> re.match(re.escape(response), response)
<re.Match object; span=(0, 37), match='string contains+ as special character'>
In the general case, an arbitrary string does not match itself, though of course this is true for any string which doesn't contain regex metacharacters.
There are several characters which are regex metacharacters and which do not match themselves. A corner case is . which matches any character (except newline, by default), and so of course it also matches a literal ., but not exclusively. The quantifiers *, +, and ? as well as the generalized repetition operator {m,n} modify the preceding regular expression, round parentheses are reserved for grouping, | for alternation, square brackets define character classes, and finally of course the backslash \ is used to escape any of the preceding metacharacters, or itself.
Depending on what you want to accomplish, you can convert a string to a regex which matches exactly that literal string with re.escape(); but perhaps you simply need to have an incorrect assumption corrected.

Raw notation doesn't give the desired outcome [duplicate]

This question already has answers here:
Escaping regex string
(4 answers)
Closed 2 years ago.
I'm trying to use Python's raw notation to find a pattern that includes special characters with no success.
When using the 'r' notation to ignore the special characters nothing is found - see the example below:
Problematic Code
import re
pattern = re.compile(r"testing+101#gmail.com")
sentence = '___dsdtesting+101#gmail.comaaa___'
result = re.search(pattern, sentence).group()
print(result)
The above code will not find the pattern and return
AttributeError: 'NoneType' object has no attribute 'group'
Working Code
When escaping the '+' with '\' it works as expected:
import re
pattern = re.compile("testing\+101#gmail.com")
sentence = '___dsdtesting+101#gmail.comaaa___'
result = re.search(pattern, sentence).group()
print(result)
The above code will return the desired outcome of "testing+101#gmail.com".
Am I using the raw notation wrong? What's going on?
TO CLARIFY: I am not interested in escaping with the '\', rather I want to use the raw notation.
There are two levels of special characters here — those that are special to Python’s string syntax, and those that are special in regular expressions. Using raw strings takes care of the first group, but not the second group.
The plus sign is special in regexes, so to match the string a+ you need the regex a\+. Because the backslash is special to Python strings, if you do not use raw strings you need to type this as 'a\\+'. Using raw strings lets you type r'a\+'.
(Because the sequence \+ does not mean anything special to Python, and Python leaves such sequences unchanged, you could actually get away with just 'a\+'.)

str.split(r'\n') doesn't split a string on a newline character in a raw string literal as expected [duplicate]

This question already has an answer here:
Different way to specify matching new line in Python regex
(1 answer)
Closed 4 years ago.
Suppose I have a string s = hi\nhellon\whatsup and I want to split it.
If I use s.split('\n'), I get the expected output:
['hi', 'hello', 'whatsup']
However, if I use re.split('\n', s), it is actually `re.split(r'\n', s) and I also get the same output:
['hi', 'hello', 'whatsup']
Why does splitting on a raw string literal with re.split() work?
What is this black magic?
\n is both the ASCII escape for newlines and the regex escape meaning "match a newline". So in a raw string, used with re.split, it looks for it as the regex escape; in a non-raw string, it looks for the literal ASCII character, but either way it finds the newline to split on.

Using findall method in a tokenized text, and prefix 'r' [duplicate]

This question already has answers here:
What does the "r" in pythons re.compile(r' pattern flags') mean?
(3 answers)
Closed 5 years ago.
I understand that the 'r' prefix indicates a raw string, hence why in the following example is the 'r' prefix being used, since there are special regex characters in the string, which should not be taken literally?
the 'string' that is being searched is an nltk Text object, I suppose it has something to do with this? However I don't understand how it affects the usage of findall.
moby.findall(r"<a> (<.*>) <man>")
In this particular case, r makes no difference, as this string does not contain any sequences which could be misinterpreted. However, it is a good habit to use r when writing regular expressions, to avoid misinterpretation of sequences like \n or \t; with r, they are treated literally, as two characters - backslash followed by a letter; without r, they evaluate to newline and tab, respectively.
The r preceeding the string is called a sigil.
For example, '\n' will be treated as a newline character, while r'\n' will be treated as the characters \ followed by n.
But for your regex:
moby.findall(r"<a> (<.*>) <man>")
it doesn't make a difference but it is always a good idea to treat regex as raw strings to avoid escaping backslashes.

using Python to search for keywords in pdf [duplicate]

This question already has answers here:
Searching text in a PDF using Python? [duplicate]
(11 answers)
Closed 8 years ago.
I'm searching for keywords in a pdf file so I'm trying to search for /AA or /Acroform like the following:
import re
l = "/Acroform "
s = "/Acroform is what I'm looking for"
if re.search (r"\b"+l.rstrip()+r"\b",s):
print "yes"
why I don't get "yes". I want the "/" to be part of the keyword I'm looking for if it exist.
any one can help me with it ?
\b only matches in between a \w (word) and a \W (non-word) character, or vice versa, or when a \w character is at the edge of a string (start or end).
Your string starts with a / forward slash, a non word character, so \W. \b will never match between the start of a string and /. Don't use \b here, use an explicit negative look-behind for a word character :
re.search(r'(?<!\w){}\b'.format(re.escape(l)), s)
The (?<!...) syntax defines a negative look-behind; like \b it matches a position in the string. Here it'll only match if the preceding character (if there is any) is not a word character.
I used string formatting instead of concatenation here, and used re.escape() to make sure that any regular expression meta characters in the string you are searching for are properly escaped.
Demo:
>>> import re
>>> l = "/Acroform "
>>> s = "/Acroform is what I'm looking for"
>>> if re.search(r'(?<!\w){}\b'.format(re.escape(l)), s):
... print 'Found'
...
Found

Categories