Raw notation doesn't give the desired outcome [duplicate] - python

This question already has answers here:
Escaping regex string
(4 answers)
Closed 2 years ago.
I'm trying to use Python's raw notation to find a pattern that includes special characters with no success.
When using the 'r' notation to ignore the special characters nothing is found - see the example below:
Problematic Code
import re
pattern = re.compile(r"testing+101#gmail.com")
sentence = '___dsdtesting+101#gmail.comaaa___'
result = re.search(pattern, sentence).group()
print(result)
The above code will not find the pattern and return
AttributeError: 'NoneType' object has no attribute 'group'
Working Code
When escaping the '+' with '\' it works as expected:
import re
pattern = re.compile("testing\+101#gmail.com")
sentence = '___dsdtesting+101#gmail.comaaa___'
result = re.search(pattern, sentence).group()
print(result)
The above code will return the desired outcome of "testing+101#gmail.com".
Am I using the raw notation wrong? What's going on?
TO CLARIFY: I am not interested in escaping with the '\', rather I want to use the raw notation.

There are two levels of special characters here — those that are special to Python’s string syntax, and those that are special in regular expressions. Using raw strings takes care of the first group, but not the second group.
The plus sign is special in regexes, so to match the string a+ you need the regex a\+. Because the backslash is special to Python strings, if you do not use raw strings you need to type this as 'a\\+'. Using raw strings lets you type r'a\+'.
(Because the sequence \+ does not mean anything special to Python, and Python leaves such sequences unchanged, you could actually get away with just 'a\+'.)

Related

Regex patterns with windows paths in python [duplicate]

This question already has answers here:
Why do backslashes appear twice?
(2 answers)
Closed 7 months ago.
I found a python package on GitHub that doesn't work. It attempts to replace a substring within a url with another string.
string = "filename.txt"
rewrite = "c:\\windows\\system32\\drivers\\hosts"
url = "https://www.example.com/path?parameter=filename.txt"
fullrewrite = re.sub(string, rewrite, url)
The string, rewrite, and url parameters are arbitrary and not hard-coded. I just put them there as an example (this is a path traversal testing library I'm trying to play around with).
When I run this code, I get a KeyError from re, which is expected according to the docs:
If you’re not using a raw string to express the pattern, remember that Python also uses the backslash as an escape sequence in string literals; if the escape sequence isn’t recognized by Python’s parser, the backslash and subsequent character are included in the resulting string. However, if Python would recognize the resulting sequence, the backslash should be repeated twice. This is complicated and hard to understand, so it’s highly recommended that you use raw strings for all but the simplest expressions.
I tried using repr to convert the string into a raw string:
raw = repr(rewrite)[1:-1] # [1:-1] removes extra quotes.
fullrewrite = re.sub(string, raw, url)
But this creates double backslashes in the resulting url: https://www.example.com/path?parameter=c:\\windows\\system32\\drivers\\hosts
My question is how am I supposed to have it replace the key word so that the resulting string is: https://www.example.com/path?parameter=c:\windows\system32\drivers\hosts?
This is my understanding, please correct me if i'm wrong.
You don't get double backslashes, but escaped backslashes. In Re and Python, one backslash is a special character. It does not match the backslash character.(or rather, not always) To print one backslash, one would need to escape it with another.(again - most often) Thus, one can say that a double backslash is an internal representation of a backslash.
If one puts 'c:\\' into print() or save it to a 'txt' file, one will get 'c:\'.
P.S. Since '\q' is not a special sequence in Python, '\q'=='\\q' returns True.

Regular expression error: unbalanced parenthesis at position n

I have been meaning to extract the month name from the following string with regex and despite the fact that my regex works on a platform like regex101, I can't seem to be able to extract the word "August".
import re
s = "word\anyword\2021\August\202108_filename.csv"
re.findall("\d+\\([[:alpha:]]+)\\\d+", s)
Which results in the following error:
error: unbalanced parenthesis at position 17
I also tried using re.compile, re.escape as per suggestions of the previous posts dealing with the same error but none of them seems to work.
Any help and also a little explanation on why this isn't working is greatly appreciated.
You can use
import re
s = r"word\anyword\2021\August\202108_filename.csv"
m = re.search(r"\d+\\([a-zA-Z]+)\\\d+", s)
if m:
print(m.group(1))
See the Python demo.
There are three main problems here:
The input string should be the same as used at regex101.com, i.e. you need to make sure you are using literal backslashes in the Python code, hence the use of raw string literals for both the input text and regex
The POSIX character classes are not supported by Python re, so [[:alpha:]]+ should be replaced with some equivalent pattern, say, [A-Za-z]+ or [^\W\d_]+
Since it seems like you only expect a single match (there is only one August (month) name in the string), you do not need re.findall, you can use re.search. Only use re.findall when you need to extract multiple matches from a string.
Also, see these posts:
Python regex - r prefix
What does the "r" in pythons re.compile(r' pattern flags') mean?
What exactly do "u" and "r" string flags do, and what are raw string literals?

Using findall method in a tokenized text, and prefix 'r' [duplicate]

This question already has answers here:
What does the "r" in pythons re.compile(r' pattern flags') mean?
(3 answers)
Closed 5 years ago.
I understand that the 'r' prefix indicates a raw string, hence why in the following example is the 'r' prefix being used, since there are special regex characters in the string, which should not be taken literally?
the 'string' that is being searched is an nltk Text object, I suppose it has something to do with this? However I don't understand how it affects the usage of findall.
moby.findall(r"<a> (<.*>) <man>")
In this particular case, r makes no difference, as this string does not contain any sequences which could be misinterpreted. However, it is a good habit to use r when writing regular expressions, to avoid misinterpretation of sequences like \n or \t; with r, they are treated literally, as two characters - backslash followed by a letter; without r, they evaluate to newline and tab, respectively.
The r preceeding the string is called a sigil.
For example, '\n' will be treated as a newline character, while r'\n' will be treated as the characters \ followed by n.
But for your regex:
moby.findall(r"<a> (<.*>) <man>")
it doesn't make a difference but it is always a good idea to treat regex as raw strings to avoid escaping backslashes.

regex syntax in python, the r before the opening quote [duplicate]

This question already has answers here:
Python - Raw String Literals [duplicate]
(2 answers)
Closed 7 years ago.
This is a line of regex from a Python thing I'm writing:
m = re.match(r"{(.+)}", self.label)
As far as i can tell, it's working fine.
Anyways, my question is about the r character before the first double quote. I've never really questioned it. But why is it there? What is its purpose?
The r before a string literal tells Python not to do any \ escaping on the string. For instance:
>>> print('a\nb')
a
b
>>> print(r'a\nb')
a\nb
>>>
The reason r-prefixed strings are often used with regular expressions is because regular expressions often use a lot of \'s. For instance, to use a simple example, compare the regular expression '\\d+' versus r'\d+'. They're actually the same string, just represented in different ways. With the r syntax, you don't have to escape the \'s that are used in the regular expression syntax. Now imagine having a lot of \'s in your regular expression; it's much cleaner to use the r syntax.
"String literals may optionally be prefixed with a letter 'r' or 'R'; such strings are called raw strings and use different rules for interpreting backslash escape sequences."
https://docs.python.org/2/reference/lexical_analysis.html#string-literals

Python - Should I be using string prefix r when looking for a period (full stop or .) using regex?

I would like to know the reason I get the same result when using string prefix "r" or not when looking for a period (full stop) using python regex.
After reading a number sources (Links below) a multiple times and experimenting with in code to find the same result (again see below), I am still unsure of:
What is the difference when using string prefix "r" and not using string prefix "r", when looking for a period using regex?
Which way is considered the correct way of finding a period in a string using python regex with string prefix "r" or without string prefix "r"?
re.compile("\.").sub("!", "blah.")
'blah!'
re.compile(r"\.").sub("!", "blah.")
'blah!'
re.compile(r"\.").search("blah.").group()
'.'
re.compile("\.").search("blah.").group()
'.'
Sources I have looked at:
Python docs: string literals
http://docs.python.org/2/reference/lexical_analysis.html#string-literals
Regular expression to replace "escaped" characters with their originals
Python regex - r prefix
r prefix is for raw strings
http://forums.udacity.com/questions/7000217/r-prefix-is-for-raw-strings
The raw string notation is just that, a notation to specify a string value. The notation results in different string values when it comes to backslash escapes recognized by the normal string notation. Because regular expressions also attach meaning to the backslash character, raw string notation is quite handy as it avoids having to use excessive escaping.
Quoting from the Python Regular Expression HOWTO:
The solution is to use Python’s raw string notation for regular expressions; backslashes are not handled in any special way in a string literal prefixed with 'r', so r"\n" is a two-character string containing '\' and 'n', while "\n" is a one-character string containing a newline. Regular expressions will often be written in Python code using this raw string notation.
The \. combination has no special meaning in regular python strings, so there is no difference, at all between the result of '\.' and r'\.'; you can use either:
>>> len('\.')
2
>>> len(r'\.')
2
Raw strings only make a difference when the backslash + other characters do have special meaning in regular string notation:
>>> '\b'
'\x08'
>>> r'\b'
'\\b'
>>> len('\b')
1
>>> len(r'\b')
2
The \b combination has special meaning; in a regular string it is interpreted as the backspace character. But regular expressions see \b as a word boundary anchor, so you'd have to use \\b in your Python string every time you wanted to use this in a regular expression. Using r'\b' instead makes it much easier to read and write your expressions.
The regular expression functions are passed string values; the result of Python interpreting your string literal. The functions do not know if you used raw or normal string literal syntax.

Categories