Python Regex escape operator \ in substitutions & raw strings

Python Regex escape operator \ in substitutions & raw strings - python

I don't understand the logic in the functioning of the scape operator \ in python regex together with r' of raw strings.
Some help is appreciated.
code:
import re
text=' esto .es 10 . er - 12 .23 with [ and.Other ] here is more ; puntuation'
print('text0=',text)
text1 = re.sub(r'(\s+)([;:\.\-])', r'\2', text)
text2 = re.sub(r'\s+\.', '\.', text)
text3 = re.sub(r'\s+\.', r'\.', text)
print('text1=',text1)
print('text2=',text2)
print('text3=',text3)
The theory says:
backslash character ('\') to indicate special forms or to allow special characters to be used without invoking their special meaning.
And as far as the link provided at the end of this question explains, r' represents a raw string, i.e. there is no special meaning for symbols, it is as it stays.
so in the above regex I would expect text2 and text3 to be different, since the substitution text is '.' in text 2, i.e. a period, whereas (in principle) the substitution text in text 3 is r'.' which is a raw string, i.e. the string as it is should appear, backslash and period. But they result in the same:
The result is:
text0= esto .es 10 . er - 12 .23 with [ and.Other ] here is more ; puntuation
text1= esto.es 10. er- 12.23 with [ and.Other ] here is more; puntuation
text2= esto\.es 10\. er - 12\.23 with [ and.Other ] here is more ; puntuation
text3= esto\.es 10\. er - 12\.23 with [ and.Other ] here is more ; puntuation
#text2=text3 but substitutions are not the same r'\.' vs '\.'
It looks to me that the r' does not work the same way in substitution part, nor the backslash. On the other hand my intuition tells me I am missing something here.
EDIT 1:
Following #Wiktor Stribiżew comment.
He pointed out that (following his link):
import re
print(re.sub(r'(.)(.)(.)(.)(.)(.)', 'a\6b', '123456'))
print(re.sub(r'(.)(.)(.)(.)(.)(.)', r'a\6b', '123456'))
# in my example the substitutions were not the same and the result were equal
# here indeed r' changes the results
which gives:
ab
a6b
that puzzles me even more.
Note:
I read this stack overflow question about raw strings which is super complete. Nevertheless it does not speak about substitutions

First and foremost,
replacement patterns ≠ regular expression patterns
We use a regex pattern to search for matches, we use replacement patterns to replace matches found with regex.
NOTE: The only special character in a substitution pattern is a backslash, \. Only the backslash must be doubled.
Replacement pattern syntax in Python
The re.sub docs are confusing as they mention both string escape sequences that can be used in replacement patterns (like \n, \r) and regex escape sequences (\6) and those that can be used as both regex and string escape sequences (\&).
I am using the term regex escape sequence to denote an escape sequence consisting of a literal backslash + a character, that is, '\\X' or r'\X', and a string escape sequence to denote a sequence of \ and a char or some sequence that together form a valid string escape sequence. They are only recognized in regular string literals. In raw string literals, you can only escape " (and that is the reason why you can't end a raw string literal with \", but the backlash is still part of the string then).
So, in a replacement pattern, you may use backreferences:
re.sub(r'\D(\d)\D', r'\1', 'a1b') # => 1
re.sub(r'\D(\d)\D', '\\1', 'a1b') # => 1
re.sub(r'\D(\d)\D', '\g<1>', 'a1b') # => 1
re.sub(r'\D(\d)\D', r'\g<1>', 'a1b') # => 1
You may see that r'\1' and '\\1' is the same replacement pattern, \1. If you use '\1', it will get parse as a string escape sequence, a character with octal value 001. If you forget to use r prefix with the unambiguous backreference, there is no problem because \g is not a valid string escape sequence, and there, \ escape character remains in the string. Read on the docs I linked to:
Unlike Standard C, all unrecognized escape sequences are left in the string unchanged, i.e., the backslash is left in the result.
So, when you pass '\.' as a replacement string, you actually send \. two-char combination as the replacement string, and that is why you get \. in the result.
\ is a special character in Python replacement pattern
If you use re.sub(r'\s+\.', r'\\.', text), you will get the same result as in text2 and text3 cases, see this demo.
That happens because \\, two literal backslashes, denote a single backslash in the replacement pattern. If you have no Group 2 in your regex pattern, but pass r'\2' in the replacement to actually replace with \ and 2 char combination, you would get an error.
Thus, when you have dynamic, user-defined replacement patterns you need to double all backslashes in the replacement patterns that are meant to be passed as literal strings:
re.sub(some_regex, some_replacement.replace('\\', '\\\\'), input_string)

A simple way to work around all these string escaping issues is to use a function/lambda as the repl argument, instead of a string. For example:
output = re.sub(
pattern=find_pattern,
repl=lambda _: replacement,
string=input,
)
The replacement string won't be parsed at all, just substituted in place of the match.

From the doc (my emphasis):
re.sub(pattern, repl, string, count=0, flags=0)
Return the string
obtained by replacing the leftmost non-overlapping occurrences of
pattern in string by the replacement repl. If the pattern isn’t found,
string is returned unchanged. repl can be a string or a function; if
it is a string, any backslash escapes in it are processed. That is, \n
is converted to a single newline character, \r is converted to a
carriage return, and so forth. Unknown escapes of ASCII letters are
reserved for future use and treated as errors. Other unknown escapes
such as \& are left alone. Backreferences, such as \6, are replaced
with the substring matched by group 6 in the pattern.
The repl argument is not just plain text. It can also be the name of a function or refer to a position in a group (e.g. \g<quote>, \g<1>, \1).
Also, from here:
Unlike Standard C, all unrecognized escape sequences are left in the
string unchanged, i.e., the backslash is left in the result.
Since . is not a special escape character, '\.' is the same as r'\.\.

Related

How to search for a character string that is an escape sequence with re.search

I wrote the code to check if the escape sequence "\n" is included in the string. However, it behaved unexpectedly, so I would like to know the reason. Why did I get the result of case2?
Case 1
The code below worked. Since r"\n" (reg1) is a string consisting of two characters, '\' and 'n', I think it is correct to search for and match the target string "\n".
import re
reg1 = r"\n"
print (re.search (reg1, "\n"))
#output: <re.Match object; span = (0, 1), match ='\n'>
Case 2
The code below expected the output to be None, but it didn't. Since "\n" (reg2), which is the line feed of the escape sequence, was used as the pattern, and "\n" consisting of two characters, '\' and 'n', was used as the target string, it was considered that they did not match. However, it actually matched.
import re
reg2 = "\n"
print (re.search (reg2, "\n"))
#output: <re.Match object; span = (0, 1), match ='\n'>

You are correct when it comes to the contents of the strings used for the regexes, but not the targets. The statement:
"\n" consisting of two characters, '\' and 'n', was used as the target string,
is incorrect. The interpretation of a string is not context-sensitive; r"\n" is always 2 characters, and "\n" is always 1. This is covered in the Python Regular Expression HOWTO:
r"\n" is a two-character string containing '\' and 'n', while "\n" is a one-character string containing a newline.
This is more easily demonstrated with a non-control character, as a literal "\n" would be written:
Did you catch that? Let's use "þ" (thorn) instead.
Case 1:
re.search(r"\u00FE", "\u00FE")
r"\u00FE" is a string with 6 characters, which compiles to the regex /\u00FE/. This is interpreted as an escape sequence by the regex library itself that matches a thorn character.
"\u00FE" is interpreted by python, producing the string "þ".
/\u00FE/ matches "þ".
Case 2:
re.search("\u00FE", "\u00FE")
"\u00FE" is a string with 1 character, "þ", which compiles to the regex /⁠þ⁠/.
/þ/ matches "þ".
Result: both regexes match. The only difference is that the regex contains an escape sequence in case 1 and a character literal in case 2.
What you seem to have in mind is a raw string for the target:
re.search(r"\u00FE", r"\u00FE")
re.search("\u00FE", r"\u00FE")
Neither of these matches, as neither of the targets contains a thorn character.
If you wanted to match an escape sequence, the escape character must be escaped within the regex:
re.search(r"\\u00FE", r"\u00FE")
re.search("\\\\u00FE", r"\u00FE")
Either of those patterns will result in the regex /\\u00FE/, which matches a string containing the given escape sequence.

Python pattern matching with 'r' prefix [duplicate]

I don't understand the logic in the functioning of the scape operator \ in python regex together with r' of raw strings.
Some help is appreciated.
code:
import re
text=' esto .es 10 . er - 12 .23 with [ and.Other ] here is more ; puntuation'
print('text0=',text)
text1 = re.sub(r'(\s+)([;:\.\-])', r'\2', text)
text2 = re.sub(r'\s+\.', '\.', text)
text3 = re.sub(r'\s+\.', r'\.', text)
print('text1=',text1)
print('text2=',text2)
print('text3=',text3)
The theory says:
backslash character ('\') to indicate special forms or to allow special characters to be used without invoking their special meaning.
And as far as the link provided at the end of this question explains, r' represents a raw string, i.e. there is no special meaning for symbols, it is as it stays.
so in the above regex I would expect text2 and text3 to be different, since the substitution text is '.' in text 2, i.e. a period, whereas (in principle) the substitution text in text 3 is r'.' which is a raw string, i.e. the string as it is should appear, backslash and period. But they result in the same:
The result is:
text0= esto .es 10 . er - 12 .23 with [ and.Other ] here is more ; puntuation
text1= esto.es 10. er- 12.23 with [ and.Other ] here is more; puntuation
text2= esto\.es 10\. er - 12\.23 with [ and.Other ] here is more ; puntuation
text3= esto\.es 10\. er - 12\.23 with [ and.Other ] here is more ; puntuation
#text2=text3 but substitutions are not the same r'\.' vs '\.'
It looks to me that the r' does not work the same way in substitution part, nor the backslash. On the other hand my intuition tells me I am missing something here.
EDIT 1:
Following #Wiktor Stribiżew comment.
He pointed out that (following his link):
import re
print(re.sub(r'(.)(.)(.)(.)(.)(.)', 'a\6b', '123456'))
print(re.sub(r'(.)(.)(.)(.)(.)(.)', r'a\6b', '123456'))
# in my example the substitutions were not the same and the result were equal
# here indeed r' changes the results
which gives:
ab
a6b
that puzzles me even more.
Note:
I read this stack overflow question about raw strings which is super complete. Nevertheless it does not speak about substitutions

First and foremost,
replacement patterns ≠ regular expression patterns
We use a regex pattern to search for matches, we use replacement patterns to replace matches found with regex.
NOTE: The only special character in a substitution pattern is a backslash, \. Only the backslash must be doubled.
Replacement pattern syntax in Python
The re.sub docs are confusing as they mention both string escape sequences that can be used in replacement patterns (like \n, \r) and regex escape sequences (\6) and those that can be used as both regex and string escape sequences (\&).
I am using the term regex escape sequence to denote an escape sequence consisting of a literal backslash + a character, that is, '\\X' or r'\X', and a string escape sequence to denote a sequence of \ and a char or some sequence that together form a valid string escape sequence. They are only recognized in regular string literals. In raw string literals, you can only escape " (and that is the reason why you can't end a raw string literal with \", but the backlash is still part of the string then).
So, in a replacement pattern, you may use backreferences:
re.sub(r'\D(\d)\D', r'\1', 'a1b') # => 1
re.sub(r'\D(\d)\D', '\\1', 'a1b') # => 1
re.sub(r'\D(\d)\D', '\g<1>', 'a1b') # => 1
re.sub(r'\D(\d)\D', r'\g<1>', 'a1b') # => 1
You may see that r'\1' and '\\1' is the same replacement pattern, \1. If you use '\1', it will get parse as a string escape sequence, a character with octal value 001. If you forget to use r prefix with the unambiguous backreference, there is no problem because \g is not a valid string escape sequence, and there, \ escape character remains in the string. Read on the docs I linked to:
Unlike Standard C, all unrecognized escape sequences are left in the string unchanged, i.e., the backslash is left in the result.
So, when you pass '\.' as a replacement string, you actually send \. two-char combination as the replacement string, and that is why you get \. in the result.
\ is a special character in Python replacement pattern
If you use re.sub(r'\s+\.', r'\\.', text), you will get the same result as in text2 and text3 cases, see this demo.
That happens because \\, two literal backslashes, denote a single backslash in the replacement pattern. If you have no Group 2 in your regex pattern, but pass r'\2' in the replacement to actually replace with \ and 2 char combination, you would get an error.
Thus, when you have dynamic, user-defined replacement patterns you need to double all backslashes in the replacement patterns that are meant to be passed as literal strings:
re.sub(some_regex, some_replacement.replace('\\', '\\\\'), input_string)

A simple way to work around all these string escaping issues is to use a function/lambda as the repl argument, instead of a string. For example:
output = re.sub(
pattern=find_pattern,
repl=lambda _: replacement,
string=input,
)
The replacement string won't be parsed at all, just substituted in place of the match.

From the doc (my emphasis):
re.sub(pattern, repl, string, count=0, flags=0)
Return the string
obtained by replacing the leftmost non-overlapping occurrences of
pattern in string by the replacement repl. If the pattern isn’t found,
string is returned unchanged. repl can be a string or a function; if
it is a string, any backslash escapes in it are processed. That is, \n
is converted to a single newline character, \r is converted to a
carriage return, and so forth. Unknown escapes of ASCII letters are
reserved for future use and treated as errors. Other unknown escapes
such as \& are left alone. Backreferences, such as \6, are replaced
with the substring matched by group 6 in the pattern.
The repl argument is not just plain text. It can also be the name of a function or refer to a position in a group (e.g. \g<quote>, \g<1>, \1).
Also, from here:
Unlike Standard C, all unrecognized escape sequences are left in the
string unchanged, i.e., the backslash is left in the result.
Since . is not a special escape character, '\.' is the same as r'\.\.

Regex to check if it is exactly one single word

I am basically trying to match string pattern(wildcard match)
Please carefully look at this -
*(star) - means exactly one word .
This is not a regex pattern...it is a convention.
So,if there patterns like -
*.key - '.key.' is preceded by exactly one word(word containing no dots)
*.key.* - '.key.' is preceded and succeeded by exactly one word having no dots
key.* - '.key' preceeds exactly one word .
So,
"door.key" matches "*.key"
"brown.door.key" doesn't match "*.key".
"brown.key.door" matches "*.key.*"
but "brown.iron.key.door" doesn't match "*.key.*"
So, when I encounter a '*' in pattern, I have replace it with a regex so that it means it is exactly one word.(a-zA-z0-9_).Can anyone please help me do this in python?

To convert your pattern to a regexp, you first need to make sure each character is interpreted literally and not as a special character. We can do that by inserting a \ in front of any re special character. Those characters can be obtained through sre_parse.SPECIAL_CHARS.
Since you have a special meaning for *, we do not want to escape that one but instead replace it by \w+.
Code
import sre_parse
def convert_to_regexp(pattern):
special_characters = set(sre_parse.SPECIAL_CHARS)
special_characters.remove('*')
safe_pattern = ''.join(['\\' + c if c in special_characters else c for c in pattern ])
return safe_pattern.replace('*', '\\w+')
Example
import re
pattern = '*.key'
r_pattern = convert_to_regexp(pattern) # '\\w+\\.key'
re.match(r_pattern, 'door.key') # Match
re.match(r_pattern, 'brown.door.key') # None
And here is an example with escaped special characters
pattern = '*.(key)'
r_pattern = convert_to_regexp(pattern) # '\\w+\\.\\(key\\)'
re.match(r_pattern, 'door.(key)') # Match
re.match(r_pattern, 'brown.door.(key)') # None
Sidenote
If you intend looking for the output pattern with re.search or re.findall, you might want to wrap the re pattern between \b boundary characters.

The conversion rules you are looking for go like this:
* is a word, thus: \w+
. is a literal dot: \.
key is and stays a literal string
plus, your samples indicate you are going to match whole strings, which in turn means your pattern should match from the ^ beginning to the $ end of the string.
Therefore, *.key becomes ^\w+\.key$, *.key.* becomes ^\w+\.key\.\w+$, and so forth..
Online Demo: play with it!

^ means a string that starts with the given set of characters in a regular expression.
$ means a string that ends with the given set of characters in a regular expression.
\s means a whitespace character.
\S means a non-whitespace character.
+ means 1 or more characters matching given condition.
Now, you want to match just a single word meaning a string of characters that start and end with non-spaced string. So, the required regular expression is:
^\S+$

You could do it with a combination of "any characters that aren't period" and the start/end anchors.
*.key would be ^[^.]*\.key, and *.key.* would be ^[^.]*\.key\.[^.]*$
EDIT: As tripleee said, [^.]*, which matches "any number of characters that aren't periods," would allow whitespace characters (which of course aren't periods), so using \w+, "any number of 'word characters'" like the other answers is better.

How can I find a match and update it with RegEx?

I have a string as
a = "hello i am stackoverflow.com user +-"
Now I want to convert the escape characters in the string except the quotation marks and white space. So my expected output is :
a = "hello i am stackoverflow\.com user \+\-"
What I did so far is find all the special characters in a string except whitespace and double quote using
re.findall(r'[^\w" ]',a)
Now, once I found all the required special characters I want to update the string. I even tried re.sub but it replaces the special characters. Is there anyway I can do it?

Use re.escape.
>>> a = "hello i am stackoverflow.com user +-"
>>> print(re.sub(r'\\(?=[\s"])', r'', re.escape(a)))
hello i am stackoverflow\.com user \+\-
re.escape(string)
Return string with all non-alphanumerics backslashed; this is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it.
r'\\(?=[\s"])' matches all the backslashes which exists just before to space or double quotes. Replacing the matched backslashes with an empty string will give you the desired output.
OR
>>> a = 'hello i am stackoverflow.com user "+-'
>>> print(re.sub(r'((?![\s"])\W)', r'\\\1', a))
hello i am stackoverflow\.com user "\+\-
((?![\s"])\W) captures all the non-word characters but not of space or double quotes. Replacing the matched characters with backslash + chars inside group index 1 will give you the desired output.

It seems like you could use backreferences with re.sub to achieve what your desired output:
import re
a = "hello i am stackoverflow.com user +-"
print re.sub(r'([^\w" ])', r'\\\1', a) # hello i am stackoverflow\.com user \+\-
The replacement pattern r'\\\1' is just \\ which means a literal backslash, followed \1 which means capture group 1, the pattern captured in the parentheses in the first argument.
In other words, it will escape everything except:
alphanumeric characters
underscore
double quotes
space

Regex for match parentheses in Python

I have a list of fasta sequences, each of which look like this:
>>> sequence_list[0]
'gi|13195623|ref|NM_024197.1| Mus musculus NADH dehydrogenase (ubiquinone) 1 alp
ha subcomplex 10 (Ndufa10), mRNAGCCGGCGCAGACGGCGAAGTCATGGCCTTGAGGTTGCTGAGACTCGTC
CCGGCGTCGGCTCCCGCGCGCGGCCTCGCGGCCGGAGCCCAGCGCGTGGG (etc)
I'd like to be able to extract the gene names from each of the fasta entries in my list, but I'm having difficulty finding the right regular expression. I thought this one would work: "^/(.+/),$". Start with a parentheses, then any number of any character, then end with a parentheses followed by a comma. Unfortunately: this returns None:
test = re.search(r"^/(.+/),$", sequence_list[0])
print(test)
Can someone point out the error in this regex?

Without any capturing groups,
>>> import re
>>> str = """
... gi|13195623|ref|NM_024197.1| Mus musculus NADH dehydrogenase (ubiquinone) 1 alp
... ha subcomplex 10 (Ndufa10), mRNAGCCGGCGCAGACGGCGAAGTCATGGCCTTGAGGTTGCTGAGACTCGTC
... CCGGCGTCGGCTCCCGCGCGCGGCCTCGCGGCCGGAGCCCAGCGCGTGGG (etc)"""
>>> m = re.findall(r'(?<=\().*?(?=\),)', str)
>>> m
['Ndufa10']
It matches only the words which are inside the parenthesis only when the closing bracket is followed by a comma.
DEMO
Explanation:
(?<=\() In regex (?<=pattern) is called a lookbehind. It actually looks after a string which matches the pattern inside lookbehind . In our case the pattern inside the lookbehind is \( means a literal (.
.*?(?=\),) It matches any character zero or more times. ? after the * makes the match reluctant. So it does an shortest match. And the characters in which the regex engine is going to match must be followed by ),

you need to escape parenthesis:
>>> re.findall(r'\([^)]*\),', txt)
['(Ndufa10),']

Can someone point out the error in this regex? r"^/(.+/),$"
regex escape character is \ not / (do not confuse with python escape character which is also \, but is not needed when using raw strings)
=> r"^\(.+\),$"
^ and $ match start/end of the input string, not what you want to output
=> r"\(.+\),"
you need to match "any" characters up to 1st occurence of ), not to the last one, so you need lazy operator +?
=> r"\(.+?\),"
in case gene names could not contain ) character, you can use a faster regex that avoids backtracking
=> r"\([^)]+\),"

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.