Regexp to literally interpret \t as \t and not tab - python

I'm trying to match a sequence of text with backslashed in it, like a windows path.
Now, when I match with regexp in python, it gets the match, but the module interprets all backslashes followed by a valid escape char (i.e. t) as an escape sequence, which is not what I want.
How do I get it not to do that?
Thanks
/m
EDIT:
well, i missed that the regexp that matches the text that contains the backslash is a (.*). I've tried the raw notation (examplefied in the awnsers), but it does not help in my situation. Or im doing it wrong.
EDIT2: Did it wrong. Thanks guys/girls!

Use double backslashes with r like this
>>> re.match(r"\\t", r"\t")
<_sre.SRE_Match object at 0xb7ce5d78>
From python docs:
When one wants to match a literal
backslash, it must be escaped in the
regular expression. With raw string
notation, this means r"\". Without
raw string notation, one must use
"\\", making the following lines of
code functionally identical:
>>> re.match(r"\\", r"\\")
<_sre.SRE_Match object at ...>
>>> re.match("\\\\", r"\\")
<_sre.SRE_Match object at ...>

Always use the r prefix when defining your regex. This will tell Python to treat the string as raw, so it doesn't do any of the standard processing.
regex = r'\t'

Related

force re.search to include # and $

I am trying to get a substring between two markers using re in Python, for example:
import re
test_str = "#$ -N model_simulation 2022"
# these two lines work
# the output is: model_simulation
print(re.search("-N(.*)2022",test_str).group(1))
print(re.search(" -N(.*)2022",test_str).group(1))
# these two lines give the error: 'NoneType' object has no attribute 'group'
print(re.search("$ -N(.*)2022",test_str).group(1))
print(re.search("#$ -N(.*)2022",test_str).group(1))
I read the documentation of re here. It says that "#" is intentionally ignored so that the outputs look neater.
But in my case, I do need to include "#" and "$". I need them to identify the part of the string that I want, because the "-N" is not unique in my entire text string for real work.
Is there a way to force re to include those? Or is there a different way without using re?
Thanks.
You can escape both with \, for example,
print(re.search("\#\$ -N(.*)2022",test_str).group(1))
# output model_simulation
You can get rid of the special meaning by using the backslash prefix: $. This way, you can match the dollar symbol in a given string
# add backslash before # and $
# the output is: model_simulation
print(re.search("\$ -N(.*)2022",test_str).group(1))
print(re.search("\#\$ -N(.*)2022",test_str).group(1))
In regular expressions, $ signals the end of the string. So 'foo' would match foo anywhere in the string, but 'foo$' only matches foo if it appears at the end. To solve this, you need to escape it by prefixing it with a backslash. That way it will match a literal $ character
# is only the start of a comment in verbose mode using re.VERBOSE (which also ignores spaces), otherwise it just matches a literal #.
In general, it is also good practice to use raw string literals for regular expressions (r'foo'), which means Python will let backslashes alone so it doesn't conflict with regular expressions (that way you don't have to type \\\\ to match a single backslash \).
Instead of re.search, it looks like you actually want re.fullmatch, which matches only if the whole string matches.
So I would write your code like this:
print(re.search(r"\$ -N(.*)2022", test_str).group(1)) # This one would not work with fullmatch, because it doesn't match at the start
print(re.fullmatch(r"#\$ -N(.*)2022", test_str).group(1))
In a comment you mentioned that the string you need to match changes all the time. In that case, re.escape may prove useful.
Example:
prefix = '#$ - N'
postfix = '2022'
print(re.fullmatch(re.escape(prefix) + '(.*)' + re.escape(postfix), tst_str).group(1))

Python match text successfully even when there are 1, 2 and 3 backslash at front of the same regex pattern [duplicate]

My current understanding of the python 3.4 regex library from the language reference does not seem to match up with my experiment results of the module.
My current understanding
The regular expression engine can be thought of as a separate entity with its own programming language that it understands (regex). It just happens to live inside python, among a variety of other languages. As such, python must pass (regex) pattern/code to this independent interpreter, if you will.
For clarity reasons, the following text will use the notion of logical length - which is supposed to represent how long the given string logically is. For example, the special character carriage return \r will have len=1 since it is a single character. However, the 2 distinct characters (backslash followed by an r) \r will have len=2.
Step 1) Lets say we want to match a carriage return \r len=1 in some text.
Step 2) We need to feed the pattern \r len=2 (2 distinct characters) to the regular expression engine.
Step 3) The regular expression engine recieves \r len=2 and interprets the pattern as: match special character carriage return \r len=1.
Step 4) It goes ahead and does the magic.
The problem is that the backslash character \ itself is used by the python interpreter as something special - a character meant to escape other stuff (like quotes).
So when we are coding in python and need to express the idea that we need to send the pattern \r len=2 to the internal regular expression interpreter, we must type pattern = '\\r' or alternatively pattern = r'\r' to express \r len=2.
And everything is well... until
I try a couple of experiments involving re.escape
Summary of questions
Point 1) Please confirm/modify my current understanding of the regex engine.
Point 2) Why are these supposed non-textbook definition patterns matching.
Point 3) What on earth is going on with \\\r from re.escape, and the whole "we have the same string lengths, but we compared unequal, but we ALSO all worked the same in matching a carriage return in the previous re.search test".
You need to understand that each time you write a pattern, it is first interpreted as a string before to be read and interpreted a second time by the regex engine.
Lets describe what happens:
>>> s='\r'
s contains the character CR.
>>> re.match('\r', s)
<_sre.SRE_Match object; span=(0, 1), match='\r'>
Here the string '\r' is a string that contains CR, so a literal CR is given to the regex engine.
>>> re.match('\\r', s)
<_sre.SRE_Match object; span=(0, 1), match='\r'>
The string is now a literal backslash and a literal r, the regex engine receives these two characters and since \r is a regex escape sequence that means a CR character too, you obtain a match too.
>>> re.match('\\\r', s)
<_sre.SRE_Match object; span=(0, 1), match='\r'>
The string contains a literal backslash and a literal CR, the regex engine receives \ and CR, but since \CR isn't a known regex escape sequence, the backslash is ignored and you obtain a match.
Note that for the regex engine, a literal backslash is the escape sequence \\ (so in a pattern string r'\\' or '\\\\')

How does re.search match raw strings?

re.search(r'c\.t', 'c.t abc') matches successfully to c.t. But the pattern being matched is c\.t, how is c.t matching to c\.t? What happened to the backslash?
Inside a regular expression, the dot character has a special meaning, which is that it can match any character at all other than a newline (unless the re.S/re.DOTALL flag is used). In this case, the backslash has the effect of escaping the dot from its special meaning and letting the regular expression engine interpret it as literally matching only a dot (and no other character). Consider if the backslash is not there:
>>> re.search(r'c.t', 'c.t abc')
<_sre.SRE_Match object at 0x7fe7378d8370>
The original string you provided as input still matches. But now the following will also match:
>>> re.search(r'c.t', 'I saw a cat')
<_sre.SRE_Match object at 0x7fe7378d83d8>
Because the a in cat qualifies as any non-newline character, which is what . will match if unescaped with a backslash. You can see that if we add the backslash back in, it no longer matches.
>>> print(re.search(r'c\.t', 'I saw a cat'))
None
More on Python's implementation of regular expressions here:
Python 2.7.x: https://docs.python.org/2/library/re.html
Python 3.4.x: https://docs.python.org/3/library/re.html
Edited to reflect #cdarke's excellent point about newlines

What does the "r" in pythons re.compile(r' pattern flags') mean?

I am reading through http://docs.python.org/2/library/re.html. According to this the "r" in pythons re.compile(r' pattern flags') refers the raw string notation :
The solution is to use Python’s raw string notation for regular
expression patterns; backslashes are not handled in any special way in
a string literal prefixed with 'r'. So r"\n" is a two-character string
containing '\' and 'n', while "\n" is a one-character string
containing a newline. Usually patterns will be expressed in Python
code using this raw string notation.
Would it be fair to say then that:
re.compile(r pattern) means that "pattern" is a regex while, re.compile(pattern) means that "pattern" is an exact match?
As #PauloBu stated, the r string prefix is not specifically related to regex's, but to strings generally in Python.
Normal strings use the backslash character as an escape character for special characters (like newlines):
>>> print('this is \n a test')
this is
a test
The r prefix tells the interpreter not to do this:
>>> print(r'this is \n a test')
this is \n a test
>>>
This is important in regular expressions, as you need the backslash to make it to the re module intact - in particular, \b matches empty string specifically at the start and end of a word. re expects the string \b, however normal string interpretation '\b' is converted to the ASCII backspace character, so you need to either explicitly escape the backslash ('\\b'), or tell python it is a raw string (r'\b').
>>> import re
>>> re.findall('\b', 'test') # the backslash gets consumed by the python string interpreter
[]
>>> re.findall('\\b', 'test') # backslash is explicitly escaped and is passed through to re module
['', '']
>>> re.findall(r'\b', 'test') # often this syntax is easier
['', '']
No, as the documentation pasted in explains the r prefix to a string indicates that the string is a raw string.
Because of the collisions between Python escaping of characters and regex escaping, both of which use the back-slash \ character, raw strings provide a way to indicate to python that you want an unescaped string.
Examine the following:
>>> "\n"
'\n'
>>> r"\n"
'\\n'
>>> print "\n"
>>> print r"\n"
\n
Prefixing with an r merely indicates to the string that backslashes \ should be treated literally and not as escape characters for python.
This is helpful, when for example you are searching on a word boundry. The regex for this is \b, however to capture this in a Python string, I'd need to use "\\b" as the pattern. Instead, I can use the raw string: r"\b" to pattern match on.
This becomes especially handy when trying to find a literal backslash in regex. To match a backslash in regex I need to use the pattern \\, to escape this in python means I need to escape each slash and the pattern becomes "\\\\", or the much simpler r"\\".
As you can guess in longer and more complex regexes, the extra slashes can get confusing, so raw strings are generally considered the way to go.
No. Not everything in regex syntax needs to be preceded by \, so ., *, +, etc still have special meaning in a pattern
The r'' is often used as a convenience for regex that do need a lot of \ as it prevents the clutter of doubling up the \

python "re" package, strange phenomenon with "raw" string

I am seeing the following phenomenon, couldn't seem to figure it out, and didn't find anything with some search through archives:
if I type in:
>>> if re.search(r'\n',r'this\nis\nit'):<br>
... print 'found it!'<br>
... else:<br>
... print "didn't find it"<br>
...
I will get:
didn't find it!
However, if I type in:
>>> if re.search(r'\\n',r'this\nis\nit'):<br>
... print 'found it!'<br>
... else:<br>
... print "didn't find it"<br>
...
Then I will get:
found it!
(The first one only has one backslash on the r'\n' whereas the second one has two backslashes in a row on the r'\\n' ... even this interpreter is removing one of them.)
I can guess what is going on, but I don't understand the official mechanism as to why this is happening: in the first case, I need to escape two things: both the regular expression and the special strings. "Raw" lets me escape the special strings, but not the regular expression.
But there will never be a regular expression in the second string, since it is the string being matched. So there is only a need to escape once.
However, something doesn't seem consistent to me: how am I supposed to ensure that the characters REALLY ARE taken literally in the first case? Can I type rr'' ? Or do I have to ensure that I escape things twice?
On a similar vein, how do I ensure that a variable is taken literally (or that it is NOT taken literally)? E.g., what if I had a variable tmp = 'this\nis\nmy\nhome', and I really wanted to find the literal combination of a slash and an 'n', instead of a newline?
Thanks!Mike
re.search(r'\n', r'this\nis\nit')
As you said, "there will never be a regular expression in the second string." So we need to look at these strings differently: the first string is a regex, the second just a string. Usually your second string will not be raw, so any backslashes are Python-escapes, not regex-escapes.
So the first string consists of a literal "\" and an "n". This is interpreted by the regex parser as a newline (docs: "Most of the standard escapes supported by Python string literals are also accepted by the regular expression parser"). So your regex will be searching for a newline character.
Your second string consists of the string "this" followed by a literal "\" and an "n". So this string does not contain an actual newline character. Your regex will not match.
As for your second regex:
re.search(r'\\n', r'this\nis\nit')
This version matches because your regex contains three characters: a literal "\", another literal "\" and an "n". The regex parser interprets the two slashes as a single "\" character, followed by an "n". So your regex will be searching for a "\" followed by an "n", which is found within the string. But that isn't very helpful, since it has nothing to do with newlines.
Most likely what you want is to drop the r from the second string, thus treating it as a normal Python string.
re.search(r'\n', 'this\nis\nit')
In this case, your regex (as before) is searching for a newline character. And, it finds it, because the second string contains the word "this" followed by a newline.
Escaping special sequences in string literals is one thing, escaping regular expression special characters is another. The row string modifier only effects the former.
Technically, re.search accepts two strings and passes the first to the regex builder with re.compile. The compiled regex object is used to search patterns inside simple strings. The second string is never compiled and thus it is not subject to regex special character rules.
If the regex builder receives a \n after the string literal is processed, it converts this sequence to a newline character. You also have to escape it if you need the match the sequence instead.
All rationale behind this is that regular expressions are not part of the language syntax. They are rather handled within the standard library inside the re module with common building blocks of the language.
The re.compile function uses special characters and escaping rules compatible with most commonly used regex implementations. However, the Python interpreter is not aware of the whole regular expression concept and it does not know whether a string literal will be compiled into a regex object or not. As a result, Python can't provide any kind syntax simplification such as the ones you suggested.
Regexes have their own meaning for literal backslashes, as character classes like \d. If you actually want a literal backslash character, you will in fact need to double-escape it. It's really not supposed to be parallel since you're comparing a regex to a string.
Raw strings are just a convenience, and it would be way overkill to have double-raw strings.

Categories