Python re.match doesnt match the same regexp - python

I'm facing a weird problem; I hope nobody asked this question before
I need to match two regexp containing "(" ")".
Here is the kind of tests I made to see why it's not working:
>>> import re
>>> re.match("a","a")
<_sre.SRE_Match object at 0xb7467218>
>>> re.match(re.escape("a"),re.escape("a"))
<_sre.SRE_Match object at 0xb7467410>
>>> re.escape("a(b)")
'a\\(b\\)'
>>> re.match(re.escape("a(b)"),re.escape("a(b)"))
=> No match
Can someone explain me why the regexp doesn't match itself ?
Thanks a lot

You've escaped special characters, so your regex will match the string "a(b)", not the string 'a\(b\)' which is the result of re.escape('a(b)').

The first argument is the pattern object, the second is the actual string you are matching against. You shouldn't escape the string itself. Remember, re.escape escapes special characters in regexp.
>>> help(re.match)
Help on function match in module re:
match(pattern, string, flags=0)
Try to apply the pattern at the start of the string, returning
a match object, or None if no match was found.
>>> re.match(re.escape('a(b)'), 'a(b)')
<_sre.SRE_Match object at 0x10119ad30>

Related

How to match an underscore using Python's regex? [duplicate]

This question already has an answer here:
Python regular expression re.match, why this code does not work? [duplicate]
(1 answer)
Closed 6 years ago.
I'm having trouble matching the underscore character in Python using regular expressions. Just playing around in the shell, I get:
>>> import re
>>> re.match(r'a', 'abc')
<_sre.SRE_Match object at 0xb746a368>
>>> re.match(r'_', 'ab_c')
>>> re.match(r'[_]', 'ab_c')
>>> re.match(r'\_', 'ab_c')
I would have expected at least one of these to return a match object. Am I doing something wrong?
Use re.search instead of re.match if the pattern you are looking for is not at the start of the search string.
re.match(pattern, string, flags=0)
Try to apply the pattern at the start of the string, returning a match
object, or None if no match was found.
re.search(pattern, string, flags=0)
Scan through string looking for a match to the pattern, returning a
match object, or None if no match was found.
You don't need to escape _ or even use raw string.
>>> re.search('_', 'ab_c')
Out[4]: <_sre.SRE_Match object; span=(2, 3), match='_'>
Try the following:
re.search(r'\_', 'ab_c')
You were indeed right to escape the underscore character!
Mind that you can only use match for the beginning of strings, as is also clear from the documentation (https://docs.python.org/2/library/re.html):
If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding MatchObject instance. Return None if the string does not match the pattern; note that this is different from a zero-length match.
You should use search in this case:
Scan through string looking for the first location where the regular expression pattern produces a match, and return a corresponding MatchObject instance. Return None if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string.

How can Python's regular expressions work with patterns that have escaped special characters?

Is there a way to get Python's regular expressions to work with patterns that have escaped special characters? As far as my limited understanding can tell, the following example should work, but the pattern fails to match.
import re
string = r'This a string with ^g\.$s' # A string to search
pattern = r'^g\.$s' # The pattern to use
string = re.escape(string) # Escape special characters
pattern = re.escape(pattern)
print(re.search(pattern, string)) # This prints "None"
Note:
Yes, this question has been asked elsewhere (like here). But as you can see, I'm already implementing the solution described in the answers and it's still not working.
Why on earth are you applying re.escape to the string?! You want to find the "special" characters in that! If you just apply it to the pattern, you'll get a match:
>>> import re
>>> string = r'This a string with ^g\.$s'
>>> pattern = r'^g\.$s'
>>> re.search(re.escape(pattern), re.escape(string)) # nope
>>> re.search(re.escape(pattern), string) # yep
<_sre.SRE_Match object at 0x025089F8>
For bonus points, notice that you just need to re.escape the pattern one more times than the string:
>>> re.search(re.escape(re.escape(pattern)), re.escape(string))
<_sre.SRE_Match object at 0x025D8DE8>

How does re.search match raw strings?

re.search(r'c\.t', 'c.t abc') matches successfully to c.t. But the pattern being matched is c\.t, how is c.t matching to c\.t? What happened to the backslash?
Inside a regular expression, the dot character has a special meaning, which is that it can match any character at all other than a newline (unless the re.S/re.DOTALL flag is used). In this case, the backslash has the effect of escaping the dot from its special meaning and letting the regular expression engine interpret it as literally matching only a dot (and no other character). Consider if the backslash is not there:
>>> re.search(r'c.t', 'c.t abc')
<_sre.SRE_Match object at 0x7fe7378d8370>
The original string you provided as input still matches. But now the following will also match:
>>> re.search(r'c.t', 'I saw a cat')
<_sre.SRE_Match object at 0x7fe7378d83d8>
Because the a in cat qualifies as any non-newline character, which is what . will match if unescaped with a backslash. You can see that if we add the backslash back in, it no longer matches.
>>> print(re.search(r'c\.t', 'I saw a cat'))
None
More on Python's implementation of regular expressions here:
Python 2.7.x: https://docs.python.org/2/library/re.html
Python 3.4.x: https://docs.python.org/3/library/re.html
Edited to reflect #cdarke's excellent point about newlines

url dispatcher detect all and nothing with regexp

I am trying to create a URL dispatcher whic detects [a-zA-Z] chars (one word) and nothing at all.
I tried something like this, but the nothing does not work, only the chars.
url(r'(?P<search_word>[a-zA-Z].*?)/?$', 'website.views.index_view', name='website_index'),
What am i missing?
I think you want something like this instead (note the lack of a dot after [a-zA-Z]):
ur'^(?P<search_word>[a-zA-Z]*)/?$'
In your original regex, .*? will allow for any character(s) (even spaces, for example). Also, [a-zA-Z] will only match a single character unless it is followed by an * or a +.
Here is an example of my regex using the re module:
>>> import re
>>> re.match(ur'^(?P<search_word>[a-zA-Z]*)/?$', 'testString/')
<_sre.SRE_Match object at 0x02BF4F20> # matches 'testString/'
>>> re.match(ur'^(?P<search_word>[a-zA-Z]*)/?$', 'test-String/') # does not match 'test-String/' because of the hyphen
>>> re.match(ur'^(?P<search_word>[a-zA-Z]*)/?$', '') # also matches empty string ''
<_sre.SRE_Match object at 0x02BF44A0>

Regexp to literally interpret \t as \t and not tab

I'm trying to match a sequence of text with backslashed in it, like a windows path.
Now, when I match with regexp in python, it gets the match, but the module interprets all backslashes followed by a valid escape char (i.e. t) as an escape sequence, which is not what I want.
How do I get it not to do that?
Thanks
/m
EDIT:
well, i missed that the regexp that matches the text that contains the backslash is a (.*). I've tried the raw notation (examplefied in the awnsers), but it does not help in my situation. Or im doing it wrong.
EDIT2: Did it wrong. Thanks guys/girls!
Use double backslashes with r like this
>>> re.match(r"\\t", r"\t")
<_sre.SRE_Match object at 0xb7ce5d78>
From python docs:
When one wants to match a literal
backslash, it must be escaped in the
regular expression. With raw string
notation, this means r"\". Without
raw string notation, one must use
"\\", making the following lines of
code functionally identical:
>>> re.match(r"\\", r"\\")
<_sre.SRE_Match object at ...>
>>> re.match("\\\\", r"\\")
<_sre.SRE_Match object at ...>
Always use the r prefix when defining your regex. This will tell Python to treat the string as raw, so it doesn't do any of the standard processing.
regex = r'\t'

Categories