Python regex equivalent for perl - python

what is the equivalent of following Perl condition in Python
if($line=~/DramBase/)
I tried the following but it didn't match(the line at the bottom)
if(re.match( r'DramBase', line)):
I had to change it to
if(re.match( r'.*DramBase', line)):
to match this line
# -DF0.CCM0.DramBaseAddress1 0x00004001
Is there a flag to match it anywhere on the line without explicitly matching starting characters ?

You need to use re.search, not re.match. re.match only matches as the beginning of the string, while re.search matches anywhere, like in Perl.

See re — Regular expression operations for an explanation
re.match(pattern, string, flags=0)
If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding match object. Return None if the string does not match the pattern; note that this is different from a zero-length match.
Note that even in MULTILINE mode, re.match() will only match at the beginning of the string and not at the beginning of each line.
If you want to locate a match anywhere in string, use search() instead (see also search() vs. match()).

Related

re.fullmatch equivalent in pandas text handling [duplicate]

I'm trying to check if a string is a number, so the regex "\d+" seemed good. However that regex also fits "78.46.92.168:8000" for some reason, which I do not want, a little bit of code:
class Foo():
_rex = re.compile("\d+")
def bar(self, string):
m = _rex.match(string)
if m != None:
doStuff()
And doStuff() is called when the ip adress is entered. I'm kind of confused, how does "." or ":" match "\d"?
\d+ matches any positive number of digits within your string, so it matches the first 78 and succeeds.
Use ^\d+$.
Or, even better: "78.46.92.168:8000".isdigit()
There are a couple of options in Python to match an entire input with a regex.
Python 2 and 3
In Python 2 and 3, you may use
re.match(r'\d+$') # re.match anchors the match at the start of the string, so $ is what remains to add
or - to avoid matching before the final \n in the string:
re.match(r'\d+\Z') # \Z will only match at the very end of the string
Or the same as above with re.search method requiring the use of ^ / \A start-of-string anchor as it does not anchor the match at the start of the string:
re.search(r'^\d+$')
re.search(r'\A\d+\Z')
Note that \A is an unambiguous string start anchor, its behavior cannot be redefined with any modifiers (re.M / re.MULTILINE can only redefine the ^ and $ behavior).
Python 3
All those cases described in the above section and one more useful method, re.fullmatch (also present in the PyPi regex module):
If the whole string matches the regular expression pattern, return a corresponding match object. Return None if the string does not match the pattern; note that this is different from a zero-length match.
So, after you compile the regex, just use the appropriate method:
_rex = re.compile("\d+")
if _rex.fullmatch(s):
doStuff()
re.match() always matches from the start of the string (unlike re.search()) but allows the match to end before the end of the string.
Therefore, you need an anchor: _rex.match(r"\d+$") would work.
To be more explicit, you could also use _rex.match(r"^\d+$") (which is redundant) or just drop re.match() altogether and just use _rex.search(r"^\d+$").
\Z matches the end of the string while $ matches the end of the string or just before the newline at the end of the string, and exhibits different behaviour in re.MULTILINE. See the syntax documentation for detailed information.
>>> s="1234\n"
>>> re.search("^\d+\Z",s)
>>> s="1234"
>>> re.search("^\d+\Z",s)
<_sre.SRE_Match object at 0xb762ed40>
Change it from \d+ to ^\d+$

Python regex not finding match in first line of file [duplicate]

What is the difference between the search() and match() functions in the Python re module?
I've read the Python 2 documentation (Python 3 documentation), but I never seem to remember it. I keep having to look it up and re-learn it. I'm hoping that someone will answer it clearly with examples so that (perhaps) it will stick in my head. Or at least I'll have a better place to return with my question and it will take less time to re-learn it.
re.match is anchored at the beginning of the string. That has nothing to do with newlines, so it is not the same as using ^ in the pattern.
As the re.match documentation says:
If zero or more characters at the
beginning of string match the regular expression pattern, return a
corresponding MatchObject instance.
Return None if the string does not
match the pattern; note that this is
different from a zero-length match.
Note: If you want to locate a match
anywhere in string, use search()
instead.
re.search searches the entire string, as the documentation says:
Scan through string looking for a
location where the regular expression
pattern produces a match, and return a
corresponding MatchObject instance.
Return None if no position in the
string matches the pattern; note that
this is different from finding a
zero-length match at some point in the
string.
So if you need to match at the beginning of the string, or to match the entire string use match. It is faster. Otherwise use search.
The documentation has a specific section for match vs. search that also covers multiline strings:
Python offers two different primitive
operations based on regular
expressions: match checks for a match
only at the beginning of the string,
while search checks for a match
anywhere in the string (this is what
Perl does by default).
Note that match may differ from search
even when using a regular expression
beginning with '^': '^' matches only
at the start of the string, or in
MULTILINE mode also immediately
following a newline. The “match”
operation succeeds only if the pattern
matches at the start of the string
regardless of mode, or at the starting
position given by the optional pos
argument regardless of whether a
newline precedes it.
Now, enough talk. Time to see some example code:
# example code:
string_with_newlines = """something
someotherthing"""
import re
print re.match('some', string_with_newlines) # matches
print re.match('someother',
string_with_newlines) # won't match
print re.match('^someother', string_with_newlines,
re.MULTILINE) # also won't match
print re.search('someother',
string_with_newlines) # finds something
print re.search('^someother', string_with_newlines,
re.MULTILINE) # also finds something
m = re.compile('thing$', re.MULTILINE)
print m.match(string_with_newlines) # no match
print m.match(string_with_newlines, pos=4) # matches
print m.search(string_with_newlines,
re.MULTILINE) # also matches
search ⇒ find something anywhere in the string and return a match object.
match ⇒ find something at the beginning of the string and return a match object.
match is much faster than search, so instead of doing regex.search("word") you can do regex.match((.*?)word(.*?)) and gain tons of performance if you are working with millions of samples.
This comment from #ivan_bilan under the accepted answer above got me thinking if such hack is actually speeding anything up, so let's find out how many tons of performance you will really gain.
I prepared the following test suite:
import random
import re
import string
import time
LENGTH = 10
LIST_SIZE = 1000000
def generate_word():
word = [random.choice(string.ascii_lowercase) for _ in range(LENGTH)]
word = ''.join(word)
return word
wordlist = [generate_word() for _ in range(LIST_SIZE)]
start = time.time()
[re.search('python', word) for word in wordlist]
print('search:', time.time() - start)
start = time.time()
[re.match('(.*?)python(.*?)', word) for word in wordlist]
print('match:', time.time() - start)
I made 10 measurements (1M, 2M, ..., 10M words) which gave me the following plot:
As you can see, searching for the pattern 'python' is faster than matching the pattern '(.*?)python(.*?)'.
Python is smart. Avoid trying to be smarter.
re.search searches for the pattern throughout the string, whereas re.match does not search the pattern; if it does not, it has no other choice than to match it at start of the string.
You can refer the below example to understand the working of re.match and re.search
a = "123abc"
t = re.match("[a-z]+",a)
t = re.search("[a-z]+",a)
re.match will return none, but re.search will return abc.
The difference is, re.match() misleads anyone accustomed to Perl, grep, or sed regular expression matching, and re.search() does not. :-)
More soberly, As John D. Cook remarks, re.match() "behaves as if every pattern has ^ prepended." In other words, re.match('pattern') equals re.search('^pattern'). So it anchors a pattern's left side. But it also doesn't anchor a pattern's right side: that still requires a terminating $.
Frankly given the above, I think re.match() should be deprecated. I would be interested to know reasons it should be retained.
Much shorter:
search scans through the whole string.
match scans only the beginning of the string.
Following Ex says it:
>>> a = "123abc"
>>> re.match("[a-z]+",a)
None
>>> re.search("[a-z]+",a)
abc
re.match attempts to match a pattern at the beginning of the string. re.search attempts to match the pattern throughout the string until it finds a match.
Quick answer
re.search('test', ' test') # returns a Truthy match object (because the search starts from any index)
re.match('test', ' test') # returns None (because the search start from 0 index)
re.match('test', 'test') # returns a Truthy match object (match at 0 index)

Regex string doesn't match

I'm having trouble matching a digit in a string with Python. While it should be clearly matched, It doesn't even match [0-9] [\d] or just 0 alone. Where is my oversight?
import re
file_without_extension = "/test/folder/something/file_0"
if re.match("[\d]+$", file_without_extension):
print "file matched!"
Read the documentation: http://docs.python.org/2/library/re.html#re.match
If zero or more characters at the beginning of string
You want to use re.search (or re.findall)
re.match is "anchored" to the beginning of the string. Use re.search.
Use re.search instead of re.match.

re.match vs re.search

If i do this
import re
m = re.compile("[0-9]{1,}Y")
res = m.search("AUD3M25Y_EOD2")
if res:
return res.group(0)[:-1]
I will get 25 as an answer
However if I do
import re
m = re.compile(".*([0-9]{1,})Y.*")
res = m.match("AUD3M25Y_EOD2")
if res:
return res.groups(0)
I will get only 5.
Why the difference?
Does it have anything to do with 'global' option? (much like s///g in vi)
In your match, the first .* is greedy, it is matching as much as it can, including numbers.
If you make it less greedy, it will work:
.*?([0-9]{1,})Y.*
(PS I think this greedy issue doesn't make it a fair comparison of re.search and re.match)
Please read the documentation first. As you should expect, it has the answers.
re.search:
Scan through string looking for a location where the regular expression pattern produces a match, and return a corresponding match object. Return None if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string.
re.match:
If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding match object. Return None if the string does not match the pattern; note that this is different from a zero-length match.
Note: If you want to locate a match anywhere in string, use search() instead.
Also, on the same page, Matching vs. Searching:
Python offers two different primitive operations based on regular expressions: match checks for a match only at the beginning of the string, while search checks for a match anywhere in the string (this is what Perl does by default).

Python regex, matching pattern over multiple lines.. why isn't this working?

I know that for parsing I should ideally remove all spaces and linebreaks but I was just doing this as a quick fix for something I was trying and I can't figure out why its not working.. I have wrapped different areas of text in my document with the wrappers like "####1" and am trying to parse based on this but its just not working no matter what I try, I think I am using multiline correctly.. any advice is appreciated
This returns no results at all:
string='
####1
ttteest
####1
ttttteeeestt
####2
ttest
####2'
import re
pattern = '.*?####(.*?)####'
returnmatch = re.compile(pattern, re.MULTILINE).findall(string)
return returnmatch
Multiline doesn't mean . will match line return, it means that ^ and $ are limited to lines only
re.M
re.MULTILINE
When specified, the pattern character '^' matches at the beginning of the string and at the >beginning of each line (immediately following each newline); and the pattern character '$' >matches at the end of the string and at the end of each line (immediately preceding each >newline). By default, '^' matches only at the beginning of the string, and '$' only at the >end of the string and immediately before the newline (if any) at the end of the string.
re.S or re.DOTALL makes . match even new lines.
Source
http://docs.python.org/
Try re.findall(r"####(.*?)\s(.*?)\s####", string, re.DOTALL) (works with re.compile too, of course).
This regexp will return tuples containing the number of the section and the section content.
For your example, this will return [('1', 'ttteest'), ('2', ' \n\nttest')].
(BTW: your example won't run, for multiline strings, use ''' or """)

Categories