Python regex - understanding the difference between match and search

Python regex - understanding the difference between match and search - python

From what I figured,
match: given a string str and a pattern pat, match checks if str matches the pattern from str's start.
search: given a string str and a pattern pat, search checks if str matches the pattern from every index of str.
If so, is there a meaning using '^' at the start of a regex with match?
From what I understood, since match already checks from the start, there isn't. I'm probably wrong; where is my mistake?

I believe there is no use. The following is copy/pasted from: http://docs.python.org/library/re.html#search-vs-match
Python offers two different primitive operations based on regular expressions: re.match() checks for a match only at the beginning of the string, while re.search() checks for a match anywhere in the string (this is what Perl does by default).
For example:
>>> re.match("c", "abcdef") # No match
>>> re.search("c", "abcdef") # Match
<_sre.SRE_Match object at ...>
Regular expressions beginning with '^' can be used with search() to restrict the match at the beginning of the string:
>>> re.match("c", "abcdef") # No match
>>> re.search("^c", "abcdef") # No match
>>> re.search("^a", "abcdef") # Match
<_sre.SRE_Match object at ...>
Note however that in MULTILINE mode match() only matches at the beginning of the string, whereas using search() with a regular expression beginning with '^' will match at the beginning of each line.
>>> re.match('X', 'A\nB\nX', re.MULTILINE) # No match
>>> re.search('^X', 'A\nB\nX', re.MULTILINE) # Match
<_sre.SRE_Match object at ...>

When calling the function re.match specifically, the ^ character does have little meaning because this function begins the matching process at the beginning of the line. However, it does have meaning for other functions in the re module, and when calling match on a compiled regular expression object.
For example:
text = """\
Mares eat oats
and does eat oats
"""
print re.findall('^(\w+)', text, re.MULTILINE)
This prints:
['Mares', 'and']
With a re.findall() and re.MULTILINE enabled, it gives you the first word (with no leading whitespace) on each line of your text.
It might be useful if doing something more complex, like lexical analysis with regular expressions, and passing into the compiled regular expression a starting position in the text it should start matching at (which you can choose to be the ending position from the previous match). See the documentation for RegexObject.match method.
Simple lexer / scanner as an example:
text = """\
Mares eat oats
and does eat oats
"""
pattern = r"""
(?P<firstword>^\w+)
|(?P<lastword>\w+$)
|(?P<word>\w+)
|(?P<whitespace>\s+)
|(?P<other>.)
"""
rx = re.compile(pattern, re.MULTILINE | re.VERBOSE)
def scan(text):
pos = 0
m = rx.match(text, pos)
while m:
toktype = m.lastgroup
tokvalue = m.group(toktype)
pos = m.end()
yield toktype, tokvalue
m = rx.match(text, pos)
for tok in scan(text):
print tok
which prints
('firstword', 'Mares')
('whitespace', ' ')
('word', 'eat')
('whitespace', ' ')
('lastword', 'oats')
('whitespace', '\n')
('firstword', 'and')
('whitespace', ' ')
('word', 'does')
('whitespace', ' ')
('word', 'eat')
('whitespace', ' ')
('lastword', 'oats')
('whitespace', '\n')
This distinguishes between types of word; a word at the beginning of a line, a word at the end of a line, and any other word.

In normal mode, you don't need ^ if you are using match.
But in multiline mode (re.MULTILINE), it can be useful because ^ can match not only the beginning of the whole string, but also beginning of every line.

Related

Replace symbol before match using regex in Python

I have strings such as:
text1 = ('SOME STRING,99,1234 FIRST STREET,9998887777,ABC')
text2 = ('SOME OTHER STRING,56789 SECOND STREET,6665554444,DEF')
text3 = ('ANOTHER STRING,#88,4321 THIRD STREET,3332221111,GHI')
Desired output:
SOME STRING 99,1234 FIRST STREET,9998887777,ABC
SOME OTHER STRING,56789 SECOND STREET,6665554444,DEF
ANOTHER STRING #88,4321 THIRD STREET,3332221111,GHI
My idea: Use regex to find occurrences of 1-5 digits, possibly preceded by a symbol, that are between two commas and not followed by a space and letters, then replace by this match without the preceding comma.
Something like:
text.replace(r'(,\d{0,5},)','.........')

If you would use regex module instead of re then possibly:
import regex
str = "ANOTHER STRING,#88,4321 THIRD STREET,3332221111,GHI"
print(regex.sub(r'(?<!^.*,.*),(?=#?\d+,\d+)', ' ', str))
You might be able to use re if you sure there are no other substring following the pattern in the lookahead.
import re
str = "ANOTHER STRING,#88,4321 THIRD STREET,3332221111,GHI"
print(re.sub(r',(?=#?\d+,\d+)', ' ', str))

Easier to read alternative if SOME STRING, SOME OTHER STRING, and ANOTHER STRING never contain commas:
text1.replace(",", " ", 1)
which just replaces the first comma with a space

Simple, yet effective:
my_pattern = r"(,)(\W?\d{0,5},)"
p = re.compile(my_pattern)
p.sub(r" \2", text1) # 'SOME STRING 99,1234 FIRST STREET,9998887777,ABC'
p.sub(r" \2", text2) # 'SOME OTHER STRING,56789 SECOND STREET,6665554444,DEF'
p.sub(r" \2", text3) # 'ANOTHER STRING #88,4321 THIRD STREET,3332221111,GHI'
Secondary pattern with non-capturing group and verbose compilation:
my_pattern = r"""
(?:,) # Non-capturing group for single comma.
(\W?\d{0,5},) # Capture zero or one non-ascii characters, zero to five numbers, and a comma
"""
# re.X compiles multiline regex patterns
p = re.compile(my_pattern, flags = re.X)
# This time we will use \1 to implement the first captured group
p.sub(r" \1", text1)
p.sub(r" \1", text2)
p.sub(r" \1", text3)

Regex that matches punctuation at the word boundary including underscore

I am looking for a Python regex for a variable phrase with the following properties:
(For the sake of example, let's assume the variable phrase here is taking the value and. But note that I need to do this in a way that the thing playing the role of and can be passed in as a variable which I'll call phrase.)
Should match: this_and, this.and, (and), [and], and^, ;And, etc.
Should not match: land, andy
This is what I tried so far (where phrase is playing the role of and):
pattern = r"\b " + re.escape(phrase.lower()) + r"\b"
This seems to work for all my requirements except that it does not match words with underscores e.g. \_hello, hello\_, hello_world.
Edit: Ideally I would like to use the standard library re module rather than any external packages.

You may use
r'(?<![^\W_])and(?![^\W_])'
See the regex demo. Compile with the re.I flag to enable case insensitive matching.
Details
(?<![^\W_]) - the preceding char should not be a letter or digit char
and - some keyword
(?![^\W_]) - the next char cannot be a letter or digit
Python demo:
import re
strs = ['this_and', 'this.and', '(and)', '[and]', 'and^', ';And', 'land', 'andy']
phrase = "and"
rx = re.compile(r'(?<![^\W_]){}(?![^\W_])'.format(re.escape(phrase)), re.I)
for s in strs:
print("{}: {}".format(s, bool(rx.search(s))))
Output:
this_and: True
this.and: True
(and): True
[and]: True
and^: True
;And: True
land: False
andy: False

Here is a regex that might solve it:
Regex
(?<=[\W_]+|^)and(?=[\W_]+|$)
Example
# import regex
string = 'this_And'
test = regex.search(r'(?<=[\W_]+|^)and(?=[\W_]+|$)', string.lower())
print(test.group(0))
# prints 'and'
# No match
string = 'Andy'
test = regex.search(r'(?<=[\W_]+|^)and(?=[\W_]+|$)', string.lower())
print(test)
# prints None
strings = [ "this_and", "this.and", "(and)", "[and]", "and^", ";And"]
[regex.search(r'(?<=[\W_]+|^)and(?=[\W_]+|$)', s.lower()).group(0) for s in strings if regex.search(r'(?<=[\W_]+|^)and(?=[\W_]+|$)', s.lower())]
# prints ['and', 'and', 'and', 'and', 'and', 'and']
Explanation
[\W_]+ means we accept before (?<=) or after (?=) and only non-word symbols except the underscore _ (a word symbol that) is accepted. |^ and |$ allow matches to lie at the edge of the string.
Edit
As mentioned in my comment, the module regex does not yield errors with variable lookbehind lengths (as opposed to re).
# This works fine
# import regex
word = 'and'
pattern = r'(?<=[\W_]+|^){}(?=[\W_]+|$)'.format(word.lower())
string = 'this_And'
regex.search(pattern, string.lower())
However, if you insist on using re, then of the top of my head I'd suggest splitting the lookbehind in two (?<=[\W_])and(?=[\W_]+|$)|^and(?=[\W_]+|$) that way cases where the string starts with and are captured as well.
# This also works fine
# import re
word = 'and'
pattern = r'(?<=[\W_]){}(?=[\W_]+|$)|^{}(?=[\W_]+|$)'.format(word.lower(), word.lower())
string = 'this_And'
re.search(pattern, string.lower())

Preventing python to escape regexp pattern while inserting into list

I am trying to create a list of regexp pattern which I can use for patter matching like below one
REGEXES = [
'port .\d+',
'te\d+-\d+ \d+ [#]?\d+',
'te\d+.-\d+'
]
Now while I am checking the o/p of it, its shows
['port .\\d+', 'te\\d+-\\d+ \\d+ [#]?\\d+', 'te\\d+.-\\d+']
And using below code
msg = "Aborting Test: checkDutPort: Invalid dutBladeAndPort: te3932-213 0 #4, not found in global ::dutPortMap"
combined = "(" + ")|(".join(REGEXES) + ")"
re.match(combined, msg)
it not able to match the pattern.
I check but for raw input also python escaped the "\".
How can I prevent this.

From the docs:
re.match(pattern, string, flags=0)
If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding match object.
None of your patterns can be found at the beginning of msg, so it returns None.
If instead you use re.search, it will find the part of the string I assume you're looking for:
>>> re.search(combined, msg)
<_sre.SRE_Match object; span=(54, 69), match='te3932-213 0 #4'>

find a word in a sentence using regular expression

So, I am trying to find a word (a complete word) in a sentence. Lets say the sentence is
Str1 = "1. how are you doing"
and that I am interested in finding if
Str2 = "1."
is in it. If I do,
re.search(r"%s\b" % Str2, Str1, re.IGNORECASE)
it should say that a match was found, isn't it? but the re.search fails for this query. why?

There are two things wrong here:
\b matches a position between a word and a non-word character, so between any letter, digit or underscore, and a character that doesn't match that set.
You are trying to match the boundary between a . and a space; both are non-word characters and the \b anchor would never match there.
You are handing re a 1., which means 'match a 1 and any other character'. You'd need to escape the dot by using re.escape() to match a literal ..
The following works better:
re.search(r"%s(?:\s|$)" % re.escape(Str2), Str1, re.IGNORECASE)
Now it'll match your input literally, and look for a following space or the end of the string. The (?:...) creates a non-capturing group (always a good idea unless you specifically need to capture sections of the match); inside the group there is a | pipe to give two alternatives; either match \s (whitespace) or match $ (end of a line). You can expand this as needed.
Demo:
>>> import re
>>> Str1 = "1. how are you doing"
>>> Str2 = "1."
>>> re.search(r"%s(?:\s|$)" % re.escape(Str2), Str1, re.IGNORECASE)
<_sre.SRE_Match object at 0x10457eed0>
>>> _.group(0)
'1. '

Regular expression to replace with XML node

I'm using Python to write a regular expression for replacing parts of the string with a XML node.
The source string looks like:
Hello
REPLACE(str1) this is to replace
REPLACE(str2) this is to replace
And the result string should be like:
Hello
<replace name="str1"> this is to replace </replace>
<replace name="str2"> this is to replace </replace>
Can anyone help me?

What makes your problem a little bit tricky is that you want to match inside of a multiline string. You need to use the re.MULTILINE flag to make that work.
Then, you need to match some groups inside your source string, and use those groups in the final output. Here is code that works to solve your problem:
import re
s_pat = "^\s*REPLACE\(([^)]+)\)(.*)$"
pat = re.compile(s_pat, re.MULTILINE)
s_input = """\
Hello
REPLACE(str1) this is to replace
REPLACE(str2) this is to replace"""
def mksub(m):
return '<replace name="%s">%s</replace>' % m.groups()
s_output = re.sub(pat, mksub, s_input)
The only tricky part is the regular expression pattern. Let's look at it in detail.
^ matches the start of a string. With re.MULTILINE, this matches the start of a line within a multiline string; in other words, it matches right after a newline in the string.
\s* matches optional whitespace.
REPLACE matches the literal string "REPLACE".
\( matches the literal string "(".
( begins a "match group".
[^)] means "match any character but a ")".
+ means "match one or more of the preceding pattern.
) closes a "match group".
\) matches the literal string ")"
(.*) is another match group containing ".*".
$ matches the end of a string. With re.MULTILINE, this matches the end of a line within a multiline string; in other words, it matches a newline character in the string.
. matches any character, and * means to match zero or more of the preceding pattern. Thus .* matches anything, up to the end of the line.
So, our pattern has two "match groups". When you run re.sub() it will make a "match object" which will be passed to mksub(). The match object has a method, .groups(), that returns the matched substrings as a tuple, and that gets substituted in to make the replacement text.
EDIT: You actually don't need to use a replacement function. You can put the special string \1 inside the replacement text, and it will be replaced by the contents of match group 1. (Match groups count from 1; the special match group 0 corresponds the the entire string matched by the pattern.) The only tricky part of the \1 string is that \ is special in strings. In a normal string, to get a \, you need to put two backslashes in a row, like so: "\\1" But you can use a Python "raw string" to conveniently write the replacement pattern. Doing so you get this:
import re
s_pat = "^\s*REPLACE\(([^)]+)\)(.*)$"
pat = re.compile(s_pat, re.MULTILINE)
s_repl = r'<replace name="\1">\2</replace>'
s_input = """\
Hello
REPLACE(str1) this is to replace
REPLACE(str2) this is to replace"""
s_output = re.sub(pat, s_repl, s_input)

Here is an excellent tutorial on how to write regular expressions in Python.

Here is a solution using pyparsing. I know you specifically asked about a regex solution, but if your requirements change, you might find it easier to expand a pyparsing parser. Or a pyparsing prototype solution might give you a little more insight into the problem leading toward a regex or other final implementation.
src = """\
Hello
REPLACE(str1) this is to replace
REPLACE(str2) this is to replace
"""
from pyparsing import Suppress, Word, alphas, alphanums, restOfLine
LPAR,RPAR = map(Suppress,"()")
ident = Word(alphas, alphanums)
replExpr = "REPLACE" + LPAR + ident("name") + RPAR + restOfLine("body")
replExpr.setParseAction(
lambda toks : '<replace name="%(name)s">%(body)s </replace>' % toks
)
print replExpr.transformString(src)
In this case, you create the expression to be matched with pyparsing, define a parse action to do the text conversion, and then call transformString to scan through the input source to find all the matches, apply the parse action to each match, and return the resulting output. The parse action serves a similar function to mksub in #steveha's solution.
In addition to the parse action, pyparsing also supports naming individual elements of the expression - I used "name" and "body" to label the two parts of interest, which are represented in the re solution as groups 1 and 2. You can name groups in an re, the corresponding re would look like:
s_pat = "^\s*REPLACE\((?P<name>[^)]+)\)(?P<body>.*)$"
Unfortunately, to access these groups by name, you have to invoke the group() method on the re match object, you can't directly do the named string interpolation as in my lambda parse action. But this is Python, right? We can wrap that callable with a class that will give us dict-like access to the groups by name:
class CallableDict(object):
def __init__(self,fn):
self.fn = fn
def __getitem__(self,name):
return self.fn(name)
def mksub(m):
return '<replace name="%(name)s">%(body)s</replace>' % CallableDict(m.group)
s_output = re.sub(pat, mksub, s_input)
Using CallableDict, the string interpolation in mksub can now call m.group for each field, by making it look like we are retrieving the ['name'] and ['body'] elements of a dict.

Maybe like this ?
import re
mystr = """Hello
REPLACE(str1) this is to replace
REPLACE(str2) this is to replace"""
prog = re.compile(r'REPLACE\((.*?)\)\s(.*)')
for line in mystr.split("\n"):
print prog.sub(r'< replace name="\1" > \2',line)

Something like this should work:
import re,sys
f = open( sys.argv[1], 'r' )
for i in f:
g = re.match( r'REPLACE\((.*)\)(.*)', i )
if g is None:
print i
else:
print '<replace name=\"%s\">%s</replace>' % (g.group(1),g.group(2))
f.close()

import re
a="""Hello
REPLACE(str1) this is to replace
REPLACE(str2) this is to replace"""
regex = re.compile(r"^REPLACE\(([^)]+)\)\s+(.*)$", re.MULTILINE)
b=re.sub(regex, r'< replace name="\1" > \2 < /replace >', a)
print b
will do the replace in one line.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python regex - understanding the difference between match and search - python

In normal mode, you don't need ^ if you are using match. But in multiline mode (re.MULTILINE), it can be useful because ^ can match not only the beginning of the whole string, but also beginning of every line.

Related

Replace symbol before match using regex in Python

Regex that matches punctuation at the word boundary including underscore

Preventing python to escape regexp pattern while inserting into list

find a word in a sentence using regular expression

Regular expression to replace with XML node

Categories

Resources