Preventing python to escape regexp pattern while inserting into list

Preventing python to escape regexp pattern while inserting into list - python

I am trying to create a list of regexp pattern which I can use for patter matching like below one
REGEXES = [
'port .\d+',
'te\d+-\d+ \d+ [#]?\d+',
'te\d+.-\d+'
]
Now while I am checking the o/p of it, its shows
['port .\\d+', 'te\\d+-\\d+ \\d+ [#]?\\d+', 'te\\d+.-\\d+']
And using below code
msg = "Aborting Test: checkDutPort: Invalid dutBladeAndPort: te3932-213 0 #4, not found in global ::dutPortMap"
combined = "(" + ")|(".join(REGEXES) + ")"
re.match(combined, msg)
it not able to match the pattern.
I check but for raw input also python escaped the "\".
How can I prevent this.

From the docs:
re.match(pattern, string, flags=0)
If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding match object.
None of your patterns can be found at the beginning of msg, so it returns None.
If instead you use re.search, it will find the part of the string I assume you're looking for:
>>> re.search(combined, msg)
<_sre.SRE_Match object; span=(54, 69), match='te3932-213 0 #4'>

Related

Parsing timestamps with Python regular expressions ':' character not found

I am teaching myself python and I am trying to implement the regular expression to obtain a timestamp from an application log file ( I normally use grep, cut and awk for this )
My logfiles contain many lines started with date and time next
18.12.19 14:03:16 [ ..... # message error
18.12.19 14:03:16 [
:
I normally use a simple grep command grep "14\:03\:16" mytext
and this expression works "14:03:16", so after researching I came up with this regex:
Where res is one of the lines above
datap = re.compile(r'(\d{2}):(\d{2}):(\d{2})')
m = datap.match(res)
This does not find anything whereas
datap = re.compile(r'(\d{2}).(\d{2}).(\d{2})')
m = datap.match(re
Captures the date.
Why the character : is not found? I have tried to use \: as well and it also does not work. Thanks in advance.

re.match tries to match the regex from the beginning of the string.
From the docs:
If zero or more characters at the beginning of string match the
regular expression pattern, return a corresponding match object.
Return None if the string does not match the pattern; note that this
is different from a zero-length match.
When you did
datap = re.compile(r'(\d{2}).(\d{2}).(\d{2})')
m = datap.match(res)
the regex actually matched the date, not the time (because it is at the beginning of the string):
print(m)
# <re.Match object; span=(0, 8), match='18.12.19'>
If you use re.search then you will get the expected output:
import re
res = '18.12.19 14:03:16 [ ..... # message error'
datap = re.compile(r'(\d{2}):(\d{2}):(\d{2})')
m = datap.search(res)
print(m)
# <re.Match object; span=(9, 17), match='14:03:16'>

Why use re.match(), when re.search() can do the same thing?

From the documentation, it's very clear that:
match() -> apply pattern match at the beginning of the string
search() -> search through the string and return first match
And search with '^' and without re.M flag would work the same as match.
Then why does python have match()? Isn't it redundant?
Are there any performance benefits to keeping match() in python?

The pos argument behaves differently in important ways:
>>> s = "a ab abc abcd"
>>> re.compile('a').match(s, pos=2)
<_sre.SRE_Match object; span=(2, 3), match='a'>
>>> re.compile('^a').search(s, pos=2)
None
match makes it possible to write a tokenizer, and ensure that characters are never skipped. search has no way of saying "start from the earliest allowable character".
Example use of match to break up a string with no gaps:
def tokenize(s, patt):
at = 0
while at < len(s):
m = patt.match(s, pos=at)
if not m:
raise ValueError("Did not expect character at location {}".format(at))
at = m.end()
yield m

"Why" questions are hard to answer. As a matter of fact, you could define the function re.match() like this:
def match(pattern, string, flags):
return re.search(r"\A(?:" + pattern + ")", string, flags)
(because \A always matches at the start of the string, regardless of the re.M flag status´).
So re.match is a useful shortcut but not strictly necessary. It's especially confusing for Java programmers who have Pattern.matches() which anchors the search to the start and end of the string (which is probably a more common use case than just anchoring to the start).
It's different for the match and search methods of regex objects, though, as Eric has pointed out.

How to print regex match results in python 3?

I was in IDLE, and decided to use regex to sort out a string. But when I typed in what the online tutorial told me to, all it would do was print:
<_sre.SRE_Match object at 0x00000000031D7E68>
Full program:
import re
reg = re.compile("[a-z]+8?")
str = "ccc8"
print(reg.match(str))
result:
<_sre.SRE_Match object at 0x00000000031D7ED0>
Could anybody tell me how to actually print the result?

You need to include .group() after to the match function so that it would print the matched string otherwise it shows only whether a match happened or not. To print the chars which are captured by the capturing groups, you need to pass the corresponding group index to the .group() function.
>>> import re
>>> reg = re.compile("[a-z]+8?")
>>> str = "ccc8"
>>> print(reg.match(str).group())
ccc8
Regex with capturing group.
>>> reg = re.compile("([a-z]+)8?")
>>> print(reg.match(str).group(1))
ccc
re.match(pattern, string, flags=0)
If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding MatchObject instance. Return None if the string does not match the pattern; note that this is different from a zero-length match.
Note that even in MULTILINE mode, re.match() will only match at the beginning of the string and not at the beginning of each line.

If you need to get the whole match value, you should use
m = reg.match(r"[a-z]+8?", text)
if m: # Always check if a match occurred to avoid NoneType issues
print(m.group()) # Print the match string
If you need to extract a part of the regex match, you need to use capturing groups in your regular expression. Enclose those patterns with a pair of unescaped parentheses.
To only print captured group results, use Match.groups:
Return a tuple containing all the subgroups of the match, from 1 up to however many groups are in the pattern. The default argument is used for groups that did not participate in the match; it defaults to None.
So, to get ccc and 8 and display only those, you may use
import re
reg = re.compile("([a-z]+)(8?)")
s = "ccc8"
m = reg.match(s)
if m:
print(m.groups()) # => ('ccc', '8')
See the Python demo

Regex replace with negative look ahead in Python

I am trying to delete the single quotes surrounding regular text. For example, given the list:
alist = ["'ABC'", '(-inf-0.5]', '(4800-20800]', "'\\'(4.5-inf)\\''", "'\\'(2.75-3.25]\\''"]
I would like to turn "'ABC'" into "ABC", but keep other quotes, that is:
alist = ["ABC", '(-inf-0.5]', '(4800-20800]', "'\\'(4.5-inf)\\''", "'\\'(2.75-3.25]\\''"]
I tried to use look-head as below:
fixRepeatedQuotes = lambda text: re.sub(r'(?<!\\\'?)\'(?!\\)', r'', text)
print [fixRepeatedQuotes(str) for str in alist]
but received error message:
sre_constants.error: look-behind requires fixed-width pattern.
Any other workaround? Thanks a lot in advance!

Try should work:
result = re.sub("""(?s)(?:')([^'"]+)(?:')""", r"\1", subject)
explanation
"""
(?: # Match the regular expression below
' # Match the character “'” literally (but the ? makes it a non-capturing group)
)
( # Match the regular expression below and capture its match into backreference number 1
[^'"] # Match a single character NOT present in the list “'"” from this character class (aka any character matches except a single and double quote)
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
)
(?: # Match the regular expression below
' # Match the character “'” literally (but the ? makes it a non-capturing group)
)
"""

re.sub accepts a function as the replace text. Therefore,
re.sub(r"'([A-Za-z]+)'", lambda match: match.group(), "'ABC'")
yields
"ABC"

Regular expression to replace with XML node

I'm using Python to write a regular expression for replacing parts of the string with a XML node.
The source string looks like:
Hello
REPLACE(str1) this is to replace
REPLACE(str2) this is to replace
And the result string should be like:
Hello
<replace name="str1"> this is to replace </replace>
<replace name="str2"> this is to replace </replace>
Can anyone help me?

What makes your problem a little bit tricky is that you want to match inside of a multiline string. You need to use the re.MULTILINE flag to make that work.
Then, you need to match some groups inside your source string, and use those groups in the final output. Here is code that works to solve your problem:
import re
s_pat = "^\s*REPLACE\(([^)]+)\)(.*)$"
pat = re.compile(s_pat, re.MULTILINE)
s_input = """\
Hello
REPLACE(str1) this is to replace
REPLACE(str2) this is to replace"""
def mksub(m):
return '<replace name="%s">%s</replace>' % m.groups()
s_output = re.sub(pat, mksub, s_input)
The only tricky part is the regular expression pattern. Let's look at it in detail.
^ matches the start of a string. With re.MULTILINE, this matches the start of a line within a multiline string; in other words, it matches right after a newline in the string.
\s* matches optional whitespace.
REPLACE matches the literal string "REPLACE".
\( matches the literal string "(".
( begins a "match group".
[^)] means "match any character but a ")".
+ means "match one or more of the preceding pattern.
) closes a "match group".
\) matches the literal string ")"
(.*) is another match group containing ".*".
$ matches the end of a string. With re.MULTILINE, this matches the end of a line within a multiline string; in other words, it matches a newline character in the string.
. matches any character, and * means to match zero or more of the preceding pattern. Thus .* matches anything, up to the end of the line.
So, our pattern has two "match groups". When you run re.sub() it will make a "match object" which will be passed to mksub(). The match object has a method, .groups(), that returns the matched substrings as a tuple, and that gets substituted in to make the replacement text.
EDIT: You actually don't need to use a replacement function. You can put the special string \1 inside the replacement text, and it will be replaced by the contents of match group 1. (Match groups count from 1; the special match group 0 corresponds the the entire string matched by the pattern.) The only tricky part of the \1 string is that \ is special in strings. In a normal string, to get a \, you need to put two backslashes in a row, like so: "\\1" But you can use a Python "raw string" to conveniently write the replacement pattern. Doing so you get this:
import re
s_pat = "^\s*REPLACE\(([^)]+)\)(.*)$"
pat = re.compile(s_pat, re.MULTILINE)
s_repl = r'<replace name="\1">\2</replace>'
s_input = """\
Hello
REPLACE(str1) this is to replace
REPLACE(str2) this is to replace"""
s_output = re.sub(pat, s_repl, s_input)

Here is an excellent tutorial on how to write regular expressions in Python.

Here is a solution using pyparsing. I know you specifically asked about a regex solution, but if your requirements change, you might find it easier to expand a pyparsing parser. Or a pyparsing prototype solution might give you a little more insight into the problem leading toward a regex or other final implementation.
src = """\
Hello
REPLACE(str1) this is to replace
REPLACE(str2) this is to replace
"""
from pyparsing import Suppress, Word, alphas, alphanums, restOfLine
LPAR,RPAR = map(Suppress,"()")
ident = Word(alphas, alphanums)
replExpr = "REPLACE" + LPAR + ident("name") + RPAR + restOfLine("body")
replExpr.setParseAction(
lambda toks : '<replace name="%(name)s">%(body)s </replace>' % toks
)
print replExpr.transformString(src)
In this case, you create the expression to be matched with pyparsing, define a parse action to do the text conversion, and then call transformString to scan through the input source to find all the matches, apply the parse action to each match, and return the resulting output. The parse action serves a similar function to mksub in #steveha's solution.
In addition to the parse action, pyparsing also supports naming individual elements of the expression - I used "name" and "body" to label the two parts of interest, which are represented in the re solution as groups 1 and 2. You can name groups in an re, the corresponding re would look like:
s_pat = "^\s*REPLACE\((?P<name>[^)]+)\)(?P<body>.*)$"
Unfortunately, to access these groups by name, you have to invoke the group() method on the re match object, you can't directly do the named string interpolation as in my lambda parse action. But this is Python, right? We can wrap that callable with a class that will give us dict-like access to the groups by name:
class CallableDict(object):
def __init__(self,fn):
self.fn = fn
def __getitem__(self,name):
return self.fn(name)
def mksub(m):
return '<replace name="%(name)s">%(body)s</replace>' % CallableDict(m.group)
s_output = re.sub(pat, mksub, s_input)
Using CallableDict, the string interpolation in mksub can now call m.group for each field, by making it look like we are retrieving the ['name'] and ['body'] elements of a dict.

Maybe like this ?
import re
mystr = """Hello
REPLACE(str1) this is to replace
REPLACE(str2) this is to replace"""
prog = re.compile(r'REPLACE\((.*?)\)\s(.*)')
for line in mystr.split("\n"):
print prog.sub(r'< replace name="\1" > \2',line)

Something like this should work:
import re,sys
f = open( sys.argv[1], 'r' )
for i in f:
g = re.match( r'REPLACE\((.*)\)(.*)', i )
if g is None:
print i
else:
print '<replace name=\"%s\">%s</replace>' % (g.group(1),g.group(2))
f.close()

import re
a="""Hello
REPLACE(str1) this is to replace
REPLACE(str2) this is to replace"""
regex = re.compile(r"^REPLACE\(([^)]+)\)\s+(.*)$", re.MULTILINE)
b=re.sub(regex, r'< replace name="\1" > \2 < /replace >', a)
print b
will do the replace in one line.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Preventing python to escape regexp pattern while inserting into list - python

Related

Parsing timestamps with Python regular expressions ':' character not found

Why use re.match(), when re.search() can do the same thing?

How to print regex match results in python 3?

Regex replace with negative look ahead in Python

Regular expression to replace with XML node

Categories

Resources