Matching \[\] in Python regexes - python

I am trying to replace all expressions of the form
\[something\]
in a string by
\[<img src='something'>\]
Since \ and [ ] are special characters, I need to espace them (so \\, \[ and \]), thus my code would be
def repl(m):
return "<img src='"+m.group(1)+"'>"
print re.sub("\\\[(.*?)\\\]", repl, "frfrfr\nfrrffr<p>\[something\]</p>frff\nfrfrr", re.S)
However, this returns the original string. Could someone point out my mistake ?

Escape \ correctly, or use r'raw string' as follow.
>>> print re.sub(r"\\\[(.*?)\\\]", repl, "frfrfr\nfrrffr<p>\[something\]</p>frff\nfrfrr", flags=re.S)
frfrfr
frrffr<p><img src='something'></p>frff
frfrr
>>> print re.sub("\\\\\\[(.*?)\\\\\\]", repl, "frfrfr\nfrrffr<p>\[something\]</p>frff\nfrfrr", flags=re.S)
frfrfr
frrffr<p><img src='something'></p>frff
frfrr
UPDATE
The fourth parameter of re.sub is count, not flags. To specify flags, use keyword arguments. Otherwise, re.S is recognized as count.
>>> print re.sub(r"\\\[(.*?)\\\]", repl, "frfrfr\nfrrffr<p>\[something\nblah\]</p>frff\nfrfrr", re.S)
frfrfr
frrffr<p>\[something
blah\]</p>frff
frfrr
>>> print re.sub(r"\\\[(.*?)\\\]", repl, "frfrfr\nfrrffr<p>\[something\nblah\]</p>frff\nfrfrr", flags=re.S)
frfrfr
frrffr<p><img src='something
blah'></p>frff
frfrr

Related

Match literal string '\$'

I'm trying to match literal string '\$'. I'm escaping both '\' and '$' by backslash. Why isn't working when I escape the backslash in the pattern? But if I use a dot then it works.
import re
print re.match('\$','\$')
print re.match('\\\$','\$')
print re.match('.\$','\$')
Output:
None
None
<_sre.SRE_Match object at 0x7fb89cef7b90>
Can someone explain what's happening internally?
You should use the re.escape() function for this:
escape(string)
Return string with all non-alphanumerics backslashed; this is useful
if you want to match an arbitrary literal string that may have regular
expression metacharacters in it.
For example:
import re
val = re.escape('\$') # val = '\\\$'
print re.match(val,'\$')
It outputs:
<_sre.SRE_Match object; span=(0, 2), match='\\$'>
This is equivalent to what #TigerhawkT3 mentioned in his answer.
Unfortunately, you need more backslashes. You need to escape them to indicate that they're literals in the string and get them into the expression, and then further escape them to indicate that they're literals instead of regex special characters. This is why raw strings are often used for regular expressions: the backslashes don't explode.
>>> import re
>>> print re.match('\$','\$')
None
>>> print re.match('\\\$','\$')
None
>>> print re.match('.\$','\$')
<_sre.SRE_Match object at 0x01E1F800>
>>> print re.match('\\\\\$','\$')
<_sre.SRE_Match object at 0x01E1F800>
>>> print re.match(r'\\\$','\$')
<_sre.SRE_Match object at 0x01E1F800>
r'string'
is the raw string
try annotating your regex string
here are the same re's with and without raw annotation
print( re.match(r'\\\$', '\$'))
<_sre.SRE_Match object; span=(0, 2), match='\\$'>
print( re.match('\\\$', '\$'))
None
this is python3 on account of because
In a (non-raw) string literal, backslash is special. It means the Python interpreter should handle following character specially. For example "\n" is a string of length 1 containing the newline character. "\$" is a string of a single character, the dollar sign. "\\$" is a string of two characters: a backslash and a dollar sign.
In regular expressions, the backslash also means the following character is to be handled specially, but in general the special meaning is different. In a regular expression, $ matches the end of a line, and \$ matches a dollar sign, \\ matches a single backslash, and \\$ matches a backslash at the end of a line.
So, when you do re.match('\$',s) the Python interpreter reads '\$' to construct a string object $ (i.e., length 1) then passes that string object to re.match. With re.match('\\$',s) Python makes a string object \$ (length 2) and passes that string object to re.match.
To see what's actually being passed to re.match, just print it. For example:
pat = '\\$'
print "pat :" + pat + ":"
m = re.match(pat, s)
People usually use raw string literals to avoid the double-meaning of backslashes.
pat = r'\$' # same 2-character string as above
Thanks for the above answers. I am adding this answer because we don't have a short summary in the above answers.
The backslash \ needs to be escaped both in python string and regex engine.
Python string will translate 2 \\ to 1 \. And regex engine will require 2 \\ to match 1 \
So to provide the regex engine with 2 \\ in order to match 1 \ we will have to use 4 \\\\ in python string.
\\\\ --> Python(string translation) ---> \\ ---> Regex Engine(translation) ---> \
You have to use . as . matches any characters except newline.

python regex subsituting expression using a variable

What I am trying to achieve is to substitute a string using python regex with a variable (contents of the variable). Since I need to retain some of the matched expression, I use the \1 and \3 group match args.
My regex/sub looks like this:
pattern = "\1" + id + "\3" \b
out = re.sub(r'(;11=)(\w+)(;)',r'%s' % pattern, line)
What appears to be happening is \1 and \3 do not get added to the output.
I have also tried this with the substitution expression:
r'\1%s\3'%orderid
But I got similar results.
Any suggestion on what might fix this?
You need to use raw strings or double the backslashes:
pattern = r"\1" + id + r"\3"
or
pattern = "\\1" + id + r"\\3"
In a regular Python string literal, \number is interpreted as an octal character code instead:
>>> '\1'
'\x01'
while the backslash has no special meaning in a raw string literal:
>>> r'\1'
'\\1'
Raw string literals are just a notation, not a type. Both r'' and '' produce strings, and only differ in how they interpret backslashes in source code.
Note that since group 1 and group3 match literal text, you don't need to use substitutions at all; simply use:
out = re.sub(r';11=\w+;', ';11=%s;' % id, line)
or use look-behind and lookahead and forgo having to repeat the literals:
out = re.sub(r'(?<=;11=)\w+(?=;)', id, line)
Demo:
>>> import re
>>> line = 'foobar;11=spam;hameggs'
>>> id = 'monty'
>>> re.sub(r';11=\w+;', ';11=%s;' % id, line)
'foobar;11=monty;hameggs'
>>> re.sub(r'(?<=;11=)\w+(?=;)', id, line)
'foobar;11=monty;hameggs'
This isn't going to work:
pattern = "\1" + id + "\3"
# ...
r'%s' % pattern
The r prefix only affects how the literal is interpreted. So, r'%s' mean that the % and s will be interpreted raw—but that's the same way they'd be interpreted without the r. Meanwhile, the pattern has non-raw literals "\1" and "\3", so it's already a control-A and a control-C before you even get to the %.
What you want is:
pattern = r"\1" + id + r"\3"
# ...
'%s' % pattern
However, you really don't need the % formatting at all; just use pattern itself and you'll get the exact same thing.

Python unescaping string in regex replacements

The output of the code below:
rpl = 'This is a nicely escaped newline \\n'
my_string = 'I hope this apple is replaced with a nicely escaped string'
reg = re.compile('apple')
reg.sub( rpl, my_string )
..is:
'I hope this This is a nicely escaped newline \n is replaced with a nicely escaped string'
..so when printed:
I hope this This is a nicely escaped newline
is replaced with a nicely escaped string
So python is unescaping the string when it replaces 'apple' in the other string? For now I've just done
reg.sub( rpl.replace('\\','\\\\'), my_string )
Is this safe? Is there a way to stop Python from doing that?
From help(re.sub) [emphasis mine]:
sub(pattern, repl, string, count=0, flags=0)
Return the string obtained by replacing the leftmost
non-overlapping occurrences of the pattern in string by the
replacement repl. repl can be either a string or a callable;
if a string, backslash escapes in it are processed. If it is
a callable, it's passed the match object and must return
a replacement string to be used.
One way to get around this is to pass a lambda:
>>> reg.sub(rpl, my_string )
'I hope this This is a nicely escaped newline \n is replaced with a nicely escaped string'
>>> reg.sub(lambda x: rpl, my_string )
'I hope this This is a nicely escaped newline \\n is replaced with a nicely escaped string'
All regex patterns used for Python's re module are unescaped, including both search and replacement patterns. This is why the r modifier is generally used with regex patterns in Python, as it reduces the amount of "backwhacking" necessary to write usable patterns.
The r modifier appears before a string constant and basically makes all \ characters (except those before string delimiters) verbatim. So, r'\\' == '\\\\', and r'\n' == '\\n'.
Writing your example as
rpl = r'This is a nicely escaped newline \\n'
my_string = 'I hope this apple is replaced with a nicely escaped string'
reg = re.compile(r'apple')
reg.sub( rpl, my_string )
works as expected.

Using regex to get passage between two strings in Python

I want to parse all of the functions inside of a .txt file. It looks like this:
def
test
end
def
hello
end
def
world
end
So, I would get the following returned: [test, hello, world]
Here is what I have tried, but I do not get anything back:
r = re.findall('def(.*?)end', doc)
print r
You have to use the re.DOTALL flag which will allow . to match newlines too (since your doc is multi-line).
You could additionally use '^def' and '^end' in the regex if you only wanted the outer def/end blocks (ie ignore indented ones), in which case you would also need to use the re.MULTILINE flag, which allows '^' and '$' to match start/end of line (as opposed to start/end of string).
re.findall('^def(.*?)^end',doc,re.DOTALL|re.MULTILINE)
r = re.findall('def(.*?)end', doc, re.S)
You need to enable re.MULTILINE flag to match multiple lines in a single regular expression.
Also, ^ and $ do NOT match linefeeds (\n)
>>> re.findall(r"^def$\n(.*)\n^end$", doc, re.MULTILINE)
[' test', ' hello', ' world']
If you don't want to match the whitespace in the beginning of the blocks, add \W+:
>>> re.findall(r"^def$\n\W*(.*)\n^end$", text, re.MULTILINE)
['test', 'hello', 'world']

Regular expression to replace with XML node

I'm using Python to write a regular expression for replacing parts of the string with a XML node.
The source string looks like:
Hello
REPLACE(str1) this is to replace
REPLACE(str2) this is to replace
And the result string should be like:
Hello
<replace name="str1"> this is to replace </replace>
<replace name="str2"> this is to replace </replace>
Can anyone help me?
What makes your problem a little bit tricky is that you want to match inside of a multiline string. You need to use the re.MULTILINE flag to make that work.
Then, you need to match some groups inside your source string, and use those groups in the final output. Here is code that works to solve your problem:
import re
s_pat = "^\s*REPLACE\(([^)]+)\)(.*)$"
pat = re.compile(s_pat, re.MULTILINE)
s_input = """\
Hello
REPLACE(str1) this is to replace
REPLACE(str2) this is to replace"""
def mksub(m):
return '<replace name="%s">%s</replace>' % m.groups()
s_output = re.sub(pat, mksub, s_input)
The only tricky part is the regular expression pattern. Let's look at it in detail.
^ matches the start of a string. With re.MULTILINE, this matches the start of a line within a multiline string; in other words, it matches right after a newline in the string.
\s* matches optional whitespace.
REPLACE matches the literal string "REPLACE".
\( matches the literal string "(".
( begins a "match group".
[^)] means "match any character but a ")".
+ means "match one or more of the preceding pattern.
) closes a "match group".
\) matches the literal string ")"
(.*) is another match group containing ".*".
$ matches the end of a string. With re.MULTILINE, this matches the end of a line within a multiline string; in other words, it matches a newline character in the string.
. matches any character, and * means to match zero or more of the preceding pattern. Thus .* matches anything, up to the end of the line.
So, our pattern has two "match groups". When you run re.sub() it will make a "match object" which will be passed to mksub(). The match object has a method, .groups(), that returns the matched substrings as a tuple, and that gets substituted in to make the replacement text.
EDIT: You actually don't need to use a replacement function. You can put the special string \1 inside the replacement text, and it will be replaced by the contents of match group 1. (Match groups count from 1; the special match group 0 corresponds the the entire string matched by the pattern.) The only tricky part of the \1 string is that \ is special in strings. In a normal string, to get a \, you need to put two backslashes in a row, like so: "\\1" But you can use a Python "raw string" to conveniently write the replacement pattern. Doing so you get this:
import re
s_pat = "^\s*REPLACE\(([^)]+)\)(.*)$"
pat = re.compile(s_pat, re.MULTILINE)
s_repl = r'<replace name="\1">\2</replace>'
s_input = """\
Hello
REPLACE(str1) this is to replace
REPLACE(str2) this is to replace"""
s_output = re.sub(pat, s_repl, s_input)
Here is an excellent tutorial on how to write regular expressions in Python.
Here is a solution using pyparsing. I know you specifically asked about a regex solution, but if your requirements change, you might find it easier to expand a pyparsing parser. Or a pyparsing prototype solution might give you a little more insight into the problem leading toward a regex or other final implementation.
src = """\
Hello
REPLACE(str1) this is to replace
REPLACE(str2) this is to replace
"""
from pyparsing import Suppress, Word, alphas, alphanums, restOfLine
LPAR,RPAR = map(Suppress,"()")
ident = Word(alphas, alphanums)
replExpr = "REPLACE" + LPAR + ident("name") + RPAR + restOfLine("body")
replExpr.setParseAction(
lambda toks : '<replace name="%(name)s">%(body)s </replace>' % toks
)
print replExpr.transformString(src)
In this case, you create the expression to be matched with pyparsing, define a parse action to do the text conversion, and then call transformString to scan through the input source to find all the matches, apply the parse action to each match, and return the resulting output. The parse action serves a similar function to mksub in #steveha's solution.
In addition to the parse action, pyparsing also supports naming individual elements of the expression - I used "name" and "body" to label the two parts of interest, which are represented in the re solution as groups 1 and 2. You can name groups in an re, the corresponding re would look like:
s_pat = "^\s*REPLACE\((?P<name>[^)]+)\)(?P<body>.*)$"
Unfortunately, to access these groups by name, you have to invoke the group() method on the re match object, you can't directly do the named string interpolation as in my lambda parse action. But this is Python, right? We can wrap that callable with a class that will give us dict-like access to the groups by name:
class CallableDict(object):
def __init__(self,fn):
self.fn = fn
def __getitem__(self,name):
return self.fn(name)
def mksub(m):
return '<replace name="%(name)s">%(body)s</replace>' % CallableDict(m.group)
s_output = re.sub(pat, mksub, s_input)
Using CallableDict, the string interpolation in mksub can now call m.group for each field, by making it look like we are retrieving the ['name'] and ['body'] elements of a dict.
Maybe like this ?
import re
mystr = """Hello
REPLACE(str1) this is to replace
REPLACE(str2) this is to replace"""
prog = re.compile(r'REPLACE\((.*?)\)\s(.*)')
for line in mystr.split("\n"):
print prog.sub(r'< replace name="\1" > \2',line)
Something like this should work:
import re,sys
f = open( sys.argv[1], 'r' )
for i in f:
g = re.match( r'REPLACE\((.*)\)(.*)', i )
if g is None:
print i
else:
print '<replace name=\"%s\">%s</replace>' % (g.group(1),g.group(2))
f.close()
import re
a="""Hello
REPLACE(str1) this is to replace
REPLACE(str2) this is to replace"""
regex = re.compile(r"^REPLACE\(([^)]+)\)\s+(.*)$", re.MULTILINE)
b=re.sub(regex, r'< replace name="\1" > \2 < /replace >', a)
print b
will do the replace in one line.

Categories