Python replace multiple strings while supporting backreferences - python

There are some nice ways to handle simultaneous multi-string replacement in python. However, I am having trouble creating an efficient function that can do that while also supporting backreferences.
What i would like is to use a dictionary of expression / replacement terms, where the replacement terms may contain backreferences to something matched by the expression.
e.g. (note the \1)
repdict = {'&&':'and', '||':'or', '!([a-zA-Z_])':'not \1'}
I put the SO answer mentioned at the outset into the function below, which works fine for expression / replacement pairs that don't contain backreferences:
def replaceAll(repdict, text):
repdict = dict((re.escape(k), v) for k, v in repdict.items())
pattern = re.compile("|".join(repdict.keys()))
return pattern.sub(lambda m: repdict[re.escape(m.group(0))], text)
However, it doesn't work for the key that does contain a backreference..
>>> replaceAll(repldict, "!newData.exists() || newData.val().length == 1")
'!newData.exists() or newData.val().length == 1'
If i do it manually, it works fine. e.g.:
pattern = re.compile("!([a-zA-Z_])")
pattern.sub(r'not \1', '!newData.exists()')
Works as expected:
'not newData.exists()'
In the fancy function, the escaping seems to be messing up the key that uses the backref, so it never matches anything.
I eventually came up with this. However, note that the problem of supporting backrefs in the input parameters is not solved, i'm just handling it manually in the replacer function:
def replaceAll(repPat, text):
def replacer(obj):
match = obj.group(0)
# manually deal with exclamation mark match..
if match[:1] == "!": return 'not ' + match[1:]
# here we naively escape the matched pattern into
# the format of our dictionary key
else: return repPat[naive_escaper(match)]
pattern = re.compile("|".join(repPat.keys()))
return pattern.sub(replacer, text)
def naive_escaper(string):
if '=' in string: return string.replace('=', '\=')
elif '|' in string: return string.replace('|', '\|')
else: return string
# manually escaping \ and = works fine
repPat = {'!([a-zA-Z_])':'', '&&':'and', '\|\|':'or', '\=\=\=':'=='}
replaceAll(repPat, "(!this && !that) || !this && foo === bar")
Returns:
'(not this and not that) or not this'
So if anyone has an idea how to make a multi-string replacement function that supports backreferences and accepts the replacement terms as input, I'd appreciate your feedback very much.

Update: See Angus Hollands' answer for a better alternative.
I couldn't think of an easier way to do it than to stick with the original idea of combining all dict keys into one massive regex.
However, there are some difficulties. Let's assume a repldict like this:
repldict = {r'(a)': r'\1a', r'(b)': r'\1b'}
If we combine these to a single regex, we get (a)|(b) - so now (b) is no longer group 1, which means its backreference won't work correctly.
Another problem is that we can't tell which replacement to use. If the regex matches the text b, how can we find out that \1b is the appropriate replacement? It's not possible; we don't have enough information.
The solution to these problems is to enclose every dict key in a named group like so:
(?P<group1>(a))|(?P<group2>(b))
Now we can easily identify the key that matched, and recalculate the backreferences to make them relative to this group. so that \1b refers to "the first group after group2".
Here's the implementation:
def replaceAll(repldict, text):
# split the dict into two lists because we need the order to be reliable
keys, repls = zip(*repldict.items())
# generate a regex pattern from the keys, putting each key in a named group
# so that we can find out which one of them matched.
# groups are named "_<idx>" where <idx> is the index of the corresponding
# replacement text in the list above
pattern = '|'.join('(?P<_{}>{})'.format(i, k) for i, k in enumerate(keys))
def repl(match):
# find out which key matched. We know that exactly one of the keys has
# matched, so it's the only named group with a value other than None.
group_name = next(name for name, value in match.groupdict().items()
if value is not None)
group_index = int(group_name[1:])
# now that we know which group matched, we can retrieve the
# corresponding replacement text
repl_text = repls[group_index]
# now we'll manually search for backreferences in the
# replacement text and substitute them
def repl_backreference(m):
reference_index = int(m.group(1))
# return the corresponding group's value from the original match
# +1 because regex starts counting at 1
return match.group(group_index + reference_index + 1)
return re.sub(r'\\(\d+)', repl_backreference, repl_text)
return re.sub(pattern, repl, text)
Tests:
repldict = {'&&':'and', r'\|\|':'or', r'!([a-zA-Z_])':r'not \1'}
print( replaceAll(repldict, "!newData.exists() || newData.val().length == 1") )
repldict = {'!([a-zA-Z_])':r'not \1', '&&':'and', r'\|\|':'or', r'\=\=\=':'=='}
print( replaceAll(repldict, "(!this && !that) || !this && foo === bar") )
# output: not newData.exists() or newData.val().length == 1
# (not this and not that) or not this and foo == bar
Caveats:
Only numerical backreferences are supported; no named references.
Silently accepts invalid backreferences like {r'(a)': r'\2'}. (These will sometimes throw an error, but not always.)

Similar solution to Rawing, only precomputing the expensive stuff ahead of time by modifying the group indices in backreferences. Also, using unnamed groups.
Here we silently wrap each case in a capture group, and then update any replacements with backreferences to correctly identify the appropriate subgroup by absolute position. Note, that when using a replacer function, backreferences do not work by default (you need to call match.expand).
import re
from collections import OrderedDict
from functools import partial
pattern_to_replacement = {'&&': 'and', '!([a-zA-Z_]+)': r'not \1'}
def build_replacer(cases):
ordered_cases = OrderedDict(cases.items())
replacements = {}
leading_groups = 0
for pattern, replacement in ordered_cases.items():
leading_groups += 1
# leading_groups is now the absolute position of the root group (back-references should be relative to this)
group_index = leading_groups
replacement = absolute_backreference(replacement, group_index)
replacements[group_index] = replacement
# This pattern contains N subgroups (determine by compiling pattern)
subgroups = re.compile(pattern).groups
leading_groups += subgroups
catch_all = "|".join("({})".format(p) for p in ordered_cases)
pattern = re.compile(catch_all)
def replacer(match):
replacement_pattern = replacements[match.lastindex]
return match.expand(replacement_pattern)
return partial(pattern.sub, replacer)
def absolute_backreference(text, n):
ref_pat = re.compile(r"\\([0-99])")
def replacer(match):
return "\\{}".format(int(match.group(1)) + n)
return ref_pat.sub(replacer, text)
replacer = build_replacer(pattern_to_replacement)
print(replacer("!this.exists()"))

Simple is better than complex, code as below is more readable(The reason why you code not work as expected is that ([a-zA-Z_]) should not be in re.escape):
repdict = {
r'\s*' + re.escape('&&')) + r'\s*': ' and ',
r'\s*' + re.escape('||') + r'\s*': ' or ',
re.escape('!') + r'([a-zA-Z_])': r'not \1',
}
def replaceAll(repdict, text):
for k, v in repdict.items():
text = re.sub(k, v, text)
return text

Related

Substitute specific matches using regex

I want to execute substitutions using regex, not for all matches but only for specific ones. However, re.sub substitutes for all matches. How can I do this?
Here is an example.
Say, I have a string with the following content:
FOO=foo1
BAR=bar1
FOO=foo2
BAR=bar2
BAR=bar3
What I want to do is this:
re.sub(r'^BAR', '#BAR', s, index=[1,2], flags=re.MULTILINE)
to get the below result.
FOO=foo1
BAR=bar1
FOO=foo2
#BAR=bar2
#BAR=bar3
You could pass replacement function to re.sub that keeps track of count and checks if the given index should be substituted:
import re
s = '''FOO=foo1
BAR=bar1
FOO=foo2
BAR=bar2
BAR=bar3'''
i = 0
index = {1, 2}
def repl(x):
global i
if i in index:
res = '#' + x.group(0)
else:
res = x.group(0)
i += 1
return res
print re.sub(r'^BAR', repl, s, flags=re.MULTILINE)
Output:
FOO=foo1
BAR=bar1
FOO=foo2
#BAR=bar2
#BAR=bar3
You could
Split your string using s.splitlines()
Iterate over the individual lines in a for loop
Track how many matches you have found so far
Only perform substitutions on those matches in the numerical ranges you want (e.g. matches 1 and 2)
And then join them back into a single string (if need be).

Python - Efficiently replace characters within text file with ASCII characters [duplicate]

I can use this code below to create a new file with the substitution of a with aa using regular expressions.
import re
with open("notes.txt") as text:
new_text = re.sub("a", "aa", text.read())
with open("notes2.txt", "w") as result:
result.write(new_text)
I was wondering do I have to use this line, new_text = re.sub("a", "aa", text.read()), multiple times but substitute the string for others letters that I want to change in order to change more than one letter in my text?
That is, so a-->aa,b--> bb and c--> cc.
So I have to write that line for all the letters I want to change or is there an easier way. Perhaps to create a "dictionary" of translations. Should I put those letters into an array? I'm not sure how to call on them if I do.
The answer proposed by #nhahtdh is valid, but I would argue less pythonic than the canonical example, which uses code less opaque than his regex manipulations and takes advantage of python's built-in data structures and anonymous function feature.
A dictionary of translations makes sense in this context. In fact, that's how the Python Cookbook does it, as shown in this example (copied from ActiveState http://code.activestate.com/recipes/81330-single-pass-multiple-replace/ )
import re
def multiple_replace(dict, text):
# Create a regular expression from the dictionary keys
regex = re.compile("(%s)" % "|".join(map(re.escape, dict.keys())))
# For each match, look-up corresponding value in dictionary
return regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text)
if __name__ == "__main__":
text = "Larry Wall is the creator of Perl"
dict = {
"Larry Wall" : "Guido van Rossum",
"creator" : "Benevolent Dictator for Life",
"Perl" : "Python",
}
print multiple_replace(dict, text)
So in your case, you could make a dict trans = {"a": "aa", "b": "bb"} and then pass it into multiple_replace along with the text you want translated. Basically all that function is doing is creating one huge regex containing all of your regexes to translate, then when one is found, passing a lambda function to regex.sub to perform the translation dictionary lookup.
You could use this function while reading from your file, for example:
with open("notes.txt") as text:
new_text = multiple_replace(replacements, text.read())
with open("notes2.txt", "w") as result:
result.write(new_text)
I've actually used this exact method in production, in a case where I needed to translate the months of the year from Czech into English for a web scraping task.
As #nhahtdh pointed out, one downside to this approach is that it is not prefix-free: dictionary keys that are prefixes of other dictionary keys will cause the method to break.
You can use capturing group and backreference:
re.sub(r"([characters])", r"\1\1", text.read())
Put characters that you want to double up in between []. For the case of lower case a, b, c:
re.sub(r"([abc])", r"\1\1", text.read())
In the replacement string, you can refer to whatever matched by a capturing group () with \n notation where n is some positive integer (0 excluded). \1 refers to the first capturing group. There is another notation \g<n> where n can be any non-negative integer (0 allowed); \g<0> will refer to the whole text matched by the expression.
If you want to double up all characters except new line:
re.sub(r"(.)", r"\1\1", text.read())
If you want to double up all characters (new line included):
re.sub(r"(.)", r"\1\1", text.read(), 0, re.S)
You can use the pandas library and the replace function. I represent one example with five replacements:
df = pd.DataFrame({'text': ['Billy is going to visit Rome in November', 'I was born in 10/10/2010', 'I will be there at 20:00']})
to_replace=['Billy','Rome','January|February|March|April|May|June|July|August|September|October|November|December', '\d{2}:\d{2}', '\d{2}/\d{2}/\d{4}']
replace_with=['name','city','month','time', 'date']
print(df.text.replace(to_replace, replace_with, regex=True))
And the modified text is:
0 name is going to visit city in month
1 I was born in date
2 I will be there at time
You can find the example here
None of the other solutions work if your patterns are themselves regexes.
For that, you need:
def multi_sub(pairs, s):
def repl_func(m):
# only one group will be present, use the corresponding match
return next(
repl
for (patt, repl), group in zip(pairs, m.groups())
if group is not None
)
pattern = '|'.join("({})".format(patt) for patt, _ in pairs)
return re.sub(pattern, repl_func, s)
Which can be used as:
>>> multi_sub([
... ('a+b', 'Ab'),
... ('b', 'B'),
... ('a+', 'A.'),
... ], "aabbaa") # matches as (aab)(b)(aa)
'AbBA.'
Note that this solution does not allow you to put capturing groups in your regexes, or use them in replacements.
Using tips from how to make a 'stringy' class, we can make an object identical to a string but for an extra sub method:
import re
class Substitutable(str):
def __new__(cls, *args, **kwargs):
newobj = str.__new__(cls, *args, **kwargs)
newobj.sub = lambda fro,to: Substitutable(re.sub(fro, to, newobj))
return newobj
This allows to use the builder pattern, which looks nicer, but works only for a pre-determined number of substitutions. If you use it in a loop, there is no point creating an extra class anymore. E.g.
>>> h = Substitutable('horse')
>>> h
'horse'
>>> h.sub('h', 'f')
'forse'
>>> h.sub('h', 'f').sub('f','h')
'horse'
I found I had to modify Emmett J. Butler's code by changing the lambda function to use myDict.get(mo.group(1),mo.group(1)). The original code wasn't working for me; using myDict.get() also provides the benefit of a default value if a key is not found.
OIDNameContraction = {
'Fucntion':'Func',
'operated':'Operated',
'Asist':'Assist',
'Detection':'Det',
'Control':'Ctrl',
'Function':'Func'
}
replacementDictRegex = re.compile("(%s)" % "|".join(map(re.escape, OIDNameContraction.keys())))
oidDescriptionStr = replacementDictRegex.sub(lambda mo:OIDNameContraction.get(mo.group(1),mo.group(1)), oidDescriptionStr)
If you dealing with files, I have a simple python code about this problem.
More info here.
import re
def multiple_replace(dictionary, text):
# Create a regular expression from the dictionaryary keys
regex = re.compile("(%s)" % "|".join(map(re.escape, dictionary.keys())))
# For each match, look-up corresponding value in dictionaryary
String = lambda mo: dictionary[mo.string[mo.start():mo.end()]]
return regex.sub(String , text)
if __name__ == "__main__":
dictionary = {
"Wiley Online Library" : "Wiley",
"Chemical Society Reviews" : "Chem. Soc. Rev.",
}
with open ('LightBib.bib', 'r') as Bib_read:
with open ('Abbreviated.bib', 'w') as Bib_write:
read_lines = Bib_read.readlines()
for rows in read_lines:
#print(rows)
text = rows
new_text = multiple_replace(dictionary, text)
#print(new_text)
Bib_write.write(new_text)
Based on Eric's great answer, I came up with a more general solution that is capable of handling capturing groups and backreferences:
import re
from itertools import islice
def multiple_replace(s, repl_dict):
groups_no = [re.compile(pattern).groups for pattern in repl_dict]
def repl_func(m):
all_groups = m.groups()
# Use 'i' as the index within 'all_groups' and 'j' as the main
# group index.
i, j = 0, 0
while i < len(all_groups) and all_groups[i] is None:
# Skip the inner groups and move on to the next group.
i += (groups_no[j] + 1)
# Advance the main group index.
j += 1
# Extract the pattern and replacement at the j-th position.
pattern, repl = next(islice(repl_dict.items(), j, j + 1))
return re.sub(pattern, repl, all_groups[i])
# Create the full pattern using the keys of 'repl_dict'.
full_pattern = '|'.join(f'({pattern})' for pattern in repl_dict)
return re.sub(full_pattern, repl_func, s)
Example. Calling the above with
s = 'This is a sample string. Which is getting replaced. 1234-5678.'
REPL_DICT = {
r'(.*?)is(.*?)ing(.*?)ch': r'\3-\2-\1',
r'replaced': 'REPLACED',
r'\d\d((\d)(\d)-(\d)(\d))\d\d': r'__\5\4__\3\2__',
r'get|ing': '!##'
}
gives:
>>> multiple_replace(s, REPL_DICT)
'. Whi- is a sample str-Th is !##t!## REPLACED. __65__43__.'
For a more efficient solution, one can create a simple wrapper to precompute groups_no and full_pattern, e.g.
import re
from itertools import islice
class ReplWrapper:
def __init__(self, repl_dict):
self.repl_dict = repl_dict
self.groups_no = [re.compile(pattern).groups for pattern in repl_dict]
self.full_pattern = '|'.join(f'({pattern})' for pattern in repl_dict)
def get_pattern_repl(self, pos):
return next(islice(self.repl_dict.items(), pos, pos + 1))
def multiple_replace(self, s):
def repl_func(m):
all_groups = m.groups()
# Use 'i' as the index within 'all_groups' and 'j' as the main
# group index.
i, j = 0, 0
while i < len(all_groups) and all_groups[i] is None:
# Skip the inner groups and move on to the next group.
i += (self.groups_no[j] + 1)
# Advance the main group index.
j += 1
return re.sub(*self.get_pattern_repl(j), all_groups[i])
return re.sub(self.full_pattern, repl_func, s)
Use it as follows:
>>> ReplWrapper(REPL_DICT).multiple_replace(s)
'. Whi- is a sample str-Th is !##t!## REPLACED. __65__43__.'
I dont know why most of the solutions try to compose a single regex pattern instead of replacing multiple times. This answer is just for the sake of completeness.
That being said, the output of this approach is different than the output of the combined regex approach. Namely, repeated substitutions may evolve the text over time. However, the following function returns the same output as a call to unix sed would:
def multi_replace(rules, data: str) -> str:
ret = data
for pattern, repl in rules:
ret = re.sub(pattern, repl, ret)
return ret
usage:
RULES = [
(r'a', r'b'),
(r'b', r'c'),
(r'c', r'd'),
]
multi_replace(RULES, 'ab') # output: dd
With the same input and rules, the other solutions will output "bc". Depending on your use case you may or may not want to replace strings consecutively. In my case I wanted to rebuild the sed behavior. Also, note that the order of rules matters. If you reverse the rule order, this example would also return "bc".
This solution is faster than combining the patterns into a single regex (by a factor of 100). So, if your use-case allows it, you should prefer the repeated substitution method.
Of course, you can compile the regex patterns:
class Sed:
def __init__(self, rules) -> None:
self._rules = [(re.compile(pattern), sub) for pattern, sub in rules]
def replace(self, data: str) -> str:
ret = data
for regx, repl in self._rules:
ret = regx.sub(repl, ret)
return ret

How fill a regex string with parameters

I would like to fill regex variables with string.
import re
hReg = re.compile("/robert/(?P<action>([a-zA-Z0-9]*))/$")
hMatch = hReg.match("/robert/delete/")
args = hMatch.groupdict()
args variable is now a dict with {"action":"delete"}.
How i can reverse this process ? With args dict and regex pattern, how i can obtain the string "/robert/delete/" ?
it's possible to have a function just like this ?
def reverse(pattern, dictArgs):
Thank you
This function should do it
def reverse(regex, dict):
replacer_regex = re.compile('''
\(\?P\< # Match the opening
(.+?) # Match the group name into group 1
\>\(.*?\)\) # Match the rest
'''
, re.VERBOSE)
return replacer_regex.sub(lambda m : dict[m.group(1)], regex)
You basically match the (\?P...) block and replace it with a value from the dict.
EDIT: regex is the regex string in my exmple. You can get it from patter by
regex_compiled.pattern
EDIT2: verbose regex added
Actually, i thinks it's doable for some narrow cases, but pretty complex thing "in general case".
You'll need to write some sort of finite state machine, parsing your regex string, and splitting different parts, then take appropriate action for this parts.
For regular symbols — simply put symbols "as is" into results string.
For named groups — put values from dictArgs in place of them
For optional blocks — put some of it's values
And so on.
One requllar expression often can match big (or even infinite) set of strings, so this "reverse" function wouldn't be very useful.
Building upon #Dimitri's answer, more sanitisation is possible.
retype = type(re.compile('hello, world'))
def reverse(ptn, dict):
if isinstance(ptn, retype):
ptn = ptn.pattern
ptn = ptn.replace(r'\.','.')
replacer_regex = re.compile(r'''
\(\?P # Match the opening
\<(.+?)\>
(.*?)
\) # Match the rest
'''
, re.VERBOSE)
# return replacer_regex.findall(ptn)
res = replacer_regex.sub( lambda m : dict[m.group(1)], ptn)
return res

Python, how do I parse key=value list ignoring what is inside parentheses?

Suppose I have a string like this:
"key1=value1;key2=value2;key3=(key3.1=value3.1;key3.2=value3.2)"
I would like to get a dictionary corresponding to the above, where the value for key3 is the string
"(key3.1=value3.1;key3.2=value3.2)"
and eventually the corresponding sub-dictionary.
I know how to split the string at the semicolons, but how can I tell the parser to ignore the semicolon between parentheses?
This includes potentially nested parentheses.
Currently I am using an ad-hoc routine that looks for pairs of matching parentheses, "clears" its content, gets split positions and applies them to the original string, but this does not appear very elegant, there must be some prepackaged pythonic way to do this.
If anyone is interested, here is the code I am currently using:
def pparams(parameters, sep=';', defs='=', brc='()'):
'''
unpackages parameter string to struct
for example, pippo(a=21;b=35;c=pluto(h=zzz;y=mmm);d=2d3f) becomes:
a: '21'
b: '35'
c.fn: 'pluto'
c.h='zzz'
d: '2d3f'
fn_: 'pippo'
'''
ob=strfind(parameters,brc[0])
dp=strfind(parameters,defs)
out={}
if len(ob)>0:
if ob[0]<dp[0]:
#opening function
out['fn_']=parameters[:ob[0]]
parameters=parameters[(ob[0]+1):-1]
if len(dp)>0:
temp=smart_tokenize(parameters,sep,brc);
for v in temp:
defp=strfind(v,defs)
pname=v[:defp[0]]
pval=v[1+defp[0]:]
if len(strfind(pval,brc[0]))>0:
out[pname]=pparams(pval,sep,defs,brc);
else:
out[pname]=pval
else:
out['fn_']=parameters
return out
def smart_tokenize( instr, sep=';', brc='()' ):
'''
tokenize string ignoring separators contained within brc
'''
tstr=instr;
ob=strfind(instr,brc[0])
while len(ob)>0:
cb=findclsbrc(tstr,ob[0])
tstr=tstr[:ob[0]]+'?'*(cb-ob[0]+1)+tstr[cb+1:]
ob=strfind(tstr,brc[1])
sepp=[-1]+strfind(tstr,sep)+[len(instr)+1]
out=[]
for i in range(1,len(sepp)):
out.append(instr[(sepp[i-1]+1):(sepp[i])])
return out
def findclsbrc(instr, brc_pos, brc='()'):
'''
given a string containing an opening bracket, finds the
corresponding closing bracket
'''
tstr=instr[brc_pos:]
o=strfind(tstr,brc[0])
c=strfind(tstr,brc[1])
p=o+c
p.sort()
s1=[1 if v in o else 0 for v in p]
s2=[-1 if v in c else 0 for v in p]
s=[s1v+s2v for s1v,s2v in zip(s1,s2)]
s=[sum(s[:i+1]) for i in range(len(s))] #cumsum
return p[s.index(0)]+brc_pos
def strfind(instr, substr):
'''
returns starting position of each occurrence of substr within instr
'''
i=0
out=[]
while i<=len(instr):
try:
p=instr[i:].index(substr)
out.append(i+p)
i+=p+1
except:
i=len(instr)+1
return out
If you want to build a real parser, use one of the Python parsing libraries, like PLY or PyParsing. If you figure such a full-fledged library is overkill for the task at hand, go for some hack like the one you already have. I'm pretty sure there is no clean few-line solution without an external library.
Expanding on Sven Marnach's answer, here's an example of a pyparsing grammar that should work for you:
from pyparsing import (ZeroOrMore, Word, printables, Forward,
Group, Suppress, Dict)
collection = Forward()
simple_value = Word(printables, excludeChars='()=;')
key = simple_value
inner_collection = Suppress('(') + collection + Suppress(')')
value = simple_value ^ inner_collection
key_and_value = Group(key + Suppress('=') + value)
collection << Dict(key_and_value + ZeroOrMore(Suppress(';') + key_and_value))
coll = collection.parseString(
"key1=value1;key2=value2;key3=(key3.1=value3.1;key3.2=value3.2)")
print coll['key1'] # value1
print coll['key2'] # value2
print coll['key3']['key3.1'] # value3.1
You could use a regex to capture the groups:
>>> import re
>>> s = "key1=value1;key2=value2;key3=(key3.1=value3.1;key3.2=value3.2)"
>>> r = re.compile('(\w+)=(\w+|\([^)]+\));?')
>>> dict(r.findall(s))
This regex says:
(\w)+ # Find and capture a group with 1 or more word characters (letters, digits, underscores)
= # Followed by the literal character '='
(\w+ # Followed by a group with 1 or more word characters
|\([^)]+\) # or a group that starts with an open paren (parens escaped with '\(' or \')'), followed by anything up until a closed paren, which terminates the alternate grouping
);? # optionally this grouping might be followed by a semicolon.
Gotta say, kind of a strange grammar. You should consider using a more standard format. If you need guidance choosing one maybe ask another question. Good luck!

replacing all regex matches in single line

I have dynamic regexp in which I don't know in advance how many groups it has
I would like to replace all matches with xml tags
example
re.sub("(this).*(string)","this is my string",'<markup>\anygroup</markup>')
>> "<markup>this</markup> is my <markup>string</markup>"
is that even possible in single line?
For a constant regexp like in your example, do
re.sub("(this)(.*)(string)",
r'<markup>\1</markup>\2<markup>\3</markup>',
text)
Note that you need to enclose .* in parentheses as well if you don't want do lose it.
Now if you don't know what the regexp looks like, it's more difficult, but should be doable.
pattern = "(this)(.*)(string)"
re.sub(pattern,
lambda m: ''.join('<markup>%s</markup>' % s if n % 2 == 0
else s for n, s in enumerate(m.groups())),
text)
If the first thing matched by your pattern doesn't necessarily have to be marked up, use this instead, with the first group optionally matching some prefix text that should be left alone:
pattern = "()(this)(.*)(string)"
re.sub(pattern,
lambda m: ''.join('<markup>%s</markup>' % s if n % 2 == 1
else s for n, s in enumerate(m.groups())),
text)
You get the idea.
If your regexps are complicated and you're not sure you can make everything part of a group, where only every second group needs to be marked up, you might do something smarter with a more complicated function:
pattern = "(this).*(string)"
def replacement(m):
s = m.group()
n_groups = len(m.groups())
# assume groups do not overlap and are listed left-to-right
for i in range(n_groups, 0, -1):
lo, hi = m.span(i)
s = s[:lo] + '<markup>' + s[lo:hi] + '</markup>' + s[hi:]
return s
re.sub(pattern, replacement, text)
If you need to handle overlapping groups, you're on your own, but it should be doable.
re.sub() will replace everything it can. If you pass it a function for repl then you can do even more.
Yes, this can be done in a single line.
>>> re.sub(r"\b(this|string)\b", r"<markup>\1</markup>", "this is my string")
'<markup>this</markup> is my <markup>string</markup>'
\b ensures that only complete words are matched.
So if you have a list of words that you need to mark up, you could do the following:
>>> mywords = ["this", "string", "words"]
>>> myre = r"\b(" + "|".join(mywords) + r")\b"
>>> re.sub(myre, r"<markup>\1</markup>", "this is my string with many words!")
'<markup>this</markup> is my <markup>string</markup> with many <markup>words</markup>!'

Categories