I want to execute substitutions using regex, not for all matches but only for specific ones. However, re.sub substitutes for all matches. How can I do this?
Here is an example.
Say, I have a string with the following content:
FOO=foo1
BAR=bar1
FOO=foo2
BAR=bar2
BAR=bar3
What I want to do is this:
re.sub(r'^BAR', '#BAR', s, index=[1,2], flags=re.MULTILINE)
to get the below result.
FOO=foo1
BAR=bar1
FOO=foo2
#BAR=bar2
#BAR=bar3
You could pass replacement function to re.sub that keeps track of count and checks if the given index should be substituted:
import re
s = '''FOO=foo1
BAR=bar1
FOO=foo2
BAR=bar2
BAR=bar3'''
i = 0
index = {1, 2}
def repl(x):
global i
if i in index:
res = '#' + x.group(0)
else:
res = x.group(0)
i += 1
return res
print re.sub(r'^BAR', repl, s, flags=re.MULTILINE)
Output:
FOO=foo1
BAR=bar1
FOO=foo2
#BAR=bar2
#BAR=bar3
You could
Split your string using s.splitlines()
Iterate over the individual lines in a for loop
Track how many matches you have found so far
Only perform substitutions on those matches in the numerical ranges you want (e.g. matches 1 and 2)
And then join them back into a single string (if need be).
I can use this code below to create a new file with the substitution of a with aa using regular expressions.
import re
with open("notes.txt") as text:
new_text = re.sub("a", "aa", text.read())
with open("notes2.txt", "w") as result:
result.write(new_text)
I was wondering do I have to use this line, new_text = re.sub("a", "aa", text.read()), multiple times but substitute the string for others letters that I want to change in order to change more than one letter in my text?
That is, so a-->aa,b--> bb and c--> cc.
So I have to write that line for all the letters I want to change or is there an easier way. Perhaps to create a "dictionary" of translations. Should I put those letters into an array? I'm not sure how to call on them if I do.
The answer proposed by #nhahtdh is valid, but I would argue less pythonic than the canonical example, which uses code less opaque than his regex manipulations and takes advantage of python's built-in data structures and anonymous function feature.
A dictionary of translations makes sense in this context. In fact, that's how the Python Cookbook does it, as shown in this example (copied from ActiveState http://code.activestate.com/recipes/81330-single-pass-multiple-replace/ )
import re
def multiple_replace(dict, text):
# Create a regular expression from the dictionary keys
regex = re.compile("(%s)" % "|".join(map(re.escape, dict.keys())))
# For each match, look-up corresponding value in dictionary
return regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text)
if __name__ == "__main__":
text = "Larry Wall is the creator of Perl"
dict = {
"Larry Wall" : "Guido van Rossum",
"creator" : "Benevolent Dictator for Life",
"Perl" : "Python",
}
print multiple_replace(dict, text)
So in your case, you could make a dict trans = {"a": "aa", "b": "bb"} and then pass it into multiple_replace along with the text you want translated. Basically all that function is doing is creating one huge regex containing all of your regexes to translate, then when one is found, passing a lambda function to regex.sub to perform the translation dictionary lookup.
You could use this function while reading from your file, for example:
with open("notes.txt") as text:
new_text = multiple_replace(replacements, text.read())
with open("notes2.txt", "w") as result:
result.write(new_text)
I've actually used this exact method in production, in a case where I needed to translate the months of the year from Czech into English for a web scraping task.
As #nhahtdh pointed out, one downside to this approach is that it is not prefix-free: dictionary keys that are prefixes of other dictionary keys will cause the method to break.
You can use capturing group and backreference:
re.sub(r"([characters])", r"\1\1", text.read())
Put characters that you want to double up in between []. For the case of lower case a, b, c:
re.sub(r"([abc])", r"\1\1", text.read())
In the replacement string, you can refer to whatever matched by a capturing group () with \n notation where n is some positive integer (0 excluded). \1 refers to the first capturing group. There is another notation \g<n> where n can be any non-negative integer (0 allowed); \g<0> will refer to the whole text matched by the expression.
If you want to double up all characters except new line:
re.sub(r"(.)", r"\1\1", text.read())
If you want to double up all characters (new line included):
re.sub(r"(.)", r"\1\1", text.read(), 0, re.S)
You can use the pandas library and the replace function. I represent one example with five replacements:
df = pd.DataFrame({'text': ['Billy is going to visit Rome in November', 'I was born in 10/10/2010', 'I will be there at 20:00']})
to_replace=['Billy','Rome','January|February|March|April|May|June|July|August|September|October|November|December', '\d{2}:\d{2}', '\d{2}/\d{2}/\d{4}']
replace_with=['name','city','month','time', 'date']
print(df.text.replace(to_replace, replace_with, regex=True))
And the modified text is:
0 name is going to visit city in month
1 I was born in date
2 I will be there at time
You can find the example here
None of the other solutions work if your patterns are themselves regexes.
For that, you need:
def multi_sub(pairs, s):
def repl_func(m):
# only one group will be present, use the corresponding match
return next(
repl
for (patt, repl), group in zip(pairs, m.groups())
if group is not None
)
pattern = '|'.join("({})".format(patt) for patt, _ in pairs)
return re.sub(pattern, repl_func, s)
Which can be used as:
>>> multi_sub([
... ('a+b', 'Ab'),
... ('b', 'B'),
... ('a+', 'A.'),
... ], "aabbaa") # matches as (aab)(b)(aa)
'AbBA.'
Note that this solution does not allow you to put capturing groups in your regexes, or use them in replacements.
Using tips from how to make a 'stringy' class, we can make an object identical to a string but for an extra sub method:
import re
class Substitutable(str):
def __new__(cls, *args, **kwargs):
newobj = str.__new__(cls, *args, **kwargs)
newobj.sub = lambda fro,to: Substitutable(re.sub(fro, to, newobj))
return newobj
This allows to use the builder pattern, which looks nicer, but works only for a pre-determined number of substitutions. If you use it in a loop, there is no point creating an extra class anymore. E.g.
>>> h = Substitutable('horse')
>>> h
'horse'
>>> h.sub('h', 'f')
'forse'
>>> h.sub('h', 'f').sub('f','h')
'horse'
I found I had to modify Emmett J. Butler's code by changing the lambda function to use myDict.get(mo.group(1),mo.group(1)). The original code wasn't working for me; using myDict.get() also provides the benefit of a default value if a key is not found.
OIDNameContraction = {
'Fucntion':'Func',
'operated':'Operated',
'Asist':'Assist',
'Detection':'Det',
'Control':'Ctrl',
'Function':'Func'
}
replacementDictRegex = re.compile("(%s)" % "|".join(map(re.escape, OIDNameContraction.keys())))
oidDescriptionStr = replacementDictRegex.sub(lambda mo:OIDNameContraction.get(mo.group(1),mo.group(1)), oidDescriptionStr)
If you dealing with files, I have a simple python code about this problem.
More info here.
import re
def multiple_replace(dictionary, text):
# Create a regular expression from the dictionaryary keys
regex = re.compile("(%s)" % "|".join(map(re.escape, dictionary.keys())))
# For each match, look-up corresponding value in dictionaryary
String = lambda mo: dictionary[mo.string[mo.start():mo.end()]]
return regex.sub(String , text)
if __name__ == "__main__":
dictionary = {
"Wiley Online Library" : "Wiley",
"Chemical Society Reviews" : "Chem. Soc. Rev.",
}
with open ('LightBib.bib', 'r') as Bib_read:
with open ('Abbreviated.bib', 'w') as Bib_write:
read_lines = Bib_read.readlines()
for rows in read_lines:
#print(rows)
text = rows
new_text = multiple_replace(dictionary, text)
#print(new_text)
Bib_write.write(new_text)
Based on Eric's great answer, I came up with a more general solution that is capable of handling capturing groups and backreferences:
import re
from itertools import islice
def multiple_replace(s, repl_dict):
groups_no = [re.compile(pattern).groups for pattern in repl_dict]
def repl_func(m):
all_groups = m.groups()
# Use 'i' as the index within 'all_groups' and 'j' as the main
# group index.
i, j = 0, 0
while i < len(all_groups) and all_groups[i] is None:
# Skip the inner groups and move on to the next group.
i += (groups_no[j] + 1)
# Advance the main group index.
j += 1
# Extract the pattern and replacement at the j-th position.
pattern, repl = next(islice(repl_dict.items(), j, j + 1))
return re.sub(pattern, repl, all_groups[i])
# Create the full pattern using the keys of 'repl_dict'.
full_pattern = '|'.join(f'({pattern})' for pattern in repl_dict)
return re.sub(full_pattern, repl_func, s)
Example. Calling the above with
s = 'This is a sample string. Which is getting replaced. 1234-5678.'
REPL_DICT = {
r'(.*?)is(.*?)ing(.*?)ch': r'\3-\2-\1',
r'replaced': 'REPLACED',
r'\d\d((\d)(\d)-(\d)(\d))\d\d': r'__\5\4__\3\2__',
r'get|ing': '!##'
}
gives:
>>> multiple_replace(s, REPL_DICT)
'. Whi- is a sample str-Th is !##t!## REPLACED. __65__43__.'
For a more efficient solution, one can create a simple wrapper to precompute groups_no and full_pattern, e.g.
import re
from itertools import islice
class ReplWrapper:
def __init__(self, repl_dict):
self.repl_dict = repl_dict
self.groups_no = [re.compile(pattern).groups for pattern in repl_dict]
self.full_pattern = '|'.join(f'({pattern})' for pattern in repl_dict)
def get_pattern_repl(self, pos):
return next(islice(self.repl_dict.items(), pos, pos + 1))
def multiple_replace(self, s):
def repl_func(m):
all_groups = m.groups()
# Use 'i' as the index within 'all_groups' and 'j' as the main
# group index.
i, j = 0, 0
while i < len(all_groups) and all_groups[i] is None:
# Skip the inner groups and move on to the next group.
i += (self.groups_no[j] + 1)
# Advance the main group index.
j += 1
return re.sub(*self.get_pattern_repl(j), all_groups[i])
return re.sub(self.full_pattern, repl_func, s)
Use it as follows:
>>> ReplWrapper(REPL_DICT).multiple_replace(s)
'. Whi- is a sample str-Th is !##t!## REPLACED. __65__43__.'
I dont know why most of the solutions try to compose a single regex pattern instead of replacing multiple times. This answer is just for the sake of completeness.
That being said, the output of this approach is different than the output of the combined regex approach. Namely, repeated substitutions may evolve the text over time. However, the following function returns the same output as a call to unix sed would:
def multi_replace(rules, data: str) -> str:
ret = data
for pattern, repl in rules:
ret = re.sub(pattern, repl, ret)
return ret
usage:
RULES = [
(r'a', r'b'),
(r'b', r'c'),
(r'c', r'd'),
]
multi_replace(RULES, 'ab') # output: dd
With the same input and rules, the other solutions will output "bc". Depending on your use case you may or may not want to replace strings consecutively. In my case I wanted to rebuild the sed behavior. Also, note that the order of rules matters. If you reverse the rule order, this example would also return "bc".
This solution is faster than combining the patterns into a single regex (by a factor of 100). So, if your use-case allows it, you should prefer the repeated substitution method.
Of course, you can compile the regex patterns:
class Sed:
def __init__(self, rules) -> None:
self._rules = [(re.compile(pattern), sub) for pattern, sub in rules]
def replace(self, data: str) -> str:
ret = data
for regx, repl in self._rules:
ret = regx.sub(repl, ret)
return ret
I would like to fill regex variables with string.
import re
hReg = re.compile("/robert/(?P<action>([a-zA-Z0-9]*))/$")
hMatch = hReg.match("/robert/delete/")
args = hMatch.groupdict()
args variable is now a dict with {"action":"delete"}.
How i can reverse this process ? With args dict and regex pattern, how i can obtain the string "/robert/delete/" ?
it's possible to have a function just like this ?
def reverse(pattern, dictArgs):
Thank you
This function should do it
def reverse(regex, dict):
replacer_regex = re.compile('''
\(\?P\< # Match the opening
(.+?) # Match the group name into group 1
\>\(.*?\)\) # Match the rest
'''
, re.VERBOSE)
return replacer_regex.sub(lambda m : dict[m.group(1)], regex)
You basically match the (\?P...) block and replace it with a value from the dict.
EDIT: regex is the regex string in my exmple. You can get it from patter by
regex_compiled.pattern
EDIT2: verbose regex added
Actually, i thinks it's doable for some narrow cases, but pretty complex thing "in general case".
You'll need to write some sort of finite state machine, parsing your regex string, and splitting different parts, then take appropriate action for this parts.
For regular symbols — simply put symbols "as is" into results string.
For named groups — put values from dictArgs in place of them
For optional blocks — put some of it's values
And so on.
One requllar expression often can match big (or even infinite) set of strings, so this "reverse" function wouldn't be very useful.
Building upon #Dimitri's answer, more sanitisation is possible.
retype = type(re.compile('hello, world'))
def reverse(ptn, dict):
if isinstance(ptn, retype):
ptn = ptn.pattern
ptn = ptn.replace(r'\.','.')
replacer_regex = re.compile(r'''
\(\?P # Match the opening
\<(.+?)\>
(.*?)
\) # Match the rest
'''
, re.VERBOSE)
# return replacer_regex.findall(ptn)
res = replacer_regex.sub( lambda m : dict[m.group(1)], ptn)
return res
Suppose I have a string like this:
"key1=value1;key2=value2;key3=(key3.1=value3.1;key3.2=value3.2)"
I would like to get a dictionary corresponding to the above, where the value for key3 is the string
"(key3.1=value3.1;key3.2=value3.2)"
and eventually the corresponding sub-dictionary.
I know how to split the string at the semicolons, but how can I tell the parser to ignore the semicolon between parentheses?
This includes potentially nested parentheses.
Currently I am using an ad-hoc routine that looks for pairs of matching parentheses, "clears" its content, gets split positions and applies them to the original string, but this does not appear very elegant, there must be some prepackaged pythonic way to do this.
If anyone is interested, here is the code I am currently using:
def pparams(parameters, sep=';', defs='=', brc='()'):
'''
unpackages parameter string to struct
for example, pippo(a=21;b=35;c=pluto(h=zzz;y=mmm);d=2d3f) becomes:
a: '21'
b: '35'
c.fn: 'pluto'
c.h='zzz'
d: '2d3f'
fn_: 'pippo'
'''
ob=strfind(parameters,brc[0])
dp=strfind(parameters,defs)
out={}
if len(ob)>0:
if ob[0]<dp[0]:
#opening function
out['fn_']=parameters[:ob[0]]
parameters=parameters[(ob[0]+1):-1]
if len(dp)>0:
temp=smart_tokenize(parameters,sep,brc);
for v in temp:
defp=strfind(v,defs)
pname=v[:defp[0]]
pval=v[1+defp[0]:]
if len(strfind(pval,brc[0]))>0:
out[pname]=pparams(pval,sep,defs,brc);
else:
out[pname]=pval
else:
out['fn_']=parameters
return out
def smart_tokenize( instr, sep=';', brc='()' ):
'''
tokenize string ignoring separators contained within brc
'''
tstr=instr;
ob=strfind(instr,brc[0])
while len(ob)>0:
cb=findclsbrc(tstr,ob[0])
tstr=tstr[:ob[0]]+'?'*(cb-ob[0]+1)+tstr[cb+1:]
ob=strfind(tstr,brc[1])
sepp=[-1]+strfind(tstr,sep)+[len(instr)+1]
out=[]
for i in range(1,len(sepp)):
out.append(instr[(sepp[i-1]+1):(sepp[i])])
return out
def findclsbrc(instr, brc_pos, brc='()'):
'''
given a string containing an opening bracket, finds the
corresponding closing bracket
'''
tstr=instr[brc_pos:]
o=strfind(tstr,brc[0])
c=strfind(tstr,brc[1])
p=o+c
p.sort()
s1=[1 if v in o else 0 for v in p]
s2=[-1 if v in c else 0 for v in p]
s=[s1v+s2v for s1v,s2v in zip(s1,s2)]
s=[sum(s[:i+1]) for i in range(len(s))] #cumsum
return p[s.index(0)]+brc_pos
def strfind(instr, substr):
'''
returns starting position of each occurrence of substr within instr
'''
i=0
out=[]
while i<=len(instr):
try:
p=instr[i:].index(substr)
out.append(i+p)
i+=p+1
except:
i=len(instr)+1
return out
If you want to build a real parser, use one of the Python parsing libraries, like PLY or PyParsing. If you figure such a full-fledged library is overkill for the task at hand, go for some hack like the one you already have. I'm pretty sure there is no clean few-line solution without an external library.
Expanding on Sven Marnach's answer, here's an example of a pyparsing grammar that should work for you:
from pyparsing import (ZeroOrMore, Word, printables, Forward,
Group, Suppress, Dict)
collection = Forward()
simple_value = Word(printables, excludeChars='()=;')
key = simple_value
inner_collection = Suppress('(') + collection + Suppress(')')
value = simple_value ^ inner_collection
key_and_value = Group(key + Suppress('=') + value)
collection << Dict(key_and_value + ZeroOrMore(Suppress(';') + key_and_value))
coll = collection.parseString(
"key1=value1;key2=value2;key3=(key3.1=value3.1;key3.2=value3.2)")
print coll['key1'] # value1
print coll['key2'] # value2
print coll['key3']['key3.1'] # value3.1
You could use a regex to capture the groups:
>>> import re
>>> s = "key1=value1;key2=value2;key3=(key3.1=value3.1;key3.2=value3.2)"
>>> r = re.compile('(\w+)=(\w+|\([^)]+\));?')
>>> dict(r.findall(s))
This regex says:
(\w)+ # Find and capture a group with 1 or more word characters (letters, digits, underscores)
= # Followed by the literal character '='
(\w+ # Followed by a group with 1 or more word characters
|\([^)]+\) # or a group that starts with an open paren (parens escaped with '\(' or \')'), followed by anything up until a closed paren, which terminates the alternate grouping
);? # optionally this grouping might be followed by a semicolon.
Gotta say, kind of a strange grammar. You should consider using a more standard format. If you need guidance choosing one maybe ask another question. Good luck!
I have dynamic regexp in which I don't know in advance how many groups it has
I would like to replace all matches with xml tags
example
re.sub("(this).*(string)","this is my string",'<markup>\anygroup</markup>')
>> "<markup>this</markup> is my <markup>string</markup>"
is that even possible in single line?
For a constant regexp like in your example, do
re.sub("(this)(.*)(string)",
r'<markup>\1</markup>\2<markup>\3</markup>',
text)
Note that you need to enclose .* in parentheses as well if you don't want do lose it.
Now if you don't know what the regexp looks like, it's more difficult, but should be doable.
pattern = "(this)(.*)(string)"
re.sub(pattern,
lambda m: ''.join('<markup>%s</markup>' % s if n % 2 == 0
else s for n, s in enumerate(m.groups())),
text)
If the first thing matched by your pattern doesn't necessarily have to be marked up, use this instead, with the first group optionally matching some prefix text that should be left alone:
pattern = "()(this)(.*)(string)"
re.sub(pattern,
lambda m: ''.join('<markup>%s</markup>' % s if n % 2 == 1
else s for n, s in enumerate(m.groups())),
text)
You get the idea.
If your regexps are complicated and you're not sure you can make everything part of a group, where only every second group needs to be marked up, you might do something smarter with a more complicated function:
pattern = "(this).*(string)"
def replacement(m):
s = m.group()
n_groups = len(m.groups())
# assume groups do not overlap and are listed left-to-right
for i in range(n_groups, 0, -1):
lo, hi = m.span(i)
s = s[:lo] + '<markup>' + s[lo:hi] + '</markup>' + s[hi:]
return s
re.sub(pattern, replacement, text)
If you need to handle overlapping groups, you're on your own, but it should be doable.
re.sub() will replace everything it can. If you pass it a function for repl then you can do even more.
Yes, this can be done in a single line.
>>> re.sub(r"\b(this|string)\b", r"<markup>\1</markup>", "this is my string")
'<markup>this</markup> is my <markup>string</markup>'
\b ensures that only complete words are matched.
So if you have a list of words that you need to mark up, you could do the following:
>>> mywords = ["this", "string", "words"]
>>> myre = r"\b(" + "|".join(mywords) + r")\b"
>>> re.sub(myre, r"<markup>\1</markup>", "this is my string with many words!")
'<markup>this</markup> is my <markup>string</markup> with many <markup>words</markup>!'