Python - Efficiently replace characters within text file with ASCII characters [duplicate] - python

I can use this code below to create a new file with the substitution of a with aa using regular expressions.
import re
with open("notes.txt") as text:
new_text = re.sub("a", "aa", text.read())
with open("notes2.txt", "w") as result:
result.write(new_text)
I was wondering do I have to use this line, new_text = re.sub("a", "aa", text.read()), multiple times but substitute the string for others letters that I want to change in order to change more than one letter in my text?
That is, so a-->aa,b--> bb and c--> cc.
So I have to write that line for all the letters I want to change or is there an easier way. Perhaps to create a "dictionary" of translations. Should I put those letters into an array? I'm not sure how to call on them if I do.

The answer proposed by #nhahtdh is valid, but I would argue less pythonic than the canonical example, which uses code less opaque than his regex manipulations and takes advantage of python's built-in data structures and anonymous function feature.
A dictionary of translations makes sense in this context. In fact, that's how the Python Cookbook does it, as shown in this example (copied from ActiveState http://code.activestate.com/recipes/81330-single-pass-multiple-replace/ )
import re
def multiple_replace(dict, text):
# Create a regular expression from the dictionary keys
regex = re.compile("(%s)" % "|".join(map(re.escape, dict.keys())))
# For each match, look-up corresponding value in dictionary
return regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text)
if __name__ == "__main__":
text = "Larry Wall is the creator of Perl"
dict = {
"Larry Wall" : "Guido van Rossum",
"creator" : "Benevolent Dictator for Life",
"Perl" : "Python",
}
print multiple_replace(dict, text)
So in your case, you could make a dict trans = {"a": "aa", "b": "bb"} and then pass it into multiple_replace along with the text you want translated. Basically all that function is doing is creating one huge regex containing all of your regexes to translate, then when one is found, passing a lambda function to regex.sub to perform the translation dictionary lookup.
You could use this function while reading from your file, for example:
with open("notes.txt") as text:
new_text = multiple_replace(replacements, text.read())
with open("notes2.txt", "w") as result:
result.write(new_text)
I've actually used this exact method in production, in a case where I needed to translate the months of the year from Czech into English for a web scraping task.
As #nhahtdh pointed out, one downside to this approach is that it is not prefix-free: dictionary keys that are prefixes of other dictionary keys will cause the method to break.

You can use capturing group and backreference:
re.sub(r"([characters])", r"\1\1", text.read())
Put characters that you want to double up in between []. For the case of lower case a, b, c:
re.sub(r"([abc])", r"\1\1", text.read())
In the replacement string, you can refer to whatever matched by a capturing group () with \n notation where n is some positive integer (0 excluded). \1 refers to the first capturing group. There is another notation \g<n> where n can be any non-negative integer (0 allowed); \g<0> will refer to the whole text matched by the expression.
If you want to double up all characters except new line:
re.sub(r"(.)", r"\1\1", text.read())
If you want to double up all characters (new line included):
re.sub(r"(.)", r"\1\1", text.read(), 0, re.S)

You can use the pandas library and the replace function. I represent one example with five replacements:
df = pd.DataFrame({'text': ['Billy is going to visit Rome in November', 'I was born in 10/10/2010', 'I will be there at 20:00']})
to_replace=['Billy','Rome','January|February|March|April|May|June|July|August|September|October|November|December', '\d{2}:\d{2}', '\d{2}/\d{2}/\d{4}']
replace_with=['name','city','month','time', 'date']
print(df.text.replace(to_replace, replace_with, regex=True))
And the modified text is:
0 name is going to visit city in month
1 I was born in date
2 I will be there at time
You can find the example here

None of the other solutions work if your patterns are themselves regexes.
For that, you need:
def multi_sub(pairs, s):
def repl_func(m):
# only one group will be present, use the corresponding match
return next(
repl
for (patt, repl), group in zip(pairs, m.groups())
if group is not None
)
pattern = '|'.join("({})".format(patt) for patt, _ in pairs)
return re.sub(pattern, repl_func, s)
Which can be used as:
>>> multi_sub([
... ('a+b', 'Ab'),
... ('b', 'B'),
... ('a+', 'A.'),
... ], "aabbaa") # matches as (aab)(b)(aa)
'AbBA.'
Note that this solution does not allow you to put capturing groups in your regexes, or use them in replacements.

Using tips from how to make a 'stringy' class, we can make an object identical to a string but for an extra sub method:
import re
class Substitutable(str):
def __new__(cls, *args, **kwargs):
newobj = str.__new__(cls, *args, **kwargs)
newobj.sub = lambda fro,to: Substitutable(re.sub(fro, to, newobj))
return newobj
This allows to use the builder pattern, which looks nicer, but works only for a pre-determined number of substitutions. If you use it in a loop, there is no point creating an extra class anymore. E.g.
>>> h = Substitutable('horse')
>>> h
'horse'
>>> h.sub('h', 'f')
'forse'
>>> h.sub('h', 'f').sub('f','h')
'horse'

I found I had to modify Emmett J. Butler's code by changing the lambda function to use myDict.get(mo.group(1),mo.group(1)). The original code wasn't working for me; using myDict.get() also provides the benefit of a default value if a key is not found.
OIDNameContraction = {
'Fucntion':'Func',
'operated':'Operated',
'Asist':'Assist',
'Detection':'Det',
'Control':'Ctrl',
'Function':'Func'
}
replacementDictRegex = re.compile("(%s)" % "|".join(map(re.escape, OIDNameContraction.keys())))
oidDescriptionStr = replacementDictRegex.sub(lambda mo:OIDNameContraction.get(mo.group(1),mo.group(1)), oidDescriptionStr)

If you dealing with files, I have a simple python code about this problem.
More info here.
import re
def multiple_replace(dictionary, text):
# Create a regular expression from the dictionaryary keys
regex = re.compile("(%s)" % "|".join(map(re.escape, dictionary.keys())))
# For each match, look-up corresponding value in dictionaryary
String = lambda mo: dictionary[mo.string[mo.start():mo.end()]]
return regex.sub(String , text)
if __name__ == "__main__":
dictionary = {
"Wiley Online Library" : "Wiley",
"Chemical Society Reviews" : "Chem. Soc. Rev.",
}
with open ('LightBib.bib', 'r') as Bib_read:
with open ('Abbreviated.bib', 'w') as Bib_write:
read_lines = Bib_read.readlines()
for rows in read_lines:
#print(rows)
text = rows
new_text = multiple_replace(dictionary, text)
#print(new_text)
Bib_write.write(new_text)

Based on Eric's great answer, I came up with a more general solution that is capable of handling capturing groups and backreferences:
import re
from itertools import islice
def multiple_replace(s, repl_dict):
groups_no = [re.compile(pattern).groups for pattern in repl_dict]
def repl_func(m):
all_groups = m.groups()
# Use 'i' as the index within 'all_groups' and 'j' as the main
# group index.
i, j = 0, 0
while i < len(all_groups) and all_groups[i] is None:
# Skip the inner groups and move on to the next group.
i += (groups_no[j] + 1)
# Advance the main group index.
j += 1
# Extract the pattern and replacement at the j-th position.
pattern, repl = next(islice(repl_dict.items(), j, j + 1))
return re.sub(pattern, repl, all_groups[i])
# Create the full pattern using the keys of 'repl_dict'.
full_pattern = '|'.join(f'({pattern})' for pattern in repl_dict)
return re.sub(full_pattern, repl_func, s)
Example. Calling the above with
s = 'This is a sample string. Which is getting replaced. 1234-5678.'
REPL_DICT = {
r'(.*?)is(.*?)ing(.*?)ch': r'\3-\2-\1',
r'replaced': 'REPLACED',
r'\d\d((\d)(\d)-(\d)(\d))\d\d': r'__\5\4__\3\2__',
r'get|ing': '!##'
}
gives:
>>> multiple_replace(s, REPL_DICT)
'. Whi- is a sample str-Th is !##t!## REPLACED. __65__43__.'
For a more efficient solution, one can create a simple wrapper to precompute groups_no and full_pattern, e.g.
import re
from itertools import islice
class ReplWrapper:
def __init__(self, repl_dict):
self.repl_dict = repl_dict
self.groups_no = [re.compile(pattern).groups for pattern in repl_dict]
self.full_pattern = '|'.join(f'({pattern})' for pattern in repl_dict)
def get_pattern_repl(self, pos):
return next(islice(self.repl_dict.items(), pos, pos + 1))
def multiple_replace(self, s):
def repl_func(m):
all_groups = m.groups()
# Use 'i' as the index within 'all_groups' and 'j' as the main
# group index.
i, j = 0, 0
while i < len(all_groups) and all_groups[i] is None:
# Skip the inner groups and move on to the next group.
i += (self.groups_no[j] + 1)
# Advance the main group index.
j += 1
return re.sub(*self.get_pattern_repl(j), all_groups[i])
return re.sub(self.full_pattern, repl_func, s)
Use it as follows:
>>> ReplWrapper(REPL_DICT).multiple_replace(s)
'. Whi- is a sample str-Th is !##t!## REPLACED. __65__43__.'

I dont know why most of the solutions try to compose a single regex pattern instead of replacing multiple times. This answer is just for the sake of completeness.
That being said, the output of this approach is different than the output of the combined regex approach. Namely, repeated substitutions may evolve the text over time. However, the following function returns the same output as a call to unix sed would:
def multi_replace(rules, data: str) -> str:
ret = data
for pattern, repl in rules:
ret = re.sub(pattern, repl, ret)
return ret
usage:
RULES = [
(r'a', r'b'),
(r'b', r'c'),
(r'c', r'd'),
]
multi_replace(RULES, 'ab') # output: dd
With the same input and rules, the other solutions will output "bc". Depending on your use case you may or may not want to replace strings consecutively. In my case I wanted to rebuild the sed behavior. Also, note that the order of rules matters. If you reverse the rule order, this example would also return "bc".
This solution is faster than combining the patterns into a single regex (by a factor of 100). So, if your use-case allows it, you should prefer the repeated substitution method.
Of course, you can compile the regex patterns:
class Sed:
def __init__(self, rules) -> None:
self._rules = [(re.compile(pattern), sub) for pattern, sub in rules]
def replace(self, data: str) -> str:
ret = data
for regx, repl in self._rules:
ret = regx.sub(repl, ret)
return ret

Related

Replacing sub-string occurrences with elements of a given list

Suppose I have a string that has the same sub-string repeated multiple times and I want to replace each occurrence with a different element from a list.
For example, consider this scenario:
pattern = "_____" # repeated pattern
s = "a(_____), b(_____), c(_____)"
r = [0,1,2] # elements to insert
The goal is to obtain a string of the form:
s = "a(_001_), b(_002_), c(_003_)"
The number of occurrences is known, and the list r has the same length as the number of occurrences (3 in the previous example) and contains increasing integers starting from 0.
I've came up with this solution:
import re
pattern = "_____"
s = "a(_____), b(_____), c(_____)"
l = [m.start() for m in re.finditer(pattern, s)]
i = 0
for el in l:
s = s[:el] + f"_{str(i).zfill(5 - 2)}_" + s[el + 5:]
i += 1
print(s)
Output: a(_000_), b(_001_), c(_002_)
This solves my problem, but it seems to me a bit cumbersome, especially the for-loop. Is there a better way, maybe more "pythonic" (intended as concise, possibly elegant, whatever it means) to solve the task?
You can simply use re.sub() method to replace each occurrence of the pattern with a different element from the list.
import re
pattern = re.compile("_____")
s = "a(_____), b(_____), c(_____)"
r = [0,1,2]
for val in r:
s = re.sub(pattern, f"_{val:03d}_", s, count=1)
print(s)
You can also choose to go with this approach without re using the values in the r list with their indexes respectively:
r = [0,1,2]
s = ", ".join(f"{'abc'[i]}(_{val:03d}_)" for i, val in enumerate(r))
print(s)
a(_000_), b(_001_), c(_002_)
TL;DR
Use re.sub with a replacement callable and an iterator:
import re
p = re.compile("_____")
s = "a(_____), b(_____), c(_____)"
r = [0, 1, 2]
it = iter(r)
print(re.sub(p, lambda _: f"_{next(it):03d}_", s))
Long version
Generally speaking, it is a good idea to re.compile your pattern once ahead of time. If you are going to use that pattern repeatedly later, this makes the regex calls much more efficient. There is basically no downside to compiling the pattern, so I would just make it a habit.
As for avoiding the for-loop altogether, the re.sub function allows us to pass a callable as the repl argument, which takes a re.Match object as its only argument and returns a string. Wouldn't it be nice, if we could have such a replacement function that takes the next element from our replacements list every time it is called?
Well, since you have an iterable of replacement elements, we can leverage the iterator protocol to avoid explicit looping over the elements. All we need to do is give our replacement function access to an iterator over those elements, so that it can grab a new one via the next function every time it is called.
The string format specification that Jamiu used in his answer is great if you know exactly that the sub-string to be replaced will always be exactly five underscores (_____) and that your replacement numbers will always be < 999.
So in its simplest form, a function doing what you described, could look like this:
import re
from collections.abc import Iterable
def multi_replace(
pattern: re.Pattern[str],
replacements: Iterable[int],
string: str,
) -> str:
iterator = iter(replacements)
def repl(_match: re.Match[str]) -> str:
return f"_{next(iterator):03d}_"
return re.sub(pattern, repl, string)
Trying it out with your example data:
if __name__ == "__main__":
p = re.compile("_____")
s = "a(_____), b(_____), c(_____)"
r = [0, 1, 2]
print(multi_replace(p, r, s))
Output: a(_000_), b(_001_), c(_002_)
In this simple application, we aren't doing anything with the Match object in our replacement function.
If you want to make it a bit more flexible, there are a few avenues possible. Let's say the sub-strings to replace might (perhaps unexpectedly) be a different number of underscores. Let's further assume that the numbers might get bigger than 999.
First of all, the pattern would need to change a bit. And if we still want to center the replacement in an arbitrary number of underscores, we'll actually need to access the match object in our replacement function to check the number of underscores.
The format specifiers are still useful because the allow centering the inserted object with the ^ align code.
import re
from collections.abc import Iterable
def dynamic_replace(
pattern: re.Pattern[str],
replacements: Iterable[int],
string: str,
) -> str:
iterator = iter(replacements)
def repl(match: re.Match[str]) -> str:
replacement = f"{next(iterator):03d}"
length = len(match.group())
return f"{replacement:_^{length}}"
return re.sub(pattern, repl, string)
if __name__ == "__main__":
p = re.compile("(_+)")
s = "a(_____), b(_____), c(_____), d(_______), e(___)"
r = [0, 1, 2, 30, 4000]
print(dynamic_replace(p, r, s))
Output: a(_000_), b(_001_), c(_002_), d(__030__), e(4000)
Here we are building the replacement string based on the length of the match group (i.e. the number of underscores) to ensure it the number is always centered.
I think you get the idea. As always, separation of concerns is a good idea. You can put the replacement logic in its own function and refer to that, whenever you need to adjust it.
i dun see regex best suit the situation.
pattern = "_____" # repeated pattern
s = "a(_____), b(_____), c(_____)"
r = [0,1,2] # elements to insert
fstring = s.replace(pattern, "_{}_")
str_out = fstring.format(*r)
str_out_pad = fstring.format(*[str(entry).zfill(3) for entry in r])
print(str_out)
print(str_out_pad)
--
a(_0_), b(_1_), c(_2_)
a(_000_), b(_001_), c(_002_)

Python re.sub() optimization

I have a python list with each string being one of the following 4 possible options like this (of course the names would be different):
Mr: Smith\n
Mr: Smith; John\n
Smith\n
Smith; John\n
I want these to be corrected to:
Mr,Smith,fname\n
Mr,Smith,John\n
title,Smith,fname\n
title,Smith,John\n
Easy enough to do with 4 re.sub():
with open ("path/to/file",'r') as fileset:
dataset = fileset.readlines()
for item in dataset:
dataset = [item.strip() for item in dataset] #removes some misc. white noise
item = re.sub((.*):\W(.*);\W,r'\g<1>'+','+r'\g<2>'+',',item)
item = re.sub((.*);\W(.*),'title,'+r'\g<1>'+','+r'\g<2>',item)
item = re.sub((.*):\W(.*),r'\g<1>'+','+r'\g<2>'+',fname',item)
item = re.sub((*.),'title,'+r'\g<1>'+',fname',item)
While this is fine for the dataset I'm using, I want to be more efficient.
Is there a single operation that can simplify this process?
Please pardon if I forgot a quote or some such; I'm not at my workstation now and I'm aware I've stripped the newline (\n).
Thank you,
Brief
Instead of running two loops, you can reduce it to just one line. Adapted from How to iterate over the file in Python (and using the code in my Code section):
f = open("path/to/file",'r')
while True:
x = f.readline()
if not x: break
print re.sub(r, repl, x)
See Python - How to use regexp on file, line by line, in Python for other alternatives.
Code
For viewing sake I've changed your file to an array.
See regex in use here
^(?:([^:\r\n]+):\W*)?([^;\r\n]+)(?:;\W*(.+))?
Note: You don't need all that in python, I do in order to show it on regex101, so your regex would actually just be ^(?:([^:]+):\W*)?([^;]+)(?:;\W*(.+))?
Usage
See code in use here
import re
a = [
"Mr: Smith",
"Mr: Smith; John",
"Smith",
"Smith; John"
]
r = r"^(?:([^:]+):\W*)?([^;]+)(?:;\W*(.+))?"
def repl(m):
return (m.group(1) or "title" ) + "," + m.group(2) + "," + (m.group(3) or "fname")
for s in a:
print re.sub(r, repl, s)
Explanation
^ Assert position at the start of the line
(?:([^:]+):\W*)? Optionally match the following
([^:]+) Capture any character except : one or more times into capture group 1
: Match this literally
\W* Match any number of non-word characters (copied from OP's original code, I assume \s* can be used instead)
([^;]+) Group any character except ; one or more times into capture group 2
(?:;\W*(.+))? Optionally match the following
; Match this literally
\W* Match any number of non-word characters (copied from OP's original code, I assume \s* can be used instead)
(.+) Capture any character one or more times into capture group 3
Given the above explanation of the regex part. The re.sub(r, repl, s) works as follows:
repl is a callback to the repl function which returns:
group 1 if it captured anything, title otherwise
group 2 (it's supposedly always set - using OP's logic here again)
group 3 if it captured anything, fname otherwise
IMHO, RegEx are just too complex here, you can use classic string function to split your string item in chunks. For that, you can use partition (or rpartition).
First, split your item string in "records", like that:
item = "Mr: Smith\n Mr: Smith; John\n Smith\n Smith; John\n"
records = item.splitlines()
# -> ['Mr,Smith,fname', 'Mr,Smith,John', 'title,Smith,fname', 'title,Smith,John']
Then, you can create a short function to normalize each "record".
Here is an example:
def normalize_record(record):
# type: (str) -> str
name, _, fname = record.partition(';')
title, _, name = name.rpartition(':')
title = title.strip() or 'title'
name = name.strip()
fname = fname.strip() or 'fname'
return "{0},{1},{2}".format(title, name, fname)
This function is easier to understand than a collection of RegEx. And, in most case, it is faster.
For a better integration, you can define another function to handle each item:
def normalize(row):
records = row.splitlines()
return "\n".join(normalize_record(record) for record in records) + "\n"
Demo:
item = "Mr: Smith\n Mr: Smith; John\n Smith\n Smith; John\n"
item = normalize(item)
You get:
'Mr,Smith,fname\nMr,Smith,John\ntitle,Smith,fname\ntitle,Smith,John\n'

Python replace multiple strings while supporting backreferences

There are some nice ways to handle simultaneous multi-string replacement in python. However, I am having trouble creating an efficient function that can do that while also supporting backreferences.
What i would like is to use a dictionary of expression / replacement terms, where the replacement terms may contain backreferences to something matched by the expression.
e.g. (note the \1)
repdict = {'&&':'and', '||':'or', '!([a-zA-Z_])':'not \1'}
I put the SO answer mentioned at the outset into the function below, which works fine for expression / replacement pairs that don't contain backreferences:
def replaceAll(repdict, text):
repdict = dict((re.escape(k), v) for k, v in repdict.items())
pattern = re.compile("|".join(repdict.keys()))
return pattern.sub(lambda m: repdict[re.escape(m.group(0))], text)
However, it doesn't work for the key that does contain a backreference..
>>> replaceAll(repldict, "!newData.exists() || newData.val().length == 1")
'!newData.exists() or newData.val().length == 1'
If i do it manually, it works fine. e.g.:
pattern = re.compile("!([a-zA-Z_])")
pattern.sub(r'not \1', '!newData.exists()')
Works as expected:
'not newData.exists()'
In the fancy function, the escaping seems to be messing up the key that uses the backref, so it never matches anything.
I eventually came up with this. However, note that the problem of supporting backrefs in the input parameters is not solved, i'm just handling it manually in the replacer function:
def replaceAll(repPat, text):
def replacer(obj):
match = obj.group(0)
# manually deal with exclamation mark match..
if match[:1] == "!": return 'not ' + match[1:]
# here we naively escape the matched pattern into
# the format of our dictionary key
else: return repPat[naive_escaper(match)]
pattern = re.compile("|".join(repPat.keys()))
return pattern.sub(replacer, text)
def naive_escaper(string):
if '=' in string: return string.replace('=', '\=')
elif '|' in string: return string.replace('|', '\|')
else: return string
# manually escaping \ and = works fine
repPat = {'!([a-zA-Z_])':'', '&&':'and', '\|\|':'or', '\=\=\=':'=='}
replaceAll(repPat, "(!this && !that) || !this && foo === bar")
Returns:
'(not this and not that) or not this'
So if anyone has an idea how to make a multi-string replacement function that supports backreferences and accepts the replacement terms as input, I'd appreciate your feedback very much.
Update: See Angus Hollands' answer for a better alternative.
I couldn't think of an easier way to do it than to stick with the original idea of combining all dict keys into one massive regex.
However, there are some difficulties. Let's assume a repldict like this:
repldict = {r'(a)': r'\1a', r'(b)': r'\1b'}
If we combine these to a single regex, we get (a)|(b) - so now (b) is no longer group 1, which means its backreference won't work correctly.
Another problem is that we can't tell which replacement to use. If the regex matches the text b, how can we find out that \1b is the appropriate replacement? It's not possible; we don't have enough information.
The solution to these problems is to enclose every dict key in a named group like so:
(?P<group1>(a))|(?P<group2>(b))
Now we can easily identify the key that matched, and recalculate the backreferences to make them relative to this group. so that \1b refers to "the first group after group2".
Here's the implementation:
def replaceAll(repldict, text):
# split the dict into two lists because we need the order to be reliable
keys, repls = zip(*repldict.items())
# generate a regex pattern from the keys, putting each key in a named group
# so that we can find out which one of them matched.
# groups are named "_<idx>" where <idx> is the index of the corresponding
# replacement text in the list above
pattern = '|'.join('(?P<_{}>{})'.format(i, k) for i, k in enumerate(keys))
def repl(match):
# find out which key matched. We know that exactly one of the keys has
# matched, so it's the only named group with a value other than None.
group_name = next(name for name, value in match.groupdict().items()
if value is not None)
group_index = int(group_name[1:])
# now that we know which group matched, we can retrieve the
# corresponding replacement text
repl_text = repls[group_index]
# now we'll manually search for backreferences in the
# replacement text and substitute them
def repl_backreference(m):
reference_index = int(m.group(1))
# return the corresponding group's value from the original match
# +1 because regex starts counting at 1
return match.group(group_index + reference_index + 1)
return re.sub(r'\\(\d+)', repl_backreference, repl_text)
return re.sub(pattern, repl, text)
Tests:
repldict = {'&&':'and', r'\|\|':'or', r'!([a-zA-Z_])':r'not \1'}
print( replaceAll(repldict, "!newData.exists() || newData.val().length == 1") )
repldict = {'!([a-zA-Z_])':r'not \1', '&&':'and', r'\|\|':'or', r'\=\=\=':'=='}
print( replaceAll(repldict, "(!this && !that) || !this && foo === bar") )
# output: not newData.exists() or newData.val().length == 1
# (not this and not that) or not this and foo == bar
Caveats:
Only numerical backreferences are supported; no named references.
Silently accepts invalid backreferences like {r'(a)': r'\2'}. (These will sometimes throw an error, but not always.)
Similar solution to Rawing, only precomputing the expensive stuff ahead of time by modifying the group indices in backreferences. Also, using unnamed groups.
Here we silently wrap each case in a capture group, and then update any replacements with backreferences to correctly identify the appropriate subgroup by absolute position. Note, that when using a replacer function, backreferences do not work by default (you need to call match.expand).
import re
from collections import OrderedDict
from functools import partial
pattern_to_replacement = {'&&': 'and', '!([a-zA-Z_]+)': r'not \1'}
def build_replacer(cases):
ordered_cases = OrderedDict(cases.items())
replacements = {}
leading_groups = 0
for pattern, replacement in ordered_cases.items():
leading_groups += 1
# leading_groups is now the absolute position of the root group (back-references should be relative to this)
group_index = leading_groups
replacement = absolute_backreference(replacement, group_index)
replacements[group_index] = replacement
# This pattern contains N subgroups (determine by compiling pattern)
subgroups = re.compile(pattern).groups
leading_groups += subgroups
catch_all = "|".join("({})".format(p) for p in ordered_cases)
pattern = re.compile(catch_all)
def replacer(match):
replacement_pattern = replacements[match.lastindex]
return match.expand(replacement_pattern)
return partial(pattern.sub, replacer)
def absolute_backreference(text, n):
ref_pat = re.compile(r"\\([0-99])")
def replacer(match):
return "\\{}".format(int(match.group(1)) + n)
return ref_pat.sub(replacer, text)
replacer = build_replacer(pattern_to_replacement)
print(replacer("!this.exists()"))
Simple is better than complex, code as below is more readable(The reason why you code not work as expected is that ([a-zA-Z_]) should not be in re.escape):
repdict = {
r'\s*' + re.escape('&&')) + r'\s*': ' and ',
r'\s*' + re.escape('||') + r'\s*': ' or ',
re.escape('!') + r'([a-zA-Z_])': r'not \1',
}
def replaceAll(repdict, text):
for k, v in repdict.items():
text = re.sub(k, v, text)
return text

How to compose string from regex pattern with named groups and datadict in python?

Short version:
I want to crate function which replace all named groups in regular expression with coresponding data from datadict.
For example:
Input: expr=r"/(?P<something>\w+)/whatever/(?P<something2>\w+)" data={"something":123, "something2": "thing"}
Output: "/123/whatever/thing"
But i have no idea how to do it.
Some addtional info:
I have code which iterate trough list of tuples containing name and pattern and trying to use re.search. In case that re.search match given string it returns name from current tuple and groupdict() (which is dict with data from re.search).
Here is the code
class UrlResolver():
def __init__(self):
self.urls = {}
def parse(self, app, url):
for pattern in self.urls[app]:
data = re.search(pattern[1], url)
if data:
return {"name": pattern[0], "data": data.groupdict()}
Now i would like to create function:
def compose(self, app, name, data):
for pattern in self.url[app]:
if pattern[0] == name:
return string composed from regex expression and data from data dict.
Above function should replace all named groups with coresponding data from datadict.
Have a look at the re.sub() function. This function can be called with a replacement function as the second parameter. See http://docs.python.org/2/library/re.html
That function you'd have to define yourself. It would have to take a match object as its parameter. In it you should look at the match object, extract the match groups and replace them with the values from the dictionary.
You can extract the text from the string that you do not need to replace from the original string by looping through the groups and calling start, end = span(group) on them.
EDIT
I misread your original question. I see now that you do not wish to replace the matches from the regular expressions, but the regular expressions themselves. In this case the difficult part will be to create a regular expression that matches a named regular expression. My solution still holds, but can be somewhat simpler.
To do proper penance I created the following example.
def repl(data, match):
key = match.group(1)
return data[key]
expression = r"/(?P<something>\w+)/whatever/(?P<something2>\w+)"
reNamedGroup = re.compile(r'\(\?P<(.*?)>\\w\+\)')
data = { 'something': 'completely',
'something2': 'different' }
print reNamedGroup.sub(lambda m: repl(data, m), expression)
This will print
/completely/whatever/different
Using a method demonstrated by F.J here, you could perform the substitution this way:
import re
data = {"something" : 123, "something2" : "thing"}
expr = r"/(?P<something>\w+)/whatever/(?P<something2>\w+)"
def matchsub(match, data):
result = list(match.string)
pat = match.re
# print(pat)
for key, index in pat.groupindex.items():
# print(key, index, data[key], match.start(index), match.end(index))
result[match.start(index):match.end(index)] = str(data[key])
return ''.join(result)
result = matchsub(re.search(expr, "hi/ABC/whatever/DEF/there"), data)
print(result)
yields
hi/123/whatever/thing/there

Python, how do I parse key=value list ignoring what is inside parentheses?

Suppose I have a string like this:
"key1=value1;key2=value2;key3=(key3.1=value3.1;key3.2=value3.2)"
I would like to get a dictionary corresponding to the above, where the value for key3 is the string
"(key3.1=value3.1;key3.2=value3.2)"
and eventually the corresponding sub-dictionary.
I know how to split the string at the semicolons, but how can I tell the parser to ignore the semicolon between parentheses?
This includes potentially nested parentheses.
Currently I am using an ad-hoc routine that looks for pairs of matching parentheses, "clears" its content, gets split positions and applies them to the original string, but this does not appear very elegant, there must be some prepackaged pythonic way to do this.
If anyone is interested, here is the code I am currently using:
def pparams(parameters, sep=';', defs='=', brc='()'):
'''
unpackages parameter string to struct
for example, pippo(a=21;b=35;c=pluto(h=zzz;y=mmm);d=2d3f) becomes:
a: '21'
b: '35'
c.fn: 'pluto'
c.h='zzz'
d: '2d3f'
fn_: 'pippo'
'''
ob=strfind(parameters,brc[0])
dp=strfind(parameters,defs)
out={}
if len(ob)>0:
if ob[0]<dp[0]:
#opening function
out['fn_']=parameters[:ob[0]]
parameters=parameters[(ob[0]+1):-1]
if len(dp)>0:
temp=smart_tokenize(parameters,sep,brc);
for v in temp:
defp=strfind(v,defs)
pname=v[:defp[0]]
pval=v[1+defp[0]:]
if len(strfind(pval,brc[0]))>0:
out[pname]=pparams(pval,sep,defs,brc);
else:
out[pname]=pval
else:
out['fn_']=parameters
return out
def smart_tokenize( instr, sep=';', brc='()' ):
'''
tokenize string ignoring separators contained within brc
'''
tstr=instr;
ob=strfind(instr,brc[0])
while len(ob)>0:
cb=findclsbrc(tstr,ob[0])
tstr=tstr[:ob[0]]+'?'*(cb-ob[0]+1)+tstr[cb+1:]
ob=strfind(tstr,brc[1])
sepp=[-1]+strfind(tstr,sep)+[len(instr)+1]
out=[]
for i in range(1,len(sepp)):
out.append(instr[(sepp[i-1]+1):(sepp[i])])
return out
def findclsbrc(instr, brc_pos, brc='()'):
'''
given a string containing an opening bracket, finds the
corresponding closing bracket
'''
tstr=instr[brc_pos:]
o=strfind(tstr,brc[0])
c=strfind(tstr,brc[1])
p=o+c
p.sort()
s1=[1 if v in o else 0 for v in p]
s2=[-1 if v in c else 0 for v in p]
s=[s1v+s2v for s1v,s2v in zip(s1,s2)]
s=[sum(s[:i+1]) for i in range(len(s))] #cumsum
return p[s.index(0)]+brc_pos
def strfind(instr, substr):
'''
returns starting position of each occurrence of substr within instr
'''
i=0
out=[]
while i<=len(instr):
try:
p=instr[i:].index(substr)
out.append(i+p)
i+=p+1
except:
i=len(instr)+1
return out
If you want to build a real parser, use one of the Python parsing libraries, like PLY or PyParsing. If you figure such a full-fledged library is overkill for the task at hand, go for some hack like the one you already have. I'm pretty sure there is no clean few-line solution without an external library.
Expanding on Sven Marnach's answer, here's an example of a pyparsing grammar that should work for you:
from pyparsing import (ZeroOrMore, Word, printables, Forward,
Group, Suppress, Dict)
collection = Forward()
simple_value = Word(printables, excludeChars='()=;')
key = simple_value
inner_collection = Suppress('(') + collection + Suppress(')')
value = simple_value ^ inner_collection
key_and_value = Group(key + Suppress('=') + value)
collection << Dict(key_and_value + ZeroOrMore(Suppress(';') + key_and_value))
coll = collection.parseString(
"key1=value1;key2=value2;key3=(key3.1=value3.1;key3.2=value3.2)")
print coll['key1'] # value1
print coll['key2'] # value2
print coll['key3']['key3.1'] # value3.1
You could use a regex to capture the groups:
>>> import re
>>> s = "key1=value1;key2=value2;key3=(key3.1=value3.1;key3.2=value3.2)"
>>> r = re.compile('(\w+)=(\w+|\([^)]+\));?')
>>> dict(r.findall(s))
This regex says:
(\w)+ # Find and capture a group with 1 or more word characters (letters, digits, underscores)
= # Followed by the literal character '='
(\w+ # Followed by a group with 1 or more word characters
|\([^)]+\) # or a group that starts with an open paren (parens escaped with '\(' or \')'), followed by anything up until a closed paren, which terminates the alternate grouping
);? # optionally this grouping might be followed by a semicolon.
Gotta say, kind of a strange grammar. You should consider using a more standard format. If you need guidance choosing one maybe ask another question. Good luck!

Categories