parsing a complex logical expression in pyparsing in a binary tree fashion

parsing a complex logical expression in pyparsing in a binary tree fashion - python

I am trying to parse complex logical expression like the one below;
x > 7 AND x < 8 OR x = 4
and get the parsed string as a binary tree. For the above expression the expected parsed expression should look like
[['x', '>', 7], 'AND', [['x', '<', 8], 'OR', ['x', '=', 4]]]
'OR' logical operator has higher precedence than 'AND' operator. Parenthesis can override the default precedence. To be more general, the parsed expression should look like;
<left_expr> <logical_operator> <right_expr>
Another example would be
input_string = x > 7 AND x < 8 AND x = 4
parsed_expr = [[['x', '>', 7], 'AND', ['x', ',', 8]], 'AND', ['x', '=', 4]]
So far i came up with this simple solution which sadly cannot generate parsed expression in binary tree fashion. operatorPrecedence doesn't seem to have help me here where there is same logical operator consecutively as in previous example.
import pyparsing as pp
complex_expr = pp.Forward()
operator = pp.Regex(">=|<=|!=|>|<|=").setName("operator")
logical = (pp.Keyword("AND") | pp.Keyword("OR")).setName("logical")
vars = pp.Word(pp.alphas, pp.alphanums + "_") | pp.Regex(r"[+-]?\d+(:?\.\d*)?(:?[eE][+-]?\d+)?")
condition = (vars + operator + vars)
clause = pp.Group(condition ^ (pp.Suppress("(") + complex_expr + pp.Suppress(")") ))
expr = pp.operatorPrecedence(clause,[
("OR", 2, pp.opAssoc.LEFT, ),
("AND", 2, pp.opAssoc.LEFT, ),])
complex_expr << expr
print complex_expr.parseString("x > 7 AND x < 8 AND x = 4")
Any suggestions or guidance is well appreciated.
BNF for the expression (without parenthesis) could be
<expr> -> <expr> | <expr> <logical> <expr>
<expr> -> <opnd> <relational> <opnd>
<opnd> -> <variable> | <numeric>
<relational> -> <'>'> | <'='> | <'>='> | <'<='> | <'!='>

NOTE: the operatorPrecedence method of pyparsing is deprecated in favor of
the method name infixNotation.
Try changing:
expr = pp.operatorPrecedence(clause,[
("OR", 2, pp.opAssoc.LEFT, ),
("AND", 2, pp.opAssoc.LEFT, ),])
to:
expr = pp.operatorPrecedence(condition,[
("OR", 2, pp.opAssoc.LEFT, ),
("AND", 2, pp.opAssoc.LEFT, ),])
The first argument to operatorPrecedence is the primitive operand to be used with the operators - there is no need to include your complexExpr in parentheses - operatorPrecedence will do that for you. Since your operand is actually another deeper comparison, you might consider changing:
condition = (expr + operator + expr)
to:
condition = pp.Group(expr + operator + expr)
so that the output of operatorPrecedence is easier to process. With these changes, parsing x > 7 AND x < 8 OR x = 4 gives:
[[['x', '>', '7'], 'AND', [['x', '<', '8'], 'OR', ['x', '=', '4']]]]
which recognizes OR's higher precedence and groups it first. (Are you sure you want this order of AND and OR precedence? I think the traditional ordering is the reverse, as shown in this wikipedia entry.)
I think you are also asking why pyparsing and operatorPrecedence does not return the results in nested binary pairs, that is, you expect parsing "A and B and C" would return:
[['A', 'and', 'B'] 'and', 'C']
but what you get is:
['A', 'and', 'B', 'and', 'C']
That is because operatorPrecedence parses repeated operations at the same precedence level using repetition, not recursion. See this question which is very similar to yours, and whose answer includes a parse action to convert your repetitive parse tree to the more traditional binary parse tree. You can also find a sample boolean expression parser implemented using operatorPrecedence on the pyparsing wiki page.
EDIT:
To clarify, this is what I recommend you reduce your parser to:
import pyparsing as pp
operator = pp.Regex(">=|<=|!=|>|<|=").setName("operator")
number = pp.Regex(r"[+-]?\d+(:?\.\d*)?(:?[eE][+-]?\d+)?")
identifier = pp.Word(pp.alphas, pp.alphanums + "_")
comparison_term = identifier | number
condition = pp.Group(comparison_term + operator + comparison_term)
expr = pp.operatorPrecedence(condition,[
("AND", 2, pp.opAssoc.LEFT, ),
("OR", 2, pp.opAssoc.LEFT, ),
])
print expr.parseString("x > 7 AND x < 8 OR x = 4")
If support for NOT might also be something you want to add, then this would look like:
expr = pp.operatorPrecedence(condition,[
("NOT", 1, pp.opAssoc.RIGHT, ),
("AND", 2, pp.opAssoc.LEFT, ),
("OR", 2, pp.opAssoc.LEFT, ),
])
At some point, you may want to expand the definition of comparison_term with a more complete arithmetic expression, defined with its own operatorPrecedence definition. I would suggest doing it this way, rather than creating one monster opPrec definition, as you have already alluded to some of the performance downsides to opPrec. If you still get performance issues, look into ParserElement.enablePackrat.

Let me suggest this parsing approach, coming directly from Peter Norvig's class in design of computer programs at udacity (and tweaked for your needs).
from functools import update_wrapper
from string import split
import re
def grammar(description, whitespace=r'\s*'):
"""Convert a description to a grammar. Each line is a rule for a
non-terminal symbol; it looks like this:
Symbol => A1 A2 ... | B1 B2 ... | C1 C2 ...
where the right-hand side is one or more alternatives, separated by
the '|' sign. Each alternative is a sequence of atoms, separated by
spaces. An atom is either a symbol on some left-hand side, or it is
a regular expression that will be passed to re.match to match a token.
Notation for *, +, or ? not allowed in a rule alternative (but ok
within a token). Use '\' to continue long lines. You must include spaces
or tabs around '=>' and '|'. That's within the grammar description itself.
The grammar that gets defined allows whitespace between tokens by default;
specify '' as the second argument to grammar() to disallow this (or supply
any regular expression to describe allowable whitespace between tokens)."""
G = {' ': whitespace}
description = description.replace('\t', ' ') # no tabs!
for line in split(description, '\n'):
lhs, rhs = split(line, ' => ', 1)
alternatives = split(rhs, ' | ')
G[lhs] = tuple(map(split, alternatives))
return G
def decorator(d):
def _d(fn):
return update_wrapper(d(fn), fn)
update_wrapper(_d, d)
return _d
#decorator
def memo(f):
cache = {}
def _f(*args):
try:
return cache[args]
except KeyError:
cache[args] = result = f(*args)
return result
except TypeError:
# some element of args can't be a dict key
return f(args)
return _f
def parse(start_symbol, text, grammar):
"""Example call: parse('Exp', '3*x + b', G).
Returns a (tree, remainder) pair. If remainder is '', it parsed the whole
string. Failure iff remainder is None. This is a deterministic PEG parser,
so rule order (left-to-right) matters. Do 'E => T op E | T', putting the
longest parse first; don't do 'E => T | T op E'
Also, no left recursion allowed: don't do 'E => E op T'"""
tokenizer = grammar[' '] + '(%s)'
def parse_sequence(sequence, text):
result = []
for atom in sequence:
tree, text = parse_atom(atom, text)
if text is None: return Fail
result.append(tree)
return result, text
#memo
def parse_atom(atom, text):
if atom in grammar: # Non-Terminal: tuple of alternatives
for alternative in grammar[atom]:
tree, rem = parse_sequence(alternative, text)
if rem is not None: return [atom]+tree, rem
return Fail
else: # Terminal: match characters against start of text
m = re.match(tokenizer % atom, text)
return Fail if (not m) else (m.group(1), text[m.end():])
# Body of parse:
return parse_atom(start_symbol, text)
Fail = (None, None)
MyLang = grammar("""expression => block logicalop expression | block
block => variable operator number
variable => [a-z]+
operator => <=|>=|>|<|=
number => [-+]?[0-9]+
logicalop => AND|OR""", whitespace='\s*')
def parse_it(text):
return parse('expression', text, MyLang)
print parse_it("x > 7 AND x < 8 AND x = 4")
Outputs:
(['expression', ['block', ['variable', 'x'], ['operator', '>'], ['number', '7']], ['logicalop', 'AND'], ['expression', ['block', ['variable', 'x'], ['operator', '<'], ['number', '8']], ['logicalop', 'AND'], ['expression', ['block', ['variable', 'x'], ['operator', '='], ['number', '4']]]]], '')

Related

Python - how to assign 1 output to multiple inputs, to then use that output in an equation [duplicate]

I would like to use the .replace function to replace multiple strings.
I currently have
string.replace("condition1", "")
but would like to have something like
string.replace("condition1", "").replace("condition2", "text")
although that does not feel like good syntax
what is the proper way to do this? kind of like how in grep/regex you can do \1 and \2 to replace fields to certain search strings

Here is a short example that should do the trick with regular expressions:
import re
rep = {"condition1": "", "condition2": "text"} # define desired replacements here
# use these three lines to do the replacement
rep = dict((re.escape(k), v) for k, v in rep.iteritems())
#Python 3 renamed dict.iteritems to dict.items so use rep.items() for latest versions
pattern = re.compile("|".join(rep.keys()))
text = pattern.sub(lambda m: rep[re.escape(m.group(0))], text)
For example:
>>> pattern.sub(lambda m: rep[re.escape(m.group(0))], "(condition1) and --condition2--")
'() and --text--'

You could just make a nice little looping function.
def replace_all(text, dic):
for i, j in dic.iteritems():
text = text.replace(i, j)
return text
where text is the complete string and dic is a dictionary — each definition is a string that will replace a match to the term.
Note: in Python 3, iteritems() has been replaced with items()
Careful: Python dictionaries don't have a reliable order for iteration. This solution only solves your problem if:
order of replacements is irrelevant
it's ok for a replacement to change the results of previous replacements
Update: The above statement related to ordering of insertion does not apply to Python versions greater than or equal to 3.6, as standard dicts were changed to use insertion ordering for iteration.
For instance:
d = { "cat": "dog", "dog": "pig"}
my_sentence = "This is my cat and this is my dog."
replace_all(my_sentence, d)
print(my_sentence)
Possible output #1:
"This is my pig and this is my pig."
Possible output #2
"This is my dog and this is my pig."
One possible fix is to use an OrderedDict.
from collections import OrderedDict
def replace_all(text, dic):
for i, j in dic.items():
text = text.replace(i, j)
return text
od = OrderedDict([("cat", "dog"), ("dog", "pig")])
my_sentence = "This is my cat and this is my dog."
replace_all(my_sentence, od)
print(my_sentence)
Output:
"This is my pig and this is my pig."
Careful #2: Inefficient if your text string is too big or there are many pairs in the dictionary.

Why not one solution like this?
s = "The quick brown fox jumps over the lazy dog"
for r in (("brown", "red"), ("lazy", "quick")):
s = s.replace(*r)
#output will be: The quick red fox jumps over the quick dog

Here is a variant of the first solution using reduce, in case you like being functional. :)
repls = {'hello' : 'goodbye', 'world' : 'earth'}
s = 'hello, world'
reduce(lambda a, kv: a.replace(*kv), repls.iteritems(), s)
martineau's even better version:
repls = ('hello', 'goodbye'), ('world', 'earth')
s = 'hello, world'
reduce(lambda a, kv: a.replace(*kv), repls, s)

This is just a more concise recap of F.J and MiniQuark great answers and last but decisive improvement by bgusach. All you need to achieve multiple simultaneous string replacements is the following function:
def multiple_replace(string, rep_dict):
pattern = re.compile("|".join([re.escape(k) for k in sorted(rep_dict,key=len,reverse=True)]), flags=re.DOTALL)
return pattern.sub(lambda x: rep_dict[x.group(0)], string)
Usage:
>>>multiple_replace("Do you like cafe? No, I prefer tea.", {'cafe':'tea', 'tea':'cafe', 'like':'prefer'})
'Do you prefer tea? No, I prefer cafe.'
If you wish, you can make your own dedicated replacement functions starting from this simpler one.

Starting Python 3.8, and the introduction of assignment expressions (PEP 572) (:= operator), we can apply the replacements within a list comprehension:
# text = "The quick brown fox jumps over the lazy dog"
# replacements = [("brown", "red"), ("lazy", "quick")]
[text := text.replace(a, b) for a, b in replacements]
# text = 'The quick red fox jumps over the quick dog'

I built this upon F.J.s excellent answer:
import re
def multiple_replacer(*key_values):
replace_dict = dict(key_values)
replacement_function = lambda match: replace_dict[match.group(0)]
pattern = re.compile("|".join([re.escape(k) for k, v in key_values]), re.M)
return lambda string: pattern.sub(replacement_function, string)
def multiple_replace(string, *key_values):
return multiple_replacer(*key_values)(string)
One shot usage:
>>> replacements = (u"café", u"tea"), (u"tea", u"café"), (u"like", u"love")
>>> print multiple_replace(u"Do you like café? No, I prefer tea.", *replacements)
Do you love tea? No, I prefer café.
Note that since replacement is done in just one pass, "café" changes to "tea", but it does not change back to "café".
If you need to do the same replacement many times, you can create a replacement function easily:
>>> my_escaper = multiple_replacer(('"','\\"'), ('\t', '\\t'))
>>> many_many_strings = (u'This text will be escaped by "my_escaper"',
u'Does this work?\tYes it does',
u'And can we span\nmultiple lines?\t"Yes\twe\tcan!"')
>>> for line in many_many_strings:
... print my_escaper(line)
...
This text will be escaped by \"my_escaper\"
Does this work?\tYes it does
And can we span
multiple lines?\t\"Yes\twe\tcan!\"
Improvements:
turned code into a function
added multiline support
fixed a bug in escaping
easy to create a function for a specific multiple replacement
Enjoy! :-)

I would like to propose the usage of string templates. Just place the string to be replaced in a dictionary and all is set! Example from docs.python.org
>>> from string import Template
>>> s = Template('$who likes $what')
>>> s.substitute(who='tim', what='kung pao')
'tim likes kung pao'
>>> d = dict(who='tim')
>>> Template('Give $who $100').substitute(d)
Traceback (most recent call last):
[...]
ValueError: Invalid placeholder in string: line 1, col 10
>>> Template('$who likes $what').substitute(d)
Traceback (most recent call last):
[...]
KeyError: 'what'
>>> Template('$who likes $what').safe_substitute(d)
'tim likes $what'

In my case, I needed a simple replacing of unique keys with names, so I thought this up:
a = 'This is a test string.'
b = {'i': 'I', 's': 'S'}
for x,y in b.items():
a = a.replace(x, y)
>>> a
'ThIS IS a teSt StrIng.'

Here my $0.02. It is based on Andrew Clark's answer, just a little bit clearer, and it also covers the case when a string to replace is a substring of another string to replace (longer string wins)
def multireplace(string, replacements):
"""
Given a string and a replacement map, it returns the replaced string.
:param str string: string to execute replacements on
:param dict replacements: replacement dictionary {value to find: value to replace}
:rtype: str
"""
# Place longer ones first to keep shorter substrings from matching
# where the longer ones should take place
# For instance given the replacements {'ab': 'AB', 'abc': 'ABC'} against
# the string 'hey abc', it should produce 'hey ABC' and not 'hey ABc'
substrs = sorted(replacements, key=len, reverse=True)
# Create a big OR regex that matches any of the substrings to replace
regexp = re.compile('|'.join(map(re.escape, substrs)))
# For each match, look up the new string in the replacements
return regexp.sub(lambda match: replacements[match.group(0)], string)
It is in this this gist, feel free to modify it if you have any proposal.

I needed a solution where the strings to be replaced can be a regular expressions,
for example to help in normalizing a long text by replacing multiple whitespace characters with a single one. Building on a chain of answers from others, including MiniQuark and mmj, this is what I came up with:
def multiple_replace(string, reps, re_flags = 0):
""" Transforms string, replacing keys from re_str_dict with values.
reps: dictionary, or list of key-value pairs (to enforce ordering;
earlier items have higher priority).
Keys are used as regular expressions.
re_flags: interpretation of regular expressions, such as re.DOTALL
"""
if isinstance(reps, dict):
reps = reps.items()
pattern = re.compile("|".join("(?P<_%d>%s)" % (i, re_str[0])
for i, re_str in enumerate(reps)),
re_flags)
return pattern.sub(lambda x: reps[int(x.lastgroup[1:])][1], string)
It works for the examples given in other answers, for example:
>>> multiple_replace("(condition1) and --condition2--",
... {"condition1": "", "condition2": "text"})
'() and --text--'
>>> multiple_replace('hello, world', {'hello' : 'goodbye', 'world' : 'earth'})
'goodbye, earth'
>>> multiple_replace("Do you like cafe? No, I prefer tea.",
... {'cafe': 'tea', 'tea': 'cafe', 'like': 'prefer'})
'Do you prefer tea? No, I prefer cafe.'
The main thing for me is that you can use regular expressions as well, for example to replace whole words only, or to normalize white space:
>>> s = "I don't want to change this name:\n Philip II of Spain"
>>> re_str_dict = {r'\bI\b': 'You', r'[\n\t ]+': ' '}
>>> multiple_replace(s, re_str_dict)
"You don't want to change this name: Philip II of Spain"
If you want to use the dictionary keys as normal strings,
you can escape those before calling multiple_replace using e.g. this function:
def escape_keys(d):
""" transform dictionary d by applying re.escape to the keys """
return dict((re.escape(k), v) for k, v in d.items())
>>> multiple_replace(s, escape_keys(re_str_dict))
"I don't want to change this name:\n Philip II of Spain"
The following function can help in finding erroneous regular expressions among your dictionary keys (since the error message from multiple_replace isn't very telling):
def check_re_list(re_list):
""" Checks if each regular expression in list is well-formed. """
for i, e in enumerate(re_list):
try:
re.compile(e)
except (TypeError, re.error):
print("Invalid regular expression string "
"at position {}: '{}'".format(i, e))
>>> check_re_list(re_str_dict.keys())
Note that it does not chain the replacements, instead performs them simultaneously. This makes it more efficient without constraining what it can do. To mimic the effect of chaining, you may just need to add more string-replacement pairs and ensure the expected ordering of the pairs:
>>> multiple_replace("button", {"but": "mut", "mutton": "lamb"})
'mutton'
>>> multiple_replace("button", [("button", "lamb"),
... ("but", "mut"), ("mutton", "lamb")])
'lamb'

Note: Test your case, see comments.
Here's a sample which is more efficient on long strings with many small replacements.
source = "Here is foo, it does moo!"
replacements = {
'is': 'was', # replace 'is' with 'was'
'does': 'did',
'!': '?'
}
def replace(source, replacements):
finder = re.compile("|".join(re.escape(k) for k in replacements.keys())) # matches every string we want replaced
result = []
pos = 0
while True:
match = finder.search(source, pos)
if match:
# cut off the part up until match
result.append(source[pos : match.start()])
# cut off the matched part and replace it in place
result.append(replacements[source[match.start() : match.end()]])
pos = match.end()
else:
# the rest after the last match
result.append(source[pos:])
break
return "".join(result)
print replace(source, replacements)
The point is in avoiding many concatenations of long strings. We chop the source string to fragments, replacing some of the fragments as we form the list, and then join the whole thing back into a string.

I was doing a similar exercise in one of my school homework. This was my solution
dictionary = {1: ['hate', 'love'],
2: ['salad', 'burger'],
3: ['vegetables', 'pizza']}
def normalize(text):
for i in dictionary:
text = text.replace(dictionary[i][0], dictionary[i][1])
return text
See result yourself on test string
string_to_change = 'I hate salad and vegetables'
print(normalize(string_to_change))

You can use the pandas library and the replace function which supports both exact matches as well as regex replacements. For example:
df = pd.DataFrame({'text': ['Billy is going to visit Rome in November', 'I was born in 10/10/2010', 'I will be there at 20:00']})
to_replace=['Billy','Rome','January|February|March|April|May|June|July|August|September|October|November|December', '\d{2}:\d{2}', '\d{2}/\d{2}/\d{4}']
replace_with=['name','city','month','time', 'date']
print(df.text.replace(to_replace, replace_with, regex=True))
And the modified text is:
0 name is going to visit city in month
1 I was born in date
2 I will be there at time
You can find an example here. Notice that the replacements on the text are done with the order they appear in the lists

I was struggling with this problem as well. With many substitutions regular expressions struggle, and are about four times slower than looping string.replace (in my experiment conditions).
You should absolutely try using the Flashtext library (blog post here, Github here). In my case it was a bit over two orders of magnitude faster, from 1.8 s to 0.015 s (regular expressions took 7.7 s) for each document.
It is easy to find use examples in the links above, but this is a working example:
from flashtext import KeywordProcessor
self.processor = KeywordProcessor(case_sensitive=False)
for k, v in self.my_dict.items():
self.processor.add_keyword(k, v)
new_string = self.processor.replace_keywords(string)
Note that Flashtext makes substitutions in a single pass (to avoid a --> b and b --> c translating 'a' into 'c'). Flashtext also looks for whole words (so 'is' will not match 'this'). It works fine if your target is several words (replacing 'This is' by 'Hello').

I face similar problem today, where I had to do use .replace() method multiple times but it didn't feel good to me. So I did something like this:
REPLACEMENTS = {'<': '<', '>': '>', '&': '&'}
event_title = ''.join([REPLACEMENTS.get(c,c) for c in event['summary']])

I feel this question needs a single-line recursive lambda function answer for completeness, just because. So there:
>>> mrep = lambda s, d: s if not d else mrep(s.replace(*d.popitem()), d)
Usage:
>>> mrep('abcabc', {'a': '1', 'c': '2'})
'1b21b2'
Notes:
This consumes the input dictionary.
Python dicts preserve key order as of 3.6; corresponding caveats in other answers are not relevant anymore. For backward compatibility one could resort to a tuple-based version:
>>> mrep = lambda s, d: s if not d else mrep(s.replace(*d.pop()), d)
>>> mrep('abcabc', [('a', '1'), ('c', '2')])
Note: As with all recursive functions in python, too large recursion depth (i.e. too large replacement dictionaries) will result in an error. See e.g. here.

You should really not do it this way, but I just find it way too cool:
>>> replacements = {'cond1':'text1', 'cond2':'text2'}
>>> cmd = 'answer = s'
>>> for k,v in replacements.iteritems():
>>> cmd += ".replace(%s, %s)" %(k,v)
>>> exec(cmd)
Now, answer is the result of all the replacements in turn
again, this is very hacky and is not something that you should be using regularly. But it's just nice to know that you can do something like this if you ever need to.

For replace only one character, use the translate and str.maketrans is my favorite method.
tl;dr > result_string = your_string.translate(str.maketrans(dict_mapping))
demo
my_string = 'This is a test string.'
dict_mapping = {'i': 's', 's': 'S'}
result_good = my_string.translate(str.maketrans(dict_mapping))
result_bad = my_string
for x, y in dict_mapping.items():
result_bad = result_bad.replace(x, y)
print(result_good) # ThsS sS a teSt Strsng.
print(result_bad) # ThSS SS a teSt StrSng.

I don't know about speed but this is my workaday quick fix:
reduce(lambda a, b: a.replace(*b)
, [('o','W'), ('t','X')] #iterable of pairs: (oldval, newval)
, 'tomato' #The string from which to replace values
)
... but I like the #1 regex answer above. Note - if one new value is a substring of another one then the operation is not commutative.

Here is a version with support for basic regex replacement. The main restriction is that expressions must not contain subgroups, and there may be some edge cases:
Code based on #bgusach and others
import re
class StringReplacer:
def __init__(self, replacements, ignore_case=False):
patterns = sorted(replacements, key=len, reverse=True)
self.replacements = [replacements[k] for k in patterns]
re_mode = re.IGNORECASE if ignore_case else 0
self.pattern = re.compile('|'.join(("({})".format(p) for p in patterns)), re_mode)
def tr(matcher):
index = next((index for index,value in enumerate(matcher.groups()) if value), None)
return self.replacements[index]
self.tr = tr
def __call__(self, string):
return self.pattern.sub(self.tr, string)
Tests
table = {
"aaa" : "[This is three a]",
"b+" : "[This is one or more b]",
r"<\w+>" : "[This is a tag]"
}
replacer = StringReplacer(table, True)
sample1 = "whatever bb, aaa, <star> BBB <end>"
print(replacer(sample1))
# output:
# whatever [This is one or more b], [This is three a], [This is a tag] [This is one or more b] [This is a tag]
The trick is to identify the matched group by its position. It is not super efficient (O(n)), but it works.
index = next((index for index,value in enumerate(matcher.groups()) if value), None)
Replacement is done in one pass.

Starting from the precious answer of Andrew i developed a script that loads the dictionary from a file and elaborates all the files on the opened folder to do the replacements. The script loads the mappings from an external file in which you can set the separator. I'm a beginner but i found this script very useful when doing multiple substitutions in multiple files. It loaded a dictionary with more than 1000 entries in seconds. It is not elegant but it worked for me
import glob
import re
mapfile = input("Enter map file name with extension eg. codifica.txt: ")
sep = input("Enter map file column separator eg. |: ")
mask = input("Enter search mask with extension eg. 2010*txt for all files to be processed: ")
suff = input("Enter suffix with extension eg. _NEW.txt for newly generated files: ")
rep = {} # creation of empy dictionary
with open(mapfile) as temprep: # loading of definitions in the dictionary using input file, separator is prompted
for line in temprep:
(key, val) = line.strip('\n').split(sep)
rep[key] = val
for filename in glob.iglob(mask): # recursion on all the files with the mask prompted
with open (filename, "r") as textfile: # load each file in the variable text
text = textfile.read()
# start replacement
#rep = dict((re.escape(k), v) for k, v in rep.items()) commented to enable the use in the mapping of re reserved characters
pattern = re.compile("|".join(rep.keys()))
text = pattern.sub(lambda m: rep[m.group(0)], text)
#write of te output files with the prompted suffice
target = open(filename[:-4]+"_NEW.txt", "w")
target.write(text)
target.close()

this is my solution to the problem. I used it in a chatbot to replace the different words at once.
def mass_replace(text, dct):
new_string = ""
old_string = text
while len(old_string) > 0:
s = ""
sk = ""
for k in dct.keys():
if old_string.startswith(k):
s = dct[k]
sk = k
if s:
new_string+=s
old_string = old_string[len(sk):]
else:
new_string+=old_string[0]
old_string = old_string[1:]
return new_string
print mass_replace("The dog hunts the cat", {"dog":"cat", "cat":"dog"})
this will become The cat hunts the dog

Another example :
Input list
error_list = ['[br]', '[ex]', 'Something']
words = ['how', 'much[ex]', 'is[br]', 'the', 'fish[br]', 'noSomething', 'really']
The desired output would be
words = ['how', 'much', 'is', 'the', 'fish', 'no', 'really']
Code :
[n[0][0] if len(n[0]) else n[1] for n in [[[w.replace(e,"") for e in error_list if e in w],w] for w in words]]

My approach would be to first tokenize the string, then decide for each token whether to include it or not.
Potentially, might be more performant, if we can assume O(1) lookup for a hashmap/set:
remove_words = {"we", "this"}
target_sent = "we should modify this string"
target_sent_words = target_sent.split()
filtered_sent = " ".join(list(filter(lambda word: word not in remove_words, target_sent_words)))
filtered_sent is now 'should modify string'

Or just for a fast hack:
for line in to_read:
read_buffer = line
stripped_buffer1 = read_buffer.replace("term1", " ")
stripped_buffer2 = stripped_buffer1.replace("term2", " ")
write_to_file = to_write.write(stripped_buffer2)

Here is another way of doing it with a dictionary:
listA="The cat jumped over the house".split()
modify = {word:word for number,word in enumerate(listA)}
modify["cat"],modify["jumped"]="dog","walked"
print " ".join(modify[x] for x in listA)

sentence='its some sentence with a something text'
def replaceAll(f,Array1,Array2):
if len(Array1)==len(Array2):
for x in range(len(Array1)):
return f.replace(Array1[x],Array2[x])
newSentence=replaceAll(sentence,['a','sentence','something'],['another','sentence','something something'])
print(newSentence)

pyparsing -- extending fourFn.py example with function

The fourFn.py example from pyparsing has a dictionary where all the supported functions are listed. I'm trying to add a definition to call the statistics mean function:
fn = {
"sin": math.sin,
"cos": math.cos,
"tan": math.tan,
"exp": math.exp,
"abs": abs,
# added function, mean imported from statistics
"mean": mean,
"trunc": lambda a: int(a),
"round": round,
"sgn": lambda a: -1 if a < -epsilon else 1 if a > epsilon else 0,
}
The mean function takes an iterable object as its argument, but apparently the grammar needs to be extended to recognize a list literal. When I try to test it:
test("mean([2, 5, 6])", mean([2, 5, 6]))
I get the following error:
mean([2, 5, 6]) failed eval: Expected {{{{{{{{W:(ABCD..., ABCD...)
Suppress:("(") -} Group:(Group:(Forward: None) [, Group:(Forward:
None)]...)} Suppress:(")")} | "PI"} | "E"} | Re:('[+-]?\\d+(?:\\.\\d*)?(?:
[eE][+-]?\\d+)?')} | W:(ABCD..., ABCD...)} | Group:({{Suppress:("(") :
...} Suppress:(")")})}, found '[' (at char 5), (line:1, col:6) []
I've tried adding a grammar definition for a list:
nlist = "[" + delimitedList(Group(expr)) + "]"
And modified the fn_call defintion:
fn_call = (ident + lpar - Group(expr_list | nlist) + rpar).setParseAction(
lambda t: t.insert(0, (t.pop(0), len(t[0])))
)
This only returns this error:
mean([2, 5, 6]) failed eval: pop from empty list ['2', '5', '6', ('mean', 5)]
So, I guess I don't have the nlist definition quite right. Any suggestions on how I might get this to work?
Thanks

Replace values in string without replacing new values [duplicate]

I would like to use the .replace function to replace multiple strings.
I currently have
string.replace("condition1", "")
but would like to have something like
string.replace("condition1", "").replace("condition2", "text")
although that does not feel like good syntax
what is the proper way to do this? kind of like how in grep/regex you can do \1 and \2 to replace fields to certain search strings

Here is a short example that should do the trick with regular expressions:
import re
rep = {"condition1": "", "condition2": "text"} # define desired replacements here
# use these three lines to do the replacement
rep = dict((re.escape(k), v) for k, v in rep.iteritems())
#Python 3 renamed dict.iteritems to dict.items so use rep.items() for latest versions
pattern = re.compile("|".join(rep.keys()))
text = pattern.sub(lambda m: rep[re.escape(m.group(0))], text)
For example:
>>> pattern.sub(lambda m: rep[re.escape(m.group(0))], "(condition1) and --condition2--")
'() and --text--'

You could just make a nice little looping function.
def replace_all(text, dic):
for i, j in dic.iteritems():
text = text.replace(i, j)
return text
where text is the complete string and dic is a dictionary — each definition is a string that will replace a match to the term.
Note: in Python 3, iteritems() has been replaced with items()
Careful: Python dictionaries don't have a reliable order for iteration. This solution only solves your problem if:
order of replacements is irrelevant
it's ok for a replacement to change the results of previous replacements
Update: The above statement related to ordering of insertion does not apply to Python versions greater than or equal to 3.6, as standard dicts were changed to use insertion ordering for iteration.
For instance:
d = { "cat": "dog", "dog": "pig"}
my_sentence = "This is my cat and this is my dog."
replace_all(my_sentence, d)
print(my_sentence)
Possible output #1:
"This is my pig and this is my pig."
Possible output #2
"This is my dog and this is my pig."
One possible fix is to use an OrderedDict.
from collections import OrderedDict
def replace_all(text, dic):
for i, j in dic.items():
text = text.replace(i, j)
return text
od = OrderedDict([("cat", "dog"), ("dog", "pig")])
my_sentence = "This is my cat and this is my dog."
replace_all(my_sentence, od)
print(my_sentence)
Output:
"This is my pig and this is my pig."
Careful #2: Inefficient if your text string is too big or there are many pairs in the dictionary.

Why not one solution like this?
s = "The quick brown fox jumps over the lazy dog"
for r in (("brown", "red"), ("lazy", "quick")):
s = s.replace(*r)
#output will be: The quick red fox jumps over the quick dog

Here is a variant of the first solution using reduce, in case you like being functional. :)
repls = {'hello' : 'goodbye', 'world' : 'earth'}
s = 'hello, world'
reduce(lambda a, kv: a.replace(*kv), repls.iteritems(), s)
martineau's even better version:
repls = ('hello', 'goodbye'), ('world', 'earth')
s = 'hello, world'
reduce(lambda a, kv: a.replace(*kv), repls, s)

This is just a more concise recap of F.J and MiniQuark great answers and last but decisive improvement by bgusach. All you need to achieve multiple simultaneous string replacements is the following function:
def multiple_replace(string, rep_dict):
pattern = re.compile("|".join([re.escape(k) for k in sorted(rep_dict,key=len,reverse=True)]), flags=re.DOTALL)
return pattern.sub(lambda x: rep_dict[x.group(0)], string)
Usage:
>>>multiple_replace("Do you like cafe? No, I prefer tea.", {'cafe':'tea', 'tea':'cafe', 'like':'prefer'})
'Do you prefer tea? No, I prefer cafe.'
If you wish, you can make your own dedicated replacement functions starting from this simpler one.

Starting Python 3.8, and the introduction of assignment expressions (PEP 572) (:= operator), we can apply the replacements within a list comprehension:
# text = "The quick brown fox jumps over the lazy dog"
# replacements = [("brown", "red"), ("lazy", "quick")]
[text := text.replace(a, b) for a, b in replacements]
# text = 'The quick red fox jumps over the quick dog'

I built this upon F.J.s excellent answer:
import re
def multiple_replacer(*key_values):
replace_dict = dict(key_values)
replacement_function = lambda match: replace_dict[match.group(0)]
pattern = re.compile("|".join([re.escape(k) for k, v in key_values]), re.M)
return lambda string: pattern.sub(replacement_function, string)
def multiple_replace(string, *key_values):
return multiple_replacer(*key_values)(string)
One shot usage:
>>> replacements = (u"café", u"tea"), (u"tea", u"café"), (u"like", u"love")
>>> print multiple_replace(u"Do you like café? No, I prefer tea.", *replacements)
Do you love tea? No, I prefer café.
Note that since replacement is done in just one pass, "café" changes to "tea", but it does not change back to "café".
If you need to do the same replacement many times, you can create a replacement function easily:
>>> my_escaper = multiple_replacer(('"','\\"'), ('\t', '\\t'))
>>> many_many_strings = (u'This text will be escaped by "my_escaper"',
u'Does this work?\tYes it does',
u'And can we span\nmultiple lines?\t"Yes\twe\tcan!"')
>>> for line in many_many_strings:
... print my_escaper(line)
...
This text will be escaped by \"my_escaper\"
Does this work?\tYes it does
And can we span
multiple lines?\t\"Yes\twe\tcan!\"
Improvements:
turned code into a function
added multiline support
fixed a bug in escaping
easy to create a function for a specific multiple replacement
Enjoy! :-)

I would like to propose the usage of string templates. Just place the string to be replaced in a dictionary and all is set! Example from docs.python.org
>>> from string import Template
>>> s = Template('$who likes $what')
>>> s.substitute(who='tim', what='kung pao')
'tim likes kung pao'
>>> d = dict(who='tim')
>>> Template('Give $who $100').substitute(d)
Traceback (most recent call last):
[...]
ValueError: Invalid placeholder in string: line 1, col 10
>>> Template('$who likes $what').substitute(d)
Traceback (most recent call last):
[...]
KeyError: 'what'
>>> Template('$who likes $what').safe_substitute(d)
'tim likes $what'

In my case, I needed a simple replacing of unique keys with names, so I thought this up:
a = 'This is a test string.'
b = {'i': 'I', 's': 'S'}
for x,y in b.items():
a = a.replace(x, y)
>>> a
'ThIS IS a teSt StrIng.'

Here my $0.02. It is based on Andrew Clark's answer, just a little bit clearer, and it also covers the case when a string to replace is a substring of another string to replace (longer string wins)
def multireplace(string, replacements):
"""
Given a string and a replacement map, it returns the replaced string.
:param str string: string to execute replacements on
:param dict replacements: replacement dictionary {value to find: value to replace}
:rtype: str
"""
# Place longer ones first to keep shorter substrings from matching
# where the longer ones should take place
# For instance given the replacements {'ab': 'AB', 'abc': 'ABC'} against
# the string 'hey abc', it should produce 'hey ABC' and not 'hey ABc'
substrs = sorted(replacements, key=len, reverse=True)
# Create a big OR regex that matches any of the substrings to replace
regexp = re.compile('|'.join(map(re.escape, substrs)))
# For each match, look up the new string in the replacements
return regexp.sub(lambda match: replacements[match.group(0)], string)
It is in this this gist, feel free to modify it if you have any proposal.

I needed a solution where the strings to be replaced can be a regular expressions,
for example to help in normalizing a long text by replacing multiple whitespace characters with a single one. Building on a chain of answers from others, including MiniQuark and mmj, this is what I came up with:
def multiple_replace(string, reps, re_flags = 0):
""" Transforms string, replacing keys from re_str_dict with values.
reps: dictionary, or list of key-value pairs (to enforce ordering;
earlier items have higher priority).
Keys are used as regular expressions.
re_flags: interpretation of regular expressions, such as re.DOTALL
"""
if isinstance(reps, dict):
reps = reps.items()
pattern = re.compile("|".join("(?P<_%d>%s)" % (i, re_str[0])
for i, re_str in enumerate(reps)),
re_flags)
return pattern.sub(lambda x: reps[int(x.lastgroup[1:])][1], string)
It works for the examples given in other answers, for example:
>>> multiple_replace("(condition1) and --condition2--",
... {"condition1": "", "condition2": "text"})
'() and --text--'
>>> multiple_replace('hello, world', {'hello' : 'goodbye', 'world' : 'earth'})
'goodbye, earth'
>>> multiple_replace("Do you like cafe? No, I prefer tea.",
... {'cafe': 'tea', 'tea': 'cafe', 'like': 'prefer'})
'Do you prefer tea? No, I prefer cafe.'
The main thing for me is that you can use regular expressions as well, for example to replace whole words only, or to normalize white space:
>>> s = "I don't want to change this name:\n Philip II of Spain"
>>> re_str_dict = {r'\bI\b': 'You', r'[\n\t ]+': ' '}
>>> multiple_replace(s, re_str_dict)
"You don't want to change this name: Philip II of Spain"
If you want to use the dictionary keys as normal strings,
you can escape those before calling multiple_replace using e.g. this function:
def escape_keys(d):
""" transform dictionary d by applying re.escape to the keys """
return dict((re.escape(k), v) for k, v in d.items())
>>> multiple_replace(s, escape_keys(re_str_dict))
"I don't want to change this name:\n Philip II of Spain"
The following function can help in finding erroneous regular expressions among your dictionary keys (since the error message from multiple_replace isn't very telling):
def check_re_list(re_list):
""" Checks if each regular expression in list is well-formed. """
for i, e in enumerate(re_list):
try:
re.compile(e)
except (TypeError, re.error):
print("Invalid regular expression string "
"at position {}: '{}'".format(i, e))
>>> check_re_list(re_str_dict.keys())
Note that it does not chain the replacements, instead performs them simultaneously. This makes it more efficient without constraining what it can do. To mimic the effect of chaining, you may just need to add more string-replacement pairs and ensure the expected ordering of the pairs:
>>> multiple_replace("button", {"but": "mut", "mutton": "lamb"})
'mutton'
>>> multiple_replace("button", [("button", "lamb"),
... ("but", "mut"), ("mutton", "lamb")])
'lamb'

Note: Test your case, see comments.
Here's a sample which is more efficient on long strings with many small replacements.
source = "Here is foo, it does moo!"
replacements = {
'is': 'was', # replace 'is' with 'was'
'does': 'did',
'!': '?'
}
def replace(source, replacements):
finder = re.compile("|".join(re.escape(k) for k in replacements.keys())) # matches every string we want replaced
result = []
pos = 0
while True:
match = finder.search(source, pos)
if match:
# cut off the part up until match
result.append(source[pos : match.start()])
# cut off the matched part and replace it in place
result.append(replacements[source[match.start() : match.end()]])
pos = match.end()
else:
# the rest after the last match
result.append(source[pos:])
break
return "".join(result)
print replace(source, replacements)
The point is in avoiding many concatenations of long strings. We chop the source string to fragments, replacing some of the fragments as we form the list, and then join the whole thing back into a string.

I was doing a similar exercise in one of my school homework. This was my solution
dictionary = {1: ['hate', 'love'],
2: ['salad', 'burger'],
3: ['vegetables', 'pizza']}
def normalize(text):
for i in dictionary:
text = text.replace(dictionary[i][0], dictionary[i][1])
return text
See result yourself on test string
string_to_change = 'I hate salad and vegetables'
print(normalize(string_to_change))

You can use the pandas library and the replace function which supports both exact matches as well as regex replacements. For example:
df = pd.DataFrame({'text': ['Billy is going to visit Rome in November', 'I was born in 10/10/2010', 'I will be there at 20:00']})
to_replace=['Billy','Rome','January|February|March|April|May|June|July|August|September|October|November|December', '\d{2}:\d{2}', '\d{2}/\d{2}/\d{4}']
replace_with=['name','city','month','time', 'date']
print(df.text.replace(to_replace, replace_with, regex=True))
And the modified text is:
0 name is going to visit city in month
1 I was born in date
2 I will be there at time
You can find an example here. Notice that the replacements on the text are done with the order they appear in the lists

I was struggling with this problem as well. With many substitutions regular expressions struggle, and are about four times slower than looping string.replace (in my experiment conditions).
You should absolutely try using the Flashtext library (blog post here, Github here). In my case it was a bit over two orders of magnitude faster, from 1.8 s to 0.015 s (regular expressions took 7.7 s) for each document.
It is easy to find use examples in the links above, but this is a working example:
from flashtext import KeywordProcessor
self.processor = KeywordProcessor(case_sensitive=False)
for k, v in self.my_dict.items():
self.processor.add_keyword(k, v)
new_string = self.processor.replace_keywords(string)
Note that Flashtext makes substitutions in a single pass (to avoid a --> b and b --> c translating 'a' into 'c'). Flashtext also looks for whole words (so 'is' will not match 'this'). It works fine if your target is several words (replacing 'This is' by 'Hello').

I face similar problem today, where I had to do use .replace() method multiple times but it didn't feel good to me. So I did something like this:
REPLACEMENTS = {'<': '<', '>': '>', '&': '&'}
event_title = ''.join([REPLACEMENTS.get(c,c) for c in event['summary']])

I feel this question needs a single-line recursive lambda function answer for completeness, just because. So there:
>>> mrep = lambda s, d: s if not d else mrep(s.replace(*d.popitem()), d)
Usage:
>>> mrep('abcabc', {'a': '1', 'c': '2'})
'1b21b2'
Notes:
This consumes the input dictionary.
Python dicts preserve key order as of 3.6; corresponding caveats in other answers are not relevant anymore. For backward compatibility one could resort to a tuple-based version:
>>> mrep = lambda s, d: s if not d else mrep(s.replace(*d.pop()), d)
>>> mrep('abcabc', [('a', '1'), ('c', '2')])
Note: As with all recursive functions in python, too large recursion depth (i.e. too large replacement dictionaries) will result in an error. See e.g. here.

You should really not do it this way, but I just find it way too cool:
>>> replacements = {'cond1':'text1', 'cond2':'text2'}
>>> cmd = 'answer = s'
>>> for k,v in replacements.iteritems():
>>> cmd += ".replace(%s, %s)" %(k,v)
>>> exec(cmd)
Now, answer is the result of all the replacements in turn
again, this is very hacky and is not something that you should be using regularly. But it's just nice to know that you can do something like this if you ever need to.

For replace only one character, use the translate and str.maketrans is my favorite method.
tl;dr > result_string = your_string.translate(str.maketrans(dict_mapping))
demo
my_string = 'This is a test string.'
dict_mapping = {'i': 's', 's': 'S'}
result_good = my_string.translate(str.maketrans(dict_mapping))
result_bad = my_string
for x, y in dict_mapping.items():
result_bad = result_bad.replace(x, y)
print(result_good) # ThsS sS a teSt Strsng.
print(result_bad) # ThSS SS a teSt StrSng.

I don't know about speed but this is my workaday quick fix:
reduce(lambda a, b: a.replace(*b)
, [('o','W'), ('t','X')] #iterable of pairs: (oldval, newval)
, 'tomato' #The string from which to replace values
)
... but I like the #1 regex answer above. Note - if one new value is a substring of another one then the operation is not commutative.

Here is a version with support for basic regex replacement. The main restriction is that expressions must not contain subgroups, and there may be some edge cases:
Code based on #bgusach and others
import re
class StringReplacer:
def __init__(self, replacements, ignore_case=False):
patterns = sorted(replacements, key=len, reverse=True)
self.replacements = [replacements[k] for k in patterns]
re_mode = re.IGNORECASE if ignore_case else 0
self.pattern = re.compile('|'.join(("({})".format(p) for p in patterns)), re_mode)
def tr(matcher):
index = next((index for index,value in enumerate(matcher.groups()) if value), None)
return self.replacements[index]
self.tr = tr
def __call__(self, string):
return self.pattern.sub(self.tr, string)
Tests
table = {
"aaa" : "[This is three a]",
"b+" : "[This is one or more b]",
r"<\w+>" : "[This is a tag]"
}
replacer = StringReplacer(table, True)
sample1 = "whatever bb, aaa, <star> BBB <end>"
print(replacer(sample1))
# output:
# whatever [This is one or more b], [This is three a], [This is a tag] [This is one or more b] [This is a tag]
The trick is to identify the matched group by its position. It is not super efficient (O(n)), but it works.
index = next((index for index,value in enumerate(matcher.groups()) if value), None)
Replacement is done in one pass.

Starting from the precious answer of Andrew i developed a script that loads the dictionary from a file and elaborates all the files on the opened folder to do the replacements. The script loads the mappings from an external file in which you can set the separator. I'm a beginner but i found this script very useful when doing multiple substitutions in multiple files. It loaded a dictionary with more than 1000 entries in seconds. It is not elegant but it worked for me
import glob
import re
mapfile = input("Enter map file name with extension eg. codifica.txt: ")
sep = input("Enter map file column separator eg. |: ")
mask = input("Enter search mask with extension eg. 2010*txt for all files to be processed: ")
suff = input("Enter suffix with extension eg. _NEW.txt for newly generated files: ")
rep = {} # creation of empy dictionary
with open(mapfile) as temprep: # loading of definitions in the dictionary using input file, separator is prompted
for line in temprep:
(key, val) = line.strip('\n').split(sep)
rep[key] = val
for filename in glob.iglob(mask): # recursion on all the files with the mask prompted
with open (filename, "r") as textfile: # load each file in the variable text
text = textfile.read()
# start replacement
#rep = dict((re.escape(k), v) for k, v in rep.items()) commented to enable the use in the mapping of re reserved characters
pattern = re.compile("|".join(rep.keys()))
text = pattern.sub(lambda m: rep[m.group(0)], text)
#write of te output files with the prompted suffice
target = open(filename[:-4]+"_NEW.txt", "w")
target.write(text)
target.close()

this is my solution to the problem. I used it in a chatbot to replace the different words at once.
def mass_replace(text, dct):
new_string = ""
old_string = text
while len(old_string) > 0:
s = ""
sk = ""
for k in dct.keys():
if old_string.startswith(k):
s = dct[k]
sk = k
if s:
new_string+=s
old_string = old_string[len(sk):]
else:
new_string+=old_string[0]
old_string = old_string[1:]
return new_string
print mass_replace("The dog hunts the cat", {"dog":"cat", "cat":"dog"})
this will become The cat hunts the dog

Another example :
Input list
error_list = ['[br]', '[ex]', 'Something']
words = ['how', 'much[ex]', 'is[br]', 'the', 'fish[br]', 'noSomething', 'really']
The desired output would be
words = ['how', 'much', 'is', 'the', 'fish', 'no', 'really']
Code :
[n[0][0] if len(n[0]) else n[1] for n in [[[w.replace(e,"") for e in error_list if e in w],w] for w in words]]

My approach would be to first tokenize the string, then decide for each token whether to include it or not.
Potentially, might be more performant, if we can assume O(1) lookup for a hashmap/set:
remove_words = {"we", "this"}
target_sent = "we should modify this string"
target_sent_words = target_sent.split()
filtered_sent = " ".join(list(filter(lambda word: word not in remove_words, target_sent_words)))
filtered_sent is now 'should modify string'

Or just for a fast hack:
for line in to_read:
read_buffer = line
stripped_buffer1 = read_buffer.replace("term1", " ")
stripped_buffer2 = stripped_buffer1.replace("term2", " ")
write_to_file = to_write.write(stripped_buffer2)

Here is another way of doing it with a dictionary:
listA="The cat jumped over the house".split()
modify = {word:word for number,word in enumerate(listA)}
modify["cat"],modify["jumped"]="dog","walked"
print " ".join(modify[x] for x in listA)

sentence='its some sentence with a something text'
def replaceAll(f,Array1,Array2):
if len(Array1)==len(Array2):
for x in range(len(Array1)):
return f.replace(Array1[x],Array2[x])
newSentence=replaceAll(sentence,['a','sentence','something'],['another','sentence','something something'])
print(newSentence)

Unpack a string into an expanded string

I am given a string in the following format: "a{1;4:6}" and "a{1;2}b{2:4}" where the ; represents two different numbers, and a : represents a sequence of numbers. There can be any number of combinations of semicolons and colons within the brace.
I want to expand it such that these are the results of expanding the two examples above:
"a{1;4:6}" = "a1a4a5a6"
"a{1;2}b{2:4}" = "a1b2b3b4a2b2b3b4"
I've never had to deal with something like this before, since I am usually given strings in some sort of ready-made format which is easily parsable. In this case I have to parse the string manually.
My attempt is to split the string manually, over and over again, until you hit a case where there is either a colon or a semicolon, then start building the string from there. This is horribly inefficient, and I would appreciate any thoughts on this approach. Here is essentially what the code looks like (I omitted a lot of it, just to get the point across more quickly):
>>> s = "a{1;4:6}"
>>> splitted = s.split("}")
>>> splitted
['a{1;4:6', '']
>>> splitted2 = [s.split("{") for s in splitted]
>>> splitted2
[['a', '1;4:6'], ['']]
>>> splitted3 = [s.split(";") for s in splitted2[0]]
>>> splitted3
[['a'], ['1', '4:6']]
# ... etc, then build up the strings manually once the ranges are figured out.
The thinking behind splitting at the close brace at first is that it is guaranteed that a new identifier, with an associated range comes up after it. Where am I going wrong? My approach works for simple strings such as the first example, but it doesn't for the second example. Furthermore it is inefficient. I would be thankful for any input on this problem.

I tried pyparsing for that and IMHO it produced a pretty readable code (took pack_tokens from the previous answer).
from pyparsing import nums, Literal, Word, oneOf, Optional, OneOrMore, Group, delimitedList
from string import ascii_lowercase as letters
# transform a '123' to 123
number = Word(nums).setParseAction(lambda s, l, t: int(t[0]))
# parses 234:543 ranges
range_ = number + Literal(':').suppress() + number
# transforms the range x:y to a list [x, x+1, ..., y]
range_.setParseAction(lambda s, l, t: list(range(t[0], t[1]+1)))
# parse the comma delimited list of ranges or individual numbers
range_list = delimitedList(range_|number,",")
# and pack them in a tuple
range_list.setParseAction(lambda s, l, t: tuple(t))
# parses 'a{2,3,4:5}' group
group = Word(letters, max=1) + Literal('{').suppress() + range_list + Literal('}').suppress()
# transform the group parsed as ['a', [2, 4, 5]] to ['a2', 'a4' ...]
group.setParseAction(lambda s, l, t: tuple("%s%d" % (t[0],num) for num in t[1]))
# the full expression is just those group one after another
expression = OneOrMore(group)
def pack_tokens(s, l, tokens):
current, *rest = tokens
if not rest:
return ''.join(current) # base case
return ''.join(token + pack_tokens(s, l, rest) for token in current)
expression.setParseAction(pack_tokens)
parsed = expression.parseString('a{1,2,3}')[0]
print(parsed)
parsed = expression.parseString('a{1,3:7}b{1:5}')[0]
print(parsed)

import re
def expand(compressed):
# 'b{2:4}' -> 'b{2;3;4}' i.e. reduce the problem to just one syntax
normalized = re.sub(r'(\d+):(\d+)', lambda m: ';'.join(map(str, range(int(m.group(1)), int(m.group(2)) + 1))), compressed)
# 'a{1;2}b{2;3;4}' -> ['a{1;2}', 'b{2;3;4}']
elements = re.findall(r'[a-z]\{[\d;]+\}', normalized)
tokens = []
# ['a{1;2}', 'b{2;3;4}'] -> [['a1', 'a2'], ['b2', 'b3', 'b4']]
for element in elements:
match = re.match(r'([a-z])\{([\d;]+)\}', element)
alphanumerics = [] # match result already guaranteed by re.findall()
for number in match.group(2).split(';'):
alphanumerics.append(match.group(1) + number)
tokens.append(alphanumerics)
# [['a1', 'a2'], ['b2', 'b3', 'b4']] -> 'a1b2b3b4a2b2b3b4'
def pack_tokens(tokens):
current, *rest = tokens
if not rest:
return ''.join(current) # base case
return ''.join(token + pack_tokens(rest) for token in current)
return pack_tokens(tokens)
strings = ['a{1;4:6}', 'a{1;2}b{2:4}', 'a{1;2}b{2:4}c{3;6}']
for string in strings:
print(string, '->', expand(string))
OUTPUT
a{1;4:6} -> a1a4a5a6
a{1;2}b{2:4} -> a1b2b3b4a2b2b3b4
a{1;2}b{2:4}c{3;6} -> a1b2c3c6b3c3c6b4c3c6a2b2c3c6b3c3c6b4c3c6

Just to demonstrate a technique for doing this using eval (as #ialcuaz asked in the comments). Again I wouldn't recommend doing it this way, the other answers are more appropriate. This technique can be useful when the structure is more complex (i.e. recursive with brackets and so on) when you don't want a full blown parser.
import re
import functools
class Group(object):
def __init__(self, prefix, items):
self.groups = [[prefix + str(x) for x in items]]
def __add__(self, other):
self.groups.extend(other.groups)
return self
def __repr__(self):
return self.pack_tokens(self.groups)
# adapted for Python 2.7 from #cdlane's code
def pack_tokens(self, tokens):
current = tokens[:1][0]
rest = tokens[1:]
if not rest:
return ''.join(current)
return ''.join(token + self.pack_tokens(rest) for token in current)
def createGroup(str, *items):
return Group(str, items)
def expand(compressed):
# Replace a{...}b{...} with a{...} + b{...} as we will overload the '+' operator to help during the evaluation
expr = re.sub(r'(\}\w+\{)', lambda m: '} + ' + m.group(1)[1:-1] + '{', compressed)
# Expand : range to explicit list of items (from #cdlane's answer)
expr = re.sub(r'(\d+):(\d+)', lambda m: ';'.join(map(str, range(int(m.group(1)), int(m.group(2)) + 1))), expr)
# Convert a{x;y;..} to a(x,y, ...) so that it evaluates as a function
expr = expr.replace('{', '(').replace('}', ')').replace(";", ",")
# Extract the group prefixes ('a', 'b', ...)
groupPrefixes = re.findall(ur'(\w+)\([\d,]+\)', expr)
# Build a namespace mapping functions 'a', 'b', ... to createGroup() capturing the groupName prefix in the closure
ns = {prefix: functools.partial(createGroup, prefix) for prefix in groupPrefixes}
# Evaluate the expression using the namespace
return eval(expr, ns)
tests = ['a{1;4:6}', 'a{1;2}b{2:4}', 'a{1;2}b{2:4}c{3;6}']
for test in tests:
print(test, '->', expand(test))
Produces:
('a{1;4:6}', '->', a1a4a5a6)
('a{1;2}b{2:4}', '->', a1b2b3b4a2b2b3b4)
('a{1;2}b{2:4}c{3;6}', '->', a1b2c3c6b3c3c6b4c3c6a2b2c3c6b3c3c6b4c3c6)

Parsing expression into list

I am given an expression using parantheses and +'s, such as (((a+b)+c)+(d+e)).
I need to find the parse tree of this, and then print the list form of this parse tree like:
[ [ [a, b], c ], [d, e] ]
I was thinking I'd use something like ast, then ast2list. However, due to my not fully understanding these, I am repeatedly getting syntax errors. This is what I have:
import ast
import parser
a = ast.parse("(((a+b)+c)+(d+e))", mode='eval')
b = parser.ast2list(a)
print(b)
Could anyone guide me in the right direction? Thanks.

Colleen's comment can be realized with something like:
str = "(((a+b)+c)+(d+e))"
replacements = [
('(','['),
(')',']'),
('+',','),
# If a,b,c,d,e are defined variables, you don't need the following 5 lines
('a',"'a'"),
('b',"'b'"),
('c',"'c'"),
('d',"'d'"),
('e',"'e'"),
]
for (f,s) in replacements:
str = str.replace(f,s)
obj = eval(str)
print(str) # [[['a','b'],'c'],['d','e']]
print(obj) # [[['a', 'b'], 'c'], ['d', 'e']]
# You can access the parsed elements as you would any iterable:
print(obj[0]) # [['a', 'b'], 'c']
print(obj[1]) # ['d', 'e']
print(obj[1][0]) # d

If you really want to do a parser, start by not writing any code, but by understanding how your grammar should work. Backus-Naur Format or BNF is the typical notation used to define your grammar. Infix notation is a common software engineering parsing topic, and the basic BNF structure for infix notation goes like:
letter ::= 'a'..'z'
operand ::= letter+
term ::= operand | '(' expr ')'
expr ::= term ( '+' term )*
The key is that term contains either your alphabetic operand or an entire subexpression wrapped in ()'s. That subexpression is just the same as the overall expression, so this recursive definition takes care of all the parenthesis nesting. The expression then is a term followed by zero or more terms, added on using your binary '+' operator. (You could expand term to handle subtraction and multiplication/division as well, but I'm not going to complicate this answer more than necessary.)
Pyparsing is a package that makes it easy to translate a BNF to a working parser using Python objects (Ply, spark, and yapps are other parsers, which follow the more traditional lex/yacc model of parser creation). Here is that BNF implemented directly using pyparsing:
from pyparsing import Suppress, Word, alphas, Forward, Group, ZeroOrMore
LPAR, RPAR, PLUS = map(Suppress, "()+")
operand = Word(alphas)
# forward declare our overall expression, necessary when defining a recursive grammar
expr = Forward()
# each term is either an alpha operand, or an expr in ()'s
term = operand | Group(LPAR + expr + RPAR)
# define expr as a term, with optional '+ term's
expr << term + ZeroOrMore(PLUS + term)
# try it out
s = "(((a+b)+c)+(d+e))"
print expr.parseString(s)
giving:
[[[['a', 'b'], 'c'], ['d', 'e']]]
Infix notation with recognition of precedence of operations is a pretty common parser, or part of a larger parser, so pyparsing includes a helper builtin call operatorPrecedence to take care of all the nesting/grouping/recursion, etc. Here is that same parser written using operatorPrecedence:
from pyparsing import operatorPrecedence, opAssoc, Word, alphas, Suppress
# define an infix notation with precedence of operations
# you only define one operation '+', so this is a simple case
operand = Word(alphas)
expr = operatorPrecedence(operand,
[
('+', 2, opAssoc.LEFT),
])
print expr.parseString(s)
giving the same results as before.
More detailed examples can be found online at the pyparsing wiki - the explicit implementation at fourFn.py and the operatorPrecedence implementation at simpleArith.py.

Look at the docs for the ast module here where the NodeVisitor class is described.
import ast
import sys
class MyNodeVisitor(ast.NodeVisitor):
op_dict = {
ast.Add : '+',
ast.Sub : '-',
ast.Mult : '*',
}
type_dict = {
ast.BinOp: lambda s, n: s.handleBinOp(n),
ast.Name: lambda s, n: getattr(n, 'id'),
ast.Num: lambda s, n: getattr(n, 'n'),
}
def __init__(self, *args, **kwargs):
ast.NodeVisitor.__init__(self, *args, **kwargs)
self.ast = []
def handleBinOp(self, node):
return (self.op_dict[type(node.op)], self.handleNode(node.left),
self.handleNode(node.right))
def handleNode(self, node):
value = self.type_dict.get(type(node), None)
return value(self, node)
def visit_BinOp(self, node):
op = self.handleBinOp(node)
self.ast.append(op)
def visit_Name(self, node):
self.ast.append(node.id)
def visit_Num(self, node):
self.ast.append(node.n)
def currentTree(self):
return reversed(self.ast)
a = ast.parse(sys.argv[1])
visitor = MyNodeVisitor()
visitor.visit(a)
print list(visitor.currentTree())
Looks like this:
$ ./ast_tree.py "5 + (1 + 2) * 3"
[('+', 5, ('*', ('+', 1, 2), 3))]
Enjoy.

This is a simple enough problem that you could just write a solution from scratch. This assumes that all variable names are one-character long, or that the expression has been correctly converted into a list of tokens. I threw in checks to make sure all parenthesis are matched; obviously you should swap out CustomError for whatever exception you want to throw or other action you want to take.
def expr_to_list(ex):
tree = []
stack = [tree]
for c in ex:
if c == '(':
new_node = []
stack[-1].append(new_node)
stack.append(new_node)
elif c == '+' or c == ' ':
continue
elif c == ')':
if stack[-1] == tree:
raise CustomError('Unmatched Parenthesis')
stack.pop()
else:
stack[-1].append(c)
if stack[-1] != tree:
raise CustomError('Unmatched Parenthesis')
return tree
Tested:
>>> expr_to_list('a + (b + c + (x + (y + z) + (d + e)))')
['a', ['b', 'c', ['x', ['y', 'z'], ['d', 'e']]]]
And for multi-character variable names, using a regex for tokenization:
>>> tokens = re.findall('\(|\)|\+|[\w]+',
'(apple + orange + (banana + grapefruit))')
>>> tokens
['(', 'apple', '+', 'orange', '+', '(', 'banana', '+', 'grapefruit', ')', ')']
>>> expr_to_list(tokens)
[['apple', 'orange', ['banana', 'grapefruit']]]

I would make a translater too. Doing it via ast was bit cumbersome to implement for this purpose.
[tw-172-25-24-198 ~]$ cat a1.py
import re
def multiple_replace(text, adict):
rx = re.compile('|'.join(map(re.escape, adict)))
def one_xlat(match):
return adict[match.group(0)]
return rx.sub(one_xlat, text)
# Closure based approach
def make_xlat(*args, **kwds):
adict = dict(*args, **kwds)
rx = re.compile('|'.join(map(re.escape, adict)))
def one_xlat(match):
return adict[match.group(0)]
def xlat(text):
return rx.sub(one_xlat, text)
return xlat
if __name__ == "__main__":
text = "((a+b)+c+(d+(e+f)))"
adict = {
"+":",",
"(":"[",
")":"]",
}
translate = make_xlat(adict)
print translate(text)
Should give
[[a,b],c,[d,[e,f]]]
Note - I have been having this snippet in my collections. It is from Python Cookbook. It does multiple replacements on the string, with the replacement key and values in the dictionary in a single pass.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

parsing a complex logical expression in pyparsing in a binary tree fashion - python

Related

Python - how to assign 1 output to multiple inputs, to then use that output in an equation [duplicate]

pyparsing -- extending fourFn.py example with function

Replace values in string without replacing new values [duplicate]

Unpack a string into an expanded string

Parsing expression into list

Categories

Resources