Related
This is a bit of a contrived example, but I have been exploring the docs for CombineValues and wish understand what I'm seeing.
If I combine values and perform some arithmetic operations on the values (the goal is to calculate percentages of keys present in a bounded stream), then I need to use the AverageFn (as defined in Example 8 in docs and provided in the source code example snippets).
However, this (based on Example 5) does not work:
with beam.Pipeline() as pipeline:
counts = ( pipeline
| 'create' >> beam.Create(['xxx'])
| 'key it' >> beam.Map(lambda elem: (elem, 1))
| 'combine' >> beam.CombinePerKey(
lambda values: sum(values)/2
)
| 'print' >> beam.Map(print)
)
as it produces
('xxx', 0.25)
I ultimately wanted to compute the count via
totals = pipeline | 'Count elements' >> beam.combiners.Count.Globally()
and then use the singleton approach they suggest (where I provide
beam.pvalue.AsSingleton(totals) to beam.CombineValues).
My question is, why does CombineValues appear to execute twice (probably going to be some facepalming)?
The reason the combiner is being called twice is because of the MapReduce phases. Since the function you are using (halving the mean) is not associative, you'd need a an "advance combiner" as in the example 8 you mention.
What is happening in your current code is, from (xxx, 1) calculate the half mean (xxx, 0.5) and then, when merging the values, it halves it again, making (xxx, 0.25).
In this answer I explain a similar concept.
For your particular case, as mentioned, you need "advance combiners"
with beam.Pipeline() as pipeline:
def combiner(elements):
print(elements)
return sum(elements)/2
class HalfMean(beam.CombineFn):
def create_accumulator(self):
# Tuple of sum, count
return (0, 0)
def add_input(self, accumulator, input):
# Add the current element to sum, add one to count
new_tuple = (accumulator[0] + input, accumulator[1] + 1)
return new_tuple
def merge_accumulators(self, accumulators):
# Join all accumulators
partial_sums = [x[0] for x in accumulators]
partial_counts = [x[1] for x in accumulators]
merged_tuple = (sum(partial_sums), sum(partial_counts))
return merged_tuple
def extract_output(self, sum_count_tuple):
# Now we can output half of the mean
mean = sum_count_tuple[0]/sum_count_tuple[1]
return mean/2
counts = ( pipeline
| 'create' >> beam.Create(['xxx'])
| 'key it' >> beam.Map(lambda elem: (elem, 1))
#| 'combine' >> beam.CombinePerKey(combiner)
| 'advance combine' >> beam.CombinePerKey(HalfMean())
| 'print' >> beam.Map(print)
)
I'm leaving your old combiner with a print so you see what's happening.
Anyhow, that is still not a CombineValues but a CombinerPerKey. CombineValues takes a key value pair on which the value is an iterator, and applies the combiner to it. In the following case, the elements that it's taking are ('a', [1, 2, 3]) and ('b', [10]). Here you have the example
kvs = [('a', 1),
('a', 2),
('a', 3),
('b', 10),
]
combine_values = (pipeline
| 'create_kvs' >> beam.Create(kvs)
| 'gbk' >> beam.GroupByKey()
| 'combine values' >> beam.CombineValues(HalfMean())
| 'print cv' >> beam.Map(print)
)
I have a data like this, and I want to add String with comma separated and sum int values.
data=[{"string": "x","int": 1},
{"string": "y","int": 2},
{"string": "z","int": 3}]
I'am expecting an output some thing like this.
Output:
{ "string":"x,y,z","int":"6"}
I tried using reduce function
func = lambda x, y: dict((m, n + y[m]) for m, n in x.items() )
print reduce(func, data)
and i am getting something like this.
{"string": "xyz", "int": "6"}
How to get string with comma separated.
func = lambda x, y: dict((m, n + y[m]) for m, n in x.items() )
You need a custom function to replace n+y[m] (let's say custom_add(a,b)), which,
if arguments are integers to return algebraic sum of them
if arguments are strings, to join them with ',' and return final string
let's implement it.
def custom_join(a,b):
arr = list((a,b))
return sum(arr) if is_int_array(arr) else ','.join(arr)
we have no is_int_array/1 yet. let's do it now.
def is_int_array(arr):
return all(i for i in map(is_int, arr))
no is_int/1. let's do it
def is_int(e):
return isinstance(e, int)
do the same things for strings
def is_str(e):
return isinstance(e, str)
def is_str_array(arr):
return all(i for i in map(is_str, arr))
Summing all of them - https://repl.it/LPRR
OK, this is insane but when you try to implement functional-only approach, you need to be ready such situations -)))
You can use str.join() and sum() with some generator expressions like this:
res = {"string": ','.join(d['string'] for d in data), "int": sum(d['int'] for d in data)}
Output:
>>> res
{'int': 6, 'string': 'x,y,z'}
I'm writing a program who's input is a set of sets (or "collection") in python syntax. The output of the program should be the same collection in proper mathematical syntax. To do so, I've written a recursive function
collection = set([
frozenset(["a,b,c"]),
frozenset(),
frozenset(["a"]),
frozenset(["b"]),
frozenset(["a,b"])
])
def format_set(given_set):
# for each element of the set
for i in given_set:
#if the element is itself a set, begin recursion
if type(i) == frozenset:
format_set(i)
else:
return "{", i, "},",
calling format_set(collection) gives the output
{ a,b }, { a,b,c }, { b }, { a },
which is missing a pair of parenthesis, and has an extra comma at the end. The correct output would be
{{ a,b }, { a,b,c }, { b }, { a },{}}.
Thus, I would need to add "{" before the first recursion, and "}" after the last, as well as not adding the comma after the last recursion. Is there a way to find the final recursion?
I could always solve the extra parenthesis problem by defining:
def shortcut(x):
print "{", frozen_set(x), "}"
However, I feel like that's somewhat inelegant, and still leaves the comma problem.
It will be more straightforward if you check the type first and then do the iteration:
def format_set(given):
if isinstance(given, (set, frozenset)):
return '{' + ', '.join(format_set(i) for i in given) + '}'
else:
return str(given)
Output:
{{a,b}, {a,b,c}, {}, {b}, {a}}
Also, note that in your example input all sets are actually empty or have 1 element. If you change the input like this...
collection = set([
frozenset(["a", "b", "c"]),
frozenset(),
frozenset(["a"]),
frozenset(["b"]),
frozenset(["a", "b"])
])
...you'll get this output:
{{a, c, b}, {}, {b}, {a}, {a, b}}
I am given an expression using parantheses and +'s, such as (((a+b)+c)+(d+e)).
I need to find the parse tree of this, and then print the list form of this parse tree like:
[ [ [a, b], c ], [d, e] ]
I was thinking I'd use something like ast, then ast2list. However, due to my not fully understanding these, I am repeatedly getting syntax errors. This is what I have:
import ast
import parser
a = ast.parse("(((a+b)+c)+(d+e))", mode='eval')
b = parser.ast2list(a)
print(b)
Could anyone guide me in the right direction? Thanks.
Colleen's comment can be realized with something like:
str = "(((a+b)+c)+(d+e))"
replacements = [
('(','['),
(')',']'),
('+',','),
# If a,b,c,d,e are defined variables, you don't need the following 5 lines
('a',"'a'"),
('b',"'b'"),
('c',"'c'"),
('d',"'d'"),
('e',"'e'"),
]
for (f,s) in replacements:
str = str.replace(f,s)
obj = eval(str)
print(str) # [[['a','b'],'c'],['d','e']]
print(obj) # [[['a', 'b'], 'c'], ['d', 'e']]
# You can access the parsed elements as you would any iterable:
print(obj[0]) # [['a', 'b'], 'c']
print(obj[1]) # ['d', 'e']
print(obj[1][0]) # d
If you really want to do a parser, start by not writing any code, but by understanding how your grammar should work. Backus-Naur Format or BNF is the typical notation used to define your grammar. Infix notation is a common software engineering parsing topic, and the basic BNF structure for infix notation goes like:
letter ::= 'a'..'z'
operand ::= letter+
term ::= operand | '(' expr ')'
expr ::= term ( '+' term )*
The key is that term contains either your alphabetic operand or an entire subexpression wrapped in ()'s. That subexpression is just the same as the overall expression, so this recursive definition takes care of all the parenthesis nesting. The expression then is a term followed by zero or more terms, added on using your binary '+' operator. (You could expand term to handle subtraction and multiplication/division as well, but I'm not going to complicate this answer more than necessary.)
Pyparsing is a package that makes it easy to translate a BNF to a working parser using Python objects (Ply, spark, and yapps are other parsers, which follow the more traditional lex/yacc model of parser creation). Here is that BNF implemented directly using pyparsing:
from pyparsing import Suppress, Word, alphas, Forward, Group, ZeroOrMore
LPAR, RPAR, PLUS = map(Suppress, "()+")
operand = Word(alphas)
# forward declare our overall expression, necessary when defining a recursive grammar
expr = Forward()
# each term is either an alpha operand, or an expr in ()'s
term = operand | Group(LPAR + expr + RPAR)
# define expr as a term, with optional '+ term's
expr << term + ZeroOrMore(PLUS + term)
# try it out
s = "(((a+b)+c)+(d+e))"
print expr.parseString(s)
giving:
[[[['a', 'b'], 'c'], ['d', 'e']]]
Infix notation with recognition of precedence of operations is a pretty common parser, or part of a larger parser, so pyparsing includes a helper builtin call operatorPrecedence to take care of all the nesting/grouping/recursion, etc. Here is that same parser written using operatorPrecedence:
from pyparsing import operatorPrecedence, opAssoc, Word, alphas, Suppress
# define an infix notation with precedence of operations
# you only define one operation '+', so this is a simple case
operand = Word(alphas)
expr = operatorPrecedence(operand,
[
('+', 2, opAssoc.LEFT),
])
print expr.parseString(s)
giving the same results as before.
More detailed examples can be found online at the pyparsing wiki - the explicit implementation at fourFn.py and the operatorPrecedence implementation at simpleArith.py.
Look at the docs for the ast module here where the NodeVisitor class is described.
import ast
import sys
class MyNodeVisitor(ast.NodeVisitor):
op_dict = {
ast.Add : '+',
ast.Sub : '-',
ast.Mult : '*',
}
type_dict = {
ast.BinOp: lambda s, n: s.handleBinOp(n),
ast.Name: lambda s, n: getattr(n, 'id'),
ast.Num: lambda s, n: getattr(n, 'n'),
}
def __init__(self, *args, **kwargs):
ast.NodeVisitor.__init__(self, *args, **kwargs)
self.ast = []
def handleBinOp(self, node):
return (self.op_dict[type(node.op)], self.handleNode(node.left),
self.handleNode(node.right))
def handleNode(self, node):
value = self.type_dict.get(type(node), None)
return value(self, node)
def visit_BinOp(self, node):
op = self.handleBinOp(node)
self.ast.append(op)
def visit_Name(self, node):
self.ast.append(node.id)
def visit_Num(self, node):
self.ast.append(node.n)
def currentTree(self):
return reversed(self.ast)
a = ast.parse(sys.argv[1])
visitor = MyNodeVisitor()
visitor.visit(a)
print list(visitor.currentTree())
Looks like this:
$ ./ast_tree.py "5 + (1 + 2) * 3"
[('+', 5, ('*', ('+', 1, 2), 3))]
Enjoy.
This is a simple enough problem that you could just write a solution from scratch. This assumes that all variable names are one-character long, or that the expression has been correctly converted into a list of tokens. I threw in checks to make sure all parenthesis are matched; obviously you should swap out CustomError for whatever exception you want to throw or other action you want to take.
def expr_to_list(ex):
tree = []
stack = [tree]
for c in ex:
if c == '(':
new_node = []
stack[-1].append(new_node)
stack.append(new_node)
elif c == '+' or c == ' ':
continue
elif c == ')':
if stack[-1] == tree:
raise CustomError('Unmatched Parenthesis')
stack.pop()
else:
stack[-1].append(c)
if stack[-1] != tree:
raise CustomError('Unmatched Parenthesis')
return tree
Tested:
>>> expr_to_list('a + (b + c + (x + (y + z) + (d + e)))')
['a', ['b', 'c', ['x', ['y', 'z'], ['d', 'e']]]]
And for multi-character variable names, using a regex for tokenization:
>>> tokens = re.findall('\(|\)|\+|[\w]+',
'(apple + orange + (banana + grapefruit))')
>>> tokens
['(', 'apple', '+', 'orange', '+', '(', 'banana', '+', 'grapefruit', ')', ')']
>>> expr_to_list(tokens)
[['apple', 'orange', ['banana', 'grapefruit']]]
I would make a translater too. Doing it via ast was bit cumbersome to implement for this purpose.
[tw-172-25-24-198 ~]$ cat a1.py
import re
def multiple_replace(text, adict):
rx = re.compile('|'.join(map(re.escape, adict)))
def one_xlat(match):
return adict[match.group(0)]
return rx.sub(one_xlat, text)
# Closure based approach
def make_xlat(*args, **kwds):
adict = dict(*args, **kwds)
rx = re.compile('|'.join(map(re.escape, adict)))
def one_xlat(match):
return adict[match.group(0)]
def xlat(text):
return rx.sub(one_xlat, text)
return xlat
if __name__ == "__main__":
text = "((a+b)+c+(d+(e+f)))"
adict = {
"+":",",
"(":"[",
")":"]",
}
translate = make_xlat(adict)
print translate(text)
Should give
[[a,b],c,[d,[e,f]]]
Note - I have been having this snippet in my collections. It is from Python Cookbook. It does multiple replacements on the string, with the replacement key and values in the dictionary in a single pass.
I am trying to parse complex logical expression like the one below;
x > 7 AND x < 8 OR x = 4
and get the parsed string as a binary tree. For the above expression the expected parsed expression should look like
[['x', '>', 7], 'AND', [['x', '<', 8], 'OR', ['x', '=', 4]]]
'OR' logical operator has higher precedence than 'AND' operator. Parenthesis can override the default precedence. To be more general, the parsed expression should look like;
<left_expr> <logical_operator> <right_expr>
Another example would be
input_string = x > 7 AND x < 8 AND x = 4
parsed_expr = [[['x', '>', 7], 'AND', ['x', ',', 8]], 'AND', ['x', '=', 4]]
So far i came up with this simple solution which sadly cannot generate parsed expression in binary tree fashion. operatorPrecedence doesn't seem to have help me here where there is same logical operator consecutively as in previous example.
import pyparsing as pp
complex_expr = pp.Forward()
operator = pp.Regex(">=|<=|!=|>|<|=").setName("operator")
logical = (pp.Keyword("AND") | pp.Keyword("OR")).setName("logical")
vars = pp.Word(pp.alphas, pp.alphanums + "_") | pp.Regex(r"[+-]?\d+(:?\.\d*)?(:?[eE][+-]?\d+)?")
condition = (vars + operator + vars)
clause = pp.Group(condition ^ (pp.Suppress("(") + complex_expr + pp.Suppress(")") ))
expr = pp.operatorPrecedence(clause,[
("OR", 2, pp.opAssoc.LEFT, ),
("AND", 2, pp.opAssoc.LEFT, ),])
complex_expr << expr
print complex_expr.parseString("x > 7 AND x < 8 AND x = 4")
Any suggestions or guidance is well appreciated.
BNF for the expression (without parenthesis) could be
<expr> -> <expr> | <expr> <logical> <expr>
<expr> -> <opnd> <relational> <opnd>
<opnd> -> <variable> | <numeric>
<relational> -> <'>'> | <'='> | <'>='> | <'<='> | <'!='>
NOTE: the operatorPrecedence method of pyparsing is deprecated in favor of
the method name infixNotation.
Try changing:
expr = pp.operatorPrecedence(clause,[
("OR", 2, pp.opAssoc.LEFT, ),
("AND", 2, pp.opAssoc.LEFT, ),])
to:
expr = pp.operatorPrecedence(condition,[
("OR", 2, pp.opAssoc.LEFT, ),
("AND", 2, pp.opAssoc.LEFT, ),])
The first argument to operatorPrecedence is the primitive operand to be used with the operators - there is no need to include your complexExpr in parentheses - operatorPrecedence will do that for you. Since your operand is actually another deeper comparison, you might consider changing:
condition = (expr + operator + expr)
to:
condition = pp.Group(expr + operator + expr)
so that the output of operatorPrecedence is easier to process. With these changes, parsing x > 7 AND x < 8 OR x = 4 gives:
[[['x', '>', '7'], 'AND', [['x', '<', '8'], 'OR', ['x', '=', '4']]]]
which recognizes OR's higher precedence and groups it first. (Are you sure you want this order of AND and OR precedence? I think the traditional ordering is the reverse, as shown in this wikipedia entry.)
I think you are also asking why pyparsing and operatorPrecedence does not return the results in nested binary pairs, that is, you expect parsing "A and B and C" would return:
[['A', 'and', 'B'] 'and', 'C']
but what you get is:
['A', 'and', 'B', 'and', 'C']
That is because operatorPrecedence parses repeated operations at the same precedence level using repetition, not recursion. See this question which is very similar to yours, and whose answer includes a parse action to convert your repetitive parse tree to the more traditional binary parse tree. You can also find a sample boolean expression parser implemented using operatorPrecedence on the pyparsing wiki page.
EDIT:
To clarify, this is what I recommend you reduce your parser to:
import pyparsing as pp
operator = pp.Regex(">=|<=|!=|>|<|=").setName("operator")
number = pp.Regex(r"[+-]?\d+(:?\.\d*)?(:?[eE][+-]?\d+)?")
identifier = pp.Word(pp.alphas, pp.alphanums + "_")
comparison_term = identifier | number
condition = pp.Group(comparison_term + operator + comparison_term)
expr = pp.operatorPrecedence(condition,[
("AND", 2, pp.opAssoc.LEFT, ),
("OR", 2, pp.opAssoc.LEFT, ),
])
print expr.parseString("x > 7 AND x < 8 OR x = 4")
If support for NOT might also be something you want to add, then this would look like:
expr = pp.operatorPrecedence(condition,[
("NOT", 1, pp.opAssoc.RIGHT, ),
("AND", 2, pp.opAssoc.LEFT, ),
("OR", 2, pp.opAssoc.LEFT, ),
])
At some point, you may want to expand the definition of comparison_term with a more complete arithmetic expression, defined with its own operatorPrecedence definition. I would suggest doing it this way, rather than creating one monster opPrec definition, as you have already alluded to some of the performance downsides to opPrec. If you still get performance issues, look into ParserElement.enablePackrat.
Let me suggest this parsing approach, coming directly from Peter Norvig's class in design of computer programs at udacity (and tweaked for your needs).
from functools import update_wrapper
from string import split
import re
def grammar(description, whitespace=r'\s*'):
"""Convert a description to a grammar. Each line is a rule for a
non-terminal symbol; it looks like this:
Symbol => A1 A2 ... | B1 B2 ... | C1 C2 ...
where the right-hand side is one or more alternatives, separated by
the '|' sign. Each alternative is a sequence of atoms, separated by
spaces. An atom is either a symbol on some left-hand side, or it is
a regular expression that will be passed to re.match to match a token.
Notation for *, +, or ? not allowed in a rule alternative (but ok
within a token). Use '\' to continue long lines. You must include spaces
or tabs around '=>' and '|'. That's within the grammar description itself.
The grammar that gets defined allows whitespace between tokens by default;
specify '' as the second argument to grammar() to disallow this (or supply
any regular expression to describe allowable whitespace between tokens)."""
G = {' ': whitespace}
description = description.replace('\t', ' ') # no tabs!
for line in split(description, '\n'):
lhs, rhs = split(line, ' => ', 1)
alternatives = split(rhs, ' | ')
G[lhs] = tuple(map(split, alternatives))
return G
def decorator(d):
def _d(fn):
return update_wrapper(d(fn), fn)
update_wrapper(_d, d)
return _d
#decorator
def memo(f):
cache = {}
def _f(*args):
try:
return cache[args]
except KeyError:
cache[args] = result = f(*args)
return result
except TypeError:
# some element of args can't be a dict key
return f(args)
return _f
def parse(start_symbol, text, grammar):
"""Example call: parse('Exp', '3*x + b', G).
Returns a (tree, remainder) pair. If remainder is '', it parsed the whole
string. Failure iff remainder is None. This is a deterministic PEG parser,
so rule order (left-to-right) matters. Do 'E => T op E | T', putting the
longest parse first; don't do 'E => T | T op E'
Also, no left recursion allowed: don't do 'E => E op T'"""
tokenizer = grammar[' '] + '(%s)'
def parse_sequence(sequence, text):
result = []
for atom in sequence:
tree, text = parse_atom(atom, text)
if text is None: return Fail
result.append(tree)
return result, text
#memo
def parse_atom(atom, text):
if atom in grammar: # Non-Terminal: tuple of alternatives
for alternative in grammar[atom]:
tree, rem = parse_sequence(alternative, text)
if rem is not None: return [atom]+tree, rem
return Fail
else: # Terminal: match characters against start of text
m = re.match(tokenizer % atom, text)
return Fail if (not m) else (m.group(1), text[m.end():])
# Body of parse:
return parse_atom(start_symbol, text)
Fail = (None, None)
MyLang = grammar("""expression => block logicalop expression | block
block => variable operator number
variable => [a-z]+
operator => <=|>=|>|<|=
number => [-+]?[0-9]+
logicalop => AND|OR""", whitespace='\s*')
def parse_it(text):
return parse('expression', text, MyLang)
print parse_it("x > 7 AND x < 8 AND x = 4")
Outputs:
(['expression', ['block', ['variable', 'x'], ['operator', '>'], ['number', '7']], ['logicalop', 'AND'], ['expression', ['block', ['variable', 'x'], ['operator', '<'], ['number', '8']], ['logicalop', 'AND'], ['expression', ['block', ['variable', 'x'], ['operator', '='], ['number', '4']]]]], '')