How to parse a BinaryTree from a string?

How to parse a BinaryTree from a string? - python

I want to code a BinaryTree parser. I don't know how to solve this problem. I've tried using regular expressions recursively but I can't find good resources. My goal is:
BinaryTree.from_string("('a') 'b' ('c')") --> BinaryTree("a", "b", "c")
BinaryTree.from_string("") --> None
BinaryTree.from_string("() ()") --> BinaryTree(None, None, None)
BinaryTree.from_string("((1) 2 (3)) 4 (5)") --> BinaryTree(BinaryTree(1, 2, 3), 4, 5)
Here is some source code:
class BinaryTree:
def __init__(self, left=None, name=None, right=None):
self.left = left
self.name = name
self.right = right
def __str__(self):
return f"({self.left}) {self.name} ({self.right})"
def __repr__(self):
return f"BinaryTree({repr(self.left)}, {repr(self.name)}, {repr(self.right)})"
def __len__(self):
if self.name is not None:
output = 1
else:
output = 0
if self.left is not None:
output += len(self.left)
if self.right is not None:
output += len(self.right)
return output
#staticmethod
def from_string(string):
# "(x) y (z)" --> BinaryTree("x", "y", "z")
# "((a) b (c)) y (z)" --> BinaryTree(BinaryTree("a", "b", "c"), "y", "z")
# "" --> None
# () () --> BinaryTree("", "", "")
pass

First, I believe that you need to drop the idea of regular expressions and concentrate on simple matching the parentheses. You have a very simple expression grammar here. Rather than reproducing such a well-traveled exercise, I simply direct your to research how to parse a binary tree expression with parentheses.
The basic expression is
left root right
where each of left and right is either
a sub-tree (first char is a left parenthesis)
a leaf-node label (first char is something else)
null (white space)
Note that you have some ambiguities. For instance, given a b, is the resulting tree (a, b, None), (None, a, b), or an error?
In any case, if you focus on simple string processing, you should be able to do this without external packages:
Find the first left parenthesis and its matching right.
In the string after that, look again for a left-right pair.
If there's anything before that first left-paren, then it must be a leaf node for the left and the node label for the root.
Either way, there must be a root node in the middle (unless this is a degenerate tree).
Recur on each of the paren matches you made.
Can you take it from there?

You can use regular expressions, but only use it to tokenize the input, i.e. it should match all parentheses as individual matches, and match quoted or non-quoted literals. Some care has to be taken to support backslash escaping inside quoted substrings.
For converting quoted strings and numbers to the corresponding Python data typed values, you can use ast.literal_eval. Of course, if the input format would have been a valid Python expression (using comma separators, ...etc), you could leave the parsing completely to ast.literal_eval. But as this is not the case, you'll have to tokenize the input and iterate over the tokens.
So import these:
import re
import ast
And then:
#staticmethod
def from_string(string):
tokens = re.finditer(r"[()]|'(?:\\.|[^'])*'|[^\s()]+|$", string)
def format(token):
return f"'{token}'" if token else "end of input"
def take(end, expect=None, forbid=None):
token = next(tokens).group(0)
if expect is not None and token != expect:
raise ValueError("Expected {}, got {}".format(format(expect), format(token)))
if end != "" and token == "" or token == forbid:
raise ValueError("Unexpected {}".format(format(token)))
if token not in ("(", ")", ""):
token = ast.literal_eval(token)
return token
def recur(end=")"):
token = take(end)
if token == end: # it is an empty leaf
return None
if token != "(": # it is a leaf
take(end, end)
return token
# It is a (left)-name-(right) sequence:
left = recur()
name = None
token = take(end, None, end)
if token != "(":
name = token
take(end, "(")
right = recur()
take(end, end)
return BinaryTree(left, name, right)
return recur("")

Related

parsing nested functions in python

line = "add(multiply(add(2,3),add(4,5)),1)"
def readLine(line):
countLeftBracket=0
string = ""
for char in line:
if char !=")":
string += char
else:
string +=char
break
for i in string:
if i=="(":
countLeftBracket+=1
if countLeftBracket>1:
cutString(string)
else:
return execute(string)
def cutString(string):
countLeftBracket=0
for char in string:
if char!="(":
string.replace(char,'')
elif char=="(":
string.replace(char,'')
break
for char in string:
if char=="(":
countLeftBracket+=1
if countLeftBracket>1:
cutString(string)
elif countLeftBracket==1:
return execute(string)
def add(num1,num2):
return print(num1+num2)
def multiply(num1,num2):
return print(num1*num2)
readLines(line)
I need to execute the whole line string. I tried to cut each function inside of brackets one by one and replace them with the result, but I am kind of lost. Not sure how to continue, my code gets the error:
File "main.py", line 26, in cutString
if char!="(":
RuntimeError: maximum recursion depth exceeded in comparison
Give me an idea of where to move, which method to use?

Here is a solution that uses pyparsing, and as such will be much easier to expand:
from pyparsing import *
first a convenience function (use the second tag function and print the parse tree to see why)
def tag(name):
"""This version converts ["expr", 4] => 4
comment in the version below to see the original parse tree
"""
def tagfn(tokens):
tklist = tokens.asList()
if name == 'expr' and len(tklist) == 1:
# LL1 artifact removal
return tklist
return tuple([name] + tklist)
return tagfn
# def tag(name):
# return lambda tokens: tuple([name] + tokens.asList())
Our lexer needs ot recognize left and right parenthesis, integers, and names. This is how you define them with pyparsing:
LPAR = Suppress("(")
RPAR = Suppress(")")
integer = Word(nums).setParseAction(lambda s,l,t: [int(t[0])])
name = Word(alphas)
our parser has function calls, which take zero or more expressions as parameters. A function call is also an expression, so to deal with the circularity we have to forward declare expr and fncall:
expr = Forward()
fncall = Forward()
expr << (integer | fncall).setParseAction(tag('expr'))
fnparams = delimitedList(expr)
fncall << (name + Group(LPAR + Optional(fnparams, default=[]) + RPAR)).setParseAction(tag('fncall'))
Now we can parse our string (we can add spaces and more or less than two parameters to functions as well):
line = "add(multiply(add(2,3),add(4,5)),1)"
res = fncall.parseString(line)
to see what is returned you can print it, this is called the parse-tree (or, since our tag function has simplified it, an abstract syntax tree):
import pprint
pprint.pprint(list(res))
which outputs:
[('fncall',
'add',
[('fncall',
'multiply',
[('fncall', 'add', [2, 3]), ('fncall', 'add', [4, 5])]),
1])]
with the commented out tag function it would be (which is just more work to deal with for no added benefit):
[('fncall',
'add',
[('expr',
('fncall',
'multiply',
[('expr', ('fncall', 'add', [('expr', 2), ('expr', 3)])),
('expr', ('fncall', 'add', [('expr', 4), ('expr', 5)]))])),
('expr', 1)])]
Now define the functions that are available to our program:
FUNCTIONS = {
'add': lambda *args: sum(args, 0),
'multiply': lambda *args: reduce(lambda a, b: a*b, args, 1),
}
# print FUNCTIONS['multiply'](1,2,3,4) # test that it works ;-)
Our parser is now very simple to write:
def parse(ast):
if not ast: # will not happen in our program, but it's good practice to exit early on no input
return
if isinstance(ast, tuple) and ast[0] == 'fncall':
# ast is here ('fncall', <name-of-function>, [list-of-arguments])
fn_name = ast[1] # get the function name
fn_args = parse(ast[2]) # parse each parameter (see elif below)
return FUNCTIONS[fn_name](*fn_args) # find and apply the function to its arguments
elif isinstance(ast, list):
# this is called when we hit a parameter list
return [parse(item) for item in ast]
elif isinstance(ast, int):
return ast
Now call the parser on the result of the lexing phase:
>>> print parse(res[0]) # the outermost item is an expression
46

Sounds like this could be solved with regex.
So this is an example of a single reduction
import re, operator
def apply(match):
func_name = match.group(1) # what's outside the patentesis
func_args = [int(x) for x in match.group(2).split(',')]
func = {"add": operator.add, "multiply": operator.mul}
return str(func[func_name](*func_args))
def single_step(line):
return re.sub(r"([a-z]+)\(([^()]+)\)",apply,line)
For example:
line = "add(multiply(add(2,3),add(4,5)),1)"
print(single_step(line))
Would output:
add(multiply(5,9),1)
All that is left to do, is to loop until the expression is a number
while not line.isdigit():
line = single_step(line)
print (line)
Will show
46

You can use a generator function to build a very simple parser:
import re, operator
line, f = "add(multiply(add(2,3),add(4,5)),1)", {'add':operator.add, 'multiply':operator.mul}
def parse(d):
n = next(d, None)
if n is not None and n != ')':
if n == '(':
yield iter(parse(d))
else:
yield n
yield from parse(d)
parsed = parse(iter(re.findall('\(|\)|\w+', line)))
def _eval(d):
_r = []
n = next(d, None)
while n is not None:
if n.isdigit():
_r.append(int(n))
else:
_r.append(f[n](*_eval(next(d))))
n = next(d, None)
return _r
print(_eval(parsed)[0])
Output:
46

Processing Strings from data file containing punctuation coding in python

I am trying to make a simple programme that can help make army lists for a popular tabletop wargame. More as an excercise for my own experience as there are plenty of pre made software packages that do this, but the idea behind it seems fairly straightforward
The programme reads the data for all the units available in an army from a spreadsheet and creates various classes for each unit. The main bit I am looking at now is the options/ upgrades.
In the file I want a straightforward syntax for the option field for each unit. i.e. the following options string itemA, itemB/itemC-3, 2*itemD, itemE/itemF/itemG, itemH/itemI+itemJ would mean
1. you may take itemA (X pts per model)
2. for every 3 models, you may exchange itemB with
a) itemC (net X pts per model)
3. each model may take 2 of itemD (X pts per model)
4. each model may take one of either
a)itemE (X pts per model)
b)itemF (X pts per model)
c)itemG (X pts per model
5. each model may take either
a)itemH (X points per model)
b)itemI and itemJ (X points per model)
At the moment I am processing the string using lots of splits and if statements, that make it very hard to keep track of and assign correctly once the user input their choice.
for index, option in enumerate(self.options):
output = "{}.".format(index+1)
if '-' in option:
sub_option, no_models = option.split('-')
no_models = int(no_models)
print(sub_option)
print(no_models)
output += "For every {} models ".format(no_models)
if '/' in sub_option:
temp_str, temp_options, points_list = exchange_option(sub_option)
else:
temp_str, temp_options, points_list = standard_option(sub_option)
index_points.append(points_list)
temp_options.append(no_models)
index_options.append(temp_options)
else:
if '/' in option:
temp_str, temp_options, points_list = exchange_option(option)
else:
temp_str, temp_options, points_list = standard_option(option)
index_points.append(points_list)
index_options.append(temp_options)
output += temp_str
the *_option() functions are additional helper functions I have defined above which have a similar structure with further if statements within them.
The main question I am asking, is there an easier way to process a code like string such as this? While it works to produce the output in the example above it seems awfully cumbersome to then deal with the user input.
What I am aiming to do is first output the string as given in my example at the top of the question, and then taking the user input index of the given option, modify the associated unit class to have the correct wargear and points value.
I thought about trying to make some kind of options class, but again labelling and defining each option so that they can interact with one another properly seems equally complex, and I feel there must be something more pythonic or just generally better coding practice to processing encoded strings such as this?

So, here's a full blown parser to do that! Now, this only outputs the list as in the previous version of your question, but it shouldn't be too hard to add more features as you want. Also please note that at the moment, the lexer does not error out when a string contains invalid tokens, but that's just a proof-of-concept, so it should be fine.
Part I: the lexer
This tokenises the input string - looks through it from left to right and attempts to classify non-overlapping substrings as instances of tokens. It's to be used before parsing. When given a string, Lexer.tokenize yields a stream of Tokens.
# FILE: lex.py
import re
import enum
class Token:
def __init__(self, type, value: str, lineno: int, pos: int):
self.type, self.value, self.lineno, self.pos = type, value, lineno, pos
def __str__(self):
v = f'({self.value!r})' if self.value else ''
return f'{self.type.name}{v} at {self.lineno}:{self.pos}'
__repr__ = __str__
class Lexer:
def __init__(self, token_types: enum.Enum, tokens_regexes: dict):
self.token_types = token_types
regex = '|'.join(map('(?P<{}>{})'.format, *zip(*((tok.name, regex) for tok, regex in tokens_regexes.items()))))
self.regex = re.compile(regex)
def tokenize(self, string, skip=['space']):
# TODO: detect invalid input
lineno, pos = 0, 0
skip = set(map(self.token_types.__getitem__, skip))
for matchobj in self.regex.finditer(string):
type_name = matchobj.lastgroup
value = matchobj.groupdict()[type_name]
Type = self.token_types[type_name]
if Type == self.token_types.newline: # possibly buggy, but not catastrophic
self.lineno += 1
self.pos = 0
continue
pos = matchobj.end()
if Type not in skip:
yield Token(Type, value, lineno, pos)
yield Token(self.token_types.EOF, '', lineno, pos)
Part II: the parser (with syntax-driven evaluation):
This parses the given stream of tokens provided by lex.Lexer.tokenize and translates individual symbols to English according to the following grammar:
Opt_list -> Option Opt_list_
Opt_list_ -> comma Option Opt_list_ | empty
Option -> Choice | Mult
Choice -> Compound More_choices Exchange
Compound -> item Add_item
Add_item -> plus item Add_item | empty
More_choices -> slash Compound More_choices | empty
Exchange -> minus num | empty
Mult -> num star Compound
The uppercase symbols are nonterminals, the lowercase ones are terminals. There's also a special symbol EOF that's not present here.
Also, take a look at the vital statistics of this grammar. This grammar is LL(1), so we can use an LL(1) recursive descent predictive parser, as shown below.
If you modify the grammar, you should modify the parser accordingly! The methods that do the actual parsing are called parse_<something>, and to change the output of the parser (the Parser.parse function, actually) you should change the return values of these parse_<something> functions.
# FILE: parse.py
import lex
class Parser:
def __init__(self, lexer):
self.string, self.tokens = None, None
self.lexer = lexer
self.t = self.lexer.token_types
self.__lookahead = None
#property
def lookahead(self):
if not self.__lookahead:
try:
self.__lookahead = next(self.tokens)
except StopIteration:
self.__lookahead = lex.Token(self.t.EOF, '', 0, -1)
return self.__lookahead
def next(self):
if self.__lookahead and self.__lookahead.type == self.t.EOF:
return self.__lookahead
self.__lookahead = None
return self.lookahead
def match(self, token_type):
if self.lookahead.type == token_type:
return self.next()
raise SyntaxError(f'Expected {token_type}, got {self.lookahead.type}', ('<string>', self.lookahead.lineno, self.lookahead.pos, self.string))
# THE PARSING STARTS HERE
def parse(self, string):
# setup
self.string = string
self.tokens = self.lexer.tokenize(string)
self.__lookahead = None
self.next()
# do parsing
ret = [''] + self.parse_opt_list()
return ' '.join(ret)
def parse_opt_list(self) -> list:
ret = self.parse_option(1)
ret.extend(self.parse_opt_list_(1))
return ret
def parse_opt_list_(self, curr_opt_number) -> list:
if self.lookahead.type in {self.t.EOF}:
return []
self.match(self.t.comma)
ret = self.parse_option(curr_opt_number + 1)
ret.extend(self.parse_opt_list_(curr_opt_number + 1))
return ret
def parse_option(self, opt_number) -> list:
ret = [f'{opt_number}.']
if self.lookahead.type == self.t.item:
ret.extend(self.parse_choice())
elif self.lookahead.type == self.t.num:
ret.extend(self.parse_mult())
else:
raise SyntaxError(f'Expected {token_type}, got {self.lookahead.type}', ('<string>', self.lookahead.lineno, self.lookahead.pos, self.string))
ret[-1] += '\n'
return ret
def parse_choice(self) -> list:
c = self.parse_compound()
m = self.parse_more_choices()
e = self.parse_exchange()
if not m:
if not e:
ret = f'You may take {" ".join(c)}'
else:
ret = f'for every {e} models you may take item {" ".join(c)}'
elif m:
c.extend(m)
if not e:
ret = f'each model may take one of: {", ".join(c)}'
else:
ret = f'for every {e} models you may exchange the following items with each other: {", ".join(c)}'
else:
ret = 'Semantic error!'
return [ret]
def parse_compound(self) -> list:
ret = [self.lookahead.value]
self.match(self.t.item)
_ret = self.parse_add_item()
return [' '.join(ret + _ret)]
def parse_add_item(self) -> list:
if self.lookahead.type in {self.t.comma, self.t.minus, self.t.slash, self.t.EOF}:
return []
ret = ['with']
self.match(self.t.plus)
ret.append(self.lookahead.value)
self.match(self.t.item)
return ret + self.parse_add_item()
def parse_more_choices(self) -> list:
if self.lookahead.type in {self.t.comma, self.t.minus, self.t.EOF}:
return []
self.match(self.t.slash)
ret = self.parse_compound()
return ret + self.parse_more_choices()
def parse_exchange(self) -> str:
if self.lookahead.type in {self.t.comma, self.t.EOF}:
return ''
self.match(self.t.minus)
ret = self.lookahead.value
self.match(self.t.num)
return ret
def parse_mult(self) -> list:
ret = [f'each model may take {self.lookahead.value} of:']
self.match(self.t.num)
self.match(self.t.star)
return ret + self.parse_compound()
Part III: usage
Here's how to use all of that code:
# FILE: evaluate.py
import enum
from lex import Lexer
from parse import Parser
# these are all the types of tokens present in our grammar
token_types = enum.Enum('Types', 'item num plus minus star slash comma space newline empty EOF')
t = token_types
# these are the regexes that the lexer uses to recognise the tokens
terminals_regexes = {
t.item: r'[a-zA-Z_]\w*',
t.num: '0|[1-9][0-9]*',
t.plus: r'\+',
t.minus: '-',
t.star: r'\*',
t.slash: '/',
t.comma: ',',
t.space: r'[ \t]',
t.newline: r'\n'
}
lexer = Lexer(token_types, terminals_regexes)
parser = Parser(lexer)
string = 'itemA, itemB/itemC-3, 2*itemD, itemE/itemF/itemG, itemH/itemI+itemJ'
print(f'STRING FROM THE QUESTION: {string!r}\nRESULT:')
print(parser.parse(string), '\n\n')
string = input('Enter a command: ')
while string and string.lower() not in {'q', 'quit', 'e', 'exit'}:
try:
print(parser.parse(string))
except SyntaxError as e:
print(f' Syntax error: {e}\n {e.text}\n' + ' ' * (4 + e.offset - 1) + '^\n')
string = input('Enter a command: ')
Example session:
# python3 evaluate.py
STRING FROM THE QUESTION: 'itemA, itemB/itemC-3, 2*itemD, itemE/itemF/itemG, itemH/itemI+itemJ'
RESULT:
1. You may take itemA
2. for every 3 models you may exchange the following items with each other: itemB, itemC
3. each model may take 2 of: itemD
4. each model may take one of: itemE, itemF, itemG
5. each model may take one of: itemH, itemI with itemJ
Enter a command: itemA/b/c/stuff
1. each model may take one of: itemA, b, c, stuff
Enter a command: 4 * anything
1. each model may take 4 of: anything
Enter a command: 5 * anything + more
1. each model may take 5 of: anything with more
Enter a command: a + b + c+ d
1. You may take a with b with c with d
Enter a command: a+b/c
1. each model may take one of: a with b, c
Enter a command: itemA/itemB-2
1. for every 2 models you may exchange the following items with each other: itemA, itemB
Enter a command: itemA+itemB/itemC - 5
1. for every 5 models you may exchange the following items with each other: itemA with itemB, itemC
Enter a command: q

Return or not return in a recursive function

Before asking, I searched out some old questions and get a better idea to put the "return" in front of the inside re-invocated the function to get the expected result.
some of them like:
How to stop python recursion
Python recursion and return statements. But when I do the same thing with my problem, it gets worse.
I have a Binary Search Tree and want to get the TreeNode instance by given a node's key, so it looks an easier traversal requirement and I already easily realized similar functions below, with which I did NOT put return in front of the function:
#preorder_List=[]
def preorder(treeNode):
if treeNode:
preorder_List.append(treeNode.getKey())
preorder(treeNode.has_left_child())
preorder(treeNode.has_right_child())
return preorder_List
so for my new requirement, I compose it like below first:
def getNode(treeNode,key):
if(treeNode):
if(treeNode.key==key):
print("got it=",treeNode.key)
return treeNode
else:
getNode(treeNode.left_child(),key)
getNode(treeNode.right_child(),key)
then the issue occurs, it finds the key/node but kept running and report a None error finally and then I put return in front of the both left and right branch like below:
def getNode(treeNode,key):
if(treeNode):
if(treeNode.key==key):
print("got it=",treeNode.key)
return treeNode
else:
return getNode(treeNode.left_child(),key)
return getNode(treeNode.right_child(),key)
but this makes the thing worse, it did reach the key found and return None earlier.
Then I tried to remove one "return" for the branch, no matter right or left. It works (Update: this worked when my test case contains only 3 nodes, when I put more nodes, it didn't work, or to say if the expected node is from right, then put return in front of right branch invocation works, for left one, it didn't). What's the better solution?

You need to be able to return the results of your recursive calls, but you don't always need to do so unconditionally. Sometimes you'll not get the result you need from the first recursion, so you need to recurse on the other one before returning anything.
The best way to deal with this is usually to assign the results of the recursion to a variable, which you can then test. So if getNode either returns a node (if it found the key), or None (if it didn't), you can do something like this:
result = getNode(treeNode.left_child(),key)
if result is not None:
return result
return getNode(treeNode.right_child(),key)
In this specific case, since None is falsey, you can use the or operator to do the "short-circuiting" for you:
return getNode(treeNode.left_child(),key) or getNode(treeNode.right_child(),key)
The second recursive call will only be made if the first one returned a falsey value (such as None).
Note that for some recursive algorithms, you may need to recurse multiple times unconditionally, then combine the results together before returning them. For instance, a function to add up the (numeric) key values in a tree might look something like this:
def sum_keys(node):
if node is None: # base case
return 0
left_sum = sumKeys(node.left_child()) # first recursion
right_sum = sumKeys(node.right_child()) # second recursion
return left_sum + right_sum + node.key # add recursive results to our key and return

Without knowing more about your objects:
Three base cases:
current node is None --> return None
current node matches the key --> return it
current node does not match, is the end of the branch --> return None
If not base case recurse. Short circuit the recursion with or: return the left branch if it a match or return the right branch result (which might also be None)
def getNode(treeNode,key):
if treeNode == None:
return None
elif treeNode.key == key:
print("got it=",treeNode.key)
return treeNode
elif not any(treeNode.has_left_child(), treeNode.has_right_child()):
return None
#left_branch = getNode(treeNode.left_child(),key)
#right_branch = getNode(treeNode.right_child(),key)
#return left_branch or right_branch
return getNode(treeNode.left_child(),key) or getNode(treeNode.right_child(),key)

Instead of return, use yield:
class Tree:
def __init__(self, **kwargs):
self.__dict__ = {i:kwargs.get(i) for i in ['left', 'key', 'right']}
t = Tree(key=10, right=Tree(key=20, left=Tree(key=18)), left=Tree(key=5))
def find_val(tree, target):
if tree.key == target:
yield target
print('found')
else:
if getattr(tree, 'left', None) is not None:
yield from find_val(tree.left, target)
if getattr(tree, 'right', None) is not None:
yield from find_val(tree.right, target)
print(list(find_val(t, 18)))
Output:
found
[18]
However, you could also implement the get_node function as a method in your binary tree class by implementing a __contains__ methods:
class Tree:
def __init__(self, **kwargs):
self.__dict__ = {i:kwargs.get(i) for i in ['left', 'key', 'right']}
def __contains__(self, _val):
if self.key == _val:
return True
_l, _r = self.left, self.right
return _val in [[], _l][bool(_l)] or _val in [[], _r][bool(_r)]
t = Tree(key=10, right=Tree(key=20, left=Tree(key=18)), left=Tree(key=5))
print({i:i in t for i in [10, 14, 18]})
Output:
{10: True, 14: False, 18: True}

How to get source corresponding to a Python AST node?

Python AST nodes have lineno and col_offset attributes, which indicate the beginning of respective code range. Is there an easy way to get also the end of the code range? A 3rd party library?

EDIT: Latest code (tested in Python 3.5-3.7) is here: https://bitbucket.org/plas/thonny/src/master/thonny/ast_utils.py
As I didn't find an easy way, here's a hard (and probably not optimal) way. Might crash and/or work incorrectly if there are more lineno/col_offset bugs in Python parser than those mentioned (and worked around) in the code. Tested in Python 3.3:
def mark_code_ranges(node, source):
"""
Node is an AST, source is corresponding source as string.
Function adds recursively attributes end_lineno and end_col_offset to each node
which has attributes lineno and col_offset.
"""
NON_VALUE_KEYWORDS = set(keyword.kwlist) - {'False', 'True', 'None'}
def _get_ordered_child_nodes(node):
if isinstance(node, ast.Dict):
children = []
for i in range(len(node.keys)):
children.append(node.keys[i])
children.append(node.values[i])
return children
elif isinstance(node, ast.Call):
children = [node.func] + node.args
for kw in node.keywords:
children.append(kw.value)
if node.starargs != None:
children.append(node.starargs)
if node.kwargs != None:
children.append(node.kwargs)
children.sort(key=lambda x: (x.lineno, x.col_offset))
return children
else:
return ast.iter_child_nodes(node)
def _fix_triple_quote_positions(root, all_tokens):
"""
http://bugs.python.org/issue18370
"""
string_tokens = list(filter(lambda tok: tok.type == token.STRING, all_tokens))
def _fix_str_nodes(node):
if isinstance(node, ast.Str):
tok = string_tokens.pop(0)
node.lineno, node.col_offset = tok.start
for child in _get_ordered_child_nodes(node):
_fix_str_nodes(child)
_fix_str_nodes(root)
# fix their erroneous Expr parents
for node in ast.walk(root):
if ((isinstance(node, ast.Expr) or isinstance(node, ast.Attribute))
and isinstance(node.value, ast.Str)):
node.lineno, node.col_offset = node.value.lineno, node.value.col_offset
def _fix_binop_positions(node):
"""
http://bugs.python.org/issue18374
"""
for child in ast.iter_child_nodes(node):
_fix_binop_positions(child)
if isinstance(node, ast.BinOp):
node.lineno = node.left.lineno
node.col_offset = node.left.col_offset
def _extract_tokens(tokens, lineno, col_offset, end_lineno, end_col_offset):
return list(filter((lambda tok: tok.start[0] >= lineno
and (tok.start[1] >= col_offset or tok.start[0] > lineno)
and tok.end[0] <= end_lineno
and (tok.end[1] <= end_col_offset or tok.end[0] < end_lineno)
and tok.string != ''),
tokens))
def _mark_code_ranges_rec(node, tokens, prelim_end_lineno, prelim_end_col_offset):
"""
Returns the earliest starting position found in given tree,
this is convenient for internal handling of the siblings
"""
# set end markers to this node
if "lineno" in node._attributes and "col_offset" in node._attributes:
tokens = _extract_tokens(tokens, node.lineno, node.col_offset, prelim_end_lineno, prelim_end_col_offset)
#tokens =
_set_real_end(node, tokens, prelim_end_lineno, prelim_end_col_offset)
# mark its children, starting from last one
# NB! need to sort children because eg. in dict literal all keys come first and then all values
children = list(_get_ordered_child_nodes(node))
for child in reversed(children):
(prelim_end_lineno, prelim_end_col_offset) = \
_mark_code_ranges_rec(child, tokens, prelim_end_lineno, prelim_end_col_offset)
if "lineno" in node._attributes and "col_offset" in node._attributes:
# new "front" is beginning of this node
prelim_end_lineno = node.lineno
prelim_end_col_offset = node.col_offset
return (prelim_end_lineno, prelim_end_col_offset)
def _strip_trailing_junk_from_expressions(tokens):
while (tokens[-1].type not in (token.RBRACE, token.RPAR, token.RSQB,
token.NAME, token.NUMBER, token.STRING,
token.ELLIPSIS)
and tokens[-1].string not in ")}]"
or tokens[-1].string in NON_VALUE_KEYWORDS):
del tokens[-1]
def _strip_trailing_extra_closers(tokens, remove_naked_comma):
level = 0
for i in range(len(tokens)):
if tokens[i].string in "({[":
level += 1
elif tokens[i].string in ")}]":
level -= 1
if level == 0 and tokens[i].string == "," and remove_naked_comma:
tokens[:] = tokens[0:i]
return
if level < 0:
tokens[:] = tokens[0:i]
return
def _set_real_end(node, tokens, prelim_end_lineno, prelim_end_col_offset):
# prelim_end_lineno and prelim_end_col_offset are the start of
# next positioned node or end of source, ie. the suffix of given
# range may contain keywords, commas and other stuff not belonging to current node
# Function returns the list of tokens which cover all its children
if isinstance(node, _ast.stmt):
# remove empty trailing lines
while (tokens[-1].type in (tokenize.NL, tokenize.COMMENT, token.NEWLINE, token.INDENT)
or tokens[-1].string in (":", "else", "elif", "finally", "except")):
del tokens[-1]
else:
_strip_trailing_extra_closers(tokens, not isinstance(node, ast.Tuple))
_strip_trailing_junk_from_expressions(tokens)
# set the end markers of this node
node.end_lineno = tokens[-1].end[0]
node.end_col_offset = tokens[-1].end[1]
# Try to peel off more tokens to give better estimate for children
# Empty parens would confuse the children of no argument Call
if ((isinstance(node, ast.Call))
and not (node.args or node.keywords or node.starargs or node.kwargs)):
assert tokens[-1].string == ')'
del tokens[-1]
_strip_trailing_junk_from_expressions(tokens)
# attribute name would confuse the "value" of Attribute
elif isinstance(node, ast.Attribute):
if tokens[-1].type == token.NAME:
del tokens[-1]
_strip_trailing_junk_from_expressions(tokens)
else:
raise AssertionError("Expected token.NAME, got " + str(tokens[-1]))
#import sys
#print("Expected token.NAME, got " + str(tokens[-1]), file=sys.stderr)
return tokens
all_tokens = list(tokenize.tokenize(io.BytesIO(source.encode('utf-8')).readline))
_fix_triple_quote_positions(node, all_tokens)
_fix_binop_positions(node)
source_lines = source.split("\n")
prelim_end_lineno = len(source_lines)
prelim_end_col_offset = len(source_lines[len(source_lines)-1])
_mark_code_ranges_rec(node, all_tokens, prelim_end_lineno, prelim_end_col_offset)

We had a similar need, and I created the asttokens library for this purpose. It maintains the source in both text and tokenized form, and marks AST nodes with token information, from which text is also readily available.
It works with Python 2 and 3 (tested with 2.7 and 3.5). For example:
import ast, asttokens
st='''
def greet(a):
say("hello") if a else say("bye")
'''
atok = asttokens.ASTTokens(st, parse=True)
for node in ast.walk(atok.tree):
if hasattr(node, 'lineno'):
print atok.get_text_range(node), node.__class__.__name__, atok.get_text(node)
Prints
(1, 50) FunctionDef def greet(a):
say("hello") if a else say("bye")
(17, 50) Expr say("hello") if a else say("bye")
(11, 12) Name a
(17, 50) IfExp say("hello") if a else say("bye")
(33, 34) Name a
(17, 29) Call say("hello")
(40, 50) Call say("bye")
(17, 20) Name say
(21, 28) Str "hello"
(40, 43) Name say
(44, 49) Str "bye"

ast.get_source_segment was added in python 3.8:
import ast
code = """
if 1 == 1 and 2 == 2 and 3 == 3:
test = 1
"""
node = ast.parse(code)
ast.get_source_segment(code, node.body[0])
Produces: if 1 == 1 and 2 == 2 and 3 == 3:\n test = 1
Thanks to Blane for his answer in https://stackoverflow.com/a/62624882/3800552

Hi I know its very late , But I think is this is what you are looking for,
I am doing the parsing only for function definitions in the module.
We can get the first and last line of the ast node by this method. This way the source code lines of a function definition can be obtained by parsing the source file by reading only the lines we need .
This is a very simple example ,
st='def foo():\n print "hello" \n\ndef bla():\n a = 1\n b = 2\n
c= a+b\n print c'
import ast
tree = ast.parse(st)
for function in tree.body:
if isinstance(function,ast.FunctionDef):
# Just in case if there are loops in the definition
lastBody = func.body[-1]
while isinstance (lastBody,(ast.For,ast.While,ast.If)):
lastBody = lastBody.Body[-1]
lastLine = lastBody.lineno
print "Name of the function is ",function.name
print "firstLine of the function is ",function.lineno
print "LastLine of the function is ",lastLine
print "the source lines are "
if isinstance(st,str):
st = st.split("\n")
for i , line in enumerate(st,1):
if i in range(function.lineno,lastLine+1):
print line

Dynamically evaluating simple boolean logic in Python

I've got some dynamically-generated boolean logic expressions, like:
(A or B) and (C or D)
A or (A and B)
A
empty - evaluates to True
The placeholders get replaced with booleans. Should I,
Convert this information to a Python expression like True or (True or False) and eval it?
Create a binary tree where a node is either a bool or Conjunction/Disjunction object and recursively evaluate it?
Convert it into nested S-expressions and use a Lisp parser?
Something else?
Suggestions welcome.

Here's a small (possibly, 74 lines including whitespace) module I built in about an hour and a half (plus almost an hour to refactoring):
str_to_token = {'True':True,
'False':False,
'and':lambda left, right: left and right,
'or':lambda left, right: left or right,
'(':'(',
')':')'}
empty_res = True
def create_token_lst(s, str_to_token=str_to_token):
"""create token list:
'True or False' -> [True, lambda..., False]"""
s = s.replace('(', ' ( ')
s = s.replace(')', ' ) ')
return [str_to_token[it] for it in s.split()]
def find(lst, what, start=0):
return [i for i,it in enumerate(lst) if it == what and i >= start]
def parens(token_lst):
"""returns:
(bool)parens_exist, left_paren_pos, right_paren_pos
"""
left_lst = find(token_lst, '(')
if not left_lst:
return False, -1, -1
left = left_lst[-1]
#can not occur earlier, hence there are args and op.
right = find(token_lst, ')', left + 4)[0]
return True, left, right
def bool_eval(token_lst):
"""token_lst has length 3 and format: [left_arg, operator, right_arg]
operator(left_arg, right_arg) is returned"""
return token_lst[1](token_lst[0], token_lst[2])
def formatted_bool_eval(token_lst, empty_res=empty_res):
"""eval a formatted (i.e. of the form 'ToFa(ToF)') string"""
if not token_lst:
return empty_res
if len(token_lst) == 1:
return token_lst[0]
has_parens, l_paren, r_paren = parens(token_lst)
if not has_parens:
return bool_eval(token_lst)
token_lst[l_paren:r_paren + 1] = [bool_eval(token_lst[l_paren+1:r_paren])]
return formatted_bool_eval(token_lst, bool_eval)
def nested_bool_eval(s):
"""The actual 'eval' routine,
if 's' is empty, 'True' is returned,
otherwise 's' is evaluated according to parentheses nesting.
The format assumed:
[1] 'LEFT OPERATOR RIGHT',
where LEFT and RIGHT are either:
True or False or '(' [1] ')' (subexpression in parentheses)
"""
return formatted_bool_eval(create_token_lst(s))
The simple tests give:
>>> print nested_bool_eval('')
True
>>> print nested_bool_eval('False')
False
>>> print nested_bool_eval('True or False')
True
>>> print nested_bool_eval('True and False')
False
>>> print nested_bool_eval('(True or False) and (True or False)')
True
>>> print nested_bool_eval('(True or False) and (True and False)')
False
>>> print nested_bool_eval('(True or False) or (True and False)')
True
>>> print nested_bool_eval('(True and False) or (True and False)')
False
>>> print nested_bool_eval('(True and False) or (True and (True or False))')
True
[Partially off-topic possibly]
Note, the you can easily configure the tokens (both operands and operators) you use with the poor-mans dependency-injection means provided (token_to_char=token_to_char and friends) to have multiple different evaluators at the same time (just resetting the "injected-by-default" globals will leave you with a single behavior).
For example:
def fuzzy_bool_eval(s):
"""as normal, but:
- an argument 'Maybe' may be :)) present
- algebra is:
[one of 'True', 'False', 'Maybe'] [one of 'or', 'and'] 'Maybe' -> 'Maybe'
"""
Maybe = 'Maybe' # just an object with nice __str__
def or_op(left, right):
return (Maybe if Maybe in [left, right] else (left or right))
def and_op(left, right):
args = [left, right]
if Maybe in args:
if True in args:
return Maybe # Maybe and True -> Maybe
else:
return False # Maybe and False -> False
return left and right
str_to_token = {'True':True,
'False':False,
'Maybe':Maybe,
'and':and_op,
'or':or_op,
'(':'(',
')':')'}
token_lst = create_token_lst(s, str_to_token=str_to_token)
return formatted_bool_eval(token_lst)
gives:
>>> print fuzzy_bool_eval('')
True
>>> print fuzzy_bool_eval('Maybe')
Maybe
>>> print fuzzy_bool_eval('True or False')
True
>>> print fuzzy_bool_eval('True or Maybe')
Maybe
>>> print fuzzy_bool_eval('False or (False and Maybe)')
False

It shouldn't be difficult at all to write a evaluator that can handle this, for example using pyparsing. You only have a few operations to handle (and, or, and grouping?), so you should be able to parse and evaluate it yourself.
You shouldn't need to explicitly form the binary tree to evaluate the expression.

If you set up dicts with the locals and globals you care about then you should be able to safely pass them along with the expression into eval().

Sounds like a piece of cake using SymPy logic module. They even have an example of that on the docs: http://docs.sympy.org/0.7.1/modules/logic.html

I am writing this because I had a solve a similar problem today and I was here when I was looking for clues. (Boolean parser with arbitrary string tokens that get converted to boolean values later).
After considering different options (implementing a solution myself or use some package), I settled on using Lark, https://github.com/lark-parser/lark
It's easy to use and pretty fast if you use LALR(1)
Here is an example that could match your syntax
from lark import Lark, Tree, Transformer
base_parser = Lark("""
expr: and_expr
| or_expr
and_expr: token
| "(" expr ")"
| and_expr " " and " " and_expr
or_expr: token
| "(" expr ")"
| or_expr " " or " " or_expr
token: LETTER
and: "and"
or: "or"
LETTER: /[A-Z]+/
""", start="expr")
class Cleaner(Transformer):
def expr(self, children):
num_children = len(children)
if num_children == 1:
return children[0]
else:
raise RuntimeError()
def and_expr(self, children):
num_children = len(children)
if num_children == 1:
return children[0]
elif num_children == 3:
first, middle, last = children
return Tree(data="and_expr", children=[first, last])
else:
raise RuntimeError()
def or_expr(self, children):
num_children = len(children)
if num_children == 1:
return children[0]
elif num_children == 3:
first, middle, last = children
return Tree(data="or_expr", children=[first, last])
else:
raise RuntimeError()
def get_syntax_tree(expression):
return Cleaner().transform(base_parser.parse(expression))
print(get_syntax_tree("A and (B or C)").pretty())
Note: the regex I chose doesn't match the empty string on purpose (Lark for some reason doesn't allow it).

You can perform that with Lark grammar library https://github.com/lark-parser/lark
from lark import Lark, Transformer, v_args, Token, Tree
from operator import or_, and_, not_
calc_grammar = f"""
?start: disjunction
?disjunction: conjunction
| disjunction "or" conjunction -> {or_.__name__}
?conjunction: atom
| conjunction "and" atom -> {and_.__name__}
?atom: BOOLEAN_LITTERAL -> bool_lit
| "not" atom -> {not_.__name__}
| "(" disjunction ")"
BOOLEAN_LITTERAL: TRUE | FALSE
TRUE: "True"
FALSE: "False"
%import common.WS_INLINE
%ignore WS_INLINE
"""
#v_args(inline=True)
class CalculateBoolTree(Transformer):
or_ = or_
not_ = not_
and_ = and_
allowed_value = {"True": True, "False": False}
def bool_lit(self, val: Token) -> bool:
return self.allowed_value[val]
calc_parser = Lark(calc_grammar, parser="lalr", transformer=CalculateBoolTree())
calc = calc_parser.parse
def eval_bool_expression(bool_expression: str) -> bool:
return calc(bool_expression)
print(eval_bool_expression("(True or False) and (False and True)"))
print(eval_bool_expression("not (False and True)"))
print(eval_bool_expression("not True or False and True and True"))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to parse a BinaryTree from a string? - python

Related

parsing nested functions in python

Processing Strings from data file containing punctuation coding in python

Return or not return in a recursive function

How to get source corresponding to a Python AST node?

Dynamically evaluating simple boolean logic in Python

Categories

Resources