network device command parsing with pyparsing - python

I`m developing network device command parser using pyparsing.
I analysed and define the command format as below:
cli ::= string + (next)*
next ::= string|range|group|simple_recursive|selective_recursive|infinite_recursive|keywords
keywords ::= "WORD"
| "LINE"
| "A.B.C.D"
| "A.B.C.D/M"
| "X:X::X:X"
| "X:X::X:X/M"
| "HH:MM:SS"
| "AA:NN"
| "XX:XX:XX:XX:XX:XX"
| "MULTILINE"
inner_recur ::= next + (next)* + ("|")* | ("|" + next + (next)*)*
string ::= alphanums + "_" + "-"
range ::= "<" + nums + "-" nums + ">"
group ::= "(" + inner_recur + ")"
simple_recursive ::= "." + range
selective_recursive ::= "{" + inner_recur + "}"
infinite_recursive ::= "[" + inner_recur + "]"
and implemented written:
# string ::= alphanums + "_" + "-"
string_ = Word(alphanums + "_" + "-").setResultsName("string")
#print(string_.parseString("option82"))
# range ::= "<" + nums + "-" nums + ">"
range_ = Combine(Literal("<") + Word(nums) + Literal("-") + Word(nums) + Literal(">")).setResultsName("range")
#print(range_.parseString("<24-1004>"))
# simple_recursive ::= "." + range
simple_recursive_ = Combine(Literal(".") + range_).setResultsName("simple_recursive")
#print(simple_recursive_.parseString(".<1-60045>"))
# keywords ::= "WORD" | "LINE" | "A.B.C.D" | "A.B.C.D/M" | "X:X::X:X" | "X:X::X:X/M" | "HH:MM:SS" | "AA:NN" | "XX:XX:XX:XX:XX:XX" | "MULTILINE"
keywords_ = Keyword("X:X::X:X/M").setResultsName("X:X::X:/M") | Keyword("A.B.C.D/M").setResultsName("A.B.C.D/M") | Keyword("A.B.C.D").setResultsName("A.B.C.D") | Keyword("X:X::X:X").setResultsName("X:X::X:X") | Keyword("HH:MM:SS").setResultsName("HH:MM:SS") | Keyword("AA:NN").setResultsName("AA:NN") | Keyword("XX:XX:XX:XX:XX:XX").setResultsName("XX:XX:XX:XX:XX:XX") | Keyword("MULTILINE").setResultsName("MULTILINE") | Keyword("WORD").setResultsName("WORD") | Keyword("LINE").setResultsName("LINE")
#print(keywords_.parseString("A.B.C.D").asXML())
#next_ = Forward()
inner_recur = Forward()
# group ::= "(" + inner_recur + ")"
group_ = Combine(Literal("(") + inner_recur + Literal(")"))
# selective_recursive ::= "{" + inner_recur + "}"
selective_recursive_ = Combine(Literal("{") + inner_recur + Literal("}"))
# infinite_recursive ::= "[" + inner_recur + "]"
infinite_recursive_ = Combine(Literal("[") + inner_recur + Literal("]"))
# next ::= string|range|group|simple_recursive|selective_recursive|infinite_recursive|keywords
next_ = keywords_ | string_ | simple_recursive_ | range_ | group_ | selective_recursive_ | infinite_recursive_
# inner_recur ::= next + (next)* + ("|")* | ("|" + next + (next)*)*
inner_recur << next_ + ZeroOrMore(next_) + ZeroOrMore(Literal("|") | ZeroOrMore(Literal("|") + next_ + OneOrMore(next_)))
# cli ::= string + (next)*
cli_ = string_ + ZeroOrMore(next_)
To test my parser, I tried to input datas
>>> test = cli_.parseString("bgp as .<1-200>")
>>> print(test)
>>> ['bgp', 'as', ['.<1-200>']]
test = cli_.parseString("bgp as <1-200> <1-255> <1-255> WORD A.B.C.D A.B.C.D/M (A|(B|C))")
print(test)
>>>
test = cli_.parseString("test (A|<1-200>|(B|{a|b|c} aaa)")
test = cli_.parseString("test (A|<1-200>|(B|{a|b|c|})|)")
when parsed second data, infinite recursion raised. I don't understand this situation and have any solution...
I expect the result:
['bgp', 'as', ['<1-200>'], ['<1-255>'], ['<1-255>'], 'WORD',
'A.B.C.D', 'A.B.C.D/M', ['A', ['B', 'C']]]
what is my problem in format or code? and point be modified?

While you have made a good first step in defining your grammar in conceptual BNF terms before writing code, I'm struggling a bit with making sense of your grammar. The culprit to me seems to be this part:
inner_recur ::= next + (next)* + ("|")* | ("|" + next + (next)*)*
From your posted examples, this looks like you are trying to define some sort of infix notation, using '|' as an operator.
From your tests, it also looks like you need to support multiple inner_recur terms within any grouping ()'s, []'s, or {}'s.
Also, please read the docs (https://pyparsing-docs.readthedocs.io/en/latest/pyparsing.html) to get a clearer picture of the difference between setResultsName and setName. I'm pretty sure in your parser throughout, you are using setResultsName but really want setName. Similarly with using Combine when you really want Group.
Lastly, I rewrote your test code using runTests, and saw that you had mismatched ()'s on the third test.
Here is your parser with these changes:
# string ::= alphanums + "_" + "-"
string_ = Word(alphanums + "_" + "-").setResultsName("string")
#print(string_.parseString("option82"))
# range ::= "<" + nums + "-" nums + ">"
range_ = Group(Literal("<") + Word(nums) + Literal("-") + Word(nums) + Literal(">")).setResultsName("range")
#print(range_.parseString("<24-1004>"))
# simple_recursive ::= "." + range
simple_recursive_ = Group(Literal(".") + range_).setResultsName("simple_recursive")
#print(simple_recursive_.parseString(".<1-60045>"))
# keywords ::= "WORD" | "LINE" | "A.B.C.D" | "A.B.C.D/M" | "X:X::X:X" | "X:X::X:X/M" | "HH:MM:SS" | "AA:NN" | "XX:XX:XX:XX:XX:XX" | "MULTILINE"
keywords_ = Keyword("X:X::X:X/M").setResultsName("X:X::X:/M") | Keyword("A.B.C.D/M").setResultsName("A.B.C.D/M") | Keyword("A.B.C.D").setResultsName("A.B.C.D") | Keyword("X:X::X:X").setResultsName("X:X::X:X") | Keyword("HH:MM:SS").setResultsName("HH:MM:SS") | Keyword("AA:NN").setResultsName("AA:NN") | Keyword("XX:XX:XX:XX:XX:XX").setResultsName("XX:XX:XX:XX:XX:XX") | Keyword("MULTILINE").setResultsName("MULTILINE") | Keyword("WORD").setResultsName("WORD") | Keyword("LINE").setResultsName("LINE")
#print(keywords_.parseString("A.B.C.D").asXML())
#next_ = Forward()
inner_recur = Forward()
# group ::= "(" + inner_recur + ")"
group_ = Group(Literal("(") + OneOrMore(inner_recur) + Literal(")"))
# selective_recursive ::= "{" + inner_recur + "}"
selective_recursive_ = Group(Literal("{") + OneOrMore(inner_recur) + Literal("}"))
# infinite_recursive ::= "[" + inner_recur + "]"
infinite_recursive_ = Group(Literal("[") + OneOrMore(inner_recur) + Literal("]"))
# next ::= string|range|group|simple_recursive|selective_recursive|infinite_recursive|keywords
next_ = keywords_ | string_ | simple_recursive_ | range_ | group_ | selective_recursive_ | infinite_recursive_
#~ next_.setName("next_").setDebug()
# inner_recur ::= next + (next)* + ("|")* | ("|" + next + (next)*)*
#~ inner_recur <<= OneOrMore(next_) + ZeroOrMore(Literal("|")) | ZeroOrMore(Literal("|") + OneOrMore(next_))
inner_recur <<= Group(infixNotation(next_,
[
(None, 2, opAssoc.LEFT),
('|', 2, opAssoc.LEFT),
]) + Optional('|'))
# cli ::= string + (next)*
cli_ = string_ + ZeroOrMore(next_)
tests = """\
bgp as .<1-200>
bgp as <1-200> <1-255> <1-255> WORD A.B.C.D A.B.C.D/M (A|(B|C))
test (A|<1-200>|(B|{a|b|c} aaa))
test (A|<1-200>|(B|{a|b|c|})|)
"""
cli_.runTests(tests)
Which gives:
bgp as .<1-200>
['bgp', 'as', ['.', ['<', '1', '-', '200', '>']]]
- simple_recursive: ['.', ['<', '1', '-', '200', '>']]
- range: ['<', '1', '-', '200', '>']
- string: 'as'
bgp as <1-200> <1-255> <1-255> WORD A.B.C.D A.B.C.D/M (A|(B|C))
['bgp', 'as', ['<', '1', '-', '200', '>'], ['<', '1', '-', '255', '>'], ['<', '1', '-', '255', '>'], 'WORD', 'A.B.C.D', 'A.B.C.D/M', ['(', [['A', '|', ['(', [['B', '|', 'C']], ')']]], ')']]
- A.B.C.D: 'A.B.C.D'
- A.B.C.D/M: 'A.B.C.D/M'
- WORD: 'WORD'
- range: ['<', '1', '-', '255', '>']
- string: 'as'
test (A|<1-200>|(B|{a|b|c} aaa))
['test', ['(', [['A', '|', ['<', '1', '-', '200', '>'], '|', ['(', [['B', '|', [['{', [['a', '|', 'b', '|', 'c']], '}'], 'aaa']]], ')']]], ')']]
- string: 'test'
test (A|<1-200>|(B|{a|b|c|})|)
['test', ['(', [['A', '|', ['<', '1', '-', '200', '>'], '|', ['(', [['B', '|', ['{', [['a', '|', 'b', '|', 'c'], '|'], '}']]], ')']], '|'], ')']]
- string: 'test'
This may be off the mark in some places, but I hope it gives you some ideas to move forward with your project.

Related

Split a Boolean expression into all possibilities with PyParsing

I am creating a program which is filtering an excel document.
In some Cells, there are combinations of codes separated by Boolean expressions.
I need to split these into every possibility, and I am looking at achieving this via PyParsing.
For example, if I have the following:
single = "347SJ"
single_or = "456NG | 347SJ"
and_or = "347SJ & (347SJ | 383DF)"
and_multi_or = "373FU & (383VF | 321AC | 383ZX | 842UQ)"
I want to end up with:
single = "347SJ"
single_or1 = "456NG"
single_or2 = "347SJ"
and_or1 = "347SJ & 347SJ"
and_or2 = "347SJ & 383DF"
and_multi_or1 = "373FU & 383VF"
and_multi_or2 = "373FU & 321AC"
and_multi_or3 = "373FU & 383ZX"
and_multi_or4 = "373FU & 842UQ"
I feel like it's very simple but I can't find anything similar to this online, can anyone help?
Here is a start, a parser that will process those inputs as & and | operations:
import pyparsing as pp
operand = pp.Word(pp.alphanums)
bool_expr = pp.infix_notation(operand, [
('&', 2, pp.opAssoc.LEFT),
('|', 2, pp.opAssoc.LEFT),
])
tests = [single, single_or, and_or, and_multi_or]
bool_expr.run_tests(tests)
prints
347SJ
['347SJ']
456NG | 347SJ
[['456NG', '|', '347SJ']]
[0]:
['456NG', '|', '347SJ']
347SJ & (347SJ | 383DF)
[['347SJ', '&', ['347SJ', '|', '383DF']]]
[0]:
['347SJ', '&', ['347SJ', '|', '383DF']]
[2]:
['347SJ', '|', '383DF']
373FU & (383VF | 321AC | 383ZX | 842UQ)
[['373FU', '&', ['383VF', '|', '321AC', '|', '383ZX', '|', '842UQ']]]
[0]:
['373FU', '&', ['383VF', '|', '321AC', '|', '383ZX', '|', '842UQ']]
[2]:
['383VF', '|', '321AC', '|', '383ZX', '|', '842UQ']
So that was the simple part. The next step is to traverse that structure and generate the various paths through the logical operators. This part gets kind of hairy.
PART 2 -- STOP HERE IF YOU WANT TO SOLVE THE REST YOURSELF!!!
(Much code taken from the invRegex example, which inverts a regex by generating sample strings that match it.)
Rather than extract the structure and then walk it, we can use pyparsing parse actions to add active nodes at each level that will then give us generators to generate each bit of the expression.
class GroupEmitter:
def __init__(self, exprs):
# drop the operators, take the operands
self.exprs = exprs[0][::2]
def makeGenerator(self):
def groupGen():
def recurseList(elist):
if len(elist) == 1:
if isinstance(elist[0], pp.ParseResults):
yield from recurseList(elist[0])
else:
yield from elist[0].makeGenerator()()
else:
for s in elist[0].makeGenerator()():
for s2 in recurseList(elist[1:]):
# this could certainly be improved
if isinstance(s, str):
if isinstance(s2, str):
yield [s, s2]
else:
yield [s, *s2]
else:
if isinstance(s2, str):
yield [*s, s2]
else:
yield [*s, *s2]
if self.exprs:
yield from recurseList(self.exprs)
return groupGen
class AlternativeEmitter:
def __init__(self, exprs):
# drop the operators, take the operands
self.exprs = exprs[0][::2]
def makeGenerator(self):
def altGen():
for e in self.exprs:
yield from e.makeGenerator()()
return altGen
class LiteralEmitter:
def __init__(self, lit):
self.lit = lit[0]
def __str__(self):
return "Lit:" + self.lit
def __repr__(self):
return "Lit:" + self.lit
def makeGenerator(self):
def litGen():
yield [self.lit]
return litGen
operand.add_parse_action(LiteralEmitter)
bool_expr = pp.infix_notation(operand, [
('&', 2, pp.opAssoc.LEFT, GroupEmitter),
('|', 2, pp.opAssoc.LEFT, AlternativeEmitter),
])
def invert(expr):
return bool_expr.parse_string(expr)[0].makeGenerator()()
for t in tests:
print(t)
print(list(invert(t)))
print("\n".join(' & '.join(tt) for tt in invert(t)))
print()
prints
347SJ
[['347SJ']]
347SJ
456NG | 347SJ
[['456NG'], ['347SJ']]
456NG
347SJ
347SJ & (347SJ | 383DF)
[['347SJ', '347SJ'], ['347SJ', '383DF']]
347SJ & 347SJ
347SJ & 383DF
373FU & (383VF | 321AC | 383ZX | 842UQ)
[['373FU', '383VF'], ['373FU', '321AC'], ['373FU', '383ZX'], ['373FU', '842UQ']]
373FU & 383VF
373FU & 321AC
373FU & 383ZX
373FU & 842UQ

RecursionError: maximum recursion depth exceeded while using lark in python

I've written the decaf grammar specified in cs143 course.
Here is my code.
import sys
from lark import Lark, Transformer, v_args
decaf_grammar = r"""
start : PROGRAM
PROGRAM : DECL+
DECL : VARIABLEDECL | FUNCTIONDECL | CLASSDECL | INTERFACEDECL
VARIABLEDECL : VARIABLE ";"
VARIABLE : TYPE "ident"
TYPE : "int" | "double" | "bool" | "string" | "ident" | TYPE "[]"
FUNCTIONDECL : ( TYPE "ident" "(" FORMALS ")" STMTBLOCK ) | ( "void" "ident" "(" FORMALS ")" STMTBLOCK )
FORMALS : VARIABLE ("," VARIABLE)*
CLASSDECL : "class" "ident" ["extends" "ident"] ["implements" "ident" ("," "ident")*] "{" FIELD* "}"
FIELD : VARIABLEDECL | FUNCTIONDECL
INTERFACEDECL : "interface" "ident" "{" PROTOTYPE* "}"
PROTOTYPE : (TYPE "ident" "(" FORMALS ")" ";") | ("void" "ident" "(" FORMALS ")" ";")
STMTBLOCK : "{" VARIABLEDECL* STMT* "}"
STMT : ( EXPR? ";") | IFSTMT | WHILESTMT | FORSTMT | BREAKSTMT | RETURNSTMT | RETURNSTMT | PRINTSTMT | STMTBLOCK
IFSTMT : "if" "(" EXPR ")" STMT ["else" STMT]
WHILESTMT : "while" "(" EXPR ")" STMT
FORSTMT : "for" "(" EXPR? ";" EXPR ";" EXPR? ")" STMT
RETURNSTMT : "return" EXPR? ";"
BREAKSTMT : "break" ";"
PRINTSTMT : "print" "(" EXPR ("," EXPR)* ")" ";"
EXPR : (LVALUE "=" EXPR) | CONSTANT | LVALUE | "this" | CALL | "(" EXPR ")" | (EXPR "+" EXPR) | (EXPR "-" EXPR) | (EXPR "*" EXPR) | (EXPR "/" EXPR) | (EXPR "%" EXPR) | ("-" EXPR) | (EXPR "<" EXPR) | (EXPR "<=" EXPR) | (EXPR ">" EXPR) | (EXPR ">=" EXPR) | (EXPR "==" EXPR) | (EXPR "!=" EXPR) | (EXPR "&&" EXPR) | (EXPR "||" EXPR) | ("!" EXPR) | ("ReadInteger" "(" ")") | ("ReadLine" "(" ")") | ("new" "ident") | ("NewArray" "(" EXPR "," TYPE ")")
LVALUE : "ident" | (EXPR "." "ident") | (EXPR "[" EXPR "]")
CALL : ("ident" "(" ACTUALS ")") | (EXPR "." "ident" "(" ACTUALS ")")
ACTUALS : EXPR ("," EXPR)* | ""
CONSTANT : "intConstant" | "doubleConstant" | "boolConstant" | "stringConstant" | "null"
"""
class TreeToJson(Transformer):
#v_args(inline=True)
def string(self, s):
return s[1:-1].replace('\\"', '"')
json_parser = Lark(decaf_grammar, parser='lalr', lexer='standard', transformer=TreeToJson())
parse = json_parser.parse
def test():
test_json = '''
{
}
'''
j = parse(test_json)
print(j)
import json
assert j == json.loads(test_json)
if __name__ == '__main__':
test()
#with open(sys.argv[1]) as f:
#print(parse(f.read()))
It throws
RecursionError: maximum recursion depth exceeded.
I'm using lark for the first time
The problem you have is that you don't feel the difference between lark's rules and terminals. Terminals (they are only should be named in capitals) should match string, not structure of your grammar.
The main terminal's property you must support is that they, unlike rules, are not "recursive". Because of that lark struggle to build your grammar and goes to infinite recursion and stackoverflow.
try using sys.setrecursionlimit(xxxx) where xxxx is max recursion depth you want.
To know more visit docs.python.org/3 .

How can I split a string of a mathematical expressions in python?

I made a program which convert infix to postfix in python. The problem is when I introduce the arguments.
If i introduce something like this: (this will be a string)
( ( 73 + ( ( 34 - 72 ) / ( 33 - 3 ) ) ) + ( 56 + ( 95 - 28 ) ) )
it will split it with .split() and the program will work correctly.
But I want the user to be able to introduce something like this:
((73 + ( (34- 72 ) / ( 33 -3) )) + (56 +(95 - 28) ) )
As you can see I want that the blank spaces can be trivial but the program continue splitting the string by parentheses, integers (not digits) and operands.
I try to solve it with a for but I don't know how to catch the whole number (73 , 34 ,72) instead one digit by digit (7, 3 , 3 , 4 , 7 , 2)
To sum up, what I want is split a string like ((81 * 6) /42+ (3-1)) into:
[(, (, 81, *, 6, ), /, 42, +, (, 3, -, 1, ), )]
Tree with ast
You could use ast to get a tree of the expression :
import ast
source = '((81 * 6) /42+ (3-1))'
node = ast.parse(source)
def show_children(node, level=0):
if isinstance(node, ast.Num):
print(' ' * level + str(node.n))
else:
print(' ' * level + str(node))
for child in ast.iter_child_nodes(node):
show_children(child, level+1)
show_children(node)
It outputs :
<_ast.Module object at 0x7f56abbc5490>
<_ast.Expr object at 0x7f56abbc5350>
<_ast.BinOp object at 0x7f56abbc5450>
<_ast.BinOp object at 0x7f56abbc5390>
<_ast.BinOp object at 0x7f56abb57cd0>
81
<_ast.Mult object at 0x7f56abbd0dd0>
6
<_ast.Div object at 0x7f56abbd0e50>
42
<_ast.Add object at 0x7f56abbd0cd0>
<_ast.BinOp object at 0x7f56abb57dd0>
3
<_ast.Sub object at 0x7f56abbd0d50>
1
As #user2357112 wrote in the comments : ast.parse interprets Python syntax, not mathematical expressions. (1+2)(3+4) would be parsed as a function call and list comprehensions would be accepted even though they probably shouldn't be considered a valid mathematical expression.
List with a regex
If you want a flat structure, a regex could work :
import re
number_or_symbol = re.compile('(\d+|[^ 0-9])')
print(re.findall(number_or_symbol, source))
# ['(', '(', '81', '*', '6', ')', '/', '42', '+', '(', '3', '-', '1', ')', ')']
It looks for either :
multiple digits
or any character which isn't a digit or a space
Once you have a list of elements, you could check if the syntax is correct, for example with a stack to check if parentheses are matching, or if every element is a known one.
You need to implement a very simple tokenizer for your input. You have the following types of tokens:
(
)
+
-
*
/
\d+
You can find them in your input string separated by all sorts of white space.
So a first step is to process the string from start to finish, and extract these tokens, and then do your parsing on the tokens, rather than on the string itself.
A nifty way to do this is to use the following regular expression: '\s*([()+*/-]|\d+)'. You can then:
import re
the_input='(3+(2*5))'
tokens = []
tokenizer = re.compile(r'\s*([()+*/-]|\d+)')
current_pos = 0
while current_pos < len(the_input):
match = tokenizer.match(the_input, current_pos)
if match is None:
raise Error('Syntax error')
tokens.append(match.group(1))
current_pos = match.end()
print(tokens)
This will print ['(', '3', '+', '(', '2', '*', '5', ')', ')']
You could also use re.findall or re.finditer, but then you'd be skipping non-matches, which are syntax errors in this case.
If you don't want to use re module, you can try this:
s="((81 * 6) /42+ (3-1))"
r=[""]
for i in s.replace(" ",""):
if i.isdigit() and r[-1].isdigit():
r[-1]=r[-1]+i
else:
r.append(i)
print(r[1:])
Output:
['(', '(', '81', '*', '6', ')', '/', '42', '+', '(', '3', '-', '1', ')', ')']
It actual would be pretty trivial to hand-roll a simple expression tokenizer. And I'd think you'd learn more that way as well.
So for the sake of education and learning, Here is a trivial expression tokenizer implementation which can be extended. It works based upon the "maximal-much" rule. This means it acts "greedy", trying to consume as many characters as it can to construct each token.
Without further ado, here is the tokenizer:
class ExpressionTokenizer:
def __init__(self, expression, operators):
self.buffer = expression
self.pos = 0
self.operators = operators
def _next_token(self):
atom = self._get_atom()
while atom and atom.isspace():
self._skip_whitespace()
atom = self._get_atom()
if atom is None:
return None
elif atom.isdigit():
return self._tokenize_number()
elif atom in self.operators:
return self._tokenize_operator()
else:
raise SyntaxError()
def _skip_whitespace(self):
while self._get_atom():
if self._get_atom().isspace():
self.pos += 1
else:
break
def _tokenize_number(self):
endpos = self.pos + 1
while self._get_atom(endpos) and self._get_atom(endpos).isdigit():
endpos += 1
number = self.buffer[self.pos:endpos]
self.pos = endpos
return number
def _tokenize_operator(self):
operator = self.buffer[self.pos]
self.pos += 1
return operator
def _get_atom(self, pos=None):
pos = pos or self.pos
try:
return self.buffer[pos]
except IndexError:
return None
def tokenize(self):
while True:
token = self._next_token()
if token is None:
break
else:
yield token
Here is a demo the usage:
tokenizer = ExpressionTokenizer('((81 * 6) /42+ (3-1))', {'+', '-', '*', '/', '(', ')'})
for token in tokenizer.tokenize():
print(token)
Which produces the output:
(
(
81
*
6
)
/
42
+
(
3
-
1
)
)
Quick regex answer:
re.findall(r"\d+|[()+\-*\/]", str_in)
Demonstration:
>>> import re
>>> str_in = "((81 * 6) /42+ (3-1))"
>>> re.findall(r"\d+|[()+\-*\/]", str_in)
['(', '(', '81', '*', '6', ')', '/', '42', '+', '(', '3', '-', '1',
')', ')']
For the nested parentheses part, you can use a stack to keep track of the level.
This does not provide quite the result you want but might be of interest to others who view this question. It makes use of the pyparsing library.
# Stolen from http://pyparsing.wikispaces.com/file/view/simpleArith.py/30268305/simpleArith.py
# Copyright 2006, by Paul McGuire
# ... and slightly altered
from pyparsing import *
integer = Word(nums).setParseAction(lambda t:int(t[0]))
variable = Word(alphas,exact=1)
operand = integer | variable
expop = Literal('^')
signop = oneOf('+ -')
multop = oneOf('* /')
plusop = oneOf('+ -')
factop = Literal('!')
expr = operatorPrecedence( operand,
[("!", 1, opAssoc.LEFT),
("^", 2, opAssoc.RIGHT),
(signop, 1, opAssoc.RIGHT),
(multop, 2, opAssoc.LEFT),
(plusop, 2, opAssoc.LEFT),]
)
print (expr.parseString('((81 * 6) /42+ (3-1))'))
Output:
[[[[81, '*', 6], '/', 42], '+', [3, '-', 1]]]
Using grako:
start = expr $;
expr = calc | value;
calc = value operator value;
value = integer | "(" #:expr ")" ;
operator = "+" | "-" | "*" | "/";
integer = /\d+/;
grako transpiles to python.
For this example, the return value looks like this:
['73', '+', ['34', '-', '72', '/', ['33', '-', '3']], '+', ['56', '+', ['95', '-', '28']]]
Normally you'd use the generated semantics class as a template for further processing.
To provide a more verbose regex approach that you could easily extend:
import re
solution = []
pattern = re.compile('([\d\.]+)')
s = '((73 + ( (34- 72 ) / ( 33 -3) )) + (56 +(95 - 28) ) )'
for token in re.split(pattern, s):
token = token.strip()
if re.match(pattern, token):
solution.append(float(token))
continue
for character in re.sub(' ', '', token):
solution.append(character)
Which will give you the result:
solution = ['(', '(', 73, '+', '(', '(', 34, '-', 72, ')', '/', '(', 33, '-', 3, ')', ')', ')', '+', '(', 56, '+', '(', 95, '-', 28, ')', ')', ')']
Similar to #McGrady's answer, you can do this with a basic queue implementation.
As a very basic implementation, here's what your Queue class can look like:
class Queue:
EMPTY_QUEUE_ERR_MSG = "Cannot do this operation on an empty queue."
def __init__(self):
self._items = []
def __len__(self) -> int:
return len(self._items)
def is_empty(self) -> bool:
return len(self) == 0
def enqueue(self, item):
self._items.append(item)
def dequeue(self):
try:
return self._items.pop(0)
except IndexError:
raise RuntimeError(Queue.EMPTY_QUEUE_ERR_MSG)
def peek(self):
try:
return self._items[0]
except IndexError:
raise RuntimeError(Queue.EMPTY_QUEUE_ERR_MSG)
Using this simple class, you can implement your parse function as:
def tokenize_with_queue(exp: str) -> List:
queue = Queue()
cum_digit = ""
for c in exp.replace(" ", ""):
if c in ["(", ")", "+", "-", "/", "*"]:
if cum_digit != "":
queue.enqueue(cum_digit)
cum_digit = ""
queue.enqueue(c)
elif c.isdigit():
cum_digit += c
else:
raise ValueError
if cum_digit != "": #one last sweep in case there are any digits waiting
queue.enqueue(cum_digit)
return [queue.dequeue() for i in range(len(queue))]
Testing it like below:
exp = "((73 + ( (34- 72 ) / ( 33 -3) )) + (56 +(95 - 28) ) )"
print(tokenize_with_queue(exp)")
would give you the token list as:
['(', '(', '73', '+', '(', '(', '34', '-', '72', ')', '/', '(', '33', '-', '3', ')', ')', ')', '+', '(', '56', '+', '(', '95', '-', '28', ')', ')', ')']

PyParsing: parseaction called multiple

I am a beginner with pyparsing but have experience with other parsing environments.
On my first small demo project I encountered a strange behavior of parsing actions: Parse action of base token (ident_simple) is called twice for each token of ident_simple.
import io, sys
from pyparsing import *
def pa_ident_simple(s, l, t):
print('ident_simple: ' + str(t))
def pa_ident_combined(s, l, t):
print('ident_combined: ' + str(t))
def make_grammar():
number = Word(nums)
ident_simple = Word( alphas, alphanums + "_" )
ident_simple.setParseAction(pa_ident_simple)
ident_combined = Combine(ident_simple + Literal('.') + ident_simple)
ident_combined.setParseAction(pa_ident_combined)
integer = number
elems = ( ident_combined | ident_simple | integer)
grammar = OneOrMore(elems) + StringEnd()
return grammar
if __name__ == "__main__":
inp_str = "UUU FFF.XXX GGG"
grammar = make_grammar()
print (inp_str, "--->", grammar.parseString( inp_str ))
For 'ident_combined' token it looks good: Parseaction is called once for each sub token 'ident_simple' and once for combined token.
I believe that the combined token is the problem: Parseaction of 'ident_simple' is called only once if 'ident_combined' is removed.
Can anybody give me a hint how to combine tokens correctly?
Thanks for any help
Update: When playing around I took the class "Or" instead of "MatchFirst".
elems = ( ident_combined ^ ident_simple ^ integer)
This showed a better behavior (in my opinion).
Output of original grammar (using "MatchFirst"):
ident_simple: ['UUU']
ident_simple: ['UUU']
ident_simple: ['FFF']
ident_simple: ['XXX']
ident_combined: ['FFF.XXX']
ident_simple: ['GGG']
ident_simple: ['GGG']
UUU FFF.XXX GGG ---> ['UUU', 'FFF.XXX', 'GGG']
Output of modified grammar (using "Or"):
ident_simple: ['UUU']
ident_simple: ['FFF']
ident_simple: ['XXX']
ident_combined: ['FFF.XXX']
ident_simple: ['GGG']
UUU FFF.XXX GGG ---> ['UUU', 'FFF.XXX', 'GGG']

Using regex to parse kindle "My Clippings.txt" file

I am currently trying to use python to parse the notes file for my kindle so that I can keep them more organized than the chronologically ordered list that the kindle automatically saves notes in. Unfortunately, I'm having trouble using regex to parse the file. Here's my code so far:
import re
def parse_file(in_file):
read_file = open(in_file, 'r')
file_lines = read_file.readlines()
read_file.close()
raw_note = "".join(file_lines)
# Regex parts
title_regex = "(.+)"
title_author_regex = "(.+) \((.+)\)"
loc_norange_regex = "(.+) (Location|on Page) ([0-9]+)"
loc_range_regex = "(.+) (Location|on Page) ([0-9]+)-([0-9]+)"
date_regex = "([a-zA-Z]+), ([a-zA-Z]+) ([0-9]+), ([0-9]+)" # Date
time_regex = "([0-9]+):([0-9]+) (AM|PM)" # Time
content_regex = "(.*)"
footer_regex = "=+"
nl_re = "\r*\n"
# No author
regex_noauthor_str =\
title_regex + nl_re +\
"- Your " + loc_range_regex + " | Added on " +\
date_regex + ", " + time_regex + nl_re +\
content_regex + nl_re +\
footer_regex
regex_noauthor = re.compile(regex_noauthor_str)
print regex_noauthor.findall(raw_note)
parse_file("testnotes")
Here is the contents of "testnotes":
Title
- Your Highlight Location 3360-3362 | Added on Wednesday, March 21, 2012, 12:16 AM
Note content goes here
==========
What I want:
[('Title', 'Highlight', 'Location', '3360', '3362', 'Wednesday', 'March', '21', '2012', '12', '16', 'AM',
But when I run the program, I get:
[('Title', 'Highlight', 'Location', '3360', '3362', '', '', '', '', '', '', '', '')]
I'm fairly new to regex, but I feel like this should be fairly straightforward.
When you say " | Added on ", you need to escape the |.
Replace that string with " \| Added on "
You need to escape the | in "- Your " + loc_range_regex + " | Added on " +\
to: "- Your " + loc_range_regex + " \| Added on " +\
| is the OR operator in a regex.
Should anyone need an update to this, the following works with Paperwhite & Voyage Kindles in 2017 : https://gist.github.com/laffan/7b945d256028d2ffaacd4d99be40ca34

Categories