I am trying to recursively parse an expression. I followed a few tutorials, and it seems like Forward() is the class that I need. However, something seemingly simple is causing me trouble.
Here is the code I wrote
from pyparsing import *
exp = Forward()
integer = Word(nums)
exp << (integer | (exp + '+' + exp))
input = "1+1"
print exp.parseString(input)
I want it to return ['1','+','1'] but it only returns ['1']
Help is much appreciated.
There are several issues you have here. In ascending order of importance:
parseString will not raise an exception if there is extra text after the parsed content. Use exp.parseString(input, parseAll=True)
'|' is MatchFirst, not MatchLongest. Since your integer is first, it will be matched first. Then the parser fails on the '+' all by itself. If you want match longest, use '^' operator.
The Biggie: once you convert to '^' (or reorder the expressions to put exp + exp first, ahead of integer), you will find yourself blowing up the maximum recursion depth. That is because this parser has left-recursive definition of exp. That is, to parse an exp, it has to parse an exp, for which it has to parse an exp, etc. In general, many published BNFs use recursion to describe this kind of repetitive structure, but pyparsing does not do the necessary lookahead/backtrack for that to work. Try exp <<= integer + ZeroOrMore('+' + integer) | '(' + exp + ')' - this expression is not left-recursive, since you have to get past an opening parenthesis before parsing a nested exp.
EDIT:
Sorry, I was a little too quick on my earlier suggestion, here is the proper way to do your recursive expression parsing:
from pyparsing import *
exp = Forward()
LPAR, RPAR = map(Suppress, "()")
integer = Word(nums)
term = integer | Group(LPAR + exp + RPAR)
exp << term + ZeroOrMore('+' + term)
input = "(1+1) + 1"
print(exp.parseString(input))
prints
[['1', '+', '1'], '+', '1']
If you trace through the code, you'll see the recursion: exp is defined using term, and term is defined using exp. The fourFn.py example is closest to this style; since writing that, I've added the infixNotation method to pyparsing, which would allow you to write:
exp = infixNotation(integer, [
('+', 2, opAssoc.LEFT),
])
infixNotation takes care of the recursive definitions internally, implicitly defines the '(' + exp + ')' expression, and makes it easy to implement a system of operators with precedence of operations.
I recently came to this problem. Now the left recursion is supported on PyParsing 3.0.0b2 or later. But the feature should be explicitly enabled, and the order of operands of the | operator should be adjusted like this.
from pyparsing import ParserElement, Forward, Word, nums
ParserElement.enable_left_recursion()
exp = Forward()
integer = Word(nums)
exp << ((exp + '+' + exp) | integer)
input = "1+1"
print(exp.parseString(input))
This will output the following.
['1', '+', '1']
Related
The code I've written is in Python, but the problem isn't Python-specific.
Effectively I'd like to check if a one line string that is meant to represent a mathematical expression comes out to be valid. I have heard of the Shunting Yard Algorithm and I intend to attempt my own version of it after I get over my initial problem which is potential whitespace between operators.
The approach I am taking is to go through character by character and tokenize them based on whether they are valid mathematical symbols or not. Parentheses get special treatment.
Overall, the approach I came up with does work, but I feel there could be some improvements or optimizations.
Step 1: Get the string into a state that can be tokenized
The symbols I want to use are the basic arithmetic operations: +-*/ as well as % for modular division, ^&| for bitwise, and later << and >> for shifting.
I also want to allow functions and user-defined variables in the string so text is also valid.
Examples of valid input with operations, functions, and vars:
1+sqrt(2)*abc
(sin(3)-cos(4))/def
The first one would tokenize into: 1, +, sqrt(, 2, ), *, abc
The second one would tokenize into: (, sin(, 3, ), -, cos(, 4, ), ), /, def
Whitespace complicates things
As stated, the current issue I'm facing is how to deal with whitespace. Using the above examples, the following would also be valid and would evaluate to the previous example after stripping whitespace between symbols:
1 + sqrt(2) * abc
( sin(3) - cos(4) ) / def
However, doing something like this would not due to the space between function name and parentheses:
1 + sqrt (2) * abc
Also, dealing with wrong input like:
1 + sq rt(2)
1 + + sqrt(2)
The first would evaluate to a function called sq rt() and would be flagged later as invalid
The second would create + + and that too would be flagged later as invalid
Current Process
val = val.strip()
mathTokens = ('+', '-', '*', '/', '%', '^', '&', '|', '~', '<' , '>')
token = ''
tokens = []
mathMode = bool(val[0] in mathTokens)
for j in range(0, len(val)):
char = val[j]
#Left paren
if(char == '('):
if(mathMode):
token = token.strip()
tokens.append(token)
tokens.append(char)
else:
token += char
tokens.append(token)
token = ''
continue
#Right paren
if(char == ')'):
token = token.strip()
tokens.append(token)
tokens.append(char)
token = ''
continue
#All others
if(char == ' ' or (mathMode and char in mathTokens) or (not mathMode and char not in mathTokens)):
token += char
else:
token = token.strip()
#If a series of spaces only was found, don't append
if(token):
tokens.append(token)
token = char
mathMode = not mathMode
#Append the last token unless it's spaces only (e.g. after right parentheses)
token = token.strip()
if(token):
tokens.append(token)
print(tokens)
Sample Input and Result
Here's some sample input (complete with spaces) which is in val in the code above:
1 + sqrt(2) * abc
This gets stored into the tokens list as:
[
'1',
'+',
'sqrt(',
'2',
')',
'*',
'abc'
]
Afterwards
Ultimately, after parsing and figuring out if it's a valid expression then the end result will be fully calculated. The calculation is going to be the second part of this. For the calculation, the Shunting Yard algorithm is what I'm thinking of going with. In general, from what I gather, I can take my tokens and evaluate accordingly.
Conclusion
As I said at the beginning, this approach does work but constantly doing strip() and looking up the symbols feels like it could be slow in situations where there potentially are lots of values to evaluate. A regex approach sounds like it'd complicate things way too much and I'm not sure if that'd even offer better performance.
Any chance there might be any algorithms for this particular situation and selectively purging whitespace?
I'm also wondering if anything should be changed for potential negative numbers as well.
That's my situation. Insight appreciated, thanks.
If you only aim at checking the validity, one option is to use Abstract Syntax Trees to check the validation of e.g. mathematical equations:
import ast
def checkEquation(equation):
try:
ast.parse(equation)
return 'valid'
except SyntaxError:
return 'invalid'
print( checkEquation('-1 - sqrt(2) * abc') ) #output: valid
print( checkEquation('-1 - sqrt(2 * abc') ) #output: invalid
I'm developing a parser for a config format that also uses functions and I'm trying to figure if it's possible to use pyparsing for this task.
My input is
%upper(%basename(filename)fandsomemore)f%lower(test)f _ %lower(test2)f
and the finite result should be
FILENAMEANDSOMEMOREtest _ test2
In order to do this I first have to get the function names and arguments. My current code works only for the upper function and the nested basename and only if the "andsomemore" part is missing. The code is bellow:
from pyparsing import *
identifier = Word(alphas)
integer = Word(nums)
functor = identifier
expression = Forward()
lparen = Literal("%").suppress() + functor + Literal("(").suppress()
rparen = Literal(")f").suppress()
argnrec = identifier | integer
arg = Group(expression) | argnrec
args = arg + ZeroOrMore("," + argnrec)
expression << Group(lparen + args + rparen)
print expression.parseString("%upper(%basename(filename)f)f%lower(test)f%lower(test2)f")
print expression.parseString("%upper(%basename(filename)fandsomemore)f%lower(test)f _ %lower(test2)f")
This works great for the first print and outputs, as expected
[['upper', [['basename', 'filename']]]]
For the second print I have an error:
pyparsing.ParseException: Expected ")f" (at char 27), (line:1, col:28)
Is there any way to get this to work with pyparsing? If not, any alternative approach would be appreciated. Also, keep in mind that this must handle more complex arguments, like windows paths (the current code only works for alphas and numbers)
Later update:
The purpose for this is to be able to insert the functions anywhere in a string. Another possible usage would be:
%upper(this)f is a %lower(TEST)f and m%upper(%lower(ORE)f)f
That will result in the end to
THIS is a test and mORE
Just having a play here, I'm not entirely sure what you want but perhaps this will help?
from pyparsing import *
identifier = Word(alphas+nums+'_')
integer = Word(nums)
functor = identifier
expression = Forward()
lparen = Literal("%").suppress() + functor + Literal("(").suppress()
rparen = Literal(")f").suppress()
term = (lparen + expression + rparen) | identifier
#args = term + ZeroOrMore("," + argnrec)
expression << OneOrMore(term)
print(expression.parseString("%upper(%basename(filename)f)f%lower(test)f%lower(test2)f"))
print(expression.parseString("%upper(%basename(filename)fandsomemore)f%lower(test)f _ %lower(test2)f"))
C:\temp>python test.py
['upper', 'basename', 'filename', 'lower', 'test', 'lower', 'test2']
['upper', 'basename', 'filename', 'andsomemore', 'lower', 'test', '_', 'lower', 'test2']
I have a boolean expression string, that I would like to take apart:
condition = "a and (b or (c and d))"
Or let's say:
I want to be able to access the string contents between two parenthesis.
I want following outcome:
"(b or (c and d))"
"(c and d)"
I've tried the following with regular expressions (not really working)
x = re.match(".*(\(.*\))", condition)
print x.group(1)
Question:
What is the nicest way to take a boolean expression string apart?
This is the sort of thing you can't do with a simple regex. You need to actually parse the text. pyparsing is apparently excellent for doing that.
Like everyone said, you need a parser.
If you don't want to install one, you can start from this simple top-down parser (take the last code sample here)
Remove everything not related to your need (+, -, *, /, is, lambda, if, else, ...). Just keep parenthesis, and, or.
You will get a binary tree structure generated from your expression.
The tokenizer use the build-in tokenize (import tokenize), which is a lexical scanner for Python source code but works just fine for simple cases like yours.
If your requirements are fairly simple, you don't really need a parser.
Matching parentheses can easily be achieved using a stack.
You could do something like the following:
condition = "a and (b or (c and d))"
stack = []
for c in condition:
if c != ')':
stack.append(c)
else:
d = c
contents = []
while d != '(':
contents.insert(0, d)
d = stack.pop()
contents.insert(0, d)
s = ''.join(contents)
print(s)
stack.append(s)
produces:
(c and d)
(b or (c and d))
Build a parser:
Condition ::= Term Condition'
Condition' ::= epsilon | OR Term Condition'
Term ::= Factor Term'
Term' ::= epsilon | AND Factor Term'
Factor ::= [ NOT ] Primary
Primary ::= Literal | '(' Condition ')'
Literal ::= Id
Intention:
To parse arithmetic expressions with support for logarithms and exponentials. Any of the following expressions are valid;
x + y
exp(x) + y
exp(x+y)
exp(log(x)+exp(z))+exp(y)
x+log(exp(y))
x + 2
Source Code:
import pyparsing as pp
arith_expr = pp.Forward()
op = pp.oneOf("^ / * % + -")
exp_funcs = pp.Regex(r"(log|exp)(2|10|e)?")
operand = pp.Word(pp.alphas, pp.alphanums + "_") | pp.Regex(r"[+-]?\d+(:?\.\d*)?(:?[eE][+-]?\d+)?")
func_atom = operand ^ (pp.Optional(exp_funcs) + "(" + arith_expr + ")")
comp_expr = pp.Combine(func_atom + pp.ZeroOrMore(op + func_atom))
arith_expr << comp_expr
print arith_expr.parseString("exp(datasize+ 2) +3")
Observation
The grammar is able to parse every such arithmetic expressions but sadly fails to parse when whitespace appears around either operand or operator. The grammar is unable to parse the following expressions;
exp(x+ 2)
exp( x + 2 )
x + 2
I have tried debugging the grammar with setDebug( ) and found the following error on every such failure;
Expected ")"
I reckon Pyparsing is whitespace insensitive after going through it's documentation and my day to day use. I have tried debugging with every possible change that i could. None of them solved the issue. I appreciate your valuable suggestions. :)
The problem is with your use of Combine, which squashes all the tokens together into a single token. Whitespace between tokens is ignored in pyparsing, but whitespace within a token is not.
To fix this, get rid of Combine and then pass the result to ''.join to get it back into one string.
Pythonistas:
Suppose you want to parse the following string using Pyparsing:
'ABC_123_SPEED_X 123'
were ABC_123 is an identifier; SPEED_X is a parameter, and 123 is a value. I thought of the following BNF using Pyparsing:
Identifier = Word( alphanums + '_' )
Parameter = Keyword('SPEED_X') or Keyword('SPEED_Y') or Keyword('SPEED_Z')
Value = # assume I already have an expression valid for any value
Entry = Identifier + Literal('_') + Parameter + Value
tokens = Entry.parseString('ABC_123_SPEED_X 123')
#Error: pyparsing.ParseException: Expected "_" (at char 16), (line:1, col:17)
If I remove the underscore from the middle (and adjust the Entry definition accordingly) it parses correctly.
How can I make this parser be a bit lazier and wait until it matches the Keyword (as opposed to slurping the entire string as an Identifier and waiting for the _, which does not exist.
Thank you.
[Note: This is a complete rewrite of my question; I had not realized what the real problem was]
I based my answer off of this one, since what you're trying to do is get a non-greedy match. It seems like this is difficult to make happen in pyparsing, but not impossible with some cleverness and compromise. The following seems to work:
from pyparsing import *
Parameter = Literal('SPEED_X') | Literal('SPEED_Y') | Literal('SPEED_Z')
UndParam = Suppress('_') + Parameter
Identifier = SkipTo(UndParam)
Value = Word(nums)
Entry = Identifier + UndParam + Value
When we run this from the interactive interpreter, we can see the following:
>>> Entry.parseString('ABC_123_SPEED_X 123')
(['ABC_123', 'SPEED_X', '123'], {})
Note that this is a compromise; because I use SkipTo, the Identifier can be full of evil, disgusting characters, not just beautiful alphanums with the occasional underscore.
EDIT: Thanks to Paul McGuire, we can concoct a truly elegant solution by setting Identifier to the following:
Identifier = Combine(Word(alphanums) +
ZeroOrMore('_' + ~Parameter + Word(alphanums)))
Let's inspect how this works. First, ignore the outer Combine; we'll get to this later. Starting with Word(alphanums) we know we'll get the 'ABC' part of the reference string, 'ABC_123_SPEED_X 123'. It's important to note that we didn't allow the "word" to contain underscores in this case. We build that separately in to the logic.
Next, we need to capture the '_123' part without also sucking in '_SPEED_X'. Let's also skip over ZeroOrMore at this point and return to it later. We start with the underscore as a Literal, but we can shortcut with just '_', which will get us the leading underscore, but not all of '_123'. Instictively, we would place another Word(alphanums) to capture the rest, but that's exactly what will get us in trouble by consuming all of the remaining '_123_SPEED_X'. Instead, we say, "So long as what follows the underscore is not the Parameter, parse that as part of my Identifier. We state that in pyparsing terms as '_' + ~Parameter + Word(alphanums). Since we assume we can have an arbitrary number of underscore + WordButNotParameter repeats, we wrap that expression a ZeroOrMore construct. (If you always expect at least underscore + WordButNotParameter following the initial, you can use OneOrMore.)
Finally, we need to wrap the initial Word and the special underscore + Word repeats together so that it's understood they are contiguous, not separated by whitespace, so we wrap the whole expression up in a Combine construct. This way 'ABC _123_SPEED_X' will raise a parse error, but 'ABC_123_SPEED_X' will parse correctly.
Note also that I had to change Keyword to Literal because the ways of the former are far too subtle and quick to anger. I do not trust Keywords, nor could I get matching with them.
If you are sure that the identifier never ends with an underscore, you can enforce it in the definition:
from pyparsing import *
my_string = 'ABC_123_SPEED_X 123'
Identifier = Combine(Word(alphanums) + Literal('_') + Word(alphanums))
Parameter = Literal('SPEED_X') | Literal('SPEED_Y') | Literal('SPEED_Z')
Value = Word(nums)
Entry = Identifier + Literal('_').suppress() + Parameter + Value
tokens = Entry.parseString(my_string)
print tokens # prints: ['ABC_123', 'SPEED_X', '123']
If it's not the case but if the identifier length is fixed you can define Identifier like this:
Identifier = Word( alphanums + '_' , exact=7)
You can also parse the identifier and parameter as one token, and split them in a parse action:
from pyparsing import *
import re
def split_ident_and_param(tokens):
mo = re.match(r"^(.*?_.*?)_(.*?_.*?)$", tokens[0])
return [mo.group(1), mo.group(2)]
ident_and_param = Word(alphanums + "_").setParseAction(split_ident_and_param)
value = Word(nums)
entry = ident_and_param + value
print entry.parseString("APC_123_SPEED_X 123")
The example above assumes that the identifiers and parameters always have the format XXX_YYY (containing one single underscore).
If this is not the case, you need to adjust the split_ident_and_param() method.
This answers a question that you probably have also asked yourself: "What's a real-world application for reduce?):
>>> keys = ['CAT', 'DOG', 'HORSE', 'DEER', 'RHINOCEROS']
>>> p = reduce(lambda x, y: x | y, [Keyword(x) for x in keys])
>>> p
{{{{"CAT" | "DOG"} | "HORSE"} | "DEER"} | "RHINOCEROS"}
Edit:
This was a pretty good answer to the original question. I'll have to work on the new one.
Further edit:
I'm pretty sure you can't do what you're trying to do. The parser that pyparsing creates doesn't do lookahead. So if you tell it to match Word(alphanums + '_'), it's going to keep matching characters until it finds one that's not a letter, number, or underscore.