I'm trying to write a simple int expression parser using tatsu, a PEG-based Python parser generator. Here is my code:
import tatsu
grammar = r'''
start = expression $ ;
expression = add | sub | term ;
add = expression '+' term ;
sub = expression '-' term ;
term = mul | div | number ;
mul = term '*' number ;
div = term '/' number ;
number = [ '-' ] /\d+/ ;
'''
parser = tatsu.compile(grammar)
print(parser.parse('2-1'))
The output of this program is ['-', '1'] instead of the expected ['2', '-', '1'].
I get the correct output if I either:
Remove support for unary minus, i.e. change the last rule to number = /\d+/ ;
Remove the term, mul and div rules, and support only addition and subtraction
Replace the second rule with expresssion = add | sub | mul | div | number ;
The last option actually works without leaving any feature out, but I don't understand why it works. What is going on?
EDIT: If I just flip the add/sub/mul/div rules to get rid of left recursion, it also works. But then evaluating the expressions becomes a problem, since the parse tree is flipped. (3-2-1 becomes 3-(2-1))
There are left recursion cases that TatSu doesn't handle, and work on fixing that is currently on hold.
You can use left/right join/gather operators to control the associativity of parsed expressions in a non-left-recursive grammar.
Related
Currently, I have a Boolean expression which supports & (logical AND), | (logical OR), (, ) (parentheses) operators along with status codes like s, f, d, n, t and job names.
The status codes represent the status of a job. (Eg: s = success, f = failure, etc...) and the job name is enclosed within parentheses with an optional argument which is a number within quotes.
Example i/p:
( s(job_A, "11:00") & f(job_B) ) | ( s(job_C) & t(job_D) )
My requirement is for such a given string in Python, I need to replace the existing job names with new job names containing a prefix and everything else should remain the same:
Example o/p:
( s(prefix_job_A, "11:00") & f(prefix_job_B) ) | ( s(prefix_job_C) & t(prefix_job_D) )
This logical expression can be arbitrarily nested like any Boolean expression and being a non-regular language we can't use regexes.
Please note: The job names are NOT known before-hand so we can't statically store the names in a dictionary and perform a replacement.
The current approach I have thought of is to generate an expression tree and perform replacements in the OPERAND nodes of that tree, however I am not sure how to proceed with this. Is there any library in python which can help me to define the grammar to build this tree? How do I specify the grammar?
Can someone help me with the approach?
Edit: The job names don't have any particular form. The minimum length of a job name is 6 and the job names are alphanumeric with underscores.
Given that we can assume that job names are alphanumeric + _ and length at least 6, we should be able to do this just with a regex since it appears nothing else in the given strings look like that.
import regex
exp = '( s(job__A, "11:00") & f(job__B) ) | ( s(job__C) & t(job__D) )'
name_regex = "([a-zA-Z\d_]{6,})" # at least 6 alphanumeric + _ characters
prefix = "prefix_"
new_exp = regex.sub(name_regex, f"{prefix}\\1", exp)
print(new_exp)
I am trying to recursively parse an expression. I followed a few tutorials, and it seems like Forward() is the class that I need. However, something seemingly simple is causing me trouble.
Here is the code I wrote
from pyparsing import *
exp = Forward()
integer = Word(nums)
exp << (integer | (exp + '+' + exp))
input = "1+1"
print exp.parseString(input)
I want it to return ['1','+','1'] but it only returns ['1']
Help is much appreciated.
There are several issues you have here. In ascending order of importance:
parseString will not raise an exception if there is extra text after the parsed content. Use exp.parseString(input, parseAll=True)
'|' is MatchFirst, not MatchLongest. Since your integer is first, it will be matched first. Then the parser fails on the '+' all by itself. If you want match longest, use '^' operator.
The Biggie: once you convert to '^' (or reorder the expressions to put exp + exp first, ahead of integer), you will find yourself blowing up the maximum recursion depth. That is because this parser has left-recursive definition of exp. That is, to parse an exp, it has to parse an exp, for which it has to parse an exp, etc. In general, many published BNFs use recursion to describe this kind of repetitive structure, but pyparsing does not do the necessary lookahead/backtrack for that to work. Try exp <<= integer + ZeroOrMore('+' + integer) | '(' + exp + ')' - this expression is not left-recursive, since you have to get past an opening parenthesis before parsing a nested exp.
EDIT:
Sorry, I was a little too quick on my earlier suggestion, here is the proper way to do your recursive expression parsing:
from pyparsing import *
exp = Forward()
LPAR, RPAR = map(Suppress, "()")
integer = Word(nums)
term = integer | Group(LPAR + exp + RPAR)
exp << term + ZeroOrMore('+' + term)
input = "(1+1) + 1"
print(exp.parseString(input))
prints
[['1', '+', '1'], '+', '1']
If you trace through the code, you'll see the recursion: exp is defined using term, and term is defined using exp. The fourFn.py example is closest to this style; since writing that, I've added the infixNotation method to pyparsing, which would allow you to write:
exp = infixNotation(integer, [
('+', 2, opAssoc.LEFT),
])
infixNotation takes care of the recursive definitions internally, implicitly defines the '(' + exp + ')' expression, and makes it easy to implement a system of operators with precedence of operations.
I recently came to this problem. Now the left recursion is supported on PyParsing 3.0.0b2 or later. But the feature should be explicitly enabled, and the order of operands of the | operator should be adjusted like this.
from pyparsing import ParserElement, Forward, Word, nums
ParserElement.enable_left_recursion()
exp = Forward()
integer = Word(nums)
exp << ((exp + '+' + exp) | integer)
input = "1+1"
print(exp.parseString(input))
This will output the following.
['1', '+', '1']
I am writing a Pyparsing grammar to convert Creole markup to HTML. I'm stuck because there's a bit of conflict trying to parse these two constructs:
Image link: {{image.jpg|title}}
Ignore formatting: {{{text}}}
The way I'm parsing the image link is as follows (note that this converts perfectly fine):
def parse_image(s, l, t):
try:
link, title = t[0].split("|")
except ValueError:
raise ParseFatalException(s,l,"invalid image link reference: " + t[0])
return '<img src="{0}" alt="{1}" />'.format(link, title)
image = QuotedString("{{", endQuoteChar="}}")
image.setParseAction(parse_image)
Next, I wrote a rule so that when {{{text}}} is encountered, simply return what's between the opening and closing braces without formatting it:
n = QuotedString("{{{", endQuoteChar="}}}")
n.setParseAction(lambda x: x[0])
However, when I try to run the following test case:
text = italic | bold | hr | newline | image | n
print text.transformString("{{{ //ignore formatting// }}}")
I get the following stack trace:
Traceback (most recent call last):
File "C:\Users\User\py\kreyol\parser.py", line 36, in <module>
print text.transformString("{{{ //ignore formatting// }}}")
File "C:\Python27\lib\site-packages\pyparsing.py", line 1210, in transformString
raise exc
pyparsing.ParseFatalException: invalid image link reference: { //ignore formatting// (at char 0), (line:1, col:1)
From what I understand, the parser encounters the {{ first and tries to parse the text as an image instead of text without formatting. How can I solve this ambiguity?
The issue is with this expression:
text = italic | bold | hr | newline | image | n
Pyparsing works strictly left-to-right, with no lookahead. Using '|' operators, you construct a pyparsing MatchFirst expression, which will match the first match of all the alternatives, even if a later match is better.
You can change the evaluation to use "longest match" by using the '^' operator instead:
text = italic ^ bold ^ hr ^ newline ^ image ^ n
This would have a performance penalty in that every expression is tested, even though there is no possibility of a better match.
An easier solution is to just reorder the expressions in your list of alternatives: test for n before image:
text = italic | bold | hr | newline | n | image
Now when evaluating alternatives, it will look for the leading {{{ of n before the leading {{ of image.
This often crops up when people define numeric terms, and accidentally define something like:
integer = Word(nums)
realnumber = Combine(Word(nums) + '.' + Word(nums))
number = integer | realnumber
In this case, number will never match a realnumber, since the leading whole number part will be parsed as an integer. The fix, as in your case, is to either use '^' operator, or just reorder:
number = realnumber | integer
I have a string like this:
a = '{CGPoint={CGPoint=d{CGPoint=dd}}}{CGSize=dd}dd{CSize=aa}'
Currently I am using this re statement to get desired result:
filter(None, re.split("\\{(.*?)\\}", a))
But this gives me:
['CGPoint={CGPoint=d{CGPoint=dd', '}}', 'CGSize=dd', 'dd', 'CSize=aa']
which is incorrect for my current situation, I need a list like this:
['CGPoint={CGPoint=d{CGPoint=dd}}', 'CGSize=dd', 'dd', 'CSize=aa']
As #m.buettner points out in the comments, Python's implementation of regular expressions can't match pairs of symbols nested to an arbitrary degree. (Other languages can, notably current versions of Perl.) The Pythonic thing to do when you have text that regexs can't parse is to use a recursive-descent parser.
There's no need to reinvent the wheel by writing your own, however; there are a number of easy-to-use parsing libraries out there. I recommend pyparsing which lets you define a grammar directly in your code and easily attach actions to matched tokens. Your code would look something like this:
import pyparsing
lbrace = Literal('{')
rbrace = Literal('}')
contents = Word(printables)
expr = Forward()
expr << Combine(Suppress(lbrace) + contents + Suppress(rbrace) + expr)
for line in lines:
results = expr.parseString(line)
There's an alternative regex module for Python I really like that supports recursive patterns:
https://pypi.python.org/pypi/regex
pip install regex
Then you can use a recursive pattern in your regex as demonstrated in this script:
import regex
from pprint import pprint
thestr = '{CGPoint={CGPoint=d{CGPoint=dd}}}{CGSize=dd}dd{CSize=aa}'
theregex = r'''
(
{
(?<match>
[^{}]*
(?:
(?1)
[^{}]*
)+
|
[^{}]+
)
}
|
(?<match>
[^{}]+
)
)
'''
matches = regex.findall(theregex, thestr, regex.X)
print 'all matches:\n'
pprint(matches)
print '\ndesired matches:\n'
print [match[1] for match in matches]
This outputs:
all matches:
[('{CGPoint={CGPoint=d{CGPoint=dd}}}', 'CGPoint={CGPoint=d{CGPoint=dd}}'),
('{CGSize=dd}', 'CGSize=dd'),
('dd', 'dd'),
('{CSize=aa}', 'CSize=aa')]
desired matches:
['CGPoint={CGPoint=d{CGPoint=dd}}', 'CGSize=dd', 'dd', 'CSize=aa']
pyparsing has a nestedExpr function for matching nested expressions:
import pyparsing as pp
ident = pp.Word(pp.alphanums)
expr = pp.nestedExpr("{", "}") | ident
thestr = '{CGPoint={CGPoint=d{CGPoint=dd}}}{CGSize=dd}dd{CSize=aa}'
for result in expr.searchString(thestr):
print(result)
yields
[['CGPoint=', ['CGPoint=d', ['CGPoint=dd']]]]
[['CGSize=dd']]
['dd']
[['CSize=aa']]
Here is some pseudo code. It creates a stack of strings and pops them when a close brace is encountered. Some extra logic to handle the fact that the first braces encountered are not included in the array.
String source = "{CGPoint={CGPoint=d{CGPoint=dd}}}{CGSize=dd}dd{CSize=aa}";
Array results;
Stack stack;
foreach (match in source.match("[{}]|[^{}]+")) {
switch (match) {
case '{':
if (stack.size == 0) stack.push(new String()); // add new empty string
else stack.push('{'); // child, so include matched brace.
case '}':
if (stack.size == 1) results.add(stack.pop()) // clear stack add to array
else stack.last += stack.pop() + '}"; // pop from stack and concatenate to previous
default:
if (stack.size == 0) results.add(match); // loose text, add to results
else stack.last += match; // append to latest member.
}
}
I have a boolean expression string, that I would like to take apart:
condition = "a and (b or (c and d))"
Or let's say:
I want to be able to access the string contents between two parenthesis.
I want following outcome:
"(b or (c and d))"
"(c and d)"
I've tried the following with regular expressions (not really working)
x = re.match(".*(\(.*\))", condition)
print x.group(1)
Question:
What is the nicest way to take a boolean expression string apart?
This is the sort of thing you can't do with a simple regex. You need to actually parse the text. pyparsing is apparently excellent for doing that.
Like everyone said, you need a parser.
If you don't want to install one, you can start from this simple top-down parser (take the last code sample here)
Remove everything not related to your need (+, -, *, /, is, lambda, if, else, ...). Just keep parenthesis, and, or.
You will get a binary tree structure generated from your expression.
The tokenizer use the build-in tokenize (import tokenize), which is a lexical scanner for Python source code but works just fine for simple cases like yours.
If your requirements are fairly simple, you don't really need a parser.
Matching parentheses can easily be achieved using a stack.
You could do something like the following:
condition = "a and (b or (c and d))"
stack = []
for c in condition:
if c != ')':
stack.append(c)
else:
d = c
contents = []
while d != '(':
contents.insert(0, d)
d = stack.pop()
contents.insert(0, d)
s = ''.join(contents)
print(s)
stack.append(s)
produces:
(c and d)
(b or (c and d))
Build a parser:
Condition ::= Term Condition'
Condition' ::= epsilon | OR Term Condition'
Term ::= Factor Term'
Term' ::= epsilon | AND Factor Term'
Factor ::= [ NOT ] Primary
Primary ::= Literal | '(' Condition ')'
Literal ::= Id