Python nicest way to take boolean expression string apart - python

I have a boolean expression string, that I would like to take apart:
condition = "a and (b or (c and d))"
Or let's say:
I want to be able to access the string contents between two parenthesis.
I want following outcome:
"(b or (c and d))"
"(c and d)"
I've tried the following with regular expressions (not really working)
x = re.match(".*(\(.*\))", condition)
print x.group(1)
Question:
What is the nicest way to take a boolean expression string apart?

This is the sort of thing you can't do with a simple regex. You need to actually parse the text. pyparsing is apparently excellent for doing that.

Like everyone said, you need a parser.
If you don't want to install one, you can start from this simple top-down parser (take the last code sample here)
Remove everything not related to your need (+, -, *, /, is, lambda, if, else, ...). Just keep parenthesis, and, or.
You will get a binary tree structure generated from your expression.
The tokenizer use the build-in tokenize (import tokenize), which is a lexical scanner for Python source code but works just fine for simple cases like yours.

If your requirements are fairly simple, you don't really need a parser.
Matching parentheses can easily be achieved using a stack.
You could do something like the following:
condition = "a and (b or (c and d))"
stack = []
for c in condition:
if c != ')':
stack.append(c)
else:
d = c
contents = []
while d != '(':
contents.insert(0, d)
d = stack.pop()
contents.insert(0, d)
s = ''.join(contents)
print(s)
stack.append(s)
produces:
(c and d)
(b or (c and d))

Build a parser:
Condition ::= Term Condition'
Condition' ::= epsilon | OR Term Condition'
Term ::= Factor Term'
Term' ::= epsilon | AND Factor Term'
Factor ::= [ NOT ] Primary
Primary ::= Literal | '(' Condition ')'
Literal ::= Id

Related

Parsing a custom configuration format in Python

I'm writing a profile manager for Stellaris game and I've hit a wall with their format in which they keep the info about mods and settings.
Mod file:
name="! (Ship Designer UI Fix) !"
path="mod/ship_designer_ui_fix"
tags={
"Fixes"
}
remote_file_id="879973318"
supported_version="1.6"
Settings:
language="l_english"
graphics={
size={
x=1920
y=1200
}
min_gui={
x=1920
y=1200
}
gui_scale=1.000000
gui_safe_ratio=1.000000
refreshRate=59
fullScreen=no
borderless=no
display_index=0
shadowSize=2048
multi_sampling=8
maxanisotropy=16
gamma=50.000000
vsync=yes
}
last_mods={
"mod/ship_designer_ui_fix.mod"
"mod/ugc_720237457.mod"
"mod/ugc_775944333.mod"
}
I've thought pyparsing will be of help there (and it probably will be) but it has been a long time since I've actually did something like this and this I'm clueless atm.
I've got to extract the simple key=value but I'm struggling to actually move from there to be able to extract the arrays, not to mention the multilevel arrays.
lbrack = Literal("{").suppress()
rbrack = Literal("}").suppress()
equals = Literal("=").suppress()
nonequals = "".join([c for c in printables if c != "="]) + " \t"
keydef = ~lbrack + Word(nonequals) + equals + restOfLine
conf = Dict( ZeroOrMore( Group(keydef) ) )
tokens = conf.parseString(data)
I haven't got very far as you can see. Can anyone point me towards next step? I'm not asking a finished and working solution for the whole thing - it would move me forward a lot but where's the fun in that :)
Well, it is awfully tempting to just dive in and write this parser, but you want some of that fun for yourself, that's great.
Before writing any code, write a BNF. That way you'll write a decent and robust parser, instead of just "everything that's not an equals sign must be an identifier".
There are a lot of "something = something" bits here, look at the kinds of things on the right- and left-hand sides of the '='. The left-hand sides all look like pretty well-mannered identifiers: alphas, underscores. I could envision numeric digits too, as long as they aren't the leading character. So let's say the left-hand sides will be identifiers:
identifier_leading = 'A'..'Z' 'a'..'z' '_'
identifier_body = identifier_leading '0'..'9'
identifier ::= identifier_leading + identifier_body*
The right-hand sides are a mix of things:
integers
floats
'yes' or 'no' booleans
quoted strings
something in braces
The "something in braces" are either a list of quoted strings, or a list of 'identifer = value' pairs. I'll skip the awful details of defining floats and integers and quoted strings, let's just assume we have those defined:
boolean_value ::= 'yes' | 'no'
value ::= float | integer | boolean_value | quoted_string | string_list_in_braces | key_value_list_in_braces
string_list_in_braces ::= '{' quoted_string * '}'
key_value ::= identifier '=' value
key_value_list_in_braces ::= '{' key_value* '}'
You will have to use a pyparsing Forward to declare value before it is fully defined, since it is used in key_value, but key_value is used in key_value_list_in_braces, which is used to define value - a recursive grammar. You are already familiar with the Dict(OneOrMore(Group(named_item))) pattern, and this should be good to give you a structure of fields that are accessible by name. For identifier, a Word would work, or you could just use the pre-defined pyparsing_common.identifier which was introduced as part of the pyparsing_common namespace class last year.
The translation from BNF to pyparsing should be pretty much 1-to-1 from here. For that matter, from the BNF, you could use PLY, ANTLR, or another parsing lib too. The BNF is really worth taking the 1/2 hour or 1/2 day to get sorted out.

Pyparsing using Forward to parse recursive expressions

I am trying to recursively parse an expression. I followed a few tutorials, and it seems like Forward() is the class that I need. However, something seemingly simple is causing me trouble.
Here is the code I wrote
from pyparsing import *
exp = Forward()
integer = Word(nums)
exp << (integer | (exp + '+' + exp))
input = "1+1"
print exp.parseString(input)
I want it to return ['1','+','1'] but it only returns ['1']
Help is much appreciated.
There are several issues you have here. In ascending order of importance:
parseString will not raise an exception if there is extra text after the parsed content. Use exp.parseString(input, parseAll=True)
'|' is MatchFirst, not MatchLongest. Since your integer is first, it will be matched first. Then the parser fails on the '+' all by itself. If you want match longest, use '^' operator.
The Biggie: once you convert to '^' (or reorder the expressions to put exp + exp first, ahead of integer), you will find yourself blowing up the maximum recursion depth. That is because this parser has left-recursive definition of exp. That is, to parse an exp, it has to parse an exp, for which it has to parse an exp, etc. In general, many published BNFs use recursion to describe this kind of repetitive structure, but pyparsing does not do the necessary lookahead/backtrack for that to work. Try exp <<= integer + ZeroOrMore('+' + integer) | '(' + exp + ')' - this expression is not left-recursive, since you have to get past an opening parenthesis before parsing a nested exp.
EDIT:
Sorry, I was a little too quick on my earlier suggestion, here is the proper way to do your recursive expression parsing:
from pyparsing import *
exp = Forward()
LPAR, RPAR = map(Suppress, "()")
integer = Word(nums)
term = integer | Group(LPAR + exp + RPAR)
exp << term + ZeroOrMore('+' + term)
input = "(1+1) + 1"
print(exp.parseString(input))
prints
[['1', '+', '1'], '+', '1']
If you trace through the code, you'll see the recursion: exp is defined using term, and term is defined using exp. The fourFn.py example is closest to this style; since writing that, I've added the infixNotation method to pyparsing, which would allow you to write:
exp = infixNotation(integer, [
('+', 2, opAssoc.LEFT),
])
infixNotation takes care of the recursive definitions internally, implicitly defines the '(' + exp + ')' expression, and makes it easy to implement a system of operators with precedence of operations.
I recently came to this problem. Now the left recursion is supported on PyParsing 3.0.0b2 or later. But the feature should be explicitly enabled, and the order of operands of the | operator should be adjusted like this.
from pyparsing import ParserElement, Forward, Word, nums
ParserElement.enable_left_recursion()
exp = Forward()
integer = Word(nums)
exp << ((exp + '+' + exp) | integer)
input = "1+1"
print(exp.parseString(input))
This will output the following.
['1', '+', '1']

How to make the first part of two syntactical equivalent parts optional when analysing with pyparsing

With the parsing library pyparsing I want to analyse constructs like this:
123 456
-^- -^-
[A] B
where both parts A and B only contain numbers and part A is optional. Here some examples how a parser for this would break strings down in their parts:
123 456 ==> A="123", B="456"
456 ==> A="", B="456"
123 ==> A="", B="123"
1 123 ==> A="1", B="123"
The native approach to write a parser looks like this:
a = pp.Optional(pp.Word(pp.nums)).setName("PART_A")
b = pp.Word(pp.nums).setName("PART_B")
expr = a('A') + b('B')
This parser works for "123 456" returning as expected {'A': '123', 'B': '456'}. However it fails on "456" with:
ParseException:
Expected PART_B (at char 3), (line:1, col:4)
"456>!<"
This is understandable because the optional part A already consumes the text which should match part B even though A was optional... My idea was to set a stopOn= option, but it needs to stop on an expression of the same type as the expression it wants to match...
Update: My 2nd idea was to re-write the Optional construct into a Or construct:
a = pp.Word(pp.nums).setName("PART_A")('A')
b = pp.Word(pp.nums).setName("PART_B")('B')
just_b = b
a_and_b = a + b
expr = pp.Or(just_b, a_and_b)
However, this now fails for texts of the form "123 456" - despite the fact that a_and_b is a alternative in the Or class...
Any suggestion what to do?
You are misconstructing the Or, it should be:
expr = pp.Or([just_b, a_and_b])
The way you are constructing it, the Or is being built with just just_b, with a_and_b being passed as the boolean argument savelist.
Please consider using the operator overloads to construct And, Or, MatchFirst, and Each expressions.
integer = pp.Word(pp.nums)
a = integer("A")
b = integer("B")
expr = a + b | b
The explicit style looks just so, well, Java-ish.
To answer the question in your title, you pretty much have already solved this: be sure to try matching the full a_and_b expression, either by placing it first in a MatchFirst (as my sample code does), or by using an Or expression (using the '^' operator, or by constructing an Or using a list of the just_b and a_and_b expressions).

Python tuple assignment and checking in conditional statements [duplicate]

This question already has answers here:
When are parentheses required around a tuple?
(3 answers)
Closed 8 years ago.
So I stumbled into a particular behaviour of tuples in python that I was wondering if there is a particular reason for it happening.
While we are perfectly capable of assigning a tuple to a variable without
explicitely enclosing it in parentheses:
>>> foo_bar_tuple = "foo","bar"
>>>
we are not able to print or check in a conditional if statement the variable containing
the tuple in the previous fashion (without explicitely typing the parentheses):
>>> print foo_bar_tuple == "foo","bar"
False bar
>>> if foo_bar_tuple == "foo","bar": pass
SyntaxError: invalid syntax
>>>
>>> print foo_bar_tuple == ("foo","bar")
True
>>>
>>> if foo_bar_tuple == ("foo","bar"): pass
>>>
Does anyone why?
Thanks in advance and although I didn't find any similar topic please inform me if you think it is a possible dublicate.
Cheers,
Alex
It's because the expressions separated by commas are evaluated before the whole comma-separated tuple (which is an "expression list" in the terminology of the Python grammar). So when you do foo_bar_tuple=="foo", "bar", that is interpreted as (foo_bar_tuple=="foo"), "bar". This behavior is described in the documentation.
You can see this if you just write such an expression by itself:
>>> 1, 2 == 1, 2 # interpreted as "1, (2==1), 2"
(1, False, 2)
The SyntaxError for the unparenthesized tuple is because an unparenthesized tuple is not an "atom" in the Python grammar, which means it's not valid as the sole content of an if condition. (You can verify this for yourself by tracing around the grammar.)
Considering an example of if 1 == 1,2: which should cause SyntaxError, following the full grammar:
if 1 == 1,2:
Using the if_stmt: 'if' test ':' suite ('elif' test ':' suite)* ['else' ':' suite], we get to shift the if keyword and start parsing 1 == 1,2:
For the test rule, only first production matches:
test: or_test ['if' or_test 'else' test] | lambdef
Then we get:
or_test: and_test ('or' and_test)*
And step down into and_test:
and_test: not_test ('and' not_test)*
Here we just step into not_test at the moment:
not_test: 'not' not_test | comparison
Notice, our input is 1 == 1,2:, thus the first production doesn't match and we check the other one: (1)
comparison: expr (comp_op expr)*
Continuing on stepping down (we take the only the first non-terminal as the zero-or-more star requires a terminal we don't have at all in our input):
expr: xor_expr ('|' xor_expr)*
xor_expr: and_expr ('^' and_expr)*
and_expr: shift_expr ('&' shift_expr)*
shift_expr: arith_expr (('<<'|'>>') arith_expr)*
arith_expr: term (('+'|'-') term)*
term: factor (('*'|'/'|'%'|'//') factor)*
factor: ('+'|'-'|'~') factor | power
Now we use the power production:
power: atom trailer* ['**' factor]
atom: ('(' [yield_expr|testlist_comp] ')' |
'[' [testlist_comp] ']' |
'{' [dictorsetmaker] '}' |
NAME | NUMBER | STRING+ | '...' | 'None' | 'True' | 'False')
And shift NUMBER (1 in our input) and reduce. Now we are back at (1) with input ==1,2: to parse. == matches comp_op:
comp_op: '<'|'>'|'=='|'>='|'<='|'<>'|'!='|'in'|'not' 'in'|'is'|'is' 'not'
So we shift it and reduce, leaving us with input 1,2: (current parsing output is NUMBER comp_op, we need to match expr now). We repeat the process for the left-hand side, going straight to the atom nonterminal and selecting the NUMBER production. Shift and reduce.
Since , does not match any comp_op we reduce the test non-terminal and receive 'if' NUMBER comp_op NUMBER. We need to match else, elif or : now, but we have , so we fail with SyntaxError.
I think the operator precedence table summarizes this nicely:
You'll see that comparisons come before expressions, which are actually dead last.
in, not in, is, is not, Comparisons, including membership tests
<, <=, >, >=, <>, !=, == and identity tests
...
(expressions...), [expressions...], Binding or tuple display, list display,
{key: value...}, `expressions...` dictionary display, string conversion

Regular expression curvy brackets in Python

I have a string like this:
a = '{CGPoint={CGPoint=d{CGPoint=dd}}}{CGSize=dd}dd{CSize=aa}'
Currently I am using this re statement to get desired result:
filter(None, re.split("\\{(.*?)\\}", a))
But this gives me:
['CGPoint={CGPoint=d{CGPoint=dd', '}}', 'CGSize=dd', 'dd', 'CSize=aa']
which is incorrect for my current situation, I need a list like this:
['CGPoint={CGPoint=d{CGPoint=dd}}', 'CGSize=dd', 'dd', 'CSize=aa']
As #m.buettner points out in the comments, Python's implementation of regular expressions can't match pairs of symbols nested to an arbitrary degree. (Other languages can, notably current versions of Perl.) The Pythonic thing to do when you have text that regexs can't parse is to use a recursive-descent parser.
There's no need to reinvent the wheel by writing your own, however; there are a number of easy-to-use parsing libraries out there. I recommend pyparsing which lets you define a grammar directly in your code and easily attach actions to matched tokens. Your code would look something like this:
import pyparsing
lbrace = Literal('{')
rbrace = Literal('}')
contents = Word(printables)
expr = Forward()
expr << Combine(Suppress(lbrace) + contents + Suppress(rbrace) + expr)
for line in lines:
results = expr.parseString(line)
There's an alternative regex module for Python I really like that supports recursive patterns:
https://pypi.python.org/pypi/regex
pip install regex
Then you can use a recursive pattern in your regex as demonstrated in this script:
import regex
from pprint import pprint
thestr = '{CGPoint={CGPoint=d{CGPoint=dd}}}{CGSize=dd}dd{CSize=aa}'
theregex = r'''
(
{
(?<match>
[^{}]*
(?:
(?1)
[^{}]*
)+
|
[^{}]+
)
}
|
(?<match>
[^{}]+
)
)
'''
matches = regex.findall(theregex, thestr, regex.X)
print 'all matches:\n'
pprint(matches)
print '\ndesired matches:\n'
print [match[1] for match in matches]
This outputs:
all matches:
[('{CGPoint={CGPoint=d{CGPoint=dd}}}', 'CGPoint={CGPoint=d{CGPoint=dd}}'),
('{CGSize=dd}', 'CGSize=dd'),
('dd', 'dd'),
('{CSize=aa}', 'CSize=aa')]
desired matches:
['CGPoint={CGPoint=d{CGPoint=dd}}', 'CGSize=dd', 'dd', 'CSize=aa']
pyparsing has a nestedExpr function for matching nested expressions:
import pyparsing as pp
ident = pp.Word(pp.alphanums)
expr = pp.nestedExpr("{", "}") | ident
thestr = '{CGPoint={CGPoint=d{CGPoint=dd}}}{CGSize=dd}dd{CSize=aa}'
for result in expr.searchString(thestr):
print(result)
yields
[['CGPoint=', ['CGPoint=d', ['CGPoint=dd']]]]
[['CGSize=dd']]
['dd']
[['CSize=aa']]
Here is some pseudo code. It creates a stack of strings and pops them when a close brace is encountered. Some extra logic to handle the fact that the first braces encountered are not included in the array.
String source = "{CGPoint={CGPoint=d{CGPoint=dd}}}{CGSize=dd}dd{CSize=aa}";
Array results;
Stack stack;
foreach (match in source.match("[{}]|[^{}]+")) {
switch (match) {
case '{':
if (stack.size == 0) stack.push(new String()); // add new empty string
else stack.push('{'); // child, so include matched brace.
case '}':
if (stack.size == 1) results.add(stack.pop()) // clear stack add to array
else stack.last += stack.pop() + '}"; // pop from stack and concatenate to previous
default:
if (stack.size == 0) results.add(match); // loose text, add to results
else stack.last += match; // append to latest member.
}
}

Categories