building a pyparsing.Dict from a string of multiple tokens - part II - python

I've made some progress thanks to feedback from this forum ( thanks forum!).
The pyparsing.Dict object dict is getting populated but silently fails when it finds decimal numbers.
given:
import pyparsing as pp
lines = '''\
(rate multiple)
(region "mountainous")
(elev 21439)
(alteleva +21439)
(altelevb -21439)
(coorda 23899.747)
(coordb +23899.747)
(coordc -23899.747)
(coordd 853.324e21)
(coorde +853.324e21)
(coordf -853.324e21)
(coordg 987.88e+09)
(coordh +987.88e+09)
(coordi -987.88e+09)
(coordj 122.45e-04)
(coordk +122.45e-04)
(coordl -122.45e-04)
'''
leftParen = pp.Literal('(')
rightParen = pp.Literal(')')
colon = pp.Literal(':')
decimalpoint = pp.Literal('.')
doublequote = pp.Literal('"')
plusorminus = pp.Literal('+') | pp.Literal('-')
exp = pp.CaselessLiteral('E')
v_string = pp.Word(pp.alphanums)
v_quoted_string = pp.Combine( doublequote + v_string + doublequote)
v_number = pp.Regex(r'[+-]?(?P<float1>\d+)(?P<float2>\.\d+)?(?P<float3>[Ee][+-]?\d+)?')
keyy = v_string
valu = v_string | v_quoted_string | v_number
item = pp.Group( pp.Literal('(').suppress() + keyy + valu + pp.Literal(')').suppress() )
items = pp.ZeroOrMore( item)
dict = pp.Dict( items)
print "dict yields: ", dict.parseString( lines).dump()
yields
- alteleva: '+21439',
- altelevb: '-21439',
- elev: '21439',
- rate: 'multiple',
- region: '"mountainous"'
Changing the order of tokens around proves the script silently fails when it hits the first decimal number, which implies there's something subtly wrong with the pp.Regex statement but I sure can't spot it.
TIA,
code_warrior

Your problem actually lies in this expression:
valu = v_string | v_quoted_string | v_number
Because v_string is defined as the very broadly-matching expression:
v_string = pp.Word(pp.alphanums)
and because it is the first expression in valu, it will mask v_numbers that start with a digit. This is because the '|' operator produces pp.MatchFirst objects, so the first expression matched (reading left-to-right) will determine which alternative is used. You can convert to using the '^' operator, which produces pp.Or objects - the Or class will try to evaluate all the alternatives and then go with the longest match. However, note that using Or carries a performance penalty, since many more expressions are test for a match even when there is no chance for confusion. In your case, you can just reorder the expressions to put the least specific matching expression last:
valu = v_quoted_string | v_number | v_string
Now values will be parsed first attempting to parse as quoted strings, then as numbers, and then only if there is no match for either of these specific types, as the very general type v_string.
A few other comments:
I personally prefer to parse quoted strings and only get the content within the quotes (It's a string, I know it already!). There used to be some confusion with older versions of pyparsing when dumping out the parsed results when parsed strings were displayed without any enclosing quotes. But now that I use repr() to show the parsed values, strings show up in quotes when calling dump(), but the value itself does not include the quotes. When it is used elsewhere in the program, such as saving to a database or CSV, I don't need the quotes, I just want the string content. The QuotedString class takes care of this for me by default. Or use pp.quotedString().addParseAction(pp.removeQuotes).
A recent pyparsing release introduced the pyparsing_common namespace class, containing a number of helpful pre-defined expressions. There are several for parsing different numeric types (integer, signed integer, real, etc.), and a couple of blanket expressions: number will parse any numeric type, and produce values of the respective type (real will give a float, integer will give an int, etc.); fnumber will parse various numerics, but return them all as floats. I've replaced your v_number expression with just pp.pyparsing_common.number(), which also permits me to remove several other partial expressions that were defined just for building up the v_number expression, like decimalpoint, plusorminus and exp. You can see more about the expressions in pyparsing_common at the online docs: https://pythonhosted.org/pyparsing/
Pyparsing's default behavior when processing literal strings in an expression like "(" + pp.Word(pp.alphas) + valu + ")" is to automatically convert the literal "(" and ")" terms to pp.Literal objects. This prevents accidentally losing parsed data, but in the case of punctuation, you end up with many cluttering and unhelpful extra strings in the parsed results. In your parser, you can replace pyparsing's default by calling pp.ParserElement.inlineLiteralsUsing and passing the pp.Suppress class:
pp.ParserElement.inlineLiteralsUsing(pp.Suppress)
Now you can write an expression like:
item = pp.Group('(' + keyy + valu + ')')
and the grouping parentheses will be suppressed from the parsed results.
Making these changes, your parser now simplifies to:
import pyparsing as pp
# override pyparsing default to suppress literal strings in expressions
pp.ParserElement.inlineLiteralsUsing(pp.Suppress)
v_string = pp.Word(pp.alphanums)
v_quoted_string = pp.QuotedString('"')
v_number = pp.pyparsing_common.number()
keyy = v_string
# define valu using least specific expressions last
valu = v_quoted_string | v_number | v_string
item = pp.Group('(' + keyy + valu + ')')
items = pp.ZeroOrMore( item)
dict_expr = pp.Dict( items)
print ("dict yields: ", dict_expr.parseString( lines).dump())
And for your test input, gives:
dict yields: [['rate', 'multiple'], ['region', 'mountainous'], ['elev', 21439],
['alteleva', 21439], ['altelevb', -21439], ['coorda', 23899.747], ['coordb',
23899.747], ['coordc', -23899.747], ['coordd', 8.53324e+23], ['coorde',
8.53324e+23], ['coordf', -8.53324e+23], ['coordg', 987880000000.0], ['coordh',
987880000000.0], ['coordi', -987880000000.0], ['coordj', 0.012245], ['coordk',
0.012245], ['coordl', -0.012245]]
- alteleva: 21439
- altelevb: -21439
- coorda: 23899.747
- coordb: 23899.747
- coordc: -23899.747
- coordd: 8.53324e+23
- coorde: 8.53324e+23
- coordf: -8.53324e+23
- coordg: 987880000000.0
- coordh: 987880000000.0
- coordi: -987880000000.0
- coordj: 0.012245
- coordk: 0.012245
- coordl: -0.012245
- elev: 21439
- rate: 'multiple'
- region: 'mountainous'

Related

Pyparsing using Forward to parse recursive expressions

I am trying to recursively parse an expression. I followed a few tutorials, and it seems like Forward() is the class that I need. However, something seemingly simple is causing me trouble.
Here is the code I wrote
from pyparsing import *
exp = Forward()
integer = Word(nums)
exp << (integer | (exp + '+' + exp))
input = "1+1"
print exp.parseString(input)
I want it to return ['1','+','1'] but it only returns ['1']
Help is much appreciated.
There are several issues you have here. In ascending order of importance:
parseString will not raise an exception if there is extra text after the parsed content. Use exp.parseString(input, parseAll=True)
'|' is MatchFirst, not MatchLongest. Since your integer is first, it will be matched first. Then the parser fails on the '+' all by itself. If you want match longest, use '^' operator.
The Biggie: once you convert to '^' (or reorder the expressions to put exp + exp first, ahead of integer), you will find yourself blowing up the maximum recursion depth. That is because this parser has left-recursive definition of exp. That is, to parse an exp, it has to parse an exp, for which it has to parse an exp, etc. In general, many published BNFs use recursion to describe this kind of repetitive structure, but pyparsing does not do the necessary lookahead/backtrack for that to work. Try exp <<= integer + ZeroOrMore('+' + integer) | '(' + exp + ')' - this expression is not left-recursive, since you have to get past an opening parenthesis before parsing a nested exp.
EDIT:
Sorry, I was a little too quick on my earlier suggestion, here is the proper way to do your recursive expression parsing:
from pyparsing import *
exp = Forward()
LPAR, RPAR = map(Suppress, "()")
integer = Word(nums)
term = integer | Group(LPAR + exp + RPAR)
exp << term + ZeroOrMore('+' + term)
input = "(1+1) + 1"
print(exp.parseString(input))
prints
[['1', '+', '1'], '+', '1']
If you trace through the code, you'll see the recursion: exp is defined using term, and term is defined using exp. The fourFn.py example is closest to this style; since writing that, I've added the infixNotation method to pyparsing, which would allow you to write:
exp = infixNotation(integer, [
('+', 2, opAssoc.LEFT),
])
infixNotation takes care of the recursive definitions internally, implicitly defines the '(' + exp + ')' expression, and makes it easy to implement a system of operators with precedence of operations.
I recently came to this problem. Now the left recursion is supported on PyParsing 3.0.0b2 or later. But the feature should be explicitly enabled, and the order of operands of the | operator should be adjusted like this.
from pyparsing import ParserElement, Forward, Word, nums
ParserElement.enable_left_recursion()
exp = Forward()
integer = Word(nums)
exp << ((exp + '+' + exp) | integer)
input = "1+1"
print(exp.parseString(input))
This will output the following.
['1', '+', '1']

Python Pyparsing: Capture comma-separated list inside parentheses ignoring inner parentheses

I have a question on how to properly parse the string like the following,
"(test.function, arr(3,12), "combine,into one")"
into the following list,
['test.function', 'arr(3,12)', '"combine,into one"']
Note: the 'list' items from the original string are not necessarily split by a comma and a space, it can also be two items split directly by a comma one after another, e.g. test.function,arr(3,12).
Basically, I want to:
Parse the input string which is contained in parentheses, but not the inner parentheses. (Hence, nestedExpr() can't be used as-is)
The items inside are separeted by commas, but the items themselves may contain commas.
Moreover, I can only use scanString() and not parseString().
I've done some search in SO and found this and this, but I can't translate them to fit into my problem.
Thanks!
This should address your nesting and quoting issues:
sample = """(test.function, arr(3,12),"combine,into one")"""
from pyparsing import (Suppress, removeQuotes, quotedString, originalTextFor,
OneOrMore, Word, printables, nestedExpr, delimitedList)
# punctuation and basic elements
LPAR,RPAR = map(Suppress, "()")
quotedString.addParseAction(removeQuotes)
# what are the possible values inside the ()'s?
# - quoted string - anything is allowed inside quotes, match these first
# - any printable, not containing ',', '(', or ')', with optional nested ()'s
# (use originalTextFor helper to extract the original text from the input
# string)
value = (quotedString
| originalTextFor(OneOrMore(Word(printables, excludeChars="(),")
| nestedExpr())))
# define an overall expression, with surrounding ()'s
expr = LPAR + delimitedList(value) + RPAR
# test against the sample
print(expr.parseString(sample).asList())
prints:
['test.function', 'arr(3,12)', 'combine,into one']

Pyparsing: Grammar cannot parse whitespace while parsing arithmetic expression with exponential and logarithm functions

Intention:
To parse arithmetic expressions with support for logarithms and exponentials. Any of the following expressions are valid;
x + y
exp(x) + y
exp(x+y)
exp(log(x)+exp(z))+exp(y)
x+log(exp(y))
x + 2
Source Code:
import pyparsing as pp
arith_expr = pp.Forward()
op = pp.oneOf("^ / * % + -")
exp_funcs = pp.Regex(r"(log|exp)(2|10|e)?")
operand = pp.Word(pp.alphas, pp.alphanums + "_") | pp.Regex(r"[+-]?\d+(:?\.\d*)?(:?[eE][+-]?\d+)?")
func_atom = operand ^ (pp.Optional(exp_funcs) + "(" + arith_expr + ")")
comp_expr = pp.Combine(func_atom + pp.ZeroOrMore(op + func_atom))
arith_expr << comp_expr
print arith_expr.parseString("exp(datasize+ 2) +3")
Observation
The grammar is able to parse every such arithmetic expressions but sadly fails to parse when whitespace appears around either operand or operator. The grammar is unable to parse the following expressions;
exp(x+ 2)
exp( x + 2 )
x + 2
I have tried debugging the grammar with setDebug( ) and found the following error on every such failure;
Expected ")"
I reckon Pyparsing is whitespace insensitive after going through it's documentation and my day to day use. I have tried debugging with every possible change that i could. None of them solved the issue. I appreciate your valuable suggestions. :)
The problem is with your use of Combine, which squashes all the tokens together into a single token. Whitespace between tokens is ignored in pyparsing, but whitespace within a token is not.
To fix this, get rid of Combine and then pass the result to ''.join to get it back into one string.

PyParsing lookaheads and greedy expressions

I'm writing a parser for a query language using PyParsing, and I've gotten stuck on (what I believe to be) an issue with lookaheads. One clause type in the query is intended to split strings into 3 parts (fieldname,operator, value) such that fieldname is one word, operator is one or more words, and value is a word, a quoted string, or a parenthesized list of these.
My data look like
author is william
author is 'william shakespeare'
author is not shakespeare
author is in (william,'the bard',shakespeare)
And my current parser for this clause is written as:
fieldname = Word(alphas)
operator = OneOrMore(Word(alphas))
single_value = Word(alphas) ^ QuotedString(quoteChar="'")
list_value = Literal("(") + Group(delimitedList(single_value)) + Literal(")")
value = single_value ^ list_value
clause = fieldname + originalTextFor(operator) + value
Obviously this fails due to the the fact that the operator element is greedy and will gobble up the value if it can. From reading other similar questions and the docs, I've gathered that I need to manage that lookahead with a NotAny or FollowedBy, but I haven't been able to figure out how to make that work.
This is a good place to Be The Parser. Or more accurately, Make The Parser Think Like You Do. Ask yourself, "In 'author is shakespeare', how do I know that 'shakespeare' is not part of the operator?" You know that 'shakespeare' is the value because it is at the end of the query, there is nothing more after it. So operator words aren't just words of alphas, they are words of alphas that are not followed by the end of the string. Now build that lookahead logic into your definition of operator:
operator = OneOrMore(Word(alphas) + ~FollowedBy(StringEnd()))
And I think this will start parsing better for you.
Some other tips:
I only use '^' operator if there will be some possible ambiguity, like if I was going to parse a string with numbers that could be integers or hex. If I used Word(nums) | Word(hexnums), then I might misprocess "123ABC" as just the leading "123". By changing '|' to '^', all of the alternatives will be tested, and the longest match chosen. In my example of parsing decimal or hex integers, I could have gotten the same result by reversing the alternatives, and test for Word(hexnums) first. In you query language, there is no way to confuse a quoted string with a non-quoted single word value (one leads with ' or ", the other doesn't), so there is no reason to use '^', '|' will suffice. Similar for value = singleValue ^ listValue.
Adding results names to the key components of your query string will make it easier to work with later:
clause = fieldname("fieldname") + originalTextFor(operator)("operator") + value("value")
Now you can access the parsed values by name instead of by parse position (which will get tricky and error-prone once you start getting more complicated with optional fields and such):
queryParts = clause.parseString('author is william')
print queryParts.fieldname
print queryParts.operator

Keyword Matching in Pyparsing: non-greedy slurping of tokens

Pythonistas:
Suppose you want to parse the following string using Pyparsing:
'ABC_123_SPEED_X 123'
were ABC_123 is an identifier; SPEED_X is a parameter, and 123 is a value. I thought of the following BNF using Pyparsing:
Identifier = Word( alphanums + '_' )
Parameter = Keyword('SPEED_X') or Keyword('SPEED_Y') or Keyword('SPEED_Z')
Value = # assume I already have an expression valid for any value
Entry = Identifier + Literal('_') + Parameter + Value
tokens = Entry.parseString('ABC_123_SPEED_X 123')
#Error: pyparsing.ParseException: Expected "_" (at char 16), (line:1, col:17)
If I remove the underscore from the middle (and adjust the Entry definition accordingly) it parses correctly.
How can I make this parser be a bit lazier and wait until it matches the Keyword (as opposed to slurping the entire string as an Identifier and waiting for the _, which does not exist.
Thank you.
[Note: This is a complete rewrite of my question; I had not realized what the real problem was]
I based my answer off of this one, since what you're trying to do is get a non-greedy match. It seems like this is difficult to make happen in pyparsing, but not impossible with some cleverness and compromise. The following seems to work:
from pyparsing import *
Parameter = Literal('SPEED_X') | Literal('SPEED_Y') | Literal('SPEED_Z')
UndParam = Suppress('_') + Parameter
Identifier = SkipTo(UndParam)
Value = Word(nums)
Entry = Identifier + UndParam + Value
When we run this from the interactive interpreter, we can see the following:
>>> Entry.parseString('ABC_123_SPEED_X 123')
(['ABC_123', 'SPEED_X', '123'], {})
Note that this is a compromise; because I use SkipTo, the Identifier can be full of evil, disgusting characters, not just beautiful alphanums with the occasional underscore.
EDIT: Thanks to Paul McGuire, we can concoct a truly elegant solution by setting Identifier to the following:
Identifier = Combine(Word(alphanums) +
ZeroOrMore('_' + ~Parameter + Word(alphanums)))
Let's inspect how this works. First, ignore the outer Combine; we'll get to this later. Starting with Word(alphanums) we know we'll get the 'ABC' part of the reference string, 'ABC_123_SPEED_X 123'. It's important to note that we didn't allow the "word" to contain underscores in this case. We build that separately in to the logic.
Next, we need to capture the '_123' part without also sucking in '_SPEED_X'. Let's also skip over ZeroOrMore at this point and return to it later. We start with the underscore as a Literal, but we can shortcut with just '_', which will get us the leading underscore, but not all of '_123'. Instictively, we would place another Word(alphanums) to capture the rest, but that's exactly what will get us in trouble by consuming all of the remaining '_123_SPEED_X'. Instead, we say, "So long as what follows the underscore is not the Parameter, parse that as part of my Identifier. We state that in pyparsing terms as '_' + ~Parameter + Word(alphanums). Since we assume we can have an arbitrary number of underscore + WordButNotParameter repeats, we wrap that expression a ZeroOrMore construct. (If you always expect at least underscore + WordButNotParameter following the initial, you can use OneOrMore.)
Finally, we need to wrap the initial Word and the special underscore + Word repeats together so that it's understood they are contiguous, not separated by whitespace, so we wrap the whole expression up in a Combine construct. This way 'ABC _123_SPEED_X' will raise a parse error, but 'ABC_123_SPEED_X' will parse correctly.
Note also that I had to change Keyword to Literal because the ways of the former are far too subtle and quick to anger. I do not trust Keywords, nor could I get matching with them.
If you are sure that the identifier never ends with an underscore, you can enforce it in the definition:
from pyparsing import *
my_string = 'ABC_123_SPEED_X 123'
Identifier = Combine(Word(alphanums) + Literal('_') + Word(alphanums))
Parameter = Literal('SPEED_X') | Literal('SPEED_Y') | Literal('SPEED_Z')
Value = Word(nums)
Entry = Identifier + Literal('_').suppress() + Parameter + Value
tokens = Entry.parseString(my_string)
print tokens # prints: ['ABC_123', 'SPEED_X', '123']
If it's not the case but if the identifier length is fixed you can define Identifier like this:
Identifier = Word( alphanums + '_' , exact=7)
You can also parse the identifier and parameter as one token, and split them in a parse action:
from pyparsing import *
import re
def split_ident_and_param(tokens):
mo = re.match(r"^(.*?_.*?)_(.*?_.*?)$", tokens[0])
return [mo.group(1), mo.group(2)]
ident_and_param = Word(alphanums + "_").setParseAction(split_ident_and_param)
value = Word(nums)
entry = ident_and_param + value
print entry.parseString("APC_123_SPEED_X 123")
The example above assumes that the identifiers and parameters always have the format XXX_YYY (containing one single underscore).
If this is not the case, you need to adjust the split_ident_and_param() method.
This answers a question that you probably have also asked yourself: "What's a real-world application for reduce?):
>>> keys = ['CAT', 'DOG', 'HORSE', 'DEER', 'RHINOCEROS']
>>> p = reduce(lambda x, y: x | y, [Keyword(x) for x in keys])
>>> p
{{{{"CAT" | "DOG"} | "HORSE"} | "DEER"} | "RHINOCEROS"}
Edit:
This was a pretty good answer to the original question. I'll have to work on the new one.
Further edit:
I'm pretty sure you can't do what you're trying to do. The parser that pyparsing creates doesn't do lookahead. So if you tell it to match Word(alphanums + '_'), it's going to keep matching characters until it finds one that's not a letter, number, or underscore.

Categories