Keyword Matching in Pyparsing: non-greedy slurping of tokens

Keyword Matching in Pyparsing: non-greedy slurping of tokens - python

Pythonistas:
Suppose you want to parse the following string using Pyparsing:
'ABC_123_SPEED_X 123'
were ABC_123 is an identifier; SPEED_X is a parameter, and 123 is a value. I thought of the following BNF using Pyparsing:
Identifier = Word( alphanums + '_' )
Parameter = Keyword('SPEED_X') or Keyword('SPEED_Y') or Keyword('SPEED_Z')
Value = # assume I already have an expression valid for any value
Entry = Identifier + Literal('_') + Parameter + Value
tokens = Entry.parseString('ABC_123_SPEED_X 123')
#Error: pyparsing.ParseException: Expected "_" (at char 16), (line:1, col:17)
If I remove the underscore from the middle (and adjust the Entry definition accordingly) it parses correctly.
How can I make this parser be a bit lazier and wait until it matches the Keyword (as opposed to slurping the entire string as an Identifier and waiting for the _, which does not exist.
Thank you.
[Note: This is a complete rewrite of my question; I had not realized what the real problem was]

I based my answer off of this one, since what you're trying to do is get a non-greedy match. It seems like this is difficult to make happen in pyparsing, but not impossible with some cleverness and compromise. The following seems to work:
from pyparsing import *
Parameter = Literal('SPEED_X') | Literal('SPEED_Y') | Literal('SPEED_Z')
UndParam = Suppress('_') + Parameter
Identifier = SkipTo(UndParam)
Value = Word(nums)
Entry = Identifier + UndParam + Value
When we run this from the interactive interpreter, we can see the following:
>>> Entry.parseString('ABC_123_SPEED_X 123')
(['ABC_123', 'SPEED_X', '123'], {})
Note that this is a compromise; because I use SkipTo, the Identifier can be full of evil, disgusting characters, not just beautiful alphanums with the occasional underscore.
EDIT: Thanks to Paul McGuire, we can concoct a truly elegant solution by setting Identifier to the following:
Identifier = Combine(Word(alphanums) +
ZeroOrMore('_' + ~Parameter + Word(alphanums)))
Let's inspect how this works. First, ignore the outer Combine; we'll get to this later. Starting with Word(alphanums) we know we'll get the 'ABC' part of the reference string, 'ABC_123_SPEED_X 123'. It's important to note that we didn't allow the "word" to contain underscores in this case. We build that separately in to the logic.
Next, we need to capture the '_123' part without also sucking in '_SPEED_X'. Let's also skip over ZeroOrMore at this point and return to it later. We start with the underscore as a Literal, but we can shortcut with just '_', which will get us the leading underscore, but not all of '_123'. Instictively, we would place another Word(alphanums) to capture the rest, but that's exactly what will get us in trouble by consuming all of the remaining '_123_SPEED_X'. Instead, we say, "So long as what follows the underscore is not the Parameter, parse that as part of my Identifier. We state that in pyparsing terms as '_' + ~Parameter + Word(alphanums). Since we assume we can have an arbitrary number of underscore + WordButNotParameter repeats, we wrap that expression a ZeroOrMore construct. (If you always expect at least underscore + WordButNotParameter following the initial, you can use OneOrMore.)
Finally, we need to wrap the initial Word and the special underscore + Word repeats together so that it's understood they are contiguous, not separated by whitespace, so we wrap the whole expression up in a Combine construct. This way 'ABC _123_SPEED_X' will raise a parse error, but 'ABC_123_SPEED_X' will parse correctly.
Note also that I had to change Keyword to Literal because the ways of the former are far too subtle and quick to anger. I do not trust Keywords, nor could I get matching with them.

If you are sure that the identifier never ends with an underscore, you can enforce it in the definition:
from pyparsing import *
my_string = 'ABC_123_SPEED_X 123'
Identifier = Combine(Word(alphanums) + Literal('_') + Word(alphanums))
Parameter = Literal('SPEED_X') | Literal('SPEED_Y') | Literal('SPEED_Z')
Value = Word(nums)
Entry = Identifier + Literal('_').suppress() + Parameter + Value
tokens = Entry.parseString(my_string)
print tokens # prints: ['ABC_123', 'SPEED_X', '123']
If it's not the case but if the identifier length is fixed you can define Identifier like this:
Identifier = Word( alphanums + '_' , exact=7)

You can also parse the identifier and parameter as one token, and split them in a parse action:
from pyparsing import *
import re
def split_ident_and_param(tokens):
mo = re.match(r"^(.*?_.*?)_(.*?_.*?)$", tokens[0])
return [mo.group(1), mo.group(2)]
ident_and_param = Word(alphanums + "_").setParseAction(split_ident_and_param)
value = Word(nums)
entry = ident_and_param + value
print entry.parseString("APC_123_SPEED_X 123")
The example above assumes that the identifiers and parameters always have the format XXX_YYY (containing one single underscore).
If this is not the case, you need to adjust the split_ident_and_param() method.

This answers a question that you probably have also asked yourself: "What's a real-world application for reduce?):
>>> keys = ['CAT', 'DOG', 'HORSE', 'DEER', 'RHINOCEROS']
>>> p = reduce(lambda x, y: x | y, [Keyword(x) for x in keys])
>>> p
{{{{"CAT" | "DOG"} | "HORSE"} | "DEER"} | "RHINOCEROS"}
Edit:
This was a pretty good answer to the original question. I'll have to work on the new one.
Further edit:
I'm pretty sure you can't do what you're trying to do. The parser that pyparsing creates doesn't do lookahead. So if you tell it to match Word(alphanums + '_'), it's going to keep matching characters until it finds one that's not a letter, number, or underscore.

Related

Parsing a custom configuration format in Python

I'm writing a profile manager for Stellaris game and I've hit a wall with their format in which they keep the info about mods and settings.
Mod file:
name="! (Ship Designer UI Fix) !"
path="mod/ship_designer_ui_fix"
tags={
"Fixes"
}
remote_file_id="879973318"
supported_version="1.6"
Settings:
language="l_english"
graphics={
size={
x=1920
y=1200
}
min_gui={
x=1920
y=1200
}
gui_scale=1.000000
gui_safe_ratio=1.000000
refreshRate=59
fullScreen=no
borderless=no
display_index=0
shadowSize=2048
multi_sampling=8
maxanisotropy=16
gamma=50.000000
vsync=yes
}
last_mods={
"mod/ship_designer_ui_fix.mod"
"mod/ugc_720237457.mod"
"mod/ugc_775944333.mod"
}
I've thought pyparsing will be of help there (and it probably will be) but it has been a long time since I've actually did something like this and this I'm clueless atm.
I've got to extract the simple key=value but I'm struggling to actually move from there to be able to extract the arrays, not to mention the multilevel arrays.
lbrack = Literal("{").suppress()
rbrack = Literal("}").suppress()
equals = Literal("=").suppress()
nonequals = "".join([c for c in printables if c != "="]) + " \t"
keydef = ~lbrack + Word(nonequals) + equals + restOfLine
conf = Dict( ZeroOrMore( Group(keydef) ) )
tokens = conf.parseString(data)
I haven't got very far as you can see. Can anyone point me towards next step? I'm not asking a finished and working solution for the whole thing - it would move me forward a lot but where's the fun in that :)

Well, it is awfully tempting to just dive in and write this parser, but you want some of that fun for yourself, that's great.
Before writing any code, write a BNF. That way you'll write a decent and robust parser, instead of just "everything that's not an equals sign must be an identifier".
There are a lot of "something = something" bits here, look at the kinds of things on the right- and left-hand sides of the '='. The left-hand sides all look like pretty well-mannered identifiers: alphas, underscores. I could envision numeric digits too, as long as they aren't the leading character. So let's say the left-hand sides will be identifiers:
identifier_leading = 'A'..'Z' 'a'..'z' '_'
identifier_body = identifier_leading '0'..'9'
identifier ::= identifier_leading + identifier_body*
The right-hand sides are a mix of things:
integers
floats
'yes' or 'no' booleans
quoted strings
something in braces
The "something in braces" are either a list of quoted strings, or a list of 'identifer = value' pairs. I'll skip the awful details of defining floats and integers and quoted strings, let's just assume we have those defined:
boolean_value ::= 'yes' | 'no'
value ::= float | integer | boolean_value | quoted_string | string_list_in_braces | key_value_list_in_braces
string_list_in_braces ::= '{' quoted_string * '}'
key_value ::= identifier '=' value
key_value_list_in_braces ::= '{' key_value* '}'
You will have to use a pyparsing Forward to declare value before it is fully defined, since it is used in key_value, but key_value is used in key_value_list_in_braces, which is used to define value - a recursive grammar. You are already familiar with the Dict(OneOrMore(Group(named_item))) pattern, and this should be good to give you a structure of fields that are accessible by name. For identifier, a Word would work, or you could just use the pre-defined pyparsing_common.identifier which was introduced as part of the pyparsing_common namespace class last year.
The translation from BNF to pyparsing should be pretty much 1-to-1 from here. For that matter, from the BNF, you could use PLY, ANTLR, or another parsing lib too. The BNF is really worth taking the 1/2 hour or 1/2 day to get sorted out.

Python, how do I parse key=value list ignoring what is inside parentheses?

Suppose I have a string like this:
"key1=value1;key2=value2;key3=(key3.1=value3.1;key3.2=value3.2)"
I would like to get a dictionary corresponding to the above, where the value for key3 is the string
"(key3.1=value3.1;key3.2=value3.2)"
and eventually the corresponding sub-dictionary.
I know how to split the string at the semicolons, but how can I tell the parser to ignore the semicolon between parentheses?
This includes potentially nested parentheses.
Currently I am using an ad-hoc routine that looks for pairs of matching parentheses, "clears" its content, gets split positions and applies them to the original string, but this does not appear very elegant, there must be some prepackaged pythonic way to do this.
If anyone is interested, here is the code I am currently using:
def pparams(parameters, sep=';', defs='=', brc='()'):
'''
unpackages parameter string to struct
for example, pippo(a=21;b=35;c=pluto(h=zzz;y=mmm);d=2d3f) becomes:
a: '21'
b: '35'
c.fn: 'pluto'
c.h='zzz'
d: '2d3f'
fn_: 'pippo'
'''
ob=strfind(parameters,brc[0])
dp=strfind(parameters,defs)
out={}
if len(ob)>0:
if ob[0]<dp[0]:
#opening function
out['fn_']=parameters[:ob[0]]
parameters=parameters[(ob[0]+1):-1]
if len(dp)>0:
temp=smart_tokenize(parameters,sep,brc);
for v in temp:
defp=strfind(v,defs)
pname=v[:defp[0]]
pval=v[1+defp[0]:]
if len(strfind(pval,brc[0]))>0:
out[pname]=pparams(pval,sep,defs,brc);
else:
out[pname]=pval
else:
out['fn_']=parameters
return out
def smart_tokenize( instr, sep=';', brc='()' ):
'''
tokenize string ignoring separators contained within brc
'''
tstr=instr;
ob=strfind(instr,brc[0])
while len(ob)>0:
cb=findclsbrc(tstr,ob[0])
tstr=tstr[:ob[0]]+'?'*(cb-ob[0]+1)+tstr[cb+1:]
ob=strfind(tstr,brc[1])
sepp=[-1]+strfind(tstr,sep)+[len(instr)+1]
out=[]
for i in range(1,len(sepp)):
out.append(instr[(sepp[i-1]+1):(sepp[i])])
return out
def findclsbrc(instr, brc_pos, brc='()'):
'''
given a string containing an opening bracket, finds the
corresponding closing bracket
'''
tstr=instr[brc_pos:]
o=strfind(tstr,brc[0])
c=strfind(tstr,brc[1])
p=o+c
p.sort()
s1=[1 if v in o else 0 for v in p]
s2=[-1 if v in c else 0 for v in p]
s=[s1v+s2v for s1v,s2v in zip(s1,s2)]
s=[sum(s[:i+1]) for i in range(len(s))] #cumsum
return p[s.index(0)]+brc_pos
def strfind(instr, substr):
'''
returns starting position of each occurrence of substr within instr
'''
i=0
out=[]
while i<=len(instr):
try:
p=instr[i:].index(substr)
out.append(i+p)
i+=p+1
except:
i=len(instr)+1
return out

If you want to build a real parser, use one of the Python parsing libraries, like PLY or PyParsing. If you figure such a full-fledged library is overkill for the task at hand, go for some hack like the one you already have. I'm pretty sure there is no clean few-line solution without an external library.

Expanding on Sven Marnach's answer, here's an example of a pyparsing grammar that should work for you:
from pyparsing import (ZeroOrMore, Word, printables, Forward,
Group, Suppress, Dict)
collection = Forward()
simple_value = Word(printables, excludeChars='()=;')
key = simple_value
inner_collection = Suppress('(') + collection + Suppress(')')
value = simple_value ^ inner_collection
key_and_value = Group(key + Suppress('=') + value)
collection << Dict(key_and_value + ZeroOrMore(Suppress(';') + key_and_value))
coll = collection.parseString(
"key1=value1;key2=value2;key3=(key3.1=value3.1;key3.2=value3.2)")
print coll['key1'] # value1
print coll['key2'] # value2
print coll['key3']['key3.1'] # value3.1

You could use a regex to capture the groups:
>>> import re
>>> s = "key1=value1;key2=value2;key3=(key3.1=value3.1;key3.2=value3.2)"
>>> r = re.compile('(\w+)=(\w+|\([^)]+\));?')
>>> dict(r.findall(s))
This regex says:
(\w)+ # Find and capture a group with 1 or more word characters (letters, digits, underscores)
= # Followed by the literal character '='
(\w+ # Followed by a group with 1 or more word characters
|\([^)]+\) # or a group that starts with an open paren (parens escaped with '\(' or \')'), followed by anything up until a closed paren, which terminates the alternate grouping
);? # optionally this grouping might be followed by a semicolon.
Gotta say, kind of a strange grammar. You should consider using a more standard format. If you need guidance choosing one maybe ask another question. Good luck!

PyParsing lookaheads and greedy expressions

I'm writing a parser for a query language using PyParsing, and I've gotten stuck on (what I believe to be) an issue with lookaheads. One clause type in the query is intended to split strings into 3 parts (fieldname,operator, value) such that fieldname is one word, operator is one or more words, and value is a word, a quoted string, or a parenthesized list of these.
My data look like
author is william
author is 'william shakespeare'
author is not shakespeare
author is in (william,'the bard',shakespeare)
And my current parser for this clause is written as:
fieldname = Word(alphas)
operator = OneOrMore(Word(alphas))
single_value = Word(alphas) ^ QuotedString(quoteChar="'")
list_value = Literal("(") + Group(delimitedList(single_value)) + Literal(")")
value = single_value ^ list_value
clause = fieldname + originalTextFor(operator) + value
Obviously this fails due to the the fact that the operator element is greedy and will gobble up the value if it can. From reading other similar questions and the docs, I've gathered that I need to manage that lookahead with a NotAny or FollowedBy, but I haven't been able to figure out how to make that work.

This is a good place to Be The Parser. Or more accurately, Make The Parser Think Like You Do. Ask yourself, "In 'author is shakespeare', how do I know that 'shakespeare' is not part of the operator?" You know that 'shakespeare' is the value because it is at the end of the query, there is nothing more after it. So operator words aren't just words of alphas, they are words of alphas that are not followed by the end of the string. Now build that lookahead logic into your definition of operator:
operator = OneOrMore(Word(alphas) + ~FollowedBy(StringEnd()))
And I think this will start parsing better for you.
Some other tips:
I only use '^' operator if there will be some possible ambiguity, like if I was going to parse a string with numbers that could be integers or hex. If I used Word(nums) | Word(hexnums), then I might misprocess "123ABC" as just the leading "123". By changing '|' to '^', all of the alternatives will be tested, and the longest match chosen. In my example of parsing decimal or hex integers, I could have gotten the same result by reversing the alternatives, and test for Word(hexnums) first. In you query language, there is no way to confuse a quoted string with a non-quoted single word value (one leads with ' or ", the other doesn't), so there is no reason to use '^', '|' will suffice. Similar for value = singleValue ^ listValue.
Adding results names to the key components of your query string will make it easier to work with later:
clause = fieldname("fieldname") + originalTextFor(operator)("operator") + value("value")
Now you can access the parsed values by name instead of by parse position (which will get tricky and error-prone once you start getting more complicated with optional fields and such):
queryParts = clause.parseString('author is william')
print queryParts.fieldname
print queryParts.operator

python regex for repeating string

I am wanting to verify and then parse this string (in quotes):
string = "start: c12354, c3456, 34526; other stuff that I don't care about"
//Note that some codes begin with 'c'
I would like to verify that the string starts with 'start:' and ends with ';'
Afterward, I would like to have a regex parse out the strings. I tried the following python re code:
regx = r"start: (c?[0-9]+,?)+;"
reg = re.compile(regx)
matched = reg.search(string)
print ' matched.groups()', matched.groups()
I have tried different variations but I can either get the first or the last code but not a list of all three.
Or should I abandon using a regex?
EDIT: updated to reflect part of the problem space I neglected and fixed string difference.
Thanks for all the suggestions - in such a short time.

In Python, this isn’t possible with a single regular expression: each capture of a group overrides the last capture of that same group (in .NET, this would actually be possible since the engine distinguishes between captures and groups).
Your easiest solution is to first extract the part between start: and ; and then using a regular expression to return all matches, not just a single match, using re.findall('c?[0-9]+', text).

You could use the standard string tools, which are pretty much always more readable.
s = "start: c12354, c3456, 34526;"
s.startswith("start:") # returns a boolean if it starts with this string
s.endswith(";") # returns a boolean if it ends with this string
s[6:-1].split(', ') # will give you a list of tokens separated by the string ", "

This can be done (pretty elegantly) with a tool like Pyparsing:
from pyparsing import Group, Literal, Optional, Word
import string
code = Group(Optional(Literal("c"), default='') + Word(string.digits) + Optional(Literal(","), default=''))
parser = Literal("start:") + OneOrMore(code) + Literal(";")
# Read lines from file:
with open('lines.txt', 'r') as f:
for line in f:
try:
result = parser.parseString(line)
codes = [c[1] for c in result[1:-1]]
# Do something with teh codez...
except ParseException exc:
# Oh noes: string doesn't match!
continue
Cleaner than a regular expression, returns a list of codes (no need to string.split), and ignores any extra characters in the line, just like your example.

import re
sstr = re.compile(r'start:([^;]*);')
slst = re.compile(r'(?:c?)(\d+)')
mystr = "start: c12354, c3456, 34526; other stuff that I don't care about"
match = re.match(sstr, mystr)
if match:
res = re.findall(slst, match.group(0))
results in
['12354', '3456', '34526']

Using pyparsing to parse a word escape-split over multiple lines

I'm trying to parse words which can be broken up over multiple lines with a backslash-newline combination ("\\n") using pyparsing. Here's what I have done:
from pyparsing import *
continued_ending = Literal('\\') + lineEnd
word = Word(alphas)
split_word = word + Suppress(continued_ending)
multi_line_word = Forward()
multi_line_word << (word | (split_word + multi_line_word))
print multi_line_word.parseString(
'''super\\
cali\\
fragi\\
listic''')
The output I get is ['super'], while the expected output is ['super', 'cali', fragi', 'listic']. Better still would be all of them joined as one word (which I think I can just do with multi_line_word.parseAction(lambda t: ''.join(t)).
I tried looking at this code in pyparsing helper, but it gives me an error, maximum recursion depth exceeded.
EDIT 2009-11-15: I realized later that pyparsing gets a little generous with regards to white space, and that leads to some poor assumptions that what I thought I was parsing for was a lot looser. That is to say, we want to see no white space between any of the portions of the word, the escape, and the EOL character.
I realized that the little example string above is insufficient as a test case, so I wrote the following unit tests. Code that passes these tests should be able to match what I intuitively think of as a escape-split word—and only an escape-split word. They will not match a basic word that is not escape-split. We can—and I believe should—use a different grammatical construct for that. This keeps it all tidy having the two separate.
import unittest
import pyparsing
# Assumes you named your module 'multiline.py'
import multiline
class MultiLineTests(unittest.TestCase):
def test_continued_ending(self):
case = '\\\n'
expected = ['\\', '\n']
result = multiline.continued_ending.parseString(case).asList()
self.assertEqual(result, expected)
def test_continued_ending_space_between_parse_error(self):
case = '\\ \n'
self.assertRaises(
pyparsing.ParseException,
multiline.continued_ending.parseString,
case
)
def test_split_word(self):
cases = ('shiny\\', 'shiny\\\n', ' shiny\\')
expected = ['shiny']
for case in cases:
result = multiline.split_word.parseString(case).asList()
self.assertEqual(result, expected)
def test_split_word_no_escape_parse_error(self):
case = 'shiny'
self.assertRaises(
pyparsing.ParseException,
multiline.split_word.parseString,
case
)
def test_split_word_space_parse_error(self):
cases = ('shiny \\', 'shiny\r\\', 'shiny\t\\', 'shiny\\ ')
for case in cases:
self.assertRaises(
pyparsing.ParseException,
multiline.split_word.parseString,
case
)
def test_multi_line_word(self):
cases = (
'shiny\\',
'shi\\\nny',
'sh\\\ni\\\nny\\\n',
' shi\\\nny\\',
'shi\\\nny '
'shi\\\nny captain'
)
expected = ['shiny']
for case in cases:
result = multiline.multi_line_word.parseString(case).asList()
self.assertEqual(result, expected)
def test_multi_line_word_spaces_parse_error(self):
cases = (
'shi \\\nny',
'shi\\ \nny',
'sh\\\n iny',
'shi\\\n\tny',
)
for case in cases:
self.assertRaises(
pyparsing.ParseException,
multiline.multi_line_word.parseString,
case
)
if __name__ == '__main__':
unittest.main()

After poking around for a bit more, I came upon this help thread where there was this notable bit
I often see inefficient grammars when
someone implements a pyparsing grammar
directly from a BNF definition. BNF
does not have a concept of "one or
more" or "zero or more" or
"optional"...
With that, I got the idea to change these two lines
multi_line_word = Forward()
multi_line_word << (word | (split_word + multi_line_word))
To
multi_line_word = ZeroOrMore(split_word) + word
This got it to output what I was looking for: ['super', 'cali', fragi', 'listic'].
Next, I added a parse action that would join these tokens together:
multi_line_word.setParseAction(lambda t: ''.join(t))
This gives a final output of ['supercalifragilistic'].
The take home message I learned is that one doesn't simply walk into Mordor.
Just kidding.
The take home message is that one can't simply implement a one-to-one translation of BNF with pyparsing. Some tricks with using the iterative types should be called into use.
EDIT 2009-11-25: To compensate for the more strenuous test cases, I modified the code to the following:
no_space = NotAny(White(' \t\r'))
# make sure that the EOL immediately follows the escape backslash
continued_ending = Literal('\\') + no_space + lineEnd
word = Word(alphas)
# make sure that the escape backslash immediately follows the word
split_word = word + NotAny(White()) + Suppress(continued_ending)
multi_line_word = OneOrMore(split_word + NotAny(White())) + Optional(word)
multi_line_word.setParseAction(lambda t: ''.join(t))
This has the benefit of making sure that no space comes between any of the elements (with the exception of newlines after the escaping backslashes).

You are pretty close with your code. Any of these mods would work:
# '|' means MatchFirst, so you had a left-recursive expression
# reversing the order of the alternatives makes this work
multi_line_word << ((split_word + multi_line_word) | word)
# '^' means Or/MatchLongest, but beware using this inside a Forward
multi_line_word << (word ^ (split_word + multi_line_word))
# an unusual use of delimitedList, but it works
multi_line_word = delimitedList(word, continued_ending)
# in place of your parse action, you can wrap in a Combine
multi_line_word = Combine(delimitedList(word, continued_ending))
As you found in your pyparsing googling, BNF->pyparsing translations should be done with a special view to using pyparsing features in place of BNF, um, shortcomings. I was actually in the middle of composing a longer answer, going into more of the BNF translation issues, but you have already found this material (on the wiki, I assume).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.