Parsing a custom configuration format in Python

Parsing a custom configuration format in Python - python

I'm writing a profile manager for Stellaris game and I've hit a wall with their format in which they keep the info about mods and settings.
Mod file:
name="! (Ship Designer UI Fix) !"
path="mod/ship_designer_ui_fix"
tags={
"Fixes"
}
remote_file_id="879973318"
supported_version="1.6"
Settings:
language="l_english"
graphics={
size={
x=1920
y=1200
}
min_gui={
x=1920
y=1200
}
gui_scale=1.000000
gui_safe_ratio=1.000000
refreshRate=59
fullScreen=no
borderless=no
display_index=0
shadowSize=2048
multi_sampling=8
maxanisotropy=16
gamma=50.000000
vsync=yes
}
last_mods={
"mod/ship_designer_ui_fix.mod"
"mod/ugc_720237457.mod"
"mod/ugc_775944333.mod"
}
I've thought pyparsing will be of help there (and it probably will be) but it has been a long time since I've actually did something like this and this I'm clueless atm.
I've got to extract the simple key=value but I'm struggling to actually move from there to be able to extract the arrays, not to mention the multilevel arrays.
lbrack = Literal("{").suppress()
rbrack = Literal("}").suppress()
equals = Literal("=").suppress()
nonequals = "".join([c for c in printables if c != "="]) + " \t"
keydef = ~lbrack + Word(nonequals) + equals + restOfLine
conf = Dict( ZeroOrMore( Group(keydef) ) )
tokens = conf.parseString(data)
I haven't got very far as you can see. Can anyone point me towards next step? I'm not asking a finished and working solution for the whole thing - it would move me forward a lot but where's the fun in that :)

Well, it is awfully tempting to just dive in and write this parser, but you want some of that fun for yourself, that's great.
Before writing any code, write a BNF. That way you'll write a decent and robust parser, instead of just "everything that's not an equals sign must be an identifier".
There are a lot of "something = something" bits here, look at the kinds of things on the right- and left-hand sides of the '='. The left-hand sides all look like pretty well-mannered identifiers: alphas, underscores. I could envision numeric digits too, as long as they aren't the leading character. So let's say the left-hand sides will be identifiers:
identifier_leading = 'A'..'Z' 'a'..'z' '_'
identifier_body = identifier_leading '0'..'9'
identifier ::= identifier_leading + identifier_body*
The right-hand sides are a mix of things:
integers
floats
'yes' or 'no' booleans
quoted strings
something in braces
The "something in braces" are either a list of quoted strings, or a list of 'identifer = value' pairs. I'll skip the awful details of defining floats and integers and quoted strings, let's just assume we have those defined:
boolean_value ::= 'yes' | 'no'
value ::= float | integer | boolean_value | quoted_string | string_list_in_braces | key_value_list_in_braces
string_list_in_braces ::= '{' quoted_string * '}'
key_value ::= identifier '=' value
key_value_list_in_braces ::= '{' key_value* '}'
You will have to use a pyparsing Forward to declare value before it is fully defined, since it is used in key_value, but key_value is used in key_value_list_in_braces, which is used to define value - a recursive grammar. You are already familiar with the Dict(OneOrMore(Group(named_item))) pattern, and this should be good to give you a structure of fields that are accessible by name. For identifier, a Word would work, or you could just use the pre-defined pyparsing_common.identifier which was introduced as part of the pyparsing_common namespace class last year.
The translation from BNF to pyparsing should be pretty much 1-to-1 from here. For that matter, from the BNF, you could use PLY, ANTLR, or another parsing lib too. The BNF is really worth taking the 1/2 hour or 1/2 day to get sorted out.

Related

Extend Formatter features to f-string syntax

In a project of mine, I'm passing strings to a Formatter subclass whic formats it using the format specifier mini-language. In my case it is customized (using the features of the Formatter class) by adding additional bang converters : !u converts the resulting string to lowercase, !c to titlecase, !q doubles any square bracket (because reasons), and some others.
For example, using a = "toFu", "{a!c}" becomes "Tofu"
How could I make my system use f-string syntax, so I can have "{a+a!c}" be turned into "Tofutofu" ?
NB: I'm not asking for a way of making f"{a+a!c}" (note the presence of an f) resolve itself as "Tofutofu", which is what hook into the builtin python f-string format machinery covers, I'm asking if there is a way for a function or any form of python code to turn "{a+a!c}" (note the absence of an f) into "Tofutofu".

Not sure I still fully understand what you need, but from the details given in the question and some comments, here is a function that parses strings with the format you specified and gives the desired results:
import re
def formatter(s):
def replacement(match):
expr, frmt = match[1].split('!')
if frmt == 'c':
return eval(expr).title()
return re.sub(r"{([^{]+)}", replacement, s)
a = "toFu"
print(formatter("blah {a!c}"))
print(formatter("{a+a!c}blah"))
Outputs:
blah Tofu
Tofutofublah
This uses the function variation of the repl argument of the re.sub function. This function (replacement) can be further extended to support all other !xs.
Main disadvantages:
Using eval is evil.
This doesn't take in count regular format specifiers, i.e. :0.3
Maybe someone can take it from here and improve.

Evolved from #Tomerikoo 's life-saving answer, here's the code:
import re
def formatter(s):
def replacement(match):
pre, bangs, suf = match.group(1, 2, 3)
# pre : the part before the first bang
# bangs : the bang (if any) and the characters going with it
# suf : the colon (if any) and the characters going with it
if not bangs:
return eval("f\"{" + pre + suf + "}\"")
conversion = set(bangs[1:]) # the first character is always a bang
sra = conversion - set("tiqulc")
conversion = conversion - sra
if sra:
sra = "!" + "".join(sra)
value = eval("f\"{" + pre + (sra or "") + suf + "}\"")
if "q" in conversion:
value = value.replace("{", "{{")
if "u" in conversion:
value = value.upper()
if "l" in conversion:
value = value.lower()
if "c" in conversion and value:
value = value.capitalize()
return value
return re.sub(r"{([^!:\n]+)((?:![^!:\n]+)?)((?::[^!:\n]+)?)}", replacement, s)
The massive regex results in the three groups I commented about at the top.
Caveat: it still uses eval (no acceptable way around it anyway), it doesn't allow for multiline replacement fields, and it may cause issues and/or discrepancies to put spaces between the ! and the :.
But these are acceptable for the use I have.

Please check specifcation
only those characters are allowed : 's', 'r', or 'a'
https://peps.python.org/pep-0498/

How to setup a grammar that can handle ambiguity

I'm trying to create a grammar to parse some Excel-like formulas I have devised, where a special character in the beginning of a string signifies a different source. For example, $ can signify a string, so "$This is text" would be treated as a string input in the program and & can signify a function, so &foo() can be treated as a call to the internal function foo.
The problem I'm facing is how to construct the grammar properly. For example, This is a simplified version as a MWE:
grammar = r'''start: instruction
?instruction: simple
| func
STARTSYMBOL: "!"|"#"|"$"|"&"|"~"
SINGLESTR: (LETTER+|DIGIT+|"_"|" ")*
simple: STARTSYMBOL [SINGLESTR] (WORDSEP SINGLESTR)*
ARGSEP: ",," // argument separator
WORDSEP: "," // word separator
CONDSEP: ";;" // condition separator
STAR: "*"
func: STARTSYMBOL SINGLESTR "(" [simple|func] (ARGSEP simple|func)* ")"
%import common.LETTER
%import common.WORD
%import common.DIGIT
%ignore ARGSEP
%ignore WORDSEP
'''
parser = lark.Lark(grammar, parser='earley')
So, with this grammar, things like: $This is a string, &foo(), &foo(#arg1), &foo($arg1,,#arg2) and &foo(!w1,w2,w3,,!w4,w5,w6) are all parsed as expected. But if I'd like to add more flexibility to my simple terminal, then I need to start fiddling around with the SINGLESTR token definition which is not convenient.
What have I tried
The part that I cannot get past is that if I want to have a string including parentheses (which are literals of func), then I cannot handle them in my current situation.
If I add the parentheses in SINGLESTR, then I get Expected STARTSYMBOL, because it's getting mixed up with the func definition and it thinks that a function argument should be passed, which makes sense.
If I redefine the grammar to reserve the ampersand symbol for functions only and add the parentheses in SINGLESTR, then I can parse a string with parentheses, but every function I'm trying to parse gives Expected LPAR.
My intent is that anything starting with a $ would be parsed as a SINGLESTR token and then I could parse things like &foo($first arg (has) parentheses,,$second arg).
My solution, for now, is that I'm using 'escape' words like LEFTPAR and RIGHTPAR in my strings and I've written helper functions to change those into parentheses when I process the tree. So, $This is a LEFTPARtestRIGHTPAR produces the correct tree and when I process it, then this gets translated to This is a (test).
To formulate a general question: Can I define my grammar in such a way that some characters that are special to the grammar are treated as normal characters in some situations and as special in any other case?
EDIT 1
Based on a comment from jbndlr I revised my grammar to create individual modes based on the start symbol:
grammar = r'''start: instruction
?instruction: simple
| func
SINGLESTR: (LETTER+|DIGIT+|"_"|" ") (LETTER+|DIGIT+|"_"|" "|"("|")")*
FUNCNAME: (LETTER+) (LETTER+|DIGIT+|"_")* // no parentheses allowed in the func name
DB: "!" SINGLESTR (WORDSEP SINGLESTR)*
TEXT: "$" SINGLESTR
MD: "#" SINGLESTR
simple: TEXT|DB|MD
ARGSEP: ",," // argument separator
WORDSEP: "," // word separator
CONDSEP: ";;" // condition separator
STAR: "*"
func: "&" FUNCNAME "(" [simple|func] (ARGSEP simple|func)* ")"
%import common.LETTER
%import common.WORD
%import common.DIGIT
%ignore ARGSEP
%ignore WORDSEP
'''
This falls (somewhat) under my second test case. I can parse all the simple types of strings (TEXT, MD or DB tokens that can contain parentheses) and functions that are empty; for example, &foo() or &foo(&bar()) parse correctly. The moment I put an argument within a function (no matter which type), I get an UnexpectedEOF Error: Expected ampersand, RPAR or ARGSEP. As a proof of concept, if I remove the parentheses from the definition of SINGLESTR in the new grammar above, then everything works as it should, but I'm back to square one.

import lark
grammar = r'''start: instruction
?instruction: simple
| func
MIDTEXTRPAR: /\)+(?!(\)|,,|$))/
SINGLESTR: (LETTER+|DIGIT+|"_"|" ") (LETTER+|DIGIT+|"_"|" "|"("|MIDTEXTRPAR)*
FUNCNAME: (LETTER+) (LETTER+|DIGIT+|"_")* // no parentheses allowed in the func name
DB: "!" SINGLESTR (WORDSEP SINGLESTR)*
TEXT: "$" SINGLESTR
MD: "#" SINGLESTR
simple: TEXT|DB|MD
ARGSEP: ",," // argument separator
WORDSEP: "," // word separator
CONDSEP: ";;" // condition separator
STAR: "*"
func: "&" FUNCNAME "(" [simple|func] (ARGSEP simple|func)* ")"
%import common.LETTER
%import common.WORD
%import common.DIGIT
%ignore ARGSEP
%ignore WORDSEP
'''
parser = lark.Lark(grammar, parser='earley')
parser.parse("&foo($first arg (has) parentheses,,$second arg)")
Output:
Tree(start, [Tree(func, [Token(FUNCNAME, 'foo'), Tree(simple, [Token(TEXT, '$first arg (has) parentheses')]), Token(ARGSEP, ',,'), Tree(simple, [Token(TEXT, '$second arg')])])])
I hope it's what you were looking for.
Those have been crazy few days. I tried lark and failed. I also tried persimonious and pyparsing. All of these different parsers all had the same problem with the 'argument' token consuming the right parenthesis that was part of the function, eventually failing because the function's parentheses weren't closed.
The trick was to figure out how do you define a right parenthesis that's "not special". See the regular expression for MIDTEXTRPAR in the code above. I defined it as a right parenthesis that is not followed by argument separation or by end of string. I did that by using the regular expression extension (?!...) which matches only if it's not followed by ... but doesn't consume characters. Luckily it even allows matching end of string inside this special regular expression extension.
EDIT:
The above mentioned method only works if you don't have an argument ending with a ), because then the MIDTEXTRPAR regular expression won't catch that ) and will think that's the end of the function even though there are more arguments to process. Also, there may be ambiguities such as ...asdf),,..., it may be an end of a function declaration inside an argument, or a 'text-like' ) inside an argument and the function declaration goes on.
This problem is related to the fact that what you describe in your question is not a context-free grammar (https://en.wikipedia.org/wiki/Context-free_grammar) for which parsers such as lark exist. Instead it is a context-sensitive grammar (https://en.wikipedia.org/wiki/Context-sensitive_grammar).
The reason for it being a context sensitive grammar is because you need the parser to 'remember' that it is nested inside a function, and how many levels of nesting there are, and have this memory available inside the grammar's syntax in some way.
EDIT2:
Also take a look at the following parser that is context-sensitive, and seems to solve the problem, but has an exponential time complexity in the number of nested functions, as it tries to parse all possible function barriers until it finds one that works. I believe it has to have an exponential complexity has since it's not context-free.
_funcPrefix = '&'
_debug = False
class ParseException(Exception):
pass
def GetRecursive(c):
if isinstance(c,ParserBase):
return c.GetRecursive()
else:
return c
class ParserBase:
def __str__(self):
return type(self).__name__ + ": [" + ','.join(str(x) for x in self.contents) +"]"
def GetRecursive(self):
return (type(self).__name__,[GetRecursive(c) for c in self.contents])
class Simple(ParserBase):
def __init__(self,s):
self.contents = [s]
class MD(Simple):
pass
class DB(ParserBase):
def __init__(self,s):
self.contents = s.split(',')
class Func(ParserBase):
def __init__(self,s):
if s[-1] != ')':
raise ParseException("Can't find right parenthesis: '%s'" % s)
lparInd = s.find('(')
if lparInd < 0:
raise ParseException("Can't find left parenthesis: '%s'" % s)
self.contents = [s[:lparInd]]
argsStr = s[(lparInd+1):-1]
args = list(argsStr.split(',,'))
i = 0
while i<len(args):
a = args[i]
if a[0] != _funcPrefix:
self.contents.append(Parse(a))
i += 1
else:
j = i+1
while j<=len(args):
nestedFunc = ',,'.join(args[i:j])
if _debug:
print(nestedFunc)
try:
self.contents.append(Parse(nestedFunc))
break
except ParseException as PE:
if _debug:
print(PE)
j += 1
if j>len(args):
raise ParseException("Can't parse nested function: '%s'" % (',,'.join(args[i:])))
i = j
def Parse(arg):
if arg[0] not in _starterSymbols:
raise ParseException("Bad prefix: " + arg[0])
return _starterSymbols[arg[0]](arg[1:])
_starterSymbols = {_funcPrefix:Func,'$':Simple,'!':DB,'#':MD}
P = Parse("&foo($first arg (has)) parentheses,,&f($asdf,,&nested2($23423))),,&second(!arg,wer))")
print(P)
import pprint
pprint.pprint(P.GetRecursive())

Problem is arguments of function are enclosed in parenthesis where one of the arguments may contain parenthesis.
One of the possible solution is use backspace \ before ( or ) when it is a part of String
SINGLESTR: (LETTER+|DIGIT+|"_"|" ") (LETTER+|DIGIT+|"_"|" "|"\("|"\)")*
Similar solution used by C, to include double quotes(") as a part of string constant where string constant is enclosed in double quotes.
example_string1='&f(!g\()'
example_string2='&f(#g)'
print(parser.parse(example_string1).pretty())
print(parser.parse(example_string2).pretty())
Output is
start
func
f
simple !g\(
start
func
f
simple #g

PyParsing lookaheads and greedy expressions

I'm writing a parser for a query language using PyParsing, and I've gotten stuck on (what I believe to be) an issue with lookaheads. One clause type in the query is intended to split strings into 3 parts (fieldname,operator, value) such that fieldname is one word, operator is one or more words, and value is a word, a quoted string, or a parenthesized list of these.
My data look like
author is william
author is 'william shakespeare'
author is not shakespeare
author is in (william,'the bard',shakespeare)
And my current parser for this clause is written as:
fieldname = Word(alphas)
operator = OneOrMore(Word(alphas))
single_value = Word(alphas) ^ QuotedString(quoteChar="'")
list_value = Literal("(") + Group(delimitedList(single_value)) + Literal(")")
value = single_value ^ list_value
clause = fieldname + originalTextFor(operator) + value
Obviously this fails due to the the fact that the operator element is greedy and will gobble up the value if it can. From reading other similar questions and the docs, I've gathered that I need to manage that lookahead with a NotAny or FollowedBy, but I haven't been able to figure out how to make that work.

This is a good place to Be The Parser. Or more accurately, Make The Parser Think Like You Do. Ask yourself, "In 'author is shakespeare', how do I know that 'shakespeare' is not part of the operator?" You know that 'shakespeare' is the value because it is at the end of the query, there is nothing more after it. So operator words aren't just words of alphas, they are words of alphas that are not followed by the end of the string. Now build that lookahead logic into your definition of operator:
operator = OneOrMore(Word(alphas) + ~FollowedBy(StringEnd()))
And I think this will start parsing better for you.
Some other tips:
I only use '^' operator if there will be some possible ambiguity, like if I was going to parse a string with numbers that could be integers or hex. If I used Word(nums) | Word(hexnums), then I might misprocess "123ABC" as just the leading "123". By changing '|' to '^', all of the alternatives will be tested, and the longest match chosen. In my example of parsing decimal or hex integers, I could have gotten the same result by reversing the alternatives, and test for Word(hexnums) first. In you query language, there is no way to confuse a quoted string with a non-quoted single word value (one leads with ' or ", the other doesn't), so there is no reason to use '^', '|' will suffice. Similar for value = singleValue ^ listValue.
Adding results names to the key components of your query string will make it easier to work with later:
clause = fieldname("fieldname") + originalTextFor(operator)("operator") + value("value")
Now you can access the parsed values by name instead of by parse position (which will get tricky and error-prone once you start getting more complicated with optional fields and such):
queryParts = clause.parseString('author is william')
print queryParts.fieldname
print queryParts.operator

Keyword Matching in Pyparsing: non-greedy slurping of tokens

Pythonistas:
Suppose you want to parse the following string using Pyparsing:
'ABC_123_SPEED_X 123'
were ABC_123 is an identifier; SPEED_X is a parameter, and 123 is a value. I thought of the following BNF using Pyparsing:
Identifier = Word( alphanums + '_' )
Parameter = Keyword('SPEED_X') or Keyword('SPEED_Y') or Keyword('SPEED_Z')
Value = # assume I already have an expression valid for any value
Entry = Identifier + Literal('_') + Parameter + Value
tokens = Entry.parseString('ABC_123_SPEED_X 123')
#Error: pyparsing.ParseException: Expected "_" (at char 16), (line:1, col:17)
If I remove the underscore from the middle (and adjust the Entry definition accordingly) it parses correctly.
How can I make this parser be a bit lazier and wait until it matches the Keyword (as opposed to slurping the entire string as an Identifier and waiting for the _, which does not exist.
Thank you.
[Note: This is a complete rewrite of my question; I had not realized what the real problem was]

I based my answer off of this one, since what you're trying to do is get a non-greedy match. It seems like this is difficult to make happen in pyparsing, but not impossible with some cleverness and compromise. The following seems to work:
from pyparsing import *
Parameter = Literal('SPEED_X') | Literal('SPEED_Y') | Literal('SPEED_Z')
UndParam = Suppress('_') + Parameter
Identifier = SkipTo(UndParam)
Value = Word(nums)
Entry = Identifier + UndParam + Value
When we run this from the interactive interpreter, we can see the following:
>>> Entry.parseString('ABC_123_SPEED_X 123')
(['ABC_123', 'SPEED_X', '123'], {})
Note that this is a compromise; because I use SkipTo, the Identifier can be full of evil, disgusting characters, not just beautiful alphanums with the occasional underscore.
EDIT: Thanks to Paul McGuire, we can concoct a truly elegant solution by setting Identifier to the following:
Identifier = Combine(Word(alphanums) +
ZeroOrMore('_' + ~Parameter + Word(alphanums)))
Let's inspect how this works. First, ignore the outer Combine; we'll get to this later. Starting with Word(alphanums) we know we'll get the 'ABC' part of the reference string, 'ABC_123_SPEED_X 123'. It's important to note that we didn't allow the "word" to contain underscores in this case. We build that separately in to the logic.
Next, we need to capture the '_123' part without also sucking in '_SPEED_X'. Let's also skip over ZeroOrMore at this point and return to it later. We start with the underscore as a Literal, but we can shortcut with just '_', which will get us the leading underscore, but not all of '_123'. Instictively, we would place another Word(alphanums) to capture the rest, but that's exactly what will get us in trouble by consuming all of the remaining '_123_SPEED_X'. Instead, we say, "So long as what follows the underscore is not the Parameter, parse that as part of my Identifier. We state that in pyparsing terms as '_' + ~Parameter + Word(alphanums). Since we assume we can have an arbitrary number of underscore + WordButNotParameter repeats, we wrap that expression a ZeroOrMore construct. (If you always expect at least underscore + WordButNotParameter following the initial, you can use OneOrMore.)
Finally, we need to wrap the initial Word and the special underscore + Word repeats together so that it's understood they are contiguous, not separated by whitespace, so we wrap the whole expression up in a Combine construct. This way 'ABC _123_SPEED_X' will raise a parse error, but 'ABC_123_SPEED_X' will parse correctly.
Note also that I had to change Keyword to Literal because the ways of the former are far too subtle and quick to anger. I do not trust Keywords, nor could I get matching with them.

If you are sure that the identifier never ends with an underscore, you can enforce it in the definition:
from pyparsing import *
my_string = 'ABC_123_SPEED_X 123'
Identifier = Combine(Word(alphanums) + Literal('_') + Word(alphanums))
Parameter = Literal('SPEED_X') | Literal('SPEED_Y') | Literal('SPEED_Z')
Value = Word(nums)
Entry = Identifier + Literal('_').suppress() + Parameter + Value
tokens = Entry.parseString(my_string)
print tokens # prints: ['ABC_123', 'SPEED_X', '123']
If it's not the case but if the identifier length is fixed you can define Identifier like this:
Identifier = Word( alphanums + '_' , exact=7)

You can also parse the identifier and parameter as one token, and split them in a parse action:
from pyparsing import *
import re
def split_ident_and_param(tokens):
mo = re.match(r"^(.*?_.*?)_(.*?_.*?)$", tokens[0])
return [mo.group(1), mo.group(2)]
ident_and_param = Word(alphanums + "_").setParseAction(split_ident_and_param)
value = Word(nums)
entry = ident_and_param + value
print entry.parseString("APC_123_SPEED_X 123")
The example above assumes that the identifiers and parameters always have the format XXX_YYY (containing one single underscore).
If this is not the case, you need to adjust the split_ident_and_param() method.

This answers a question that you probably have also asked yourself: "What's a real-world application for reduce?):
>>> keys = ['CAT', 'DOG', 'HORSE', 'DEER', 'RHINOCEROS']
>>> p = reduce(lambda x, y: x | y, [Keyword(x) for x in keys])
>>> p
{{{{"CAT" | "DOG"} | "HORSE"} | "DEER"} | "RHINOCEROS"}
Edit:
This was a pretty good answer to the original question. I'll have to work on the new one.
Further edit:
I'm pretty sure you can't do what you're trying to do. The parser that pyparsing creates doesn't do lookahead. So if you tell it to match Word(alphanums + '_'), it's going to keep matching characters until it finds one that's not a letter, number, or underscore.

Using pyparsing to parse a word escape-split over multiple lines

I'm trying to parse words which can be broken up over multiple lines with a backslash-newline combination ("\\n") using pyparsing. Here's what I have done:
from pyparsing import *
continued_ending = Literal('\\') + lineEnd
word = Word(alphas)
split_word = word + Suppress(continued_ending)
multi_line_word = Forward()
multi_line_word << (word | (split_word + multi_line_word))
print multi_line_word.parseString(
'''super\\
cali\\
fragi\\
listic''')
The output I get is ['super'], while the expected output is ['super', 'cali', fragi', 'listic']. Better still would be all of them joined as one word (which I think I can just do with multi_line_word.parseAction(lambda t: ''.join(t)).
I tried looking at this code in pyparsing helper, but it gives me an error, maximum recursion depth exceeded.
EDIT 2009-11-15: I realized later that pyparsing gets a little generous with regards to white space, and that leads to some poor assumptions that what I thought I was parsing for was a lot looser. That is to say, we want to see no white space between any of the portions of the word, the escape, and the EOL character.
I realized that the little example string above is insufficient as a test case, so I wrote the following unit tests. Code that passes these tests should be able to match what I intuitively think of as a escape-split word—and only an escape-split word. They will not match a basic word that is not escape-split. We can—and I believe should—use a different grammatical construct for that. This keeps it all tidy having the two separate.
import unittest
import pyparsing
# Assumes you named your module 'multiline.py'
import multiline
class MultiLineTests(unittest.TestCase):
def test_continued_ending(self):
case = '\\\n'
expected = ['\\', '\n']
result = multiline.continued_ending.parseString(case).asList()
self.assertEqual(result, expected)
def test_continued_ending_space_between_parse_error(self):
case = '\\ \n'
self.assertRaises(
pyparsing.ParseException,
multiline.continued_ending.parseString,
case
)
def test_split_word(self):
cases = ('shiny\\', 'shiny\\\n', ' shiny\\')
expected = ['shiny']
for case in cases:
result = multiline.split_word.parseString(case).asList()
self.assertEqual(result, expected)
def test_split_word_no_escape_parse_error(self):
case = 'shiny'
self.assertRaises(
pyparsing.ParseException,
multiline.split_word.parseString,
case
)
def test_split_word_space_parse_error(self):
cases = ('shiny \\', 'shiny\r\\', 'shiny\t\\', 'shiny\\ ')
for case in cases:
self.assertRaises(
pyparsing.ParseException,
multiline.split_word.parseString,
case
)
def test_multi_line_word(self):
cases = (
'shiny\\',
'shi\\\nny',
'sh\\\ni\\\nny\\\n',
' shi\\\nny\\',
'shi\\\nny '
'shi\\\nny captain'
)
expected = ['shiny']
for case in cases:
result = multiline.multi_line_word.parseString(case).asList()
self.assertEqual(result, expected)
def test_multi_line_word_spaces_parse_error(self):
cases = (
'shi \\\nny',
'shi\\ \nny',
'sh\\\n iny',
'shi\\\n\tny',
)
for case in cases:
self.assertRaises(
pyparsing.ParseException,
multiline.multi_line_word.parseString,
case
)
if __name__ == '__main__':
unittest.main()

After poking around for a bit more, I came upon this help thread where there was this notable bit
I often see inefficient grammars when
someone implements a pyparsing grammar
directly from a BNF definition. BNF
does not have a concept of "one or
more" or "zero or more" or
"optional"...
With that, I got the idea to change these two lines
multi_line_word = Forward()
multi_line_word << (word | (split_word + multi_line_word))
To
multi_line_word = ZeroOrMore(split_word) + word
This got it to output what I was looking for: ['super', 'cali', fragi', 'listic'].
Next, I added a parse action that would join these tokens together:
multi_line_word.setParseAction(lambda t: ''.join(t))
This gives a final output of ['supercalifragilistic'].
The take home message I learned is that one doesn't simply walk into Mordor.
Just kidding.
The take home message is that one can't simply implement a one-to-one translation of BNF with pyparsing. Some tricks with using the iterative types should be called into use.
EDIT 2009-11-25: To compensate for the more strenuous test cases, I modified the code to the following:
no_space = NotAny(White(' \t\r'))
# make sure that the EOL immediately follows the escape backslash
continued_ending = Literal('\\') + no_space + lineEnd
word = Word(alphas)
# make sure that the escape backslash immediately follows the word
split_word = word + NotAny(White()) + Suppress(continued_ending)
multi_line_word = OneOrMore(split_word + NotAny(White())) + Optional(word)
multi_line_word.setParseAction(lambda t: ''.join(t))
This has the benefit of making sure that no space comes between any of the elements (with the exception of newlines after the escaping backslashes).

You are pretty close with your code. Any of these mods would work:
# '|' means MatchFirst, so you had a left-recursive expression
# reversing the order of the alternatives makes this work
multi_line_word << ((split_word + multi_line_word) | word)
# '^' means Or/MatchLongest, but beware using this inside a Forward
multi_line_word << (word ^ (split_word + multi_line_word))
# an unusual use of delimitedList, but it works
multi_line_word = delimitedList(word, continued_ending)
# in place of your parse action, you can wrap in a Combine
multi_line_word = Combine(delimitedList(word, continued_ending))
As you found in your pyparsing googling, BNF->pyparsing translations should be done with a special view to using pyparsing features in place of BNF, um, shortcomings. I was actually in the middle of composing a longer answer, going into more of the BNF translation issues, but you have already found this material (on the wiki, I assume).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.