Parsing C like structured text into dict

Parsing C like structured text into dict - python

I have config files structured like simplified C syntax, eg:
Main { /* some comments */
VariableName1 = VariableValue1;
VariableName2 = VariableValue2;
SubSection {
VariableName1 = VariableValue1; // inline comment
VariableName2 = VariableValue2;
}
VariableName3 = "StriingValue4";
}
Sections may be recursivesly nested.
How can I parse that file into dict in a clean and "pythonish" way?
[EDIT]
OK, I've found pyparsing module :) but maybe someone can tell how to do this without it.
[EDIT2]
Because of curiosity, I want to know for future how to write that I think simple task by hand.

Use a parser like SimpleParse, just feed it the EBNF definition.
Do you have the format documented in some sort of BNF, don't you? If not, you can tell the next genius inventing another config format instead of using json, xml, yaml or xml that he is not authorized to reinvent the wheel unless he can specify the syntax using EBNF.
It may take some time to write a grammar if you are not familiarized with EBNF, but it pays. It will make your code well documented, rock solid and easier to maintain.
See the python wiki about Language Parsing for another options.
If you try to pull some stunt using str.split or regular expressions, every other developer to do maintenance of this piece of code will curse you.
UPDATE:
It just occurred to me that if you replace the SectionName with SectionName :, ; with , and enclose the main section with a pair of curly braces, this format will likely to be valid json.
"Name" = JSON Grammar
"Author" = Arsène von Wyss
"Version" = 1.0
"About" = 'Grammar for JSON data, following http://www.json.org/'
! and compliant with http://www.ietf.org/rfc/rfc4627
"Start Symbol" = <Json>
"Case Sensitive" = True
"Character Mapping" = 'Unicode'
! ------------------------------------------------- Sets
{Unescaped} = {All Valid} - {&1 .. &19} - ["\]
{Hex} = {Digit} + [ABCDEFabcdef]
{Digit9} = {Digit} - [0]
! ------------------------------------------------- Terminals
Number = '-'?('0'|{Digit9}{Digit}*)('.'{Digit}+)?([Ee][+-]?{Digit}+)?
String = '"'({Unescaped}|'\'(["\/bfnrt]|'u'{Hex}{Hex}{Hex}{Hex}))*'"'
! ------------------------------------------------- Rules
<Json> ::= <Object>
| <Array>
<Object> ::= '{' '}'
| '{' <Members> '}'
<Members> ::= <Pair>
| <Pair> ',' <Members>
<Pair> ::= String ':' <Value>
<Array> ::= '[' ']'
| '[' <Elements> ']'
<Elements> ::= <Value>
| <Value> ',' <Elements>
<Value> ::= String
| Number
| <Object>
| <Array>
| true
| false
| null

You need to parse it recursively, using this Backus Naur form, staring from PARSE:
PARSE: '{' VARASSIGN VARASSIGN [PARSE [VARASSIGN]] '}'
VARASSIGN : VARIABLENAME '=' '"' STRING '"'
VARIABLENAME: STRING
STRING: [[:alpha:]][[:alnum:]]*
Because your structure is easy, you can use a predicative parser LL(1).

1) Write a tokenizer, i.e. a function that will parse the character stream and turn it to a list of Identifiers, OpeningBrace, ClosingBrace, EqualSign and SemiColon; comments and spaces are discarded. Could be done using Regexpr's.
2) Write a simple parser. Skip the first Identifier and OpeningBrace.
The parser expects an Identifier followed by one of EqualSign or OpeningBrace, or a ClosingBrace.
2.1) If EqualSign, must be followed by Identifier and SemiColon.
2.2) If OpeningBrace, invoke the parser recursively.
2.3) If ClosingBrace, return from the recursive call.
In the processing of 2.1, enter the desired data into the dict, the way you like. You could prefix identifiers with the names of the enclosing blocks, e.g.
{"Main.SubSection.VariableName1": VariableValue1}
Here is prototype code for the parser, to be called after the tokenizer. It scans a string where a letter stands for an identifier, separators are ={}; and the last token must be a $.
def Parse(String, Prefix= "", Nest= 0):
global Cursor
if Nest == 0:
Cursor= 0
# Scan the input string
while String[Cursor + 0].isalpha():
# Identifier, starts an Assignment or a Block (Id |)
if String[Cursor + 1] == "=":
# Assignment, lookup (Id= | Id;)
if String[Cursor + 2].isalpha():
if String[Cursor + 3] == ";":
# Accept the assignment (Id=Id; |)
print Nest * " " + Prefix + String[Cursor] + "=" + String[Cursor + 2] + ";"
Cursor+= 4
elif String[Cursor + 1] == "{":
# Block, lookup (Id{ | )
print Nest * " " + String[Cursor] + "{"
Cursor+= 2
# Recurse
Parse(String, Prefix + String[Cursor - 2] + "::", Nest + 4)
else:
# Unexpected token
break
if String[Cursor + 0] == "}":
# Block complete, (Id{...} |)
print (Nest - 4) * " " + "}"
Cursor+= 1
return
if Nest == 0 and String[Cursor + 0] == "$":
# Done
return
print "Syntax error at", String[Cursor:], ":("
Parse("C{D=E;X{Y=Z;}F=G;}H=I;A=B;$")
When executed, it outputs:
C{
C::D=E;
X{
C::X::Y=Z;
}
C::F=G;
}
H=I;
A=B;
proving that it did detect the nesting. Replace the print statements by whatever processing you like.

You could use pyparsing to write a parser for this format.

Related

How to setup a grammar that can handle ambiguity

I'm trying to create a grammar to parse some Excel-like formulas I have devised, where a special character in the beginning of a string signifies a different source. For example, $ can signify a string, so "$This is text" would be treated as a string input in the program and & can signify a function, so &foo() can be treated as a call to the internal function foo.
The problem I'm facing is how to construct the grammar properly. For example, This is a simplified version as a MWE:
grammar = r'''start: instruction
?instruction: simple
| func
STARTSYMBOL: "!"|"#"|"$"|"&"|"~"
SINGLESTR: (LETTER+|DIGIT+|"_"|" ")*
simple: STARTSYMBOL [SINGLESTR] (WORDSEP SINGLESTR)*
ARGSEP: ",," // argument separator
WORDSEP: "," // word separator
CONDSEP: ";;" // condition separator
STAR: "*"
func: STARTSYMBOL SINGLESTR "(" [simple|func] (ARGSEP simple|func)* ")"
%import common.LETTER
%import common.WORD
%import common.DIGIT
%ignore ARGSEP
%ignore WORDSEP
'''
parser = lark.Lark(grammar, parser='earley')
So, with this grammar, things like: $This is a string, &foo(), &foo(#arg1), &foo($arg1,,#arg2) and &foo(!w1,w2,w3,,!w4,w5,w6) are all parsed as expected. But if I'd like to add more flexibility to my simple terminal, then I need to start fiddling around with the SINGLESTR token definition which is not convenient.
What have I tried
The part that I cannot get past is that if I want to have a string including parentheses (which are literals of func), then I cannot handle them in my current situation.
If I add the parentheses in SINGLESTR, then I get Expected STARTSYMBOL, because it's getting mixed up with the func definition and it thinks that a function argument should be passed, which makes sense.
If I redefine the grammar to reserve the ampersand symbol for functions only and add the parentheses in SINGLESTR, then I can parse a string with parentheses, but every function I'm trying to parse gives Expected LPAR.
My intent is that anything starting with a $ would be parsed as a SINGLESTR token and then I could parse things like &foo($first arg (has) parentheses,,$second arg).
My solution, for now, is that I'm using 'escape' words like LEFTPAR and RIGHTPAR in my strings and I've written helper functions to change those into parentheses when I process the tree. So, $This is a LEFTPARtestRIGHTPAR produces the correct tree and when I process it, then this gets translated to This is a (test).
To formulate a general question: Can I define my grammar in such a way that some characters that are special to the grammar are treated as normal characters in some situations and as special in any other case?
EDIT 1
Based on a comment from jbndlr I revised my grammar to create individual modes based on the start symbol:
grammar = r'''start: instruction
?instruction: simple
| func
SINGLESTR: (LETTER+|DIGIT+|"_"|" ") (LETTER+|DIGIT+|"_"|" "|"("|")")*
FUNCNAME: (LETTER+) (LETTER+|DIGIT+|"_")* // no parentheses allowed in the func name
DB: "!" SINGLESTR (WORDSEP SINGLESTR)*
TEXT: "$" SINGLESTR
MD: "#" SINGLESTR
simple: TEXT|DB|MD
ARGSEP: ",," // argument separator
WORDSEP: "," // word separator
CONDSEP: ";;" // condition separator
STAR: "*"
func: "&" FUNCNAME "(" [simple|func] (ARGSEP simple|func)* ")"
%import common.LETTER
%import common.WORD
%import common.DIGIT
%ignore ARGSEP
%ignore WORDSEP
'''
This falls (somewhat) under my second test case. I can parse all the simple types of strings (TEXT, MD or DB tokens that can contain parentheses) and functions that are empty; for example, &foo() or &foo(&bar()) parse correctly. The moment I put an argument within a function (no matter which type), I get an UnexpectedEOF Error: Expected ampersand, RPAR or ARGSEP. As a proof of concept, if I remove the parentheses from the definition of SINGLESTR in the new grammar above, then everything works as it should, but I'm back to square one.

import lark
grammar = r'''start: instruction
?instruction: simple
| func
MIDTEXTRPAR: /\)+(?!(\)|,,|$))/
SINGLESTR: (LETTER+|DIGIT+|"_"|" ") (LETTER+|DIGIT+|"_"|" "|"("|MIDTEXTRPAR)*
FUNCNAME: (LETTER+) (LETTER+|DIGIT+|"_")* // no parentheses allowed in the func name
DB: "!" SINGLESTR (WORDSEP SINGLESTR)*
TEXT: "$" SINGLESTR
MD: "#" SINGLESTR
simple: TEXT|DB|MD
ARGSEP: ",," // argument separator
WORDSEP: "," // word separator
CONDSEP: ";;" // condition separator
STAR: "*"
func: "&" FUNCNAME "(" [simple|func] (ARGSEP simple|func)* ")"
%import common.LETTER
%import common.WORD
%import common.DIGIT
%ignore ARGSEP
%ignore WORDSEP
'''
parser = lark.Lark(grammar, parser='earley')
parser.parse("&foo($first arg (has) parentheses,,$second arg)")
Output:
Tree(start, [Tree(func, [Token(FUNCNAME, 'foo'), Tree(simple, [Token(TEXT, '$first arg (has) parentheses')]), Token(ARGSEP, ',,'), Tree(simple, [Token(TEXT, '$second arg')])])])
I hope it's what you were looking for.
Those have been crazy few days. I tried lark and failed. I also tried persimonious and pyparsing. All of these different parsers all had the same problem with the 'argument' token consuming the right parenthesis that was part of the function, eventually failing because the function's parentheses weren't closed.
The trick was to figure out how do you define a right parenthesis that's "not special". See the regular expression for MIDTEXTRPAR in the code above. I defined it as a right parenthesis that is not followed by argument separation or by end of string. I did that by using the regular expression extension (?!...) which matches only if it's not followed by ... but doesn't consume characters. Luckily it even allows matching end of string inside this special regular expression extension.
EDIT:
The above mentioned method only works if you don't have an argument ending with a ), because then the MIDTEXTRPAR regular expression won't catch that ) and will think that's the end of the function even though there are more arguments to process. Also, there may be ambiguities such as ...asdf),,..., it may be an end of a function declaration inside an argument, or a 'text-like' ) inside an argument and the function declaration goes on.
This problem is related to the fact that what you describe in your question is not a context-free grammar (https://en.wikipedia.org/wiki/Context-free_grammar) for which parsers such as lark exist. Instead it is a context-sensitive grammar (https://en.wikipedia.org/wiki/Context-sensitive_grammar).
The reason for it being a context sensitive grammar is because you need the parser to 'remember' that it is nested inside a function, and how many levels of nesting there are, and have this memory available inside the grammar's syntax in some way.
EDIT2:
Also take a look at the following parser that is context-sensitive, and seems to solve the problem, but has an exponential time complexity in the number of nested functions, as it tries to parse all possible function barriers until it finds one that works. I believe it has to have an exponential complexity has since it's not context-free.
_funcPrefix = '&'
_debug = False
class ParseException(Exception):
pass
def GetRecursive(c):
if isinstance(c,ParserBase):
return c.GetRecursive()
else:
return c
class ParserBase:
def __str__(self):
return type(self).__name__ + ": [" + ','.join(str(x) for x in self.contents) +"]"
def GetRecursive(self):
return (type(self).__name__,[GetRecursive(c) for c in self.contents])
class Simple(ParserBase):
def __init__(self,s):
self.contents = [s]
class MD(Simple):
pass
class DB(ParserBase):
def __init__(self,s):
self.contents = s.split(',')
class Func(ParserBase):
def __init__(self,s):
if s[-1] != ')':
raise ParseException("Can't find right parenthesis: '%s'" % s)
lparInd = s.find('(')
if lparInd < 0:
raise ParseException("Can't find left parenthesis: '%s'" % s)
self.contents = [s[:lparInd]]
argsStr = s[(lparInd+1):-1]
args = list(argsStr.split(',,'))
i = 0
while i<len(args):
a = args[i]
if a[0] != _funcPrefix:
self.contents.append(Parse(a))
i += 1
else:
j = i+1
while j<=len(args):
nestedFunc = ',,'.join(args[i:j])
if _debug:
print(nestedFunc)
try:
self.contents.append(Parse(nestedFunc))
break
except ParseException as PE:
if _debug:
print(PE)
j += 1
if j>len(args):
raise ParseException("Can't parse nested function: '%s'" % (',,'.join(args[i:])))
i = j
def Parse(arg):
if arg[0] not in _starterSymbols:
raise ParseException("Bad prefix: " + arg[0])
return _starterSymbols[arg[0]](arg[1:])
_starterSymbols = {_funcPrefix:Func,'$':Simple,'!':DB,'#':MD}
P = Parse("&foo($first arg (has)) parentheses,,&f($asdf,,&nested2($23423))),,&second(!arg,wer))")
print(P)
import pprint
pprint.pprint(P.GetRecursive())

Problem is arguments of function are enclosed in parenthesis where one of the arguments may contain parenthesis.
One of the possible solution is use backspace \ before ( or ) when it is a part of String
SINGLESTR: (LETTER+|DIGIT+|"_"|" ") (LETTER+|DIGIT+|"_"|" "|"\("|"\)")*
Similar solution used by C, to include double quotes(") as a part of string constant where string constant is enclosed in double quotes.
example_string1='&f(!g\()'
example_string2='&f(#g)'
print(parser.parse(example_string1).pretty())
print(parser.parse(example_string2).pretty())
Output is
start
func
f
simple !g\(
start
func
f
simple #g

Getting shift-reduce conflict while parsing grammar

I am using LALR(1) parsing from lark-parser library. I have written a grammar to parse an ORM like language. An example of my language is pasted below:
Table1
.join(table=Table2, left_on=[column1], right_on=[column_2])
.group_by(col=[column1], agg=[sum])
.join(table=Table3, left_on=[column1], right_on=[column_3])
.some_column
My grammar is:
start: [CNAME (object)*]
object: "." (CNAME|operation)
operation: [(join|group) (object)*]
join: "join" "(" [(join_args ",")* join_args] ")"
join_args: "table" "=" CNAME
| "left_on" "=" list
| "right_on" "=" list
group: "group_by" "(" [(group_args ",")* group_args] ")"
group_args: "col" "=" list
| "agg" "=" list
list: "[" [CNAME ("," CNAME)*] "]"
%import common.CNAME //# Variable name declaration
%import common.WS //# White space declaration
%ignore WS
When I parse the language, it gets parsed correctly but I get shift-reduce conflict warning. I believe that this is due to collision at object: "." (CNAME|operation), but I may be wrong. Is there any other way to write this grammar ?

I think you should replace
operation: [(join|group) (object)*]
With just
operation: join | group
You've already allowed repetition of object in
start: [CNAME (object)*]
So also allowing object* at the end of operation is ambiguous, leading to the conflict.
Personally, I would have gone for something like:
start : [ CNAME ("." qualifier)* ]
qualifier: CNAME | join | group
Because I don't see the point of object. But that's just a minor style difference.

Lark parser grammar works with Earley but not LALR

Consider this simple test of the Python Lark parser:
GRAMMAR = '''
start: container*
container: string ":" "{" (container | attribute | attribute_value)* "}"
attribute: attribute_name "=" (attribute_value | container)
attribute_value: string ":" _value ("," _value)*
_value: number | string
attribute_name: /[A-Za-z_][A-Za-z_#0-9]*/
string: /[A-Za-z_#0-9]+/
number: /[0-9]+/
%import common.WS
%ignore WS
'''
data = '''outer : {
inner : {
}
}'''
parser = Lark(GRAMMAR, parser='lalr')
parser.parse(data)
This works with parser='earley' but it fails with parser='lalr'. I don't understand why. The error message is:
UnexpectedCharacters: No terminal defined for '{' at line 2 col 12
inner : {
This is just an MWE. My actual grammar suffers from the same problem.

The reason this fails with LALR, is because it has a lookahead of 1 (unlike Earley, which has unlimited lookahead), and it gets confused between attribute_name and string. Once it matches one of the other (in this case, attribute_name), it's impossible for it to backtrack and match a different rule.
If you use a lower priority for the attribute_name terminal, it will work. For example:
attribute_name: ATTR
ATTR.0: /[A-Za-z_][A-Za-z_#0-9]*/
But the recommended practice is to use the same terminal for both, if possible, so that the parser can do the thinking for you, instead of the lexer. You can add extra validation, if that's required, after the parsing is done.
Both approaches (changing priority or merging the terminals) will solve your problem.

no viable alternative at input 'int ' - ANTLR 4 with Python Parser

I got the problem when dealing with ANTLR4 and Parse by Python Library.
The grammar:
grammar SimpleCode;
program : 'class' ' ' 'Program' ' ' '{' field_decl* method_decl* '}' ;
field_decl : DATA_TYPE variable (',' variable)* ';' ;
method_decl: (DATA_TYPE | 'void') identifier '(' method_params? ')' block ;
variable : identifier | identifier '[' int_literal ']' ;
method_params : DATA_TYPE identifier (',' DATA_TYPE identifier)* ;
block : '{' var_decl* statement* '}' ;
var_decl : DATA_TYPE identifier (',' identifier)* ';';
statement : location assign_op expr ';' | method_call ';' | 'if' '(' (expr) ')' block ('else' block)? | 'for' identifier '=' (expr) ',' (expr) block | 'return' (expr)? ';' | 'break' ';' | 'continue' ';' | block ;
assign_op : '=' | '+=' | '-=' ;
method_call : method_name '(' method_call_params? ')' | 'callout' (string_literal (',' callout_arg (',' callout_arg)*)?) ;
method_call_params : DATA_TYPE identifier (',' DATA_TYPE identifier)* ;
method_name : identifier ;
location : identifier | identifier '[' expr ']' ;
expr : location | method_call | literal | expr bin_op expr | '-' expr | '!' expr | '(' expr ')' ;
callout_arg : expr | string_literal ;
bin_op : arith_op | rel_op | eq_op | cond_op ;
arith_op : '+' | '-' | '*' | '/' + '%' ;
rel_op : '<' | '>' | '<=' | '>=' ;
eq_op : '==' | '!=' ;
cond_op : '&&' | '||' ;
literal : int_literal | char_literal | bool_literal ;
identifier : ALPHA alpha_num* ;
alpha_num : ALPHA | DIGIT ;
int_literal : decimal_literal | hex_literal ;
decimal_literal : DIGIT DIGIT* ;
hex_literal : '0x' HEX_DIGIT HEX_DIGIT* ;
bool_literal : 'true' | 'false' ;
CHAR: . ;
char_literal : '\'' CHAR '\'' ;
string_literal : '"' CHAR* '"' ;
DATA_TYPE : INT | BOOL ;
INT : 'int' ;
BOOL : 'boolean' ;
ALPHA : [a-zA-Z] ;
DIGIT : [0-9] ;
HEX_DIGIT : [0-9a-fA-F] ;
White : [ \t]+ -> skip ;
Newline : ( '\r' '\n'? | '\n' ) -> skip ;
LineComment : '//' ~[\r\n]* -> skip ;
My python code to parse:
from antlr4 import *
from SimpleCodeLexer import SimpleCodeLexer
from SimpleCodeListener import SimpleCodeListener
from SimpleCodeParser import SimpleCodeParser
import sys
class SimpleCodePrintListener(SimpleCodeListener):
def enterProgram(self, ctx):
print(ctx.getText())
print(ctx.toStringTree())
# for child in ctx.getChildren():
# print(child.getText(), child.getSymbol())
def main():
input_stream = FileStream('in.in')
lexer = SimpleCodeLexer(input_stream)
stream = CommonTokenStream(lexer)
parser = SimpleCodeParser(stream)
tree = parser.program()
printer = SimpleCodePrintListener()
walker = ParseTreeWalker()
walker.walk(printer, tree)
if __name__ == '__main__':
print('Starting parse....')
main()
And the in.in file:
class Program {
int main() {
int v;
v = 1;
v = 'c';
v = true;
return 0;
}
}
I got this error after run the python code:
line 2:7 no viable alternative at input 'int '
The result of the first print is:
class Program {int main() {int v;v = 1;v = 'c';v = true;return 0; }}
([] class Program { int m a i n ( ) { int v ; v = 1 ; v = ' c ' ; v = true ; return 0 ; } })
I'm newbie to ANTLR4, so are there any special case to handle with lexers and tokens, because after hours of searching on internet, the main problem is about DATA_TYPE is used at many different places in grammar.

When debugging issues like this, it often helps to print the token stream that's generated for the given input. You can do that by running grun with the option -tokens or by iterating over stream in your main function.
If you do that, you'll see that main is tokenized as a sequence of four CHAR tokens, whereas your identifier rule expects ALPHA tokens, not CHAR. So that's the immediate problem, but it's not the only problem in your code:
The first thing I noticed when I tried your code is that I got errors on the line breaks. The reason that this happens for me and not for you is (presumably) that you're using Windows line breaks (\r\n) and I'm not. Your lexer recognizes \r\n as a line break and skips it, but just \n is recognized as a CHAR.
Further your handling of spaces is very confusing. Single spaces are their own tokens. They have to appear in certain places and can't appear anywhere else. However multiple consecutive spaces are skipped. So something like int main would be an error because it would not detect a space between int and main. On the other hand indenting a line with a single space would be an error because then the indentation would not be skipped.
Your identifiers are also wonky. Identifiers can contain spaces (as long as it's more than one), line breaks (as long as they're \r\n or you fix it, so that \n is skipped, too) or comments. So the following would be a single valid identifier (assuming you change the lexer, so that the letters are recognized as ALPHA instead of CHAR):
hel lo //comment
wor
ld
On the other hand maintarget would not be a valid identifier because it contains the keyword int.
Similarly skipped tokens can also be used inside your integer literals and string literals. For string literals that means that "a b" is a valid string (which is fine) that only contains the characters a and b (which is not fine) because the double space gets skipped. On the other hand " " would be a invalid string because is recognized as a ' ' token, not a CHAR token. Also if you fix your identifiers by making letters be recognized as ALPHA, they will no longer be valid inside strings. Also "la//la" would be seen as an unclosed string literal because //la" would be seen as comment.
All of these issues are related to how the lexer works, so let's go through that:
When turning a character stream into a token stream, the lexer will process the input according to the "maximal munch" rule: It will go through all of the lexer rules and checks which one matches at the beginning of the current input. Of those that match, it will pick the one that produces the longest match. In case of ties it will prefer the one that's defined first in the grammar. If you use string literals directly in parser rules, they are treated like lexer rules that are defined before any others.
So the fact that you have a CHAR: .; rule that comes before ALPHA, DIGIT and HEX_DIGIT means that these rules will never be matched. All of these rules match a single character, so when more than one of them matches, CHAR will be preferred because it comes first in the grammar. If you move CHAR to the end, letters will now be matched by ALPHA, decimal digits by DIGIT and everything else by CHAR. This still leaves HEX_DIGIT useless (and if you move it to the front, that would render ALPHA and DIGIT useless) and it also means that CHAR no longer does what you want because you want digits and letters to be seen as CHARs - but only inside strings.
The real problem here is that none of these things should be tokens. They should either be fragments or just be inlined directly into the lexer rules that use them. Instead your tokens should be anything inside of which you don't want to allow/ignore spaces or comments. So string literals, int literals and identifiers should all be tokens. The only instance where you have multiple lexer rules that could match the same input should be identifiers and keywords (where keywords take precedence over identifiers because you specify them as string literals in the grammar, but longer identifiers could still contain keywords as a substring because of the maximal munch rule).
You also should remove all uses of ' ' from your grammar and instead always skip spaces.

Parsing keyword next to special character (pyparsing)

Using pyparsing, how can I match a keyword immediately before or after a special character (like "{" or "}")? The code below shows that my keyword "msg" is not matched unless it is preceded by whitespace (or at start):
import pyparsing as pp
openBrace = pp.Suppress(pp.Keyword("{"))
closeBrace = pp.Suppress(pp.Keyword("}"))
messageKw = pp.Keyword("msg")
messageExpr = pp.Forward()
messageExpr << messageKw + openBrace +\
pp.ZeroOrMore(messageExpr) + closeBrace
try:
result = messageExpr.parseString("msg { msg { } }")
print result.dump(), "\n"
result = messageExpr.parseString("msg {msg { } }")
print result.dump()
except pp.ParseException as pe:
print pe, "\n", "Text: ", pe.line
I'm sure there's a way to do this, but I have been unable to find it.
Thanks in advance

openBrace = pp.Suppress(pp.Keyword("{"))
closeBrace = pp.Suppress(pp.Keyword("}"))
should be:
openBrace = pp.Suppress(pp.Literal("{"))
closeBrace = pp.Suppress(pp.Literal("}"))
or even just:
openBrace = pp.Suppress("{")
closeBrace = pp.Suppress("}")
(Most pyparsing classes will auto-promote a string argument "arg" to Literal("arg").)
When I have parsers with many punctuation marks, rather than have a big ugly chunk of statements like this, I'll collapse them down to something like:
OBRACE, CBRACE, OPAR, CPAR, SEMI, COMMA = map(pp.Suppress, "{}();,")
The problem you are seeing is that Keyword looks at the immediately-surrounding characters, to make sure that the current string is not being accidentally matched when it is really embedded in a larger identifier-like string. In Keyword('{'), this will only work if there is no adjoining character that could be confused as part of a larger word. Since '{' itself is not really a typical keyword character, using Keyword('{') is not a good use of that class.
Only use Keyword for strings that could be misinterpreted as identifiers. For matching characters that are not in the set of typical keyword characters (by "keyword characters" I mean alphanumerics + '_'), use Literal.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.