Lark parser grammar works with Earley but not LALR

Lark parser grammar works with Earley but not LALR - python

Consider this simple test of the Python Lark parser:
GRAMMAR = '''
start: container*
container: string ":" "{" (container | attribute | attribute_value)* "}"
attribute: attribute_name "=" (attribute_value | container)
attribute_value: string ":" _value ("," _value)*
_value: number | string
attribute_name: /[A-Za-z_][A-Za-z_#0-9]*/
string: /[A-Za-z_#0-9]+/
number: /[0-9]+/
%import common.WS
%ignore WS
'''
data = '''outer : {
inner : {
}
}'''
parser = Lark(GRAMMAR, parser='lalr')
parser.parse(data)
This works with parser='earley' but it fails with parser='lalr'. I don't understand why. The error message is:
UnexpectedCharacters: No terminal defined for '{' at line 2 col 12
inner : {
This is just an MWE. My actual grammar suffers from the same problem.

The reason this fails with LALR, is because it has a lookahead of 1 (unlike Earley, which has unlimited lookahead), and it gets confused between attribute_name and string. Once it matches one of the other (in this case, attribute_name), it's impossible for it to backtrack and match a different rule.
If you use a lower priority for the attribute_name terminal, it will work. For example:
attribute_name: ATTR
ATTR.0: /[A-Za-z_][A-Za-z_#0-9]*/
But the recommended practice is to use the same terminal for both, if possible, so that the parser can do the thinking for you, instead of the lexer. You can add extra validation, if that's required, after the parsing is done.
Both approaches (changing priority or merging the terminals) will solve your problem.

Related

define statement only works if followed by an assignment in Lark grammar

I am creating a parser with Lark. The parser works fine for most of the tests I ran, but failed with the define keyword. It only works if it is followed by an assignement. define a = 10 works just fine, but define b is not treated as a define statement.
Here is the Lark parser :
import lark
# ...
parser = lark.Lark("""
?start: statements
?statements: ((expr (";" | NEWLINE) | NEWLINE ) )* expr?
?expr: identifier | number | functioncall | define | assignment | function
?functioncall: identifier "(" arguments? ")"
?arguments: expr ("," expr)*
?define: "define" identifier ("=" expr)?
?assignment: identifier "=" expr
?function: "function" "(" parameters? ")" "->" identifier block
?parameters: identifier ("," identifier)*
?block: "{" statements "}"
?identifier: NAME -> identifier
?number: NUMBER -> number
%import common.NEWLINE
%import common.CNAME -> NAME
%import common.NUMBER
%import common.WS_INLINE
%ignore WS_INLINE
COMMENT: "/*" /(.|\n)+/x "*/" | "//" /.+/ NEWLINE?
%ignore COMMENT
""")
My tests :
tree = parser.parse("define a = 10")
assert(tree.data == "define") # OK
tree = parser.parse("define b")
assert(tree.data == "define") # NOT OK - tree.data is "identifier"
Specifically, parser.parse("define b") and parser.parse("b") give the exact same result. I would expect parser.parse("define b") to give a tree beginning with the define rule, but instead I have an identifier.

Sometime Lark parser doesn't clearly identifies a rule, for instance, define b and b gives Tree(identifier, [Token(NAME, 'b')]). To be able to distinguish the two, you need to force Lark to add a name to the rule, this can be done by adding -> name_of_rule at the end of a line in the parser definition. So for instance, the definition of the ?define rule should become:
?define: "define" identifier ( "=" expr )? -> define

How to setup a grammar that can handle ambiguity

I'm trying to create a grammar to parse some Excel-like formulas I have devised, where a special character in the beginning of a string signifies a different source. For example, $ can signify a string, so "$This is text" would be treated as a string input in the program and & can signify a function, so &foo() can be treated as a call to the internal function foo.
The problem I'm facing is how to construct the grammar properly. For example, This is a simplified version as a MWE:
grammar = r'''start: instruction
?instruction: simple
| func
STARTSYMBOL: "!"|"#"|"$"|"&"|"~"
SINGLESTR: (LETTER+|DIGIT+|"_"|" ")*
simple: STARTSYMBOL [SINGLESTR] (WORDSEP SINGLESTR)*
ARGSEP: ",," // argument separator
WORDSEP: "," // word separator
CONDSEP: ";;" // condition separator
STAR: "*"
func: STARTSYMBOL SINGLESTR "(" [simple|func] (ARGSEP simple|func)* ")"
%import common.LETTER
%import common.WORD
%import common.DIGIT
%ignore ARGSEP
%ignore WORDSEP
'''
parser = lark.Lark(grammar, parser='earley')
So, with this grammar, things like: $This is a string, &foo(), &foo(#arg1), &foo($arg1,,#arg2) and &foo(!w1,w2,w3,,!w4,w5,w6) are all parsed as expected. But if I'd like to add more flexibility to my simple terminal, then I need to start fiddling around with the SINGLESTR token definition which is not convenient.
What have I tried
The part that I cannot get past is that if I want to have a string including parentheses (which are literals of func), then I cannot handle them in my current situation.
If I add the parentheses in SINGLESTR, then I get Expected STARTSYMBOL, because it's getting mixed up with the func definition and it thinks that a function argument should be passed, which makes sense.
If I redefine the grammar to reserve the ampersand symbol for functions only and add the parentheses in SINGLESTR, then I can parse a string with parentheses, but every function I'm trying to parse gives Expected LPAR.
My intent is that anything starting with a $ would be parsed as a SINGLESTR token and then I could parse things like &foo($first arg (has) parentheses,,$second arg).
My solution, for now, is that I'm using 'escape' words like LEFTPAR and RIGHTPAR in my strings and I've written helper functions to change those into parentheses when I process the tree. So, $This is a LEFTPARtestRIGHTPAR produces the correct tree and when I process it, then this gets translated to This is a (test).
To formulate a general question: Can I define my grammar in such a way that some characters that are special to the grammar are treated as normal characters in some situations and as special in any other case?
EDIT 1
Based on a comment from jbndlr I revised my grammar to create individual modes based on the start symbol:
grammar = r'''start: instruction
?instruction: simple
| func
SINGLESTR: (LETTER+|DIGIT+|"_"|" ") (LETTER+|DIGIT+|"_"|" "|"("|")")*
FUNCNAME: (LETTER+) (LETTER+|DIGIT+|"_")* // no parentheses allowed in the func name
DB: "!" SINGLESTR (WORDSEP SINGLESTR)*
TEXT: "$" SINGLESTR
MD: "#" SINGLESTR
simple: TEXT|DB|MD
ARGSEP: ",," // argument separator
WORDSEP: "," // word separator
CONDSEP: ";;" // condition separator
STAR: "*"
func: "&" FUNCNAME "(" [simple|func] (ARGSEP simple|func)* ")"
%import common.LETTER
%import common.WORD
%import common.DIGIT
%ignore ARGSEP
%ignore WORDSEP
'''
This falls (somewhat) under my second test case. I can parse all the simple types of strings (TEXT, MD or DB tokens that can contain parentheses) and functions that are empty; for example, &foo() or &foo(&bar()) parse correctly. The moment I put an argument within a function (no matter which type), I get an UnexpectedEOF Error: Expected ampersand, RPAR or ARGSEP. As a proof of concept, if I remove the parentheses from the definition of SINGLESTR in the new grammar above, then everything works as it should, but I'm back to square one.

import lark
grammar = r'''start: instruction
?instruction: simple
| func
MIDTEXTRPAR: /\)+(?!(\)|,,|$))/
SINGLESTR: (LETTER+|DIGIT+|"_"|" ") (LETTER+|DIGIT+|"_"|" "|"("|MIDTEXTRPAR)*
FUNCNAME: (LETTER+) (LETTER+|DIGIT+|"_")* // no parentheses allowed in the func name
DB: "!" SINGLESTR (WORDSEP SINGLESTR)*
TEXT: "$" SINGLESTR
MD: "#" SINGLESTR
simple: TEXT|DB|MD
ARGSEP: ",," // argument separator
WORDSEP: "," // word separator
CONDSEP: ";;" // condition separator
STAR: "*"
func: "&" FUNCNAME "(" [simple|func] (ARGSEP simple|func)* ")"
%import common.LETTER
%import common.WORD
%import common.DIGIT
%ignore ARGSEP
%ignore WORDSEP
'''
parser = lark.Lark(grammar, parser='earley')
parser.parse("&foo($first arg (has) parentheses,,$second arg)")
Output:
Tree(start, [Tree(func, [Token(FUNCNAME, 'foo'), Tree(simple, [Token(TEXT, '$first arg (has) parentheses')]), Token(ARGSEP, ',,'), Tree(simple, [Token(TEXT, '$second arg')])])])
I hope it's what you were looking for.
Those have been crazy few days. I tried lark and failed. I also tried persimonious and pyparsing. All of these different parsers all had the same problem with the 'argument' token consuming the right parenthesis that was part of the function, eventually failing because the function's parentheses weren't closed.
The trick was to figure out how do you define a right parenthesis that's "not special". See the regular expression for MIDTEXTRPAR in the code above. I defined it as a right parenthesis that is not followed by argument separation or by end of string. I did that by using the regular expression extension (?!...) which matches only if it's not followed by ... but doesn't consume characters. Luckily it even allows matching end of string inside this special regular expression extension.
EDIT:
The above mentioned method only works if you don't have an argument ending with a ), because then the MIDTEXTRPAR regular expression won't catch that ) and will think that's the end of the function even though there are more arguments to process. Also, there may be ambiguities such as ...asdf),,..., it may be an end of a function declaration inside an argument, or a 'text-like' ) inside an argument and the function declaration goes on.
This problem is related to the fact that what you describe in your question is not a context-free grammar (https://en.wikipedia.org/wiki/Context-free_grammar) for which parsers such as lark exist. Instead it is a context-sensitive grammar (https://en.wikipedia.org/wiki/Context-sensitive_grammar).
The reason for it being a context sensitive grammar is because you need the parser to 'remember' that it is nested inside a function, and how many levels of nesting there are, and have this memory available inside the grammar's syntax in some way.
EDIT2:
Also take a look at the following parser that is context-sensitive, and seems to solve the problem, but has an exponential time complexity in the number of nested functions, as it tries to parse all possible function barriers until it finds one that works. I believe it has to have an exponential complexity has since it's not context-free.
_funcPrefix = '&'
_debug = False
class ParseException(Exception):
pass
def GetRecursive(c):
if isinstance(c,ParserBase):
return c.GetRecursive()
else:
return c
class ParserBase:
def __str__(self):
return type(self).__name__ + ": [" + ','.join(str(x) for x in self.contents) +"]"
def GetRecursive(self):
return (type(self).__name__,[GetRecursive(c) for c in self.contents])
class Simple(ParserBase):
def __init__(self,s):
self.contents = [s]
class MD(Simple):
pass
class DB(ParserBase):
def __init__(self,s):
self.contents = s.split(',')
class Func(ParserBase):
def __init__(self,s):
if s[-1] != ')':
raise ParseException("Can't find right parenthesis: '%s'" % s)
lparInd = s.find('(')
if lparInd < 0:
raise ParseException("Can't find left parenthesis: '%s'" % s)
self.contents = [s[:lparInd]]
argsStr = s[(lparInd+1):-1]
args = list(argsStr.split(',,'))
i = 0
while i<len(args):
a = args[i]
if a[0] != _funcPrefix:
self.contents.append(Parse(a))
i += 1
else:
j = i+1
while j<=len(args):
nestedFunc = ',,'.join(args[i:j])
if _debug:
print(nestedFunc)
try:
self.contents.append(Parse(nestedFunc))
break
except ParseException as PE:
if _debug:
print(PE)
j += 1
if j>len(args):
raise ParseException("Can't parse nested function: '%s'" % (',,'.join(args[i:])))
i = j
def Parse(arg):
if arg[0] not in _starterSymbols:
raise ParseException("Bad prefix: " + arg[0])
return _starterSymbols[arg[0]](arg[1:])
_starterSymbols = {_funcPrefix:Func,'$':Simple,'!':DB,'#':MD}
P = Parse("&foo($first arg (has)) parentheses,,&f($asdf,,&nested2($23423))),,&second(!arg,wer))")
print(P)
import pprint
pprint.pprint(P.GetRecursive())

Problem is arguments of function are enclosed in parenthesis where one of the arguments may contain parenthesis.
One of the possible solution is use backspace \ before ( or ) when it is a part of String
SINGLESTR: (LETTER+|DIGIT+|"_"|" ") (LETTER+|DIGIT+|"_"|" "|"\("|"\)")*
Similar solution used by C, to include double quotes(") as a part of string constant where string constant is enclosed in double quotes.
example_string1='&f(!g\()'
example_string2='&f(#g)'
print(parser.parse(example_string1).pretty())
print(parser.parse(example_string2).pretty())
Output is
start
func
f
simple !g\(
start
func
f
simple #g

Getting shift-reduce conflict while parsing grammar

I am using LALR(1) parsing from lark-parser library. I have written a grammar to parse an ORM like language. An example of my language is pasted below:
Table1
.join(table=Table2, left_on=[column1], right_on=[column_2])
.group_by(col=[column1], agg=[sum])
.join(table=Table3, left_on=[column1], right_on=[column_3])
.some_column
My grammar is:
start: [CNAME (object)*]
object: "." (CNAME|operation)
operation: [(join|group) (object)*]
join: "join" "(" [(join_args ",")* join_args] ")"
join_args: "table" "=" CNAME
| "left_on" "=" list
| "right_on" "=" list
group: "group_by" "(" [(group_args ",")* group_args] ")"
group_args: "col" "=" list
| "agg" "=" list
list: "[" [CNAME ("," CNAME)*] "]"
%import common.CNAME //# Variable name declaration
%import common.WS //# White space declaration
%ignore WS
When I parse the language, it gets parsed correctly but I get shift-reduce conflict warning. I believe that this is due to collision at object: "." (CNAME|operation), but I may be wrong. Is there any other way to write this grammar ?

I think you should replace
operation: [(join|group) (object)*]
With just
operation: join | group
You've already allowed repetition of object in
start: [CNAME (object)*]
So also allowing object* at the end of operation is ambiguous, leading to the conflict.
Personally, I would have gone for something like:
start : [ CNAME ("." qualifier)* ]
qualifier: CNAME | join | group
Because I don't see the point of object. But that's just a minor style difference.

Parsing a custom configuration format in Python

I'm writing a profile manager for Stellaris game and I've hit a wall with their format in which they keep the info about mods and settings.
Mod file:
name="! (Ship Designer UI Fix) !"
path="mod/ship_designer_ui_fix"
tags={
"Fixes"
}
remote_file_id="879973318"
supported_version="1.6"
Settings:
language="l_english"
graphics={
size={
x=1920
y=1200
}
min_gui={
x=1920
y=1200
}
gui_scale=1.000000
gui_safe_ratio=1.000000
refreshRate=59
fullScreen=no
borderless=no
display_index=0
shadowSize=2048
multi_sampling=8
maxanisotropy=16
gamma=50.000000
vsync=yes
}
last_mods={
"mod/ship_designer_ui_fix.mod"
"mod/ugc_720237457.mod"
"mod/ugc_775944333.mod"
}
I've thought pyparsing will be of help there (and it probably will be) but it has been a long time since I've actually did something like this and this I'm clueless atm.
I've got to extract the simple key=value but I'm struggling to actually move from there to be able to extract the arrays, not to mention the multilevel arrays.
lbrack = Literal("{").suppress()
rbrack = Literal("}").suppress()
equals = Literal("=").suppress()
nonequals = "".join([c for c in printables if c != "="]) + " \t"
keydef = ~lbrack + Word(nonequals) + equals + restOfLine
conf = Dict( ZeroOrMore( Group(keydef) ) )
tokens = conf.parseString(data)
I haven't got very far as you can see. Can anyone point me towards next step? I'm not asking a finished and working solution for the whole thing - it would move me forward a lot but where's the fun in that :)

Well, it is awfully tempting to just dive in and write this parser, but you want some of that fun for yourself, that's great.
Before writing any code, write a BNF. That way you'll write a decent and robust parser, instead of just "everything that's not an equals sign must be an identifier".
There are a lot of "something = something" bits here, look at the kinds of things on the right- and left-hand sides of the '='. The left-hand sides all look like pretty well-mannered identifiers: alphas, underscores. I could envision numeric digits too, as long as they aren't the leading character. So let's say the left-hand sides will be identifiers:
identifier_leading = 'A'..'Z' 'a'..'z' '_'
identifier_body = identifier_leading '0'..'9'
identifier ::= identifier_leading + identifier_body*
The right-hand sides are a mix of things:
integers
floats
'yes' or 'no' booleans
quoted strings
something in braces
The "something in braces" are either a list of quoted strings, or a list of 'identifer = value' pairs. I'll skip the awful details of defining floats and integers and quoted strings, let's just assume we have those defined:
boolean_value ::= 'yes' | 'no'
value ::= float | integer | boolean_value | quoted_string | string_list_in_braces | key_value_list_in_braces
string_list_in_braces ::= '{' quoted_string * '}'
key_value ::= identifier '=' value
key_value_list_in_braces ::= '{' key_value* '}'
You will have to use a pyparsing Forward to declare value before it is fully defined, since it is used in key_value, but key_value is used in key_value_list_in_braces, which is used to define value - a recursive grammar. You are already familiar with the Dict(OneOrMore(Group(named_item))) pattern, and this should be good to give you a structure of fields that are accessible by name. For identifier, a Word would work, or you could just use the pre-defined pyparsing_common.identifier which was introduced as part of the pyparsing_common namespace class last year.
The translation from BNF to pyparsing should be pretty much 1-to-1 from here. For that matter, from the BNF, you could use PLY, ANTLR, or another parsing lib too. The BNF is really worth taking the 1/2 hour or 1/2 day to get sorted out.

Parsing C like structured text into dict

I have config files structured like simplified C syntax, eg:
Main { /* some comments */
VariableName1 = VariableValue1;
VariableName2 = VariableValue2;
SubSection {
VariableName1 = VariableValue1; // inline comment
VariableName2 = VariableValue2;
}
VariableName3 = "StriingValue4";
}
Sections may be recursivesly nested.
How can I parse that file into dict in a clean and "pythonish" way?
[EDIT]
OK, I've found pyparsing module :) but maybe someone can tell how to do this without it.
[EDIT2]
Because of curiosity, I want to know for future how to write that I think simple task by hand.

Use a parser like SimpleParse, just feed it the EBNF definition.
Do you have the format documented in some sort of BNF, don't you? If not, you can tell the next genius inventing another config format instead of using json, xml, yaml or xml that he is not authorized to reinvent the wheel unless he can specify the syntax using EBNF.
It may take some time to write a grammar if you are not familiarized with EBNF, but it pays. It will make your code well documented, rock solid and easier to maintain.
See the python wiki about Language Parsing for another options.
If you try to pull some stunt using str.split or regular expressions, every other developer to do maintenance of this piece of code will curse you.
UPDATE:
It just occurred to me that if you replace the SectionName with SectionName :, ; with , and enclose the main section with a pair of curly braces, this format will likely to be valid json.
"Name" = JSON Grammar
"Author" = Arsène von Wyss
"Version" = 1.0
"About" = 'Grammar for JSON data, following http://www.json.org/'
! and compliant with http://www.ietf.org/rfc/rfc4627
"Start Symbol" = <Json>
"Case Sensitive" = True
"Character Mapping" = 'Unicode'
! ------------------------------------------------- Sets
{Unescaped} = {All Valid} - {&1 .. &19} - ["\]
{Hex} = {Digit} + [ABCDEFabcdef]
{Digit9} = {Digit} - [0]
! ------------------------------------------------- Terminals
Number = '-'?('0'|{Digit9}{Digit}*)('.'{Digit}+)?([Ee][+-]?{Digit}+)?
String = '"'({Unescaped}|'\'(["\/bfnrt]|'u'{Hex}{Hex}{Hex}{Hex}))*'"'
! ------------------------------------------------- Rules
<Json> ::= <Object>
| <Array>
<Object> ::= '{' '}'
| '{' <Members> '}'
<Members> ::= <Pair>
| <Pair> ',' <Members>
<Pair> ::= String ':' <Value>
<Array> ::= '[' ']'
| '[' <Elements> ']'
<Elements> ::= <Value>
| <Value> ',' <Elements>
<Value> ::= String
| Number
| <Object>
| <Array>
| true
| false
| null

You need to parse it recursively, using this Backus Naur form, staring from PARSE:
PARSE: '{' VARASSIGN VARASSIGN [PARSE [VARASSIGN]] '}'
VARASSIGN : VARIABLENAME '=' '"' STRING '"'
VARIABLENAME: STRING
STRING: [[:alpha:]][[:alnum:]]*
Because your structure is easy, you can use a predicative parser LL(1).

1) Write a tokenizer, i.e. a function that will parse the character stream and turn it to a list of Identifiers, OpeningBrace, ClosingBrace, EqualSign and SemiColon; comments and spaces are discarded. Could be done using Regexpr's.
2) Write a simple parser. Skip the first Identifier and OpeningBrace.
The parser expects an Identifier followed by one of EqualSign or OpeningBrace, or a ClosingBrace.
2.1) If EqualSign, must be followed by Identifier and SemiColon.
2.2) If OpeningBrace, invoke the parser recursively.
2.3) If ClosingBrace, return from the recursive call.
In the processing of 2.1, enter the desired data into the dict, the way you like. You could prefix identifiers with the names of the enclosing blocks, e.g.
{"Main.SubSection.VariableName1": VariableValue1}
Here is prototype code for the parser, to be called after the tokenizer. It scans a string where a letter stands for an identifier, separators are ={}; and the last token must be a $.
def Parse(String, Prefix= "", Nest= 0):
global Cursor
if Nest == 0:
Cursor= 0
# Scan the input string
while String[Cursor + 0].isalpha():
# Identifier, starts an Assignment or a Block (Id |)
if String[Cursor + 1] == "=":
# Assignment, lookup (Id= | Id;)
if String[Cursor + 2].isalpha():
if String[Cursor + 3] == ";":
# Accept the assignment (Id=Id; |)
print Nest * " " + Prefix + String[Cursor] + "=" + String[Cursor + 2] + ";"
Cursor+= 4
elif String[Cursor + 1] == "{":
# Block, lookup (Id{ | )
print Nest * " " + String[Cursor] + "{"
Cursor+= 2
# Recurse
Parse(String, Prefix + String[Cursor - 2] + "::", Nest + 4)
else:
# Unexpected token
break
if String[Cursor + 0] == "}":
# Block complete, (Id{...} |)
print (Nest - 4) * " " + "}"
Cursor+= 1
return
if Nest == 0 and String[Cursor + 0] == "$":
# Done
return
print "Syntax error at", String[Cursor:], ":("
Parse("C{D=E;X{Y=Z;}F=G;}H=I;A=B;$")
When executed, it outputs:
C{
C::D=E;
X{
C::X::Y=Z;
}
C::F=G;
}
H=I;
A=B;
proving that it did detect the nesting. Replace the print statements by whatever processing you like.

You could use pyparsing to write a parser for this format.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.