I got the problem when dealing with ANTLR4 and Parse by Python Library.
The grammar:
grammar SimpleCode;
program : 'class' ' ' 'Program' ' ' '{' field_decl* method_decl* '}' ;
field_decl : DATA_TYPE variable (',' variable)* ';' ;
method_decl: (DATA_TYPE | 'void') identifier '(' method_params? ')' block ;
variable : identifier | identifier '[' int_literal ']' ;
method_params : DATA_TYPE identifier (',' DATA_TYPE identifier)* ;
block : '{' var_decl* statement* '}' ;
var_decl : DATA_TYPE identifier (',' identifier)* ';';
statement : location assign_op expr ';' | method_call ';' | 'if' '(' (expr) ')' block ('else' block)? | 'for' identifier '=' (expr) ',' (expr) block | 'return' (expr)? ';' | 'break' ';' | 'continue' ';' | block ;
assign_op : '=' | '+=' | '-=' ;
method_call : method_name '(' method_call_params? ')' | 'callout' (string_literal (',' callout_arg (',' callout_arg)*)?) ;
method_call_params : DATA_TYPE identifier (',' DATA_TYPE identifier)* ;
method_name : identifier ;
location : identifier | identifier '[' expr ']' ;
expr : location | method_call | literal | expr bin_op expr | '-' expr | '!' expr | '(' expr ')' ;
callout_arg : expr | string_literal ;
bin_op : arith_op | rel_op | eq_op | cond_op ;
arith_op : '+' | '-' | '*' | '/' + '%' ;
rel_op : '<' | '>' | '<=' | '>=' ;
eq_op : '==' | '!=' ;
cond_op : '&&' | '||' ;
literal : int_literal | char_literal | bool_literal ;
identifier : ALPHA alpha_num* ;
alpha_num : ALPHA | DIGIT ;
int_literal : decimal_literal | hex_literal ;
decimal_literal : DIGIT DIGIT* ;
hex_literal : '0x' HEX_DIGIT HEX_DIGIT* ;
bool_literal : 'true' | 'false' ;
CHAR: . ;
char_literal : '\'' CHAR '\'' ;
string_literal : '"' CHAR* '"' ;
DATA_TYPE : INT | BOOL ;
INT : 'int' ;
BOOL : 'boolean' ;
ALPHA : [a-zA-Z] ;
DIGIT : [0-9] ;
HEX_DIGIT : [0-9a-fA-F] ;
White : [ \t]+ -> skip ;
Newline : ( '\r' '\n'? | '\n' ) -> skip ;
LineComment : '//' ~[\r\n]* -> skip ;
My python code to parse:
from antlr4 import *
from SimpleCodeLexer import SimpleCodeLexer
from SimpleCodeListener import SimpleCodeListener
from SimpleCodeParser import SimpleCodeParser
import sys
class SimpleCodePrintListener(SimpleCodeListener):
def enterProgram(self, ctx):
print(ctx.getText())
print(ctx.toStringTree())
# for child in ctx.getChildren():
# print(child.getText(), child.getSymbol())
def main():
input_stream = FileStream('in.in')
lexer = SimpleCodeLexer(input_stream)
stream = CommonTokenStream(lexer)
parser = SimpleCodeParser(stream)
tree = parser.program()
printer = SimpleCodePrintListener()
walker = ParseTreeWalker()
walker.walk(printer, tree)
if __name__ == '__main__':
print('Starting parse....')
main()
And the in.in file:
class Program {
int main() {
int v;
v = 1;
v = 'c';
v = true;
return 0;
}
}
I got this error after run the python code:
line 2:7 no viable alternative at input 'int '
The result of the first print is:
class Program {int main() {int v;v = 1;v = 'c';v = true;return 0; }}
([] class Program { int m a i n ( ) { int v ; v = 1 ; v = ' c ' ; v = true ; return 0 ; } })
I'm newbie to ANTLR4, so are there any special case to handle with lexers and tokens, because after hours of searching on internet, the main problem is about DATA_TYPE is used at many different places in grammar.
When debugging issues like this, it often helps to print the token stream that's generated for the given input. You can do that by running grun with the option -tokens or by iterating over stream in your main function.
If you do that, you'll see that main is tokenized as a sequence of four CHAR tokens, whereas your identifier rule expects ALPHA tokens, not CHAR. So that's the immediate problem, but it's not the only problem in your code:
The first thing I noticed when I tried your code is that I got errors on the line breaks. The reason that this happens for me and not for you is (presumably) that you're using Windows line breaks (\r\n) and I'm not. Your lexer recognizes \r\n as a line break and skips it, but just \n is recognized as a CHAR.
Further your handling of spaces is very confusing. Single spaces are their own tokens. They have to appear in certain places and can't appear anywhere else. However multiple consecutive spaces are skipped. So something like int main would be an error because it would not detect a space between int and main. On the other hand indenting a line with a single space would be an error because then the indentation would not be skipped.
Your identifiers are also wonky. Identifiers can contain spaces (as long as it's more than one), line breaks (as long as they're \r\n or you fix it, so that \n is skipped, too) or comments. So the following would be a single valid identifier (assuming you change the lexer, so that the letters are recognized as ALPHA instead of CHAR):
hel lo //comment
wor
ld
On the other hand maintarget would not be a valid identifier because it contains the keyword int.
Similarly skipped tokens can also be used inside your integer literals and string literals. For string literals that means that "a b" is a valid string (which is fine) that only contains the characters a and b (which is not fine) because the double space gets skipped. On the other hand " " would be a invalid string because is recognized as a ' ' token, not a CHAR token. Also if you fix your identifiers by making letters be recognized as ALPHA, they will no longer be valid inside strings. Also "la//la" would be seen as an unclosed string literal because //la" would be seen as comment.
All of these issues are related to how the lexer works, so let's go through that:
When turning a character stream into a token stream, the lexer will process the input according to the "maximal munch" rule: It will go through all of the lexer rules and checks which one matches at the beginning of the current input. Of those that match, it will pick the one that produces the longest match. In case of ties it will prefer the one that's defined first in the grammar. If you use string literals directly in parser rules, they are treated like lexer rules that are defined before any others.
So the fact that you have a CHAR: .; rule that comes before ALPHA, DIGIT and HEX_DIGIT means that these rules will never be matched. All of these rules match a single character, so when more than one of them matches, CHAR will be preferred because it comes first in the grammar. If you move CHAR to the end, letters will now be matched by ALPHA, decimal digits by DIGIT and everything else by CHAR. This still leaves HEX_DIGIT useless (and if you move it to the front, that would render ALPHA and DIGIT useless) and it also means that CHAR no longer does what you want because you want digits and letters to be seen as CHARs - but only inside strings.
The real problem here is that none of these things should be tokens. They should either be fragments or just be inlined directly into the lexer rules that use them. Instead your tokens should be anything inside of which you don't want to allow/ignore spaces or comments. So string literals, int literals and identifiers should all be tokens. The only instance where you have multiple lexer rules that could match the same input should be identifiers and keywords (where keywords take precedence over identifiers because you specify them as string literals in the grammar, but longer identifiers could still contain keywords as a substring because of the maximal munch rule).
You also should remove all uses of ' ' from your grammar and instead always skip spaces.
Related
This is my grammar file, I am trying to parse a comment from starting of any content. my comment is python (#) single line and (""") multiline comment.
I am getting below error during the parsing the below test cases.
test case scenario for sigle line comment
# This is single line comment.\nme = "Test Name"\nt = test(me)\nreturn t FAIL
# This is single line comment. PASS
me = "Test Name"\n# This is single line comment.\nt = test(me)\nreturn t PASS
test case scenario for multi line comment
"""This is comment.\nThis is comment.\n"""\nme = "Test Name"\nt = test(me)\nreturn t FAIL
"""This is comment.\nThis is comment.\n""" PASS
me = "Test Name"\n"""This is comment.\nThis is comment.\n"""\nt = test(me)\nreturn t PASS
antlr4.error.Errors.ParseCancellationException: Line 2: 0 extraneous input 'me' expecting {, NEWLINE}
grammar ngc_grammar;
parse
: (multi_statement | NEWLINE*) EOF
;
multi_statement
: (statement NEWLINE*)+
;
statement
: return_statement
| conditional_statement
| function_invocation
| assignment
| for_loop
;
conditional_statement
: if_statement (NEWLINE+ elif_statement)* (NEWLINE+ else_statement)?
;
if_statement
: IF SPACE logical_expression ':' NEWLINE SPACE statement
;
elif_statement
: ELIF SPACE logical_expression ':' NEWLINE SPACE statement
;
else_statement
: ELSE ':' NEWLINE SPACE statement
;
return_statement
: RETURN SPACE (expr | logical_expression)
;
logical_expression
: '(' SPACE? logical_expression SPACE? ')' #BracketedLogicalExpr
| NOT SPACE logical_expression #LogicalNotExpr
| left=logical_expression SPACE operator=OR SPACE right=logical_expression #LogicalExpr
| left=logical_expression SPACE operator=AND SPACE right=logical_expression #LogicalExpr
| boolean_expression #LogicalBooleanExpr
;
boolean_expression
: left=expr SPACE? operator=(GT | GTE | LT | LTE | EQ | NE | IN | NOT_IN | IS | IS_NOT) SPACE? right=expr #ComparisonExpr
| expr #BooleanFunctionExpr
;
expr
: term (SPACE? arith_expr)*
;
arith_expr
: operator=(PLUS|MINUS) SPACE? term
;
term
: factor (SPACE? factor_expr)*
;
factor_expr
: operator=(TIMES|DIV) SPACE? factor
;
factor
: operator=(PLUS|MINUS) SPACE? factor | value
;
value
: atom_expr | function_invocation
;
atom_expr
: atom trailer* #AtomExprAtom
| '(' SPACE? expr SPACE? ')' #AtomExprBracket
;
trailer
: '(' parameters ')' # TrailerFunction
| '[' string_atom ']' # TrailerIndex
| '.' identifier # TrailerProp
;
function_invocation
: identifier '(' parameters ')' # FunctionInvocation
| atom trailer+ # FunctionAccessor
;
assignment
: identifier SPACE? '=' SPACE? expr #VariableAssignment
| variable_accessor '.' identifier SPACE? '=' SPACE? expr #PropAssignment
;
parameters
: (value SPACE? (',' SPACE? value)*)?
;
array
: '[' SPACE? (value SPACE? (',' SPACE? value)*)? SPACE? ']'
;
for_loop
: FOR SPACE var=identifier SPACE* (',' SPACE* index=identifier)? SPACE IN SPACE atom ':' NEWLINE SPACE assignment #forLoop
;
atom
: none_atom
| boolean_atom
| float_atom
| integer_atom
| string_atom
| array
| variable_accessor
;
variable_accessor
: identifier
;
none_atom
: NONE
;
boolean_atom
: TRUE
| FALSE
;
integer_atom
: INTEGER
;
float_atom
: FLOAT
;
string_atom
: STRING_LITERAL
;
identifier
: NAME
;
FOR: 'for';
AND: 'and';
ELIF: 'elif';
ELSE: 'else';
FALSE: 'False';
FLOAT: ( '0' | [1-9] [0-9]* ) '.' [0-9]+;
IF: 'if';
IN: 'in';
IS: 'is';
IS_NOT: 'is not';
NOT_IN: 'not in';
INTEGER: ( '0' | [1-9] [0-9]* );
NEWLINE: ( '\r'? '\n' | '\r' | '\f' );
NONE: 'None';
NOT: 'not';
OR: 'or';
RETURN: 'return';
SPACE: [ \t]+;
STRING_LITERAL
: '"'.*?'"'
| '\''.*?'\''
;
TRUE: 'True';
GT: '>';
GTE: '>=';
LT: '<';
LTE: '<=';
EQ: '==';
NE: '!=';
PLUS: '+';
MINUS: '-';
TIMES: '*';
DIV: '/';
NAME: ID_START ID_CONTINUE*;
SINGLELINECOMMENT: '#' ~[\r\n]* -> skip;
MULTILINECOMMENT: ('"""') .*? (MULTILINECOMMENT | '"""') -> skip;
fragment ID_START
: '_'
| [A-Z]
| [a-z]
;
fragment ID_CONTINUE
: ID_START
| [0-9]
;
Consider this simple test of the Python Lark parser:
GRAMMAR = '''
start: container*
container: string ":" "{" (container | attribute | attribute_value)* "}"
attribute: attribute_name "=" (attribute_value | container)
attribute_value: string ":" _value ("," _value)*
_value: number | string
attribute_name: /[A-Za-z_][A-Za-z_#0-9]*/
string: /[A-Za-z_#0-9]+/
number: /[0-9]+/
%import common.WS
%ignore WS
'''
data = '''outer : {
inner : {
}
}'''
parser = Lark(GRAMMAR, parser='lalr')
parser.parse(data)
This works with parser='earley' but it fails with parser='lalr'. I don't understand why. The error message is:
UnexpectedCharacters: No terminal defined for '{' at line 2 col 12
inner : {
This is just an MWE. My actual grammar suffers from the same problem.
The reason this fails with LALR, is because it has a lookahead of 1 (unlike Earley, which has unlimited lookahead), and it gets confused between attribute_name and string. Once it matches one of the other (in this case, attribute_name), it's impossible for it to backtrack and match a different rule.
If you use a lower priority for the attribute_name terminal, it will work. For example:
attribute_name: ATTR
ATTR.0: /[A-Za-z_][A-Za-z_#0-9]*/
But the recommended practice is to use the same terminal for both, if possible, so that the parser can do the thinking for you, instead of the lexer. You can add extra validation, if that's required, after the parsing is done.
Both approaches (changing priority or merging the terminals) will solve your problem.
While checking python grammar at official documentation, here is what it reads
atom_expr: ['await'] atom trailer*
atom: ('(' [yield_expr|testlist_comp] ')' |
'[' [testlist_comp] ']' |
'{' [dictorsetmaker] '}' |
NAME | NUMBER | STRING+ | '...' | 'None' | 'True' | 'False')
testlist_comp: (test|star_expr) ( comp_for | (',' (test|star_expr))* [','] )
trailer: '(' [arglist] ')' | '[' subscriptlist ']' | '.' NAME
Then, 10.bit_length() is a valid syntax according to that definition but not according to the python interpreter. Instead, n=10;n.bit_length() is valid syntax for both the specifications and the interpreter.
Where should I find the real definition of atom and atom_expr?
Thanks to juanpa's comment and the answers in the related question, it appears that the problem comes from 10.. The definition of NUMBER includes the dot such that 10.bit_length() is of kind NUMBER NAME trailer and not NUMBER '.' NAME trailer.
In order to obtain an atom_expr, one must separate the dot: both 10 .bit_length() and (10).bit_length() give the correct answer.
I'm trying to write a simple int expression parser using tatsu, a PEG-based Python parser generator. Here is my code:
import tatsu
grammar = r'''
start = expression $ ;
expression = add | sub | term ;
add = expression '+' term ;
sub = expression '-' term ;
term = mul | div | number ;
mul = term '*' number ;
div = term '/' number ;
number = [ '-' ] /\d+/ ;
'''
parser = tatsu.compile(grammar)
print(parser.parse('2-1'))
The output of this program is ['-', '1'] instead of the expected ['2', '-', '1'].
I get the correct output if I either:
Remove support for unary minus, i.e. change the last rule to number = /\d+/ ;
Remove the term, mul and div rules, and support only addition and subtraction
Replace the second rule with expresssion = add | sub | mul | div | number ;
The last option actually works without leaving any feature out, but I don't understand why it works. What is going on?
EDIT: If I just flip the add/sub/mul/div rules to get rid of left recursion, it also works. But then evaluating the expressions becomes a problem, since the parse tree is flipped. (3-2-1 becomes 3-(2-1))
There are left recursion cases that TatSu doesn't handle, and work on fixing that is currently on hold.
You can use left/right join/gather operators to control the associativity of parsed expressions in a non-left-recursive grammar.
I have config files structured like simplified C syntax, eg:
Main { /* some comments */
VariableName1 = VariableValue1;
VariableName2 = VariableValue2;
SubSection {
VariableName1 = VariableValue1; // inline comment
VariableName2 = VariableValue2;
}
VariableName3 = "StriingValue4";
}
Sections may be recursivesly nested.
How can I parse that file into dict in a clean and "pythonish" way?
[EDIT]
OK, I've found pyparsing module :) but maybe someone can tell how to do this without it.
[EDIT2]
Because of curiosity, I want to know for future how to write that I think simple task by hand.
Use a parser like SimpleParse, just feed it the EBNF definition.
Do you have the format documented in some sort of BNF, don't you? If not, you can tell the next genius inventing another config format instead of using json, xml, yaml or xml that he is not authorized to reinvent the wheel unless he can specify the syntax using EBNF.
It may take some time to write a grammar if you are not familiarized with EBNF, but it pays. It will make your code well documented, rock solid and easier to maintain.
See the python wiki about Language Parsing for another options.
If you try to pull some stunt using str.split or regular expressions, every other developer to do maintenance of this piece of code will curse you.
UPDATE:
It just occurred to me that if you replace the SectionName with SectionName :, ; with , and enclose the main section with a pair of curly braces, this format will likely to be valid json.
"Name" = JSON Grammar
"Author" = Arsène von Wyss
"Version" = 1.0
"About" = 'Grammar for JSON data, following http://www.json.org/'
! and compliant with http://www.ietf.org/rfc/rfc4627
"Start Symbol" = <Json>
"Case Sensitive" = True
"Character Mapping" = 'Unicode'
! ------------------------------------------------- Sets
{Unescaped} = {All Valid} - {&1 .. &19} - ["\]
{Hex} = {Digit} + [ABCDEFabcdef]
{Digit9} = {Digit} - [0]
! ------------------------------------------------- Terminals
Number = '-'?('0'|{Digit9}{Digit}*)('.'{Digit}+)?([Ee][+-]?{Digit}+)?
String = '"'({Unescaped}|'\'(["\/bfnrt]|'u'{Hex}{Hex}{Hex}{Hex}))*'"'
! ------------------------------------------------- Rules
<Json> ::= <Object>
| <Array>
<Object> ::= '{' '}'
| '{' <Members> '}'
<Members> ::= <Pair>
| <Pair> ',' <Members>
<Pair> ::= String ':' <Value>
<Array> ::= '[' ']'
| '[' <Elements> ']'
<Elements> ::= <Value>
| <Value> ',' <Elements>
<Value> ::= String
| Number
| <Object>
| <Array>
| true
| false
| null
You need to parse it recursively, using this Backus Naur form, staring from PARSE:
PARSE: '{' VARASSIGN VARASSIGN [PARSE [VARASSIGN]] '}'
VARASSIGN : VARIABLENAME '=' '"' STRING '"'
VARIABLENAME: STRING
STRING: [[:alpha:]][[:alnum:]]*
Because your structure is easy, you can use a predicative parser LL(1).
1) Write a tokenizer, i.e. a function that will parse the character stream and turn it to a list of Identifiers, OpeningBrace, ClosingBrace, EqualSign and SemiColon; comments and spaces are discarded. Could be done using Regexpr's.
2) Write a simple parser. Skip the first Identifier and OpeningBrace.
The parser expects an Identifier followed by one of EqualSign or OpeningBrace, or a ClosingBrace.
2.1) If EqualSign, must be followed by Identifier and SemiColon.
2.2) If OpeningBrace, invoke the parser recursively.
2.3) If ClosingBrace, return from the recursive call.
In the processing of 2.1, enter the desired data into the dict, the way you like. You could prefix identifiers with the names of the enclosing blocks, e.g.
{"Main.SubSection.VariableName1": VariableValue1}
Here is prototype code for the parser, to be called after the tokenizer. It scans a string where a letter stands for an identifier, separators are ={}; and the last token must be a $.
def Parse(String, Prefix= "", Nest= 0):
global Cursor
if Nest == 0:
Cursor= 0
# Scan the input string
while String[Cursor + 0].isalpha():
# Identifier, starts an Assignment or a Block (Id |)
if String[Cursor + 1] == "=":
# Assignment, lookup (Id= | Id;)
if String[Cursor + 2].isalpha():
if String[Cursor + 3] == ";":
# Accept the assignment (Id=Id; |)
print Nest * " " + Prefix + String[Cursor] + "=" + String[Cursor + 2] + ";"
Cursor+= 4
elif String[Cursor + 1] == "{":
# Block, lookup (Id{ | )
print Nest * " " + String[Cursor] + "{"
Cursor+= 2
# Recurse
Parse(String, Prefix + String[Cursor - 2] + "::", Nest + 4)
else:
# Unexpected token
break
if String[Cursor + 0] == "}":
# Block complete, (Id{...} |)
print (Nest - 4) * " " + "}"
Cursor+= 1
return
if Nest == 0 and String[Cursor + 0] == "$":
# Done
return
print "Syntax error at", String[Cursor:], ":("
Parse("C{D=E;X{Y=Z;}F=G;}H=I;A=B;$")
When executed, it outputs:
C{
C::D=E;
X{
C::X::Y=Z;
}
C::F=G;
}
H=I;
A=B;
proving that it did detect the nesting. Replace the print statements by whatever processing you like.
You could use pyparsing to write a parser for this format.