Java ANTLR4 mismatched input '<EOF>' when validating python comment

Java ANTLR4 mismatched input '<EOF>' when validating python comment - python

This is my grammar file, I am trying to parse a comment from starting of any content. my comment is python (#) single line and (""") multiline comment.
I am getting below error during the parsing the below test cases.
test case scenario for sigle line comment
# This is single line comment.\nme = "Test Name"\nt = test(me)\nreturn t FAIL
# This is single line comment. PASS
me = "Test Name"\n# This is single line comment.\nt = test(me)\nreturn t PASS
test case scenario for multi line comment
"""This is comment.\nThis is comment.\n"""\nme = "Test Name"\nt = test(me)\nreturn t FAIL
"""This is comment.\nThis is comment.\n""" PASS
me = "Test Name"\n"""This is comment.\nThis is comment.\n"""\nt = test(me)\nreturn t PASS
antlr4.error.Errors.ParseCancellationException: Line 2: 0 extraneous input 'me' expecting {, NEWLINE}
grammar ngc_grammar;
parse
: (multi_statement | NEWLINE*) EOF
;
multi_statement
: (statement NEWLINE*)+
;
statement
: return_statement
| conditional_statement
| function_invocation
| assignment
| for_loop
;
conditional_statement
: if_statement (NEWLINE+ elif_statement)* (NEWLINE+ else_statement)?
;
if_statement
: IF SPACE logical_expression ':' NEWLINE SPACE statement
;
elif_statement
: ELIF SPACE logical_expression ':' NEWLINE SPACE statement
;
else_statement
: ELSE ':' NEWLINE SPACE statement
;
return_statement
: RETURN SPACE (expr | logical_expression)
;
logical_expression
: '(' SPACE? logical_expression SPACE? ')' #BracketedLogicalExpr
| NOT SPACE logical_expression #LogicalNotExpr
| left=logical_expression SPACE operator=OR SPACE right=logical_expression #LogicalExpr
| left=logical_expression SPACE operator=AND SPACE right=logical_expression #LogicalExpr
| boolean_expression #LogicalBooleanExpr
;
boolean_expression
: left=expr SPACE? operator=(GT | GTE | LT | LTE | EQ | NE | IN | NOT_IN | IS | IS_NOT) SPACE? right=expr #ComparisonExpr
| expr #BooleanFunctionExpr
;
expr
: term (SPACE? arith_expr)*
;
arith_expr
: operator=(PLUS|MINUS) SPACE? term
;
term
: factor (SPACE? factor_expr)*
;
factor_expr
: operator=(TIMES|DIV) SPACE? factor
;
factor
: operator=(PLUS|MINUS) SPACE? factor | value
;
value
: atom_expr | function_invocation
;
atom_expr
: atom trailer* #AtomExprAtom
| '(' SPACE? expr SPACE? ')' #AtomExprBracket
;
trailer
: '(' parameters ')' # TrailerFunction
| '[' string_atom ']' # TrailerIndex
| '.' identifier # TrailerProp
;
function_invocation
: identifier '(' parameters ')' # FunctionInvocation
| atom trailer+ # FunctionAccessor
;
assignment
: identifier SPACE? '=' SPACE? expr #VariableAssignment
| variable_accessor '.' identifier SPACE? '=' SPACE? expr #PropAssignment
;
parameters
: (value SPACE? (',' SPACE? value)*)?
;
array
: '[' SPACE? (value SPACE? (',' SPACE? value)*)? SPACE? ']'
;
for_loop
: FOR SPACE var=identifier SPACE* (',' SPACE* index=identifier)? SPACE IN SPACE atom ':' NEWLINE SPACE assignment #forLoop
;
atom
: none_atom
| boolean_atom
| float_atom
| integer_atom
| string_atom
| array
| variable_accessor
;
variable_accessor
: identifier
;
none_atom
: NONE
;
boolean_atom
: TRUE
| FALSE
;
integer_atom
: INTEGER
;
float_atom
: FLOAT
;
string_atom
: STRING_LITERAL
;
identifier
: NAME
;
FOR: 'for';
AND: 'and';
ELIF: 'elif';
ELSE: 'else';
FALSE: 'False';
FLOAT: ( '0' | [1-9] [0-9]* ) '.' [0-9]+;
IF: 'if';
IN: 'in';
IS: 'is';
IS_NOT: 'is not';
NOT_IN: 'not in';
INTEGER: ( '0' | [1-9] [0-9]* );
NEWLINE: ( '\r'? '\n' | '\r' | '\f' );
NONE: 'None';
NOT: 'not';
OR: 'or';
RETURN: 'return';
SPACE: [ \t]+;
STRING_LITERAL
: '"'.*?'"'
| '\''.*?'\''
;
TRUE: 'True';
GT: '>';
GTE: '>=';
LT: '<';
LTE: '<=';
EQ: '==';
NE: '!=';
PLUS: '+';
MINUS: '-';
TIMES: '*';
DIV: '/';
NAME: ID_START ID_CONTINUE*;
SINGLELINECOMMENT: '#' ~[\r\n]* -> skip;
MULTILINECOMMENT: ('"""') .*? (MULTILINECOMMENT | '"""') -> skip;
fragment ID_START
: '_'
| [A-Z]
| [a-z]
;
fragment ID_CONTINUE
: ID_START
| [0-9]
;

Related

define statement only works if followed by an assignment in Lark grammar

I am creating a parser with Lark. The parser works fine for most of the tests I ran, but failed with the define keyword. It only works if it is followed by an assignement. define a = 10 works just fine, but define b is not treated as a define statement.
Here is the Lark parser :
import lark
# ...
parser = lark.Lark("""
?start: statements
?statements: ((expr (";" | NEWLINE) | NEWLINE ) )* expr?
?expr: identifier | number | functioncall | define | assignment | function
?functioncall: identifier "(" arguments? ")"
?arguments: expr ("," expr)*
?define: "define" identifier ("=" expr)?
?assignment: identifier "=" expr
?function: "function" "(" parameters? ")" "->" identifier block
?parameters: identifier ("," identifier)*
?block: "{" statements "}"
?identifier: NAME -> identifier
?number: NUMBER -> number
%import common.NEWLINE
%import common.CNAME -> NAME
%import common.NUMBER
%import common.WS_INLINE
%ignore WS_INLINE
COMMENT: "/*" /(.|\n)+/x "*/" | "//" /.+/ NEWLINE?
%ignore COMMENT
""")
My tests :
tree = parser.parse("define a = 10")
assert(tree.data == "define") # OK
tree = parser.parse("define b")
assert(tree.data == "define") # NOT OK - tree.data is "identifier"
Specifically, parser.parse("define b") and parser.parse("b") give the exact same result. I would expect parser.parse("define b") to give a tree beginning with the define rule, but instead I have an identifier.

Sometime Lark parser doesn't clearly identifies a rule, for instance, define b and b gives Tree(identifier, [Token(NAME, 'b')]). To be able to distinguish the two, you need to force Lark to add a name to the rule, this can be done by adding -> name_of_rule at the end of a line in the parser definition. So for instance, the definition of the ?define rule should become:
?define: "define" identifier ( "=" expr )? -> define

no viable alternative at input 'int ' - ANTLR 4 with Python Parser

I got the problem when dealing with ANTLR4 and Parse by Python Library.
The grammar:
grammar SimpleCode;
program : 'class' ' ' 'Program' ' ' '{' field_decl* method_decl* '}' ;
field_decl : DATA_TYPE variable (',' variable)* ';' ;
method_decl: (DATA_TYPE | 'void') identifier '(' method_params? ')' block ;
variable : identifier | identifier '[' int_literal ']' ;
method_params : DATA_TYPE identifier (',' DATA_TYPE identifier)* ;
block : '{' var_decl* statement* '}' ;
var_decl : DATA_TYPE identifier (',' identifier)* ';';
statement : location assign_op expr ';' | method_call ';' | 'if' '(' (expr) ')' block ('else' block)? | 'for' identifier '=' (expr) ',' (expr) block | 'return' (expr)? ';' | 'break' ';' | 'continue' ';' | block ;
assign_op : '=' | '+=' | '-=' ;
method_call : method_name '(' method_call_params? ')' | 'callout' (string_literal (',' callout_arg (',' callout_arg)*)?) ;
method_call_params : DATA_TYPE identifier (',' DATA_TYPE identifier)* ;
method_name : identifier ;
location : identifier | identifier '[' expr ']' ;
expr : location | method_call | literal | expr bin_op expr | '-' expr | '!' expr | '(' expr ')' ;
callout_arg : expr | string_literal ;
bin_op : arith_op | rel_op | eq_op | cond_op ;
arith_op : '+' | '-' | '*' | '/' + '%' ;
rel_op : '<' | '>' | '<=' | '>=' ;
eq_op : '==' | '!=' ;
cond_op : '&&' | '||' ;
literal : int_literal | char_literal | bool_literal ;
identifier : ALPHA alpha_num* ;
alpha_num : ALPHA | DIGIT ;
int_literal : decimal_literal | hex_literal ;
decimal_literal : DIGIT DIGIT* ;
hex_literal : '0x' HEX_DIGIT HEX_DIGIT* ;
bool_literal : 'true' | 'false' ;
CHAR: . ;
char_literal : '\'' CHAR '\'' ;
string_literal : '"' CHAR* '"' ;
DATA_TYPE : INT | BOOL ;
INT : 'int' ;
BOOL : 'boolean' ;
ALPHA : [a-zA-Z] ;
DIGIT : [0-9] ;
HEX_DIGIT : [0-9a-fA-F] ;
White : [ \t]+ -> skip ;
Newline : ( '\r' '\n'? | '\n' ) -> skip ;
LineComment : '//' ~[\r\n]* -> skip ;
My python code to parse:
from antlr4 import *
from SimpleCodeLexer import SimpleCodeLexer
from SimpleCodeListener import SimpleCodeListener
from SimpleCodeParser import SimpleCodeParser
import sys
class SimpleCodePrintListener(SimpleCodeListener):
def enterProgram(self, ctx):
print(ctx.getText())
print(ctx.toStringTree())
# for child in ctx.getChildren():
# print(child.getText(), child.getSymbol())
def main():
input_stream = FileStream('in.in')
lexer = SimpleCodeLexer(input_stream)
stream = CommonTokenStream(lexer)
parser = SimpleCodeParser(stream)
tree = parser.program()
printer = SimpleCodePrintListener()
walker = ParseTreeWalker()
walker.walk(printer, tree)
if __name__ == '__main__':
print('Starting parse....')
main()
And the in.in file:
class Program {
int main() {
int v;
v = 1;
v = 'c';
v = true;
return 0;
}
}
I got this error after run the python code:
line 2:7 no viable alternative at input 'int '
The result of the first print is:
class Program {int main() {int v;v = 1;v = 'c';v = true;return 0; }}
([] class Program { int m a i n ( ) { int v ; v = 1 ; v = ' c ' ; v = true ; return 0 ; } })
I'm newbie to ANTLR4, so are there any special case to handle with lexers and tokens, because after hours of searching on internet, the main problem is about DATA_TYPE is used at many different places in grammar.

When debugging issues like this, it often helps to print the token stream that's generated for the given input. You can do that by running grun with the option -tokens or by iterating over stream in your main function.
If you do that, you'll see that main is tokenized as a sequence of four CHAR tokens, whereas your identifier rule expects ALPHA tokens, not CHAR. So that's the immediate problem, but it's not the only problem in your code:
The first thing I noticed when I tried your code is that I got errors on the line breaks. The reason that this happens for me and not for you is (presumably) that you're using Windows line breaks (\r\n) and I'm not. Your lexer recognizes \r\n as a line break and skips it, but just \n is recognized as a CHAR.
Further your handling of spaces is very confusing. Single spaces are their own tokens. They have to appear in certain places and can't appear anywhere else. However multiple consecutive spaces are skipped. So something like int main would be an error because it would not detect a space between int and main. On the other hand indenting a line with a single space would be an error because then the indentation would not be skipped.
Your identifiers are also wonky. Identifiers can contain spaces (as long as it's more than one), line breaks (as long as they're \r\n or you fix it, so that \n is skipped, too) or comments. So the following would be a single valid identifier (assuming you change the lexer, so that the letters are recognized as ALPHA instead of CHAR):
hel lo //comment
wor
ld
On the other hand maintarget would not be a valid identifier because it contains the keyword int.
Similarly skipped tokens can also be used inside your integer literals and string literals. For string literals that means that "a b" is a valid string (which is fine) that only contains the characters a and b (which is not fine) because the double space gets skipped. On the other hand " " would be a invalid string because is recognized as a ' ' token, not a CHAR token. Also if you fix your identifiers by making letters be recognized as ALPHA, they will no longer be valid inside strings. Also "la//la" would be seen as an unclosed string literal because //la" would be seen as comment.
All of these issues are related to how the lexer works, so let's go through that:
When turning a character stream into a token stream, the lexer will process the input according to the "maximal munch" rule: It will go through all of the lexer rules and checks which one matches at the beginning of the current input. Of those that match, it will pick the one that produces the longest match. In case of ties it will prefer the one that's defined first in the grammar. If you use string literals directly in parser rules, they are treated like lexer rules that are defined before any others.
So the fact that you have a CHAR: .; rule that comes before ALPHA, DIGIT and HEX_DIGIT means that these rules will never be matched. All of these rules match a single character, so when more than one of them matches, CHAR will be preferred because it comes first in the grammar. If you move CHAR to the end, letters will now be matched by ALPHA, decimal digits by DIGIT and everything else by CHAR. This still leaves HEX_DIGIT useless (and if you move it to the front, that would render ALPHA and DIGIT useless) and it also means that CHAR no longer does what you want because you want digits and letters to be seen as CHARs - but only inside strings.
The real problem here is that none of these things should be tokens. They should either be fragments or just be inlined directly into the lexer rules that use them. Instead your tokens should be anything inside of which you don't want to allow/ignore spaces or comments. So string literals, int literals and identifiers should all be tokens. The only instance where you have multiple lexer rules that could match the same input should be identifiers and keywords (where keywords take precedence over identifiers because you specify them as string literals in the grammar, but longer identifiers could still contain keywords as a substring because of the maximal munch rule).
You also should remove all uses of ' ' from your grammar and instead always skip spaces.

python grammar: wrong atom definition?

While checking python grammar at official documentation, here is what it reads
atom_expr: ['await'] atom trailer*
atom: ('(' [yield_expr|testlist_comp] ')' |
'[' [testlist_comp] ']' |
'{' [dictorsetmaker] '}' |
NAME | NUMBER | STRING+ | '...' | 'None' | 'True' | 'False')
testlist_comp: (test|star_expr) ( comp_for | (',' (test|star_expr))* [','] )
trailer: '(' [arglist] ')' | '[' subscriptlist ']' | '.' NAME
Then, 10.bit_length() is a valid syntax according to that definition but not according to the python interpreter. Instead, n=10;n.bit_length() is valid syntax for both the specifications and the interpreter.
Where should I find the real definition of atom and atom_expr?

Thanks to juanpa's comment and the answers in the related question, it appears that the problem comes from 10.. The definition of NUMBER includes the dot such that 10.bit_length() is of kind NUMBER NAME trailer and not NUMBER '.' NAME trailer.
In order to obtain an atom_expr, one must separate the dot: both 10 .bit_length() and (10).bit_length() give the correct answer.

How do you use multiple arguments in {} when using the .format() method in Python

I want a table in python to print like this:
Clearly, I want to use the .format() method, but I have long floats that look like this: 1464.1000000000001 I need the floats to be rounded, so that they look like this: 1464.10 (always two decimal places, even if both are zeros, so I can't use the round() function).
I can round the floats using "{0:.2f}".format("1464.1000000000001"), but then they do not print into nice tables.
I can put them into nice tables by doing "{0:>15}.format("1464.1000000000001"), but then they are not rounded.
Is there a way to do both? Something like "{0:>15,.2f}.format("1464.1000000000001")?

You were almost there, just remove the comma (and pass in a float number, not a string):
"{0:>15.2f}".format(1464.1000000000001)
See the Format Specification Mini-Language section:
format_spec ::= [[fill]align][sign][#][0][width][,][.precision][type]
fill ::= <any character>
align ::= "<" | ">" | "=" | "^"
sign ::= "+" | "-" | " "
width ::= integer
precision ::= integer
type ::= "b" | "c" | "d" | "e" | "E" | "f" | "F" | "g" | "G" | "n" | "o" | "s" | "x" | "X" | "%"
Breaking the above format down then:
fill: <empty>
align: < # left
sign: <not specified>
width: 15
precision: 2
type: `f`
Demo:
>>> "{0:>15.2f}".format(1464.1000000000001)
' 1464.10'
Note that for numbers, the default alignment is to the right, so the > could be omitted.

"{0:15.2f}".format(1464.1000000000001)
I always find this site useful for this stuff:
https://pyformat.info/

Parsing C like structured text into dict

I have config files structured like simplified C syntax, eg:
Main { /* some comments */
VariableName1 = VariableValue1;
VariableName2 = VariableValue2;
SubSection {
VariableName1 = VariableValue1; // inline comment
VariableName2 = VariableValue2;
}
VariableName3 = "StriingValue4";
}
Sections may be recursivesly nested.
How can I parse that file into dict in a clean and "pythonish" way?
[EDIT]
OK, I've found pyparsing module :) but maybe someone can tell how to do this without it.
[EDIT2]
Because of curiosity, I want to know for future how to write that I think simple task by hand.

Use a parser like SimpleParse, just feed it the EBNF definition.
Do you have the format documented in some sort of BNF, don't you? If not, you can tell the next genius inventing another config format instead of using json, xml, yaml or xml that he is not authorized to reinvent the wheel unless he can specify the syntax using EBNF.
It may take some time to write a grammar if you are not familiarized with EBNF, but it pays. It will make your code well documented, rock solid and easier to maintain.
See the python wiki about Language Parsing for another options.
If you try to pull some stunt using str.split or regular expressions, every other developer to do maintenance of this piece of code will curse you.
UPDATE:
It just occurred to me that if you replace the SectionName with SectionName :, ; with , and enclose the main section with a pair of curly braces, this format will likely to be valid json.
"Name" = JSON Grammar
"Author" = Arsène von Wyss
"Version" = 1.0
"About" = 'Grammar for JSON data, following http://www.json.org/'
! and compliant with http://www.ietf.org/rfc/rfc4627
"Start Symbol" = <Json>
"Case Sensitive" = True
"Character Mapping" = 'Unicode'
! ------------------------------------------------- Sets
{Unescaped} = {All Valid} - {&1 .. &19} - ["\]
{Hex} = {Digit} + [ABCDEFabcdef]
{Digit9} = {Digit} - [0]
! ------------------------------------------------- Terminals
Number = '-'?('0'|{Digit9}{Digit}*)('.'{Digit}+)?([Ee][+-]?{Digit}+)?
String = '"'({Unescaped}|'\'(["\/bfnrt]|'u'{Hex}{Hex}{Hex}{Hex}))*'"'
! ------------------------------------------------- Rules
<Json> ::= <Object>
| <Array>
<Object> ::= '{' '}'
| '{' <Members> '}'
<Members> ::= <Pair>
| <Pair> ',' <Members>
<Pair> ::= String ':' <Value>
<Array> ::= '[' ']'
| '[' <Elements> ']'
<Elements> ::= <Value>
| <Value> ',' <Elements>
<Value> ::= String
| Number
| <Object>
| <Array>
| true
| false
| null

You need to parse it recursively, using this Backus Naur form, staring from PARSE:
PARSE: '{' VARASSIGN VARASSIGN [PARSE [VARASSIGN]] '}'
VARASSIGN : VARIABLENAME '=' '"' STRING '"'
VARIABLENAME: STRING
STRING: [[:alpha:]][[:alnum:]]*
Because your structure is easy, you can use a predicative parser LL(1).

1) Write a tokenizer, i.e. a function that will parse the character stream and turn it to a list of Identifiers, OpeningBrace, ClosingBrace, EqualSign and SemiColon; comments and spaces are discarded. Could be done using Regexpr's.
2) Write a simple parser. Skip the first Identifier and OpeningBrace.
The parser expects an Identifier followed by one of EqualSign or OpeningBrace, or a ClosingBrace.
2.1) If EqualSign, must be followed by Identifier and SemiColon.
2.2) If OpeningBrace, invoke the parser recursively.
2.3) If ClosingBrace, return from the recursive call.
In the processing of 2.1, enter the desired data into the dict, the way you like. You could prefix identifiers with the names of the enclosing blocks, e.g.
{"Main.SubSection.VariableName1": VariableValue1}
Here is prototype code for the parser, to be called after the tokenizer. It scans a string where a letter stands for an identifier, separators are ={}; and the last token must be a $.
def Parse(String, Prefix= "", Nest= 0):
global Cursor
if Nest == 0:
Cursor= 0
# Scan the input string
while String[Cursor + 0].isalpha():
# Identifier, starts an Assignment or a Block (Id |)
if String[Cursor + 1] == "=":
# Assignment, lookup (Id= | Id;)
if String[Cursor + 2].isalpha():
if String[Cursor + 3] == ";":
# Accept the assignment (Id=Id; |)
print Nest * " " + Prefix + String[Cursor] + "=" + String[Cursor + 2] + ";"
Cursor+= 4
elif String[Cursor + 1] == "{":
# Block, lookup (Id{ | )
print Nest * " " + String[Cursor] + "{"
Cursor+= 2
# Recurse
Parse(String, Prefix + String[Cursor - 2] + "::", Nest + 4)
else:
# Unexpected token
break
if String[Cursor + 0] == "}":
# Block complete, (Id{...} |)
print (Nest - 4) * " " + "}"
Cursor+= 1
return
if Nest == 0 and String[Cursor + 0] == "$":
# Done
return
print "Syntax error at", String[Cursor:], ":("
Parse("C{D=E;X{Y=Z;}F=G;}H=I;A=B;$")
When executed, it outputs:
C{
C::D=E;
X{
C::X::Y=Z;
}
C::F=G;
}
H=I;
A=B;
proving that it did detect the nesting. Replace the print statements by whatever processing you like.

You could use pyparsing to write a parser for this format.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.