While checking python grammar at official documentation, here is what it reads
atom_expr: ['await'] atom trailer*
atom: ('(' [yield_expr|testlist_comp] ')' |
'[' [testlist_comp] ']' |
'{' [dictorsetmaker] '}' |
NAME | NUMBER | STRING+ | '...' | 'None' | 'True' | 'False')
testlist_comp: (test|star_expr) ( comp_for | (',' (test|star_expr))* [','] )
trailer: '(' [arglist] ')' | '[' subscriptlist ']' | '.' NAME
Then, 10.bit_length() is a valid syntax according to that definition but not according to the python interpreter. Instead, n=10;n.bit_length() is valid syntax for both the specifications and the interpreter.
Where should I find the real definition of atom and atom_expr?
Thanks to juanpa's comment and the answers in the related question, it appears that the problem comes from 10.. The definition of NUMBER includes the dot such that 10.bit_length() is of kind NUMBER NAME trailer and not NUMBER '.' NAME trailer.
In order to obtain an atom_expr, one must separate the dot: both 10 .bit_length() and (10).bit_length() give the correct answer.
Related
This question already has answers here:
Accessing attributes on literals work on all types, but not `int`; why? [duplicate]
(4 answers)
Closed 1 year ago.
Why we can directly access a method belonging to a string literal:
keyfunc="STR".__eq__
or a float constant:
keyfunc=1.0.__eq__
or even:
keyfunc=1..__eq__ # 1. is the same float as 1.0
but the same code for an integer throws a syntax error?
keyfunc=1.__eq__ # WRONG!
The last line should be written as:
keyfunc=(1).__eq__
Why and when are the parens required?
Reason for this to happen:
It's because 1. gets treated as a float:
>>> 1.
1.0
>>>
And obviously:
>>> 1.0__eq__
SyntaxError: invalid decimal literal
>>>
Would give an error.
It's because Python operates from left to right. So the dot would belong to 1 to make it a float.
Workaround for this other than parenthesis:
So the way to fix it would be to add a space between 1 and the dot ., like:
>>> 1 .__eq__
<method-wrapper '__eq__' of int object at 0x00007FFDF14B2730>
>>>
Reasoning for these workarounds to work:
The reason this works is because:
>>> 1 .
SyntaxError: invalid syntax
>>>
Gives an error, so it doesn't get treated as an integer.
It's the same case for (1).
As you can see:
>>> (1).
SyntaxError: invalid syntax
>>>
Documentation references:
As shown on this page of the documentation:
integer ::= decimalinteger | octinteger | hexinteger | bininteger
decimalinteger ::= nonzerodigit digit* | "0"+
nonzerodigit ::= "1"..."9"
digit ::= "0"..."9"
octinteger ::= "0" ("o" | "O") octdigit+
hexinteger ::= "0" ("x" | "X") hexdigit+
bininteger ::= "0" ("b" | "B") bindigit+
octdigit ::= "0"..."7"
hexdigit ::= digit | "a"..."f" | "A"..."F"
bindigit ::= "0" | "1"
The above is the integer literal definitions in Python.
It also has to do with the python compiler specification. But the main confusion is that we don't know who the period belongs to, i.e.
keyfunc=[1.]__eq__ or keyfunc=1[.__eq__]. Without specifying the (1) or (as mentioned) adding the space, we don't know whether it's the end of an expression or not.
I am using LALR(1) parsing from lark-parser library. I have written a grammar to parse an ORM like language. An example of my language is pasted below:
Table1
.join(table=Table2, left_on=[column1], right_on=[column_2])
.group_by(col=[column1], agg=[sum])
.join(table=Table3, left_on=[column1], right_on=[column_3])
.some_column
My grammar is:
start: [CNAME (object)*]
object: "." (CNAME|operation)
operation: [(join|group) (object)*]
join: "join" "(" [(join_args ",")* join_args] ")"
join_args: "table" "=" CNAME
| "left_on" "=" list
| "right_on" "=" list
group: "group_by" "(" [(group_args ",")* group_args] ")"
group_args: "col" "=" list
| "agg" "=" list
list: "[" [CNAME ("," CNAME)*] "]"
%import common.CNAME //# Variable name declaration
%import common.WS //# White space declaration
%ignore WS
When I parse the language, it gets parsed correctly but I get shift-reduce conflict warning. I believe that this is due to collision at object: "." (CNAME|operation), but I may be wrong. Is there any other way to write this grammar ?
I think you should replace
operation: [(join|group) (object)*]
With just
operation: join | group
You've already allowed repetition of object in
start: [CNAME (object)*]
So also allowing object* at the end of operation is ambiguous, leading to the conflict.
Personally, I would have gone for something like:
start : [ CNAME ("." qualifier)* ]
qualifier: CNAME | join | group
Because I don't see the point of object. But that's just a minor style difference.
I got the problem when dealing with ANTLR4 and Parse by Python Library.
The grammar:
grammar SimpleCode;
program : 'class' ' ' 'Program' ' ' '{' field_decl* method_decl* '}' ;
field_decl : DATA_TYPE variable (',' variable)* ';' ;
method_decl: (DATA_TYPE | 'void') identifier '(' method_params? ')' block ;
variable : identifier | identifier '[' int_literal ']' ;
method_params : DATA_TYPE identifier (',' DATA_TYPE identifier)* ;
block : '{' var_decl* statement* '}' ;
var_decl : DATA_TYPE identifier (',' identifier)* ';';
statement : location assign_op expr ';' | method_call ';' | 'if' '(' (expr) ')' block ('else' block)? | 'for' identifier '=' (expr) ',' (expr) block | 'return' (expr)? ';' | 'break' ';' | 'continue' ';' | block ;
assign_op : '=' | '+=' | '-=' ;
method_call : method_name '(' method_call_params? ')' | 'callout' (string_literal (',' callout_arg (',' callout_arg)*)?) ;
method_call_params : DATA_TYPE identifier (',' DATA_TYPE identifier)* ;
method_name : identifier ;
location : identifier | identifier '[' expr ']' ;
expr : location | method_call | literal | expr bin_op expr | '-' expr | '!' expr | '(' expr ')' ;
callout_arg : expr | string_literal ;
bin_op : arith_op | rel_op | eq_op | cond_op ;
arith_op : '+' | '-' | '*' | '/' + '%' ;
rel_op : '<' | '>' | '<=' | '>=' ;
eq_op : '==' | '!=' ;
cond_op : '&&' | '||' ;
literal : int_literal | char_literal | bool_literal ;
identifier : ALPHA alpha_num* ;
alpha_num : ALPHA | DIGIT ;
int_literal : decimal_literal | hex_literal ;
decimal_literal : DIGIT DIGIT* ;
hex_literal : '0x' HEX_DIGIT HEX_DIGIT* ;
bool_literal : 'true' | 'false' ;
CHAR: . ;
char_literal : '\'' CHAR '\'' ;
string_literal : '"' CHAR* '"' ;
DATA_TYPE : INT | BOOL ;
INT : 'int' ;
BOOL : 'boolean' ;
ALPHA : [a-zA-Z] ;
DIGIT : [0-9] ;
HEX_DIGIT : [0-9a-fA-F] ;
White : [ \t]+ -> skip ;
Newline : ( '\r' '\n'? | '\n' ) -> skip ;
LineComment : '//' ~[\r\n]* -> skip ;
My python code to parse:
from antlr4 import *
from SimpleCodeLexer import SimpleCodeLexer
from SimpleCodeListener import SimpleCodeListener
from SimpleCodeParser import SimpleCodeParser
import sys
class SimpleCodePrintListener(SimpleCodeListener):
def enterProgram(self, ctx):
print(ctx.getText())
print(ctx.toStringTree())
# for child in ctx.getChildren():
# print(child.getText(), child.getSymbol())
def main():
input_stream = FileStream('in.in')
lexer = SimpleCodeLexer(input_stream)
stream = CommonTokenStream(lexer)
parser = SimpleCodeParser(stream)
tree = parser.program()
printer = SimpleCodePrintListener()
walker = ParseTreeWalker()
walker.walk(printer, tree)
if __name__ == '__main__':
print('Starting parse....')
main()
And the in.in file:
class Program {
int main() {
int v;
v = 1;
v = 'c';
v = true;
return 0;
}
}
I got this error after run the python code:
line 2:7 no viable alternative at input 'int '
The result of the first print is:
class Program {int main() {int v;v = 1;v = 'c';v = true;return 0; }}
([] class Program { int m a i n ( ) { int v ; v = 1 ; v = ' c ' ; v = true ; return 0 ; } })
I'm newbie to ANTLR4, so are there any special case to handle with lexers and tokens, because after hours of searching on internet, the main problem is about DATA_TYPE is used at many different places in grammar.
When debugging issues like this, it often helps to print the token stream that's generated for the given input. You can do that by running grun with the option -tokens or by iterating over stream in your main function.
If you do that, you'll see that main is tokenized as a sequence of four CHAR tokens, whereas your identifier rule expects ALPHA tokens, not CHAR. So that's the immediate problem, but it's not the only problem in your code:
The first thing I noticed when I tried your code is that I got errors on the line breaks. The reason that this happens for me and not for you is (presumably) that you're using Windows line breaks (\r\n) and I'm not. Your lexer recognizes \r\n as a line break and skips it, but just \n is recognized as a CHAR.
Further your handling of spaces is very confusing. Single spaces are their own tokens. They have to appear in certain places and can't appear anywhere else. However multiple consecutive spaces are skipped. So something like int main would be an error because it would not detect a space between int and main. On the other hand indenting a line with a single space would be an error because then the indentation would not be skipped.
Your identifiers are also wonky. Identifiers can contain spaces (as long as it's more than one), line breaks (as long as they're \r\n or you fix it, so that \n is skipped, too) or comments. So the following would be a single valid identifier (assuming you change the lexer, so that the letters are recognized as ALPHA instead of CHAR):
hel lo //comment
wor
ld
On the other hand maintarget would not be a valid identifier because it contains the keyword int.
Similarly skipped tokens can also be used inside your integer literals and string literals. For string literals that means that "a b" is a valid string (which is fine) that only contains the characters a and b (which is not fine) because the double space gets skipped. On the other hand " " would be a invalid string because is recognized as a ' ' token, not a CHAR token. Also if you fix your identifiers by making letters be recognized as ALPHA, they will no longer be valid inside strings. Also "la//la" would be seen as an unclosed string literal because //la" would be seen as comment.
All of these issues are related to how the lexer works, so let's go through that:
When turning a character stream into a token stream, the lexer will process the input according to the "maximal munch" rule: It will go through all of the lexer rules and checks which one matches at the beginning of the current input. Of those that match, it will pick the one that produces the longest match. In case of ties it will prefer the one that's defined first in the grammar. If you use string literals directly in parser rules, they are treated like lexer rules that are defined before any others.
So the fact that you have a CHAR: .; rule that comes before ALPHA, DIGIT and HEX_DIGIT means that these rules will never be matched. All of these rules match a single character, so when more than one of them matches, CHAR will be preferred because it comes first in the grammar. If you move CHAR to the end, letters will now be matched by ALPHA, decimal digits by DIGIT and everything else by CHAR. This still leaves HEX_DIGIT useless (and if you move it to the front, that would render ALPHA and DIGIT useless) and it also means that CHAR no longer does what you want because you want digits and letters to be seen as CHARs - but only inside strings.
The real problem here is that none of these things should be tokens. They should either be fragments or just be inlined directly into the lexer rules that use them. Instead your tokens should be anything inside of which you don't want to allow/ignore spaces or comments. So string literals, int literals and identifiers should all be tokens. The only instance where you have multiple lexer rules that could match the same input should be identifiers and keywords (where keywords take precedence over identifiers because you specify them as string literals in the grammar, but longer identifiers could still contain keywords as a substring because of the maximal munch rule).
You also should remove all uses of ' ' from your grammar and instead always skip spaces.
I want a table in python to print like this:
Clearly, I want to use the .format() method, but I have long floats that look like this: 1464.1000000000001 I need the floats to be rounded, so that they look like this: 1464.10 (always two decimal places, even if both are zeros, so I can't use the round() function).
I can round the floats using "{0:.2f}".format("1464.1000000000001"), but then they do not print into nice tables.
I can put them into nice tables by doing "{0:>15}.format("1464.1000000000001"), but then they are not rounded.
Is there a way to do both? Something like "{0:>15,.2f}.format("1464.1000000000001")?
You were almost there, just remove the comma (and pass in a float number, not a string):
"{0:>15.2f}".format(1464.1000000000001)
See the Format Specification Mini-Language section:
format_spec ::= [[fill]align][sign][#][0][width][,][.precision][type]
fill ::= <any character>
align ::= "<" | ">" | "=" | "^"
sign ::= "+" | "-" | " "
width ::= integer
precision ::= integer
type ::= "b" | "c" | "d" | "e" | "E" | "f" | "F" | "g" | "G" | "n" | "o" | "s" | "x" | "X" | "%"
Breaking the above format down then:
fill: <empty>
align: < # left
sign: <not specified>
width: 15
precision: 2
type: `f`
Demo:
>>> "{0:>15.2f}".format(1464.1000000000001)
' 1464.10'
Note that for numbers, the default alignment is to the right, so the > could be omitted.
"{0:15.2f}".format(1464.1000000000001)
I always find this site useful for this stuff:
https://pyformat.info/
While shortening my code I was cutting down a few variable declarations onto one line-
##For example- going from-
Var1 =15
Var2 = 26
Var3 = 922
##To-
Var1, Var2, Var3 = 15, 26, 922
However, when I tried doing the same thing to this code-
User_Input += Master_Key[Input_ref]
Key += Master_Key[Key_ref]
Key2 += Master_Key[Key_2_Ref]
##Which looks like-
User_Input, Key, Key2 += Master_Key[Input_Ref], Master_Key[Key_Ref], Master_Key[Key_2_Ref]
This throws the error
SyntaxError: illegal expression for augmented assignment
I have read the relevant Python documentation, but I still can't find a way to shorten this particular bit of code.
No, you cannot. You cannot use augmented assignment together with multiple targets.
You can see this in the Augmented assignment statements section you linked to:
augmented_assignment_stmt ::= augtarget augop (expression_list | yield_expression)
augtarget ::= identifier | attributeref | subscription | slicing
The augtarget rule only allows for one target. Compare this with the Assignment statements rules:
assignment_stmt ::= (target_list "=")+ (expression_list | yield_expression)
target_list ::= target ("," target)* [","]
target ::= identifier
| "(" target_list ")"
| "[" target_list "]"
| attributeref
| subscription
| slicing
where you have a target_list rule to assign to.
I'd not try and shorten this at all; trying to squeeze augmented assignments onto one line does not improve readability or comprehension of what is happening.