How to acquire Abstract Syntax Tree from ParseResults

How to acquire Abstract Syntax Tree from ParseResults - python

I'm developing a translator for translating simple script on PC to some bytecode to execute it (the bytecode) on a microcontroller.
I've developed the translator in C++ using lex and re2c but Im considering switching to pyparsing.
In order to translate a statement of my script to few operations in bytecode I need to get the Abstract Syntax Tree of that statement.
I.E. this script:
X = 1 - 2;
Should be translated to binary equivalent of this:
register1 <- 1
register2 <- 2
register3 <- register1 - register2
x <- register3
I've got this python code:
integer = Combine( number )
ident = Word(alphas,alphanums)
expr = Forward()
atom = ( integer |
( lpar + expr.suppress() + rpar )
)
expr << ( atom + (addop | multop) + atom )
statement = ident + assign + expr
L = statement..parseString( line )
Is there an example for visiting leafs of AST in L? Or something similar to that...
Thanks in advance

Your current parser will just give you a flat list of parsed tokens, since that is the default in pyparsing. The purpose is so that, regardless of how you build up your parser, whether in smaller pieces and then put them all together, or just in one giant statement, the tokens you get from parsing are structured (or not structured) the same. To get something akin to an AST, you need to define where you want structure using pyparsing's Group class (and I recommend using results names as well). So for example if you change statement to:
statement = Group(ident("lhs") + '=' + Group(expr)("rhs"))
Then your output will be much more predictable - every parsed statement will have 3 top-level elements - the target identifier (addressable as result.lhs), the '=' operator, and the source expression (addressable as result.rhs). The source expression may have further structure to it, but overall there will always be these 3 at the top-most level in every statement.
To ensure the parenthetical groups in your RHS expression are retained when evaluating your expr, again, use a Group:
atom = (integer | Group(lpar + expr + rpar))
You can navigate the hierarchical structure of the parsed results as if you were walking a list of nested lists.
But I would also encourage you to look at the SimpleBool example on the pyparsing wiki. In this example, the various parsed expressions get rendered into instances of classes which then can be processed using a visitor or just an iterator, and each class can then implement its own special logic for emitting your bytecode. Imagine that you had written a typical parser to generate an AST, then walked the AST to create CodeGenerator objects which subclass into AssignmentCodeGenerator or IfCodeGenerator or PrintCodeGenerator classes, and then walked this structure to create your bytecode. Instead, you can define assignment, if-then-else, or print statement expressions in pyparsing, have pyparsing create the classes directly, and then walk the classes to create the bytecode. In the end, your code is neatly organized into different statement types, and each type encapsulates the type of bytecode that it should output.

Related

How python prioritizes instruction while parsing?

I'm studying over parsing with python. I have user-defined instructions. So I have to specify precedence of them. I find an example here is the link
I don't understand what they do in here
precedence = (
('left','PLUS','MINUS'),
('left','TIMES','DIVIDE'),
('right','UMINUS'),
)
how python prioritizes them?
and also those too
def p_statement_assign(t):
'statement : NAME EQUALS expression'
names[t[1]] = t[3]
def p_statement_expr(t):
'statement : expression'
print(t[1])
What does it mean to write 'statement : expression' in quotation marks? How python understand and make sense of them?
I'm adding my instruction too. I will use them for drawing something in my program
F n -> go on n step
R n -> turn right n degree
L n -> Repeat the parentheses n times
COLOR f -> f: line color
PEN n -> line thickness

These instructions are read by ply and any Python function/class/module can have these strings write at the beginning of them called docstring and you can use __doc__ attribute to retrieve them. Ply cleverly uses them as annotations to define the parsing rules. The rule can be interpreted as such: statement: NAME EQUALS expression means if there is a token stream that matches the sequence first with NAME, then EQUALS sign and finally an expression, it will be reduced to a statement.
The same is for precedence variable, which is also read by ply and ply uses this variable to define precedence rule.
I recommend you read the ply documentation before using it as you need to know the basics about tokenizing and parsing before you can use a compiler construction tool like ply.

Why do two or more string arguments without commas separating them results in concatenation when given in argument list [duplicate]

I can create a multi-line string using this syntax:
string = str("Some chars "
"Some more chars")
This will produce the following string:
"Some chars Some more chars"
Is Python joining these two separate strings or is the editor/compiler treating them as a single string?
P.s: I just want to understand the internals. I know there are other ways to declare or create multi-line strings.

Read the reference manual, it's in there.
Specifically:
Multiple adjacent string or bytes literals (delimited by whitespace), possibly using different quoting conventions, are allowed, and their meaning is the same as their concatenation. Thus, "hello" 'world' is equivalent to "helloworld". This feature can be used to reduce the number of backslashes needed, to split long strings conveniently across long lines, or even to add comments to parts of strings,
(emphasis mine)
This is why:
string = str("Some chars "
"Some more chars")
is exactly the same as: str("Some chars Some more chars").
This action is performed wherever a string literal might appear, list initiliazations, function calls (as is the case with str above) et-cetera.
The only caveat is when a string literal is not contained between one of the grouping delimiters (), {} or [] but, instead, spreads between two separate physical lines. In that case we can alternatively use the backslash character to join these lines and get the same result:
string = "Some chars " \
"Some more chars"
Of course, concatenation of strings on the same physical line does not require the backslash. (string = "Hello " "World" is just fine)
Is Python joining these two separate strings or is the editor/compiler treating them as a single string?
Python is, now when exactly does Python do this is where things get interesting.
From what I could gather (take this with a pinch of salt, I'm not a parsing expert), this happens when Python transforms the parse tree (LL(1) Parser) for a given expression to it's corresponding AST (Abstract Syntax Tree).
You can get a view of the parsed tree via the parser module:
import parser
expr = """
str("Hello "
"World")
"""
pexpr = parser.expr(expr)
parser.st2list(pexpr)
This dumps a pretty big and confusing list that represents concrete syntax tree parsed from the expression in expr:
-- rest snipped for brevity --
[322,
[323,
[3, '"hello"'],
[3, '"world"']]]]]]]]]]]]]]]]]],
-- rest snipped for brevity --
The numbers correspond to either symbols or tokens in the parse tree and the mappings from symbol to grammar rule and token to constant are in Lib/symbol.py and Lib/token.py respectively.
As you can see in the snipped version I added, you have two different entries corresponding to the two different str literals in the expression parsed.
Next, we can view the output of the AST tree produced by the previous expression via the ast module provided in the Standard Library:
p = ast.parse(expr)
ast.dump(p)
# this prints out the following:
"Module(body = [Expr(value = Call(func = Name(id = 'str', ctx = Load()), args = [Str(s = 'hello world')], keywords = []))])"
The output is more user friendly in this case; you can see that the args for the function call is the single concatenated string Hello World.
In addition, I also stumbled upon a cool module that generates a visualization of the tree for ast nodes. Using it, the output of the expression expr is visualized like this:
Image cropped to show only the relevant part for the expression.
As you can see, in the terminal leaf node we have a single str object, the joined string for "Hello " and "World", i.e "Hello World".
If you are feeling brave enough, dig into the source, the source code for transforming expressions into a parse tree is located at Parser/pgen.c while the code transforming the parse tree into an Abstract Syntax Tree is in Python/ast.c.
This information is for Python 3.5 and I'm pretty sure that unless you're using some really old version (< 2.5) the functionality and locations should be similar.
Additionally, if you are interested in the whole compilation step python follows, a good gentle intro is provided by one of the core contributors, Brett Cannon, in the video From Source to Code: How CPython's Compiler Works.

Deobfuscation: Simplify Python3 Expressions

I'm trying to learn how to deobfuscate some code that is unneccesarily complicated. For example, I would like to be able to rewrite this line of code:
return ('d' + chr(101) + chr(97) + chr(200 - 100)) # returns 'dead'
to:
return 'dead'
So basically, I need to evaluate all literals within the py file, including complicated expressions that evaluate to simple integers. How do I go about writing this reader / is there something that exists that can do this? Thanks!

What you want is a program transformation system (PTS).
This is a tool for parsing source code to an AST, transforming the tree, and then regenerating valid source code from the tree. See my SO answer on rewriting Python text for some background.
With a PTS like (my company's) DMS Software Reengineering Toolkiit, you can write rules to do constant folding, which means essentially doing compile-time arithmetic.
For the example you show, the following rules can accomplish OP's example:
rule fold_subtract_naturals(n:NATURAL,m:NATURAL): sum->sum =
" \n + \m " -> " \subtract_naturals\(\n\,\m\) ";
rule convert_chr_to_string(c:NATURAL): term->term =
" chr(\c) " -> make_string_from_natural(c) ;
rule convert_character_literal_to_string(c:CHARACTER): term->term =
" \c " -> make_string_from_character(c) ;
rule fold_concatenate_strings(s1:STRING, s2:STRING): sum->sum =
" \s1 + \s2 " -> " \concatenate_strings\(\s1\,\s2\) ";
ruleset fold_strings = {
fold_subtract_naturals,
convert_chr_to_string,
convert_characater_to_string,
fold_concatenate_strings };
Each of the individual rules matches corresponding syntax/trees. They are written in such a way that they only apply to literal constants.
fold_add_naturals finds pairs of NATURAL constants joined by an add operation, and replaces that by the sum using a built-in function that sums two values and produces a literal value node containing the sum.
convert_chr_to_string converts chr(c) to the corresponding string literal.
convert_character_to_string converts 'C' to the corresponding string "C".
fold_concatenate_strings combines two literal strings separated by an add operator. It works analogously to the way that fold_add_naturals works.
subtract_naturals and concatenate_strings are built into DMS. convert_chr_to_string and convert_character_to_string need to be custom-coded in DMS's metaprogramming language, PARLANSE, but these routines are pretty simple (maybe 10 lines).
The ruleset packages up the set of rules so they can all be applied.
Not shown is the basic code to open a file, call the parser, invoke the ruleset transformer (which applies rules until no rule applies). The last step is to call the prettyprinter to reprint the modified AST.
Many other PTS offer similar facilities.

Issue creating LL(1) grammar

I'm learning how parsers work by creating a simple recursive descent parser. However I'm having a problem defining my grammar to be LL(1). I want to be able to parse the following two statements:
a = 1
a + 1
To do this I've created the following grammar rules:
statement: assignent | expression
assignment: NAME EQUALS expression
expression: term [(PLUS|MINUS) term]
term: NAME | NUMBER
However, this leads to ambiguity when using a LL(1) parser as when a NAME token is encountered in the statement rule, it doesn't know whether it is an assignment or an expression without a look-ahead.
Python's grammar is LL(1) so I know this is possible to do but I can't figure out how to do it. I've looked at Python's grammar rules found here (https://docs.python.org/3/reference/grammar.html) but I'm still not sure how they implement this.
Any help would be greatly appreciated :)

Just treat = as an operator with very low precedence. However (unless you want a language like C where = really is an operator with very low precedence), you need to exclude it from internal (e.g. parenthetic) expressions.
If you had only multiplication and addition, you could use:
expression: factor ['+' factor]
factor: term ['*' term]
term: ID | NUMBER | '(' expression ')'
That is a guide for operator precedence: has higher precedence because the arguments to + can include s but not vice versa. So we could just add assignment:
statement: expression ['=' expression]
Unfortunately, that would allow, for example:
(a + 1) = b
which is undesirable. So it needs to be eliminated, but it is possible to eliminate it when the production is accepted (by a check of the form of the first expression), rather than in the grammar itself. As I understand it, that's what the Python parser does; see the long comment about test and keywords.
If you used an LR(1) parser instead, you wouldn't have this problem.

Python string literal concatenation

I can create a multi-line string using this syntax:
string = str("Some chars "
"Some more chars")
This will produce the following string:
"Some chars Some more chars"
Is Python joining these two separate strings or is the editor/compiler treating them as a single string?
P.s: I just want to understand the internals. I know there are other ways to declare or create multi-line strings.

Read the reference manual, it's in there.
Specifically:
Multiple adjacent string or bytes literals (delimited by whitespace), possibly using different quoting conventions, are allowed, and their meaning is the same as their concatenation. Thus, "hello" 'world' is equivalent to "helloworld". This feature can be used to reduce the number of backslashes needed, to split long strings conveniently across long lines, or even to add comments to parts of strings,
(emphasis mine)
This is why:
string = str("Some chars "
"Some more chars")
is exactly the same as: str("Some chars Some more chars").
This action is performed wherever a string literal might appear, list initiliazations, function calls (as is the case with str above) et-cetera.
The only caveat is when a string literal is not contained between one of the grouping delimiters (), {} or [] but, instead, spreads between two separate physical lines. In that case we can alternatively use the backslash character to join these lines and get the same result:
string = "Some chars " \
"Some more chars"
Of course, concatenation of strings on the same physical line does not require the backslash. (string = "Hello " "World" is just fine)
Is Python joining these two separate strings or is the editor/compiler treating them as a single string?
Python is, now when exactly does Python do this is where things get interesting.
From what I could gather (take this with a pinch of salt, I'm not a parsing expert), this happens when Python transforms the parse tree (LL(1) Parser) for a given expression to it's corresponding AST (Abstract Syntax Tree).
You can get a view of the parsed tree via the parser module:
import parser
expr = """
str("Hello "
"World")
"""
pexpr = parser.expr(expr)
parser.st2list(pexpr)
This dumps a pretty big and confusing list that represents concrete syntax tree parsed from the expression in expr:
-- rest snipped for brevity --
[322,
[323,
[3, '"hello"'],
[3, '"world"']]]]]]]]]]]]]]]]]],
-- rest snipped for brevity --
The numbers correspond to either symbols or tokens in the parse tree and the mappings from symbol to grammar rule and token to constant are in Lib/symbol.py and Lib/token.py respectively.
As you can see in the snipped version I added, you have two different entries corresponding to the two different str literals in the expression parsed.
Next, we can view the output of the AST tree produced by the previous expression via the ast module provided in the Standard Library:
p = ast.parse(expr)
ast.dump(p)
# this prints out the following:
"Module(body = [Expr(value = Call(func = Name(id = 'str', ctx = Load()), args = [Str(s = 'hello world')], keywords = []))])"
The output is more user friendly in this case; you can see that the args for the function call is the single concatenated string Hello World.
In addition, I also stumbled upon a cool module that generates a visualization of the tree for ast nodes. Using it, the output of the expression expr is visualized like this:
Image cropped to show only the relevant part for the expression.
As you can see, in the terminal leaf node we have a single str object, the joined string for "Hello " and "World", i.e "Hello World".
If you are feeling brave enough, dig into the source, the source code for transforming expressions into a parse tree is located at Parser/pgen.c while the code transforming the parse tree into an Abstract Syntax Tree is in Python/ast.c.
This information is for Python 3.5 and I'm pretty sure that unless you're using some really old version (< 2.5) the functionality and locations should be similar.
Additionally, if you are interested in the whole compilation step python follows, a good gentle intro is provided by one of the core contributors, Brett Cannon, in the video From Source to Code: How CPython's Compiler Works.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to acquire Abstract Syntax Tree from ParseResults - python

Related

How python prioritizes instruction while parsing?

Why do two or more string arguments without commas separating them results in concatenation when given in argument list [duplicate]

Deobfuscation: Simplify Python3 Expressions

Issue creating LL(1) grammar

Python string literal concatenation

Categories

Resources