Obtain parse tree for python code

Obtain parse tree for python code - python

I would like to be able to generate a parse tree for python source code. This code does not have to be compilable, e.g.
if x == 5:
should be turned some sort of tree representation. I can use the Python compiler package to create a tree but this only works for code that is compilable, e.g.
if x == 5: print True

The paper you linked to says that used the ast module in the Python standard library. It also says they used a dummy body for the body of the if statement. Use a statement that will be easy to recognize as being a dummy body, like pass or a function call like dummy().

Our DMS Software Reengineering Toolkit with its Python front end can do this.
DMS provides infrastructure for parsing code, parameterized by a language definition (e.g, a Python grammar, etc.) and automatically building ASTs, as well as the ability to inspect/navigate/change those ASTs, and prettyprint the resulting modified trees.
Its AST parsing machinery can handle a variety of special cases:
Parsing files or strings ("streams") as a (Python) full program.
Syntax errors in a stream are reported, and if repairable by single token insertion or deletion, so repaired.
Parsing a stream according to an arbitrary language nonterminal.
Parsing a pattern, corresponding to a named grammar nonterminal with named placeholders for the missing subtrees. A pattern match result can be used to match against concrete ASTs to decide match or not, and if match, to provide bindings for the pattern variables.
Parsing a valid arbitrary substring. This returns a tree with possible missing left or right children, which define the left and right ends of the substring.
For instance, OP could write the following pattern to handle his example:
pattern if_x_is_5(s: statement):statement
= " if x==5: \s ";
DMS will read that pattern and build the corresponding pattern tree.
The paper that OP references really wants operators and keywords to remain as explicit artifacts in the AST. One way to interpret that is that they really want a concrete syntax tree. DMS actually produces "AST"s which are concrete syntax trees with the constant terminals removed; this has the effect of being very close to what a perfect AST should be, but one can easily determine for any leaf node where constant terminals should be inserted (or one can configure DMS to simply produce the uncompressed CSTs).
Personally, I don't see how the goal of the paper of OP's interest can really succeed in providing useful psuedo-code (in spite of its claims). Understanding an algorithm requires understanding of the corresponding data structures and the abstract and concrete algorithms being applied to those data structures. The paper focuses only on raw language syntax; there is no hint of understanding the more abstract ideas.

Related

Why would someone pass a string through an empty function instead of using the string directly in Python?

I am working through some old code for a tKinter GUI and at the top of their code they create the following definition:
def _(text):
return text
The code then proceeds to use the _ function around almost all of the strings being passed to the tKinter widgets. For example:
editmenu.add_command(label=_("Paste"), command=parent.onPaste)
Is there a reason for using the function here as opposed to just passing the string?
I've removed the _ function from around a few of the strings and haven't run into any issues. Just curious if this has a real purpose.

This is a stub for a pattern typically used for internationalization. The pattern is explicitly documented at https://docs.python.org/3/library/gettext.html#deferred-translations
_("Foo") is intended to have _ look up the string Foo in the current user's configured language.
Putting the stub in now -- before any translation tables have actually been built -- makes it easy to search for strings that need to be translated, and avoids folks needing to go through the software figuring out which strings are intended for system-internal vs user-visible uses later. (It also helps guide programmers to avoid making common mistakes like using the same string constants for both user-visible display and save file formats or wire communications).
Understanding this is important, because it guides how this should be used. Don't use _("Save %s".format(name)); instead, use _("Save %s").format(filename) so the folks who are eventually writing translation tables can adjust the format strings, and so those strings don't need to have user-provided content (which obviously can't be known ahead-of-time) within them.

Identifying semantic errors

import printf, printf;
void foo(int x, int y) {
return 0;
}
int a = food(1, -2.0, 5);
I need to write a solution to identify all semantic errors that are in the above code. I have a supplied compiler for it.
What is the best way of going about this.

ANTLR is about syntax (not particularly about semantics). You can have semantic predicates, but the intention there is to guide the interpretation of the syntax if there would otherwise be ambiguities (there can be other reasons).
It is possible to handle some semantic processing in actions, but this can quickly lead to an ANTLR grammar where it’s hard to find the grammar in all the noise of the actions. I view this as something of an anti-pattern (particularly in hand crafting grammars).
Code your syntax in the grammar, use the resulting code to produce a ParseTree. You can then use either Listeners or Visitors to write your own code to do any semantic validation. This separation of concerns will also help keep the code base easier to manage.
It’s a bit much to go into listeners and visitors in depth in a SO answer, but there is plenty written about using them.
So in your example, ANTLR could produce a perfectly valid ParseTree for the foo function, and a parse tree from it. In a listener for the function context, you could detect that the function says it’s return type is void and then check the body of the function (in the parse tree) to see if it contains any `return statements. These would be syntactically “correct” but semantically invalid, so you’d identify that as an error.
In short, ANTLR is great at giving you a data structure that accurately represents the only way to interpret the input stream. And it provides utility functionality with Listeners and Visitors to make it pretty simple to analyze that parse tree in search of semantic issues (or even use a visitor to produce an interpreter to execute the code, if you’d like).

generic line syntax parser in python

THE PROBLEM:
I'm looking for a python library that might already implement a text parser that I have written in another language.
I have lines of text that represent either configuration commands in devices or dynamic output from commands run on devices. For simplicity, let's assume each line is parsed independently.
The bottom line is that a line contains fixed keywords and values/variables/parameters. Some keywords are optional and some are mandatory and in specific order. The number and type of variables/values associated with / following a given keyword can vary from one keyword to another.
SOLUTION IN OTHER LANGUAGE:
I wrote generic code in c++ that would parse any line and convert the input into structured data. The input to the code is
1. the line to be parsed and
2. a model/structure that described what keywords to look for, whether they are optional or not, in what order they might appear and what type of values/variables to expect for each (and also how many values/variables).
In c++ the interface allows the user among other things to supply a set of user-defined callback functions (one for each keyword) to be invoked by the parsing engine to supply the results (the parsed parameters associated with the given keyword). The implementation of the callback is user-defined but the callback signature is pre-defined.
WHAT ABOUT PYTHON?
I'm hoping for a simple library in python (or a completely different direction if this is something done differently/better in python) that provides an interface to specify the grammar/syntax/model of a given line (the details of all keywords, their order, what number and type of parameters each requires) and then does the parsing of input lines based on that syntax.
I'm not sure how much argparse fits what I need but this is not about parsing a command line input thou similar.
AN EXAMPLE:
Here is an example line from the IP networking world but the problem is more generic:
access-list SOMENAME-IN extended permit tcp host 117.21.212.54 host 174.163.16.23 range 5160 7000
In the above line, the keywords and their corresponding parameters are:
key: extended, no parameters
key: permit, no parameters
key: tcp, no parameters
key: host, par1: 117.21.212.54
key: host, par1: 174.163.16.23
key: range, par1: 5160, par2: 7000
This is a form of firewall access control list ACL. In this case the parser would be used to fill a structure that indicates
- the name of the ACL (SOMENAME-IN in the above example)
- the type of ACL (extended in the above example but there are other valid keywords)
- the protocol (tcp in the above example)
- the src host/IP (117.21.212.54 in the example)
- the src port (optional and not present in the above example)
- the dst host/IP (174.163.16.23 in the example)
- the dst port (a range of ports from 5160 to 7000 in the above example)
One can rather easily write a dedicated parser that assume the above example specific syntax and checks for it (perhaps this might also be more efficient and more clear since targeted to a specific syntax) but what I want is to be able to write a general parsing code, where all the keywords and the expected syntax is provided as data / model to the parsing engine which uses it to parse the lines and is also capable of pointing out errors in the parsed line.
I'm not obviously looking for a full solution cause that would be a lot but I hope for thoughts specifically in the context of using python and reusing any features or libraries python may have to do such parsing.
Thanks,
Al.

If I understand your needs correctly (and it is possible that I don't, because it is hard to know what limits you place on the possible grammars), then you should be able to solve this problem fairly simply with a tokeniser and a map of keyword parsers for each command.
If your needs are very simple, you might be able to tokenise using the split string method, but it is possible that you would prefer a tokeniser which at least handles quoted strings and maybe some operator symbols. The standard Python library provides the shlex module for this purpose.
There is no standard library module which does the parsing. There are a large variety of third-party parsing frameworks, but I think that they are likely to be overkill for your needs (although I might be wrong, and you might want to check them out even if you don't need anything that sophisticated). While I don't usually advocate hand-rolling a parser, this particular application is both simple enough to make that possible and sufficiently different from the possibilities of a context-free grammar to make direct coding useful.
(What makes a context-free grammar impractical is the desire to allow different command options to be provided in arbitrary order without allowing repetition of non-repeatable options. But rereading this answer, I realize that it is just an assumption on my part that you need that feature.)

Use of eval in Python, MATLAB, etc [duplicate]

This question already has answers here:
Why is using 'eval' a bad practice?
(8 answers)
Closed 9 years ago.
I do know that one shouldn't use eval. For all the obvious reasons (performance, maintainability, etc.). My question is more on the side – is there a legitimate use for it? Where one should use it rather than implement the code in another way.
Since it is implemented in several languages and can lead to bad programming style, I assume there is a reason why it's still available.

First, here is Mathwork's list of alternatives to eval.
You could also be clever and use eval() in a compiled application to build your mCode interpreter, but the Matlab compiler doesn't allow that for obvious reasons.

One place where I have found a reasonable use of eval is in obtaining small predicates of code that consumers of my software need to be able to supply as part of a parameter file.
For example, there might be an item called "Data" that has a location for reading and writing the data, but also requires some predicate applied to it upon load. In a Yaml file, this might look like:
Data:
Name: CustomerID
ReadLoc: some_server.some_table
WriteLoc: write_server.write_table
Predicate: "lambda x: x[:4]"
Upon loading and parsing the objects from Yaml, I can use eval to turn the predicate string into a callable lambda function. In this case, it implies that CustomerID is a long string and only the first 4 characters are needed in this particular instance.
Yaml offers some clunky ways to magically invoke object constructors (e.g. using something like !Data in my code above, and then having defined a class for Data in the code that appropriately uses Yaml hooks into the constructor). In fact, one of the biggest criticisms I have of the Yaml magic object construction is that it is effectively like making your whole parameter file into one giant eval statement. And this is very problematic if you need to validate things and if you need flexibility in the way multiple parts of the code absorb multiple parts of the parameter file. It also doesn't lend itself easily to templating with Mako, whereas my approach above makes that easy.
I think this simpler design which can be easily parsed with any XML tools is better, and using eval lets me allow the user to pass in whatever arbitrary callable they want.
A couple of notes on why this works in my case:
The users of the code are not Python programmers. They don't have the ability to write their own functions and then just pass a module location, function name, and argument signature (although, putting all that in a parameter file is another way to solve this that wouldn't rely on eval if the consumers can be trusted to write code.)
The users are responsible for their bad lambda functions. I can do some validation that eval works on the passed predicate, and maybe even create some tests on the fly or have a nice failure mode, but at the end of the day I am allowed to tell them that it's their job to supply valid predicates and to ensure the data can be manipulated with simple predicates. If this constraint wasn't in place, I'd have to shuck this for a different system.
The users of these parameter files compose a small group mostly willing to conform to conventions. If that weren't true, it would be risky that folks would hi-jack the predicate field to do many inappropriate things -- and this would be hard to guard against. On big projects, it would not be a great idea.
I don't know if my points apply very generally, but I would say that using eval to add flexibility to a parameter file is good if you can guarantee your users are a small group of convention-upholders (a rare feat, I know).

In MATLAB the eval function is useful when functions make use of the name of the input argument via the inputname function. For example, to overload the builtin display function (which is sensitive to the name of the input argument) the eval function is required. For example, to call the built in display from an overloaded display you would do
function display(X)
eval([inputname(1), ' = X;']);
eval(['builtin(''display'', ', inputname(1), ');']);
end
In MATLAB there is also evalc. From the documentation:
T = evalc(S) is the same as EVAL(S) except that anything that would
normally be written to the command window, except for error messages,
is captured and returned in the character array T (lines in T are
separated by '\n' characters).
If you still consider this eval, then it is very powerful when dealing with closed source code that displays useful information in the command window and you need to capture and parse that output.

What is the advantage of using AST & Tree walker for translating a DSL instead of directly parsing & translating objects from parser grammar?

I am creating a Domain Specific Language using Antlr3. So far, I have directly translated the parsed objects from inside the parser grammar. Going through the examples of AST and Tree Walker, i came to know that they are normally used to divide the grammar into hierarchical tree and translate objects from the nodes. Currently i am also doing the same sort of action using parser grammar where i translate objects from each sub-rule. I would be more than happy to know the advantages of using AST & Tree walker over just using parser grammars. Thanking you in advanced.

One advantage of using tree parsers is that you can organize them into multiple passes. For some translation work I did I was able to use seven passes and separate logical steps into their own pass. One pass did expression analysis, one did control flow analysis, others used that analysis to eliminate dead code or to simplify the translation for special cases.
I personally like using tree grammars for the same reason I like using parsers for text grammars. It allows me to use rules to organize the parsing context. It's easy to do things like structure rules to recognize a top-level expression versus a subexpression if you need to distinguish between them for recognition purposes. All of the attribute and context management you use in regular parsers can apply to tree parsers.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.