generic line syntax parser in python

generic line syntax parser in python - python

THE PROBLEM:
I'm looking for a python library that might already implement a text parser that I have written in another language.
I have lines of text that represent either configuration commands in devices or dynamic output from commands run on devices. For simplicity, let's assume each line is parsed independently.
The bottom line is that a line contains fixed keywords and values/variables/parameters. Some keywords are optional and some are mandatory and in specific order. The number and type of variables/values associated with / following a given keyword can vary from one keyword to another.
SOLUTION IN OTHER LANGUAGE:
I wrote generic code in c++ that would parse any line and convert the input into structured data. The input to the code is
1. the line to be parsed and
2. a model/structure that described what keywords to look for, whether they are optional or not, in what order they might appear and what type of values/variables to expect for each (and also how many values/variables).
In c++ the interface allows the user among other things to supply a set of user-defined callback functions (one for each keyword) to be invoked by the parsing engine to supply the results (the parsed parameters associated with the given keyword). The implementation of the callback is user-defined but the callback signature is pre-defined.
WHAT ABOUT PYTHON?
I'm hoping for a simple library in python (or a completely different direction if this is something done differently/better in python) that provides an interface to specify the grammar/syntax/model of a given line (the details of all keywords, their order, what number and type of parameters each requires) and then does the parsing of input lines based on that syntax.
I'm not sure how much argparse fits what I need but this is not about parsing a command line input thou similar.
AN EXAMPLE:
Here is an example line from the IP networking world but the problem is more generic:
access-list SOMENAME-IN extended permit tcp host 117.21.212.54 host 174.163.16.23 range 5160 7000
In the above line, the keywords and their corresponding parameters are:
key: extended, no parameters
key: permit, no parameters
key: tcp, no parameters
key: host, par1: 117.21.212.54
key: host, par1: 174.163.16.23
key: range, par1: 5160, par2: 7000
This is a form of firewall access control list ACL. In this case the parser would be used to fill a structure that indicates
- the name of the ACL (SOMENAME-IN in the above example)
- the type of ACL (extended in the above example but there are other valid keywords)
- the protocol (tcp in the above example)
- the src host/IP (117.21.212.54 in the example)
- the src port (optional and not present in the above example)
- the dst host/IP (174.163.16.23 in the example)
- the dst port (a range of ports from 5160 to 7000 in the above example)
One can rather easily write a dedicated parser that assume the above example specific syntax and checks for it (perhaps this might also be more efficient and more clear since targeted to a specific syntax) but what I want is to be able to write a general parsing code, where all the keywords and the expected syntax is provided as data / model to the parsing engine which uses it to parse the lines and is also capable of pointing out errors in the parsed line.
I'm not obviously looking for a full solution cause that would be a lot but I hope for thoughts specifically in the context of using python and reusing any features or libraries python may have to do such parsing.
Thanks,
Al.

If I understand your needs correctly (and it is possible that I don't, because it is hard to know what limits you place on the possible grammars), then you should be able to solve this problem fairly simply with a tokeniser and a map of keyword parsers for each command.
If your needs are very simple, you might be able to tokenise using the split string method, but it is possible that you would prefer a tokeniser which at least handles quoted strings and maybe some operator symbols. The standard Python library provides the shlex module for this purpose.
There is no standard library module which does the parsing. There are a large variety of third-party parsing frameworks, but I think that they are likely to be overkill for your needs (although I might be wrong, and you might want to check them out even if you don't need anything that sophisticated). While I don't usually advocate hand-rolling a parser, this particular application is both simple enough to make that possible and sufficiently different from the possibilities of a context-free grammar to make direct coding useful.
(What makes a context-free grammar impractical is the desire to allow different command options to be provided in arbitrary order without allowing repetition of non-repeatable options. But rereading this answer, I realize that it is just an assumption on my part that you need that feature.)

Related

How do I get the local variables defined in a Python function?

I'm working on a project to do some static analysis of Python code. We're hoping to encode certain conventions that go beyond questions of style or detecting code duplication. I'm not sure this question is specific enough, but I'm going to post it anyway.
A few of the ideas that I have involve being able to build a certain understanding of how the various parts of source code work so we can impose these checks. For example, in part of our application that's exposing a REST API, I'd like to validate something like the fact that if a route is defined as a GET, then arguments to the API are passed as URL arguments rather than in the request body.
I'm able to get something like that to work by pulling all the routes, which are pretty nicely structured, and there are guarantees of consistency given the route has to be created as a route object. But once I know that, say, a given route is a GET, figuring out how the handler function uses arguments requires some degree of interpretation of the function source code.
Naïvely, something like inspect.getsourcelines will allow me to get the source code, but on further examination that's not the best solution because I immediately have to build interpreter-like features, such as figuring out whether a line is a comment, and then do something like use regular expressions to hunt down places where state is moved from the request context to a local variable.
Looking at tools like PyLint, they seem mostly focused on high-level "universals" of static analysis, and (at least on superficial inspection) don't have obvious ways of extracting this sort of understanding at a lower level.
Is there a more systematic way to get this representation of the source code, either with something in the standard library or with another tool? Or is the only way to do this writing a mini-interpreter that serves my purposes?

Obtain parse tree for python code

I would like to be able to generate a parse tree for python source code. This code does not have to be compilable, e.g.
if x == 5:
should be turned some sort of tree representation. I can use the Python compiler package to create a tree but this only works for code that is compilable, e.g.
if x == 5: print True

The paper you linked to says that used the ast module in the Python standard library. It also says they used a dummy body for the body of the if statement. Use a statement that will be easy to recognize as being a dummy body, like pass or a function call like dummy().

Our DMS Software Reengineering Toolkit with its Python front end can do this.
DMS provides infrastructure for parsing code, parameterized by a language definition (e.g, a Python grammar, etc.) and automatically building ASTs, as well as the ability to inspect/navigate/change those ASTs, and prettyprint the resulting modified trees.
Its AST parsing machinery can handle a variety of special cases:
Parsing files or strings ("streams") as a (Python) full program.
Syntax errors in a stream are reported, and if repairable by single token insertion or deletion, so repaired.
Parsing a stream according to an arbitrary language nonterminal.
Parsing a pattern, corresponding to a named grammar nonterminal with named placeholders for the missing subtrees. A pattern match result can be used to match against concrete ASTs to decide match or not, and if match, to provide bindings for the pattern variables.
Parsing a valid arbitrary substring. This returns a tree with possible missing left or right children, which define the left and right ends of the substring.
For instance, OP could write the following pattern to handle his example:
pattern if_x_is_5(s: statement):statement
= " if x==5: \s ";
DMS will read that pattern and build the corresponding pattern tree.
The paper that OP references really wants operators and keywords to remain as explicit artifacts in the AST. One way to interpret that is that they really want a concrete syntax tree. DMS actually produces "AST"s which are concrete syntax trees with the constant terminals removed; this has the effect of being very close to what a perfect AST should be, but one can easily determine for any leaf node where constant terminals should be inserted (or one can configure DMS to simply produce the uncompressed CSTs).
Personally, I don't see how the goal of the paper of OP's interest can really succeed in providing useful psuedo-code (in spite of its claims). Understanding an algorithm requires understanding of the corresponding data structures and the abstract and concrete algorithms being applied to those data structures. The paper focuses only on raw language syntax; there is no hint of understanding the more abstract ideas.

Use of eval in Python, MATLAB, etc [duplicate]

This question already has answers here:
Why is using 'eval' a bad practice?
(8 answers)
Closed 9 years ago.
I do know that one shouldn't use eval. For all the obvious reasons (performance, maintainability, etc.). My question is more on the side – is there a legitimate use for it? Where one should use it rather than implement the code in another way.
Since it is implemented in several languages and can lead to bad programming style, I assume there is a reason why it's still available.

First, here is Mathwork's list of alternatives to eval.
You could also be clever and use eval() in a compiled application to build your mCode interpreter, but the Matlab compiler doesn't allow that for obvious reasons.

One place where I have found a reasonable use of eval is in obtaining small predicates of code that consumers of my software need to be able to supply as part of a parameter file.
For example, there might be an item called "Data" that has a location for reading and writing the data, but also requires some predicate applied to it upon load. In a Yaml file, this might look like:
Data:
Name: CustomerID
ReadLoc: some_server.some_table
WriteLoc: write_server.write_table
Predicate: "lambda x: x[:4]"
Upon loading and parsing the objects from Yaml, I can use eval to turn the predicate string into a callable lambda function. In this case, it implies that CustomerID is a long string and only the first 4 characters are needed in this particular instance.
Yaml offers some clunky ways to magically invoke object constructors (e.g. using something like !Data in my code above, and then having defined a class for Data in the code that appropriately uses Yaml hooks into the constructor). In fact, one of the biggest criticisms I have of the Yaml magic object construction is that it is effectively like making your whole parameter file into one giant eval statement. And this is very problematic if you need to validate things and if you need flexibility in the way multiple parts of the code absorb multiple parts of the parameter file. It also doesn't lend itself easily to templating with Mako, whereas my approach above makes that easy.
I think this simpler design which can be easily parsed with any XML tools is better, and using eval lets me allow the user to pass in whatever arbitrary callable they want.
A couple of notes on why this works in my case:
The users of the code are not Python programmers. They don't have the ability to write their own functions and then just pass a module location, function name, and argument signature (although, putting all that in a parameter file is another way to solve this that wouldn't rely on eval if the consumers can be trusted to write code.)
The users are responsible for their bad lambda functions. I can do some validation that eval works on the passed predicate, and maybe even create some tests on the fly or have a nice failure mode, but at the end of the day I am allowed to tell them that it's their job to supply valid predicates and to ensure the data can be manipulated with simple predicates. If this constraint wasn't in place, I'd have to shuck this for a different system.
The users of these parameter files compose a small group mostly willing to conform to conventions. If that weren't true, it would be risky that folks would hi-jack the predicate field to do many inappropriate things -- and this would be hard to guard against. On big projects, it would not be a great idea.
I don't know if my points apply very generally, but I would say that using eval to add flexibility to a parameter file is good if you can guarantee your users are a small group of convention-upholders (a rare feat, I know).

In MATLAB the eval function is useful when functions make use of the name of the input argument via the inputname function. For example, to overload the builtin display function (which is sensitive to the name of the input argument) the eval function is required. For example, to call the built in display from an overloaded display you would do
function display(X)
eval([inputname(1), ' = X;']);
eval(['builtin(''display'', ', inputname(1), ');']);
end
In MATLAB there is also evalc. From the documentation:
T = evalc(S) is the same as EVAL(S) except that anything that would
normally be written to the command window, except for error messages,
is captured and returned in the character array T (lines in T are
separated by '\n' characters).
If you still consider this eval, then it is very powerful when dealing with closed source code that displays useful information in the command window and you need to capture and parse that output.

Pythonic way to ID a mystery file, then call a filetype-specific parser for it? Class creation q's

(note) I would appreciate help generalizing the title. I am sure that this is a class of problems in OO land, and probably has a reasonable pattern, I just don't know a better way to describe it.
I'm considering the following -- Our server script will be called by an outside program, and have a bunch of text dumped at it, (usually XML).
There are multiple possible types of data we could be getting, and multiple versions of the data representation we could be getting, e.g. "Report Type A, version 1.2" vs. "Report Type A, version 2.0"
We will generally want to do the same thing action with all the data -- namely, determine what sort and version it is, then parse it with a custom parser, then call a synchronize-to-database function on it.
We will definitely be adding types and versions as time goes on.
So, what's a good design pattern here? I can come up with two, both seem like they may have som problems.
Option 1
Write a monolithic ID script which determines the type, and then
imports and calls the properly named class functions.
Benefits
Probably pretty easy to debug,
Only one file that does the parsing.
Downsides
Seems hack-ish.
It would be nice to not have to create
knowledge of dataformats in two places, once for ID, once for actual
merging.
Option 2
Write an "ID" function for each class; returns Yes / No / Maybe when given identifying text.
the ID script now imports a bunch of classes, instantiates them on the text and asks if the text and class type match.
Upsides:
Cleaner in that everything lives in one module?
Downsides:
Slower? Depends on logic of running through the classes.
Put abstractly, should Python instantiate a bunch of Classes, and consume an ID function, or should Python instantiate one (or many) ID classes which have a paired item class, or some other way?

You could use the Strategy pattern which would allow you to separate the logic for the different formats which need to be parsed into concrete strategies. Your code would typically parse a portion of the file in the interface and then decide on a concrete strategy.
As far as defining the grammar for your files I would find a fast way to identify the file without implementing the full definition, perhaps a header or other unique feature at the beginning of the document. Then once you know how to handle the file you can pick the best concrete strategy for that file handling the parsing and writes to the database.

How to document and test interfaces required of formal parameters in Python 2?

To ask my very specific question I find I need quite a long introduction to motivate and explain it -- I promise there's a proper question at the end!
While reading part of a large Python codebase, sometimes one comes across code where the interface required of an argument is not obvious from "nearby" code in the same module or package. As an example:
def make_factory(schema):
entity = schema.get_entity()
...
There might be many "schemas" and "factories" that the code deals with, and "def get_entity()" might be quite common too (or perhaps the function doesn't call any methods on schema, but just passes it to another function). So a quick grep isn't always helpful to find out more about what "schema" is (and the same goes for the return type). Though "duck typing" is a nice feature of Python, sometimes the uncertainty in a reader's mind about the interface of arguments passed in as the "schema" gets in the way of quickly understanding the code (and the same goes for uncertainty about typical concrete classes that implement the interface). Looking at the automated tests can help, but explicit documentation can be better because it's quicker to read. Any such documentation is best when it can itself be tested so that it doesn't get out of date.
Doctests are one possible approach to solving this problem, but that's not what this question is about.
Python 3 has a "parameter annotations" feature (part of the function annotations feature, defined in PEP 3107). The uses to which that feature might be put aren't defined by the language, but it can be used for this purpose. That might look like this:
def make_factory(schema: "xml_schema"):
...
Here, "xml_schema" identifies a Python interface that the argument passed to this function should support. Elsewhere there would be code that defines that interface in terms of attributes, methods & their argument signatures, etc. and code that allows introspection to verify whether particular objects provide an interface (perhaps implemented using something like zope.interface / zope.schema). Note that this doesn't necessarily mean that the interface gets checked every time an argument is passed, nor that static analysis is done. Rather, the motivation of defining the interface is to provide ways to write automated tests that verify that this documentation isn't out of date (they might be fairly generic tests so that you don't have to write a new test for each function that uses the parameters, or you might turn on run-time interface checking but only when you run your unit tests). You can go further and annotate the interface of the return value, which I won't illustrate.
So, the question:
I want to do exactly that, but using Python 2 instead of Python 3. Python 2 doesn't have the function annotations feature. What's the "closest thing" in Python 2? Clearly there is more than one way to do it, but I suspect there is one (relatively) obvious way to do it.
For extra points: name a library that implements the one obvious way.

Take a look at plac that uses annotations to define a command-line interface for a script. On Python 2.x it uses plac.annotations() decorator.

The closest thing is, I believe, an annotation library called PyAnno.
From the project webpage:
"The Pyanno annotations have two functions:
Provide a structured way to document Python code
Perform limited run-time checking "

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.