Identifying semantic errors

Identifying semantic errors - python

import printf, printf;
void foo(int x, int y) {
return 0;
}
int a = food(1, -2.0, 5);
I need to write a solution to identify all semantic errors that are in the above code. I have a supplied compiler for it.
What is the best way of going about this.

ANTLR is about syntax (not particularly about semantics). You can have semantic predicates, but the intention there is to guide the interpretation of the syntax if there would otherwise be ambiguities (there can be other reasons).
It is possible to handle some semantic processing in actions, but this can quickly lead to an ANTLR grammar where it’s hard to find the grammar in all the noise of the actions. I view this as something of an anti-pattern (particularly in hand crafting grammars).
Code your syntax in the grammar, use the resulting code to produce a ParseTree. You can then use either Listeners or Visitors to write your own code to do any semantic validation. This separation of concerns will also help keep the code base easier to manage.
It’s a bit much to go into listeners and visitors in depth in a SO answer, but there is plenty written about using them.
So in your example, ANTLR could produce a perfectly valid ParseTree for the foo function, and a parse tree from it. In a listener for the function context, you could detect that the function says it’s return type is void and then check the body of the function (in the parse tree) to see if it contains any `return statements. These would be syntactically “correct” but semantically invalid, so you’d identify that as an error.
In short, ANTLR is great at giving you a data structure that accurately represents the only way to interpret the input stream. And it provides utility functionality with Listeners and Visitors to make it pretty simple to analyze that parse tree in search of semantic issues (or even use a visitor to produce an interpreter to execute the code, if you’d like).

Related

How do I get the local variables defined in a Python function?

I'm working on a project to do some static analysis of Python code. We're hoping to encode certain conventions that go beyond questions of style or detecting code duplication. I'm not sure this question is specific enough, but I'm going to post it anyway.
A few of the ideas that I have involve being able to build a certain understanding of how the various parts of source code work so we can impose these checks. For example, in part of our application that's exposing a REST API, I'd like to validate something like the fact that if a route is defined as a GET, then arguments to the API are passed as URL arguments rather than in the request body.
I'm able to get something like that to work by pulling all the routes, which are pretty nicely structured, and there are guarantees of consistency given the route has to be created as a route object. But once I know that, say, a given route is a GET, figuring out how the handler function uses arguments requires some degree of interpretation of the function source code.
Naïvely, something like inspect.getsourcelines will allow me to get the source code, but on further examination that's not the best solution because I immediately have to build interpreter-like features, such as figuring out whether a line is a comment, and then do something like use regular expressions to hunt down places where state is moved from the request context to a local variable.
Looking at tools like PyLint, they seem mostly focused on high-level "universals" of static analysis, and (at least on superficial inspection) don't have obvious ways of extracting this sort of understanding at a lower level.
Is there a more systematic way to get this representation of the source code, either with something in the standard library or with another tool? Or is the only way to do this writing a mini-interpreter that serves my purposes?

Safe implementation strategy for embedded user-defined expressions

I am designing/prototyping a Domain Specific Language... in Python, for now, at least. The design is straightforward - but requiring support to specify an arbitrary function (the domain of which is a map from labels to integers - the range is an integer.) In many cases, the function will merely select a label in the domain to yield a result... but I want to allow the specification of any function that could be easily (and efficiently) implemented in a general purpose programming language.
A caveat is that I want the function to be 'safe'... by this I mean:
A 'pure' function: deterministic with no side effects. (i.e. no external state; no interaction with files, I/O, devices - etc.)
Terminating - either successfully, or after specific (small-scale) allocated computational resources have expired.
I am keen that this function should be implemented efficiently - I expect definitions to be provided infrequently - and evaluated very frequently. I would also like the functions to be defined using a familiar syntax.
I've considered supporting the implementation of functions in python... I'm aware that I could impose restrictions using the eval() function, and I've found the AST module - suggesting an approach involving parsing to an AST, then interpreting (or verifying, prior to evaluation) the AST tree. I've also read about pyparse and consdered implementing a bespoke, interpreted, language.
I can't help think that trying to block undesirable behaviour from eval() is to be tackling the problem "backwards" (trying to block undesirable functionality ex-post) whereas implementing a bespoke language would involve re-inventing the wheel.
Does Python already have a safe, efficient, embeddable, expression interpreter?

PyPy has a sandbox.
If you're running this in the web browser (the usual place for untrusted code concerns) consider running it client-side with something like Brython. No-one cares if the user hacks his own machine.
If you do implement a bespoke interpreter, you don't have to re-implement all of the wheel. It's thought to be relatively safe to use compile() on untrusted code, but beware of large constants eating time and memory. Run the compiler in a separate process you can kill. Then you just need to write a Python bytecode interpreter that lacks access to anything important.

enforcing python function parameters types from docstring

Both epydoc and Sphinx document generators permit the coder to annotate what the types should be of any/all function parameter.
My question is: Is there a way (or module) that enforces these types (at run-time) when documented in the docstring. This wouldn't be strong-typing (compile-time checking), but (more likely) might be called firm-typing (run-time checking). Maybe raising a "ValueError", or even better still... raising a "SemanticError"
Ideally there would already be something (like a module) similar to the "import antigravity" module as per xkcd, and this "firm_type_check" module would already exist somewhere handy for download.
FYI: The docstring for epydoc and sphinz are as follows:
epydoc:
Functions and Methods parameters:
#param p: ... # A description of the parameter p for a function or method.
#type p: ... # The expected type for the parameter p.
#return: ... # The return value for a function or method.
#rtype: ... # The type of the return value for a function or method.
#keyword p: ... # A description of the keyword parameter p.
#raise e: ... # A description of the circumstances under which a function or method
raises exception e.
Sphinx: Inside Python object description directives, reST field lists with these fields are recognized and formatted nicely:
param, parameter, arg, argument, key, keyword: Description of a parameter.
type: Type of a parameter.
raises, raise, except, exception: That (and when) a specific exception is raised.
var, ivar, cvar: Description of a variable.
returns, return: Description of the return value.
rtype: Return type.
The closest I could find was a mention by Guido in mail.python.org and created by Jukka Lehtosalo at Mypy Examples. CMIIW: mypy cannot be imported as a py3 module.
Similar stackoverflow questions that do not use the docstring per se:
Pythonic Way To Check for A Parameter Type
What's the canonical way to check for type in python?

To my knowledge, nothing of the sort exists, for a few important reasons:
First, docstrings are documentation, just like comments. And just like comments, people will expect them to have no effect on the way your program works. Making your program's behavior depend on its documentation is a major antipattern, and a horrible idea all around.
Second, docstrings aren't guaranteed to be preserved. If you run python with -OO, for example, all docstrings are removed. What then?
Finally, Python 3 introduced optional function annotations, which would serve that purpose much better: http://legacy.python.org/dev/peps/pep-3107/ . Python currently does nothing with them (they're documentation), but if I were to write such a module, I'd use those, not docstrings.
My honest opinion is this: if you're gonna go through the (considerable) trouble of writing a (necessarily half-baked) static type system for Python, all the time it will take you would be put to better use by learning another programming language that supports static typing in a less insane way:
Clojure (http://clojure.org/) is incredibly dynamic and powerful (due to its nature as a Lisp) and supports optional static typing through core.typed (https://github.com/clojure/core.typed). It is geared towards concurrency and networking (it has STM and persistent data structures <3 ), has a great community, and is one of the most elegantly-designed languages I've seen. That said, it runs on the JVM, which is both a good and a bad thing.
Golang (http://golang.org/) feels sort-of Pythonic (or at least, it's attracting a lot of refugees from Python), is statically typed and compiles to native code.
Rust (http://www.rust-lang.org/) is lower-level than that, but it has one of the best type systems I've seen (type inference, pattern matching, traits, generics, zero-sized types...) and enforces memory and resource safety at compile time. It is being developed by Mozilla as a language to write their next browser (Servo) in, so performance and safety are its main goals. You can think of it as a modern take on C++. It compiles to native code, but hasn't hit 1.0 yet and as such, the language itself is still subject to change. Which is why I wouldn't recommend writing production code in it yet.

Statically Typed Metaprogramming?

I've been thinking about what I would miss in porting some Python code to a statically typed language such as F# or Scala; the libraries can be substituted, the conciseness is comparable, but I have lots of python code which is as follows:
#specialclass
class Thing(object):
#specialFunc
def method1(arg1, arg2):
...
#specialFunc
def method2(arg3, arg4, arg5):
...
Where the decorators do a huge amount: replacing the methods with callable objects with state, augmenting the class with additional data and properties, etc.. Although Python allows dynamic monkey-patch metaprogramming anywhere, anytime, by anyone, I find that essentially all my metaprogramming is done in a separate "phase" of the program. i.e.:
load/compile .py files
transform using decorators
// maybe transform a few more times using decorators
execute code // no more transformations!
These phases are basically completely distinct; I do not run any application level code in the decorators, nor do I perform any ninja replace-class-with-other-class or replace-function-with-other-function in the main application code. Although the "dynamic"ness of the language says I can do so anywhere I want, I never go around replacing functions or redefining classes in the main application code because it gets crazy very quickly.
I am, essentially, performing a single re-compile on the code before i start running it.
The only similar metapogramming i know of in statically typed languages is reflection: i.e. getting functions/classes from strings, invoking methods using argument arrays, etc. However, this basically converts the statically typed language into a dynamically typed language, losing all type safety (correct me if i'm wrong?). Ideally, I think, I would have something like the following:
load/parse application files
load/compile transformer
transform application files using transformer
compile
execute code
Essentially, you would be augmenting the compilation process with arbitrary code, compiled using the normal compiler, that will perform transformations on the main application code. The point is that it essentially emulates the "load, transform(s), execute" workflow while strictly maintaining type safety.
If the application code are borked the compiler will complain, if the transformer code is borked the compiler will complain, if the transformer code compiles but doesn't do the right thing, either it will crash or the compilation step after will complain that the final types don't add up. In any case, you will never get the runtime type-errors possible by using reflection to do dynamic dispatch: it would all be statically checked at every step.
So my question is, is this possible? Has it already been done in some language or framework which I do not know about? Is it theoretically impossible? I'm not very familiar with compiler or formal language theory, I know it would make the compilation step turing complete and with no guarantee of termination, but it seems to me that this is what I would need to match the sort of convenient code-transformation i get in a dynamic language while maintaining static type checking.
EDIT: One example use case would be a completely generic caching decorator. In python it would be:
cacheDict = {}
def cache(func):
#functools.wraps(func)
def wrapped(*args, **kwargs):
cachekey = hash((args, kwargs))
if cachekey not in cacheDict.keys():
cacheDict[cachekey] = func(*args, **kwargs)
return cacheDict[cachekey]
return wrapped
#cache
def expensivepurefunction(arg1, arg2):
# do stuff
return result
While higher order functions can do some of this or objects-with-functions-inside can do some of this, AFAIK they cannot be generalized to work with any function taking an arbitrary set of parameters and returning an arbitrary type while maintaining type safety. I could do stuff like:
public Thingy wrap(Object O){ //this probably won't compile, but you get the idea
return (params Object[] args) => {
//check cache
return InvokeWithReflection(O, args)
}
}
But all the casting completely kills type safety.
EDIT: This is a simple example, where the function signature does not change. Ideally what I am looking for could modify the function signature, changing the input parameters or output type (a.l.a. function composition) while still maintaining type checking.

Very interesting question.
Some points regarding metaprogramming in Scala:
In scala 2.10 there will be developments in scala reflection
There is work in source to source transformation (macros) which is something you are looking for: scalamacros.org
Java has introspection (through the reflection api) but does not allow self modification. However you can use tools to support this (such as javassist). In theory you could use these tools in Scala to achieve more than introspection.
From what I could understand of your development process, you separate your domain code from your decorators (or a cross cutting concern if you will) which allow to achieve modularity and code simplicity. This can be a good use for aspect oriented programming, which allows to just that. For Java theres is a library (aspectJ), however I'm dubious it will run with Scala.

So my question is, is this possible?
There are many ways to achieve the same effect in statically-typed programming languages.
You have essentially described the process of doing some term rewriting on a program before executing it. This functionality is perhaps best known in the form of the Lisp macro but some statically typed languages also have macro systems, most notably OCaml's camlp4 macro system which can be used to extend the language.
More generally, you are describing one form of language extensibility. There are many alternatives and different languages provide different techniques. See my blog post Extensibility in Functional Programming for more information. Note that many of these languages are research projects so the motivation is to add novel features and not necessarily good features, so they rarely retrofit good features that were invented elsewhere.
The ML (meta language) family of languages including Standard ML, OCaml and F# were specifically designed for metaprogramming. Consequently, they tend to have awesome support for lexing, parsing, rewriting, interpreting and compiling. However, F# is the most far removed member of this family and lacks the mature tools that languages like OCaml benefit from (e.g. camlp4, ocamllex, dypgen, menhir etc.). F# does have a partial implementation of fslex, fsyacc and a Haskell-inspired parser combinator library called FParsec.
You may well find that the problem you are facing (which you have not described) is better solved using more traditional forms of metaprogramming, most notably a DSL or EDSL.

Without knowing why you're doing this, it's difficult to know whether this kind of approach is the right one in Scala or F#. But ignoring that for now, it's certainly possible to achieve in Scala, at least, although not at the language level.
A compiler plugin gives you access to the tree and allows you to perform all kinds of manipulation of that tree, all fully typechecked.
There are some issues with generating synthetic methods in Scala compiler plugins - it's difficult for me to know whether that will be a problem for you.
It is possible to work around this by creating a compiler plugin that generates source code which is then compiled in a separate pass. This is how ScalaMock works, for instance.

You might be interested in source-to-source program transformation systems (PTS).
Such tools parse the source code, producing an AST, and then allow one to define arbitrary analyses and/or transformations on the code, finally regenerating source code from the modified AST.
Some tools provide parsing, tree building and AST navigation by a procedural interface, such as ANTLR. Many of the more modern dynamic languages (Python, Scala, etc.) have had some self-hosting parser libraries built, and even Java (compiler plug-ins) and C# (open compiler) are catching on to this idea.
But mostly these tools only provide procedural access to the AST. A system with surface-syntax rewriting allows you to express "if you see this change it to that" using patterns with the syntax of the language(s) being manipulated. These include Stratego/XT and TXL.
It is our experience that manipulating complex languages requires complex compiler support and reasoning; this is the canonical lesson from 70 years of people building compilers. All of the above tools suffer from not having access to symbol tables and various kinds of flow analysis; after all, how one part of the program operates, depends on action taken in remote parts, so information flow is fundamental. [As noted in comments on another answer, you can implement symbol tables/flow analysis with those tools; my point is they give you no special support for doing so, and these are difficult tasks, even worse on modern languages with complex type systems and control flows].
Our DMS Software Reengineering Toolkit is a PTS that provides all of the above facilities (Life After Parsing), at some cost in configuring it to your particular language or DSL, which we try to ameliorate by providing these off-the-shelf for mainstream languages. [DMS provides explicit infrastructure for building/managing symbol tables, control and data flow; this has been used to implement these mechanisms for Java 1.8 and full C++14].
DMS has also been used to define meta-AOP, tools that enable one to build AOP systems for arbitrary languages and apply AOP like operations.
In any case, to the extent that you simply modify the AST, directly or indirectly, you have no guarantee of "type safety". You can only get that by writing transformation rules that don't break it. For that, you'd need a theorem prover to check that each modification (or composition of such) didn't break type safety, and that's pretty much beyond the state of the art. However, you can be careful how you write your rules, and get pretty useful systems.
You can see an example of specification of a DSL and manipulation with surface-syntax source-to-source rewriting rules, that preserves semantics, in this example that defines and manipulates algebra and calculus using DMS. I note this example is simple to make it understandable; in particular, its does not exhibit any of the flow analysis machinery DMS offers.

Ideally what I am looking for could modify the function signature, changing the input parameters or output type (a.l.a. function composition) while still maintaining type checking.
I have same need for making R APIs available in the type safe world. This way we would bring the wealth of scientific code from R into the (type) safe world of Scala.
Rationale
Make possible documenting the business domain aspects of the APIs through Specs2 (see https://etorreborre.github.io/specs2/guide/SPECS2-3.0/org.specs2.guide.UserGuide.html; is generated from Scala code). Think Domain Driven Design applied backwards.
Take a language oriented approach to the challenges faced by SparkR which tries to combine Spark with R.
See https://spark-summit.org/east-2015/functionality-and-performance-improvement-of-sparkr-and-its-application/ for attempts to improve how it is currently done in SparkR. See also https://github.com/onetapbeyond/renjin-spark-executor for a simplistic way to integrate.
In terms of solutioning this we could use Renjin (Java based interpreter) as runtime engine but use StrategoXT Metaborg to parse R and generate strongly typed Scala APIs (like you describe).
StrategoTX (http://www.metaborg.org/en/latest/) is the most powerful DSL development platform I know. Allows combining/embedding languages using a parsing technology that allows composing languages (longer story).

How to document and test interfaces required of formal parameters in Python 2?

To ask my very specific question I find I need quite a long introduction to motivate and explain it -- I promise there's a proper question at the end!
While reading part of a large Python codebase, sometimes one comes across code where the interface required of an argument is not obvious from "nearby" code in the same module or package. As an example:
def make_factory(schema):
entity = schema.get_entity()
...
There might be many "schemas" and "factories" that the code deals with, and "def get_entity()" might be quite common too (or perhaps the function doesn't call any methods on schema, but just passes it to another function). So a quick grep isn't always helpful to find out more about what "schema" is (and the same goes for the return type). Though "duck typing" is a nice feature of Python, sometimes the uncertainty in a reader's mind about the interface of arguments passed in as the "schema" gets in the way of quickly understanding the code (and the same goes for uncertainty about typical concrete classes that implement the interface). Looking at the automated tests can help, but explicit documentation can be better because it's quicker to read. Any such documentation is best when it can itself be tested so that it doesn't get out of date.
Doctests are one possible approach to solving this problem, but that's not what this question is about.
Python 3 has a "parameter annotations" feature (part of the function annotations feature, defined in PEP 3107). The uses to which that feature might be put aren't defined by the language, but it can be used for this purpose. That might look like this:
def make_factory(schema: "xml_schema"):
...
Here, "xml_schema" identifies a Python interface that the argument passed to this function should support. Elsewhere there would be code that defines that interface in terms of attributes, methods & their argument signatures, etc. and code that allows introspection to verify whether particular objects provide an interface (perhaps implemented using something like zope.interface / zope.schema). Note that this doesn't necessarily mean that the interface gets checked every time an argument is passed, nor that static analysis is done. Rather, the motivation of defining the interface is to provide ways to write automated tests that verify that this documentation isn't out of date (they might be fairly generic tests so that you don't have to write a new test for each function that uses the parameters, or you might turn on run-time interface checking but only when you run your unit tests). You can go further and annotate the interface of the return value, which I won't illustrate.
So, the question:
I want to do exactly that, but using Python 2 instead of Python 3. Python 2 doesn't have the function annotations feature. What's the "closest thing" in Python 2? Clearly there is more than one way to do it, but I suspect there is one (relatively) obvious way to do it.
For extra points: name a library that implements the one obvious way.

Take a look at plac that uses annotations to define a command-line interface for a script. On Python 2.x it uses plac.annotations() decorator.

The closest thing is, I believe, an annotation library called PyAnno.
From the project webpage:
"The Pyanno annotations have two functions:
Provide a structured way to document Python code
Perform limited run-time checking "

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.