Safe implementation strategy for embedded user-defined expressions

Safe implementation strategy for embedded user-defined expressions - python

I am designing/prototyping a Domain Specific Language... in Python, for now, at least. The design is straightforward - but requiring support to specify an arbitrary function (the domain of which is a map from labels to integers - the range is an integer.) In many cases, the function will merely select a label in the domain to yield a result... but I want to allow the specification of any function that could be easily (and efficiently) implemented in a general purpose programming language.
A caveat is that I want the function to be 'safe'... by this I mean:
A 'pure' function: deterministic with no side effects. (i.e. no external state; no interaction with files, I/O, devices - etc.)
Terminating - either successfully, or after specific (small-scale) allocated computational resources have expired.
I am keen that this function should be implemented efficiently - I expect definitions to be provided infrequently - and evaluated very frequently. I would also like the functions to be defined using a familiar syntax.
I've considered supporting the implementation of functions in python... I'm aware that I could impose restrictions using the eval() function, and I've found the AST module - suggesting an approach involving parsing to an AST, then interpreting (or verifying, prior to evaluation) the AST tree. I've also read about pyparse and consdered implementing a bespoke, interpreted, language.
I can't help think that trying to block undesirable behaviour from eval() is to be tackling the problem "backwards" (trying to block undesirable functionality ex-post) whereas implementing a bespoke language would involve re-inventing the wheel.
Does Python already have a safe, efficient, embeddable, expression interpreter?

PyPy has a sandbox.
If you're running this in the web browser (the usual place for untrusted code concerns) consider running it client-side with something like Brython. No-one cares if the user hacks his own machine.
If you do implement a bespoke interpreter, you don't have to re-implement all of the wheel. It's thought to be relatively safe to use compile() on untrusted code, but beware of large constants eating time and memory. Run the compiler in a separate process you can kill. Then you just need to write a Python bytecode interpreter that lacks access to anything important.

Related

Interpret Python bytecode in C# (with fine control)

For a project idea of mine, I have the following need, which is quite precise:
I would like to be able to execute Python code (pre-compiled before hand if necessary) on a per-bytecode-instruction basis. I also need to access what's inside the Python VM (frame stack, data stacks, etc.). Ideally, I would also like to remove a lot of Python built-in features and reimplement a few of them my own way (such as file writing).
All of this must be coded in C# (I'm using Unity).
I'm okay with loosing a few of Python's actual features, especially concerning complicated stuff with imports, etc. However, I would like most of it to stay intact.
I looked a little bit into IronPython's code but it remains very obscure to me and it seems quite enormous too. I began translating Byterun (a Python bytecode interpreter written in Python) but I face a lot of difficulties as Byterun leverages a lot of Python's features to... interpret Python.
Today, I don't ask for a pre-made solution (except if you have one in mind?), but rather for some advice, places to look at, etc. Do you have any ideas about the things I should research first?

I've tried to do my own implementation of the Python VM in the distant past and learned a lot but never came even close to a fully working implementation. I used the C implementation as a starting point, specifically everything in https://github.com/python/cpython/tree/main/Objects and
https://github.com/python/cpython/blob/main/Python/ceval.c (look for switch(opcode))
Here are some pointers:
Come to grips with the Python object model. Implement an abstract PyObject class with the necessary methods for instancing, attribute access, indexing and slicing, calling, comparisons, aritmetic operations and representation. Provide concrete implemetations for None, booleans, ints, floats, strings, tuples, lists and dictionaries.
Implement the core of your VM: a Frame object that loops over the opcodes and dispatches, using a giant switch statment (following the C implementation here), to the corresponding methods of the PyObject. The frame should maintains a stack of PyObjects for the operants of the opcodes. Depending on the opcode, arguments are popped from and pushed on this stack. A dict can be used to store and retrieve local variables. Use the Frame object to create a PyObject for function objects.
Get familiar with the idea of a namespace and the way Python builds on the concept of namespaces. Implement a module, a class and an instance object, using the dict to map (attribute)names to objects.
Finally, add as many builtin functions as you think you need to get a usefull implementation.
I think it is easy to underestimate the amount of work you're getting yourself into, but ... have fun!

How to apply Closed-Open and Inversion of Control principles in Python?

Building out a new application now and struggling a lot with the implementation part of "Closed-Open" and "Inversion of Control" principles I following after reading Clean Architecture book by Uncle Bob.
How can I implement them in Python?
Usually, these two principles coming hand in hand and depicted in the UML as an Interface reversing control from module/package A to B.
I'm confused because:
Python does not possess Interfaces as Java and C++ do. Yes, there are ABC and #abstractmethod, but it is not a Pythonic style and redundant from my point of view if you are not developing a framework
Passing a class to the method of another one (I understood that it is a way to implement open-closed principle) is a little bit dangerous in Python, since it does not have a compiler which is catching issues may (and will) happen if one of two loosely coupled objects change
After neglecting interfaces and passing a top-level class to lower-level ones... I still need to import everything somewhere at the top module. And by that, the whole thing is violated.
So, as you can see I'm super confused and having a hard time programming according to my design. I came up with. Can you help me, please?

You just pass an object that implements the methods you need it to implement.
True, there is no "Interface" to define what those methods have to be, but that's just the way it is in python.
You pass around arguments all the time that have to be lists, maps, tuples, or whatever, and none of these are type-checked. You can write code that calls whatever you want on these things and python will not notice any kind of problem until that code is actually executed.
It's exactly the same when you need those arguments to implement whatever IoC interface you're using. Make sure you detail the requirements in comments.
Yes, this is all pretty dangerous. That's why we prefer statically typed languages for large systems that have complex interfaces.

LinkedList on python and c++ [duplicate]

Why does Python seem slower, on average, than C/C++? I learned Python as my first programming language, but I've only just started with C and already I feel I can see a clear difference.

Python is a higher level language than C, which means it abstracts the details of the computer from you - memory management, pointers, etc, and allows you to write programs in a way which is closer to how humans think.
It is true that C code usually runs 10 to 100 times faster than Python code if you measure only the execution time. However if you also include the development time Python often beats C. For many projects the development time is far more critical than the run time performance. Longer development time converts directly into extra costs, fewer features and slower time to market.
Internally the reason that Python code executes more slowly is because code is interpreted at runtime instead of being compiled to native code at compile time.
Other interpreted languages such as Java bytecode and .NET bytecode run faster than Python because the standard distributions include a JIT compiler that compiles bytecode to native code at runtime. The reason why CPython doesn't have a JIT compiler already is because the dynamic nature of Python makes it difficult to write one. There is work in progress to write a faster Python runtime so you should expect the performance gap to be reduced in the future, but it will probably be a while before the standard Python distribution includes a powerful JIT compiler.

CPython is particularly slow because it has no Just in Time optimizer (since it's the reference implementation and chooses simplicity over performance in certain cases). Unladen Swallow is a project to add an LLVM-backed JIT into CPython, and achieves massive speedups. It's possible that Jython and IronPython are much faster than CPython as well as they are backed by heavily optimized virtual machines (JVM and .NET CLR).
One thing that will arguably leave Python slower however, is that it's dynamically typed, and there is tons of lookup for each attribute access.
For instance calling f on an object A will cause possible lookups in __dict__, calls to __getattr__, etc, then finally call __call__ on the callable object f.
With respect to dynamic typing, there are many optimizations that can be done if you know what type of data you are dealing with. For example in Java or C, if you have a straight array of integers you want to sum, the final assembly code can be as simple as fetching the value at the index i, adding it to the accumulator, and then incrementing i.
In Python, this is very hard to make code this optimal. Say you have a list subclass object containing ints. Before even adding any, Python must call list.__getitem__(i), then add that to the "accumulator" by calling accumulator.__add__(n), then repeat. Tons of alternative lookups can happen here because another thread may have altered for example the __getitem__ method, the dict of the list instance, or the dict of the class, between calls to add or getitem. Even finding the accumulator and list (and any variable you're using) in the local namespace causes a dict lookup. This same overhead applies when using any user defined object, although for some built-in types, it's somewhat mitigated.
It's also worth noting, that the primitive types such as bigint (int in Python 3, long in Python 2.x), list, set, dict, etc, etc, are what people use a lot in Python. There are tons of built in operations on these objects that are already optimized enough. For example, for the example above, you'd just call sum(list) instead of using an accumulator and index. Sticking to these, and a bit of number crunching with int/float/complex, you will generally not have speed issues, and if you do, there is probably a small time critical unit (a SHA2 digest function, for example) that you can simply move out to C (or Java code, in Jython). The fact is, that when you code C or C++, you are going to waste lots of time doing things that you can do in a few seconds/lines of Python code. I'd say the tradeoff is always worth it except for cases where you are doing something like embedded or real time programming and can't afford it.

Compilation vs interpretation isn't important here: Python is compiled, and it's a tiny part of the runtime cost for any non-trivial program.
The primary costs are: the lack of an integer type which corresponds to native integers (making all integer operations vastly more expensive), the lack of static typing (which makes resolution of methods more difficult, and means that the types of values must be checked at runtime), and the lack of unboxed values (which reduce memory usage, and can avoid a level of indirection).
Not that any of these things aren't possible or can't be made more efficient in Python, but the choice has been made to favor programmer convenience and flexibility, and language cleanness over runtime speed. Some of these costs may be overcome by clever JIT compilation, but the benefits Python provides will always come at some cost.

The difference between python and C is the usual difference between an interpreted (bytecode) and compiled (to native) language. Personally, I don't really see python as slow, it manages just fine. If you try to use it outside of its realm, of course, it will be slower. But for that, you can write C extensions for python, which puts time-critical algorithms in native code, making it way faster.

Python is typically implemented as a scripting language. That means it goes through an interpreter which means it translates code on the fly to the machine language rather than having the executable all in machine language from the beginning. As a result, it has to pay the cost of translating code in addition to executing it. This is true even of CPython even though it compiles to bytecode which is closer to the machine language and therefore can be translated faster. With Python also comes some very useful runtime features like dynamic typing, but such things typically cannot be implemented even on the most efficient implementations without heavy runtime costs.
If you are doing very processor-intensive work like writing shaders, it's not uncommon for Python to be somewhere around 200 times slower than C++. If you use CPython, that time can be cut in half but it's still nowhere near as fast. With all those runtmie goodies comes a price. There are plenty of benchmarks to show this and here's a particularly good one. As admitted on the front page, the benchmarks are flawed. They are all submitted by users trying their best to write efficient code in the language of their choice, but it gives you a good general idea.
I recommend you try mixing the two together if you are concerned about efficiency: then you can get the best of both worlds. I'm primarily a C++ programmer but I think a lot of people tend to code too much of the mundane, high-level code in C++ when it's just a nuisance to do so (compile times as just one example). Mixing a scripting language with an efficient language like C/C++ which is closer to the metal is really the way to go to balance programmer efficiency (productivity) with processing efficiency.

Comparing C/C++ to Python is not a fair comparison. Like comparing a F1 race car with a utility truck.
What is surprising is how fast Python is in comparison to its peers of other dynamic languages. While the methodology is often considered flawed, look at The Computer Language Benchmark Game to see relative language speed on similar algorithms.
The comparison to Perl, Ruby, and C# are more 'fair'

Aside from the answers already posted, one thing is Python's ability to change things during runtime, which you can't do in other languages such as C. You can add member functions to classes as you go.
Also, Pythons' dynamic nature makes it impossible to say what type of parameters will be passed to a function, which in turn makes optimizing a whole lot harder.
RPython seems to be a way of getting around the optimization problem.
Still, it'll probably won't be near the performance of C for number-crunching and the like.

C and C++ compile to native code- that is, they run directly on the CPU. Python is an interpreted language, which means that the Python code you write must go through many, many stages of abstraction before it can become executable machine code.

Python is a high-level programming language. Here is how a python script runs:
The python source code is first compiled into Byte Code. Yes, you heard me right! Though Python is an interpreted language, it first gets compiled into byte code. This byte code is then interpreted and executed by the Python Virtual Machine(PVM).
This compilation and execution are what make Python slower than other low-level languages such as C/C++. In languages such as C/C++, the source code is compiled into binary code which can be directly executed by the CPU thus making their execution efficient than that of Python.

This answer applies to python3. Most people do not know that a JIT-like compile occurs whenever you use the import statement. CPython will search for the imported source file (.py), take notice of the modification date, then look for compiled-to-bytecode file (.pyc) in a subfolder named "_ _ pycache _ _" (dunder pycache dunder). If everything matches then your program will use that bytecode file until something changes (you change the source file or upgrade Python)
But this never happens with the main program which is usually started from a BASH shell, interactively or via. Here is an example:
#!/usr/bin/python3
# title : /var/www/cgi-bin/name2.py
# author: Neil Rieck
# edit : 2019-10-19
# ==================
import name3 # name3.py will be cache-checked and/or compiled
import name4 # name4.py will be cache-checked and/or compiled
import name5 # name5.py will be cache-checked and/or compiled
#
def main():
#
# code that uses the imported libraries goes here
#
if __name__ == "__main__":
main()
#
Once executed, the compiled output code will be discarded. However, your main python program will be compiled if you start up via an import statement like so:
#!/usr/bin/python3
# title : /var/www/cgi-bin/name1
# author: Neil Rieck
# edit : 2019-10-19
# ==================
import name2 # name2.py will be cache-checked and/or compiled
#name2.main() #
And now for the caveats:
if you were testing code interactively in the Apache area, your compiled file might be saved with privs that Apache can't read (or write on a recompile)
some claim that the subfolder "_ _ pycache _ _" (dunder pycache dunder) needs to be available in the Apache config
will SELinux allow CPython to write to subfolder (this was a problem in CentOS-7.5 but I believe a patch has been made available)
One last point. You can access the compiler yourself, generate the pyc files, then change the protection bits as a workaround to any of the caveats I've listed. Here are two examples:
method #1
=========
python3
import py_compile
py_compile("name1.py")
exit()
method #2
=========
python3 -m py_compile name1.py

python is interpreted language is not complied and its not get combined with CPU hardware
but I have a solutions for increase python as a faster programing language
1.Use python3 for run and code python command like Ubuntu or any Linux distro use python3 main.py and update regularly your python so you python3 framework modules and libraries i will suggest use pip 3.
2.Use [Numba][1] python framework with JIT compiler this framework use for data visualization but you can use for any program this framework use GPU acceleration of your program.
3.Use [Profiler optimizing][1] so this use for see with function or syntax for bit longer or faster also have use full to change syntax as a faster for python its very god and work full so this give a with function or syntax using much more time execution of code.
4.Use multi threading so making multiprocessing of program for python so use CPU cores and threads so this make your code much more faster.
5.Using C,C#,C++ increasing python much more faster i think its called parallel programing use like a [cpython][1] .
6.Debug your code for test your code to make not bug in your code so then you will get little bit your code faster also have one more thing Application logging is for debugging code.
and them some low things that makes your code faster:
1.Know the basic data structures for using good syntax use make best code.
2.make a best code have Reduce memory footprinting.
3.Use builtin functions and libraries.
4.Move calculations outside the loop.
5.keep your code base small.
so using this thing then get your code much more faster yes so using this python not a slow programing language

Statically Typed Metaprogramming?

I've been thinking about what I would miss in porting some Python code to a statically typed language such as F# or Scala; the libraries can be substituted, the conciseness is comparable, but I have lots of python code which is as follows:
#specialclass
class Thing(object):
#specialFunc
def method1(arg1, arg2):
...
#specialFunc
def method2(arg3, arg4, arg5):
...
Where the decorators do a huge amount: replacing the methods with callable objects with state, augmenting the class with additional data and properties, etc.. Although Python allows dynamic monkey-patch metaprogramming anywhere, anytime, by anyone, I find that essentially all my metaprogramming is done in a separate "phase" of the program. i.e.:
load/compile .py files
transform using decorators
// maybe transform a few more times using decorators
execute code // no more transformations!
These phases are basically completely distinct; I do not run any application level code in the decorators, nor do I perform any ninja replace-class-with-other-class or replace-function-with-other-function in the main application code. Although the "dynamic"ness of the language says I can do so anywhere I want, I never go around replacing functions or redefining classes in the main application code because it gets crazy very quickly.
I am, essentially, performing a single re-compile on the code before i start running it.
The only similar metapogramming i know of in statically typed languages is reflection: i.e. getting functions/classes from strings, invoking methods using argument arrays, etc. However, this basically converts the statically typed language into a dynamically typed language, losing all type safety (correct me if i'm wrong?). Ideally, I think, I would have something like the following:
load/parse application files
load/compile transformer
transform application files using transformer
compile
execute code
Essentially, you would be augmenting the compilation process with arbitrary code, compiled using the normal compiler, that will perform transformations on the main application code. The point is that it essentially emulates the "load, transform(s), execute" workflow while strictly maintaining type safety.
If the application code are borked the compiler will complain, if the transformer code is borked the compiler will complain, if the transformer code compiles but doesn't do the right thing, either it will crash or the compilation step after will complain that the final types don't add up. In any case, you will never get the runtime type-errors possible by using reflection to do dynamic dispatch: it would all be statically checked at every step.
So my question is, is this possible? Has it already been done in some language or framework which I do not know about? Is it theoretically impossible? I'm not very familiar with compiler or formal language theory, I know it would make the compilation step turing complete and with no guarantee of termination, but it seems to me that this is what I would need to match the sort of convenient code-transformation i get in a dynamic language while maintaining static type checking.
EDIT: One example use case would be a completely generic caching decorator. In python it would be:
cacheDict = {}
def cache(func):
#functools.wraps(func)
def wrapped(*args, **kwargs):
cachekey = hash((args, kwargs))
if cachekey not in cacheDict.keys():
cacheDict[cachekey] = func(*args, **kwargs)
return cacheDict[cachekey]
return wrapped
#cache
def expensivepurefunction(arg1, arg2):
# do stuff
return result
While higher order functions can do some of this or objects-with-functions-inside can do some of this, AFAIK they cannot be generalized to work with any function taking an arbitrary set of parameters and returning an arbitrary type while maintaining type safety. I could do stuff like:
public Thingy wrap(Object O){ //this probably won't compile, but you get the idea
return (params Object[] args) => {
//check cache
return InvokeWithReflection(O, args)
}
}
But all the casting completely kills type safety.
EDIT: This is a simple example, where the function signature does not change. Ideally what I am looking for could modify the function signature, changing the input parameters or output type (a.l.a. function composition) while still maintaining type checking.

Very interesting question.
Some points regarding metaprogramming in Scala:
In scala 2.10 there will be developments in scala reflection
There is work in source to source transformation (macros) which is something you are looking for: scalamacros.org
Java has introspection (through the reflection api) but does not allow self modification. However you can use tools to support this (such as javassist). In theory you could use these tools in Scala to achieve more than introspection.
From what I could understand of your development process, you separate your domain code from your decorators (or a cross cutting concern if you will) which allow to achieve modularity and code simplicity. This can be a good use for aspect oriented programming, which allows to just that. For Java theres is a library (aspectJ), however I'm dubious it will run with Scala.

So my question is, is this possible?
There are many ways to achieve the same effect in statically-typed programming languages.
You have essentially described the process of doing some term rewriting on a program before executing it. This functionality is perhaps best known in the form of the Lisp macro but some statically typed languages also have macro systems, most notably OCaml's camlp4 macro system which can be used to extend the language.
More generally, you are describing one form of language extensibility. There are many alternatives and different languages provide different techniques. See my blog post Extensibility in Functional Programming for more information. Note that many of these languages are research projects so the motivation is to add novel features and not necessarily good features, so they rarely retrofit good features that were invented elsewhere.
The ML (meta language) family of languages including Standard ML, OCaml and F# were specifically designed for metaprogramming. Consequently, they tend to have awesome support for lexing, parsing, rewriting, interpreting and compiling. However, F# is the most far removed member of this family and lacks the mature tools that languages like OCaml benefit from (e.g. camlp4, ocamllex, dypgen, menhir etc.). F# does have a partial implementation of fslex, fsyacc and a Haskell-inspired parser combinator library called FParsec.
You may well find that the problem you are facing (which you have not described) is better solved using more traditional forms of metaprogramming, most notably a DSL or EDSL.

Without knowing why you're doing this, it's difficult to know whether this kind of approach is the right one in Scala or F#. But ignoring that for now, it's certainly possible to achieve in Scala, at least, although not at the language level.
A compiler plugin gives you access to the tree and allows you to perform all kinds of manipulation of that tree, all fully typechecked.
There are some issues with generating synthetic methods in Scala compiler plugins - it's difficult for me to know whether that will be a problem for you.
It is possible to work around this by creating a compiler plugin that generates source code which is then compiled in a separate pass. This is how ScalaMock works, for instance.

You might be interested in source-to-source program transformation systems (PTS).
Such tools parse the source code, producing an AST, and then allow one to define arbitrary analyses and/or transformations on the code, finally regenerating source code from the modified AST.
Some tools provide parsing, tree building and AST navigation by a procedural interface, such as ANTLR. Many of the more modern dynamic languages (Python, Scala, etc.) have had some self-hosting parser libraries built, and even Java (compiler plug-ins) and C# (open compiler) are catching on to this idea.
But mostly these tools only provide procedural access to the AST. A system with surface-syntax rewriting allows you to express "if you see this change it to that" using patterns with the syntax of the language(s) being manipulated. These include Stratego/XT and TXL.
It is our experience that manipulating complex languages requires complex compiler support and reasoning; this is the canonical lesson from 70 years of people building compilers. All of the above tools suffer from not having access to symbol tables and various kinds of flow analysis; after all, how one part of the program operates, depends on action taken in remote parts, so information flow is fundamental. [As noted in comments on another answer, you can implement symbol tables/flow analysis with those tools; my point is they give you no special support for doing so, and these are difficult tasks, even worse on modern languages with complex type systems and control flows].
Our DMS Software Reengineering Toolkit is a PTS that provides all of the above facilities (Life After Parsing), at some cost in configuring it to your particular language or DSL, which we try to ameliorate by providing these off-the-shelf for mainstream languages. [DMS provides explicit infrastructure for building/managing symbol tables, control and data flow; this has been used to implement these mechanisms for Java 1.8 and full C++14].
DMS has also been used to define meta-AOP, tools that enable one to build AOP systems for arbitrary languages and apply AOP like operations.
In any case, to the extent that you simply modify the AST, directly or indirectly, you have no guarantee of "type safety". You can only get that by writing transformation rules that don't break it. For that, you'd need a theorem prover to check that each modification (or composition of such) didn't break type safety, and that's pretty much beyond the state of the art. However, you can be careful how you write your rules, and get pretty useful systems.
You can see an example of specification of a DSL and manipulation with surface-syntax source-to-source rewriting rules, that preserves semantics, in this example that defines and manipulates algebra and calculus using DMS. I note this example is simple to make it understandable; in particular, its does not exhibit any of the flow analysis machinery DMS offers.

Ideally what I am looking for could modify the function signature, changing the input parameters or output type (a.l.a. function composition) while still maintaining type checking.
I have same need for making R APIs available in the type safe world. This way we would bring the wealth of scientific code from R into the (type) safe world of Scala.
Rationale
Make possible documenting the business domain aspects of the APIs through Specs2 (see https://etorreborre.github.io/specs2/guide/SPECS2-3.0/org.specs2.guide.UserGuide.html; is generated from Scala code). Think Domain Driven Design applied backwards.
Take a language oriented approach to the challenges faced by SparkR which tries to combine Spark with R.
See https://spark-summit.org/east-2015/functionality-and-performance-improvement-of-sparkr-and-its-application/ for attempts to improve how it is currently done in SparkR. See also https://github.com/onetapbeyond/renjin-spark-executor for a simplistic way to integrate.
In terms of solutioning this we could use Renjin (Java based interpreter) as runtime engine but use StrategoXT Metaborg to parse R and generate strongly typed Scala APIs (like you describe).
StrategoTX (http://www.metaborg.org/en/latest/) is the most powerful DSL development platform I know. Allows combining/embedding languages using a parsing technology that allows composing languages (longer story).

How to document and test interfaces required of formal parameters in Python 2?

To ask my very specific question I find I need quite a long introduction to motivate and explain it -- I promise there's a proper question at the end!
While reading part of a large Python codebase, sometimes one comes across code where the interface required of an argument is not obvious from "nearby" code in the same module or package. As an example:
def make_factory(schema):
entity = schema.get_entity()
...
There might be many "schemas" and "factories" that the code deals with, and "def get_entity()" might be quite common too (or perhaps the function doesn't call any methods on schema, but just passes it to another function). So a quick grep isn't always helpful to find out more about what "schema" is (and the same goes for the return type). Though "duck typing" is a nice feature of Python, sometimes the uncertainty in a reader's mind about the interface of arguments passed in as the "schema" gets in the way of quickly understanding the code (and the same goes for uncertainty about typical concrete classes that implement the interface). Looking at the automated tests can help, but explicit documentation can be better because it's quicker to read. Any such documentation is best when it can itself be tested so that it doesn't get out of date.
Doctests are one possible approach to solving this problem, but that's not what this question is about.
Python 3 has a "parameter annotations" feature (part of the function annotations feature, defined in PEP 3107). The uses to which that feature might be put aren't defined by the language, but it can be used for this purpose. That might look like this:
def make_factory(schema: "xml_schema"):
...
Here, "xml_schema" identifies a Python interface that the argument passed to this function should support. Elsewhere there would be code that defines that interface in terms of attributes, methods & their argument signatures, etc. and code that allows introspection to verify whether particular objects provide an interface (perhaps implemented using something like zope.interface / zope.schema). Note that this doesn't necessarily mean that the interface gets checked every time an argument is passed, nor that static analysis is done. Rather, the motivation of defining the interface is to provide ways to write automated tests that verify that this documentation isn't out of date (they might be fairly generic tests so that you don't have to write a new test for each function that uses the parameters, or you might turn on run-time interface checking but only when you run your unit tests). You can go further and annotate the interface of the return value, which I won't illustrate.
So, the question:
I want to do exactly that, but using Python 2 instead of Python 3. Python 2 doesn't have the function annotations feature. What's the "closest thing" in Python 2? Clearly there is more than one way to do it, but I suspect there is one (relatively) obvious way to do it.
For extra points: name a library that implements the one obvious way.

Take a look at plac that uses annotations to define a command-line interface for a script. On Python 2.x it uses plac.annotations() decorator.

The closest thing is, I believe, an annotation library called PyAnno.
From the project webpage:
"The Pyanno annotations have two functions:
Provide a structured way to document Python code
Perform limited run-time checking "

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.