Parsing a DTD to reveal hierarchy of elements

Parsing a DTD to reveal hierarchy of elements - python

My goal is to parse several relatively complex DTDs to reveal the hierarchy of elements. The only distinction between DTDs is the version, but each version made no attempt to remain backwards compatible--that would be too easy! As such, I intend to visualize the structure of elements defined by each DTD so that I can design a database model suitable for uniformly storing the data.
Because most solutions I've investigated in Python will only validate against external DTDs, I've decided to start my efforts from the beginning. Python's xml.parsers.expat only parses XML files and implements very basic DTD callbacks, so I've decided to check out the original version, which was written in C and claims to fully comport with the XML 1.0 specifications. However, I have the following questions about this approach:
Will expat (in C) parse external entity references in a DTD file and follow those references, parse their elements, and add those elements to the hierarchy?
Can expat generalize and handle SGML, or will it fail after encountering an invalid DTD yet valid SGML file?
My requirements may lead to the conclusion that expat is inappropriate. If that's the case, I'm considering writing a lexer/parser for XML 1.0 DTDs. Are there any other options I should consider?
The following illustrates more succinctly my intent:
Input DTD Excerpt

<!ELEMENT abstract (doc-page+ | (abst-problem , abst-solution) | p+)>
Object Created from DTD Excerpt (pseudocode)
class abstract:
member doc_page_array[]
member abst_problem
member abst_solution
member paragraph_array[]
member description = "A concise summary of the disclosure."
One challenging aspect is to attribute to the <!ELEMENT> tag the comment appearing above it. Hence, a homegrown parser might be necessary if I cannot use expat to accomplish this.
Another problem is that some parsers have problems processing DTDs that use unicode characters greater than #xFFFF, so that might be another factor that favors creating my own.
If it turns out that the lexer/parser route is better suited for my task, does anyone happen to know of a good way to convert these EBNF expressions to something capable of being parsed? I suppose the "best" approach might be to use regular expressions.
Anyway, these are just the thoughts I've had regarding my issue. Any answers to the above questions or suggestions on alternative approaches would be appreciated.

There are several existing tools which may fit your needs, including DTDParse, OpenSP, Matra, and DTD Parser. There are also articles on creating a custom parser.

Related

python: how to define types for conversion to xml

I am implementing a parser in python 2.7 for an xml-based file format. The format defines classes with fields of specific types.
What I want to do is define objects like this:
class DataType1(SomeCoolParent):
a = Integer()
b = Float()
c = String()
#and so on...
The resulting classes need to be able to be constructed from an XML document, and render back into XML (so that I can save and load files).
This looks a whole lot like the ORM libraries in SQLAlchemy or Django... except that there is no database, just XML. It's not clear to me how I would implement what I'm after with those libraries. I would basically want to rip out their backend database stuff and have my own conversion to XML and back again.
My current plan: set up a metaclass and a parent class (SomeCoolParent), define all the types (Integer, Float, String), and generally build a custom almost-ORM myself. I know how to do this, but it seems that someone else must have done/thought of this before, and maybe come up with a general-purpose solution...
Is there a "better" way to implement a hierarchy of fixed type structures with python objects, which allows or at least facilitates conversion to XML?
Caveats/Notes
As mentioned, I'm aware of SQLAlchemy and Django, but I'm not clear on how/whether I can do what I want with their ORM libraries. Answers that show how to do this with either of them definately count.
I'm definitely open to other methods, those two are just the ORM libraries I am aware of.

There a several XML Schema specifications out there. My knowledge is somewhat dated but XML Schema (W3C) - a.k.a. XSD and relax ng are the most popular with relax ng having a lower learning curve. With that in mind, a good ORM (or would that be OXM 'Object-XML Mapper) using these data description formats ought to be out there. Some candidates are
pyxb
generateDS
Stronger google foo may yield more options.

Python API Compatibility Checker

In my current work environment, we produce a large number of Python packages for internal use (10s if not 100s). Each package has some dependencies, usually on a mixture of internal and external packages, and some of these dependencies are shared.
As we approach dependency hell, updating dependencies becomes a time consuming process. While we care about the functional changes a new version might introduce, of equal (if not more) importance are the API changes that break the code.
Although running unit/integration tests against newer versions of a dependency helps us to catch some issues, our coverage is not close enough to 100% to make this a robust strategy. Release notes and a change log help identify major changes at a high-level, but these rarely exist for internally developed tools or go into enough detail to understand the implications the new version has on the (public) API.
I am looking at otherways to automate this process.
I would like to be able to automatically compare two versions of a Python package and report the API differences between them. In particular this would include backwards incompatible changes such as removing functions/methods/classes/modules, adding positional arguments to a function/method/class and changing the number of items a function/method returns. As a developer, based on the report this generates I should have a greater understanding about the code level implications this version change will introduce, and so the time require to integrate it.
Elsewhere, we use the C++ abi-compliance-checker and are looking at the Java api-compliance-checker to help with this process. Is there a similar tool available for Python? I have found plenty of lint/analysis/refactor tools but nothing that provides this level of functionality. I understand that Python's dynamic typing will make a comprehensive report impossible.
If such a tool does not exist, are they any libraries that could help with implementing a solution? For example, my current approach would be to use an ast.NodeVisitor to traverse the package and build a tree where each node represents a module/class/method/function and then compare this tree to that of another version for the same package.
Edit: since posting the question I have found pysdiff which covers some of my requirements, but interested to see alternatives still.
Edit: also found Upstream-Tracker would is a good example of the sort of information I'd like to end up with.

What about using the AST module to parse the files?
import ast
with file("test.py") as f:
python_src = f.read()
node = ast.parse(python_src) # Note: doesn't compile the src
print ast.dump(node)
There's the walk method on the ast node (described http://docs.python.org/2/library/ast.html)
The astdump might work (available on pypi)
This out of date pretty printer
http://code.activestate.com/recipes/533146-ast-pretty-printer/
The documentation tool Sphinx also extracts the information you are looking for. Perhaps give that a look.
So walk the AST and build a tree with the information you want in it. Once you have a tree you can pickle it and diff later or convert the tree to a text representation in a
text file you can diff with difftools, or some external diff program.
The ast has parse() and compile() methods. Only thing is I'm not entirely sure how much information is available to you after parsing (as you don't want to compile()).

Perhaps you can start by using the inspect module
import inspect
import types
def genFunctions(module):
moduleDict = module.__dict__
for name in dir(module):
if name.startswith('_'):
continue
element = moduleDict[name]
if isinstance(element, types.FunctionType):
argSpec = inspect.getargspec(element)
argList = argSpec.args
print "{}.{}({})".format(module.__name__, name, ", ".join(argList))
That will give you a list of "public" (not starting with underscore) functions with their argument lists. You can add more stuff to print the kwargs, classes, etc.
Once you run that on all the packages/modules you care about, in both old and new versions, you'll have two lists like this:
myPackage.myModule.myFunction1(foo, bar)
myPackage.myModule.myFunction2(baz)
Then you can either just sort and diff them, or write some smarter tooling in Python to actually compare all the names, e.g. to permit additional optional arguments but reject new mandatory arguments.

Check out zope.interfaces (you can get it from PyPI). Then you can incorporate unit testing that modules support interfaces into your unit tests. May take a while to retro fit however - also it's not a silver bullet.

What is the advantage of using AST & Tree walker for translating a DSL instead of directly parsing & translating objects from parser grammar?

I am creating a Domain Specific Language using Antlr3. So far, I have directly translated the parsed objects from inside the parser grammar. Going through the examples of AST and Tree Walker, i came to know that they are normally used to divide the grammar into hierarchical tree and translate objects from the nodes. Currently i am also doing the same sort of action using parser grammar where i translate objects from each sub-rule. I would be more than happy to know the advantages of using AST & Tree walker over just using parser grammars. Thanking you in advanced.

One advantage of using tree parsers is that you can organize them into multiple passes. For some translation work I did I was able to use seven passes and separate logical steps into their own pass. One pass did expression analysis, one did control flow analysis, others used that analysis to eliminate dead code or to simplify the translation for special cases.
I personally like using tree grammars for the same reason I like using parsers for text grammars. It allows me to use rules to organize the parsing context. It's easy to do things like structure rules to recognize a top-level expression versus a subexpression if you need to distinguish between them for recognition purposes. All of the attribute and context management you use in regular parsers can apply to tree parsers.

Patterns for Python object transformation (encoding, decoding, deserialization, serialization)

I've been working with the parsing module construct and have found myself really enamored with the declarative nature of it's data structures. For those of you unfamiliar with it, you can write Python code that essentially looks like what you're trying to parse by nesting objects when you instantiate.
Example:
ethernet = Struct("ethernet_header",
Bytes("destination", 6),
Bytes("source", 6),
Enum(UBInt16("type"),
IPv4 = 0x0800,
ARP = 0x0806,
RARP = 0x8035,
X25 = 0x0805,
IPX = 0x8137,
IPv6 = 0x86DD,
),
)
While construct doesn't really support storing values inside this structure (you can either parse an abstract Container into a bytestream or a bytestream into an abstract container), I want to extend the framework so that parsers can also store values as they parse through so that they can be accessed in dot notation ethernet.type.
But in doing so, think the best possible solution here would be a generic method of writing encoding/decoding mechanisms so that you could register an encode/decode mechanism and be able to produce a variety of outputs from the abstract data structure (the parser itself), as well as the output of the parser.
To give you an example, by default when you run an ethernet packed through the parser, you end up with something like a dict:
Container(name='ethernet_header',
destination='\x01\x02\x03\x04\x05\x06',
source='\x01\x02\x03\x04\x05\x06',
type=IPX
)
I don't want to have to parse things twice - I would ideally like to have the parser generate the 'target' object/strings/bytes in a configurable manner.
The root of the idea is that you could have various 'plugins' registered for consuming or processing structures so that you could programatically generate XML or Graphviz diagrams as well as being able to convert from bytes to Python dicts. The crux of the task is, walk a tree of nodes and based on the encoder/decoder, transform and return the transformed objects.
So the question is essentially - what patterns are best suited for this purpose?
codec style:
I've looked at the codecs module, which is fairly elegant in that you create encoding mechanisms, register that your class can encode things and you can specify the specific encoding you want on the fly.
'blah blah'.encode('utf8')
serdes (serializer, deserializer):
There's are a couple examples of existing serdes modules for Python, off the top of my head JSON springs to mind - but the problem with it is that it's very specific and doesn't support arbitrary formats easily. You can either encode or decode JSON and thats basically it. There's a variety of serdes constructed like this, some use the loads,*dumps* methods, some don't. It's a crapshoot.
objects = json.loads('{'a': 1, 'b': 2})
visitor pattern(?):
I'm not terribly familiar with the visitor pattern, but it does seem to have some mechanisms that might be applicable - the idea being that (if I understand this correctly), you'd setup a visitor for nodes and it'd walk the tree and apply some transformation (and return the new objects?).. I'm hazy here.
other?:
Are there other mechanisms that might be more pythonic or already be written? I considered perhaps using ElementTree and subclassing Elements - but I wanted to consult stackoverflow before doing something daft.

Pythonic way to ID a mystery file, then call a filetype-specific parser for it? Class creation q's

(note) I would appreciate help generalizing the title. I am sure that this is a class of problems in OO land, and probably has a reasonable pattern, I just don't know a better way to describe it.
I'm considering the following -- Our server script will be called by an outside program, and have a bunch of text dumped at it, (usually XML).
There are multiple possible types of data we could be getting, and multiple versions of the data representation we could be getting, e.g. "Report Type A, version 1.2" vs. "Report Type A, version 2.0"
We will generally want to do the same thing action with all the data -- namely, determine what sort and version it is, then parse it with a custom parser, then call a synchronize-to-database function on it.
We will definitely be adding types and versions as time goes on.
So, what's a good design pattern here? I can come up with two, both seem like they may have som problems.
Option 1
Write a monolithic ID script which determines the type, and then
imports and calls the properly named class functions.
Benefits
Probably pretty easy to debug,
Only one file that does the parsing.
Downsides
Seems hack-ish.
It would be nice to not have to create
knowledge of dataformats in two places, once for ID, once for actual
merging.
Option 2
Write an "ID" function for each class; returns Yes / No / Maybe when given identifying text.
the ID script now imports a bunch of classes, instantiates them on the text and asks if the text and class type match.
Upsides:
Cleaner in that everything lives in one module?
Downsides:
Slower? Depends on logic of running through the classes.
Put abstractly, should Python instantiate a bunch of Classes, and consume an ID function, or should Python instantiate one (or many) ID classes which have a paired item class, or some other way?

You could use the Strategy pattern which would allow you to separate the logic for the different formats which need to be parsed into concrete strategies. Your code would typically parse a portion of the file in the interface and then decide on a concrete strategy.
As far as defining the grammar for your files I would find a fast way to identify the file without implementing the full definition, perhaps a header or other unique feature at the beginning of the document. Then once you know how to handle the file you can pick the best concrete strategy for that file handling the parsing and writes to the database.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.