Validating XML in Python without non-python dependencies - python

I'm writing a small Python app for distribution. I need to include simple XML validation (it's a debugging tool), but I want to avoid any dependencies on compiled C libraries such as lxml or pyxml as those will make the resulting app much harder to distribute. I can't find anything that seems to fit the bill - for DTDs, Relax NG or XML Schema. Any suggestions?

Do you mean something like MiniXsv? I have never used it, but from the website, we can read that
minixsv is a lightweight XML schema
validator package written in pure
Python (at least Python 2.4 is
required).
so, it should work for you.
I believe that ElementTree could also be used for that goal, but I am not 100% sure.

Why don't you try invoking an online XML validator and parsing the results?
I couldn't find any free REST or SOAP based services but it would be easy enough to use a normal HTML form based one such as this one or this one. You just need to construct the correct request and parse the results (httplib may be of help here if you don't want to use a third party library such as mechanize to easy the pain).

I haven't looked at it in a while so it might have changed, but Fourthought's 4Suite (community edition) could still be pure Python.
Edit: just browsed through the docs and it's mostly Python which probably isn't enough for you.

The beautifulsoup module is pure Python (and a single .py file), but I don't know if it will fulfill your validation needs, I've only used it for quickly extracting some fields on large XMLs.
http://www.crummy.com/software/BeautifulSoup/

Related

xml processing options in python

I am currently writing a non-web program using mostly python and have gotten to the point where I need to create a structure for save and settings files.
I decided to use xml over creating my own format but have come into a bit of a barrier when trying to figure out what to use to actually process the xml. I want to know if I can get some of the pros and cons about packages that are available for python since there are a lot and I'm not entirely sure which ones seem best from just looking at the documentation.
I basically want to know if there is a package or library that will let me write and read data from an xml file with relative ease by just knowing what the tag name I'm looking for is.
P.S. My app is mostly geared to be used on Linux if it makes any difference.
If your data is only for the use of your Python programs, pickle might be an easier solution.
Don't use xml -- use yaml or json -- some would argue xml has outlived it's useful life...
That being said, I like ElementTree:
import xml.etree.cElementTree as ET
fname = 'foo.xml'
root = ET.parse(open(fname))
... yada yada...

Generating Python code from XML tree

What is the best way in Python to read XML from a file, build a tree, do some rewriting in the tree and generate Python code? (I'm slightly confused as in Python there seem to be multiple options such as expat, ElementTree, XSLT, ...).
You should check out lxml. It is easy to use and does what you want in a few steps. If you want to stick to the stdlib, then you check out ElementTree (Python 2.5+) . The type of XML processing you want depends on your needs and for high performance XML parsing, read this.
EDIT: My answer is for XML parsing with Python and does not answer your last question: "generate Python code", cause that makes no sense :)
I use Beautiful Soup. Its relatively easy to use and powerful.
Check it out here

Which XML library for what purposes?

A search for "python" and "xml" returns a variety of libraries for combining the two.
This list probably faulty:
xml.dom
xml.etree
xml.sax
xml.parsers.expat
PyXML
beautifulsoup?
HTMLParser
htmllib
sgmllib
Be nice if someone can offer a quick summary of when to use which, and why.
The DOM/SAX divide is a basic one. It applies not just to python since DOM and SAX are cross-language.
DOM: read the whole document into memory and manipulate it.
Good for:
complex relationships across tags in the markup
small intricate XML documents
Cautions:
Easy to use excessive memory
SAX: parse the document while you read it. Good for:
Long documents or open ended streams
places where memory is a constraint
Cautions:
You'll need to code a stateful parser, which can be tricky
beautifulsoup:
Great for HTML or not-quite-well-formed markup. Easy to use and fast. Good for screen scraping, etc. It can work with markup where the XML based ones would just through an error saying the markup is incorrect.
Most of the rest I haven't used, but I don't think there's hard and fast rules about when to use which. Just your standard considerations: who is going to maintain the code, which APIs do you find most easy to use, how well do they work, etc.
In general, for basic needs, it's nice to use the standard library modules since they are "standard" and thus available and well known. However, if you need to dig deep into something, almost always there are newer nonstandard modules with superior functionality outside of the standard library.
I find xml.etree essentially sufficient for everything, except for BeautifulSoup if I ever need to parse broken XML (not a common problem, differently from broken HTML, which BeautifulSoup also helps with and is everywhere): it has reasonable support for reading entire XML docs in memory, navigating them, creating them, incrementally-parsing large docs. lxml supports the same interface, and is generally faster -- useful to push performance when you can afford to install third party Python extensions (e.g. on App Engine you can't -- but xml.etree is still there, so you can run exactly the same code). lxml also has more features, and offers BeautifulSoup too.
The other libs you mention mimic APIs designed for very different languages, and in general I see no reason to contort Python into those gyrations. If you have very specific needs such as support for xslt, various kinds of validations, etc, it may be worth looking around for other libraries yet, but I haven't had such needs in a long time so I'm not current wrt the offerings for them.
For many problems you can get by with the xml. It has the major advantage of being part of the standard library. This means that it is pre-installed on almost every system and that the interface will be static. It is not the best, or the fastest, but it is there.
For everything else there is lxml. Specically, lxml is best for parsing broken HTML, xHTML, or suspect feeds. It uses libxml2 and libxslt to handle XPath, XSLT, and EXSLT. The tutorial is clear and the interface is simplistically straight-forward. The rest of the libraries mentioned exist because lxml was not available in its current form.
This is my opinion.
I don't do much with XML, but when I've needed to, lxml has been a joy to work with and is apparently quite fast. The element tree API is very nice in an object oriented setting.

What is a good (fastest, least broken, etc) way to implement JSON in Python?

There seems to be a handful of JSON libraries out there for Python even though Python has a built-in library. One even claims to be built according to http://www.json.org spec (which caused me to think 'hmmm, is Python's built in library not built fully to spec?', so I find myself here to ask what others have found when trying out different libraries. Is there any difference?
I will be using it for a Django-based web AJAX API (I know there are Django apps for this, but I want to get at the root of this before I just grab an app).
The built in library is fine most of the time although occasionally you can get issues to do with character encoding.
There is cjson if you have performance issues to deal with.
Personally, I just use simplejson - for no particular reason.
Python < 2.6 did not include json module. The presence of multiple JSON implementations says nothing about the quality of the built-in module and everything about the history of having no built-in json.
I suggest that your assumption (multiple implemntations means low quality in the library) is false.
The built-in module json works perfect. If you have to use an earlier python, use simplejson, a third-party module (which is exactly the same interface). These have the serialization interface you expect from Python and are widely used.
(simple)json by default has some very minor extensions of the JSON standard. You can read about these in the documentation for json and disable them if you want to for some reason.

Does any one know of an RTF report generator in Django?

I know about the PDF report generator, but I need to use RTF instead.
There is PyRTF but it hasn't been updated in a while.
If that doesn't work and you are willing to do some hacking then I can also point you to the GRAMPS Project that has an RTF report generator (look in gramps/docgen/RTFDoc.py). This code is very specific to their genealogy reporting needs, but it is clean and decently documented so could make a good starting point.
(can't comment yet so I post it as an answer)
There's a fork of PyRTF called pyrtf-ng, it's maintained and has Unicode support.
Documentation doesn't exist from what I've seen, but api looks nice, and you can figure out a lot from tests in tests dir.
Windward Reports has a very nice RTF generator. And you create the template in Word so very easy to use. (Disclaimer - I'm the CTO at Windward.) And yes the Java engine is callable from Python.

Categories