xml processing options in python

xml processing options in python - python

I am currently writing a non-web program using mostly python and have gotten to the point where I need to create a structure for save and settings files.
I decided to use xml over creating my own format but have come into a bit of a barrier when trying to figure out what to use to actually process the xml. I want to know if I can get some of the pros and cons about packages that are available for python since there are a lot and I'm not entirely sure which ones seem best from just looking at the documentation.
I basically want to know if there is a package or library that will let me write and read data from an xml file with relative ease by just knowing what the tag name I'm looking for is.
P.S. My app is mostly geared to be used on Linux if it makes any difference.

If your data is only for the use of your Python programs, pickle might be an easier solution.

Don't use xml -- use yaml or json -- some would argue xml has outlived it's useful life...
That being said, I like ElementTree:
import xml.etree.cElementTree as ET
fname = 'foo.xml'
root = ET.parse(open(fname))
... yada yada...

Related

What Python accessible tools can you use to generate XSD from an XML document?

I'm looking for a tool that will play nicely with Python.
Except for my Python requirement, my question is the same as this one:
"I am looking for a tool which will take an XML instance document and output a corresponding XSD schema."

According to the PyCharm docs, PyCharm has a facility for this. This is not exactly accessible by a program as an API. You are probably better off using XML Schema Learner as a separate program since it is a command line program (subprocess friendly!).

Are you looking for something like pyxsd? (primarily used for validation against a schema) Or maybe PyXB? (can generate classes based on xml) Otherwise, I don't think there's a tool [yet] that will generate the schema from within Python. Can you do it on demand using something like xsd.exe? Does it have to be programmatic/repeatable?

Currently, there is no module that will run within your python program and do this conversion. But I see the problem of creating a XSD schema from XML as a tooling problem. It's the kind of functionality that I'll use once, to get a schema started but after that I'll be maintaining the schema myself. From reading a single XML file the XSD generator will create a starting point for a real schema, it cannot infer all the functionality and options offered by XSD.
Basically, I don't see the need to have this conversion run as a module inside of my code, generating new XSDs every time the XML changes. After all, it's the schema that defines the XML not the other way around.
As end-user pointed out you could use xsd.exe but you might also want to look at other tools such as trang (a bit old) for Java and stylusstudio (XML tool).

Generating Python code from XML tree

What is the best way in Python to read XML from a file, build a tree, do some rewriting in the tree and generate Python code? (I'm slightly confused as in Python there seem to be multiple options such as expat, ElementTree, XSLT, ...).

You should check out lxml. It is easy to use and does what you want in a few steps. If you want to stick to the stdlib, then you check out ElementTree (Python 2.5+) . The type of XML processing you want depends on your needs and for high performance XML parsing, read this.
EDIT: My answer is for XML parsing with Python and does not answer your last question: "generate Python code", cause that makes no sense :)

I use Beautiful Soup. Its relatively easy to use and powerful.
Check it out here

Parsing XML - right scripting languages / packages for the job?

I know that any language is capable of parsing XML; I'm really just looking for advantages or drawbacks that you may have come across in your own experiences. Perl would be my standard go to here, but I'm open to suggestions.
Thanks!
UPDATE: I ended up going with XML::Simple which did a nice job, but I have one piece of advice if you plan to use it--research the forcearray option first. I had to rewrite a bunch of statements after learning that it is usually best practice to set forcearray. This page had the clearest explanation that I could find. Frankly, I'm surprised this isn't the default behavior.

If you are using Perl then I would recommend XML::Simple:
As more and more Web sites begin using
XML for their content, it's
increasingly important for Web
developers to know how to parse XML
data and convert it into different
formats. That's where the Perl module
called XML::Simple comes in. It takes
away the drudgery of parsing XML data,
making the process easier than you
ever thought possible.

XML::Twig is very nice, especially because it’s not as awfully verbose as some of the other options.

For pure XML parsing, I wouldn't use Java, C#, C++, C, etc. They tend to overcomplicate things, as in you want a banana and get the gorilla with it as well.
Higher-level and interpreted languages such as Perl, PHP, Python, Groovy are more suitable. Perl is included in virtually every Linux distro, as is PHP for the most part.
I've used Groovy recently for especially this and found it very easy. Mind you though that a C parser will be orders of magnitude faster than Groovy for instance.

It's all going to be in the libraries.
Python has great libraries for XML. My preference is lxml. It uses libxml/libxslt so it's fast, but the Python binding make it really easy to use. Perl may very well have equally awesome OO libraries.

I saw that people recommend XML::Simple if you decide on Perl.
While XML::Simple is, indeed, very simple to use and great, is a DOM parser. As such, it is, sadly, completely unsuitable to processing large XML files as your process would run out of memory (it's a common problem for any DOM parser, not limited to XML::Simple or Perl).
So, for large files, you must pick a SAX parser in whichever language you choose (there are many XML SAX parsers in Perl, or use another stream parser like XML::Twig that is even better than standard SAX parser. Can't speak for other languages).

Not exactly a scripting language, but you could also consider Scala. You can start from here.

Scala's XML support is rather good, especially as XML can just be typed directly into Scala programs.
Microsoft also did some cool integrated stuff with their LINQ for XML
But I really like Elementtree and just that package alone is a good reason to use Python instead of Perl ;)
Here's an example:
import elementtree.ElementTree as ET
# build a tree structure
root = ET.Element("html")
head = ET.SubElement(root, "head")
title = ET.SubElement(head, "title")
title.text = "Page Title"
body = ET.SubElement(root, "body")
body.set("bgcolor", "#ffffff")
body.text = "Hello, World!"
# wrap it in an ElementTree instance, and save as XML
tree = ET.ElementTree(root)
tree.write("page.xhtml")

It's not a scripting language, but Scala is great for working with XML natively. Also, see this book (draft) by Burak.

Python has some pretty good support for XML. From the standard library DOM packages to much more 'pythonic' libraries that parse XML directly into more usable object structures.
There isn't really a 'right' language though... there are good XML packages for most languages nowadays.

If you're going to use Ruby to do it then you're going to want to take a look at Nokogiri or Hpricot. Both have their strengths and weaknesses. The language and package selection really comes down to what you want to do with the data after you've parsed it.

Reading Data out of XML files is dead easy with C# and LINQ to XML!
Somehow, although I really love python, I found it hard to parse XML with the standard libraries.

I would say it depends like everything else. VB.NET 2008 uses XML literals, has IntelliSense for LINQ to XML, and a few power toys that help turn XML into XSD. So personally, if you are working in a .NET environment I think this is the best choice.

Validating XML in Python without non-python dependencies

I'm writing a small Python app for distribution. I need to include simple XML validation (it's a debugging tool), but I want to avoid any dependencies on compiled C libraries such as lxml or pyxml as those will make the resulting app much harder to distribute. I can't find anything that seems to fit the bill - for DTDs, Relax NG or XML Schema. Any suggestions?

Do you mean something like MiniXsv? I have never used it, but from the website, we can read that
minixsv is a lightweight XML schema
validator package written in pure
Python (at least Python 2.4 is
required).
so, it should work for you.
I believe that ElementTree could also be used for that goal, but I am not 100% sure.

Why don't you try invoking an online XML validator and parsing the results?
I couldn't find any free REST or SOAP based services but it would be easy enough to use a normal HTML form based one such as this one or this one. You just need to construct the correct request and parse the results (httplib may be of help here if you don't want to use a third party library such as mechanize to easy the pain).

I haven't looked at it in a while so it might have changed, but Fourthought's 4Suite (community edition) could still be pure Python.
Edit: just browsed through the docs and it's mostly Python which probably isn't enough for you.

The beautifulsoup module is pure Python (and a single .py file), but I don't know if it will fulfill your validation needs, I've only used it for quickly extracting some fields on large XMLs.
http://www.crummy.com/software/BeautifulSoup/

Import XML into SQL database

I'm working with a 20 gig XML file that I would like to import into a SQL database (preferably MySQL, since that is what I am familiar with). This seems like it would be a common task, but after Googling around a bit I haven't been able to figure out how to do it. What is the best way to do this?
I know this ability is built into MySQL 6.0, but that is not an option right now because it is an alpha development release.
Also, if I have to do any scripting I would prefer to use Python because that's what I am most familiar with.
Thanks.

You can use the getiterator() function to iterate over the XML file without parsing the whole thing at once. You can do this with ElementTree, which is included in the standard library, or with lxml.
for record in root.getiterator('record'):
add_element_to_database(record) # Depends on your database interface.
# I recommend SQLAlchemy.

Take a look at the iterparse() function from ElementTree or cElementTree (I guess cElementTree would be best if you can use it)
This piece describes more or less what you need to do: http://effbot.org/zone/element-iterparse.htm#incremental-parsing
This will probably be the most efficient way to do it in Python. Make sure not to forget to call .clear() on the appropriate elements (you really don't want to build an in memory tree of a 20gig xml file: the .getiterator() method described in another answer is slightly simpler, but does require the whole tree first - I assume that the poster actually had iterparse() in mind as well)

I've done this several times with Python, but never with such a big XML file. ElementTree is an excellent XML library for Python that would be of assistance. If it was possible, I would divide the XML up into smaller files to make it easier to load into memory and parse.

It may be a common task, but maybe 20GB isn't as common with MySQL as it is with SQL Server.
I've done this using SQL Server Integration Services and a bit of custom code. Whether you need either of those depends on what you need to do with 20GB of XML in a database. Is it going to be a single column of a single row of a table? One row per child element?
SQL Server has an XML datatype if you simply want to store the XML as XML. This type allows you to do queries using XQuery, allows you to create XML indexes over the XML, and allows the XML column to be "strongly-typed" by referring it to a set of XML schemas, which you store in the database.

The MySQL documentation does not seem to indicate that XML import is restricted to version 6. It apparently works with 5, too.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

xml processing options in python - python

If your data is only for the use of your Python programs, pickle might be an easier solution.

Don't use xml -- use yaml or json -- some would argue xml has outlived it's useful life... That being said, I like ElementTree: import xml.etree.cElementTree as ET fname = 'foo.xml' root = ET.parse(open(fname)) ... yada yada...

Related

What Python accessible tools can you use to generate XSD from an XML document?

Generating Python code from XML tree

Parsing XML - right scripting languages / packages for the job?

Validating XML in Python without non-python dependencies

Import XML into SQL database

Categories

Resources