Crunching xml with python - python

I need to remove white spaces between xml tags, e.g. if the original xml looks like:
<node1>
<node2>
<node3>foo</node3>
</node2>
</node1>
I'd like the end-result to be crunched down to single line:
<node1><node2><node3>foo</node3></node2></node1>
Please note that I will not have control over the xml structure, so the solution should be generic enough to be able to handle any valid xml. Also the xml might contain CDATA blocks, which I'd need to exclude from this crunching and leave them as-is.
I have couple of ideas so far: (1) parse the xml as text and look for start and end of tags < and > (2) another approach is to load the xml document and go node-by-node and print out a new document by concatenating the tags.
I think either method would work, but I'd rather not reinvent the wheel here, so may be there is a python library that already does something like this? If not, then any issues/pitfalls to be aware of when rolling out my own cruncher? Any recommendations?
EDIT
Thank you all for answers/suggestions, both Triptych's and Van Gale's solutions work for me and do exactly what I want. Wish I could accept both answers.

This is pretty easily handled with lxml (note: this particular feature isn't in ElementTree):
from lxml import etree
parser = etree.XMLParser(remove_blank_text=True)
foo = """<node1>
<node2>
<node3>foo </node3>
</node2>
</node1>"""
bar = etree.XML(foo, parser)
print etree.tostring(bar,pretty_print=False,with_tail=True)
Results in:
<node1><node2><node3>foo </node3></node2></node1>
Edit: The answer by Triptych reminded me about the CDATA requirements, so the line creating the parser object should actually look like this:
parser = etree.XMLParser(remove_blank_text=True, strip_cdata=False)

I'd use XSLT:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" encoding="UTF-8" omit-xml-declaration="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="*">
<xsl:copy>
<xsl:copy-of select="#*" />
<xsl:apply-templates />
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
That should do the trick.
In python you could use lxml (direct link to sample on homepage) to transform it.
For some tests, use xsltproc, sample:
xsltproc test.xsl test.xml
where test.xsl is the file above and test.xml your XML file.

Pretty straightforward with BeautifulSoup.
This solution assumes it is ok to strip whitespace from the tail ends of character data.
Example: <foo> bar </foo> becomes <foo>bar</foo>
It will correctly ignore comments and CDATA.
import BeautifulSoup
s = """
<node1>
<node2>
<node3>foo</node3>
</node2>
<node3>
<!-- I'm a comment! Leave me be! -->
</node3>
<node4>
<![CDATA[
I'm CDATA! Changing me would be bad!
]]>
</node4>
</node1>
"""
soup = BeautifulSoup.BeautifulStoneSoup(s)
for t in soup.findAll(text=True):
if type(t) is BeautifulSoup.NavigableString: # Ignores comments and CDATA
t.replaceWith(t.strip())
print soup

Not a solution really but since you asked for recommendations: I'd advise against doing your own parsing (unless you want to learn how to write a complex parser) because, as you say, not all spaces should be removed. There are not only CDATA blocks but also elements with the "xml:space=preserve" attribute, which correspond to things like <pre> in XHTML (where the enclosed whitespaces actually have meaning), and writing a parser that is able to recognize those elements and leave the whitespace alone would be possible but unpleasant.
I would go with the parsing method, i.e. load the document and go node-by-node printing them out. That way you can easily identify which nodes you can strip the spaces out of and which you can't. There are some modules in the Python standard library, none of which I have ever used ;-) that could be useful to you... try xml.dom, or I'm not sure if you could do this with xml.parsers.expat.

Related

Python: Read and write namespaced XML using ElementTree

This XML file is named example.xml:
<?xml version="1.0"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>14.0.0</modelVersion>
<groupId>.com.foobar.flubber</groupId>
<artifactId>uberportalconf</artifactId>
<version>13-SNAPSHOT</version>
<packaging>pom</packaging>
<name>Environment for UberPortalConf</name>
<description>This is the description</description>
<properties>
<birduberportal.version>11</birduberportal.version>
<promotiondevice.version>9</promotiondevice.version>
<foobarportal.version>6</foobarportal.version>
<eventuberdevice.version>2</eventuberdevice.version>
</properties>
<!-- A lot more here, but as it is irrelevant for the problem I have removed it -->
</project>
If I load the example.xml file above using ElementTree and print the root node:
>>> from xml.etree import ElementTree
>>> tree = ElementTree.parse('example.xml')
>>> print tree.getroot()
<Element '{http://maven.apache.org/POM/4.0.0}project' at 0x26ee0f0>
I see that Element also contains the namespace http://maven.apache.org/POM/4.0.0.
How do I:
Get the foobarportal.version text, increase it by one and write the XML file back while keeping the namespace the document had when loaded and also not change the overall XML layout.
Get it to load using any namespace, not just http://maven.apache.org/POM/4.0.0. I still don´t want to strip the namespace, as I want the XML to stay the same except for changing foobarportal.version as in 1 above.
The current way is not aware of XML but fulfills 1 and 2 above:
Grep for <foobarportal.version>(.*)</foobarportal.version>
Take the contents of the match group and i increase it by one
Write it back.
It would be nice to have an XML aware solution, as it would be more robust. The XML namespace handling of ElementTree is making it more complicated.
If your question is simply: "how do I search by a namespaced element name", then the answer is that lxml understands {namespace} syntax, so you can do:
tree.getroot().find('{http://maven.apache.org/POM/4.0.0}project')

How to split up a file by keyword?

I have a large XML file that looks like
<data> skdfnlsniisimsoinfsdfoisdfinsdofinodnfonf <emrosem> 23324097234097g </emrosem>
<peto> oifmisnie </peto>
</data>
<data> sfnseosfnosefoisneofinseionfoaisenfoisen <emrosem> 3249087203470w </emrosem>
<peto> sdfn </peto>
</data>
I want to separate this into a list that looks like
[<data> skdfnlsniisimsoinfsdfoisdfinsdofinodnfonf <emrosem> 23324097234097g </emrosem>
<peto> oifmisnie </peto></data>, <data> sfnseosfnosefoisneofinseionfoaisenfoisen
<emrosem> 3249087203470w </emrosem> <peto> sdfn </peto> </data>]
In other words, I want to split it based on the word "data".
I'm using python 2.7, thanks for the help.
The included XML Parser is one way to parse XML. It might be a bit kludgey to get data off of it and into a list with the tags intact but it should be doable.
Please don't use regular expressions for this. If you need to parse XML, use an XML parser. XML just has too many subtleties to handle it with simple string manipulation routines. For a nice explanation as to why, see the first answer to this question.

Preserve XML Tags - Python DOM Implementation

I have just finished skiming through the python DOM API and can't seem to find what I am looking for.
I basically want to preserve the XML tags when traversing through the DOM tree. The idea is to print the tag name and corresponding attributes which I later want to convert into an xml file.
<book name="bookname" source="/home/phiri/Book/book.xml"
xmlns:xi="http://www.w3.org/2001/XInclude">
<chapter>
<page>page1</page>
<page>page2</page>
</chapter>
<chapter>
<page>page1</page>
<page>page2</page>
<page>Page3</page>
</chapter>
</book>
Using the XML contents above for instance, what I want is for the result of the book.xml file to have.
<book name="bookname" source="/home/phiri/Book/book.xml"
xmlns:xi="http://www.w3.org/2001/XInclude">
<chapter></chapter>
<chapter></chapter>
</book>
Is there an alternative xml package I could use to preserve results I get when extracting contents using python?
A simple way to get the output you posted from the input is to override the XSLT identity transform. It looks like you want to eliminate all text nodes and all elements that have more than two ancestors, so you'd just add empty templates for those:
<xsl:template match="text()"/>
<xsl:template match="*[count(ancestor::*) > 2]"/>
Generally the best way to use XSLT in Python is with the libxml2 module. Unless you need a pure Python solution, in which case you're stuck not using XSLT, because nobody's built a pure Python XSLT processor yet.

ElementTree namespace incovenience

I can't control quality of XML that I get. In some cases it is:
<COLLADA xmlns="http://www.collada.org/2005/11/COLLADASchema" version="1.4.1">
...
</COLLADA>
in others I get:
<COLLADA>...</COLLADA>
and I guess I should also handle
<collada:COLLADA xmlns:collada="http://www.collada.org/2005/11/COLLADASchema">
...
</collada:COLLADA>
It's the same schema all over, and I only need one parser to process it. How can I handle all these cases? I need XPath and other lxml goodies to get through this. How do I make it consistent during etree.parse time? I don't want to check on namespaces every time I need to use XPath.
My usual recommendation is to preprocess it first, to normalize the namespaces. This has two benefits: the normalization code is highly reusable, because it doesn't depend on how the data is being processed subsequently; and the logic to process the data is considerably simplified.
If the documents only use this one namespace, or none, and do not use qualified names in the content of text or attribute nodes, then the transformation to achieve this normalization is very simple:
<xsl:template match="*">
<xsl:element name="local-name()" namespace="http://www.collada.org/2005/11/COLLADASchema">
<xsl:copy-of select="#*"/>
<xsl:apply-templates/>
</xsl:element>
</xsl:template>

How to add xml header to dom object

I'm using Python's xml.dom.minidom but I think the question is valid for any DOM parser.
My original file has a line like this at the beginning:
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
This doesn't seem to be part of the dom, so when I do something like dom.toxml() the resulting string have not line at the beginning.
How can I add it?
example outpupt:
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<Root xmlns:aid="http://xxxxxxxxxxxxxxxxxx">
<Section>BANDSAW BLADES</Section>
</Root>
hope to be clear.
This doesn't seem to be part of the dom
The XML Declaration doesn't get a node of its own, no, but the properties declared in it are visible on the Document object:
>>> doc= minidom.parseString('<?xml version="1.0" encoding="utf-8" standalone="yes"?><a/>')
>>> doc.encoding
'utf-8'
>>> doc.standalone
True
Serialising the document should include the standalone="yes" part of the declaration, but toxml() doesn't. You could consider this a bug, perhaps, but really the toxml() method doesn't make any promises to serialise the XML declaration in an appropriate way. (eg you don't get an encoding unless you specifically ask for it either.)
You could take charge of writing the document yourself:
xml= []
xml.append('<?xml version="1.0" encoding="utf-8" standalone="yes"?>')
for child in doc.childNodes:
xml.append(child.toxml())
but do you really need the XML Declaration here? You are using the default version and encoding, and since you have no DOCTYPE there can be no externally-defined entities, so the document is already standalone by nature. As per the XML standard: “if there are no external markup declarations, the standalone document declaration has no meaning”. It seems to me you could safely omit it completely.

Categories