How to check if the two XML files are equivalent with Python? - python

How to check if two XML files are equivalent?
For example, the two XML files are the same even though the ordering is different. I need to check if the two XML files content the same textual info disregarding the order.
<a>
<b>hello</b>
<c><d>world</d></c>
</a>
<a>
<c><d>world</d></c>
<b>hello</b>
</a>
Are there tools for this out there?

It all depends on your definition of "equivalent".
Assuming you really only care about the text nodes (for example: the d tags in your example do not even matter, you only care about the content word), you can just make a set of the text nodes of each document, and compare the sets. Using lxml, this could look like:
from lxml import etree
tree1 = etree.parse('example1.xml')
tree2 = etree.parse('example2.xml')
print set(tree1.getroot().itertext()) == set(tree2.getroot().itertext())
You might even want to ignore whitespace nodes, doing something like:
set(i for i in tree.getroot().itertext() if i.strip())
Note that using sets means you will NOT take into account how many times certain pieces of text occur in the document (this might be what you want, it might not). If the order is not important, but the number of times something occurs is, you could use a dictionary instead of a set, and keep track of the number of occurences (eg. with collections.defaultdict() or collections.Counter in python 2.7)
But if it is only the order of the direct child elements of the root element (in your case, children of the a element) that may be ignored, and everything inside those elements really counts, you would need another approach. You could for example do xml canonicalization on each child element to get a normalized version of each child (again, I don't know if this is normalized enough for your needs).
from lxml import etree
tree1 = etree.parse('example1.xml')
tree2 = etree.parse('example2.xml')
set1 = set(etree.tostring(i, method='c14n') for i in tree1.getroot())
set2 = set(etree.tostring(i, method='c14n') for i in tree2.getroot())
print set1 == set2
Note: to keep the example simpler, I've used the development version of lxml, in older versions, there is no method='c14n' for etree.tostring(), only a c14n() method on the ElementTree, that writes to a file-like object. So to get it working there, you'd have to copy each element to a tree of its own, and use a StringIO() object as a dummy file)
Also, this way of doing it is probably not recommended with very large files.
But again: a BIG WARNING: you really have to know what you need as "equivalent", and create your own solution based on that knowledge!

Ordering is important in XML, so the two files you provided are different. Normally you could normalize the XML and then simply compare the files as text, but if you want order-insensitive comparison, you will probably have to implement it yourself using one of the bazillion XML parsers out there (I would recommend lxml, by the way).

my solution is below. compare all attributes,tags iteration.
Some code refered from : Testing Equivalence of xml.etree.ElementTree
import xml.etree.ElementTree as ET
def elements_equal(e1, e2):
if e1.tag != e2.tag:
return False
if e1.text != e2.text:
if e1.text!=None and e2.text!=None :
return False
if e1.tail != e2.tail:
if e1.tail!=None and e2.tail!=None:
return False
if e1.attrib != e2.attrib:
return False
if len(e1) != len(e2):
return False
return all(elements_equal(c1, c2) for c1, c2 in zip(e1, e2))
def is_two_xml_equal(f1, f2):
tree1 = ET.parse(f1)
root1 = tree1.getroot()
tree2 = ET.parse(f2)
root2 = tree2.getroot()
return elements_equal(root1,root3)
f1 = '2.xml'
f2 = '1.xml'
print(is_two_xml_equal(f1, f2))

Related

Biopython: return chain but with the new chain ID already

I have script which can extract selected chains from a structure into a new file. I do it for 400+ structures. Because chainIDs of my selected chains can differ in the structures, I parse .yaml files where I store the corresponding chainIDs. This script is working, everything is fine but the next step is to rename the chains to be the same in each file. I used edited code from here:this. Basically it worked as well, however the problem is that e.g. my new chainID of chain1 is the same as original chainID of chain2, and the error occurrs:Cannot change id from U to T. The id T is already used for a sibling of this entity. Actually, this happened for many variables and it'd be too complicated doing it manually.
I've got idea that this could be solved by renaming the chainIDs right in the moment when I'm extracting it. Is it possible using Biopython like that? Could'nt find anything similar to my problem.
Simplified code for one structure (in the original one is one more loop for iterating over 400+ structures and its .yaml files):
with open(yaml_file, "r") as file:
proteins = yaml.load(file, Loader=yaml.FullLoader)
chain1= proteins["1_chain"].split(",")[0] #just for illustration that I have to parse the original chainIDs
chain2= proteins["2_chain"].split(",")[0]
structure = parser.get_structure("xxx", "xxx.cif" )[0]
for model in structure:
for chain in model:
class ChainSelect(Select):
def accept_chain(self, chain):
if chain.get_id() == '{}'.format(chain1):
return True # I thought that somewhere in this part could be added command renaming the chain to "A"
if chain.get_id() == '{}'.format(chain2):
return True #here I'd rename it "B"
else:
return False
io = MMCIFIO()
io.set_structure(structure)
io.save("new.cif" , ChainSelect())
Is it possible to somehow expand "return" command in a way that it would return the chain with desired chainID (e.g. A)? Note that the original chain ID can differ in the structures (thus I have to use .format(chainX))
I don't have any other idea how I'd get rid of the error that my desired chainID is already in sibling entity.

Import XML namespace in python

I'm a total noob in coding, I study IT, and have a school project in which I must convert a .txt file in a XML file. I have managed to create a tree, and subelements, but a must put some XML namespace in the code. Because the XML file in the end must been opened in a program that gives you a table of the informations, and something more. But without the scheme from the XML namespace it won't open anything. Can someone help me in how to put a .xsd in my code?
This is the scheme:
http://www.pufbih.ba/images/stories/epp_docs/PaketniUvozObrazaca_V1_0.xsd
Example of XML file a must create:
http://www.pufbih.ba/images/stories/epp_docs/4200575050089_1022.xml
And in the first row a have the scheme that I must input: "urn:PaketniUvozObrazaca_V1_0.xsd"
This is the code a created so far:
import xml.etree.ElementTree as xml
def GenerateXML(GIP1022):
root=xml.Element("PaketniUvozObrazaca")
p1=xml.Element("PodaciOPoslodavcu")
root.append(p1)
jib=xml.SubElement(p1,"JIBPoslodavca")
jib.text="4254160150005"
pos=xml.SubElement(p1,"NazivPoslodavca")
pos.text="MOJATVRTKA d.o.o. ORAŠJE"
zah=xml.SubElement(p1,"BrojZahtjeva")
zah.text="8"
datz=xml.SubElement(p1,"DatumPodnosenja")
datz.text="2021-01-01"
tree=xml.ElementTree(root)
with open(GIP1022,"wb") as files:
tree.write(files)
if __name__=="__main__":
GenerateXML("primjer.xml")
The official documentation is not super explicit as to how one works with namespaces in ElementTree, but the core of it is that ElementTree takes a very fundamental(ist) approach: instead of manipulating namespace prefixes / aliases, elementtree uses Clark's Notation.
So e.g.
<bar xmlns="foo">
or
<x:bar xmlns:x="foo">
(the element bar in the foo namespace) would be written
{foo}bar
>>> tostring(Element('{foo}bar'), encoding='unicode')
'<ns0:bar xmlns:ns0="foo" />'
alternatively (and sometimes more conveniently for authoring and manipulating) you can use QName objects which can either take a Clark's notation tag name, or separately take a namespace and a tag name:
>>> tostring(Element(QName('foo', 'bar')), encoding='unicode')
'<ns0:bar xmlns:ns0="foo" />'
So while ElementTree doesn't have a namespace object per-se you can create namespaced object like this, probably via a helper partially applying QName:
>>> root = Element(ns("PaketniUvozObrazaca"))
>>> SubElement(root, ns("PodaciOPoslodavcu"))
<Element <QName '{urn:PaketniUvozObrazaca_V1_0.xsd}PodaciOPoslodavcu'> at 0x7f502481bdb0>
>>> tostring(root, encoding='unicode')
'<ns0:PaketniUvozObrazaca xmlns:ns0="urn:PaketniUvozObrazaca_V1_0.xsd"><ns0:PodaciOPoslodavcu /></ns0:PaketniUvozObrazaca>'
Now there are a few important considerations here:
First, as you can see the prefix when serialising is arbitrary, this is in keeping with ElementTree's fundamentalist approach to XML (the prefix should not matter), but it has since grown a "register_namespace" global function which allows registering specific prefixes:
>>> register_namespace('xxx', 'urn:PaketniUvozObrazaca_V1_0.xsd')
>>> tostring(root, encoding='unicode')
'<xxx:PaketniUvozObrazaca xmlns:xxx="urn:PaketniUvozObrazaca_V1_0.xsd"><xxx:PodaciOPoslodavcu /></xxx:PaketniUvozObrazaca>'
you can also pass a single default_namespace to (some) serialization function to specify the, well, default namespace:
>>> tostring(root, encoding='unicode', default_namespace='urn:PaketniUvozObrazaca_V1_0.xsd')
'<PaketniUvozObrazaca xmlns="urn:PaketniUvozObrazaca_V1_0.xsd"><PodaciOPoslodavcu /></PaketniUvozObrazaca>'
A second, possibly larger, issue is that ElementTree does not support validation.
The Python standard library does not provide support for any validating parser or tree builder, whether DTD, rng, xml schema, anything. Not by default, and not optionally.
lxml is probably the main alternative supporting validation (of multiple types of schema), its core API follows ElementTree but extends it in multiple ways and directions (including much more precise namespace prefix support, and prefix round-tripping). But even then the validation is (AFAIK) mostly explicit, at least when generating / serializing documents.
What you want is to add a default namespace declaration (xmlns="urn:PaketniUvozObrazaca_V1_0.xsd") to the root element. I have edited the code in the question to show you how this can be done.
import xml.etree.ElementTree as ET
def GenerateXML(GIP1022):
# Create the PaketniUvozObrazaca root element in the urn:PaketniUvozObrazaca_V1_0.xsd namespace
root = ET.Element("{urn:PaketniUvozObrazaca_V1_0.xsd}PaketniUvozObrazaca")
# Add subelements
p1 = ET.Element("PodaciOPoslodavcu")
root.append(p1)
jib = ET.SubElement(p1,"JIBPoslodavca")
jib.text = "4254160150005"
pos = ET.SubElement(p1,"NazivPoslodavca")
pos.text = "MOJATVRTKA d.o.o. ORAŠJE"
zah = ET.SubElement(p1,"BrojZahtjeva")
zah.text = "8"
datz = ET.SubElement(p1,"DatumPodnosenja")
datz.text = "2021-01-01"
# Make urn:PaketniUvozObrazaca_V1_0.xsd the default namespace (no prefix)
ET.register_namespace("", "urn:PaketniUvozObrazaca_V1_0.xsd")
# Prettify output (requires Python 3.9)
ET.indent(root)
tree = ET.ElementTree(root)
with open(GIP1022,"wb") as files:
tree.write(files)
if __name__=="__main__":
GenerateXML("primjer.xml")
Contents of primjer.xml:
<PaketniUvozObrazaca xmlns="urn:PaketniUvozObrazaca_V1_0.xsd">
<PodaciOPoslodavcu>
<JIBPoslodavca>4254160150005</JIBPoslodavca>
<NazivPoslodavca>MOJATVRTKA d.o.o. ORAŠJE</NazivPoslodavca>
<BrojZahtjeva>8</BrojZahtjeva>
<DatumPodnosenja>2021-01-01</DatumPodnosenja>
</PodaciOPoslodavcu>
</PaketniUvozObrazaca>
Note that only the root element is explicitly bound to a namespace in the code. The subelements do not need to be in a namespace when they are added. The end result is an XML document (primjer.xml) where all elements belong to the same default namespace.
The above is not the only way to create an element in a namespace. For example, instead of the {namespace-uri}name notation, the QName class can be used. See https://stackoverflow.com/a/58678592/407651.
The tree.write() method takes a default_namespace argument.
What happens if you change that line to the following?
tree.write(files, default_namespace="urn:PaketniUvozObrazaca_V1_0.xsd")

Original order of processed xml with ElementTree can't be kept [duplicate]

I've written a fairly simple filter in python using ElementTree to munge the contexts of some xml files. And it works, more or less.
But it reorders the attributes of various tags, and I'd like it to not do that.
Does anyone know a switch I can throw to make it keep them in specified order?
Context for this
I'm working with and on a particle physics tool that has a complex, but oddly limited configuration system based on xml files. Among the many things setup that way are the paths to various static data files. These paths are hardcoded into the existing xml and there are no facilities for setting or varying them based on environment variables, and in our local installation they are necessarily in a different place.
This isn't a disaster because the combined source- and build-control tool we're using allows us to shadow certain files with local copies. But even thought the data fields are static the xml isn't, so I've written a script for fixing the paths, but with the attribute rearrangement diffs between the local and master versions are harder to read than necessary.
This is my first time taking ElementTree for a spin (and only my fifth or sixth python project) so maybe I'm just doing it wrong.
Abstracted for simplicity the code looks like this:
tree = elementtree.ElementTree.parse(inputfile)
i = tree.getiterator()
for e in i:
e.text = filter(e.text)
tree.write(outputfile)
Reasonable or dumb?
Related links:
How can I get the order of an element attribute list using Python xml.sax?
Preserve order of attributes when modifying with minidom
With help from #bobince's answer and these two (setting attribute order, overriding module methods)
I managed to get this monkey patched it's dirty and I'd suggest using another module that better handles this scenario but when that isn't a possibility:
# =======================================================================
# Monkey patch ElementTree
import xml.etree.ElementTree as ET
def _serialize_xml(write, elem, encoding, qnames, namespaces):
tag = elem.tag
text = elem.text
if tag is ET.Comment:
write("<!--%s-->" % ET._encode(text, encoding))
elif tag is ET.ProcessingInstruction:
write("<?%s?>" % ET._encode(text, encoding))
else:
tag = qnames[tag]
if tag is None:
if text:
write(ET._escape_cdata(text, encoding))
for e in elem:
_serialize_xml(write, e, encoding, qnames, None)
else:
write("<" + tag)
items = elem.items()
if items or namespaces:
if namespaces:
for v, k in sorted(namespaces.items(),
key=lambda x: x[1]): # sort on prefix
if k:
k = ":" + k
write(" xmlns%s=\"%s\"" % (
k.encode(encoding),
ET._escape_attrib(v, encoding)
))
#for k, v in sorted(items): # lexical order
for k, v in items: # Monkey patch
if isinstance(k, ET.QName):
k = k.text
if isinstance(v, ET.QName):
v = qnames[v.text]
else:
v = ET._escape_attrib(v, encoding)
write(" %s=\"%s\"" % (qnames[k], v))
if text or len(elem):
write(">")
if text:
write(ET._escape_cdata(text, encoding))
for e in elem:
_serialize_xml(write, e, encoding, qnames, None)
write("</" + tag + ">")
else:
write(" />")
if elem.tail:
write(ET._escape_cdata(elem.tail, encoding))
ET._serialize_xml = _serialize_xml
from collections import OrderedDict
class OrderedXMLTreeBuilder(ET.XMLTreeBuilder):
def _start_list(self, tag, attrib_in):
fixname = self._fixname
tag = fixname(tag)
attrib = OrderedDict()
if attrib_in:
for i in range(0, len(attrib_in), 2):
attrib[fixname(attrib_in[i])] = self._fixtext(attrib_in[i+1])
return self._target.start(tag, attrib)
# =======================================================================
Then in your code:
tree = ET.parse(pathToFile, OrderedXMLTreeBuilder())
Nope. ElementTree uses a dictionary to store attribute values, so it's inherently unordered.
Even DOM doesn't guarantee you attribute ordering, and DOM exposes a lot more detail of the XML infoset than ElementTree does. (There are some DOMs that do offer it as a feature, but it's not standard.)
Can it be fixed? Maybe. Here's a stab at it that replaces the dictionary when parsing with an ordered one (collections.OrderedDict()).
from xml.etree import ElementTree
from collections import OrderedDict
import StringIO
class OrderedXMLTreeBuilder(ElementTree.XMLTreeBuilder):
def _start_list(self, tag, attrib_in):
fixname = self._fixname
tag = fixname(tag)
attrib = OrderedDict()
if attrib_in:
for i in range(0, len(attrib_in), 2):
attrib[fixname(attrib_in[i])] = self._fixtext(attrib_in[i+1])
return self._target.start(tag, attrib)
>>> xmlf = StringIO.StringIO('<a b="c" d="e" f="g" j="k" h="i"/>')
>>> tree = ElementTree.ElementTree()
>>> root = tree.parse(xmlf, OrderedXMLTreeBuilder())
>>> root.attrib
OrderedDict([('b', 'c'), ('d', 'e'), ('f', 'g'), ('j', 'k'), ('h', 'i')])
Looks potentially promising.
>>> s = StringIO.StringIO()
>>> tree.write(s)
>>> s.getvalue()
'<a b="c" d="e" f="g" h="i" j="k" />'
Bah, the serialiser outputs them in canonical order.
This looks like the line to blame, in ElementTree._write:
items.sort() # lexical order
Subclassing or monkey-patching that is going to be annoying as it's right in the middle of a big method.
Unless you did something nasty like subclass OrderedDict and hack items to return a special subclass of list that ignores calls to sort(). Nah, probably that's even worse and I should go to bed before I come up with anything more horrible than that.
Best Option is to use the lxml library http://lxml.de/
Installing the lxml and just switching the library did the magic to me.
#import xml.etree.ElementTree as ET
from lxml import etree as ET
Yes, with lxml
>>> from lxml import etree
>>> root = etree.Element("root", interesting="totally")
>>> etree.tostring(root)
b'<root interesting="totally"/>'
>>> print(root.get("hello"))
None
>>> root.set("hello", "Huhu")
>>> print(root.get("hello"))
Huhu
>>> etree.tostring(root)
b'<root interesting="totally" hello="Huhu"/>'
Here is direct link to documentation, from which the above example is slightly adapted.
Also note that lxml has, by design, some good API compatiblity with standard xml.etree.ElementTree
This has been "fixed" in python 3.8. I can't find any notes about it anywhere, but it works now.
D:\tmp\etree_order>type etree_order.py
import xml.etree.ElementTree as ET
a = ET.Element('a', {"aaa": "1", "ccc": "3", "bbb": "2"})
print(ET.tostring(a))
D:\tmp\etree_order>C:\Python37-64\python.exe etree_order.py
b'<a aaa="1" bbb="2" ccc="3" />'
D:\tmp\etree_order>c:\Python38-64\python.exe etree_order.py
b'<a aaa="1" ccc="3" bbb="2" />'
Wrong question. Should be: "Where do I find a diff gadget that works sensibly with XML files?
Answer: Google is your friend. First result for search on "xml diff" => this. There are a few more possibles.
From section 3.1 of the XML recommendation:
Note that the order of attribute specifications in a start-tag or empty-element tag is not significant.
Any system that relies on the order of attributes in an XML element is going to break.
This is a partial solution, for the case where xml is being emitted and a predictable order is desired. It does not solve round trip parsing and writing. Both 2.7 and 3.x use sorted() to force an attribute ordering. So, this code, in conjunction with use of an OrderedDictionary to hold the attributes will preserve the order for xml output to match the order used to create the Elements.
from collections import OrderedDict
from xml.etree import ElementTree as ET
# Make sorted() a no-op for the ElementTree module
ET.sorted = lambda x: x
try:
# python3 use a cPython implementation by default, prevent that
ET.Element = ET._Element_Py
# similarly, override SubElement method if desired
def SubElement(parent, tag, attrib=OrderedDict(), **extra):
attrib = attrib.copy()
attrib.update(extra)
element = parent.makeelement(tag, attrib)
parent.append(element)
return element
ET.SubElement = SubElement
except AttributeError:
pass # nothing else for python2, ElementTree is pure python
# Make an element with a particular "meaningful" ordering
t = ET.ElementTree(ET.Element('component',
OrderedDict([('grp','foo'),('name','bar'),
('class','exec'),('arch','x86')])))
# Add a child element
ET.SubElement(t.getroot(),'depend',
OrderedDict([('grp','foo'),('name','util1'),('class','lib')]))
x = ET.tostring(n)
print (x)
# Order maintained...
# <component grp="foo" name="bar" class="exec" arch="x86"><depend grp="foo" name="util1" class="lib" /></component>
# Parse again, won't be ordered because Elements are created
# without ordered dict
print ET.tostring(ET.fromstring(x))
# <component arch="x86" name="bar" grp="foo" class="exec"><depend name="util1" grp="foo" class="lib" /></component>
The problem with parsing XML into an element tree is that the code internally creates plain dicts which are passed in to Element(), at which point the order is lost. No equivalent simple patch is possible.
Have had your problem. Firstly looked for some Python script to canonize, didnt found anyone. Then started thinking about making one. Finally xmllintsolved.
I used the accepted answer above, with both statements:
ET._serialize_xml = _serialize_xml
ET._serialize['xml'] = _serialize_xml
While this fixed the ordering in every node, attribute ordering on new nodes inserted from copies of existing nodes failed to preserve without a deepcopy. Watch out for reusing nodes to create others...
In my case I had an element with several attributes, so I wanted to reuse them:
to_add = ET.fromstring(ET.tostring(contract))
to_add.attrib['symbol'] = add
to_add.attrib['uniqueId'] = add
contracts.insert(j + 1, to_add)
The fromstring(tostring) will reorder the attributes in memory. It may not result in the alpha sorted dict of attributes, but it also may not have the expected ordering.
to_add = copy.deepcopy(contract)
to_add.attrib['symbol'] = add
to_add.attrib['uniqueId'] = add
contracts.insert(j + 1, to_add)
Now the ordering persists.
I would recommend using LXML (as others have as well). If you need to preserve the order of attributes to adhere to the c14n v1 or v2 standards (https://www.w3.org/TR/xml-c14n2/) (i.e. increasing lexicographic order), lxml supports this very nicely by passing an output method (see heading C14N of https://lxml.de/api.html)
For example:
from lxml import etree as ET
element = ET.Element('Test', B='beta', Z='omega', A='alpha')
val = ET.tostring(element, method="c14n")
print(val)
By running the python script in python 3.8 version we can preserve the order of the attributes in xml files.

Creating a document tree before or after adding the subelements

I am using lxml and Python for writing XML files. I was wondering what is the accepted practice: creating a document tree first and then adding the sub elements OR adding the sub elements and creating the tree later? I know this hardly makes any difference as to the output, but I was interested in knowing what is the accepted norm in this from a coding-style point of view.
Sample code:
page = etree.Element('root')
#first create the tree
doc = etree.ElementTree(page)
#add the subelements
headElt = etree.SubElement(page, 'head')
Or this:
page = etree.Element('root')
headElt = etree.SubElement(page, 'head')
#create the tree in the end
doc = etree.ElementTree(page)
Since tree construction is typically a recursive action, I would say that the tree root could get created last, once the subtree is done. However, I don't see any reason why that should be any better than creating the tree first. I honestly don't think there's an accepted norm for this, and rather than trying to find one I would advise you to write your code in such a way that it makes sense for you and anyone else that might need to read and understand it later.

Python xml.dom.minidom removeChild whitespace problem

I'm trying to read an xml file into python, pull out certain elements from the xml file and then write the results back to an xml file (so basically it's the original xml file without several elements). When I use .removeChild(source) it removes the individual elements I want to remove but leaves white space in its stead making the file very unreadable. I know I can still parse the file with all of the whitespace, but there are times when I need to manually alter the values of certain element's attributes and it makes it difficult (and annyoing) to do this. I can certainly remove the whitespace by hand but if I have dozens of these xml files that's not really feasible.
Is there a way to do .removeChild and have it remove the white space as well?
Here's what my code looks like:
dom=parse(filename)
main=dom.childNodes[0]
sources = main.getElementsByTagName("source")
for source in sources :
name=source.getAttribute("name")
spatialModel=source.getElementsByTagName("spatialModel")
val1=float(spatialModel[0].getElementsByTagName("parameter")[0].getAttribute("value"))
val2=float(spatialModel[0].getElementsByTagName("parameter")[1].getAttribute("value"))
if angsep(val1,val2,X,Y)>=ROI :
main.removeChild(source)
else:
print name,val1,val2,angsep(val1,val2,X,Y)
f=open(outfile,"write")
f.write("<?xml version=\"1.0\" ?>\n")
f.write(dom.saveXML(main))
f.close()
Thanks much for the help.
If you have PyXML installed you can use xml.dom.ext.PrettyPrint()
I couldn't figure out how to do this using xml.dom.minidom, so I just wrote a quick function to read in the output file and remove all blank lines and then rewrite to a new file:
f = open(xmlfile).readlines()
w = open('src_model.xml','w')
empty=re.compile('^$')
for line in open(xmlfile).readlines():
if empty.match(line):
continue
else:
w.write(line)
This works good enough for me :)
… for searching ppl:
This funny snippet
skey = lambda x: getattr(x, "tagName", None)
mainnode.childNodes = sorted(
[n for n in mainnode.childNodes if n.nodeType != n.TEXT_NODE],
cmp=lambda x, y: cmp(skey(y), skey(x)))
removes all text nodes (and, also, reverse sorts them by tagname).
I.e. you can (recursively) do tr.childNodes = [recurseclean(n) for n in tr.childNodes if n.nodeType != n.TEXT_NODE] to remove all text nodes
Or you might want to do something like … if n.nodeType != n.TEXT_NODE or not re.match(r'^[:whitespace:]*$', n.data, re.MULTILINE) (did't try that one myself) if you need text nodes with some data. Or something more complex to leave text inside specific tags.
After that tree.toprettyxml(…) will return well-formatted XML text.
I know, that this question is quite dated, but since it took a while to figure out different approaches to the problem, here are my solutions:
The best way, I found is using lxml, indeed:
from lxml import etree
root = etree.fromstring(data)
# for tag in root.iter('tag') doesn't cope with namespaces...
for tag in root.xpath('//*[local-name() = "tag"]'):
tag.getparent().remove(tag)
data = etree.tostring(root, encoding = 'utf-8', pretty_print = True)
With minidom, it's a bit more convoluted due to the fact, that every node is accompanied with a trailing whitespace node:
import xml.dom.minidom
dom = xml.dom.minidom.parseString(data)
for tag in dom.getElementsByTagName('tag'):
if tag.nextSibling \
and tag.nextSibling.nodeType == meta.TEXT_NODE \
and tag.nextSibling.data.isspace():
tag.parentNode.removeChild(tag.nextSibling)
tag.parentNode.removeChild(tag)
data = dom.documentElement.toxml(encoding = 'utf-8')

Categories