Python: Read and write namespaced XML using ElementTree - python

This XML file is named example.xml:
<?xml version="1.0"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>14.0.0</modelVersion>
<groupId>.com.foobar.flubber</groupId>
<artifactId>uberportalconf</artifactId>
<version>13-SNAPSHOT</version>
<packaging>pom</packaging>
<name>Environment for UberPortalConf</name>
<description>This is the description</description>
<properties>
<birduberportal.version>11</birduberportal.version>
<promotiondevice.version>9</promotiondevice.version>
<foobarportal.version>6</foobarportal.version>
<eventuberdevice.version>2</eventuberdevice.version>
</properties>
<!-- A lot more here, but as it is irrelevant for the problem I have removed it -->
</project>
If I load the example.xml file above using ElementTree and print the root node:
>>> from xml.etree import ElementTree
>>> tree = ElementTree.parse('example.xml')
>>> print tree.getroot()
<Element '{http://maven.apache.org/POM/4.0.0}project' at 0x26ee0f0>
I see that Element also contains the namespace http://maven.apache.org/POM/4.0.0.
How do I:
Get the foobarportal.version text, increase it by one and write the XML file back while keeping the namespace the document had when loaded and also not change the overall XML layout.
Get it to load using any namespace, not just http://maven.apache.org/POM/4.0.0. I still don´t want to strip the namespace, as I want the XML to stay the same except for changing foobarportal.version as in 1 above.
The current way is not aware of XML but fulfills 1 and 2 above:
Grep for <foobarportal.version>(.*)</foobarportal.version>
Take the contents of the match group and i increase it by one
Write it back.
It would be nice to have an XML aware solution, as it would be more robust. The XML namespace handling of ElementTree is making it more complicated.

If your question is simply: "how do I search by a namespaced element name", then the answer is that lxml understands {namespace} syntax, so you can do:
tree.getroot().find('{http://maven.apache.org/POM/4.0.0}project')

Related

How do I search for a Tag in xml file using ElementTree where i have prefixes (python)

I just started learning Python and have to write a program that parses xml files.
I have multiple entries as seen below and I need, as a starting point, to return all the different d:Name entries in a list.
Unfortunately, I can't manage to use findall with prefixes.
import xml.etree.ElementTree as ET
tree = ET.parse('test.xml')
lst = tree.findall('.//{d}Name')
I read that if d is a prefix, I need to use the URI instead of a. But I don't understand which is the URI in my case, or how to make a successful search when i have the following file.
I have an XML that looks like this (simplified):
feed xml:base="http://projectserver/ps/_api/">
<entry>
<id>
http://projectserver/ps/_api/ProjectServer/EnterpriseResources('some id...')
</id>
<content type="application/xml">
<m:properties>
<d:Name>
WHAT I NEED
</d:Name>
</m:properties>
</content>
</entry>
<entry>
...
This bypassed my problem so thank you!
If you are using Python 3.8 or later, this post may help: link – Jim
Rhodes
So I ran the following which returned the list of tags, where i found the {URI}Name which I then used to do the search properly.
for elem in tree.iter():
print(elem.tag)

Python add new element by xml ElementTree

XML file
<?xml version="1.0" encoding="utf-8"?>
<Info xmlns="BuildTest">
<RequestDate>5/4/2020 12:27:46 AM</RequestDate>
</Info>
I want to add a new element inside the Info tag.
Here is what I did.
import xml.etree.ElementTree as ET
tree = ET.parse('example.xml')
root = tree.getroot()
ele = ET.Element('element1')
ele.text = 'ele1'
root.append(ele)
tree.write("output.xhtml")
Output
<ns0:Info xmlns:ns0="BuildTest">
<ns0:RequestDate>5/4/2020 12:27:46 AM</ns0:RequestDate>
<element1>ele1</element1></ns0:Info>
Three questions:
The <?xml version="1.0" encoding="utf-8"?> is missing.
The namespace is wrong.
The whitespace of the new element is gone.
I saw many questions related to this topic, most of them are suggesting other packages.
Is there any way it can handle properly?
The processing instructions are not considered XML elements. Just Google are processing instructions part of an XML, and the first result states:
Processing instructions are markup, but they're not elements.
Since the package you are using is literally called ElementTree, you can reasonably expect its objects to be a trees of elements. If I remember correctly, DOM compliant XML packages can support non-element markup in XML.
For the namespace issue, the answer is in stack overflow, at Remove ns0 from XML - you just have to register the namespace you specified in the top element of your document. The following worked for me:
ET.register_namespace("", "Buildtest")
As for the whitespace - the new element does not have any whitespace. You can assign to the tail member to add a linefeed after an element.

Unable to remove element/node using ElementTree

I have an issue with ElementTree that I can't quite figure out. I've read all their documentation as well as all the information I could find on this forum. I have a couple elements/nodes that I am trying to remove using ElementTree. I don't get any errors with the following code, but when I look at the output file I wrote the changes to, the elements/nodes that I expected to be removed are still there. I have a document that looks like this:
<data>
<config>
<script filename="test1.txt"></script>
<documentation filename="test2.txt"></script>
</config>
</data>
My code looks as follows:
import xml.etree.ElementTree as ElementTree
xmlTree = ElementTree.parse(os.path.join(sourcePath, "test.xml"))
xmlRoot = xmlTree.getroot()
for doc in xmlRoot.findall('documentation'):
xmlRoot.remove(doc)
xmlTree.write(os.path.join(sourcePath, "testTWO.xml"))
The result is I get the following document:
<data>
<config>
<script filename="test1.txt" />
<documentation filename="test2.txt" />
</config>
</data>
What I need is something more like this. I am not stuck using ElementTree. If there is a better solution with lxml or some other library, I am all ears. I know ElementTree can be a little bit of a pain at times.
<data>
<config>
</config>
</data>
xmlRoot.findall('documentation') in your code didn't find anything, because <documentation> isn't direct child of the root element <data>. It is actually direct child of <config> :
"Element.findall() finds only elements with a tag which are direct children of the current element". [19.7.1.3. Finding interesting elements]
This is one possible way to remove all children of <config> using findall() given sample XML you posted (and assuming that the actual XML has <documentation> element closed with proper closing tag instead of closed with </script>) :
......
config = xmlRoot.find('config')
# find all children of config
for doc in config.findall('*'):
config.remove(doc)
# print just to make sure the element to be removed is correct
print ElementTree.tostring(doc)
......

finding text into namespaced xml elements with lxml.etree

I try to use lxml.etree to parse an XML file and find text into elements of the XML.
XML files can be as such:
<?xml version="1.0" encoding="UTF-8"?>
<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/
http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
<responseDate>2002-06-01T19:20:30Z</responseDate>
<request verb="ListRecords" from="1998-01-15"
set="physics:hep"
metadataPrefix="oai_rfc1807">
http://an.oa.org/OAI-script</request>
<ListRecords>
<record>
<header>
<identifier>oai:arXiv.org:hep-th/9901001</identifier>
<datestamp>1999-12-25</datestamp>
<setSpec>physics:hep</setSpec>
<setSpec>math</setSpec>
</header>
<metadata>
<rfc1807 xmlns=
"http://info.internet.isi.edu:80/in-notes/rfc/files/rfc1807.txt"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation=
"http://info.internet.isi.edu:80/in-notes/rfc/files/rfc1807.txt
http://www.openarchives.org/OAI/1.1/rfc1807.xsd">
<bib-version>v2</bib-version>
<id>hep-th/9901001</id>
<entry>January 1, 1999</entry>
<title>Investigations of Radioactivity</title>
<author>Ernest Rutherford</author>
<date>March 30, 1999</date>
</rfc1807>
</metadata>
<about>
<oai_dc:dc
xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/
http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:publisher>Los Alamos arXiv</dc:publisher>
<dc:rights>Metadata may be used without restrictions as long as
the oai identifier remains attached to it.</dc:rights>
</oai_dc:dc>
</about>
</record>
<record>
<header status="deleted">
<identifier>oai:arXiv.org:hep-th/9901007</identifier>
<datestamp>1999-12-21</datestamp>
</header>
</record>
</ListRecords>
</OAI-PMH>
For the following part we assume doc = etree.parse("/tmp/test.xml") where text.xml contains the xml pasted above.
First I try to find all the <record> elements using doc.findall(".//record") but it returns an empty list.
Secondly, for a given word I'd like to check if it is in the <dc:publisher>.
To achieve this I try first to do the same as earlier : doc.findall(".//publisher") but i've the same issue... I'm pretty sure all of this is linked with namespaces but I don't know how to handle them.
I've read the libxml tutorial, and tried the example for findall method on a basic xml file (without any namespace) and it worked out.
As Chris has already mentioned, you can also use lxml and xpath. As xpath doesn't allow you to write the namespaced names in full like {http://www.openarchives.org/OAI/2.0/}record (so-called "James Clark notation" *), you will have to use prefixes, and provide the xpath engine with a prefix-to-namespace-uri mapping.
Example with lxml (assuming you already have the desired tree object):
nsmap = {'oa':'http://www.openarchives.org/OAI/2.0/',
'dc':'http://purl.org/dc/elements/1.1/'}
tree.xpath('//oa:record[descendant::dc:publisher[contains(., "Alamos")]]',
namespaces=nsmap)
This will select all {http://www.openarchives.org/OAI/2.0/}record elements that have a descendant element {http://purl.org/dc/elements/1.1/}dc containing the word "Alamos".
[*] this comes from an article where James Clark explains XML Namespaces, everyone not familiar with namespaces should read this! (even if it was written a long time ago)
Disclaimer: I am using the standard library xml.etree.ElementTree module, not the lxml library (although this is a subset of lxml as far as I know). I'm sure there is an answer which is much simpler than mine which uses lxml and XPATH, but I don't know it.
Namespace issue
You were right to say that the problem is likely the namespaces. There is no record element in your XML file, but there are two {http://www.openarchives.org/OAI/2.0/}record tags in the file. As the following demonstrates:
>>> import xml.etree.ElementTree as etree
>>> xml_string = ...Your XML to parse...
>>> e = etree.fromstring(xml_string)
# Let's see what the root element is
>>> e
<Element {http://www.openarchives.org/OAI/2.0/}OAI-PMH at 7f39ebf54f80>
# Let's see what children there are of the root element
>>> for child in e:
... print child
...
<Element {http://www.openarchives.org/OAI/2.0/}responseDate at 7f39ebf54fc8>
<Element {http://www.openarchives.org/OAI/2.0/}request at 7f39ebf58050>
<Element {http://www.openarchives.org/OAI/2.0/}ListRecords at 7f39ebf58098>
# Finally, let's get the children of the `ListRecords` element
>>> for child in e[-1]:
... print child
...
<Element {http://www.openarchives.org/OAI/2.0/}record at 7f39ebf580e0>
<Element {http://www.openarchives.org/OAI/2.0/}record at 7f39ebf58908>
So, for example
>>> e.find('ListRecords')
returns None, whereas
>>> e.find('{http://www.openarchives.org/OAI/2.0/}ListRecords'
<Element {http://www.openarchives.org/OAI/2.0/}ListRecords at 7f39ebf58098>
returns the ListRecords element.
Note that I am using the find method since the standard library ElementTree does not have an xpath method.
Possible solution
One way to solve this and to get the namespace prefix and prepend this to the tag you are trying to find. You can use
>>>> e.tag[:e.tag.index('}')+1]
'{http://www.openarchives.org/OAI/2.0/}'
on the root element, e, to find the namespace, although I'm sure there is a better way of doing this.
Now we can define functions to extract the tags we want we an optional namespace prefix:
def findallNS(element, tag, namespace=None):
if namspace is not None:
return element.findall(namepsace+tag)
else:
return element.findall(tag)
def findNS(element, tag, namespace=None):
if namspace is not None:
return element.find(namepsace+tag)
else:
return element.find(tag)
So now we can write:
>>> list_records = findNS(e, 'ListRecords', namespace)
>>> findallNS(list_records, 'record', namespace)
[<Element {http://www.openarchives.org/OAI/2.0/}record at 7f39ebf580e0>,
<Element {http://www.openarchives.org/OAI/2.0/}record at 7f39ebf58908>]
Alternative solution
Another solution maybe to write a function to search for all tags which end with the tag you are interested in, for example:
def find_child_tags(element, tag):
return [child for child in element if child.tag.endswith(tag)]
Here you don't need to deal with the namespace at all.
#Chris answer is very good and it will work with lxml too. Here is another way using lxml (works the same way with xpath instead of find):
In [37]: xml.find('.//n:record', namespaces={'n': 'http://www.openarchives.org/OAI/2.0/'})
Out[37]: <Element {http://www.openarchives.org/OAI/2.0/}record at 0x2a451e0>

How to add xml header to dom object

I'm using Python's xml.dom.minidom but I think the question is valid for any DOM parser.
My original file has a line like this at the beginning:
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
This doesn't seem to be part of the dom, so when I do something like dom.toxml() the resulting string have not line at the beginning.
How can I add it?
example outpupt:
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<Root xmlns:aid="http://xxxxxxxxxxxxxxxxxx">
<Section>BANDSAW BLADES</Section>
</Root>
hope to be clear.
This doesn't seem to be part of the dom
The XML Declaration doesn't get a node of its own, no, but the properties declared in it are visible on the Document object:
>>> doc= minidom.parseString('<?xml version="1.0" encoding="utf-8" standalone="yes"?><a/>')
>>> doc.encoding
'utf-8'
>>> doc.standalone
True
Serialising the document should include the standalone="yes" part of the declaration, but toxml() doesn't. You could consider this a bug, perhaps, but really the toxml() method doesn't make any promises to serialise the XML declaration in an appropriate way. (eg you don't get an encoding unless you specifically ask for it either.)
You could take charge of writing the document yourself:
xml= []
xml.append('<?xml version="1.0" encoding="utf-8" standalone="yes"?>')
for child in doc.childNodes:
xml.append(child.toxml())
but do you really need the XML Declaration here? You are using the default version and encoding, and since you have no DOCTYPE there can be no externally-defined entities, so the document is already standalone by nature. As per the XML standard: “if there are no external markup declarations, the standalone document declaration has no meaning”. It seems to me you could safely omit it completely.

Categories