Python XML question - python

I have an XML document as a str. Now, in the XSD <foo> is unbounded, and while most of the time there is only 1, there COULD be more. I'm trying to use ElementTree, but am running into an issue:
>>> from xml.etree.ElementTree import fromstring
>>>
>>> xml_str = """<?xml version="1.0"?>
... <foo>
... <bar>
... <baz>Spam</baz>
... <qux>Eggs</qux>
... </bar>
... </foo>"""
>>> # Try to get the document
>>> el = fromstring(xml_str)
>>> el.findall('foo')
[]
>>> el.findall('bar')
[<Element 'bar' at 0x1004acb90>]
Clearly, I need to loop through the <foo>s, but because <foo> is at the root, I can't. Obviously, I could create an element called <root> and put el inside of it, but is there a more correct way of doing this?

Each XML document is supposed to have exactly one root element. You will need to adjust your XML if you want to support multiple foo elements.

Alas, wrapping the element in an ElementTree with tree = ElementTree(el) and trying tree.findall('//foo') doesn't seem to work either (it seems you can only search "beneath" an element, and even if the search is done from the full tree, it searches "beneath" the root). As ElementTree doesn't claim to really implement xpath, it's difficult to say whether this is intended or a bug.
Solution: without using lxml with full xpath support (el.xpath('//foo') for example), the easiest solution would be to use the Element.iter() method.
for foo in el.iter(tag='foo'):
print foo
or if you want the results in a list:
list(el.iter(tag='foo'))
Note that you can't use complex paths in this way, just find all elements with a certain tagname, starting from (and including) the element.

Related

XPath text() does not get the text of a link node

from lxml import etree
import requests
htmlparser = etree.HTMLParser()
f = requests.get('https://rss.orf.at/news.xml')
# without the ufeff this would fail because it tells me: "ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration."
tree = etree.fromstring('\ufeff'+f.text, htmlparser)
print(tree.xpath('//item/title/text()')) #<- this does produce a liste of titles
print(tree.xpath('//item/link/text()')) #<- this does NOT produce a liste of links why ?!?!
Okay this is a bit of mystery to me, and maybe I'm just overlooking the simplest thing, but the XPath '//item/link/text()' does only produce an empty list while '//item/title/text()' works exactly like expected. Does the <link> node hold any special purpose? I can select all of them with '//item/link' I just can't get the text() selector to work on them.
You're using etree.HTMLParser to parse an XML document. I suspect this was an attempt to deal with XML namespacing, but I think it's probably the wrong solution. It's possible treating the XML document as HTML is ultimately the source of your problem.
If we use the XML parser instead, everything pretty much works as expected.
First, if we look at the root element, we see that it sets a default namespace:
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
xmlns:orfon="http://rss.orf.at/1.0/"
xmlns="http://purl.org/rss/1.0/"
>
That means when we see an item element in the document, it's actually an "item in the http://purl.org/rss/1.0/ namespace" element. We need to provide that namespace information in our xpath queries by passing in a namespaces dictionary and use a namespace prefix on the element names, like this:
>>> tree.xpath('//rss:item', namespaces={'rss': 'http://purl.org/rss/1.0/'})
[<Element {http://purl.org/rss/1.0/}item at 0x7f0497000e80>, ...]
Your first xpath expression (looking at /item/title/text()) becomes:
>>> tree.xpath('//rss:item/rss:title/text()', namespaces={'rss': 'http://purl.org/rss/1.0/'})
['Amnesty dokumentiert Kriegsverbrechen', ..., 'Moskauer Börse startet abgeschirmten Handel']
And your second xpath expression (looking at /item/link/text()) becomes:
>>> tree.xpath('//rss:item/rss:link/text()', namespaces={'rss': 'http://purl.org/rss/1.0/'})
['https://orf.at/stories/3255477/', ..., 'https://orf.at/stories/3255384/']
This makes the code look like:
from lxml import etree
import requests
f = requests.get('https://rss.orf.at/news.xml')
tree = etree.fromstring(f.content)
print(tree.xpath('//rss:item/rss:title/text()', namespaces={'rss': 'http://purl.org/rss/1.0/'}))
print(tree.xpath('//rss:item/rss:link/text()', namespaces={'rss': 'http://purl.org/rss/1.0/'}))
Note that by using f.content (which is a byte string) instead of f.text (a unicode string), we avoid the whole unicode parsing error.

XML parser that contains debug information

I'm looking for an XML parser for python that includes some debug information in each node, for instance the line number and column where the node began. Ideally, it would be a parser that is compatible with xml.etree.ElementTree.XMLParser, i.e., one that I can pass to xml.etree.ElementTree.parse.
I know these parsers don't actually produce the elements, so I'm not sure how this would really work, but it seems like such a useful thing, I'll be surprised if no-body has one. Syntax errors in the XML are one thing, but semantic errors in the resulting structure can be difficult to debug if you can't point to a specific location in the source file/string.
Point to an element by xpath (lxml - getpath)
lxml offers finding an xpath for an element in document.
Having test document:
>>> from lxml import etree
>>> xmlstr = """<root><rec id="01"><subrec>a</subrec><subrec>b</subrec></rec>
... <rec id="02"><para>graph</para></rec>
... </root>"""
...
>>> doc = etree.fromstring(xmlstr)
>>> doc
<Element root at 0x7f61040fd5f0>
We pick an element <para>graph</para>:
>>> para = doc.xpath("//para")[0]
>>> para
<Element para at 0x7f61040fd488>
XPath has a meaning, if we have clear context, in this case it is root of the XML document:
>>> root = doc.getroottree()
>>> root
<lxml.etree._ElementTree at 0x7f610410f758>
And now we can ask, what xpath leads from the root to the element of our interest:
>>> root.getpath(para)
'/root/rec[2]/para'

finding text into namespaced xml elements with lxml.etree

I try to use lxml.etree to parse an XML file and find text into elements of the XML.
XML files can be as such:
<?xml version="1.0" encoding="UTF-8"?>
<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/
http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
<responseDate>2002-06-01T19:20:30Z</responseDate>
<request verb="ListRecords" from="1998-01-15"
set="physics:hep"
metadataPrefix="oai_rfc1807">
http://an.oa.org/OAI-script</request>
<ListRecords>
<record>
<header>
<identifier>oai:arXiv.org:hep-th/9901001</identifier>
<datestamp>1999-12-25</datestamp>
<setSpec>physics:hep</setSpec>
<setSpec>math</setSpec>
</header>
<metadata>
<rfc1807 xmlns=
"http://info.internet.isi.edu:80/in-notes/rfc/files/rfc1807.txt"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation=
"http://info.internet.isi.edu:80/in-notes/rfc/files/rfc1807.txt
http://www.openarchives.org/OAI/1.1/rfc1807.xsd">
<bib-version>v2</bib-version>
<id>hep-th/9901001</id>
<entry>January 1, 1999</entry>
<title>Investigations of Radioactivity</title>
<author>Ernest Rutherford</author>
<date>March 30, 1999</date>
</rfc1807>
</metadata>
<about>
<oai_dc:dc
xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/
http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:publisher>Los Alamos arXiv</dc:publisher>
<dc:rights>Metadata may be used without restrictions as long as
the oai identifier remains attached to it.</dc:rights>
</oai_dc:dc>
</about>
</record>
<record>
<header status="deleted">
<identifier>oai:arXiv.org:hep-th/9901007</identifier>
<datestamp>1999-12-21</datestamp>
</header>
</record>
</ListRecords>
</OAI-PMH>
For the following part we assume doc = etree.parse("/tmp/test.xml") where text.xml contains the xml pasted above.
First I try to find all the <record> elements using doc.findall(".//record") but it returns an empty list.
Secondly, for a given word I'd like to check if it is in the <dc:publisher>.
To achieve this I try first to do the same as earlier : doc.findall(".//publisher") but i've the same issue... I'm pretty sure all of this is linked with namespaces but I don't know how to handle them.
I've read the libxml tutorial, and tried the example for findall method on a basic xml file (without any namespace) and it worked out.
As Chris has already mentioned, you can also use lxml and xpath. As xpath doesn't allow you to write the namespaced names in full like {http://www.openarchives.org/OAI/2.0/}record (so-called "James Clark notation" *), you will have to use prefixes, and provide the xpath engine with a prefix-to-namespace-uri mapping.
Example with lxml (assuming you already have the desired tree object):
nsmap = {'oa':'http://www.openarchives.org/OAI/2.0/',
'dc':'http://purl.org/dc/elements/1.1/'}
tree.xpath('//oa:record[descendant::dc:publisher[contains(., "Alamos")]]',
namespaces=nsmap)
This will select all {http://www.openarchives.org/OAI/2.0/}record elements that have a descendant element {http://purl.org/dc/elements/1.1/}dc containing the word "Alamos".
[*] this comes from an article where James Clark explains XML Namespaces, everyone not familiar with namespaces should read this! (even if it was written a long time ago)
Disclaimer: I am using the standard library xml.etree.ElementTree module, not the lxml library (although this is a subset of lxml as far as I know). I'm sure there is an answer which is much simpler than mine which uses lxml and XPATH, but I don't know it.
Namespace issue
You were right to say that the problem is likely the namespaces. There is no record element in your XML file, but there are two {http://www.openarchives.org/OAI/2.0/}record tags in the file. As the following demonstrates:
>>> import xml.etree.ElementTree as etree
>>> xml_string = ...Your XML to parse...
>>> e = etree.fromstring(xml_string)
# Let's see what the root element is
>>> e
<Element {http://www.openarchives.org/OAI/2.0/}OAI-PMH at 7f39ebf54f80>
# Let's see what children there are of the root element
>>> for child in e:
... print child
...
<Element {http://www.openarchives.org/OAI/2.0/}responseDate at 7f39ebf54fc8>
<Element {http://www.openarchives.org/OAI/2.0/}request at 7f39ebf58050>
<Element {http://www.openarchives.org/OAI/2.0/}ListRecords at 7f39ebf58098>
# Finally, let's get the children of the `ListRecords` element
>>> for child in e[-1]:
... print child
...
<Element {http://www.openarchives.org/OAI/2.0/}record at 7f39ebf580e0>
<Element {http://www.openarchives.org/OAI/2.0/}record at 7f39ebf58908>
So, for example
>>> e.find('ListRecords')
returns None, whereas
>>> e.find('{http://www.openarchives.org/OAI/2.0/}ListRecords'
<Element {http://www.openarchives.org/OAI/2.0/}ListRecords at 7f39ebf58098>
returns the ListRecords element.
Note that I am using the find method since the standard library ElementTree does not have an xpath method.
Possible solution
One way to solve this and to get the namespace prefix and prepend this to the tag you are trying to find. You can use
>>>> e.tag[:e.tag.index('}')+1]
'{http://www.openarchives.org/OAI/2.0/}'
on the root element, e, to find the namespace, although I'm sure there is a better way of doing this.
Now we can define functions to extract the tags we want we an optional namespace prefix:
def findallNS(element, tag, namespace=None):
if namspace is not None:
return element.findall(namepsace+tag)
else:
return element.findall(tag)
def findNS(element, tag, namespace=None):
if namspace is not None:
return element.find(namepsace+tag)
else:
return element.find(tag)
So now we can write:
>>> list_records = findNS(e, 'ListRecords', namespace)
>>> findallNS(list_records, 'record', namespace)
[<Element {http://www.openarchives.org/OAI/2.0/}record at 7f39ebf580e0>,
<Element {http://www.openarchives.org/OAI/2.0/}record at 7f39ebf58908>]
Alternative solution
Another solution maybe to write a function to search for all tags which end with the tag you are interested in, for example:
def find_child_tags(element, tag):
return [child for child in element if child.tag.endswith(tag)]
Here you don't need to deal with the namespace at all.
#Chris answer is very good and it will work with lxml too. Here is another way using lxml (works the same way with xpath instead of find):
In [37]: xml.find('.//n:record', namespaces={'n': 'http://www.openarchives.org/OAI/2.0/'})
Out[37]: <Element {http://www.openarchives.org/OAI/2.0/}record at 0x2a451e0>

Python: ElementTree, get the namespace string of an Element

This XML file is named example.xml:
<?xml version="1.0"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>14.0.0</modelVersion>
<groupId>.com.foobar.flubber</groupId>
<artifactId>uberportalconf</artifactId>
<version>13-SNAPSHOT</version>
<packaging>pom</packaging>
<name>Environment for UberPortalConf</name>
<description>This is the description</description>
<properties>
<birduberportal.version>11</birduberportal.version>
<promotiondevice.version>9</promotiondevice.version>
<foobarportal.version>6</foobarportal.version>
<eventuberdevice.version>2</eventuberdevice.version>
</properties>
<!-- A lot more here, but as it is irrelevant for the problem I have removed it -->
</project>
If I load example.xml and parse it with ElementTree I can see its namespace is http://maven.apache.org/POM/4.0.0.
>>> from xml.etree import ElementTree
>>> tree = ElementTree.parse('example.xml')
>>> print tree.getroot()
<Element '{http://maven.apache.org/POM/4.0.0}project' at 0x26ee0f0>
I have not found a method to call to get just the namespace from an Element without resorting to parsing the str(an_element) of an Element. It seems like there got to be a better way.
This is a perfect task for a regular expression.
import re
def namespace(element):
m = re.match(r'\{.*\}', element.tag)
return m.group(0) if m else ''
The namespace should be in Element.tag right before the "actual" tag:
>>> root = tree.getroot()
>>> root.tag
'{http://maven.apache.org/POM/4.0.0}project'
To know more about namespaces, take a look at ElementTree: Working with Namespaces and Qualified Names.
I am not sure if this is possible with xml.etree, but here is how you could do it with lxml.etree:
>>> from lxml import etree
>>> tree = etree.parse('example.xml')
>>> tree.xpath('namespace-uri(.)')
'http://maven.apache.org/POM/4.0.0'
Without using regular expressions:
>>> root
<Element '{http://www.google.com/schemas/sitemap/0.84}urlset' at 0x2f7cc10>
>>> root.tag.split('}')[0].strip('{')
'http://www.google.com/schemas/sitemap/0.84'
The lxml.xtree library's element has a dictionary called nsmap, which shows all the namespace that are in use in the current tag scope.
>>> item = tree.getroot().iter().next()
>>> item.nsmap
{'md': 'urn:oasis:names:tc:SAML:2.0:metadata'}
I think it will be easier to take a look at the attributes:
>>> root.attrib
{'{http://www.w3.org/2001/XMLSchema-instance}schemaLocation':
'http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd'}
The short answer is:
ElementTree._namspace_map[ElementTree._namspace_map.values().index('')]
but only if you have been calling
ElementTree.register_namespace(prefix,uri)
in response to every event=="start-ns" received while iterating through the result of
ET.iterparse(...)
and you registered for "start-ns"
The answer the question "what is the default namespace?", it is necessary to clarify two points:
(1) XML specifications say that the default namespace is not necessarily global throughout the tree, rather the default namespace can be re-declared at any element under root, and inherits downwards until meeting another default namespace re-declaration.
(2) The ElementTree module can (de facto) handle XML-like documents which have no root default namespace, -if- they have no namespace use anywhere in the document. (* there may be less strict conditions, e.g., that is "if" and not necessarily "iff").
It's probably also worth considering "what do you want it for?" Consider that XML files can be semantically equivalent, but syntactically very different. E.g., the following three files are semantically equivalent, but A.xml has one default namespace declaration, B.xml has three, and C.xml has none.
A.xml:
<a xlmns="http://A" xlmns:nsB0="http://B0" xlmns:nsB1="http://B1">
<nsB0:b/>
<nsB1:b/>
</a>
B.xml:
<a xlmns="http://A">
<b xlmns="http://B0"/>
<b xlmns="http://B1"/>
</a>
C.xml:
<{http://A}a>
<{http://B0}b/>
<{http://B1}b/>
</a>
The file C.xml is the canonical expanded syntactical representation presented to the ElementTree search functions.
If you are certain a priori that there will be no namespace collisions, you can modify the element tags while parsing as discussed here: Python ElementTree module: How to ignore the namespace of XML files to locate matching element when using the method "find", "findall"
combining some of the answers above, I think the shortest code is
theroot = tree.getroot()
theroot.attrib[theroot.keys()[0]]
Here is my solution on ElementTree 3.9+,
def get_element_namespaces(filename, element):
namespace = []
for key, value in ET.iterparse(filename, events=['start', 'start-ns']):
print(key, value)
if key == 'start-ns':
namespace.append(value)
else:
if ET.tostring(element) == ET.tostring(value):
return namespace
namespace = []
return namespaces
This would return an array of [prefix:URL] tuples like this:
[('android', 'http://schemas.android.com/apk/res/android'), ('tools', 'http://schemas.android.com/tools')]

Adding attributes to existing elements, removing elements, etc with lxml

I parse in the XML using
from lxml import etree
tree = etree.parse('test.xml', etree.XMLParser())
Now I want to work on the parsed XML. I'm having trouble removing elements with namespaces or just elements in general such as
<rdf:description><dc:title>Example</dc:title></rdf:description>
and I want to remove that entire element as well as everything within the tags. I also want to add attributes to existing elements as well. The methods I need are in the Element class but I have no idea how to use that with the ElementTree object here. Any pointers would definitely be appreciated, thanks
You can get to the root element via this call: root=tree.getroot()
Using that root element, you can use findall() and remove elements that match your criteria:
deleteThese = root.findall("title")
for element in deleteThese: root.remove(element)
Finally, you can see what your new tree looks like with this: etree.tostring(root, pretty_print=True)
Here is some info about how find/findall work:
http://infohost.nmt.edu/tcc/help/pubs/pylxml/class-ElementTree.html#ElementTree-find
To add an attribute to an element, try something like this:
root.attrib['myNewAttribute']='hello world'
The remove method should do what you want:
>>> from lxml import etree
>>> from StringIO import StringIO
>>> s = '<Root><Description><Title>foo</Title></Description></Root>'
>>> tree = etree.parse(StringIO(s))
>>> print(etree.tostring(tree.getroot()))
<Root><Description><Title>foo</Title></Description></Root>
>>> title = tree.find('//Title')
>>> title.getparent().remove(title)
>>> etree.tostring(tree.getroot())
'<Root><Description/></Root>'
>>> print(etree.tostring(tree.getroot()))
<Root><Description/></Root>

Categories