XML parser that contains debug information - python

I'm looking for an XML parser for python that includes some debug information in each node, for instance the line number and column where the node began. Ideally, it would be a parser that is compatible with xml.etree.ElementTree.XMLParser, i.e., one that I can pass to xml.etree.ElementTree.parse.
I know these parsers don't actually produce the elements, so I'm not sure how this would really work, but it seems like such a useful thing, I'll be surprised if no-body has one. Syntax errors in the XML are one thing, but semantic errors in the resulting structure can be difficult to debug if you can't point to a specific location in the source file/string.

Point to an element by xpath (lxml - getpath)
lxml offers finding an xpath for an element in document.
Having test document:
>>> from lxml import etree
>>> xmlstr = """<root><rec id="01"><subrec>a</subrec><subrec>b</subrec></rec>
... <rec id="02"><para>graph</para></rec>
... </root>"""
...
>>> doc = etree.fromstring(xmlstr)
>>> doc
<Element root at 0x7f61040fd5f0>
We pick an element <para>graph</para>:
>>> para = doc.xpath("//para")[0]
>>> para
<Element para at 0x7f61040fd488>
XPath has a meaning, if we have clear context, in this case it is root of the XML document:
>>> root = doc.getroottree()
>>> root
<lxml.etree._ElementTree at 0x7f610410f758>
And now we can ask, what xpath leads from the root to the element of our interest:
>>> root.getpath(para)
'/root/rec[2]/para'

Related

XPath text() does not get the text of a link node

from lxml import etree
import requests
htmlparser = etree.HTMLParser()
f = requests.get('https://rss.orf.at/news.xml')
# without the ufeff this would fail because it tells me: "ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration."
tree = etree.fromstring('\ufeff'+f.text, htmlparser)
print(tree.xpath('//item/title/text()')) #<- this does produce a liste of titles
print(tree.xpath('//item/link/text()')) #<- this does NOT produce a liste of links why ?!?!
Okay this is a bit of mystery to me, and maybe I'm just overlooking the simplest thing, but the XPath '//item/link/text()' does only produce an empty list while '//item/title/text()' works exactly like expected. Does the <link> node hold any special purpose? I can select all of them with '//item/link' I just can't get the text() selector to work on them.
You're using etree.HTMLParser to parse an XML document. I suspect this was an attempt to deal with XML namespacing, but I think it's probably the wrong solution. It's possible treating the XML document as HTML is ultimately the source of your problem.
If we use the XML parser instead, everything pretty much works as expected.
First, if we look at the root element, we see that it sets a default namespace:
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
xmlns:orfon="http://rss.orf.at/1.0/"
xmlns="http://purl.org/rss/1.0/"
>
That means when we see an item element in the document, it's actually an "item in the http://purl.org/rss/1.0/ namespace" element. We need to provide that namespace information in our xpath queries by passing in a namespaces dictionary and use a namespace prefix on the element names, like this:
>>> tree.xpath('//rss:item', namespaces={'rss': 'http://purl.org/rss/1.0/'})
[<Element {http://purl.org/rss/1.0/}item at 0x7f0497000e80>, ...]
Your first xpath expression (looking at /item/title/text()) becomes:
>>> tree.xpath('//rss:item/rss:title/text()', namespaces={'rss': 'http://purl.org/rss/1.0/'})
['Amnesty dokumentiert Kriegsverbrechen', ..., 'Moskauer Börse startet abgeschirmten Handel']
And your second xpath expression (looking at /item/link/text()) becomes:
>>> tree.xpath('//rss:item/rss:link/text()', namespaces={'rss': 'http://purl.org/rss/1.0/'})
['https://orf.at/stories/3255477/', ..., 'https://orf.at/stories/3255384/']
This makes the code look like:
from lxml import etree
import requests
f = requests.get('https://rss.orf.at/news.xml')
tree = etree.fromstring(f.content)
print(tree.xpath('//rss:item/rss:title/text()', namespaces={'rss': 'http://purl.org/rss/1.0/'}))
print(tree.xpath('//rss:item/rss:link/text()', namespaces={'rss': 'http://purl.org/rss/1.0/'}))
Note that by using f.content (which is a byte string) instead of f.text (a unicode string), we avoid the whole unicode parsing error.

Python minidom: How to access an element

I'm working on parsing an XML-Sheet in Python. The XML has a structure like this:
<layer1>
<layer2>
<element>
<info1></info1>
</element>
<element>
<info1></info1>
</element>
<element>
<info1></info1>
</element>
</layer2>
</layer1>
Without layer2, I have no problems to acess the data in info1. But with layer2, I'm really in trouble. Their I can adress info1 with: root.firstChild.childNodes[0].childNodes[0].data
So my thought was, that I can do it similiar like this:root.firstChild.firstChild.childNodes[0].childNodes[0].data
########## Solution
So this is how I solved my problem:
from xml.etree import cElementTree as ET
from xml.etree import cElementTree as ET
tree = ET.parse("test.xml")
root = tree.getroot()
for elem in root.findall('./layer2/'):
for node in elem.findall('element/'):
x = node.find('info1').text
if x != "abc":
elem.remove(node)
Don't use the minidom API if you can help it. Use the ElementTree API instead; the xml.dom.minidom documentation explicitly states that:
Users who are not already proficient with the DOM should consider using the xml.etree.ElementTree module for their XML processing instead.
Here is a short sample that uses the ElementTree API to access your elements:
from xml.etree import ElementTree as ET
tree = ET.parse('inputfile.xml')
for info in tree.findall('.//element/info1'):
print info.text
This uses an XPath expression to list all info1 elements that are contained inside a element element, regardless of their position in the overall XML document.
If all you need is the first info1 element, use .find():
print tree.find('.//info1').text
With the DOM API, .firstChild could easily be a Text node instead of an Element node; you always need to loop over the .childNotes sequence to find the first Element match:
def findFirstElement(node):
for child in node.childNodes:
if child.nodeType == node.ELEMENT_NODE:
return child
but for your case, perhaps using .getElementsByTagName() suffices:
root.getElementsByTagName('info1').data
does this work? (im not amazing at python just a quick thought)
name[0].firstChild.nodeValue

finding text into namespaced xml elements with lxml.etree

I try to use lxml.etree to parse an XML file and find text into elements of the XML.
XML files can be as such:
<?xml version="1.0" encoding="UTF-8"?>
<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/
http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
<responseDate>2002-06-01T19:20:30Z</responseDate>
<request verb="ListRecords" from="1998-01-15"
set="physics:hep"
metadataPrefix="oai_rfc1807">
http://an.oa.org/OAI-script</request>
<ListRecords>
<record>
<header>
<identifier>oai:arXiv.org:hep-th/9901001</identifier>
<datestamp>1999-12-25</datestamp>
<setSpec>physics:hep</setSpec>
<setSpec>math</setSpec>
</header>
<metadata>
<rfc1807 xmlns=
"http://info.internet.isi.edu:80/in-notes/rfc/files/rfc1807.txt"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation=
"http://info.internet.isi.edu:80/in-notes/rfc/files/rfc1807.txt
http://www.openarchives.org/OAI/1.1/rfc1807.xsd">
<bib-version>v2</bib-version>
<id>hep-th/9901001</id>
<entry>January 1, 1999</entry>
<title>Investigations of Radioactivity</title>
<author>Ernest Rutherford</author>
<date>March 30, 1999</date>
</rfc1807>
</metadata>
<about>
<oai_dc:dc
xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/
http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:publisher>Los Alamos arXiv</dc:publisher>
<dc:rights>Metadata may be used without restrictions as long as
the oai identifier remains attached to it.</dc:rights>
</oai_dc:dc>
</about>
</record>
<record>
<header status="deleted">
<identifier>oai:arXiv.org:hep-th/9901007</identifier>
<datestamp>1999-12-21</datestamp>
</header>
</record>
</ListRecords>
</OAI-PMH>
For the following part we assume doc = etree.parse("/tmp/test.xml") where text.xml contains the xml pasted above.
First I try to find all the <record> elements using doc.findall(".//record") but it returns an empty list.
Secondly, for a given word I'd like to check if it is in the <dc:publisher>.
To achieve this I try first to do the same as earlier : doc.findall(".//publisher") but i've the same issue... I'm pretty sure all of this is linked with namespaces but I don't know how to handle them.
I've read the libxml tutorial, and tried the example for findall method on a basic xml file (without any namespace) and it worked out.
As Chris has already mentioned, you can also use lxml and xpath. As xpath doesn't allow you to write the namespaced names in full like {http://www.openarchives.org/OAI/2.0/}record (so-called "James Clark notation" *), you will have to use prefixes, and provide the xpath engine with a prefix-to-namespace-uri mapping.
Example with lxml (assuming you already have the desired tree object):
nsmap = {'oa':'http://www.openarchives.org/OAI/2.0/',
'dc':'http://purl.org/dc/elements/1.1/'}
tree.xpath('//oa:record[descendant::dc:publisher[contains(., "Alamos")]]',
namespaces=nsmap)
This will select all {http://www.openarchives.org/OAI/2.0/}record elements that have a descendant element {http://purl.org/dc/elements/1.1/}dc containing the word "Alamos".
[*] this comes from an article where James Clark explains XML Namespaces, everyone not familiar with namespaces should read this! (even if it was written a long time ago)
Disclaimer: I am using the standard library xml.etree.ElementTree module, not the lxml library (although this is a subset of lxml as far as I know). I'm sure there is an answer which is much simpler than mine which uses lxml and XPATH, but I don't know it.
Namespace issue
You were right to say that the problem is likely the namespaces. There is no record element in your XML file, but there are two {http://www.openarchives.org/OAI/2.0/}record tags in the file. As the following demonstrates:
>>> import xml.etree.ElementTree as etree
>>> xml_string = ...Your XML to parse...
>>> e = etree.fromstring(xml_string)
# Let's see what the root element is
>>> e
<Element {http://www.openarchives.org/OAI/2.0/}OAI-PMH at 7f39ebf54f80>
# Let's see what children there are of the root element
>>> for child in e:
... print child
...
<Element {http://www.openarchives.org/OAI/2.0/}responseDate at 7f39ebf54fc8>
<Element {http://www.openarchives.org/OAI/2.0/}request at 7f39ebf58050>
<Element {http://www.openarchives.org/OAI/2.0/}ListRecords at 7f39ebf58098>
# Finally, let's get the children of the `ListRecords` element
>>> for child in e[-1]:
... print child
...
<Element {http://www.openarchives.org/OAI/2.0/}record at 7f39ebf580e0>
<Element {http://www.openarchives.org/OAI/2.0/}record at 7f39ebf58908>
So, for example
>>> e.find('ListRecords')
returns None, whereas
>>> e.find('{http://www.openarchives.org/OAI/2.0/}ListRecords'
<Element {http://www.openarchives.org/OAI/2.0/}ListRecords at 7f39ebf58098>
returns the ListRecords element.
Note that I am using the find method since the standard library ElementTree does not have an xpath method.
Possible solution
One way to solve this and to get the namespace prefix and prepend this to the tag you are trying to find. You can use
>>>> e.tag[:e.tag.index('}')+1]
'{http://www.openarchives.org/OAI/2.0/}'
on the root element, e, to find the namespace, although I'm sure there is a better way of doing this.
Now we can define functions to extract the tags we want we an optional namespace prefix:
def findallNS(element, tag, namespace=None):
if namspace is not None:
return element.findall(namepsace+tag)
else:
return element.findall(tag)
def findNS(element, tag, namespace=None):
if namspace is not None:
return element.find(namepsace+tag)
else:
return element.find(tag)
So now we can write:
>>> list_records = findNS(e, 'ListRecords', namespace)
>>> findallNS(list_records, 'record', namespace)
[<Element {http://www.openarchives.org/OAI/2.0/}record at 7f39ebf580e0>,
<Element {http://www.openarchives.org/OAI/2.0/}record at 7f39ebf58908>]
Alternative solution
Another solution maybe to write a function to search for all tags which end with the tag you are interested in, for example:
def find_child_tags(element, tag):
return [child for child in element if child.tag.endswith(tag)]
Here you don't need to deal with the namespace at all.
#Chris answer is very good and it will work with lxml too. Here is another way using lxml (works the same way with xpath instead of find):
In [37]: xml.find('.//n:record', namespaces={'n': 'http://www.openarchives.org/OAI/2.0/'})
Out[37]: <Element {http://www.openarchives.org/OAI/2.0/}record at 0x2a451e0>

Python XML question

I have an XML document as a str. Now, in the XSD <foo> is unbounded, and while most of the time there is only 1, there COULD be more. I'm trying to use ElementTree, but am running into an issue:
>>> from xml.etree.ElementTree import fromstring
>>>
>>> xml_str = """<?xml version="1.0"?>
... <foo>
... <bar>
... <baz>Spam</baz>
... <qux>Eggs</qux>
... </bar>
... </foo>"""
>>> # Try to get the document
>>> el = fromstring(xml_str)
>>> el.findall('foo')
[]
>>> el.findall('bar')
[<Element 'bar' at 0x1004acb90>]
Clearly, I need to loop through the <foo>s, but because <foo> is at the root, I can't. Obviously, I could create an element called <root> and put el inside of it, but is there a more correct way of doing this?
Each XML document is supposed to have exactly one root element. You will need to adjust your XML if you want to support multiple foo elements.
Alas, wrapping the element in an ElementTree with tree = ElementTree(el) and trying tree.findall('//foo') doesn't seem to work either (it seems you can only search "beneath" an element, and even if the search is done from the full tree, it searches "beneath" the root). As ElementTree doesn't claim to really implement xpath, it's difficult to say whether this is intended or a bug.
Solution: without using lxml with full xpath support (el.xpath('//foo') for example), the easiest solution would be to use the Element.iter() method.
for foo in el.iter(tag='foo'):
print foo
or if you want the results in a list:
list(el.iter(tag='foo'))
Note that you can't use complex paths in this way, just find all elements with a certain tagname, starting from (and including) the element.

Adding attributes to existing elements, removing elements, etc with lxml

I parse in the XML using
from lxml import etree
tree = etree.parse('test.xml', etree.XMLParser())
Now I want to work on the parsed XML. I'm having trouble removing elements with namespaces or just elements in general such as
<rdf:description><dc:title>Example</dc:title></rdf:description>
and I want to remove that entire element as well as everything within the tags. I also want to add attributes to existing elements as well. The methods I need are in the Element class but I have no idea how to use that with the ElementTree object here. Any pointers would definitely be appreciated, thanks
You can get to the root element via this call: root=tree.getroot()
Using that root element, you can use findall() and remove elements that match your criteria:
deleteThese = root.findall("title")
for element in deleteThese: root.remove(element)
Finally, you can see what your new tree looks like with this: etree.tostring(root, pretty_print=True)
Here is some info about how find/findall work:
http://infohost.nmt.edu/tcc/help/pubs/pylxml/class-ElementTree.html#ElementTree-find
To add an attribute to an element, try something like this:
root.attrib['myNewAttribute']='hello world'
The remove method should do what you want:
>>> from lxml import etree
>>> from StringIO import StringIO
>>> s = '<Root><Description><Title>foo</Title></Description></Root>'
>>> tree = etree.parse(StringIO(s))
>>> print(etree.tostring(tree.getroot()))
<Root><Description><Title>foo</Title></Description></Root>
>>> title = tree.find('//Title')
>>> title.getparent().remove(title)
>>> etree.tostring(tree.getroot())
'<Root><Description/></Root>'
>>> print(etree.tostring(tree.getroot()))
<Root><Description/></Root>

Categories