Parse XML where an element prefix is defined in the same element

Parse XML where an element prefix is defined in the same element - python

I have an XML file with an element which looks like this:
<wrapping_element>
<prefix:tag xmlns:prefix="url">value</prefix:tag>
</wrapping_element>
I want to get this element, so I am using lxml as follows:
wrapping_element.find('prefix:tag', wrapping_element.nsmap)
but I get the following error: SyntaxError: prefix 'prefix' not found in prefix map because prefix is not defined before reaching this element in the XML.
Is there a way to get the element anyway?

Like mentioned in the comments, you could use local-name() to circumvent the namespace, but it's easy enough to just handle the namespace directly in the xpath() call...
from lxml import etree
tree = etree.parse("input.xml")
wrapping_element = tree.xpath("/wrapping_element")[0]
tag = wrapping_element.xpath("x:tag", namespaces={"x": "url"})[0]
print(etree.tostring(tag, encoding="unicode"))
This will print...
<prefix:tag xmlns:prefix="url">value</prefix:tag>
Notice I used the prefix x. The prefix can match the prefix in the XML file, but it doesn't have to; only the namespace URIs need to match exactly.
See here for more details: http://lxml.de/xpathxslt.html#namespaces-and-prefixes

Related

XPath text() does not get the text of a link node

from lxml import etree
import requests
htmlparser = etree.HTMLParser()
f = requests.get('https://rss.orf.at/news.xml')
# without the ufeff this would fail because it tells me: "ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration."
tree = etree.fromstring('\ufeff'+f.text, htmlparser)
print(tree.xpath('//item/title/text()')) #<- this does produce a liste of titles
print(tree.xpath('//item/link/text()')) #<- this does NOT produce a liste of links why ?!?!
Okay this is a bit of mystery to me, and maybe I'm just overlooking the simplest thing, but the XPath '//item/link/text()' does only produce an empty list while '//item/title/text()' works exactly like expected. Does the <link> node hold any special purpose? I can select all of them with '//item/link' I just can't get the text() selector to work on them.

You're using etree.HTMLParser to parse an XML document. I suspect this was an attempt to deal with XML namespacing, but I think it's probably the wrong solution. It's possible treating the XML document as HTML is ultimately the source of your problem.
If we use the XML parser instead, everything pretty much works as expected.
First, if we look at the root element, we see that it sets a default namespace:
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
xmlns:orfon="http://rss.orf.at/1.0/"
xmlns="http://purl.org/rss/1.0/"
>
That means when we see an item element in the document, it's actually an "item in the http://purl.org/rss/1.0/ namespace" element. We need to provide that namespace information in our xpath queries by passing in a namespaces dictionary and use a namespace prefix on the element names, like this:
>>> tree.xpath('//rss:item', namespaces={'rss': 'http://purl.org/rss/1.0/'})
[<Element {http://purl.org/rss/1.0/}item at 0x7f0497000e80>, ...]
Your first xpath expression (looking at /item/title/text()) becomes:
>>> tree.xpath('//rss:item/rss:title/text()', namespaces={'rss': 'http://purl.org/rss/1.0/'})
['Amnesty dokumentiert Kriegsverbrechen', ..., 'Moskauer Börse startet abgeschirmten Handel']
And your second xpath expression (looking at /item/link/text()) becomes:
>>> tree.xpath('//rss:item/rss:link/text()', namespaces={'rss': 'http://purl.org/rss/1.0/'})
['https://orf.at/stories/3255477/', ..., 'https://orf.at/stories/3255384/']
This makes the code look like:
from lxml import etree
import requests
f = requests.get('https://rss.orf.at/news.xml')
tree = etree.fromstring(f.content)
print(tree.xpath('//rss:item/rss:title/text()', namespaces={'rss': 'http://purl.org/rss/1.0/'}))
print(tree.xpath('//rss:item/rss:link/text()', namespaces={'rss': 'http://purl.org/rss/1.0/'}))
Note that by using f.content (which is a byte string) instead of f.text (a unicode string), we avoid the whole unicode parsing error.

Access XML element by name

Im using xml.etree.ElementTree to parse my XML data. I'm trying to get the text value of <Name>
This is my code.
for Content in Zone[0]:
print(Content.find('Name').text)
It is returning as NoneObject
However, I am able to access the Element using
for Content in Zone[0]:
print(Content[12].text)
I think I might have found the problem as when I print the tags out, it doesn't display Name and instead it displays {http://schemas.datacontract.org/2004/07/}Name. What is the extra data infront of the tag name?

Your XML is likely has default namespace -namespace declared with no prefix-. Notice that descendant elements without prefix inherits default namespace implicitly. You can handle default namespace the way you would handle prefixed namespaces; just map a prefix to the namespace URI, and use that prefix along with element name to reference element in namespace :
namespaces = {'d': 'http://schemas.datacontract.org/2004/07/'}
for Content in Zone[0]:
print(Content.find('d:Name', namespaces).text)

Parse xml node children list by tag with any prefix in python

I would like to got an list of items, independently of their prefixes.
My goal is to create method (please notice me if something like this exist), who has one argument(tagname) and returns list of elements.
For example in case of argument 'item' <media:item>, <abc:item> should be part of result of this function.
It would be nice to use lxml but it can be any python DOM-based parser.
Unfortunatuly i can't assume, that xml has xmlns, that's why i need to parse for any prefix.

lxml is a good option primarily because it has full support for XPath version 1.0 via the xpath() method besides many other useful utilities. And in XPath, you can ignore element namespace by using local-name() as mentioned in the comment.
lxml also able to deal with undefined prefix by setting parameter recover=True, but now comes the catch; local-name() still return prefixed 'tagname' for element having undefined prefix. There is a hacky way to match this kind of element, by finding element which local name contains :tagname -or to be more precise, find element which local name ends with :tagname instead of contains-.
The following is a working example for demo. The demo uses two expressions combined with logical operator or; one for dealing with element having undefined prefix, and the other for element without prefix or with properly defined prefix :
from lxml import etree
xml = """<root foo="bar">
<media:item>a</media:item>
<abc:item>b</abc:item>
<foo:item>c</foo:item>
<item>d</item>
</root>"""
parser = etree.XMLParser(recover=True)
tree = etree.fromstring(xml, parser=parser)
tagname = "item"
#expression to match element undefined prefix
predicate1 = "contains(local-name(),':{0}')".format(tagname)
#expression to match element with properly defined prefix or with no prefix
predicate2 = "local-name()='{0}'".format(tagname)
elements = tree.xpath("//*[{0} or {1}]".format(predicate1, predicate2))
for e in elements:
print(etree.tostring(e))
output :
<media:item>a</media:item>
<abc:item>b</abc:item>
<foo:item>c</foo:item>
<item>d</item>

How do I use ":" in XML element names using lxml?

How do I generate and parse XML like the following using lxml?
<s:Envelope xmlns:s="a" xmlns:a="http_//www.w3.org/2005/08/addressing">
....
</s:Envelope>
I currently swap : with _ in the element names when I parse and generate XML, but it seems stupid.

It's not clear exactly what you're asking, but maybe this will help:
An element such as <s:Envelope> is using a XML namespace prefix. This is used to indicate that the s:Envelope attribute in this document is defined in the a namespace.
LXML represents XML namespaces using a namespace prefix in braces, for example: {a}Envelope. Your example document is sort of confusing, because you also defined the a: namespace prefix, so:
a:Element is equivalent to {http://www.w3.org/2005/08/addressing}Element, and
s:Element is equivalent to {a}Element.
Many of the LXML commands let you provide a namespace prefix mapping. For example, to find the Envelope element in your document using XPATH, you could do this:
import lxml.etree as etree
doc = etree.parse('mydocument.xml')
envelope = doc.xpath('//s:Envelope',
namespaces={'s': 'a'})
Note that this is exactly equivalent to:
envelope = doc.xpath('//x:Envelope',
namespaces={'x': 'a'})
That is, the namespace prefix doesn't have to match what is used in the source XML document; only the the absolute namespace matters.
You can read more about LXML and namespaces here.

Is there a switch to ignore undefined namespace prefixes in LXML?

I'm parsing a non-compliant XML file (Sphinx's xmlpipe2 format) and would like LXML parser to ignore the fact that there are unresolved namespace prefixes.
An example of the Sphinx XML:
<sphinx:schema>
<sphinx:field name="subject"/>
<sphinx:field name="content"/>
<sphinx:attr name="published" type="timestamp"/>
<sphinx:attr name="author_id" type="int" bits="16" default="1"/>
</sphinx:schema>
I'm aware of passing a parser keyword option to try and recover broken XML, e.g.
parser = etree.XMLParser(recover=True)
tree = etree.parse('sphinxTest.xml', parser)
but the above does not ignore the prefix, it removes it.
I could create a target which adds in the removed prefix e.g.
parser = etree.XMLParser(target = AddPrefix())
where AddPrefix() is a class which adds in the prefix to every attribute tag.
Is there a simpler way to do this?
Eventually i want to programmatically write Sphinx's xmlpipe2 format cleanly.

Add xmlns:sphinx="bogus" to the root element.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parse XML where an element prefix is defined in the same element - python

Related

XPath text() does not get the text of a link node

Access XML element by name

Parse xml node children list by tag with any prefix in python

How do I use ":" in XML element names using lxml?

Is there a switch to ignore undefined namespace prefixes in LXML?

Categories

Resources