Using XPath, how are attributes that contain a colon character processed?

Using XPath, how are attributes that contain a colon character processed? - python

Given the following XML (fragment):
<node id="b071f9fa-14b0-4217-8e97-eb41da73f598" type="Group" ext:score="90">
<node id="b071f9fa-14b0-4217-8e97-eb41da73f599" type="Person" ext:score="100">
<node id="b071f9fa-14b0-4217-8e97-eb41da73f600" type="Business" ext:score="80">
I want to retrieve the id of nodes that have an ext:score of 100.
The current code:
match = dom.xpath('//node[#ext:score="100"]/#id')[0]
Returns an exception:
lxml.etree.XPathEvalError: Undefined namespace prefix
I have read (both here and in XPath docs) that ext would first need to be defined as a valid namespace, as the DOM cannot be parsed as an attribute if it contains special characters. However, I have been unable to find a good example of how to do this. There is no definition of ext in the excerpts I am processing and I'm not sure how to create a namespace prefix.
Any thoughts?

The colon character in an XML attribute (or element) name such as ext:score separates the namespace prefix, ext, from the local name, score. Namespace prefixes themselves are significant only by virtue of their association with a namespace value.
For this XML,
<metadata xmlns:ext="http://musicbrainz.org/ns/mmd-2.0#">
<node id="b071f9fa-14b0-4217-8e97-eb41da73f598" type="Group" ext:score="90">
<node id="b071f9fa-14b0-4217-8e97-eb41da73f599" type="Person" ext:score="100">
<node id="b071f9fa-14b0-4217-8e97-eb41da73f600" type="Business" ext:score="80">
</metadata>
This XPath,
//node[#ext:score="100"]/#id
will select the id attributes of all node elements with an ext:score attribute value of 100, provided you have a way to bind a namespace prefix (ext) to a namespace value (http://musicbrainz.org/ns/mmd-2.0# in the language or tool from which XPath is being called.
To bind a namespace prefix to a namespace value in Python (see How does XPath deal with XML namespaces? for Python and other language examples):
from lxml import etree
f = StringIO('your XML here')
doc = etree.parse(f)
r = doc.xpath('//node[#ext:score="100"]/#id',
namespaces={'ext':'http://musicbrainz.org/ns/ext#-2.0'})
Note that if your XML uses ext without declaring it, it is not namespace-well-formed.

Related

Finding namespace URIs for lxml

I'm using lxml to parse XML product feeds with the following code:
namespace = {"sm": "http://www.sitemaps.org/schemas/sitemap/0.9"}
data = [loc.text for loc in tree.xpath("//sm:urlset/sm:url/sm:loc",namespaces=namespace)]
This works with the majority of feeds that I am using as an input, but I occasionally I find a feed with additional namespaces such as the below:
<?xml version="1.0" encoding="UTF-8"?>
<urlset
xmlns="https://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="https://www.sitemaps.org/schemas/sitemap/0.9
https://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
<url>
<loc>https://www.example.com/</loc>
<priority>1.00</priority>
</url>
From what I've read I would need to add the additional namespace here (xmlns:xsi I guess) to the namespace dictionary to get my xpath to work with multiple namespaces.
However, this is not a long term solution for me as I might come across other differing namespaces in the future - is there a way for me to search/detect or even delete the namespace ? The element tree always will be the same, so my xpath wouldn't change.
Thanks

You shouldn't need to map the xsi prefix; that's only there for the xsi:schemaLocation attribute.
The difference between your current mapping and the input file is that there is an "s" in "https" in the default namespace of the XML.
To handle both namespace URIs (or really any other namespace URI that urlset might have) is to first get the namespace URI for the root element and then use that in your dict mapping...
from lxml import etree
tree = etree.parse("input.xml")
root_ns_uri = tree.xpath("namespace-uri()")
namespace = {"sm": root_ns_uri}
data = [loc.text for loc in tree.xpath("//sm:urlset/sm:url/sm:loc", namespaces=namespace)]
print(data)
prints...
['https://www.example.com/']
If urlset isn't always the root element, you may want to do something like this instead...
root_ns_uri = tree.xpath("namespace-uri(//*[local-name()='urlset'])")

Access XML element by name

Im using xml.etree.ElementTree to parse my XML data. I'm trying to get the text value of <Name>
This is my code.
for Content in Zone[0]:
print(Content.find('Name').text)
It is returning as NoneObject
However, I am able to access the Element using
for Content in Zone[0]:
print(Content[12].text)
I think I might have found the problem as when I print the tags out, it doesn't display Name and instead it displays {http://schemas.datacontract.org/2004/07/}Name. What is the extra data infront of the tag name?

Your XML is likely has default namespace -namespace declared with no prefix-. Notice that descendant elements without prefix inherits default namespace implicitly. You can handle default namespace the way you would handle prefixed namespaces; just map a prefix to the namespace URI, and use that prefix along with element name to reference element in namespace :
namespaces = {'d': 'http://schemas.datacontract.org/2004/07/'}
for Content in Zone[0]:
print(Content.find('d:Name', namespaces).text)

Adding Namespaces to a DOM Element python

I want to produce this xml file with python and minidom:
<xml vesion="1.0" encoding="utf-8?>
<package name="Operation" xmlns="http://www.modelIL.eu/types-2.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation = "http://www.modelIL.eu/types-2.0 modelIL-package-2.0.xsd">
</package>
I have wrote this:
import xml.dom.minidom as dom
document = dom.Document()
root_xml = document.createElement("package")
root_xml.setAttribute("name", "Operation")
root_xml.setAttributeNS("", "xmlns", "http://www.modelIL.eu/types-2.0")
root_xml.setAttributeNS("xmls", "xsi", "http://www.w3.org/2001/XMLSchema-instance")
root_xml.setAttribute("xsi:schemaLocation", "http://www.modelIL.eu/types-2.0 modelIL-package-2.0.xsd")
root = document.appendChild(root_xml)
print(document.toprettyxml(indent(" "))
But the output I get is this one:
<xml vesion="1.0" ?>
<package name="Operation" xmlns="http://www.modelIL.eu/types-2.0" xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation = "http://www.modelIL.eu/types-2.0 modelIL-package-2.0.xsd">
</package>
Why do I have only xsi and not xmlns:xsi? Did I forget something?

Full disclosure: I do not use minidom for XML, I use lxml and on top of that I do not use XML that often, so I hope my answer will be useful.
One might expect that by setting an attribute with a particular namespace, there would not be any need to explicitly state a prefix to appear before the local name in the final, written XML document - after all, it should be possible to detect that a namespace has been employed and that a prefix is required in the full attribute name so that the attribute is recognised as being associated with that namespace. Unfortunately, we do not seem to have that luxury and must explicitly specify a prefix as part of the qualified name when setting such an attribute
Python and XML: An Introduction (skip to the Atributes part)
This should solve your problem:
root_xml.setAttributeNS("xmls", "xmlns:xsi", "http://www.w3.org/2001/XMLSchema-instance")
As you know the setAttributeNS method takes three arguments: namespaceURI, qualifiedName, value. The attribute is than added if the element has no attribute with the same namespaceURI and localname - we get the localname by doing a split on qualifiedName using the function _nssplit. Otherwise the method tries to update the value of the attribute.
However the name of the attribute is a combination of prefix (the part of the qualifiedName before the colon punctuation) and localname "%s:%s" % (prefix, localName). If no prefix is present the name of the attribute is the same as the qualifiedName argument.
If you do not care for the namespaceURI of your attributes you could achieve the same result using only the setAttribute method like you did with the first and last attribute. In that case the method will look for an attribute with the same attribute name. If it finds one, it will try to overwrite it's value.
I do have one question: why do you bind root = document.appendChild(root_xml)? Is it to avoid the return value in your REPL? That I understand.

Retrieve content of element with unknown namespace using python

I am attempting to parse a maven project definition using python to extract a version.
The project definition looks like:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>...</groupId>
<artifactId>...</artifactId>
<version>1.6.0-SNAPSHOT</version>
...
</project>
I can extract the version using:
root = ET.fromstring(xml)
version = root.find('./p:version', { 'p': 'http://maven.apache.org/POM/4.0.0' })
print(version.text)
prints: 1.6.0-SNAPSHOT
However, the namespace used may change, and I don't want to depend on this. Is there a way to extract the namespace to use in my subsequent xpath expression?
I tried the following, to see if xmlns was itself exposed, but no luck:
root = ET.fromstring(xml)
for k in root.attrib:
print('%s => %s' % (k, root.attrib[k]))
prints: {http://www.w3.org/2001/XMLSchema-instance}schemaLocation => http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd

However, the namespace used may change, and I don't want to depend on this.
Are you saying that the namespace uri might change, or that the prefix might? If it's just the prefix, then that's not an issue, because what matters is that the prefixes in your XPath match the prefixes you supply to the XPath evaluator. And in either case, auto-detecting the namespaces is probably a bad call. Suppose someone decides to start generating that XML like this:
<proj:project xmlns:proj="http://maven.apache.org/POM/4.0.0"
xmlns:other="http://maven.apache.org/POM/5.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/maven-v4_0_0.xsd">
which is still perfectly representing the XML in the same namespace as your example, but you have no idea that the proj prefix is the namespace prefix you're looking for.
I think it's unlikely that Apache would suddenly change the namespace for one of their official XML formats, but if you are genuinely worried about it, there should always be the option of using local-name() to namespace-agnostically find a node you're looking for:
version = root.find('./*[local-name() = "version"]')
Also, I'm not familiar with the elementTree library, but you could try this to try to get information about the XML document's namespaces, just to see if you can:
namespaces = root.findall('//namespace::*')

Unfortunately, ElementTree namespace support is rather patchy.
You'll need to use an internal method from the xml.etree.ElementTree module to get a namespace map out:
_, namespaces = ET._namespaces(root, 'utf8')
namespaces is now a dict with URIs as keys, and prefixes as values.
You could switch to lxml instead. That library implements the same ElementTree API, but has augmented that API considerably.
For example, each node includes a .nsmap attribute which maps prefixes to URIs, including the default namespace under the key None.

How do I use ":" in XML element names using lxml?

How do I generate and parse XML like the following using lxml?
<s:Envelope xmlns:s="a" xmlns:a="http_//www.w3.org/2005/08/addressing">
....
</s:Envelope>
I currently swap : with _ in the element names when I parse and generate XML, but it seems stupid.

It's not clear exactly what you're asking, but maybe this will help:
An element such as <s:Envelope> is using a XML namespace prefix. This is used to indicate that the s:Envelope attribute in this document is defined in the a namespace.
LXML represents XML namespaces using a namespace prefix in braces, for example: {a}Envelope. Your example document is sort of confusing, because you also defined the a: namespace prefix, so:
a:Element is equivalent to {http://www.w3.org/2005/08/addressing}Element, and
s:Element is equivalent to {a}Element.
Many of the LXML commands let you provide a namespace prefix mapping. For example, to find the Envelope element in your document using XPATH, you could do this:
import lxml.etree as etree
doc = etree.parse('mydocument.xml')
envelope = doc.xpath('//s:Envelope',
namespaces={'s': 'a'})
Note that this is exactly equivalent to:
envelope = doc.xpath('//x:Envelope',
namespaces={'x': 'a'})
That is, the namespace prefix doesn't have to match what is used in the source XML document; only the the absolute namespace matters.
You can read more about LXML and namespaces here.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using XPath, how are attributes that contain a colon character processed? - python

Related

Finding namespace URIs for lxml

Access XML element by name

Adding Namespaces to a DOM Element python

Retrieve content of element with unknown namespace using python

How do I use ":" in XML element names using lxml?

Categories

Resources