Adding Namespaces to a DOM Element python

Adding Namespaces to a DOM Element python - python

I want to produce this xml file with python and minidom:
<xml vesion="1.0" encoding="utf-8?>
<package name="Operation" xmlns="http://www.modelIL.eu/types-2.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation = "http://www.modelIL.eu/types-2.0 modelIL-package-2.0.xsd">
</package>
I have wrote this:
import xml.dom.minidom as dom
document = dom.Document()
root_xml = document.createElement("package")
root_xml.setAttribute("name", "Operation")
root_xml.setAttributeNS("", "xmlns", "http://www.modelIL.eu/types-2.0")
root_xml.setAttributeNS("xmls", "xsi", "http://www.w3.org/2001/XMLSchema-instance")
root_xml.setAttribute("xsi:schemaLocation", "http://www.modelIL.eu/types-2.0 modelIL-package-2.0.xsd")
root = document.appendChild(root_xml)
print(document.toprettyxml(indent(" "))
But the output I get is this one:
<xml vesion="1.0" ?>
<package name="Operation" xmlns="http://www.modelIL.eu/types-2.0" xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation = "http://www.modelIL.eu/types-2.0 modelIL-package-2.0.xsd">
</package>
Why do I have only xsi and not xmlns:xsi? Did I forget something?

Full disclosure: I do not use minidom for XML, I use lxml and on top of that I do not use XML that often, so I hope my answer will be useful.
One might expect that by setting an attribute with a particular namespace, there would not be any need to explicitly state a prefix to appear before the local name in the final, written XML document - after all, it should be possible to detect that a namespace has been employed and that a prefix is required in the full attribute name so that the attribute is recognised as being associated with that namespace. Unfortunately, we do not seem to have that luxury and must explicitly specify a prefix as part of the qualified name when setting such an attribute
Python and XML: An Introduction (skip to the Atributes part)
This should solve your problem:
root_xml.setAttributeNS("xmls", "xmlns:xsi", "http://www.w3.org/2001/XMLSchema-instance")
As you know the setAttributeNS method takes three arguments: namespaceURI, qualifiedName, value. The attribute is than added if the element has no attribute with the same namespaceURI and localname - we get the localname by doing a split on qualifiedName using the function _nssplit. Otherwise the method tries to update the value of the attribute.
However the name of the attribute is a combination of prefix (the part of the qualifiedName before the colon punctuation) and localname "%s:%s" % (prefix, localName). If no prefix is present the name of the attribute is the same as the qualifiedName argument.
If you do not care for the namespaceURI of your attributes you could achieve the same result using only the setAttribute method like you did with the first and last attribute. In that case the method will look for an attribute with the same attribute name. If it finds one, it will try to overwrite it's value.
I do have one question: why do you bind root = document.appendChild(root_xml)? Is it to avoid the return value in your REPL? That I understand.

Related

Finding namespace URIs for lxml

I'm using lxml to parse XML product feeds with the following code:
namespace = {"sm": "http://www.sitemaps.org/schemas/sitemap/0.9"}
data = [loc.text for loc in tree.xpath("//sm:urlset/sm:url/sm:loc",namespaces=namespace)]
This works with the majority of feeds that I am using as an input, but I occasionally I find a feed with additional namespaces such as the below:
<?xml version="1.0" encoding="UTF-8"?>
<urlset
xmlns="https://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="https://www.sitemaps.org/schemas/sitemap/0.9
https://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
<url>
<loc>https://www.example.com/</loc>
<priority>1.00</priority>
</url>
From what I've read I would need to add the additional namespace here (xmlns:xsi I guess) to the namespace dictionary to get my xpath to work with multiple namespaces.
However, this is not a long term solution for me as I might come across other differing namespaces in the future - is there a way for me to search/detect or even delete the namespace ? The element tree always will be the same, so my xpath wouldn't change.
Thanks

You shouldn't need to map the xsi prefix; that's only there for the xsi:schemaLocation attribute.
The difference between your current mapping and the input file is that there is an "s" in "https" in the default namespace of the XML.
To handle both namespace URIs (or really any other namespace URI that urlset might have) is to first get the namespace URI for the root element and then use that in your dict mapping...
from lxml import etree
tree = etree.parse("input.xml")
root_ns_uri = tree.xpath("namespace-uri()")
namespace = {"sm": root_ns_uri}
data = [loc.text for loc in tree.xpath("//sm:urlset/sm:url/sm:loc", namespaces=namespace)]
print(data)
prints...
['https://www.example.com/']
If urlset isn't always the root element, you may want to do something like this instead...
root_ns_uri = tree.xpath("namespace-uri(//*[local-name()='urlset'])")

Using XPath, how are attributes that contain a colon character processed?

Given the following XML (fragment):
<node id="b071f9fa-14b0-4217-8e97-eb41da73f598" type="Group" ext:score="90">
<node id="b071f9fa-14b0-4217-8e97-eb41da73f599" type="Person" ext:score="100">
<node id="b071f9fa-14b0-4217-8e97-eb41da73f600" type="Business" ext:score="80">
I want to retrieve the id of nodes that have an ext:score of 100.
The current code:
match = dom.xpath('//node[#ext:score="100"]/#id')[0]
Returns an exception:
lxml.etree.XPathEvalError: Undefined namespace prefix
I have read (both here and in XPath docs) that ext would first need to be defined as a valid namespace, as the DOM cannot be parsed as an attribute if it contains special characters. However, I have been unable to find a good example of how to do this. There is no definition of ext in the excerpts I am processing and I'm not sure how to create a namespace prefix.
Any thoughts?

The colon character in an XML attribute (or element) name such as ext:score separates the namespace prefix, ext, from the local name, score. Namespace prefixes themselves are significant only by virtue of their association with a namespace value.
For this XML,
<metadata xmlns:ext="http://musicbrainz.org/ns/mmd-2.0#">
<node id="b071f9fa-14b0-4217-8e97-eb41da73f598" type="Group" ext:score="90">
<node id="b071f9fa-14b0-4217-8e97-eb41da73f599" type="Person" ext:score="100">
<node id="b071f9fa-14b0-4217-8e97-eb41da73f600" type="Business" ext:score="80">
</metadata>
This XPath,
//node[#ext:score="100"]/#id
will select the id attributes of all node elements with an ext:score attribute value of 100, provided you have a way to bind a namespace prefix (ext) to a namespace value (http://musicbrainz.org/ns/mmd-2.0# in the language or tool from which XPath is being called.
To bind a namespace prefix to a namespace value in Python (see How does XPath deal with XML namespaces? for Python and other language examples):
from lxml import etree
f = StringIO('your XML here')
doc = etree.parse(f)
r = doc.xpath('//node[#ext:score="100"]/#id',
namespaces={'ext':'http://musicbrainz.org/ns/ext#-2.0'})
Note that if your XML uses ext without declaring it, it is not namespace-well-formed.

Access XML element by name

Im using xml.etree.ElementTree to parse my XML data. I'm trying to get the text value of <Name>
This is my code.
for Content in Zone[0]:
print(Content.find('Name').text)
It is returning as NoneObject
However, I am able to access the Element using
for Content in Zone[0]:
print(Content[12].text)
I think I might have found the problem as when I print the tags out, it doesn't display Name and instead it displays {http://schemas.datacontract.org/2004/07/}Name. What is the extra data infront of the tag name?

Your XML is likely has default namespace -namespace declared with no prefix-. Notice that descendant elements without prefix inherits default namespace implicitly. You can handle default namespace the way you would handle prefixed namespaces; just map a prefix to the namespace URI, and use that prefix along with element name to reference element in namespace :
namespaces = {'d': 'http://schemas.datacontract.org/2004/07/'}
for Content in Zone[0]:
print(Content.find('d:Name', namespaces).text)

python etree with xpath and namespaces with prefix

I can't find info, how to parse my XML with namespace:
I have this xml:
<par:Request xmlns:par="http://somewhere.net/actual">
<par:actual>blabla</par:actual>
<par:documentType>string</par:documentType>
</par:Request>
And tried to parse it:
dom = ET.parse(u'C:\\filepath\\1.xml')
rootxml = dom.getroot()
for subtag in rootxml.xpath(u'//par:actual'):
#do something
print(subtag)
And got exception, because it doesn't know about namespace prefix.
Is there best way to solve that problem, counting that script will not know about file it going to parse and tag is going to search for?
Searching web and stackoverflow I found, that if I will add there:
namespace = {u'par': u"http://somewhere.net/actual"}
for subtag in rootxml.xpath(u'//par:actual', namespaces=namespace):
#do something
print(subtag)
That works. Perfect. But I don't know which XML I will parse, and searching tag (such as //par:actual) is also unknown to my script. So, I need to find way to extract namespace from XML somehow.
I found a lot of ways, how to extract namespace URI, such as:
print(rootxml.tag)
print(rootxml.xpath('namespace-uri(.)'))
print(rootxml.xpath('namespace-uri(/*)'))
But how should I extract prefix to create dictionary which ElementTree wants from me? I don't want to use regular expression monster over xml body to extract prefix, I believe there have to exist supported way for that, isn't it?
And maybe there have to exist some methods for me to extract by ETree namespace from XML as dictionary (as ETree wants!) without hands manipulation?

You cannot rely on the namespace declarations on the root element: there is no guarantee that the declarations will even be there, or that the document will have the same prefix for the same namespace throughout.
Assuming you are going to have some way of passing the tag you want to search (because you say it is not known by your script), you should also provide a way to pass a namespace mapping as well. Or use the James Clark notation, like {http://somewhere.net/actual}actual (the ETXPath has support for this syntax, whereas "normal" xpath does not, but you can also use other methods like .findall() if you don't need full xpath)
If you don't care for the prefix at all, you could also use the local-name() function in xpath, eg. //*[local-name()="actual"] (but you won't be "really" sure it's the right "actual")

Oh, I found it.
After we do that:
dom = ET.parse(u'C:\\filepath\\1.xml')
rootxml = dom.getroot()
Object rootxml contains dictionary nsmap, which contains all namespaces that I want.
So, simplest solution I've found:
dom = ET.parse(u'C:\\filepath\\1.xml')
rootxml = dom.getroot()
nss = rootxml.nsmap
for subtag in rootxml.xpath(u'//par:actual', namespaces=nss):
#do something
print(subtag)
That works.
UPD: that works if user understand what means 'par' in XML he works with. For example, comparing supposed namespace with existing namespace before any other operations.
Still, I like much variant with XPath that understands {...}actual, that was what I tried to achieve.

With Python 3.8.2 I found this question with the same issue.
This is the solution I found, put the namespace in the XPath query. (Between the {})
ApplicationArea = BOD_IN_tree.find('.//ApplicationArea', ns)
if(ApplicationArea is None):
ApplicationArea = BOD_IN_tree.find('.//{http://www.defaultNamespace.com/2}ApplicationArea', ns)
I search for the element without the namespace, then search again if it's not found. I have no control over the inbound documents, some have namespaces, some do not.
I hope this helps!

Retrieve content of element with unknown namespace using python

I am attempting to parse a maven project definition using python to extract a version.
The project definition looks like:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>...</groupId>
<artifactId>...</artifactId>
<version>1.6.0-SNAPSHOT</version>
...
</project>
I can extract the version using:
root = ET.fromstring(xml)
version = root.find('./p:version', { 'p': 'http://maven.apache.org/POM/4.0.0' })
print(version.text)
prints: 1.6.0-SNAPSHOT
However, the namespace used may change, and I don't want to depend on this. Is there a way to extract the namespace to use in my subsequent xpath expression?
I tried the following, to see if xmlns was itself exposed, but no luck:
root = ET.fromstring(xml)
for k in root.attrib:
print('%s => %s' % (k, root.attrib[k]))
prints: {http://www.w3.org/2001/XMLSchema-instance}schemaLocation => http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd

However, the namespace used may change, and I don't want to depend on this.
Are you saying that the namespace uri might change, or that the prefix might? If it's just the prefix, then that's not an issue, because what matters is that the prefixes in your XPath match the prefixes you supply to the XPath evaluator. And in either case, auto-detecting the namespaces is probably a bad call. Suppose someone decides to start generating that XML like this:
<proj:project xmlns:proj="http://maven.apache.org/POM/4.0.0"
xmlns:other="http://maven.apache.org/POM/5.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/maven-v4_0_0.xsd">
which is still perfectly representing the XML in the same namespace as your example, but you have no idea that the proj prefix is the namespace prefix you're looking for.
I think it's unlikely that Apache would suddenly change the namespace for one of their official XML formats, but if you are genuinely worried about it, there should always be the option of using local-name() to namespace-agnostically find a node you're looking for:
version = root.find('./*[local-name() = "version"]')
Also, I'm not familiar with the elementTree library, but you could try this to try to get information about the XML document's namespaces, just to see if you can:
namespaces = root.findall('//namespace::*')

Unfortunately, ElementTree namespace support is rather patchy.
You'll need to use an internal method from the xml.etree.ElementTree module to get a namespace map out:
_, namespaces = ET._namespaces(root, 'utf8')
namespaces is now a dict with URIs as keys, and prefixes as values.
You could switch to lxml instead. That library implements the same ElementTree API, but has augmented that API considerably.
For example, each node includes a .nsmap attribute which maps prefixes to URIs, including the default namespace under the key None.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Adding Namespaces to a DOM Element python - python

Related

Finding namespace URIs for lxml

Using XPath, how are attributes that contain a colon character processed?

Access XML element by name

python etree with xpath and namespaces with prefix

Retrieve content of element with unknown namespace using python

Categories

Resources