Python: ignoring namespaces in xml.etree.ElementTree?

Python: ignoring namespaces in xml.etree.ElementTree? - python

How can I tell ElementTree to ignore namespaces in an XML file?
For example, I would prefer to query modelVersion (as in statement 1) rather than {http://maven.apache.org/POM/4.0.0}modelVersion (as in statement 2).
pom="""
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
</project>
"""
from xml.etree import ElementTree
ElementTree.register_namespace("","http://maven.apache.org/POM/4.0.0")
root = ElementTree.fromstring(pom)
print 1,root.findall('modelVersion')
print 2,root.findall('{http://maven.apache.org/POM/4.0.0}modelVersion')
1 []
2 [<Element '{http://maven.apache.org/POM/4.0.0}modelVersion' at 0x1006bff10>]

There appears to be no straight-forward pathway, thus I'd simply wrap the find calls, e.g.
from xml.etree import ElementTree as ET
POM = """
<project xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://maven.apache.org/POM/4.0.0">
<modelVersion>4.0.0</modelVersion>
</project>
"""
NSPS = {'foo' : "http://maven.apache.org/POM/4.0.0"}
# sic!
def findall(node, tag):
return node.findall('foo:' + tag, NSPS)
root = ET.fromstring(POM)
print(map(ET.tostring, findall(root, 'modelVersion')))
output:
['<ns0:modelVersion xmlns:ns0="http://maven.apache.org/POM/4.0.0">4.0.0</ns0:modelVersion>\n']

Here's what I'm presently doing, which makes me incredibly confident that there's a better way.
$ cat pom.xml |
tr '\n' ' ' |
sed 's/<project [^>]*>/<project>/' |
myprogram |
sed 's/<project>/<project xmlns="http:\/\/maven.apache.org\/POM\/4.0.0" xmlns:xsi="http:\/\/www.w3.org\/2001\/XMLSchema-instance" xsi:schemaLocation="http:\/\/maven.apache.org\/POM\/4.0.0 http:\/\/maven.apache.org\/maven-v4_0_0.xsd">/'

Rather than ignore, another approach would be to remove the namespaces in the tree, so there's no need to 'ignore' because they aren't there - see nonagon's answer to this question (and my extension of that to include namespaces on attributes): Python ElementTree module: How to ignore the namespace of XML files to locate matching element when using the method "find", "findall"

Here's the equivalent solution without using the shell. Basic idea:
translate <project junk...> to <project>
perform "clean" processing without worrying about the namespace
translate <project> back to <project junk...>
with the new code:
pom="""
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
</project>
"""
short_project="""<project>"""
long_project="""<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">"""
import re,sys
from xml.etree import ElementTree
# eliminate namespace specs
pom=re.compile('<project [^>]*>').sub(short_project,pom)
root = ElementTree.fromstring(pom)
ElementTree.dump(root)
print 1,root.findall('modelVersion')
print 2,root.findall('{http://maven.apache.org/POM/4.0.0}modelVersion')
mv=root.findall('modelVersion')
# restore the namespace specs
pom=ElementTree.tostring(root)
pom=re.compile(short_project).sub(long_project,pom)

Related

Unexpected results when parsing XML via lxml

The output of my xml parsing is not es expected.
The xml file
<?xml version="1.0"?>
<stationaer xsi:schemaLocation="http:/foo.bar" xmlns="http://foo.bar" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<einrichtung>
<name>Name</name>
</einrichtung>
<einrichtung>
<name>Name</name>
</einrichtung>
</stationaer>
I would expect to get something like root.tag == 'stationaer' and child.tag = 'einrichtung'.
See the outpout at the end.
This is the MWE
#!/usr/bin/env python3
import pathlib
import lxml
from lxml import etree
import pandas
xml_src = '''<?xml version="1.0"?>
<stationaer xsi:schemaLocation="http:/foo.bar" xmlns="http://foo.bar" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<einrichtung>
<name>Name</name>
</einrichtung>
<einrichtung>
<name>Name</name>
</einrichtung>
</stationaer>
'''
# tree = etree.parse(file_path)
# root = tree.getroot()
root = etree.fromstring(xml_src)
print(repr(root.tag))
print(repr(root.text))
child = root.getchildren()[0]
print(repr(child.tag))
print(repr(child.text))
The output for root is
'{http://foo.bar}stationaer'
'\n '
and for child
'{http://foo.bar}einrichtung'
'\n '
I don't understand what's going on here and why that URL is in the output.

This is actually not unexpected. The elements in the XML document are bound to the http://foo.bar default namespace. The namespace is declared by xmlns="http://foo.bar" on the root element and the declaration is inherited by all descendants.
The special notation with the namespace URI enclosed in curly braces ({http://foo.bar}stationaer) is never used in XML documents, but it is used by lxml and ElementTree when printing element (tag) names. It can also be used when searching or creating elements that belong to a namespace.
More information:
https://www.w3.org/TR/xml-names/
https://lxml.de/tutorial.html#namespaces
https://docs.python.org/3/library/xml.etree.elementtree.html#parsing-xml-with-namespaces

Issue with python script while parsing pom file in project

I'm having issue extracting version number using python script. Its returning none while running the script. Can someone help me on this ?
Python Script:
import xml.etree.ElementTree as ET
tree = ET.parse('pom.xml')
root = tree.getroot()
releaseVersion = root.find("version")
print(releaseVersion)
pom.xml
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://maven.apache.org/POM/4.0.0"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">
<artifactId>watcher</artifactId>
<version>0.0.1-SNAPSHOT</version>
<groupId>com.test</groupId>
<name>file</name>
<packaging>jar</packaging>
<parent>
<artifactId>spring-boot-starter-parent</artifactId>
<groupId>org.springframework.boot</groupId>
<relativePath/>
<version>2.6.1</version>
</parent>
</project>

You're not taking into account that all of your elements are in a default namespace defined by xmlns="http://maven.apache.org/POM/4.0.0" in your <project> element.
So you have to create your query with this namespace.
import xml.etree.ElementTree as ET
tree = ET.parse('pom.xml')
root = tree.getroot()
NS = { 'maven' : 'http://maven.apache.org/POM/4.0.0' }
releaseVersion = root.find("maven:version",NS)
print(releaseVersion.text)
Here NS = { ... } defines the namespace (in the following referred to by its prefix maven) used in the following XPath expression.

Your pom.xml has a namespace xmlns="http://maven.apache.org/POM/4.0.0" in the project tag.
If you must search with fullname, you need to follow {namespace}tag
>> root.find("{http://maven.apache.org/POM/4.0.0}version")
<Element '{http://maven.apache.org/POM/4.0.0}version' at 0x0000014635EC0A40>
But if you don't bother you can search with {*}tag
>> root.find("{*}version")
<Element '{http://maven.apache.org/POM/4.0.0}version' at 0x0000014635EC0A40>

Parsing XML with namespace

With this XML
<?xml version="1.0" encoding="UTF-8"?>
<Envelope>
<subject>Reference rates</subject>
<Sender>
<name>European Central Bank</name>
</Sender>
<Cube>
<Cube time='2013-12-20'>
<Cube currency='USD' rate='1.3655'/>
<Cube currency='JPY' rate='142.66'/>
</Cube>
</Cube>
</Envelope>
I can get the inner Cube tags like this
from xml.etree.ElementTree import ElementTree
t = ElementTree()
t.parse('eurofxref-daily.xml')
day = t.find('Cube/Cube')
print 'Day:', day.attrib['time']
for currency in day:
print currency.items()
Day: 2013-12-20
[('currency', 'USD'), ('rate', '1.3655')]
[('currency', 'JPY'), ('rate', '142.66')]
The problem is that the above XML is a cleaned version of the original file which has defined namespaces
<?xml version="1.0" encoding="UTF-8"?>
<gesmes:Envelope xmlns:gesmes="http://www.gesmes.org/xml/2002-08-01" xmlns="http://www.ecb.int/vocabulary/2002-08-01/eurofxref">
<gesmes:subject>Reference rates</gesmes:subject>
<gesmes:Sender>
<gesmes:name>European Central Bank</gesmes:name>
</gesmes:Sender>
<Cube>
<Cube time='2013-12-20'>
<Cube currency='USD' rate='1.3655'/>
<Cube currency='JPY' rate='142.66'/>
</Cube>
</Cube>
</gesmes:Envelope>
When I try to get the first Cube tag I get a None
t = ElementTree()
t.parse('eurofxref-daily.xml')
print t.find('Cube')
None
The root tag includes the namespace
root = t.getroot()
print 'root.tag:', root.tag
root.tag: {http://www.gesmes.org/xml/2002-08-01}Envelope
Its children also
for e in root.getchildren():
print 'e.tag:', e.tag
e.tag: {http://www.gesmes.org/xml/2002-08-01}subject
e.tag: {http://www.gesmes.org/xml/2002-08-01}Sender
e.tag: {http://www.ecb.int/vocabulary/2002-08-01/eurofxref}Cube
I can get the Cube tags if I include the namespace in the tag
day = t.find('{http://www.ecb.int/vocabulary/2002-08-01/eurofxref}Cube/{http://www.ecb.int/vocabulary/2002-08-01/eurofxref}Cube')
print 'Day: ', day.attrib['time']
Day: 2013-12-20
But that is really ugly. Apart from cleaning the file before processing or doing string manipulation is there an elegant way to handle it?

There's a more elegant way than including the whole namespace URI in the text of the query. For a python version that does not support the namespaces argument on ElementTree.find, lxml provides the missing functionality and is "mostly compatible" with xml.etree:
from lxml.etree import ElementTree
t = ElementTree()
t.parse('eurofxref-daily.xml')
namespaces = { "exr": "http://www.ecb.int/vocabulary/2002-08-01/eurofxref" }
day = t.find('exr:Cube', namespaces)
print day
Using the namespaces object, you can set it once and for all and then just use prefixes in your queries.
Here is the output:
$ python test.py
<Element '{http://www.ecb.int/vocabulary/2002-08-01/eurofxref}Cube' at 0x7fe0f95e3290>
If you find prefixes inelegant, then you have to work on a file without namespaces. Or there may be other tools out there that will "cheat" and match on local-name() even if namespaces are in effect but I don't use them.
In python 2.7 or python 3.3, or higher, you could use the same code as above but use xml.etree instead of lxml because they've added support for namespaces to these versions.

How to resolve external entities with xml.etree like lxml.etree

I have a script that parses XML using lxml.etree:
from lxml import etree
parser = etree.XMLParser(load_dtd=True, resolve_entities=True)
tree = etree.parse('main.xml', parser=parser)
I need load_dtd=True and resolve_entities=True be have &emptyEntry; from globals.xml resolved:
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE map SYSTEM "globals.xml" [
<!ENTITY dirData "${DATADIR}">
]>
<map
xmlns:map="http://my.dummy.org/map"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsschemaLocation="http://my.dummy.org/map main.xsd"
>
&emptyEntry; <!-- from globals.xml -->
<entry><key>KEY</key><value>VALUE</value></entry>
<entry><key>KEY</key><value>VALUE</value></entry>
</map>
with globals.xml
<?xml version="1.0" encoding="UTF-8"?>
<!ENTITY emptyEntry "<entry></entry>">
Now I would like to move from non-standard lxml to standard xml.etree. But this fails with my file because the load_dtd=True and resolve_entities=True is not supported by xml.etree.
Is there an xml.etree-way to have these entities resolved?

My trick is to use the external program xmllint
proc = subprocess.Popen(['xmllint','--noent',fname],stdout=subprocess.PIPE)
output = proc.communicate()[0]
tree = ElementTree.parse(StringIO.StringIO(output))

lxml is a right tool for the job.
But, if you want to use stdlib, then be prepared for difficulties and take a look at XMLParser's UseForeignDTD method. Here's a good (but hacky) example: Python ElementTree support for parsing unknown XML entities?

Saving the XML document, breaks my XSI declaration

I have a question:
I am parsing an XML that has a namespace with a python xml parser ( beautifulsoup ), and when I save that xml the parser replaces: "xsi:" in the namespace with {http://www.w3.org/2001/XMLSchema-instance} how can I prevent him from doing that ?
Example:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
Becomes:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" {http://www.w3.org/2001/XMLSchema-instance}schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
Can anyone help me out with this ?
Regards,
Bojan

I've filed a bug for you. I've also committed a fix which will be in the next release of Beautiful Soup.

This is how I temporarily solved it.
soupOut = str(soup)
ns = re.search("<project [^>]* xmlns:xsi=\"(?P<ns>[^\"]*)\"[^>]*>",soupOut)
if ns:
soupOut = soupOut.replace("{%s}"%ns.group('ns'), 'xsi:')
file.write(soupOut)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: ignoring namespaces in xml.etree.ElementTree? - python

Related

Unexpected results when parsing XML via lxml

Issue with python script while parsing pom file in project

Parsing XML with namespace

How to resolve external entities with xml.etree like lxml.etree

Saving the XML document, breaks my XSI declaration

Categories

Resources