How do I use ":" in XML element names using lxml? - python

How do I generate and parse XML like the following using lxml?
<s:Envelope xmlns:s="a" xmlns:a="http_//www.w3.org/2005/08/addressing">
....
</s:Envelope>
I currently swap : with _ in the element names when I parse and generate XML, but it seems stupid.

It's not clear exactly what you're asking, but maybe this will help:
An element such as <s:Envelope> is using a XML namespace prefix. This is used to indicate that the s:Envelope attribute in this document is defined in the a namespace.
LXML represents XML namespaces using a namespace prefix in braces, for example: {a}Envelope. Your example document is sort of confusing, because you also defined the a: namespace prefix, so:
a:Element is equivalent to {http://www.w3.org/2005/08/addressing}Element, and
s:Element is equivalent to {a}Element.
Many of the LXML commands let you provide a namespace prefix mapping. For example, to find the Envelope element in your document using XPATH, you could do this:
import lxml.etree as etree
doc = etree.parse('mydocument.xml')
envelope = doc.xpath('//s:Envelope',
namespaces={'s': 'a'})
Note that this is exactly equivalent to:
envelope = doc.xpath('//x:Envelope',
namespaces={'x': 'a'})
That is, the namespace prefix doesn't have to match what is used in the source XML document; only the the absolute namespace matters.
You can read more about LXML and namespaces here.

Related

Finding namespace URIs for lxml

I'm using lxml to parse XML product feeds with the following code:
namespace = {"sm": "http://www.sitemaps.org/schemas/sitemap/0.9"}
data = [loc.text for loc in tree.xpath("//sm:urlset/sm:url/sm:loc",namespaces=namespace)]
This works with the majority of feeds that I am using as an input, but I occasionally I find a feed with additional namespaces such as the below:
<?xml version="1.0" encoding="UTF-8"?>
<urlset
xmlns="https://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="https://www.sitemaps.org/schemas/sitemap/0.9
https://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
<url>
<loc>https://www.example.com/</loc>
<priority>1.00</priority>
</url>
From what I've read I would need to add the additional namespace here (xmlns:xsi I guess) to the namespace dictionary to get my xpath to work with multiple namespaces.
However, this is not a long term solution for me as I might come across other differing namespaces in the future - is there a way for me to search/detect or even delete the namespace ? The element tree always will be the same, so my xpath wouldn't change.
Thanks
You shouldn't need to map the xsi prefix; that's only there for the xsi:schemaLocation attribute.
The difference between your current mapping and the input file is that there is an "s" in "https" in the default namespace of the XML.
To handle both namespace URIs (or really any other namespace URI that urlset might have) is to first get the namespace URI for the root element and then use that in your dict mapping...
from lxml import etree
tree = etree.parse("input.xml")
root_ns_uri = tree.xpath("namespace-uri()")
namespace = {"sm": root_ns_uri}
data = [loc.text for loc in tree.xpath("//sm:urlset/sm:url/sm:loc", namespaces=namespace)]
print(data)
prints...
['https://www.example.com/']
If urlset isn't always the root element, you may want to do something like this instead...
root_ns_uri = tree.xpath("namespace-uri(//*[local-name()='urlset'])")

Registering namespaces with lxml before parsing

I am using lxml to parse XML from an external service that has namespaces, but doesn't register them with xmlns. I am trying to register it by hand with register_namespace, but that doesn't seem to work.
from lxml import etree
xml = """
<Foo xsi:type="xsd:string">bar</Foo>
"""
etree.register_namespace('xsi', 'http://www.w3.org/2001/XMLSchema-instance')
el = etree.fromstring(xml) # lxml.etree.XMLSyntaxError: Namespace prefix xsi for type on Foo is not defined
What am I missing? Oddly enough, looking at the lxml source code to try and understand what I might be doing wrong, it seems as if the xsi namespace should already be there as one of the default namespaces.
When an XML document is parsed and then saved again, lxml does not change any prefixes (and register_namespace has no effect).
If your XML document does not declare its namespace prefixes, it is not namespace-well-formed. Using register_namespace before parsing cannot fix this.
register_namespace defines the prefixes to be used when serializing a newly created XML document.
Example 1 (without register_namespace):
from lxml import etree
el = etree.Element('{http://example.com}Foo')
print(etree.tostring(el).decode())
Output:
<ns0:Foo xmlns:ns0="http://example.com"/>
Example 2 (with register_namespace):
from lxml import etree
etree.register_namespace("abc", "http://example.com")
el = etree.Element('{http://example.com}Foo')
print(etree.tostring(el).decode())
Output:
<abc:Foo xmlns:abc="http://example.com"/>
Example 3 (without register_namespace, but with a "well-known" namespace associated with a conventional prefix):
from lxml import etree
el = etree.Element('{http://www.w3.org/2001/XMLSchema-instance}Foo')
print(etree.tostring(el).decode())
Output:
<xsi:Foo xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"/>
Namespace-well-formed XML that uses custom namespaces must also include the namespace declaration itself. Adding an xmlns in the first element is enough:
from lxml import etree
xml = """
<Foo xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance' xsi:type='xsd:string'>bar</Foo>
"""
el = etree.fromstring(xml)
print (el)
So, technically, if your XML uses xsi but it does not contain the namespace declaration, it's not (namespace) well-formed XML.
See also How to restrict the value of an XML element using xsi:type in XSD?

Parse XML where an element prefix is defined in the same element

I have an XML file with an element which looks like this:
<wrapping_element>
<prefix:tag xmlns:prefix="url">value</prefix:tag>
</wrapping_element>
I want to get this element, so I am using lxml as follows:
wrapping_element.find('prefix:tag', wrapping_element.nsmap)
but I get the following error: SyntaxError: prefix 'prefix' not found in prefix map because prefix is not defined before reaching this element in the XML.
Is there a way to get the element anyway?
Like mentioned in the comments, you could use local-name() to circumvent the namespace, but it's easy enough to just handle the namespace directly in the xpath() call...
from lxml import etree
tree = etree.parse("input.xml")
wrapping_element = tree.xpath("/wrapping_element")[0]
tag = wrapping_element.xpath("x:tag", namespaces={"x": "url"})[0]
print(etree.tostring(tag, encoding="unicode"))
This will print...
<prefix:tag xmlns:prefix="url">value</prefix:tag>
Notice I used the prefix x. The prefix can match the prefix in the XML file, but it doesn't have to; only the namespace URIs need to match exactly.
See here for more details: http://lxml.de/xpathxslt.html#namespaces-and-prefixes

Parse xml node children list by tag with any prefix in python

I would like to got an list of items, independently of their prefixes.
My goal is to create method (please notice me if something like this exist), who has one argument(tagname) and returns list of elements.
For example in case of argument 'item' <media:item>, <abc:item> should be part of result of this function.
It would be nice to use lxml but it can be any python DOM-based parser.
Unfortunatuly i can't assume, that xml has xmlns, that's why i need to parse for any prefix.
lxml is a good option primarily because it has full support for XPath version 1.0 via the xpath() method besides many other useful utilities. And in XPath, you can ignore element namespace by using local-name() as mentioned in the comment.
lxml also able to deal with undefined prefix by setting parameter recover=True, but now comes the catch; local-name() still return prefixed 'tagname' for element having undefined prefix. There is a hacky way to match this kind of element, by finding element which local name contains :tagname -or to be more precise, find element which local name ends with :tagname instead of contains-.
The following is a working example for demo. The demo uses two expressions combined with logical operator or; one for dealing with element having undefined prefix, and the other for element without prefix or with properly defined prefix :
from lxml import etree
xml = """<root foo="bar">
<media:item>a</media:item>
<abc:item>b</abc:item>
<foo:item>c</foo:item>
<item>d</item>
</root>"""
parser = etree.XMLParser(recover=True)
tree = etree.fromstring(xml, parser=parser)
tagname = "item"
#expression to match element undefined prefix
predicate1 = "contains(local-name(),':{0}')".format(tagname)
#expression to match element with properly defined prefix or with no prefix
predicate2 = "local-name()='{0}'".format(tagname)
elements = tree.xpath("//*[{0} or {1}]".format(predicate1, predicate2))
for e in elements:
print(etree.tostring(e))
output :
<media:item>a</media:item>
<abc:item>b</abc:item>
<foo:item>c</foo:item>
<item>d</item>

Parsing XML with namespace in Python via 'ElementTree'

I have the following XML which I want to parse using Python's ElementTree:
<rdf:RDF xml:base="http://dbpedia.org/ontology/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:owl="http://www.w3.org/2002/07/owl#"
xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
xmlns="http://dbpedia.org/ontology/">
<owl:Class rdf:about="http://dbpedia.org/ontology/BasketballLeague">
<rdfs:label xml:lang="en">basketball league</rdfs:label>
<rdfs:comment xml:lang="en">
a group of sports teams that compete against each other
in Basketball
</rdfs:comment>
</owl:Class>
</rdf:RDF>
I want to find all owl:Class tags and then extract the value of all rdfs:label instances inside them. I am using the following code:
tree = ET.parse("filename")
root = tree.getroot()
root.findall('owl:Class')
Because of the namespace, I am getting the following error.
SyntaxError: prefix 'owl' not found in prefix map
I tried reading the document at http://effbot.org/zone/element-namespaces.htm but I am still not able to get this working since the above XML has multiple nested namespaces.
Kindly let me know how to change the code to find all the owl:Class tags.
You need to give the .find(), findall() and iterfind() methods an explicit namespace dictionary:
namespaces = {'owl': 'http://www.w3.org/2002/07/owl#'} # add more as needed
root.findall('owl:Class', namespaces)
Prefixes are only looked up in the namespaces parameter you pass in. This means you can use any namespace prefix you like; the API splits off the owl: part, looks up the corresponding namespace URL in the namespaces dictionary, then changes the search to look for the XPath expression {http://www.w3.org/2002/07/owl}Class instead. You can use the same syntax yourself too of course:
root.findall('{http://www.w3.org/2002/07/owl#}Class')
Also see the Parsing XML with Namespaces section of the ElementTree documentation.
If you can switch to the lxml library things are better; that library supports the same ElementTree API, but collects namespaces for you in .nsmap attribute on elements and generally has superior namespaces support.
Here's how to do this with lxml without having to hard-code the namespaces or scan the text for them (as Martijn Pieters mentions):
from lxml import etree
tree = etree.parse("filename")
root = tree.getroot()
root.findall('owl:Class', root.nsmap)
UPDATE:
5 years later I'm still running into variations of this issue. lxml helps as I showed above, but not in every case. The commenters may have a valid point regarding this technique when it comes merging documents, but I think most people are having difficulty simply searching documents.
Here's another case and how I handled it:
<?xml version="1.0" ?><Tag1 xmlns="http://www.mynamespace.com/prefix">
<Tag2>content</Tag2></Tag1>
xmlns without a prefix means that unprefixed tags get this default namespace. This means when you search for Tag2, you need to include the namespace to find it. However, lxml creates an nsmap entry with None as the key, and I couldn't find a way to search for it. So, I created a new namespace dictionary like this
namespaces = {}
# response uses a default namespace, and tags don't mention it
# create a new ns map using an identifier of our choice
for k,v in root.nsmap.iteritems():
if not k:
namespaces['myprefix'] = v
e = root.find('myprefix:Tag2', namespaces)
Note: This is an answer useful for Python's ElementTree standard library without using hardcoded namespaces.
To extract namespace's prefixes and URI from XML data you can use ElementTree.iterparse function, parsing only namespace start events (start-ns):
>>> from io import StringIO
>>> from xml.etree import ElementTree
>>> my_schema = u'''<rdf:RDF xml:base="http://dbpedia.org/ontology/"
... xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
... xmlns:owl="http://www.w3.org/2002/07/owl#"
... xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
... xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
... xmlns="http://dbpedia.org/ontology/">
...
... <owl:Class rdf:about="http://dbpedia.org/ontology/BasketballLeague">
... <rdfs:label xml:lang="en">basketball league</rdfs:label>
... <rdfs:comment xml:lang="en">
... a group of sports teams that compete against each other
... in Basketball
... </rdfs:comment>
... </owl:Class>
...
... </rdf:RDF>'''
>>> my_namespaces = dict([
... node for _, node in ElementTree.iterparse(
... StringIO(my_schema), events=['start-ns']
... )
... ])
>>> from pprint import pprint
>>> pprint(my_namespaces)
{'': 'http://dbpedia.org/ontology/',
'owl': 'http://www.w3.org/2002/07/owl#',
'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
'rdfs': 'http://www.w3.org/2000/01/rdf-schema#',
'xsd': 'http://www.w3.org/2001/XMLSchema#'}
Then the dictionary can be passed as argument to the search functions:
root.findall('owl:Class', my_namespaces)
I've been using similar code to this and have found it's always worth reading the documentation... as usual!
findall() will only find elements which are direct children of the current tag. So, not really ALL.
It might be worth your while trying to get your code working with the following, especially if you're dealing with big and complex xml files so that that sub-sub-elements (etc.) are also included.
If you know yourself where elements are in your xml, then I suppose it'll be fine! Just thought this was worth remembering.
root.iter()
ref: https://docs.python.org/3/library/xml.etree.elementtree.html#finding-interesting-elements
"Element.findall() finds only elements with a tag which are direct children of the current element. Element.find() finds the first child with a particular tag, and Element.text accesses the element’s text content. Element.get() accesses the element’s attributes:"
To get the namespace in its namespace format, e.g. {myNameSpace}, you can do the following:
root = tree.getroot()
ns = re.match(r'{.*}', root.tag).group(0)
This way, you can use it later on in your code to find nodes, e.g using string interpolation (Python 3).
link = root.find(f"{ns}link")
This is basically Davide Brunato's answer however I found out that his answer had serious problems the default namespace being the empty string, at least on my python 3.6 installation. The function I distilled from his code and that worked for me is the following:
from io import StringIO
from xml.etree import ElementTree
def get_namespaces(xml_string):
namespaces = dict([
node for _, node in ElementTree.iterparse(
StringIO(xml_string), events=['start-ns']
)
])
namespaces["ns0"] = namespaces[""]
return namespaces
where ns0 is just a placeholder for the empty namespace and you can replace it by any random string you like.
If I then do:
my_namespaces = get_namespaces(my_schema)
root.findall('ns0:SomeTagWithDefaultNamespace', my_namespaces)
It also produces the correct answer for tags using the default namespace as well.
My solution is based on #Martijn Pieters' comment:
register_namespace only influences serialisation, not search.
So the trick here is to use different dictionaries for serialization and for searching.
namespaces = {
'': 'http://www.example.com/default-schema',
'spec': 'http://www.example.com/specialized-schema',
}
Now, register all namespaces for parsing and writing:
for name, value in namespaces.iteritems():
ET.register_namespace(name, value)
For searching (find(), findall(), iterfind()) we need a non-empty prefix. Pass these functions a modified dictionary (here I modify the original dictionary, but this must be made only after the namespaces are registered).
self.namespaces['default'] = self.namespaces['']
Now, the functions from the find() family can be used with the default prefix:
print root.find('default:myelem', namespaces)
but
tree.write(destination)
does not use any prefixes for elements in the default namespace.

Categories