Access XML element by name - python

Im using xml.etree.ElementTree to parse my XML data. I'm trying to get the text value of <Name>
This is my code.
for Content in Zone[0]:
print(Content.find('Name').text)
It is returning as NoneObject
However, I am able to access the Element using
for Content in Zone[0]:
print(Content[12].text)
I think I might have found the problem as when I print the tags out, it doesn't display Name and instead it displays {http://schemas.datacontract.org/2004/07/}Name. What is the extra data infront of the tag name?

Your XML is likely has default namespace -namespace declared with no prefix-. Notice that descendant elements without prefix inherits default namespace implicitly. You can handle default namespace the way you would handle prefixed namespaces; just map a prefix to the namespace URI, and use that prefix along with element name to reference element in namespace :
namespaces = {'d': 'http://schemas.datacontract.org/2004/07/'}
for Content in Zone[0]:
print(Content.find('d:Name', namespaces).text)

Related

Parse XML where an element prefix is defined in the same element

I have an XML file with an element which looks like this:
<wrapping_element>
<prefix:tag xmlns:prefix="url">value</prefix:tag>
</wrapping_element>
I want to get this element, so I am using lxml as follows:
wrapping_element.find('prefix:tag', wrapping_element.nsmap)
but I get the following error: SyntaxError: prefix 'prefix' not found in prefix map because prefix is not defined before reaching this element in the XML.
Is there a way to get the element anyway?
Like mentioned in the comments, you could use local-name() to circumvent the namespace, but it's easy enough to just handle the namespace directly in the xpath() call...
from lxml import etree
tree = etree.parse("input.xml")
wrapping_element = tree.xpath("/wrapping_element")[0]
tag = wrapping_element.xpath("x:tag", namespaces={"x": "url"})[0]
print(etree.tostring(tag, encoding="unicode"))
This will print...
<prefix:tag xmlns:prefix="url">value</prefix:tag>
Notice I used the prefix x. The prefix can match the prefix in the XML file, but it doesn't have to; only the namespace URIs need to match exactly.
See here for more details: http://lxml.de/xpathxslt.html#namespaces-and-prefixes

Adding Namespaces to a DOM Element python

I want to produce this xml file with python and minidom:
<xml vesion="1.0" encoding="utf-8?>
<package name="Operation" xmlns="http://www.modelIL.eu/types-2.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation = "http://www.modelIL.eu/types-2.0 modelIL-package-2.0.xsd">
</package>
I have wrote this:
import xml.dom.minidom as dom
document = dom.Document()
root_xml = document.createElement("package")
root_xml.setAttribute("name", "Operation")
root_xml.setAttributeNS("", "xmlns", "http://www.modelIL.eu/types-2.0")
root_xml.setAttributeNS("xmls", "xsi", "http://www.w3.org/2001/XMLSchema-instance")
root_xml.setAttribute("xsi:schemaLocation", "http://www.modelIL.eu/types-2.0 modelIL-package-2.0.xsd")
root = document.appendChild(root_xml)
print(document.toprettyxml(indent(" "))
But the output I get is this one:
<xml vesion="1.0" ?>
<package name="Operation" xmlns="http://www.modelIL.eu/types-2.0" xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation = "http://www.modelIL.eu/types-2.0 modelIL-package-2.0.xsd">
</package>
Why do I have only xsi and not xmlns:xsi? Did I forget something?
Full disclosure: I do not use minidom for XML, I use lxml and on top of that I do not use XML that often, so I hope my answer will be useful.
One might expect that by setting an attribute with a particular namespace, there would not be any need to explicitly state a prefix to appear before the local name in the final, written XML document - after all, it should be possible to detect that a namespace has been employed and that a prefix is required in the full attribute name so that the attribute is recognised as being associated with that namespace. Unfortunately, we do not seem to have that luxury and must explicitly specify a prefix as part of the qualified name when setting such an attribute
Python and XML: An Introduction (skip to the Atributes part)
This should solve your problem:
root_xml.setAttributeNS("xmls", "xmlns:xsi", "http://www.w3.org/2001/XMLSchema-instance")
As you know the setAttributeNS method takes three arguments: namespaceURI, qualifiedName, value. The attribute is than added if the element has no attribute with the same namespaceURI and localname - we get the localname by doing a split on qualifiedName using the function _nssplit. Otherwise the method tries to update the value of the attribute.
However the name of the attribute is a combination of prefix (the part of the qualifiedName before the colon punctuation) and localname "%s:%s" % (prefix, localName). If no prefix is present the name of the attribute is the same as the qualifiedName argument.
If you do not care for the namespaceURI of your attributes you could achieve the same result using only the setAttribute method like you did with the first and last attribute. In that case the method will look for an attribute with the same attribute name. If it finds one, it will try to overwrite it's value.
I do have one question: why do you bind root = document.appendChild(root_xml)? Is it to avoid the return value in your REPL? That I understand.

Parse xml node children list by tag with any prefix in python

I would like to got an list of items, independently of their prefixes.
My goal is to create method (please notice me if something like this exist), who has one argument(tagname) and returns list of elements.
For example in case of argument 'item' <media:item>, <abc:item> should be part of result of this function.
It would be nice to use lxml but it can be any python DOM-based parser.
Unfortunatuly i can't assume, that xml has xmlns, that's why i need to parse for any prefix.
lxml is a good option primarily because it has full support for XPath version 1.0 via the xpath() method besides many other useful utilities. And in XPath, you can ignore element namespace by using local-name() as mentioned in the comment.
lxml also able to deal with undefined prefix by setting parameter recover=True, but now comes the catch; local-name() still return prefixed 'tagname' for element having undefined prefix. There is a hacky way to match this kind of element, by finding element which local name contains :tagname -or to be more precise, find element which local name ends with :tagname instead of contains-.
The following is a working example for demo. The demo uses two expressions combined with logical operator or; one for dealing with element having undefined prefix, and the other for element without prefix or with properly defined prefix :
from lxml import etree
xml = """<root foo="bar">
<media:item>a</media:item>
<abc:item>b</abc:item>
<foo:item>c</foo:item>
<item>d</item>
</root>"""
parser = etree.XMLParser(recover=True)
tree = etree.fromstring(xml, parser=parser)
tagname = "item"
#expression to match element undefined prefix
predicate1 = "contains(local-name(),':{0}')".format(tagname)
#expression to match element with properly defined prefix or with no prefix
predicate2 = "local-name()='{0}'".format(tagname)
elements = tree.xpath("//*[{0} or {1}]".format(predicate1, predicate2))
for e in elements:
print(etree.tostring(e))
output :
<media:item>a</media:item>
<abc:item>b</abc:item>
<foo:item>c</foo:item>
<item>d</item>

python etree with xpath and namespaces with prefix

I can't find info, how to parse my XML with namespace:
I have this xml:
<par:Request xmlns:par="http://somewhere.net/actual">
<par:actual>blabla</par:actual>
<par:documentType>string</par:documentType>
</par:Request>
And tried to parse it:
dom = ET.parse(u'C:\\filepath\\1.xml')
rootxml = dom.getroot()
for subtag in rootxml.xpath(u'//par:actual'):
#do something
print(subtag)
And got exception, because it doesn't know about namespace prefix.
Is there best way to solve that problem, counting that script will not know about file it going to parse and tag is going to search for?
Searching web and stackoverflow I found, that if I will add there:
namespace = {u'par': u"http://somewhere.net/actual"}
for subtag in rootxml.xpath(u'//par:actual', namespaces=namespace):
#do something
print(subtag)
That works. Perfect. But I don't know which XML I will parse, and searching tag (such as //par:actual) is also unknown to my script. So, I need to find way to extract namespace from XML somehow.
I found a lot of ways, how to extract namespace URI, such as:
print(rootxml.tag)
print(rootxml.xpath('namespace-uri(.)'))
print(rootxml.xpath('namespace-uri(/*)'))
But how should I extract prefix to create dictionary which ElementTree wants from me? I don't want to use regular expression monster over xml body to extract prefix, I believe there have to exist supported way for that, isn't it?
And maybe there have to exist some methods for me to extract by ETree namespace from XML as dictionary (as ETree wants!) without hands manipulation?
You cannot rely on the namespace declarations on the root element: there is no guarantee that the declarations will even be there, or that the document will have the same prefix for the same namespace throughout.
Assuming you are going to have some way of passing the tag you want to search (because you say it is not known by your script), you should also provide a way to pass a namespace mapping as well. Or use the James Clark notation, like {http://somewhere.net/actual}actual (the ETXPath has support for this syntax, whereas "normal" xpath does not, but you can also use other methods like .findall() if you don't need full xpath)
If you don't care for the prefix at all, you could also use the local-name() function in xpath, eg. //*[local-name()="actual"] (but you won't be "really" sure it's the right "actual")
Oh, I found it.
After we do that:
dom = ET.parse(u'C:\\filepath\\1.xml')
rootxml = dom.getroot()
Object rootxml contains dictionary nsmap, which contains all namespaces that I want.
So, simplest solution I've found:
dom = ET.parse(u'C:\\filepath\\1.xml')
rootxml = dom.getroot()
nss = rootxml.nsmap
for subtag in rootxml.xpath(u'//par:actual', namespaces=nss):
#do something
print(subtag)
That works.
UPD: that works if user understand what means 'par' in XML he works with. For example, comparing supposed namespace with existing namespace before any other operations.
Still, I like much variant with XPath that understands {...}actual, that was what I tried to achieve.
With Python 3.8.2 I found this question with the same issue.
This is the solution I found, put the namespace in the XPath query. (Between the {})
ApplicationArea = BOD_IN_tree.find('.//ApplicationArea', ns)
if(ApplicationArea is None):
ApplicationArea = BOD_IN_tree.find('.//{http://www.defaultNamespace.com/2}ApplicationArea', ns)
I search for the element without the namespace, then search again if it's not found. I have no control over the inbound documents, some have namespaces, some do not.
I hope this helps!

Parse xbrl file in python

I am working on a xml parser.
The goal is to parse a number of different xml files where prefixes and tags remain consistent but namespaces change.
I am hence trying either:
to parse the xml just by <prefix:tags> without resolving (replacing) the prefix with the namespace. The prefixes remain unchanged from document to document.
to load automatically the namespaces so that the identifier (<prefix:tag>) could be replaced with the proper namespace.
just parse the xml by tag
I have tried with xml.etree.ElementTree.
I also had a look at lxml
I did not find any configuration option of the XMLParser in lxml that could help me out although here I could read an answer where the author suggests that lxml should be able to collect namespaces for me automatically.
Interestingly, parsed_file = etree.XML(file) fails with the error:
lxml.etree.XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1
One example of the files I would like to parse is here
Do not care about ns prefixes, care about complete namespaces
Sometime, people do care about those short prefixes and forgetting, the are of secondary importance. They are only short reference to fully qualified namespace. E.g.
xmlns:trw="http://www.trw.com/20131231"
in xml means, from now on, the "trw:" stands for fully qualified namespace "http://www.trw.com/20131231". Note, that this prefix can be redefined to any other namespace in any following element and may get completely different meaning.
On the other hand, when you care about real meaning, what means here fully qualified namespace, you may think of "trw:row" as "{http://www.trw.com/20131231}row". This translated meaning will be reliable and will not change with prefix changes.
Parsing referred xml
The link to http://edgar.sec.gov/Archives/edgar/data/1267097/000104746914000925/trw-20131231.xml leads to an xml, which validates by xmlstarlet and which lxml is able to parse.
The error message you show is referring to very first character of the stream, so chances are you either met BOM byte in your file, or you are trying to read xml, which is gzipped and shall be decompressed first.
lxml and namespaces
lxml works with namespaces well. It allows you to use XPath expressions, which use namespaces. With controlling namspace prefix on output it is a bit more complex, as it is dependent on xmlns attributes, which are part of serialized document. If you want to modify the prefixes, you must somehow organize these xmlns attributes, often by moving all of the to the root element. At the same time, lxml keeps track of fully qualified namespace of each element, so at the moment of serialization, it will respect this full name as well as currently valid prefix for this namespace.
Handling these xmlna attributes is a bit of more code, refer to lxml documentation.
items = tree.xpath("*[local-name(.) = 'a_tag_goes_here']")
did the job. On top of that I had to browse the generated list items manually to define my other desired filtering functions.

Categories