How can I parse an XML document into a Python object?

How can I parse an XML document into a Python object? - python

I'm trying to consume an XML API. I'd like to have some Python objects that represent the XML data. I have several XSD and some example API responses from the documentation.
http://www.isan.org/schema/v1.11/common/common.xsd
http://www.isan.org/schema/v1.21/common/serial.xsd
http://www.isan.org/schema/v1.11/common/version.xsd
http://www.isan.org/ISAN/isan.xsd
http://www.isan.org/schema/v1.11/common/title.xsd
http://www.isan.org/schema/v1.11/common/externalid.xsd
http://www.isan.org/schema/v1.11/common/participant.xsd
http://www.isan.org/schema/v1.11/common/language.xsd
http://www.isan.org/schema/v1.11/common/country.xsd
Here's one example XML response:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<serial:serialHeaderType xmlns:isan="http://www.isan.org/ISAN/isan"
xmlns:title="http://www.isan.org/schema/v1.11/common/title"
xmlns:serial="http://www.isan.org/schema/v1.21/common/serial"
xmlns:externalid="http://www.isan.org/schema/v1.11/common/externalid"
xmlns:common="http://www.isan.org/schema/v1.11/common/common"
xmlns:participant="http://www.isan.org/schema/v1.11/common/participant"
xmlns:language="http://www.isan.org/schema/v1.11/common/language"
xmlns:country="http://www.isan.org/schema/v1.11/common/country">
<common:status>
<common:DataType>SERIAL_HEADER_TYPE</common:DataType>
<common:ISAN root="0000-0002-3B9F"/>
<common:WorkStatus>ACTIVE</common:WorkStatus>
</common:status>
<serial:SerialHeaderId root="0000-0002-3B9F"/>
<serial:MainTitles>
<title:TitleDetail>
<title:Title>Braquo</title:Title>
<title:Language>
<language:LanguageLabel>French</language:LanguageLabel>
<language:LanguageCode>
<language:CodingSystem>ISO639_2</language:CodingSystem>
<language:ISO639_2Code>FRE</language:ISO639_2Code>
</language:LanguageCode>
</title:Language>
<title:TitleKind>ORIGINAL</title:TitleKind>
</title:TitleDetail>
</serial:MainTitles>
<serial:TotalEpisodes>11</serial:TotalEpisodes>
<serial:TotalSeasons>0</serial:TotalSeasons>
<serial:MinDuration>
<common:TimeUnit>MIN</common:TimeUnit>
<common:TimeValue>45</common:TimeValue>
</serial:MinDuration>
<serial:MaxDuration>
<common:TimeUnit>MIN</common:TimeUnit>
<common:TimeValue>144</common:TimeValue>
</serial:MaxDuration>
<serial:MinYear>2009</serial:MinYear>
<serial:MaxYear>2009</serial:MaxYear>
<serial:MainParticipantList>
<participant:Participant>
<participant:FirstName>Frédéric</participant:FirstName>
<participant:LastName>Schoendoerffer</participant:LastName>
<participant:RoleCode>DIR</participant:RoleCode>
</participant:Participant>
<participant:Participant>
<participant:FirstName>Karole</participant:FirstName>
<participant:LastName>Rocher</participant:LastName>
<participant:RoleCode>ACT</participant:RoleCode>
</participant:Participant>
</serial:MainParticipantList>
<serial:CompanyList>
<common:Company>
<common:CompanyKind>PRO</common:CompanyKind>
<common:CompanyName>R.T.B.F.</common:CompanyName>
</common:Company>
<common:Company>
<common:CompanyKind>PRO</common:CompanyKind>
<common:CompanyName>Capa Drama</common:CompanyName>
</common:Company>
<common:Company>
<common:CompanyKind>PRO</common:CompanyKind>
<common:CompanyName>Marathon</common:CompanyName>
</common:Company>
</serial:CompanyList>
</serial:serialHeaderType>
I tried simply ignoring the XSD and using lxml.objectify on the XML I'd get from the API. I had a problem with namespaces. Having to refer to every child node with its explicit namespace was a real pain and doesn't make for readable code.
from lxml import objectify
obj = objectify.fromstring(response)
print obj.MainTitles.TitleDetail
# This will fail to find the element because you need to specify the namespace
print obj.MainTitles['{http://www.isan.org/schema/v1.11/common/title}TitleDetail']
# Or something like that, I couldn't get it to work, and I'd much rather use attributes and not specify the namespace
So then I tried generateDS to create some Python class definitions for me. I've lost the error messages that this attempt gave me but I couldn't get it to work. It would generate a module for each XSD that I gave it but it wouldn't parse the example XML.
I'm now trying pyxb and this seems much nicer so far. It's generating nicer definitions than generateDS (splitting them into multiple, reusable modules) but it won't parse the XML:
from models import serial
obj = serial.CreateFromDocument(response)
Traceback (most recent call last):
...
File "/vagrant/isan/isan.py", line 58, in lookup
return serial.CreateFromDocument(resp.content)
File "/vagrant/isan/models/serial.py", line 69, in CreateFromDocument
instance = handler.rootObject()
File "/home/vagrant/venv/lib/python2.7/site-packages/pyxb/binding/saxer.py", line 285, in rootObject
raise pyxb.UnrecognizedDOMRootNodeError(self.__rootObject)
UnrecognizedDOMRootNodeError: <pyxb.utils.saxdom.Element object at 0x2b53664dc850>
The unrecognised node is the <serial:serialHeaderType> node from the example. Looking at the pyxb source it seems that this error comes about "if the top-level element got processed as a DOM instance" but I don't know what this means or how to prevent it.
I've run out of steam for trying to explore this, I don't know what to do next.

I have had a lot of luck parsing XML into Python using Beautiful Soup. It is extremely straightforward, and they provide pretty strong documentation. Check it out here:
http://www.crummy.com/software/BeautifulSoup/
http://www.crummy.com/software/BeautifulSoup/bs4/doc/

UnrecognizedDOMRootNodeError indicates that PyXB could not locate the element in a namespace for which it has bindings registered. In your case it fails on the first element, which is {http://www.isan.org/schema/v1.21/common/serial}serialHeaderType.
The schema for that namespace defines a complexType named SerialHeaderType but does not define an element with the name serialHeaderType. In fact it defines no top-level elements. So PyXB can't recognize it, and the XML does not validate.
Either there's an additional schema for the namespace that you'll need to locate which provides elements, or the message you're sending really doesn't validate. That may be because somebody's expecting a implicit mapping from a complex type to an element with that type, or because it's a fragment that would normally be found within some other element where that QName is a member element name.
UPDATE: You can hand-craft an element in that namespace by adding the
following to the generated bindings in serial.py:
serialHeaderType = pyxb.binding.basis.element(pyxb.namespace.ExpandedName(Namespace, 'serialHeaderType'), SerialHeaderType)
Namespace.addCategoryObject('elementBinding', serialHeaderType.name().localName(), serialHeaderType)
If you do that, you won't get the UnrecognizedDOMRootNodeError but you
will get an IncompleteElementContentError at:
<common:status>
<common:DataType>SERIAL_HEADER_TYPE</common:DataType>
<common:ISAN root="0000-0002-3B9F"/>
<common:WorkStatus>ACTIVE</common:WorkStatus>
</common:status>
which provides the following details:
The containing element {http://www.isan.org/schema/v1.11/common/common}status is defined at common.xsd[243:3].
The containing element type {http://www.isan.org/schema/v1.11/common/common}StatusType is defined at common.xsd[289:1]
The {http://www.isan.org/schema/v1.11/common/common}StatusType automaton is not in an accepting state.
Any accepted content has been stored in instance
The following element and wildcard content would be accepted:
An element {http://www.isan.org/schema/v1.11/common/common}ActiveISAN per common.xsd[316:3]
An element {http://www.isan.org/schema/v1.11/common/common}MatchingISANs per common.xsd[317:3]
An element {http://www.isan.org/schema/v1.11/common/common}Description per common.xsd[318:3]
No content remains unconsumed
Reviewing the schema confirms that, at a minimum, a {http://www.isan.org/schema/v1.11/common/common}Description element is missing but required.
So it seems these documents are not meant to be validated, and PyXB is
probably the wrong technology to use.

Related

Import XML namespace in python

I'm a total noob in coding, I study IT, and have a school project in which I must convert a .txt file in a XML file. I have managed to create a tree, and subelements, but a must put some XML namespace in the code. Because the XML file in the end must been opened in a program that gives you a table of the informations, and something more. But without the scheme from the XML namespace it won't open anything. Can someone help me in how to put a .xsd in my code?
This is the scheme:
http://www.pufbih.ba/images/stories/epp_docs/PaketniUvozObrazaca_V1_0.xsd
Example of XML file a must create:
http://www.pufbih.ba/images/stories/epp_docs/4200575050089_1022.xml
And in the first row a have the scheme that I must input: "urn:PaketniUvozObrazaca_V1_0.xsd"
This is the code a created so far:
import xml.etree.ElementTree as xml
def GenerateXML(GIP1022):
root=xml.Element("PaketniUvozObrazaca")
p1=xml.Element("PodaciOPoslodavcu")
root.append(p1)
jib=xml.SubElement(p1,"JIBPoslodavca")
jib.text="4254160150005"
pos=xml.SubElement(p1,"NazivPoslodavca")
pos.text="MOJATVRTKA d.o.o. ORAŠJE"
zah=xml.SubElement(p1,"BrojZahtjeva")
zah.text="8"
datz=xml.SubElement(p1,"DatumPodnosenja")
datz.text="2021-01-01"
tree=xml.ElementTree(root)
with open(GIP1022,"wb") as files:
tree.write(files)
if __name__=="__main__":
GenerateXML("primjer.xml")

The official documentation is not super explicit as to how one works with namespaces in ElementTree, but the core of it is that ElementTree takes a very fundamental(ist) approach: instead of manipulating namespace prefixes / aliases, elementtree uses Clark's Notation.
So e.g.
<bar xmlns="foo">
or
<x:bar xmlns:x="foo">
(the element bar in the foo namespace) would be written
{foo}bar
>>> tostring(Element('{foo}bar'), encoding='unicode')
'<ns0:bar xmlns:ns0="foo" />'
alternatively (and sometimes more conveniently for authoring and manipulating) you can use QName objects which can either take a Clark's notation tag name, or separately take a namespace and a tag name:
>>> tostring(Element(QName('foo', 'bar')), encoding='unicode')
'<ns0:bar xmlns:ns0="foo" />'
So while ElementTree doesn't have a namespace object per-se you can create namespaced object like this, probably via a helper partially applying QName:
>>> root = Element(ns("PaketniUvozObrazaca"))
>>> SubElement(root, ns("PodaciOPoslodavcu"))
<Element <QName '{urn:PaketniUvozObrazaca_V1_0.xsd}PodaciOPoslodavcu'> at 0x7f502481bdb0>
>>> tostring(root, encoding='unicode')
'<ns0:PaketniUvozObrazaca xmlns:ns0="urn:PaketniUvozObrazaca_V1_0.xsd"><ns0:PodaciOPoslodavcu /></ns0:PaketniUvozObrazaca>'
Now there are a few important considerations here:
First, as you can see the prefix when serialising is arbitrary, this is in keeping with ElementTree's fundamentalist approach to XML (the prefix should not matter), but it has since grown a "register_namespace" global function which allows registering specific prefixes:
>>> register_namespace('xxx', 'urn:PaketniUvozObrazaca_V1_0.xsd')
>>> tostring(root, encoding='unicode')
'<xxx:PaketniUvozObrazaca xmlns:xxx="urn:PaketniUvozObrazaca_V1_0.xsd"><xxx:PodaciOPoslodavcu /></xxx:PaketniUvozObrazaca>'
you can also pass a single default_namespace to (some) serialization function to specify the, well, default namespace:
>>> tostring(root, encoding='unicode', default_namespace='urn:PaketniUvozObrazaca_V1_0.xsd')
'<PaketniUvozObrazaca xmlns="urn:PaketniUvozObrazaca_V1_0.xsd"><PodaciOPoslodavcu /></PaketniUvozObrazaca>'
A second, possibly larger, issue is that ElementTree does not support validation.
The Python standard library does not provide support for any validating parser or tree builder, whether DTD, rng, xml schema, anything. Not by default, and not optionally.
lxml is probably the main alternative supporting validation (of multiple types of schema), its core API follows ElementTree but extends it in multiple ways and directions (including much more precise namespace prefix support, and prefix round-tripping). But even then the validation is (AFAIK) mostly explicit, at least when generating / serializing documents.

What you want is to add a default namespace declaration (xmlns="urn:PaketniUvozObrazaca_V1_0.xsd") to the root element. I have edited the code in the question to show you how this can be done.
import xml.etree.ElementTree as ET
def GenerateXML(GIP1022):
# Create the PaketniUvozObrazaca root element in the urn:PaketniUvozObrazaca_V1_0.xsd namespace
root = ET.Element("{urn:PaketniUvozObrazaca_V1_0.xsd}PaketniUvozObrazaca")
# Add subelements
p1 = ET.Element("PodaciOPoslodavcu")
root.append(p1)
jib = ET.SubElement(p1,"JIBPoslodavca")
jib.text = "4254160150005"
pos = ET.SubElement(p1,"NazivPoslodavca")
pos.text = "MOJATVRTKA d.o.o. ORAŠJE"
zah = ET.SubElement(p1,"BrojZahtjeva")
zah.text = "8"
datz = ET.SubElement(p1,"DatumPodnosenja")
datz.text = "2021-01-01"
# Make urn:PaketniUvozObrazaca_V1_0.xsd the default namespace (no prefix)
ET.register_namespace("", "urn:PaketniUvozObrazaca_V1_0.xsd")
# Prettify output (requires Python 3.9)
ET.indent(root)
tree = ET.ElementTree(root)
with open(GIP1022,"wb") as files:
tree.write(files)
if __name__=="__main__":
GenerateXML("primjer.xml")
Contents of primjer.xml:
<PaketniUvozObrazaca xmlns="urn:PaketniUvozObrazaca_V1_0.xsd">
<PodaciOPoslodavcu>
<JIBPoslodavca>4254160150005</JIBPoslodavca>
<NazivPoslodavca>MOJATVRTKA d.o.o. ORAŠJE</NazivPoslodavca>
<BrojZahtjeva>8</BrojZahtjeva>
<DatumPodnosenja>2021-01-01</DatumPodnosenja>
</PodaciOPoslodavcu>
</PaketniUvozObrazaca>
Note that only the root element is explicitly bound to a namespace in the code. The subelements do not need to be in a namespace when they are added. The end result is an XML document (primjer.xml) where all elements belong to the same default namespace.
The above is not the only way to create an element in a namespace. For example, instead of the {namespace-uri}name notation, the QName class can be used. See https://stackoverflow.com/a/58678592/407651.

The tree.write() method takes a default_namespace argument.
What happens if you change that line to the following?
tree.write(files, default_namespace="urn:PaketniUvozObrazaca_V1_0.xsd")

Python LXML create xml with specific namespace and structure

I am trying to create an XML export from a python application and need to structure the file in a specific way for the external recipient of the file.
The root node needs to be namespaced, but the child nodes should not.
The root node of should look like this:
<ns0:SalesInvoice_Custom_Xml xmlns:ns0="http://EDI-export/Invoice">...</ns0:SalesInvoice_Custom_Xml>
I have tried to generate the same node using the lxml library on Python 2.7, but it does not behave as expected.
Here is the code that should generate the root node:
def create_edi(self, document):
_logger.info("INFO: Started creating EDI invoice with invoice number %s", document.number)
rootNs = etree.QName("ns0", "SalesInvoice_Custom_Xml")
doc = etree.Element(rootNs, nsmap={
'ns0': "http://EDI-export/Invoice"
})
This gives the following output
<ns1:SalesInvoice_Custom_Xml xmlns:ns0="http://EDI-export/Invoice" xmlns:ns1="ns0">...</ns1:SalesInvoice_Custom_Xml>
What should I change in my code to get lxml to generate the correct root node

You need to use
rootNs = etree.QName(ns0, "SalesInvoice_Custom_Xml")
with
ns0 = "http://EDI-export/Invoice"
The whole data structure itself is agnostic of any namespace mapping you might apply later, i. e. the tags know the true namespaces (e. g. http://EDI-export/Invoice) not their mapping (e. g. ns0).
Later, when you finally serialize this into a string, a namespace mapping is needed. Then (and only then) a namespace mapping will be used.
Also, after parsing you can ask the etree object what namespace mapping had been found during parsing. But that is not part of the structure, it is just additional information about how the structure had been encoded as string. Consider that the following two XMLs are logically equal:
<x:tag xmlns:x="namespace"></x:tag>
and
<y:tag xmlns:y="namespace"></y:tag>
After parsing, their structures will be equal, their namespace mappings will not.

How to access attribute value in xml containing namespace using ElementTree in python

XML file:
<?xml version="1.0" encoding="iso-8859-1"?>
<rdf:RDF xmlns:cim="http://iec.ch/TC57/2008/CIM-schema-cim13#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<cim:Terminal rdf:ID="A_T1">
<cim:Terminal.ConductingEquipment rdf:resource="#A_EF2"/>
<cim:Terminal.ConnectivityNode rdf:resource="#A_CN1"/>
</cim:Terminal>
</rdf:RDF>
I want to get the Terminal.ConnnectivityNode element's attribute value and Terminal element's attribute value also as output from the above xml. I have tried in below way!
Python code:
from elementtree import ElementTree as etree
tree= etree.parse(r'N:\myinternwork\files xml of bus systems\cimxmleg.xml')
cim= "{http://iec.ch/TC57/2008/CIM-schema-cim13#}"
rdf= "{http://www.w3.org/1999/02/22-rdf-syntax-ns#}"
Appending the below line to the code
print tree.find('{0}Terminal'.format(cim)).attrib
output1: : Is as expected
{'{http://www.w3.org/1999/02/22-rdf-syntax-ns#}ID': 'A_T1'}
If we Append with this below line to above code
print tree.find('{0}Terminal'.format(cim)).attrib['rdf:ID']
output2: key error in rdf:ID
If we append with this below line to above code
print tree.find('{0}Terminal/{0}Terminal.ConductivityEquipment'.format(cim))
output3 None
How to get output2 as A_T1 & Output3 as #A_CN1?
What is the significance of {0} in the above code, I have found that it must be used through net didn't get the significance of it?

First off, the {0} you're wondering about is part of the syntax for Python's built-in string formatting facility. The Python documentation has a fairly comprehensive guide to the syntax. In your case, it simply gets substituted by cim, which results in the string {http://iec.ch/TC57/2008/CIM-schema-cim13#}Terminal.
The problem here is that ElementTree is a bit silly about namespaces. Instead of being able to simply supply the namespace prefix (like cim: or rdf:), you have to supply it in XPath form. This means that rdf:id becomes {http://www.w3.org/1999/02/22-rdf-syntax-ns#}ID, which is very clunky.
ElementTree does support a way to use the namespace prefix for finding tags, but not for attributes. This means you'll have to expand rdf: to {http://www.w3.org/1999/02/22-rdf-syntax-ns#} yourself.
In your case, it could look as following (note also that ID is case-sensitive):
tree.find('{0}Terminal'.format(cim)).attrib['{0}ID'.format(rdf)]
Those substitutions expand to:
tree.find('{http://iec.ch/TC57/2008/CIM-schema-cim13#}Terminal').attrib['{http://www.w3.org/1999/02/22-rdf-syntax-ns#}ID']
With those hoops jumped through, it works (note that the ID is A_T1 and not #A_T1, however). Of course, this is all really annoying to have to deal with, so you could also switch to lxml and have it mostly handled for you.
Your third case doesn't work simply because 1) it's named Terminal.ConductingEquipment and not Terminal.ConductivityEquipment, and 2) if you really want A_CN1 and not A_EF2, that's the ConnectivityNode and not the ConductingEquipment. You can get A_CN1 with tree.find('{0}Terminal/{0}Terminal.ConnectivityNode'.format(cim)).attrib['{0}resource'.format(rdf)].

python xml.etree - how to search on more than one attribute

I have an XML file with this line:
<op type="create" file="C:/Users/mureadr/Desktop/A/HMI_FORGF/bld/armle-v7/release/SimpleNetwork/Makefile" found="0"/>
I want to use xml.etree to search on more than one attribute:
result = tree.search('.//op[#type="create" #file="c:/Users/mureadr/Desktop/A/HMI_FORGF/bld/armle-v7/release/HmiLogging/Makefile"]')
But I get an error
raise SyntaxError("invalid predicate")
I tried this (added and), still got same error
'.//op[#type="create" and #file="c:/Users/mureadr/Desktop/A/HMI_FORGF/bld/armle-v7/release/HmiLogging/Makefile"]'
Tried adding &&, still got same error
'.//op[#type="create" && #file="c:/Users/mureadr/Desktop/A/HMI_FORGF/bld/armle-v7/release/HmiLogging/Makefile"]'
Finally, tried &, still got same error
'.//op[#type="create" & #file="c:/Users/mureadr/Desktop/A/HMI_FORGF/bld/armle-v7/release/HmiLogging/Makefile"]'
I'm guessing that this is a limitation of xml.etree.
Probably I shouldn't use it in the future, but I'm almost done with my project.
For N attributes, how do I use etree.xml to be able to search on all N attributes?

You can use multiple square brackets in succession
'.//op[#type="create"][#file="/some/path"]'
UPDATE: I see that you are using python's xml.etree module. I am not sure if the above answer is valid for that module (It has extremely limited support for XPath). I'd suggest using the go-to library for all XML tasks -- LXML. If you'd use lxml, it would be simply doc.xpath(".//op[..][..]")

XPath Namespace issues

When connecting to an XMPP server I get one of these two responses:
<stream:features xmlns:stream="http://etherx.jabber.org/streams">
<mechanisms xmlns="urn:ietf:params:xml:ns:xmpp-sasl">
<mechanism>PLAIN</mechanism>
<mechanism>DIGEST MD5</mechanism>
</mechanisms>
<auth xmlns="http://jabber.org/features/iq-auth" />
<register xmlns="http://jabber.org/features/iq-register" />
</stream:features>
OR
<stream:features>
<mechanisms xmlns="urn:ietf:params:xml:ns:xmpp-sasl">
<mechanism>DIGEST-MD5</mechanism>
<mechanism>PLAIN</mechanism>
<mechanism>ANONYMOUS</mechanism>
<mechanism>CRAM-MD5</mechanism>
</mechanisms>
<compression xmlns="http://jabber.org/features/compress">
<method>zlib</method>
</compression>
<auth xmlns="http://jabber.org/features/iq-auth" />
<register xmlns="http://jabber.org/features/iq-register" />
</stream:features>
When trying to parse the second one with my code, I get this error:
namespace error : Namespace prefix stream on features is not defined
<stream:features><mechanisms xmlns="urn:ietf:params:xml:ns:xmpp-sasl"><mechanism
^
Here is my code:
mechanisms = []
xmlParsed = libxml2.parseDoc(xmlResponse)
xpathContext = xmlParsed.xpathNewContext()
xpathContext.xpathRegisterNs('urn','http://etherx.jabber.org/streams')
xpathContext.xpathRegisterNs('sasl', 'urn:ietf:params:xml:ns:xmpp-sasl')
nodes = xpathContext.xpathEval("//urn:stream/features/sasl:mechanisms/sasl:mechanism/text()|//urn:features/sasl:mechanisms/sasl:mechanism/text()")
for node in nodes:
mechanisms.append(str(node))
What am I doing wrong and how can I right it? Please don't say, use the XMPP libraries or such, I'm not trying to write an entire XMPP client. I just want enough code to register as a user first.

Please don't write your own XMPP library from scratch. There are already many available from a list on xmpp.org. In particular, for Python, try SleekXMPP.
For example, using parseDoc isn't going to work; you'll need to parse XML incrementally. The missing prefix definition for "stream" in "stream:features" is a symptom of this sort of problem.

I think the error is reported for the <stream:features> tag saying that the prefix stream is not defined.
<stream:features> indicates that the features tag is under a namespace represented by prefix stream and in your xml fragment there is no such namespace declared.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I parse an XML document into a Python object? - python

I have had a lot of luck parsing XML into Python using Beautiful Soup. It is extremely straightforward, and they provide pretty strong documentation. Check it out here: http://www.crummy.com/software/BeautifulSoup/ http://www.crummy.com/software/BeautifulSoup/bs4/doc/

Related

Import XML namespace in python

Python LXML create xml with specific namespace and structure

How to access attribute value in xml containing namespace using ElementTree in python

python xml.etree - how to search on more than one attribute

XPath Namespace issues

Categories

Resources