Stop ElementTree removing namespace from elements [duplicate] - python

This question already has an answer here:
Why does xml package modify my xml file in Python3?
(1 answer)
Closed 5 years ago.
Some of my elements in the xml file I am parsing have their own xmlns attribute, but whenever I parse and write the file back, the xmlns are removed and instead I get a ns3: prefix and a new namespace is added at the top.
The head of the XML file I'm reading:
<oval_definitions xmlns="http://oval.mitre.org/XMLSchema/oval-definitions-5" xmlns:oval="http://oval.mitre.org/XMLSchema/oval-common-5" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://oval.mitre.org/XMLSchema/oval-common-5 http://oval.mitre.org/language/download/schema/version5.8/ovaldefinition/complete/oval-common-schema.xsd http://oval.mitre.org/XMLSchema/oval-definitions-5 http://oval.mitre.org/language/download/schema/version5.8/ovaldefinition/complete/oval-definitions-schema.xsd http://oval.mitre.org/XMLSchema/oval-definitions-5#windows http://oval.mitre.org/language/download/schema/version5.8/ovaldefinition/complete/windows-definitions-schema.xsd">
The head of the output I get:
<oval_definitions xmlns="http://oval.mitre.org/XMLSchema/oval-definitions-5" xmlns:ns3="http://oval.mitre.org/XMLSchema/oval-definitions-5#windows" xmlns:oval="http://oval.mitre.org/XMLSchema/oval-common-5" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://oval.mitre.org/XMLSchema/oval-common-5 http://oval.mitre.org/language/download/schema/version5.8/ovaldefinition/complete/oval-common-schema.xsd http://oval.mitre.org/XMLSchema/oval-definitions-5 http://oval.mitre.org/language/download/schema/version5.8/ovaldefinition/complete/oval-definitions-schema.xsd http://oval.mitre.org/XMLSchema/oval-definitions-5#windows http://oval.mitre.org/language/download/schema/version5.8/ovaldefinition/complete/windows-definitions-schema.xsd">
My namespace declarations:
ET.register_namespace('', "http://oval.mitre.org/XMLSchema/oval-definitions-5")
ET.register_namespace('oval', "http://oval.mitre.org/XMLSchema/oval-common- 5")
ET.register_namespace('xsi', "http://www.w3.org/2001/XMLSchema-instance")
ET.register_namespace('xsi:schemaLocation', "http://oval.mitre.org/XMLSchema/oval-common-5 http://oval.mitre.org/language/download/schema/version5.8/ovaldefinition/complete/oval-common-schema.xsd http://oval.mitre.org/XMLSchema/oval-definitions-5 http://oval.mitre.org/language/download/schema/version5.8/ovaldefinition/complete/oval-definitions-schema.xsd http://oval.mitre.org/XMLSchema/oval-definitions-5#windows http://oval.mitre.org/language/download/schema/version5.8/ovaldefinition/complete/windows-definitions-schema.xsd")
What I want:
<registry_state xmlns="http://oval.mitre.org/XMLSchema/oval-definitions-5#windows" id="oval:mil.disa.fso.windows:ste:397100" version="2" comment="Reg_Dword type and value equals 0">
What I'm getting now:
<ns3:registry_state comment="Reg_Dword type and value equals 0" id="oval:mil.disa.fso.windows:ste:397100" version="2">
How can I get the xmlns= attribute back into my elements and out of the head of the document?

oval: and ns3: are namespace prefixes, not namespaces.
Namespace prefixes themselves are insignificant; it is only through
the namespace (eg http://oval.mitre.org/XMLSchema/oval-definitions-5#windows) to which
they're bound that they derive meaning. No compliant XML processors will care about the specific namespace prefix (only the namespace URIs to which they're bound), and neither should you or the software you write.
Similarly, control over the use of a default namespace vs an explicit namespace via an namespace prefix is also a difference that makes no difference (assuming equivalence is preserved wrt the inheritance of the default namespaces to descendant elements).
See also: Why does xml package modify my xml file in Python3?

Related

Import XML namespace in python

I'm a total noob in coding, I study IT, and have a school project in which I must convert a .txt file in a XML file. I have managed to create a tree, and subelements, but a must put some XML namespace in the code. Because the XML file in the end must been opened in a program that gives you a table of the informations, and something more. But without the scheme from the XML namespace it won't open anything. Can someone help me in how to put a .xsd in my code?
This is the scheme:
http://www.pufbih.ba/images/stories/epp_docs/PaketniUvozObrazaca_V1_0.xsd
Example of XML file a must create:
http://www.pufbih.ba/images/stories/epp_docs/4200575050089_1022.xml
And in the first row a have the scheme that I must input: "urn:PaketniUvozObrazaca_V1_0.xsd"
This is the code a created so far:
import xml.etree.ElementTree as xml
def GenerateXML(GIP1022):
root=xml.Element("PaketniUvozObrazaca")
p1=xml.Element("PodaciOPoslodavcu")
root.append(p1)
jib=xml.SubElement(p1,"JIBPoslodavca")
jib.text="4254160150005"
pos=xml.SubElement(p1,"NazivPoslodavca")
pos.text="MOJATVRTKA d.o.o. ORAŠJE"
zah=xml.SubElement(p1,"BrojZahtjeva")
zah.text="8"
datz=xml.SubElement(p1,"DatumPodnosenja")
datz.text="2021-01-01"
tree=xml.ElementTree(root)
with open(GIP1022,"wb") as files:
tree.write(files)
if __name__=="__main__":
GenerateXML("primjer.xml")
The official documentation is not super explicit as to how one works with namespaces in ElementTree, but the core of it is that ElementTree takes a very fundamental(ist) approach: instead of manipulating namespace prefixes / aliases, elementtree uses Clark's Notation.
So e.g.
<bar xmlns="foo">
or
<x:bar xmlns:x="foo">
(the element bar in the foo namespace) would be written
{foo}bar
>>> tostring(Element('{foo}bar'), encoding='unicode')
'<ns0:bar xmlns:ns0="foo" />'
alternatively (and sometimes more conveniently for authoring and manipulating) you can use QName objects which can either take a Clark's notation tag name, or separately take a namespace and a tag name:
>>> tostring(Element(QName('foo', 'bar')), encoding='unicode')
'<ns0:bar xmlns:ns0="foo" />'
So while ElementTree doesn't have a namespace object per-se you can create namespaced object like this, probably via a helper partially applying QName:
>>> root = Element(ns("PaketniUvozObrazaca"))
>>> SubElement(root, ns("PodaciOPoslodavcu"))
<Element <QName '{urn:PaketniUvozObrazaca_V1_0.xsd}PodaciOPoslodavcu'> at 0x7f502481bdb0>
>>> tostring(root, encoding='unicode')
'<ns0:PaketniUvozObrazaca xmlns:ns0="urn:PaketniUvozObrazaca_V1_0.xsd"><ns0:PodaciOPoslodavcu /></ns0:PaketniUvozObrazaca>'
Now there are a few important considerations here:
First, as you can see the prefix when serialising is arbitrary, this is in keeping with ElementTree's fundamentalist approach to XML (the prefix should not matter), but it has since grown a "register_namespace" global function which allows registering specific prefixes:
>>> register_namespace('xxx', 'urn:PaketniUvozObrazaca_V1_0.xsd')
>>> tostring(root, encoding='unicode')
'<xxx:PaketniUvozObrazaca xmlns:xxx="urn:PaketniUvozObrazaca_V1_0.xsd"><xxx:PodaciOPoslodavcu /></xxx:PaketniUvozObrazaca>'
you can also pass a single default_namespace to (some) serialization function to specify the, well, default namespace:
>>> tostring(root, encoding='unicode', default_namespace='urn:PaketniUvozObrazaca_V1_0.xsd')
'<PaketniUvozObrazaca xmlns="urn:PaketniUvozObrazaca_V1_0.xsd"><PodaciOPoslodavcu /></PaketniUvozObrazaca>'
A second, possibly larger, issue is that ElementTree does not support validation.
The Python standard library does not provide support for any validating parser or tree builder, whether DTD, rng, xml schema, anything. Not by default, and not optionally.
lxml is probably the main alternative supporting validation (of multiple types of schema), its core API follows ElementTree but extends it in multiple ways and directions (including much more precise namespace prefix support, and prefix round-tripping). But even then the validation is (AFAIK) mostly explicit, at least when generating / serializing documents.
What you want is to add a default namespace declaration (xmlns="urn:PaketniUvozObrazaca_V1_0.xsd") to the root element. I have edited the code in the question to show you how this can be done.
import xml.etree.ElementTree as ET
def GenerateXML(GIP1022):
# Create the PaketniUvozObrazaca root element in the urn:PaketniUvozObrazaca_V1_0.xsd namespace
root = ET.Element("{urn:PaketniUvozObrazaca_V1_0.xsd}PaketniUvozObrazaca")
# Add subelements
p1 = ET.Element("PodaciOPoslodavcu")
root.append(p1)
jib = ET.SubElement(p1,"JIBPoslodavca")
jib.text = "4254160150005"
pos = ET.SubElement(p1,"NazivPoslodavca")
pos.text = "MOJATVRTKA d.o.o. ORAŠJE"
zah = ET.SubElement(p1,"BrojZahtjeva")
zah.text = "8"
datz = ET.SubElement(p1,"DatumPodnosenja")
datz.text = "2021-01-01"
# Make urn:PaketniUvozObrazaca_V1_0.xsd the default namespace (no prefix)
ET.register_namespace("", "urn:PaketniUvozObrazaca_V1_0.xsd")
# Prettify output (requires Python 3.9)
ET.indent(root)
tree = ET.ElementTree(root)
with open(GIP1022,"wb") as files:
tree.write(files)
if __name__=="__main__":
GenerateXML("primjer.xml")
Contents of primjer.xml:
<PaketniUvozObrazaca xmlns="urn:PaketniUvozObrazaca_V1_0.xsd">
<PodaciOPoslodavcu>
<JIBPoslodavca>4254160150005</JIBPoslodavca>
<NazivPoslodavca>MOJATVRTKA d.o.o. ORAŠJE</NazivPoslodavca>
<BrojZahtjeva>8</BrojZahtjeva>
<DatumPodnosenja>2021-01-01</DatumPodnosenja>
</PodaciOPoslodavcu>
</PaketniUvozObrazaca>
Note that only the root element is explicitly bound to a namespace in the code. The subelements do not need to be in a namespace when they are added. The end result is an XML document (primjer.xml) where all elements belong to the same default namespace.
The above is not the only way to create an element in a namespace. For example, instead of the {namespace-uri}name notation, the QName class can be used. See https://stackoverflow.com/a/58678592/407651.
The tree.write() method takes a default_namespace argument.
What happens if you change that line to the following?
tree.write(files, default_namespace="urn:PaketniUvozObrazaca_V1_0.xsd")

How can I register multiple default namespaces when modifying an xml file using python?

I have looked through many namespace documents on here and am only able to slightly relate to a few. in my document, I have 3 defaults and only one colon style xmlns, example:
xmlns="someurlNo1"
xmlns:spatial="someurlNo2"
xmlns="someurlNo3"
xmlns="someurlNo4"
From what I have read, it seems that I have 3 defaults (please correct me if I am interpreting this wrong), but when I modify my base xml and then write my new xml, I am only able to avoid having the first two, ns0 and ns1, not show up by commenting out the last two, which makes everything else part of the last two defaults are labeled with "ns2" and "ns3" even if I register all as such:
ET.register_namespace('',"someurlNo0") #ns0
ET.register_namespace('spatial',"someurlNo1") #ns1
#ET.register_namespace('',"someurlNo2") #ns2
#ET.register_namespace('',"someurlNo3") #ns3
Does anyone know how to register the last two default namespaces correctly? When I leave the last two not commented out, ns0 and ns2 appear where they should, and while all the ns3s disappear, the default is no longer equal to "someurlNo3".
This answer: https://stackoverflow.com/a/43530940 has so far been the most helpful explanation to me in illustrating that there may be multiple defaults that travel down (which I believe is true for my document), but I am still unsure how to properly register them. Any ideas would be much appreciated!
Here is what the top part of my xml looks like that includes all 4 namespaces. I'd rather spare you from seeing all 3k lines but if needed I can share more:
<?xml version='1.0' encoding='UTF-8' standalone='no'?>
<sbml xmlns="http://www.sbml.org/sbml/level3/version1/core" level="3" spatial:required="true" version="1" xmlns:spatial="http://www.sbml.org/sbml/level3/version1/spatial/version1">
<notes>
<body xmlns="http://www.w3.org/1999/xhtml">
<p>Exported by VCell 7.3</p>
</body>
</notes>
<model areaUnits="um2" extentUnits="molecules" id="_zero_6_29_21_Phase1_cellularConcAgain_Spatial" lengthUnits="um" name="06_29_21_Phase1_cellularConcAgain_Spatial" substanceUnits="molecules" timeUnits="s" volumeUnits="um3">
<spatial:geometry xmlns:spatial="http://www.sbml.org/sbml/level3/version1/spatial/version1" id="vcell" spatial:coordinateSystem="cartesian" spatial:id="vcell">
<spatial:listOfCoordinateComponents>
<spatial:coordinateComponent id="x" spatial:id="x" spatial:type="cartesianX" spatial:unit="um">
<spatial:boundaryMin id="Xmin" spatial:id="Xmin" spatial:value="0.0"/>
<spatial:boundaryMax id="Xmax" spatial:id="Xmax" spatial:value="1.6"/>
</spatial:coordinateComponent>
<spatial:coordinateComponent id="y" spatial:id="y" spatial:type="cartesianY" spatial:unit="um">
<spatial:boundaryMin id="Ymin" spatial:id="Ymin" spatial:value="0.0"/>
<spatial:boundaryMax id="Ymax" spatial:id="Ymax" spatial:value="3.5"/>
</spatial:coordinateComponent>
</spatial:listOfCoordinateComponents>
<spatial:listOfDomains>
<spatial:domain id="chr0" spatial:domainType="domainType_chr" spatial:id="chr0">
<spatial:listOfInteriorPoints>
<spatial:interiorPoint spatial:coord1="0.0" spatial:coord2="0.0" spatial:coord3="5.0"/>
</spatial:listOfInteriorPoints>
</spatial:domain>
</spatial:listOfDomains>
<spatial:listOfDomainTypes>
<spatial:domainType id="domainType_chr" spatial:id="domainType_chr" spatial:spatialDimensions="3"/>
</spatial:listOfDomainTypes>
<spatial:listOfGeometryDefinitions>
<spatial:analyticGeometry id="Analytic_Geometry1640227629" spatial:id="Analytic_Geometry1640227629" spatial:isActive="true">
<spatial:listOfAnalyticVolumes>
<spatial:analyticVolume spatial:domainType="domainType_chr" spatial:functionType="layered" spatial:id="chr" spatial:ordinal="0">
<math xmlns="http://www.w3.org/1998/Math/MathML">
<apply>
<neq/>
<cn> 0 </cn>
<cn> 1 </cn>
</apply>
</math>
You're misunderstanding namespaces. That a namespace is "default" is not a property of that namespace, it's a property of the XML in that location.
Don't get distracted. Give all namespace URIs that you're going to use a prefix, done.
ns = {
'a': 'someurlNo0',
'b': 'someurlNo1',
'c': 'someurlNo2',
'd': 'someurlNo3',
'e': 'someurlNo3', # same URI as above, perfectly legal
}
tree = ET.parse('path/to.xml')
tree.findall("./a:node/b:node/c:node/d:node/e:node", namespaces=ns)
It does not even need to be the same prefix as in your XML. In fact, avoid that. Give namespace URIs prefixes that make reading your code easy. All that matters in the end is the namespace URI, the prefix is ephemeral.
As long as the prefixes in your code resolve to the actual namespace URI of the targeted nodes, you're good. It does not matter if the namespaces are default at the specific location in the XML.

Python LXML create xml with specific namespace and structure

I am trying to create an XML export from a python application and need to structure the file in a specific way for the external recipient of the file.
The root node needs to be namespaced, but the child nodes should not.
The root node of should look like this:
<ns0:SalesInvoice_Custom_Xml xmlns:ns0="http://EDI-export/Invoice">...</ns0:SalesInvoice_Custom_Xml>
I have tried to generate the same node using the lxml library on Python 2.7, but it does not behave as expected.
Here is the code that should generate the root node:
def create_edi(self, document):
_logger.info("INFO: Started creating EDI invoice with invoice number %s", document.number)
rootNs = etree.QName("ns0", "SalesInvoice_Custom_Xml")
doc = etree.Element(rootNs, nsmap={
'ns0': "http://EDI-export/Invoice"
})
This gives the following output
<ns1:SalesInvoice_Custom_Xml xmlns:ns0="http://EDI-export/Invoice" xmlns:ns1="ns0">...</ns1:SalesInvoice_Custom_Xml>
What should I change in my code to get lxml to generate the correct root node
You need to use
rootNs = etree.QName(ns0, "SalesInvoice_Custom_Xml")
with
ns0 = "http://EDI-export/Invoice"
The whole data structure itself is agnostic of any namespace mapping you might apply later, i. e. the tags know the true namespaces (e. g. http://EDI-export/Invoice) not their mapping (e. g. ns0).
Later, when you finally serialize this into a string, a namespace mapping is needed. Then (and only then) a namespace mapping will be used.
Also, after parsing you can ask the etree object what namespace mapping had been found during parsing. But that is not part of the structure, it is just additional information about how the structure had been encoded as string. Consider that the following two XMLs are logically equal:
<x:tag xmlns:x="namespace"></x:tag>
and
<y:tag xmlns:y="namespace"></y:tag>
After parsing, their structures will be equal, their namespace mappings will not.

How to access attribute value in xml containing namespace using ElementTree in python

XML file:
<?xml version="1.0" encoding="iso-8859-1"?>
<rdf:RDF xmlns:cim="http://iec.ch/TC57/2008/CIM-schema-cim13#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<cim:Terminal rdf:ID="A_T1">
<cim:Terminal.ConductingEquipment rdf:resource="#A_EF2"/>
<cim:Terminal.ConnectivityNode rdf:resource="#A_CN1"/>
</cim:Terminal>
</rdf:RDF>
I want to get the Terminal.ConnnectivityNode element's attribute value and Terminal element's attribute value also as output from the above xml. I have tried in below way!
Python code:
from elementtree import ElementTree as etree
tree= etree.parse(r'N:\myinternwork\files xml of bus systems\cimxmleg.xml')
cim= "{http://iec.ch/TC57/2008/CIM-schema-cim13#}"
rdf= "{http://www.w3.org/1999/02/22-rdf-syntax-ns#}"
Appending the below line to the code
print tree.find('{0}Terminal'.format(cim)).attrib
output1: : Is as expected
{'{http://www.w3.org/1999/02/22-rdf-syntax-ns#}ID': 'A_T1'}
If we Append with this below line to above code
print tree.find('{0}Terminal'.format(cim)).attrib['rdf:ID']
output2: key error in rdf:ID
If we append with this below line to above code
print tree.find('{0}Terminal/{0}Terminal.ConductivityEquipment'.format(cim))
output3 None
How to get output2 as A_T1 & Output3 as #A_CN1?
What is the significance of {0} in the above code, I have found that it must be used through net didn't get the significance of it?
First off, the {0} you're wondering about is part of the syntax for Python's built-in string formatting facility. The Python documentation has a fairly comprehensive guide to the syntax. In your case, it simply gets substituted by cim, which results in the string {http://iec.ch/TC57/2008/CIM-schema-cim13#}Terminal.
The problem here is that ElementTree is a bit silly about namespaces. Instead of being able to simply supply the namespace prefix (like cim: or rdf:), you have to supply it in XPath form. This means that rdf:id becomes {http://www.w3.org/1999/02/22-rdf-syntax-ns#}ID, which is very clunky.
ElementTree does support a way to use the namespace prefix for finding tags, but not for attributes. This means you'll have to expand rdf: to {http://www.w3.org/1999/02/22-rdf-syntax-ns#} yourself.
In your case, it could look as following (note also that ID is case-sensitive):
tree.find('{0}Terminal'.format(cim)).attrib['{0}ID'.format(rdf)]
Those substitutions expand to:
tree.find('{http://iec.ch/TC57/2008/CIM-schema-cim13#}Terminal').attrib['{http://www.w3.org/1999/02/22-rdf-syntax-ns#}ID']
With those hoops jumped through, it works (note that the ID is A_T1 and not #A_T1, however). Of course, this is all really annoying to have to deal with, so you could also switch to lxml and have it mostly handled for you.
Your third case doesn't work simply because 1) it's named Terminal.ConductingEquipment and not Terminal.ConductivityEquipment, and 2) if you really want A_CN1 and not A_EF2, that's the ConnectivityNode and not the ConductingEquipment. You can get A_CN1 with tree.find('{0}Terminal/{0}Terminal.ConnectivityNode'.format(cim)).attrib['{0}resource'.format(rdf)].

How can I parse an XML document into a Python object?

I'm trying to consume an XML API. I'd like to have some Python objects that represent the XML data. I have several XSD and some example API responses from the documentation.
http://www.isan.org/schema/v1.11/common/common.xsd
http://www.isan.org/schema/v1.21/common/serial.xsd
http://www.isan.org/schema/v1.11/common/version.xsd
http://www.isan.org/ISAN/isan.xsd
http://www.isan.org/schema/v1.11/common/title.xsd
http://www.isan.org/schema/v1.11/common/externalid.xsd
http://www.isan.org/schema/v1.11/common/participant.xsd
http://www.isan.org/schema/v1.11/common/language.xsd
http://www.isan.org/schema/v1.11/common/country.xsd
Here's one example XML response:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<serial:serialHeaderType xmlns:isan="http://www.isan.org/ISAN/isan"
xmlns:title="http://www.isan.org/schema/v1.11/common/title"
xmlns:serial="http://www.isan.org/schema/v1.21/common/serial"
xmlns:externalid="http://www.isan.org/schema/v1.11/common/externalid"
xmlns:common="http://www.isan.org/schema/v1.11/common/common"
xmlns:participant="http://www.isan.org/schema/v1.11/common/participant"
xmlns:language="http://www.isan.org/schema/v1.11/common/language"
xmlns:country="http://www.isan.org/schema/v1.11/common/country">
<common:status>
<common:DataType>SERIAL_HEADER_TYPE</common:DataType>
<common:ISAN root="0000-0002-3B9F"/>
<common:WorkStatus>ACTIVE</common:WorkStatus>
</common:status>
<serial:SerialHeaderId root="0000-0002-3B9F"/>
<serial:MainTitles>
<title:TitleDetail>
<title:Title>Braquo</title:Title>
<title:Language>
<language:LanguageLabel>French</language:LanguageLabel>
<language:LanguageCode>
<language:CodingSystem>ISO639_2</language:CodingSystem>
<language:ISO639_2Code>FRE</language:ISO639_2Code>
</language:LanguageCode>
</title:Language>
<title:TitleKind>ORIGINAL</title:TitleKind>
</title:TitleDetail>
</serial:MainTitles>
<serial:TotalEpisodes>11</serial:TotalEpisodes>
<serial:TotalSeasons>0</serial:TotalSeasons>
<serial:MinDuration>
<common:TimeUnit>MIN</common:TimeUnit>
<common:TimeValue>45</common:TimeValue>
</serial:MinDuration>
<serial:MaxDuration>
<common:TimeUnit>MIN</common:TimeUnit>
<common:TimeValue>144</common:TimeValue>
</serial:MaxDuration>
<serial:MinYear>2009</serial:MinYear>
<serial:MaxYear>2009</serial:MaxYear>
<serial:MainParticipantList>
<participant:Participant>
<participant:FirstName>Frédéric</participant:FirstName>
<participant:LastName>Schoendoerffer</participant:LastName>
<participant:RoleCode>DIR</participant:RoleCode>
</participant:Participant>
<participant:Participant>
<participant:FirstName>Karole</participant:FirstName>
<participant:LastName>Rocher</participant:LastName>
<participant:RoleCode>ACT</participant:RoleCode>
</participant:Participant>
</serial:MainParticipantList>
<serial:CompanyList>
<common:Company>
<common:CompanyKind>PRO</common:CompanyKind>
<common:CompanyName>R.T.B.F.</common:CompanyName>
</common:Company>
<common:Company>
<common:CompanyKind>PRO</common:CompanyKind>
<common:CompanyName>Capa Drama</common:CompanyName>
</common:Company>
<common:Company>
<common:CompanyKind>PRO</common:CompanyKind>
<common:CompanyName>Marathon</common:CompanyName>
</common:Company>
</serial:CompanyList>
</serial:serialHeaderType>
I tried simply ignoring the XSD and using lxml.objectify on the XML I'd get from the API. I had a problem with namespaces. Having to refer to every child node with its explicit namespace was a real pain and doesn't make for readable code.
from lxml import objectify
obj = objectify.fromstring(response)
print obj.MainTitles.TitleDetail
# This will fail to find the element because you need to specify the namespace
print obj.MainTitles['{http://www.isan.org/schema/v1.11/common/title}TitleDetail']
# Or something like that, I couldn't get it to work, and I'd much rather use attributes and not specify the namespace
So then I tried generateDS to create some Python class definitions for me. I've lost the error messages that this attempt gave me but I couldn't get it to work. It would generate a module for each XSD that I gave it but it wouldn't parse the example XML.
I'm now trying pyxb and this seems much nicer so far. It's generating nicer definitions than generateDS (splitting them into multiple, reusable modules) but it won't parse the XML:
from models import serial
obj = serial.CreateFromDocument(response)
Traceback (most recent call last):
...
File "/vagrant/isan/isan.py", line 58, in lookup
return serial.CreateFromDocument(resp.content)
File "/vagrant/isan/models/serial.py", line 69, in CreateFromDocument
instance = handler.rootObject()
File "/home/vagrant/venv/lib/python2.7/site-packages/pyxb/binding/saxer.py", line 285, in rootObject
raise pyxb.UnrecognizedDOMRootNodeError(self.__rootObject)
UnrecognizedDOMRootNodeError: <pyxb.utils.saxdom.Element object at 0x2b53664dc850>
The unrecognised node is the <serial:serialHeaderType> node from the example. Looking at the pyxb source it seems that this error comes about "if the top-level element got processed as a DOM instance" but I don't know what this means or how to prevent it.
I've run out of steam for trying to explore this, I don't know what to do next.
I have had a lot of luck parsing XML into Python using Beautiful Soup. It is extremely straightforward, and they provide pretty strong documentation. Check it out here:
http://www.crummy.com/software/BeautifulSoup/
http://www.crummy.com/software/BeautifulSoup/bs4/doc/
UnrecognizedDOMRootNodeError indicates that PyXB could not locate the element in a namespace for which it has bindings registered. In your case it fails on the first element, which is {http://www.isan.org/schema/v1.21/common/serial}serialHeaderType.
The schema for that namespace defines a complexType named SerialHeaderType but does not define an element with the name serialHeaderType. In fact it defines no top-level elements. So PyXB can't recognize it, and the XML does not validate.
Either there's an additional schema for the namespace that you'll need to locate which provides elements, or the message you're sending really doesn't validate. That may be because somebody's expecting a implicit mapping from a complex type to an element with that type, or because it's a fragment that would normally be found within some other element where that QName is a member element name.
UPDATE: You can hand-craft an element in that namespace by adding the
following to the generated bindings in serial.py:
serialHeaderType = pyxb.binding.basis.element(pyxb.namespace.ExpandedName(Namespace, 'serialHeaderType'), SerialHeaderType)
Namespace.addCategoryObject('elementBinding', serialHeaderType.name().localName(), serialHeaderType)
If you do that, you won't get the UnrecognizedDOMRootNodeError but you
will get an IncompleteElementContentError at:
<common:status>
<common:DataType>SERIAL_HEADER_TYPE</common:DataType>
<common:ISAN root="0000-0002-3B9F"/>
<common:WorkStatus>ACTIVE</common:WorkStatus>
</common:status>
which provides the following details:
The containing element {http://www.isan.org/schema/v1.11/common/common}status is defined at common.xsd[243:3].
The containing element type {http://www.isan.org/schema/v1.11/common/common}StatusType is defined at common.xsd[289:1]
The {http://www.isan.org/schema/v1.11/common/common}StatusType automaton is not in an accepting state.
Any accepted content has been stored in instance
The following element and wildcard content would be accepted:
An element {http://www.isan.org/schema/v1.11/common/common}ActiveISAN per common.xsd[316:3]
An element {http://www.isan.org/schema/v1.11/common/common}MatchingISANs per common.xsd[317:3]
An element {http://www.isan.org/schema/v1.11/common/common}Description per common.xsd[318:3]
No content remains unconsumed
Reviewing the schema confirms that, at a minimum, a {http://www.isan.org/schema/v1.11/common/common}Description element is missing but required.
So it seems these documents are not meant to be validated, and PyXB is
probably the wrong technology to use.

Categories