XPath Namespace issues - python

When connecting to an XMPP server I get one of these two responses:
<stream:features xmlns:stream="http://etherx.jabber.org/streams">
<mechanisms xmlns="urn:ietf:params:xml:ns:xmpp-sasl">
<mechanism>PLAIN</mechanism>
<mechanism>DIGEST MD5</mechanism>
</mechanisms>
<auth xmlns="http://jabber.org/features/iq-auth" />
<register xmlns="http://jabber.org/features/iq-register" />
</stream:features>
OR
<stream:features>
<mechanisms xmlns="urn:ietf:params:xml:ns:xmpp-sasl">
<mechanism>DIGEST-MD5</mechanism>
<mechanism>PLAIN</mechanism>
<mechanism>ANONYMOUS</mechanism>
<mechanism>CRAM-MD5</mechanism>
</mechanisms>
<compression xmlns="http://jabber.org/features/compress">
<method>zlib</method>
</compression>
<auth xmlns="http://jabber.org/features/iq-auth" />
<register xmlns="http://jabber.org/features/iq-register" />
</stream:features>
When trying to parse the second one with my code, I get this error:
namespace error : Namespace prefix stream on features is not defined
<stream:features><mechanisms xmlns="urn:ietf:params:xml:ns:xmpp-sasl"><mechanism
^
Here is my code:
mechanisms = []
xmlParsed = libxml2.parseDoc(xmlResponse)
xpathContext = xmlParsed.xpathNewContext()
xpathContext.xpathRegisterNs('urn','http://etherx.jabber.org/streams')
xpathContext.xpathRegisterNs('sasl', 'urn:ietf:params:xml:ns:xmpp-sasl')
nodes = xpathContext.xpathEval("//urn:stream/features/sasl:mechanisms/sasl:mechanism/text()|//urn:features/sasl:mechanisms/sasl:mechanism/text()")
for node in nodes:
mechanisms.append(str(node))
What am I doing wrong and how can I right it? Please don't say, use the XMPP libraries or such, I'm not trying to write an entire XMPP client. I just want enough code to register as a user first.

Please don't write your own XMPP library from scratch. There are already many available from a list on xmpp.org. In particular, for Python, try SleekXMPP.
For example, using parseDoc isn't going to work; you'll need to parse XML incrementally. The missing prefix definition for "stream" in "stream:features" is a symptom of this sort of problem.

I think the error is reported for the <stream:features> tag saying that the prefix stream is not defined.
<stream:features> indicates that the features tag is under a namespace represented by prefix stream and in your xml fragment there is no such namespace declared.

Related

XMLSigner - sign multiple references

I'm developing a digital signature system and I have a question about the XMLSigner library. I could not find my answer by looking at documentation, so I'm gonna ask here.
I have a XML file, which I need to sign it and I have more than a reference, but there's a problem: they don't have an ID specifying for a reference. The XML that I need to sign is something like:
<AppHdr>
<Fr>
<FIId>
<FinInstnId>
<Othr>
<Id>00038166</Id>
</Othr>
</FinInstnId>
</FIId>
</Fr>
(more content...)
</AppHdr>
<Document>
(more content...)
</Document>
I've extracted both AppHdr and Document, and made the signature using each one, and my idea was to put them together later in another xml file, already canonicalized and encrypted, using:
signed_app_hdr = XMLSigner(method=methods.enveloped, signature_algorithm='rsa-sha256',
digest_algorithm='sha256',
c14n_algorithm=
"http://www.w3.org/2001/10/xml-exc-c14n#"). \
sign(et_app_hdr, key=rsa_key)
signed_document_info = XMLSigner(signature_algorithm='rsa-sha256',
digest_algorithm='sha256',
c14n_algorithm=
"http://www.w3.org/2001/10/xml-exc-c14n#"). \
sign(et_document, key=rsa_key)
The digital signature output (padronized by an institution that I'm sending the message), requires that
<Reference URI="">
to reference the <AppHdr>
And
<Reference>
to reference the <Document>
There's also a <KeyInfo>, that is referenced by an id (no issues in this case). I'm just using sign(et_key_info, key=rsa_key, reference_uri='key-info-id') to reference it.
So, my question is: how do I do to reference the AppHdr and Document in the reference_uri? Is it possible? When I just leave reference_uri = None (by default), it just creates <Reference URI="">, which would be no problem for the AppHdr. And for the document? What should I do? Could I create an artificial ID for them and remove later? Idk if it would have implications in cryptography.
Thanks in advance!

How can I register multiple default namespaces when modifying an xml file using python?

I have looked through many namespace documents on here and am only able to slightly relate to a few. in my document, I have 3 defaults and only one colon style xmlns, example:
xmlns="someurlNo1"
xmlns:spatial="someurlNo2"
xmlns="someurlNo3"
xmlns="someurlNo4"
From what I have read, it seems that I have 3 defaults (please correct me if I am interpreting this wrong), but when I modify my base xml and then write my new xml, I am only able to avoid having the first two, ns0 and ns1, not show up by commenting out the last two, which makes everything else part of the last two defaults are labeled with "ns2" and "ns3" even if I register all as such:
ET.register_namespace('',"someurlNo0") #ns0
ET.register_namespace('spatial',"someurlNo1") #ns1
#ET.register_namespace('',"someurlNo2") #ns2
#ET.register_namespace('',"someurlNo3") #ns3
Does anyone know how to register the last two default namespaces correctly? When I leave the last two not commented out, ns0 and ns2 appear where they should, and while all the ns3s disappear, the default is no longer equal to "someurlNo3".
This answer: https://stackoverflow.com/a/43530940 has so far been the most helpful explanation to me in illustrating that there may be multiple defaults that travel down (which I believe is true for my document), but I am still unsure how to properly register them. Any ideas would be much appreciated!
Here is what the top part of my xml looks like that includes all 4 namespaces. I'd rather spare you from seeing all 3k lines but if needed I can share more:
<?xml version='1.0' encoding='UTF-8' standalone='no'?>
<sbml xmlns="http://www.sbml.org/sbml/level3/version1/core" level="3" spatial:required="true" version="1" xmlns:spatial="http://www.sbml.org/sbml/level3/version1/spatial/version1">
<notes>
<body xmlns="http://www.w3.org/1999/xhtml">
<p>Exported by VCell 7.3</p>
</body>
</notes>
<model areaUnits="um2" extentUnits="molecules" id="_zero_6_29_21_Phase1_cellularConcAgain_Spatial" lengthUnits="um" name="06_29_21_Phase1_cellularConcAgain_Spatial" substanceUnits="molecules" timeUnits="s" volumeUnits="um3">
<spatial:geometry xmlns:spatial="http://www.sbml.org/sbml/level3/version1/spatial/version1" id="vcell" spatial:coordinateSystem="cartesian" spatial:id="vcell">
<spatial:listOfCoordinateComponents>
<spatial:coordinateComponent id="x" spatial:id="x" spatial:type="cartesianX" spatial:unit="um">
<spatial:boundaryMin id="Xmin" spatial:id="Xmin" spatial:value="0.0"/>
<spatial:boundaryMax id="Xmax" spatial:id="Xmax" spatial:value="1.6"/>
</spatial:coordinateComponent>
<spatial:coordinateComponent id="y" spatial:id="y" spatial:type="cartesianY" spatial:unit="um">
<spatial:boundaryMin id="Ymin" spatial:id="Ymin" spatial:value="0.0"/>
<spatial:boundaryMax id="Ymax" spatial:id="Ymax" spatial:value="3.5"/>
</spatial:coordinateComponent>
</spatial:listOfCoordinateComponents>
<spatial:listOfDomains>
<spatial:domain id="chr0" spatial:domainType="domainType_chr" spatial:id="chr0">
<spatial:listOfInteriorPoints>
<spatial:interiorPoint spatial:coord1="0.0" spatial:coord2="0.0" spatial:coord3="5.0"/>
</spatial:listOfInteriorPoints>
</spatial:domain>
</spatial:listOfDomains>
<spatial:listOfDomainTypes>
<spatial:domainType id="domainType_chr" spatial:id="domainType_chr" spatial:spatialDimensions="3"/>
</spatial:listOfDomainTypes>
<spatial:listOfGeometryDefinitions>
<spatial:analyticGeometry id="Analytic_Geometry1640227629" spatial:id="Analytic_Geometry1640227629" spatial:isActive="true">
<spatial:listOfAnalyticVolumes>
<spatial:analyticVolume spatial:domainType="domainType_chr" spatial:functionType="layered" spatial:id="chr" spatial:ordinal="0">
<math xmlns="http://www.w3.org/1998/Math/MathML">
<apply>
<neq/>
<cn> 0 </cn>
<cn> 1 </cn>
</apply>
</math>
You're misunderstanding namespaces. That a namespace is "default" is not a property of that namespace, it's a property of the XML in that location.
Don't get distracted. Give all namespace URIs that you're going to use a prefix, done.
ns = {
'a': 'someurlNo0',
'b': 'someurlNo1',
'c': 'someurlNo2',
'd': 'someurlNo3',
'e': 'someurlNo3', # same URI as above, perfectly legal
}
tree = ET.parse('path/to.xml')
tree.findall("./a:node/b:node/c:node/d:node/e:node", namespaces=ns)
It does not even need to be the same prefix as in your XML. In fact, avoid that. Give namespace URIs prefixes that make reading your code easy. All that matters in the end is the namespace URI, the prefix is ephemeral.
As long as the prefixes in your code resolve to the actual namespace URI of the targeted nodes, you're good. It does not matter if the namespaces are default at the specific location in the XML.

How to access attribute value in xml containing namespace using ElementTree in python

XML file:
<?xml version="1.0" encoding="iso-8859-1"?>
<rdf:RDF xmlns:cim="http://iec.ch/TC57/2008/CIM-schema-cim13#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<cim:Terminal rdf:ID="A_T1">
<cim:Terminal.ConductingEquipment rdf:resource="#A_EF2"/>
<cim:Terminal.ConnectivityNode rdf:resource="#A_CN1"/>
</cim:Terminal>
</rdf:RDF>
I want to get the Terminal.ConnnectivityNode element's attribute value and Terminal element's attribute value also as output from the above xml. I have tried in below way!
Python code:
from elementtree import ElementTree as etree
tree= etree.parse(r'N:\myinternwork\files xml of bus systems\cimxmleg.xml')
cim= "{http://iec.ch/TC57/2008/CIM-schema-cim13#}"
rdf= "{http://www.w3.org/1999/02/22-rdf-syntax-ns#}"
Appending the below line to the code
print tree.find('{0}Terminal'.format(cim)).attrib
output1: : Is as expected
{'{http://www.w3.org/1999/02/22-rdf-syntax-ns#}ID': 'A_T1'}
If we Append with this below line to above code
print tree.find('{0}Terminal'.format(cim)).attrib['rdf:ID']
output2: key error in rdf:ID
If we append with this below line to above code
print tree.find('{0}Terminal/{0}Terminal.ConductivityEquipment'.format(cim))
output3 None
How to get output2 as A_T1 & Output3 as #A_CN1?
What is the significance of {0} in the above code, I have found that it must be used through net didn't get the significance of it?
First off, the {0} you're wondering about is part of the syntax for Python's built-in string formatting facility. The Python documentation has a fairly comprehensive guide to the syntax. In your case, it simply gets substituted by cim, which results in the string {http://iec.ch/TC57/2008/CIM-schema-cim13#}Terminal.
The problem here is that ElementTree is a bit silly about namespaces. Instead of being able to simply supply the namespace prefix (like cim: or rdf:), you have to supply it in XPath form. This means that rdf:id becomes {http://www.w3.org/1999/02/22-rdf-syntax-ns#}ID, which is very clunky.
ElementTree does support a way to use the namespace prefix for finding tags, but not for attributes. This means you'll have to expand rdf: to {http://www.w3.org/1999/02/22-rdf-syntax-ns#} yourself.
In your case, it could look as following (note also that ID is case-sensitive):
tree.find('{0}Terminal'.format(cim)).attrib['{0}ID'.format(rdf)]
Those substitutions expand to:
tree.find('{http://iec.ch/TC57/2008/CIM-schema-cim13#}Terminal').attrib['{http://www.w3.org/1999/02/22-rdf-syntax-ns#}ID']
With those hoops jumped through, it works (note that the ID is A_T1 and not #A_T1, however). Of course, this is all really annoying to have to deal with, so you could also switch to lxml and have it mostly handled for you.
Your third case doesn't work simply because 1) it's named Terminal.ConductingEquipment and not Terminal.ConductivityEquipment, and 2) if you really want A_CN1 and not A_EF2, that's the ConnectivityNode and not the ConductingEquipment. You can get A_CN1 with tree.find('{0}Terminal/{0}Terminal.ConnectivityNode'.format(cim)).attrib['{0}resource'.format(rdf)].

How can I parse an XML document into a Python object?

I'm trying to consume an XML API. I'd like to have some Python objects that represent the XML data. I have several XSD and some example API responses from the documentation.
http://www.isan.org/schema/v1.11/common/common.xsd
http://www.isan.org/schema/v1.21/common/serial.xsd
http://www.isan.org/schema/v1.11/common/version.xsd
http://www.isan.org/ISAN/isan.xsd
http://www.isan.org/schema/v1.11/common/title.xsd
http://www.isan.org/schema/v1.11/common/externalid.xsd
http://www.isan.org/schema/v1.11/common/participant.xsd
http://www.isan.org/schema/v1.11/common/language.xsd
http://www.isan.org/schema/v1.11/common/country.xsd
Here's one example XML response:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<serial:serialHeaderType xmlns:isan="http://www.isan.org/ISAN/isan"
xmlns:title="http://www.isan.org/schema/v1.11/common/title"
xmlns:serial="http://www.isan.org/schema/v1.21/common/serial"
xmlns:externalid="http://www.isan.org/schema/v1.11/common/externalid"
xmlns:common="http://www.isan.org/schema/v1.11/common/common"
xmlns:participant="http://www.isan.org/schema/v1.11/common/participant"
xmlns:language="http://www.isan.org/schema/v1.11/common/language"
xmlns:country="http://www.isan.org/schema/v1.11/common/country">
<common:status>
<common:DataType>SERIAL_HEADER_TYPE</common:DataType>
<common:ISAN root="0000-0002-3B9F"/>
<common:WorkStatus>ACTIVE</common:WorkStatus>
</common:status>
<serial:SerialHeaderId root="0000-0002-3B9F"/>
<serial:MainTitles>
<title:TitleDetail>
<title:Title>Braquo</title:Title>
<title:Language>
<language:LanguageLabel>French</language:LanguageLabel>
<language:LanguageCode>
<language:CodingSystem>ISO639_2</language:CodingSystem>
<language:ISO639_2Code>FRE</language:ISO639_2Code>
</language:LanguageCode>
</title:Language>
<title:TitleKind>ORIGINAL</title:TitleKind>
</title:TitleDetail>
</serial:MainTitles>
<serial:TotalEpisodes>11</serial:TotalEpisodes>
<serial:TotalSeasons>0</serial:TotalSeasons>
<serial:MinDuration>
<common:TimeUnit>MIN</common:TimeUnit>
<common:TimeValue>45</common:TimeValue>
</serial:MinDuration>
<serial:MaxDuration>
<common:TimeUnit>MIN</common:TimeUnit>
<common:TimeValue>144</common:TimeValue>
</serial:MaxDuration>
<serial:MinYear>2009</serial:MinYear>
<serial:MaxYear>2009</serial:MaxYear>
<serial:MainParticipantList>
<participant:Participant>
<participant:FirstName>Frédéric</participant:FirstName>
<participant:LastName>Schoendoerffer</participant:LastName>
<participant:RoleCode>DIR</participant:RoleCode>
</participant:Participant>
<participant:Participant>
<participant:FirstName>Karole</participant:FirstName>
<participant:LastName>Rocher</participant:LastName>
<participant:RoleCode>ACT</participant:RoleCode>
</participant:Participant>
</serial:MainParticipantList>
<serial:CompanyList>
<common:Company>
<common:CompanyKind>PRO</common:CompanyKind>
<common:CompanyName>R.T.B.F.</common:CompanyName>
</common:Company>
<common:Company>
<common:CompanyKind>PRO</common:CompanyKind>
<common:CompanyName>Capa Drama</common:CompanyName>
</common:Company>
<common:Company>
<common:CompanyKind>PRO</common:CompanyKind>
<common:CompanyName>Marathon</common:CompanyName>
</common:Company>
</serial:CompanyList>
</serial:serialHeaderType>
I tried simply ignoring the XSD and using lxml.objectify on the XML I'd get from the API. I had a problem with namespaces. Having to refer to every child node with its explicit namespace was a real pain and doesn't make for readable code.
from lxml import objectify
obj = objectify.fromstring(response)
print obj.MainTitles.TitleDetail
# This will fail to find the element because you need to specify the namespace
print obj.MainTitles['{http://www.isan.org/schema/v1.11/common/title}TitleDetail']
# Or something like that, I couldn't get it to work, and I'd much rather use attributes and not specify the namespace
So then I tried generateDS to create some Python class definitions for me. I've lost the error messages that this attempt gave me but I couldn't get it to work. It would generate a module for each XSD that I gave it but it wouldn't parse the example XML.
I'm now trying pyxb and this seems much nicer so far. It's generating nicer definitions than generateDS (splitting them into multiple, reusable modules) but it won't parse the XML:
from models import serial
obj = serial.CreateFromDocument(response)
Traceback (most recent call last):
...
File "/vagrant/isan/isan.py", line 58, in lookup
return serial.CreateFromDocument(resp.content)
File "/vagrant/isan/models/serial.py", line 69, in CreateFromDocument
instance = handler.rootObject()
File "/home/vagrant/venv/lib/python2.7/site-packages/pyxb/binding/saxer.py", line 285, in rootObject
raise pyxb.UnrecognizedDOMRootNodeError(self.__rootObject)
UnrecognizedDOMRootNodeError: <pyxb.utils.saxdom.Element object at 0x2b53664dc850>
The unrecognised node is the <serial:serialHeaderType> node from the example. Looking at the pyxb source it seems that this error comes about "if the top-level element got processed as a DOM instance" but I don't know what this means or how to prevent it.
I've run out of steam for trying to explore this, I don't know what to do next.
I have had a lot of luck parsing XML into Python using Beautiful Soup. It is extremely straightforward, and they provide pretty strong documentation. Check it out here:
http://www.crummy.com/software/BeautifulSoup/
http://www.crummy.com/software/BeautifulSoup/bs4/doc/
UnrecognizedDOMRootNodeError indicates that PyXB could not locate the element in a namespace for which it has bindings registered. In your case it fails on the first element, which is {http://www.isan.org/schema/v1.21/common/serial}serialHeaderType.
The schema for that namespace defines a complexType named SerialHeaderType but does not define an element with the name serialHeaderType. In fact it defines no top-level elements. So PyXB can't recognize it, and the XML does not validate.
Either there's an additional schema for the namespace that you'll need to locate which provides elements, or the message you're sending really doesn't validate. That may be because somebody's expecting a implicit mapping from a complex type to an element with that type, or because it's a fragment that would normally be found within some other element where that QName is a member element name.
UPDATE: You can hand-craft an element in that namespace by adding the
following to the generated bindings in serial.py:
serialHeaderType = pyxb.binding.basis.element(pyxb.namespace.ExpandedName(Namespace, 'serialHeaderType'), SerialHeaderType)
Namespace.addCategoryObject('elementBinding', serialHeaderType.name().localName(), serialHeaderType)
If you do that, you won't get the UnrecognizedDOMRootNodeError but you
will get an IncompleteElementContentError at:
<common:status>
<common:DataType>SERIAL_HEADER_TYPE</common:DataType>
<common:ISAN root="0000-0002-3B9F"/>
<common:WorkStatus>ACTIVE</common:WorkStatus>
</common:status>
which provides the following details:
The containing element {http://www.isan.org/schema/v1.11/common/common}status is defined at common.xsd[243:3].
The containing element type {http://www.isan.org/schema/v1.11/common/common}StatusType is defined at common.xsd[289:1]
The {http://www.isan.org/schema/v1.11/common/common}StatusType automaton is not in an accepting state.
Any accepted content has been stored in instance
The following element and wildcard content would be accepted:
An element {http://www.isan.org/schema/v1.11/common/common}ActiveISAN per common.xsd[316:3]
An element {http://www.isan.org/schema/v1.11/common/common}MatchingISANs per common.xsd[317:3]
An element {http://www.isan.org/schema/v1.11/common/common}Description per common.xsd[318:3]
No content remains unconsumed
Reviewing the schema confirms that, at a minimum, a {http://www.isan.org/schema/v1.11/common/common}Description element is missing but required.
So it seems these documents are not meant to be validated, and PyXB is
probably the wrong technology to use.

Forwarded Email parsing in Python/Any other language?

I have some mails in txt format, that have been forwarded multiple times.
I want to extract the content/the main body of the mail. This should be at the last position in the hierarchy..right? (Someone point this out if I'm wrong).
The email module doesn't give me a way to extract the content. if I make a message object, the object doesn't have a field for the content of the body.
Any idea on how to do it? Any module that exists for the same or any any particular way you can think of except the most naive one of-course of starting from the back of the text file and looking till you find the header.
If there is an easy or straightforward way/module with any other language ( I doubt), please let me know that as well!
Any help is much appreciated!
The email module doesn't give me a way to extract the content. if I make a message object, the object doesn't have a field for the content of the body.
Of course it does. Have a look at the Python documentation and examples. In particular, look at the walk and payload methods.
Try get_payload on the parsed Message object. If there is only one message, the return type will be string, otherwise it will be a list of Message objects.
Something like this:
messages = parsed_message.get_payload()
while type(messages) <> Types.StringType:
messages = messages[-1].get_payload()

Categories