Saving the XML document, breaks my XSI declaration - python

I have a question:
I am parsing an XML that has a namespace with a python xml parser ( beautifulsoup ), and when I save that xml the parser replaces: "xsi:" in the namespace with {http://www.w3.org/2001/XMLSchema-instance} how can I prevent him from doing that ?
Example:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
Becomes:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" {http://www.w3.org/2001/XMLSchema-instance}schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
Can anyone help me out with this ?
Regards,
Bojan

I've filed a bug for you. I've also committed a fix which will be in the next release of Beautiful Soup.

This is how I temporarily solved it.
soupOut = str(soup)
ns = re.search("<project [^>]* xmlns:xsi=\"(?P<ns>[^\"]*)\"[^>]*>",soupOut)
if ns:
soupOut = soupOut.replace("{%s}"%ns.group('ns'), 'xsi:')
file.write(soupOut)

Related

Issue with python script while parsing pom file in project

I'm having issue extracting version number using python script. Its returning none while running the script. Can someone help me on this ?
Python Script:
import xml.etree.ElementTree as ET
tree = ET.parse('pom.xml')
root = tree.getroot()
releaseVersion = root.find("version")
print(releaseVersion)
pom.xml
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://maven.apache.org/POM/4.0.0"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">
<artifactId>watcher</artifactId>
<version>0.0.1-SNAPSHOT</version>
<groupId>com.test</groupId>
<name>file</name>
<packaging>jar</packaging>
<parent>
<artifactId>spring-boot-starter-parent</artifactId>
<groupId>org.springframework.boot</groupId>
<relativePath/>
<version>2.6.1</version>
</parent>
</project>
You're not taking into account that all of your elements are in a default namespace defined by xmlns="http://maven.apache.org/POM/4.0.0" in your <project> element.
So you have to create your query with this namespace.
import xml.etree.ElementTree as ET
tree = ET.parse('pom.xml')
root = tree.getroot()
NS = { 'maven' : 'http://maven.apache.org/POM/4.0.0' }
releaseVersion = root.find("maven:version",NS)
print(releaseVersion.text)
Here NS = { ... } defines the namespace (in the following referred to by its prefix maven) used in the following XPath expression.
Your pom.xml has a namespace xmlns="http://maven.apache.org/POM/4.0.0" in the project tag.
If you must search with fullname, you need to follow {namespace}tag
>> root.find("{http://maven.apache.org/POM/4.0.0}version")
<Element '{http://maven.apache.org/POM/4.0.0}version' at 0x0000014635EC0A40>
But if you don't bother you can search with {*}tag
>> root.find("{*}version")
<Element '{http://maven.apache.org/POM/4.0.0}version' at 0x0000014635EC0A40>

Parsing soap/XML response in Python

I am trying to parse the below xml using the python. I do not understand which type of xml this is as I never worked on this kind of xml.I just got it from a api response form Microsoft.
Now my question is how to parse and get the value of BinarySecurityToken in my python code.
I refer this question Parse XML SOAP response with Python
But look like this has also some xmlns to get the text .However in my xml I can't see any nearby xmlns value through I can get the value.
Please let me know how to get the value of a specific filed using python from below xml.
<?xml version="1.0" encoding="utf-8" ?>
<S:Envelope xmlns:S="http://www.w3.org/2003/05/soap-envelope" xmlns:wsse="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd" xmlns:wsu="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd" xmlns:wsa="http://www.w3.org/2005/08/addressing">
<S:Header>
<wsa:Action xmlns:S="http://www.w3.org/2003/05/soap-envelope" xmlns:wsa="http://www.w3.org/2005/08/addressing" xmlns:wsu="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd" wsu:Id="Action" S:mustUnderstand="1">http://schemas.xmlsoap.org/ws/2005/02/trust/RSTR/Issue</wsa:Action>
<wsa:To xmlns:S="http://www.w3.org/2003/05/soap-envelope" xmlns:wsa="http://www.w3.org/2005/08/addressing" xmlns:wsu="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd" wsu:Id="To" S:mustUnderstand="1">http://schemas.xmlsoap.org/ws/2004/08/addressing/role/anonymous</wsa:To>
<wsse:Security S:mustUnderstand="1">
<wsu:Timestamp xmlns:wsu="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd" wsu:Id="TS">
<wsu:Created>2017-06-12T10:23:01Z</wsu:Created>
<wsu:Expires>2017-06-12T10:28:01Z</wsu:Expires>
</wsu:Timestamp>
</wsse:Security>
</S:Header>
<S:Body>
<wst:RequestSecurityTokenResponse xmlns:S="http://www.w3.org/2003/05/soap-envelope" xmlns:wst="http://schemas.xmlsoap.org/ws/2005/02/trust" xmlns:wsse="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd" xmlns:wsu="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd" xmlns:saml="urn:oasis:names:tc:SAML:1.0:assertion" xmlns:wsp="http://schemas.xmlsoap.org/ws/2004/09/policy" xmlns:psf="http://schemas.microsoft.com/Passport/SoapServices/SOAPFault">
<wst:TokenType>urn:passport:compact</wst:TokenType>
<wsp:AppliesTo xmlns:wsa="http://www.w3.org/2005/08/addressing">
<wsa:EndpointReference>
<wsa:Address>https://something.something.something.com</wsa:Address>
</wsa:EndpointReference>
</wsp:AppliesTo>
<wst:Lifetime>
<wsu:Created>2017-06-12T10:23:01Z</wsu:Created>
<wsu:Expires>2017-06-13T10:23:01Z</wsu:Expires>
</wst:Lifetime>
<wst:RequestedSecurityToken>
<wsse:BinarySecurityToken Id="Compact0">my token</wsse:BinarySecurityToken>
</wst:RequestedSecurityToken>
<wst:RequestedAttachedReference>
<wsse:SecurityTokenReference>
<wsse:Reference URI="wwwww=">
</wsse:Reference>
</wsse:SecurityTokenReference>
</wst:RequestedAttachedReference>
<wst:RequestedUnattachedReference>
<wsse:SecurityTokenReference>
<wsse:Reference URI="swsw=">
</wsse:Reference>
</wsse:SecurityTokenReference>
</wst:RequestedUnattachedReference>
</wst:RequestSecurityTokenResponse>
</S:Body>
</S:Envelope>
This declaration is part of the start tag of the root element:
xmlns:wsse="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd"
It means that elements with the wsse prefix (such as BinarySecurityToken) are in the http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd namespace.
The solution is basically the same as in the answer to the linked question. It's just another namespace:
import xml.etree.ElementTree as ET
tree = ET.parse('soap.xml')
print tree.find('.//{http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd}BinarySecurityToken').text
Here is another way of doing it:
import xml.etree.ElementTree as ET
ns = {"wsse": "http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd"}
tree = ET.parse('soap.xml')
print tree.find('.//wsse:BinarySecurityToken', ns).text
The output in both cases is my token.
See https://docs.python.org/2.7/library/xml.etree.elementtree.html#parsing-xml-with-namespaces.
Creating a namespace dict helped me. Thank you #mzjn for linking that article.
In my SOAP response, I found that I was having to use the full path to the element to extract the text.
For example, I am working with FEDEX API, and one element that I needed to find was TrackDetails. My initial .find() looked like .find('{http://fedex.com/ws/track/v16}TrackDetails')
I was able to simplify this to the following:
ns = {'TrackDetails': 'http://fedex.com/ws/track/v16'}
tree.find('TrackDetails:TrackDetails',ns)
You see TrackDetails twice because I named the key TrackDetails in the dict, but you could name this anything you want. Just helped me to remember what I was working on in my project, but the TrackDetails after the : is the actual element in the SOAP response that I need.
Hope this helps someone!

Python: ignoring namespaces in xml.etree.ElementTree?

How can I tell ElementTree to ignore namespaces in an XML file?
For example, I would prefer to query modelVersion (as in statement 1) rather than {http://maven.apache.org/POM/4.0.0}modelVersion (as in statement 2).
pom="""
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
</project>
"""
from xml.etree import ElementTree
ElementTree.register_namespace("","http://maven.apache.org/POM/4.0.0")
root = ElementTree.fromstring(pom)
print 1,root.findall('modelVersion')
print 2,root.findall('{http://maven.apache.org/POM/4.0.0}modelVersion')
1 []
2 [<Element '{http://maven.apache.org/POM/4.0.0}modelVersion' at 0x1006bff10>]
There appears to be no straight-forward pathway, thus I'd simply wrap the find calls, e.g.
from xml.etree import ElementTree as ET
POM = """
<project xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://maven.apache.org/POM/4.0.0">
<modelVersion>4.0.0</modelVersion>
</project>
"""
NSPS = {'foo' : "http://maven.apache.org/POM/4.0.0"}
# sic!
def findall(node, tag):
return node.findall('foo:' + tag, NSPS)
root = ET.fromstring(POM)
print(map(ET.tostring, findall(root, 'modelVersion')))
output:
['<ns0:modelVersion xmlns:ns0="http://maven.apache.org/POM/4.0.0">4.0.0</ns0:modelVersion>\n']
Here's what I'm presently doing, which makes me incredibly confident that there's a better way.
$ cat pom.xml |
tr '\n' ' ' |
sed 's/<project [^>]*>/<project>/' |
myprogram |
sed 's/<project>/<project xmlns="http:\/\/maven.apache.org\/POM\/4.0.0" xmlns:xsi="http:\/\/www.w3.org\/2001\/XMLSchema-instance" xsi:schemaLocation="http:\/\/maven.apache.org\/POM\/4.0.0 http:\/\/maven.apache.org\/maven-v4_0_0.xsd">/'
Rather than ignore, another approach would be to remove the namespaces in the tree, so there's no need to 'ignore' because they aren't there - see nonagon's answer to this question (and my extension of that to include namespaces on attributes): Python ElementTree module: How to ignore the namespace of XML files to locate matching element when using the method "find", "findall"
Here's the equivalent solution without using the shell. Basic idea:
translate <project junk...> to <project>
perform "clean" processing without worrying about the namespace
translate <project> back to <project junk...>
with the new code:
pom="""
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
</project>
"""
short_project="""<project>"""
long_project="""<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">"""
import re,sys
from xml.etree import ElementTree
# eliminate namespace specs
pom=re.compile('<project [^>]*>').sub(short_project,pom)
root = ElementTree.fromstring(pom)
ElementTree.dump(root)
print 1,root.findall('modelVersion')
print 2,root.findall('{http://maven.apache.org/POM/4.0.0}modelVersion')
mv=root.findall('modelVersion')
# restore the namespace specs
pom=ElementTree.tostring(root)
pom=re.compile(short_project).sub(long_project,pom)

Parsing XML with namespace

With this XML
<?xml version="1.0" encoding="UTF-8"?>
<Envelope>
<subject>Reference rates</subject>
<Sender>
<name>European Central Bank</name>
</Sender>
<Cube>
<Cube time='2013-12-20'>
<Cube currency='USD' rate='1.3655'/>
<Cube currency='JPY' rate='142.66'/>
</Cube>
</Cube>
</Envelope>
I can get the inner Cube tags like this
from xml.etree.ElementTree import ElementTree
t = ElementTree()
t.parse('eurofxref-daily.xml')
day = t.find('Cube/Cube')
print 'Day:', day.attrib['time']
for currency in day:
print currency.items()
Day: 2013-12-20
[('currency', 'USD'), ('rate', '1.3655')]
[('currency', 'JPY'), ('rate', '142.66')]
The problem is that the above XML is a cleaned version of the original file which has defined namespaces
<?xml version="1.0" encoding="UTF-8"?>
<gesmes:Envelope xmlns:gesmes="http://www.gesmes.org/xml/2002-08-01" xmlns="http://www.ecb.int/vocabulary/2002-08-01/eurofxref">
<gesmes:subject>Reference rates</gesmes:subject>
<gesmes:Sender>
<gesmes:name>European Central Bank</gesmes:name>
</gesmes:Sender>
<Cube>
<Cube time='2013-12-20'>
<Cube currency='USD' rate='1.3655'/>
<Cube currency='JPY' rate='142.66'/>
</Cube>
</Cube>
</gesmes:Envelope>
When I try to get the first Cube tag I get a None
t = ElementTree()
t.parse('eurofxref-daily.xml')
print t.find('Cube')
None
The root tag includes the namespace
root = t.getroot()
print 'root.tag:', root.tag
root.tag: {http://www.gesmes.org/xml/2002-08-01}Envelope
Its children also
for e in root.getchildren():
print 'e.tag:', e.tag
e.tag: {http://www.gesmes.org/xml/2002-08-01}subject
e.tag: {http://www.gesmes.org/xml/2002-08-01}Sender
e.tag: {http://www.ecb.int/vocabulary/2002-08-01/eurofxref}Cube
I can get the Cube tags if I include the namespace in the tag
day = t.find('{http://www.ecb.int/vocabulary/2002-08-01/eurofxref}Cube/{http://www.ecb.int/vocabulary/2002-08-01/eurofxref}Cube')
print 'Day: ', day.attrib['time']
Day: 2013-12-20
But that is really ugly. Apart from cleaning the file before processing or doing string manipulation is there an elegant way to handle it?
There's a more elegant way than including the whole namespace URI in the text of the query. For a python version that does not support the namespaces argument on ElementTree.find, lxml provides the missing functionality and is "mostly compatible" with xml.etree:
from lxml.etree import ElementTree
t = ElementTree()
t.parse('eurofxref-daily.xml')
namespaces = { "exr": "http://www.ecb.int/vocabulary/2002-08-01/eurofxref" }
day = t.find('exr:Cube', namespaces)
print day
Using the namespaces object, you can set it once and for all and then just use prefixes in your queries.
Here is the output:
$ python test.py
<Element '{http://www.ecb.int/vocabulary/2002-08-01/eurofxref}Cube' at 0x7fe0f95e3290>
If you find prefixes inelegant, then you have to work on a file without namespaces. Or there may be other tools out there that will "cheat" and match on local-name() even if namespaces are in effect but I don't use them.
In python 2.7 or python 3.3, or higher, you could use the same code as above but use xml.etree instead of lxml because they've added support for namespaces to these versions.

How to resolve external entities with xml.etree like lxml.etree

I have a script that parses XML using lxml.etree:
from lxml import etree
parser = etree.XMLParser(load_dtd=True, resolve_entities=True)
tree = etree.parse('main.xml', parser=parser)
I need load_dtd=True and resolve_entities=True be have &emptyEntry; from globals.xml resolved:
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE map SYSTEM "globals.xml" [
<!ENTITY dirData "${DATADIR}">
]>
<map
xmlns:map="http://my.dummy.org/map"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsschemaLocation="http://my.dummy.org/map main.xsd"
>
&emptyEntry; <!-- from globals.xml -->
<entry><key>KEY</key><value>VALUE</value></entry>
<entry><key>KEY</key><value>VALUE</value></entry>
</map>
with globals.xml
<?xml version="1.0" encoding="UTF-8"?>
<!ENTITY emptyEntry "<entry></entry>">
Now I would like to move from non-standard lxml to standard xml.etree. But this fails with my file because the load_dtd=True and resolve_entities=True is not supported by xml.etree.
Is there an xml.etree-way to have these entities resolved?
My trick is to use the external program xmllint
proc = subprocess.Popen(['xmllint','--noent',fname],stdout=subprocess.PIPE)
output = proc.communicate()[0]
tree = ElementTree.parse(StringIO.StringIO(output))
lxml is a right tool for the job.
But, if you want to use stdlib, then be prepared for difficulties and take a look at XMLParser's UseForeignDTD method. Here's a good (but hacky) example: Python ElementTree support for parsing unknown XML entities?

Categories