BeautifulSoup4 Removes Namespace Definitions from Schema in WSDL - python

I am working on a custom library that consumes WSDLs. One thing I need to be able to do is pull out the namespace definitions in the schemas so I can create a map of them. What I'm running into is that BeautifulSoup (using lxml) is removing the namespace definitions from the Schema elements.
Here is one of my actual Schemas:
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns="http://servicecenter.peregrine.com/PWS" xmlns:cmn="http://servicecenter.peregrine.com/PWS/Common" attributeFormDefault="unqualified" elementFormDefault="qualified" targetNamespace="http://servicecenter.peregrine.com/PWS" version="2016-01-18 Rev 0">
And here is what bs4's rendering of it looks like:
<xsd:schema attributeFormDefault="unqualified" elementFormDefault="qualified" targetNamespace="http://servicecenter.peregrine.com/PWS" version="2016-01-18 Rev 0">
All of my xmlns attributes are gone. Obviously, this seems intentional, but I can't figure out how I can retrieve these attributes. They are not in .attrs and nothing I can find in documentation or online or using dir() has thus far yielded anything useful.
EDIT:
I reduced my WSDL to just the following:
<definitions xmlns="http://schemas.xmlsoap.org/wsdl/" xmlns:cmn="http://servicecenter.peregrine.com/PWS/Common" xmlns:http="http://schemas.xmlsoap.org/wsdl/http/" xmlns:mime="http://schemas.xmlsoap.org/wsdl/mime/" xmlns:ns="http://servicecenter.peregrine.com/PWS" xmlns:soap="http://schemas.xmlsoap.org/wsdl/soap/" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" targetNamespace="http://servicecenter.peregrine.com/PWS" xsi:schemaLocation="http://schemas.xmlsoap.org/wsdl/ http://schemas.xmlsoap.org/wsdl/">
<types>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns="http://servicecenter.peregrine.com/PWS" xmlns:cmn="http://servicecenter.peregrine.com/PWS/Common" attributeFormDefault="unqualified" elementFormDefault="qualified" targetNamespace="http://servicecenter.peregrine.com/PWS" version="2016-01-18 Rev 0">
</xs:schema>
</types>
</definitions>
And passed this to BeautifulSoup:
from bs4 import BeautifulSoup
wsdl = "..." #Replace this with wsdl from above. I didn't want to duplicate data
soup = BeautifulSoup(wsdl,'xml')
print(soup.prettify())
And now it is gone:
<?xml version="1.0" encoding="utf-8"?>
<definitions targetNamespace="http://servicecenter.peregrine.com/PWS" xmlns="http://schemas.xmlsoap.org/wsdl/" xmlns:cmn="http://servicecenter.peregrine.com/PWS/Common" xmlns:http="http://schemas.xmlsoap.org/wsdl/http/" xmlns:mime="http://schemas.xmlsoap.org/wsdl/mime/" xmlns:ns="http://servicecenter.peregrine.com/PWS" xmlns:soap="http://schemas.xmlsoap.org/wsdl/soap/" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://schemas.xmlsoap.org/wsdl/ http://schemas.xmlsoap.org/wsdl/">
<types>
<xsd:schema attributeFormDefault="unqualified" elementFormDefault="qualified" targetNamespace="http://servicecenter.peregrine.com/PWS" version="2016-01-18 Rev 0">
</xsd:schema>
</types>
</definitions>
I can see that it is apparently removing namespace declarations that are redundant (they are already defined under different names in the definitions tag) but it changes the name of these namespaces. Is there any way to prevent it from being so smart? ;) I realize in terms of functionality of the web service request, the name change does not matter, but I would like to stick as close to the actual content of the WSDL as possible.

Related

How to remove all occurences of element in XML file?

I'd like to edit a KML file and remove all occurences of ExtendedData elements, wherever they are located in the file.
Here's the input XML file:
<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://earth.google.com/kml/2.2">
<Document>
<Style id="placemark-red">
<IconStyle>
<Icon>
<href>http://maps.me/placemarks/placemark-red.png</href>
</Icon>
</IconStyle>
</Style>
<name>My track</name>
<ExtendedData xmlns:mwm="https://maps.me">
<mwm:name>
<mwm:lang code="default">Blah</mwm:lang>
</mwm:name>
<mwm:lastModified>2020-04-05T14:17:18Z</mwm:lastModified>
</ExtendedData>
<Placemark>
<name></name>
…
<ExtendedData xmlns:mwm="https://maps.me">
<mwm:localId>0</mwm:localId>
<mwm:visibility>1</mwm:visibility>
</ExtendedData>
</Placemark>
</Document>
</kml>
And here's the code that 1) only removes the outermost occurence, and 2) requires adding the namespace to find it:
from lxml import etree
from pykml import parser
from pykml.factory import KML_ElementMaker as KML
with open("input.xml") as f:
doc = parser.parse(f)
root = doc.getroot()
ns = "{http://earth.google.com/kml/2.2}"
for pm in root.Document.getchildren():
#No way to get rid of namespace, for easier search?
if pm.tag==f"{ns}ExtendedData":
root.Document.remove(pm)
#How to remove innermost occurence of ExtendedData?
print(etree.tostring(doc, pretty_print=True))
Is there a way to remove all occurences in one go, or should I parse the whole tree?
Thank you.
Edit: The BeautifulSoup solution below requires adding an option "BeautifulSoup(my_xml,features="lxml")" to avoid the warning "No parser was explicitly specified".
Here's a solution using BeautifulSoup:
soup = BeautifulSoup(my_xml) # this is your xml
while True:
elem = soup.find("extendeddata")
if not elem:
break
elem.decompose()
Here's the output for your data:
<?xml version="1.0" encoding="UTF-8"?>
<html>
<body>
<kml xmlns="http://earth.google.com/kml/2.2">
<document>
<style id="placemark-red">
<IconStyle>
<Icon>
<href>http://maps.me/placemarks/placemark-red.png</href>
</Icon>
</IconStyle>
</style>
<name>
My track
</name>
<placemark>
<name>
</name>
</placemark>
</document>
</kml>
</body>
</html>
If you know the XML structure, try:
xml_root = ElementTree.parse(filename_path).getroot()
elem = xml_root.find('./ExtendedData')
xml_root.remove(elem)
or
xml_root = ElementTree.parse(filename_path).getroot()
p_elem = xml_root.find('/Placemark')
c_elem = xml_root.find('/Placemark/ExtendedData')
p_elem.remove(c_elem)
play with this ideas :)
if you don't know the xml structure, I think you need to parse the whole tree.
Simply run the empty template with Identity Transform using XSLT 1.0 which Python's lxml can run. No for/while loops or if logic needed. To handle the default namespace, define a prefix like doc:
XSLT (save a .xsl file, a special .xml file)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:doc="http://earth.google.com/kml/2.2">
<xsl:output method="xml" omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<!-- IDENTITY TRANSFORM -->
<xsl:template match="node()|#*">
<xsl:copy>
<xsl:apply-templates select="node()|#*"/>
</xsl:copy>
</xsl:template>
<!-- REMOVE ALL OCCURRENCES OF NODE -->
<xsl:template match="doc:ExtendedData"/>
</xsl:stylesheet>
Python
import lxml.etree as et
# LOAD XML AND XSL SOURCES
xml = et.parse('Input.xml')
xsl = et.parse('XSLT_Script.xsl')
# TRANSFORM INPUT
transform = et.XSLT(xsl)
result = transform(xml)
# PRINT TO SCREEN
print(result)
# SAVE TO FILE
with open('Output.kml', 'wb') as f:
f.write(result)

Remove unwanted tags from XML file

I working on a XML file that contains soap tags in it. I want to remove those soap tags as part of XML cleanup process.
How can I achieve it in either Python or Scala. Should not use shell script.
Sample Input :
<?xml version="1.0" encoding="UTF-8"?>
<soap:Envelope xmlns:soap="http://sample.com/">
<soap:Body>
<com:RESPONSE xmlns:com="http://sample.com/">
<Student>
<StudentID>100234</StudentID>
<Gender>Male</Gender>
<Surname>Robert</Surname>
<Firstname>Mathews</Firstname>
</Student>
</com:RESPONSE>
</soap:Body>
</soap:Envelope>
Expected Output :
<?xml version="1.0" encoding="UTF-8"?>
<com:RESPONSE xmlns:com="http://sample.com/">
<Student>
<StudentID>100234</StudentID>
<Gender>Male</Gender>
<Surname>Robert</Surname>
<Firstname>Mathews</Firstname>
</Student>
</com:RESPONSE>
This could help you!
from lxml import etree
doc = etree.parse('test.xml')
for ele in doc.xpath('//soap'):
parent = ele.getparent()
parent.remove(ele)
print(etree.tostring(doc))

how to add xml child node with namespace in python?

i'm realy stuck in this, i got a file with an xml layout like this:
<rss xmlns:irc="SomeName" version="2.0">
<channel>
<item>
<irc:title>A title</irc:title>
<irc:poster>A poster</irc:poster>
<irc:url>An url</irc:url>
</item>
</channel>
</rss>
i need to add another 'item' in channel node, that's easy, but i can't find the way to add the item's child with the namespace.
i'm trying with lxml, but the documentation is not so clear for a newbie
please any help will be appreciated.
i find the way to doit with lxml
root = xml.getroot()
channel = root.find('channel')
item = et.Element('item')
title = et.SubElement(item,'{SomeName}title')
title.text = 'My new title'
poster = et.SubElement(item,'{SomeName}poster')
poster.text = 'My poster'
poster = et.SubElement(item,'{SomeName}url')
poster.text = 'http://My.url.com'
channel.append(item)
but still interested in a better solution
Alternatively, you can use XSLT, the declarative programming language, that transforms, styles, re-formats, and re-structures XML files in any way, shape, or form. Python's lxml module maintains an XSLT processor.
Simply, register the needed namespace in the XSLT's declaration line and use it in any new node. This might appear to be overkill for your current need but there could be a situation where a more complex transformation is needed with appropriate namespaces. Below adds a new title to the previous poster and URL.
XSLT (to be saved as .xsl)
<?xml version="1.0" ?>
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
xmlns:irc="SomeName">
<xsl:strip-space elements="*" />
<xsl:output method="xml" indent="yes"/>
<xsl:template match="rss">
<rss>
<channel>
<xsl:for-each select="//item">
<item>
<irc:title>My new title</irc:title>
<xsl:copy-of select="irc:poster"/>
<xsl:copy-of select="irc:url"/>
</item>
</xsl:for-each>
</channel>
</rss>
</xsl:template>
</xsl:transform>
Python
import os
import lxml.etree as ET
# SET CURRENT DIRECTORY
cd = os.path.dirname(os.path.abspath(__file__))
# LOAD IN XML AND XSL FILES
dom = ET.parse(os.path.join(cd, 'Original.xml'))
xslt = ET.parse(os.path.join(cd, 'XSLT_Script.xsl'))
# TRANSFORM
transform = ET.XSLT(xslt)
newdom = transform(dom)
# OUTPUT FINAL XML
tree_out = ET.tostring(newdom, encoding='UTF-8', pretty_print=True, xml_declaration=True)
xmlfile = open(os.path.join(cd, 'output.xml'),'wb')
xmlfile.write(tree_out)
xmlfile.close()
Output
<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:irc="SomeName">
<channel>
<item>
<irc:title>My new title</irc:title>
<irc:poster>A poster</irc:poster>
<irc:url>An url</irc:url>
</item>
</channel>
</rss>

Parsing XML with namespaces into a dataframe

I have the following simpplified XML:
<?xml version="1.0" encoding="UTF-8"?>
<soap:Envelope xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:soap="http://www.w3.org/2003/05/soap-envelope">
<soap:Body>
<ReadResponse xmlns="ABCDEFG.com">
<ReadResult>
<Value>
<Alias>x1</Alias>
<Timestamp>2013-11-11T00:00:00</Timestamp>
<Val>113</Val>
<Duration>5000</Duration>
<Quality>128</Quality>
</Value>
<Value>
<Alias>x1</Alias>
<Timestamp>2014-11-11T00:02:00</Timestamp>
<Val>110</Val>
<Duration>5000</Duration>
<Quality>128</Quality>
</Value>
<Value>
<Alias>x2</Alias>
<Timestamp>2013-11-11T00:00:00</Timestamp>
<Val>101</Val>
<Duration>5000</Duration>
<Quality>128</Quality>
</Value>
<Value>
<Alias>x2</Alias>
<Timestamp>2014-11-11T00:02:00</Timestamp>
<Val>122</Val>
<Duration>5000</Duration>
<Quality>128</Quality>
</Value>
</ReadResult>
</ReadResponse>
</soap:Body>
</soap:Envelope>
and would like to parse it into a dataframe with the following structure (keeping some of the tags and discarding the rest):
Timestamp x1 x2
2013-11-11T00:00:00 113 101
2014-11-11T00:02:00 110 122
The problem is since the XML file includes namespaces, I don't know how to proceed. I have gone through several tutorials (e.g., https://docs.python.org/2/library/pyexpat.html) and questions (e.g., How to open this XML file to create dataframe in Python? and Parsing XML with namespace in Python via 'ElementTree') but none of them have helped/worked. I appreciate if anyone can help me sorting this out.
Here is an example on how to parse an xml using lxml and xpaths:
from lxml import etree
namespaces = {'abc': "ABCDEFG.com"}
xmltree = etree.fromstring(xml_string)
items = xmltree.xpath('//abc:Alias/text()', namespaces=namespaces)
print items

python remove element containing namespace

I am trying to remove an element in an xml which contains a namespace.
Here is my code:
templateXml = """<?xml version="1.0" encoding="UTF-8"?>
<Metadata xmlns="http://www.amazon.com/UnboxMetadata/v1">
<Movie>
<CountryOfOrigin>US</CountryOfOrigin>
<TitleInfo>
<Title locale="en-GB">The Title</Title>
<Actor>
<ActorName locale="en-GB">XXX</ActorName>
<Character locale="en-GB">XXX</Character>
</Actor>
</TitleInfo>
</Movie>
</Metadata>"""
from lxml import etree
tree = etree.fromstring(templateXml)
namespaces = {'ns':'http://www.amazon.com/UnboxMetadata/v1'}
for checkActor in tree.xpath('//ns:Actor', namespaces=namespaces):
etree.strip_elements(tree, 'ns:Actor')
In my actual XML I have lots of tags, So I am trying to search for the Actor tags which contain XXX and completely remove that whole tag and its contents. But it's not working.
Use remove() method:
for checkActor in tree.xpath('//ns:Actor', namespaces=namespaces):
checkActor.getparent().remove(checkActor)
print etree.tostring(tree, pretty_print=True, xml_declaration=True)
prints:
<?xml version='1.0' encoding='ASCII'?>
<Metadata xmlns="http://www.amazon.com/UnboxMetadata/v1">
<Movie>
<CountryOfOrigin>US</CountryOfOrigin>
<TitleInfo>
<Title locale="en-GB">The Title</Title>
</TitleInfo>
</Movie>
</Metadata>

Categories