Parsing XML with namespaces into a dataframe

Parsing XML with namespaces into a dataframe - python

I have the following simpplified XML:
<?xml version="1.0" encoding="UTF-8"?>
<soap:Envelope xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:soap="http://www.w3.org/2003/05/soap-envelope">
<soap:Body>
<ReadResponse xmlns="ABCDEFG.com">
<ReadResult>
<Value>
<Alias>x1</Alias>
<Timestamp>2013-11-11T00:00:00</Timestamp>
<Val>113</Val>
<Duration>5000</Duration>
<Quality>128</Quality>
</Value>
<Value>
<Alias>x1</Alias>
<Timestamp>2014-11-11T00:02:00</Timestamp>
<Val>110</Val>
<Duration>5000</Duration>
<Quality>128</Quality>
</Value>
<Value>
<Alias>x2</Alias>
<Timestamp>2013-11-11T00:00:00</Timestamp>
<Val>101</Val>
<Duration>5000</Duration>
<Quality>128</Quality>
</Value>
<Value>
<Alias>x2</Alias>
<Timestamp>2014-11-11T00:02:00</Timestamp>
<Val>122</Val>
<Duration>5000</Duration>
<Quality>128</Quality>
</Value>
</ReadResult>
</ReadResponse>
</soap:Body>
</soap:Envelope>
and would like to parse it into a dataframe with the following structure (keeping some of the tags and discarding the rest):
Timestamp x1 x2
2013-11-11T00:00:00 113 101
2014-11-11T00:02:00 110 122
The problem is since the XML file includes namespaces, I don't know how to proceed. I have gone through several tutorials (e.g., https://docs.python.org/2/library/pyexpat.html) and questions (e.g., How to open this XML file to create dataframe in Python? and Parsing XML with namespace in Python via 'ElementTree') but none of them have helped/worked. I appreciate if anyone can help me sorting this out.

Here is an example on how to parse an xml using lxml and xpaths:
from lxml import etree
namespaces = {'abc': "ABCDEFG.com"}
xmltree = etree.fromstring(xml_string)
items = xmltree.xpath('//abc:Alias/text()', namespaces=namespaces)
print items

Related

Remove unwanted tags from XML file

I working on a XML file that contains soap tags in it. I want to remove those soap tags as part of XML cleanup process.
How can I achieve it in either Python or Scala. Should not use shell script.
Sample Input :
<?xml version="1.0" encoding="UTF-8"?>
<soap:Envelope xmlns:soap="http://sample.com/">
<soap:Body>
<com:RESPONSE xmlns:com="http://sample.com/">
<Student>
<StudentID>100234</StudentID>
<Gender>Male</Gender>
<Surname>Robert</Surname>
<Firstname>Mathews</Firstname>
</Student>
</com:RESPONSE>
</soap:Body>
</soap:Envelope>
Expected Output :
<?xml version="1.0" encoding="UTF-8"?>
<com:RESPONSE xmlns:com="http://sample.com/">
<Student>
<StudentID>100234</StudentID>
<Gender>Male</Gender>
<Surname>Robert</Surname>
<Firstname>Mathews</Firstname>
</Student>
</com:RESPONSE>

This could help you!
from lxml import etree
doc = etree.parse('test.xml')
for ele in doc.xpath('//soap'):
parent = ele.getparent()
parent.remove(ele)
print(etree.tostring(doc))

Insert XML document into existing XML with Python

Given these XML documents:
Document 1
<root>
<element1>
</element1>
</root>
Document 2
<request>
<dummyValue>5</dummyValue>
</request>
Using Pythons ElementTree I'd like to insert the second document into the first document so that the result would look as follows.
Resulting document
<root>
<element1>
<request>
<dummyValue>5</dummyValue>
</request>
</element1>
</root>
ET.SubElement(element1, request) gives me a serialization error.
Is there a Pythonic way of doing this?

SubElement() constructs an Element and then attaches it to the tree. Since you already have request as an Element, you don't need to construct a new one.
Try element1.append(request), like so:
import xml.etree.ElementTree as ET
doc1 = ET.XML('''
<root>
<element1>
</element1>
</root>
''')
request = ET.XML('''
<request>
<dummyValue>5</dummyValue>
</request>
''')
for element1 in doc1.findall('element1'):
element1.append(request)
ET.dump(doc1)

BeautifulSoup4 Removes Namespace Definitions from Schema in WSDL

I am working on a custom library that consumes WSDLs. One thing I need to be able to do is pull out the namespace definitions in the schemas so I can create a map of them. What I'm running into is that BeautifulSoup (using lxml) is removing the namespace definitions from the Schema elements.
Here is one of my actual Schemas:
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns="http://servicecenter.peregrine.com/PWS" xmlns:cmn="http://servicecenter.peregrine.com/PWS/Common" attributeFormDefault="unqualified" elementFormDefault="qualified" targetNamespace="http://servicecenter.peregrine.com/PWS" version="2016-01-18 Rev 0">
And here is what bs4's rendering of it looks like:
<xsd:schema attributeFormDefault="unqualified" elementFormDefault="qualified" targetNamespace="http://servicecenter.peregrine.com/PWS" version="2016-01-18 Rev 0">
All of my xmlns attributes are gone. Obviously, this seems intentional, but I can't figure out how I can retrieve these attributes. They are not in .attrs and nothing I can find in documentation or online or using dir() has thus far yielded anything useful.
EDIT:
I reduced my WSDL to just the following:
<definitions xmlns="http://schemas.xmlsoap.org/wsdl/" xmlns:cmn="http://servicecenter.peregrine.com/PWS/Common" xmlns:http="http://schemas.xmlsoap.org/wsdl/http/" xmlns:mime="http://schemas.xmlsoap.org/wsdl/mime/" xmlns:ns="http://servicecenter.peregrine.com/PWS" xmlns:soap="http://schemas.xmlsoap.org/wsdl/soap/" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" targetNamespace="http://servicecenter.peregrine.com/PWS" xsi:schemaLocation="http://schemas.xmlsoap.org/wsdl/ http://schemas.xmlsoap.org/wsdl/">
<types>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns="http://servicecenter.peregrine.com/PWS" xmlns:cmn="http://servicecenter.peregrine.com/PWS/Common" attributeFormDefault="unqualified" elementFormDefault="qualified" targetNamespace="http://servicecenter.peregrine.com/PWS" version="2016-01-18 Rev 0">
</xs:schema>
</types>
</definitions>
And passed this to BeautifulSoup:
from bs4 import BeautifulSoup
wsdl = "..." #Replace this with wsdl from above. I didn't want to duplicate data
soup = BeautifulSoup(wsdl,'xml')
print(soup.prettify())
And now it is gone:
<?xml version="1.0" encoding="utf-8"?>
<definitions targetNamespace="http://servicecenter.peregrine.com/PWS" xmlns="http://schemas.xmlsoap.org/wsdl/" xmlns:cmn="http://servicecenter.peregrine.com/PWS/Common" xmlns:http="http://schemas.xmlsoap.org/wsdl/http/" xmlns:mime="http://schemas.xmlsoap.org/wsdl/mime/" xmlns:ns="http://servicecenter.peregrine.com/PWS" xmlns:soap="http://schemas.xmlsoap.org/wsdl/soap/" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://schemas.xmlsoap.org/wsdl/ http://schemas.xmlsoap.org/wsdl/">
<types>
<xsd:schema attributeFormDefault="unqualified" elementFormDefault="qualified" targetNamespace="http://servicecenter.peregrine.com/PWS" version="2016-01-18 Rev 0">
</xsd:schema>
</types>
</definitions>
I can see that it is apparently removing namespace declarations that are redundant (they are already defined under different names in the definitions tag) but it changes the name of these namespaces. Is there any way to prevent it from being so smart? ;) I realize in terms of functionality of the web service request, the name change does not matter, but I would like to stick as close to the actual content of the WSDL as possible.

how to add xml child node with namespace in python?

i'm realy stuck in this, i got a file with an xml layout like this:
<rss xmlns:irc="SomeName" version="2.0">
<channel>
<item>
<irc:title>A title</irc:title>
<irc:poster>A poster</irc:poster>
<irc:url>An url</irc:url>
</item>
</channel>
</rss>
i need to add another 'item' in channel node, that's easy, but i can't find the way to add the item's child with the namespace.
i'm trying with lxml, but the documentation is not so clear for a newbie
please any help will be appreciated.
i find the way to doit with lxml
root = xml.getroot()
channel = root.find('channel')
item = et.Element('item')
title = et.SubElement(item,'{SomeName}title')
title.text = 'My new title'
poster = et.SubElement(item,'{SomeName}poster')
poster.text = 'My poster'
poster = et.SubElement(item,'{SomeName}url')
poster.text = 'http://My.url.com'
channel.append(item)
but still interested in a better solution

Alternatively, you can use XSLT, the declarative programming language, that transforms, styles, re-formats, and re-structures XML files in any way, shape, or form. Python's lxml module maintains an XSLT processor.
Simply, register the needed namespace in the XSLT's declaration line and use it in any new node. This might appear to be overkill for your current need but there could be a situation where a more complex transformation is needed with appropriate namespaces. Below adds a new title to the previous poster and URL.
XSLT (to be saved as .xsl)
<?xml version="1.0" ?>
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
xmlns:irc="SomeName">
<xsl:strip-space elements="*" />
<xsl:output method="xml" indent="yes"/>
<xsl:template match="rss">
<rss>
<channel>
<xsl:for-each select="//item">
<item>
<irc:title>My new title</irc:title>
<xsl:copy-of select="irc:poster"/>
<xsl:copy-of select="irc:url"/>
</item>
</xsl:for-each>
</channel>
</rss>
</xsl:template>
</xsl:transform>
Python
import os
import lxml.etree as ET
# SET CURRENT DIRECTORY
cd = os.path.dirname(os.path.abspath(__file__))
# LOAD IN XML AND XSL FILES
dom = ET.parse(os.path.join(cd, 'Original.xml'))
xslt = ET.parse(os.path.join(cd, 'XSLT_Script.xsl'))
# TRANSFORM
transform = ET.XSLT(xslt)
newdom = transform(dom)
# OUTPUT FINAL XML
tree_out = ET.tostring(newdom, encoding='UTF-8', pretty_print=True, xml_declaration=True)
xmlfile = open(os.path.join(cd, 'output.xml'),'wb')
xmlfile.write(tree_out)
xmlfile.close()
Output
<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:irc="SomeName">
<channel>
<item>
<irc:title>My new title</irc:title>
<irc:poster>A poster</irc:poster>
<irc:url>An url</irc:url>
</item>
</channel>
</rss>

Parse XML SOAP response with Python

I want parse this response from SOAP and extract text between <LoginResult> :
<?xml version="1.0" encoding="utf-8"?>
<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<soap:Body>
<LoginResponse xmlns="http://tempuri.org/wsSalesQuotation/Service1">
<LoginResult>45eeadF43423KKmP33</LoginResult>
</LoginResponse>
</soap:Body>
</soap:Envelope>
How I can do it using XML Python Libs?

import xml.etree.ElementTree as ET
tree = ET.parse('soap.xml')
print tree.find('.//{http://tempuri.org/wsSalesQuotation/Service1}LoginResult').text
>>45eeadF43423KKmP33
instead of print, do something useful to it.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing XML with namespaces into a dataframe - python

Here is an example on how to parse an xml using lxml and xpaths: from lxml import etree namespaces = {'abc': "ABCDEFG.com"} xmltree = etree.fromstring(xml_string) items = xmltree.xpath('//abc:Alias/text()', namespaces=namespaces) print items

Related

Remove unwanted tags from XML file

Insert XML document into existing XML with Python

BeautifulSoup4 Removes Namespace Definitions from Schema in WSDL

how to add xml child node with namespace in python?

Parse XML SOAP response with Python

Categories

Resources