Parsing xml in python to get all child elements - python

I have parsed an XML file to get all its elements. I am getting the following output
[<Element '{urn:mitel:params:xml:ns:yang:vld}vld-list' at 0x0000000003059188>, <Element '{urn:mitel:params:xml:ns:yang:vld}vl-id' at 0x00000000030689F8>, <Element '{urn:mitel:params:xml:ns:yang:vld}descriptor-version' at 0x0000000003068A48>]
I need to select the value between } and ' only for each element of the list.
This is my Code till now :
import xml.etree.ElementTree as ET
tree = ET.parse('UMR_VLD01_OAM_V6-Provider_eth0.xml')
root = tree.getroot()
# all items
print('\nAll item data:')
for elem in root:
all_descendants = list(elem.iter())
print(all_descendants)
How can i achieve this ?

The text in {} is the namespace part of the qualified name (QName) of the XML element. AFAIK there is no method in ElementTree to return only the local name. So, you have to either
extract the local part of the name with string handling, as already proposed in a comment to your question,
use lxml.etree instead of xml.etree.ElementTree and apply xpath('local-name()') on each element,
or provide an XML source without namespace. You can strip the namespace with XSLT.
So, given this XML input:
<?xml version="1.0" encoding="UTF-8"?>
<foo xmlns="urn:mitel:params:xml:ns:yang:vld">
<bar>
<baz x="1"/>
<yet>
<more>
<nested/>
</more>
</yet>
</bar>
<bar/>
</foo>
You can print a list of the local names only with this variation of your program:
import xml.etree.ElementTree as ET
tree = ET.parse('UMR_VLD01_OAM_V6-Provider_eth0.xml')
root = tree.getroot()
# all items
print('\nAll item data:')
for elem in root:
all_descendants = [e.tag.split('}', 1)[1] for e in elem.iter()]
print(all_descendants)
Output:
['bar', 'baz', 'yet', 'more', 'nested']
['bar']
The version with lxml.etree and xpath('local-name()') looks like this:
import lxml.etree as ET
tree = ET.parse('UMR_VLD01_OAM_V6-Provider_eth0.xml')
root = tree.getroot()
# all items
print('\nAll item data:')
for elem in root:
all_descendants = [e.xpath('local-name()') for e in elem.iter()]
print(all_descendants)
The output is the same as with the string handling version.
For stripping the namespace completely from your input, you can apply this XSLT:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" >
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
<xsl:template match="*">
<xsl:element name="{local-name()}">
<xsl:copy-of select="#*"/>
<xsl:apply-templates/>
</xsl:element>
</xsl:template>
</xsl:stylesheet>
Then your original program outputs:
[<Element 'bar' at 0x04583B40>, <Element 'baz' at 0x04583B70>, <Element 'yet' at 0x04583BD0>, <Element 'more' at 0x04583C30>, <Element 'nested' at 0x04583C90>]
[<Element 'bar' at 0x04583CC0>]
Now the elements themselves do not bear a namespace. So, you don't have to strip it anymore.
You can apply the XSLT with with xsltproc, then you don't need to change your program. Alternatively, you can apply XSLT in python, but this also requires you to use lxml.etree. So, the last variation of your program looks like this:
import lxml.etree as ET
tree = ET.parse('UMR_VLD01_OAM_V6-Provider_eth0.xml')
xslt = ET.parse('stripns.xslt')
transform = ET.XSLT(xslt)
tree = transform(tree)
root = tree.getroot()
# all items
print('\nAll item data:')
for elem in root:
all_descendants = list(elem.iter())
print(all_descendants)

Related

cleanup_namespaces does not remove namespaces from XML

Here is my xml string
xml = '''
<exta>
<signature>This </signature>
<begin_date>2019-07-12T09:41:48.187</begin_date>
<ver>4</ver>
<maiden_bc>1549</maiden_bc>
<exta_id>12345</exta_id>
<nps_max_price xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<exta_id>72723</exta_id>
<extended_datetime>2018-11-20T11:01:29.040</extended_datetime>
<event_ind>E</event_ind>
<maiden>12345</maiden>
<patient_id>123</patient_id>
<boss_id>123LHF</boss_id>
<template_name/>
<end_date>2019-01-01T00:00:00</end_date>
<UYI_AMN xsi:nil="true"/>
<dedt_bef_ATS xsi:nil="true"/>
<form>W</form>
</nps_max_price>
</exta>
'''
I was using cleanup_namespaces to remove namespace from the xml string
from lxml import etree
root = etree.fromstring(xml)
for elem in root.getiterator():
elem.tag = etree.QName(elem).localname
etree.cleanup_namespaces(root)
print(etree.tostring(root).decode())
This gives me :
<exta>
<signature>This </signature>
<begin_date>2019-07-12T09:41:48.187</begin_date>
<ver>4</ver>
<maiden_bc>1549</maiden_bc>
<exta_id>12345</exta_id>
<nps_max_price xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<exta_id>72723</exta_id>
<extended_datetime>2018-11-20T11:01:29.040</extended_datetime>
<event_ind>E</event_ind>
<maiden>12345</maiden>
<patient_id>123</patient_id>
<boss_id>123LHF</boss_id>
<template_name/>
<end_date>2019-01-01T00:00:00</end_date>
<UYI_AMN xsi:nil="true"/>
<dedt_bef_ATS xsi:nil="true"/>
<form>W</form>
</nps_max_price>
</exta>
However the expected output was xml to not to have the namespaces xmlns:xsi, xsi:nil, xsd etc. How can I do this?
Expected Output:
<exta>
<signature>This </signature>
<begin_date>2019-07-12T09:41:48.187</begin_date>
<ver>4</ver>
<maiden_bc>1549</maiden_bc>
<exta_id>12345</exta_id>
<nps_max_price>
<exta_id>72723</exta_id>
<extended_datetime>2018-11-20T11:01:29.040</extended_datetime>
<event_ind>E</event_ind>
<maiden>12345</maiden>
<patient_id>123</patient_id>
<boss_id>123LHF</boss_id>
<template_name/>
<end_date>2019-01-01T00:00:00</end_date>
<UYI_AMN/>
<dedt_bef_ATS/>
<form>W</form>
</nps_max_price>
</exta>
The code in the question removes namespaces from elements. But in your XML string, none of the elements are bound to a namespace. That is why nothing changes.
However, there are two namespaced attributes (xsi:nil). If you simply want to delete those attributes (or any namespaced attribute), here is how you can do it:
for elem in root.iter():
for attr in elem.attrib:
if etree.QName(attr).namespace:
del elem.attrib[attr]
etree.cleanup_namespaces(root)

Python Elementtree filtering by XPath

imagine I have an XML like this:
<root>
<elements>
<element> foo </element>
<element is="false"> foo </element>
<element is="false"> bli </element>
<element is="false"> bla </element>
</elements>
</root>
How can I do this:
import xml.etree.ElementTree as ET
root = ET.fromstring(XmlFromAbove)
res_a = root.findall("element[#is='false']")) ##<- This gives me all elements with the specific attribute
res_b = root.findall("element[not#is='false']")) ##<- This would be nice to give me all elements without that specific attribute (`<element> foo </element>` in this case)
Now, I know that res_b will not work but I guess this is a common issue so anybody has an idea what the workaround for that is?
To point it out a little bit more (copied from the comments)
I could find the element containing "foo" for sure but what I want to know is if there is a way to find any element that is NOT containing the attribute is="false".
see below
import xml.etree.ElementTree as ET
xml = '''<root>
<elements>
<element> foo </element>
<element is="false"> foo </element>
<element is="false"> bli </element>
<element is="false"> bla </element>
<element please="false"> no_is </element>
<element is="true"> with_true_is </element>
</elements>
</root>'''
root = ET.fromstring(xml)
no_is_lst = [e for e in root.findall('.//element') if 'is' not in e.attrib]
for e in no_is_lst:
print(e.text)
output
foo
no_is
You can use lxml
from lxml import etree
root = etree.fromstring(data)
res = root.xpath(".//element[not(#is)]")
print(res[0].text) #foo

Python add Tags to XML using lxml

I have the following Input XML:
<?xml version="1.0" encoding="utf-8"?>
<Scenario xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="Scenario.xsd">
<TestCase>test_startup_0029</TestCase>
<ShortDescription>Restart of the EVC with missing ODO5 board.</ShortDescription>
<Events>
<Event Num="1">Switch on the EVC</Event>
</Events>
<HW-configuration>
<ELBE5A>true</ELBE5A>
<ELBE5K>false</ELBE5K>
</HW-configuration>
<SystemFailure>true</SystemFailure>
</Scenario>
My Program does add three Tags to the XML but they are formatted false.
The Output XML looks like the following:
<Scenario xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="Scenario.xsd">
<TestCase>test_startup_0029</TestCase>
<ShortDescription>Restart of the EVC with missing ODO5 board.</ShortDescription>
<Events>
<Event Num="1">Switch on the EVC</Event>
</Events>
<HW-configuration>
<ELBE5A>true</ELBE5A>
<ELBE5K>false</ELBE5K>
</HW-configuration>
<SystemFailure>true</SystemFailure>
<Duration>12</Duration><EVC-SW-Version>08.02.0001.0027</EVC-SW-Version><STAC-Release>08.02.0001.0027</STAC-Release></Scenario>
Thats my Source-Code:
class XmlManager:
#staticmethod
def write_xml(xml_path, duration, evc_sw_version):
xml_path = os.path.abspath(xml_path)
if os.path.isfile(xml_path) and xml_path.endswith(".xml"):
# parse XML into etree
root = etree.parse(xml_path).getroot()
# add tags
duration_tag = etree.SubElement(root, "Duration")
duration_tag.text = duration
sw_version_tag = etree.SubElement(root, "EVC-SW-Version")
sw_version_tag.text = evc_sw_version
stac_release = evc_sw_version
stac_release_tag = etree.SubElement(root, "STAC-Release")
stac_release_tag.text = stac_release
# write changes to the XML-file
tree = etree.ElementTree(root)
tree.write(xml_path, pretty_print=False)
else:
XmlManager.logger.log("Invalid path to XML-file")
def main():
xml = r".\Test_Input_Data_Base\blnmerf1_md1czjyc_REL_V_08.01.0001.000x\Test_startup_0029\Test_startup_0029.xml"
XmlManager.write_xml(xml, "12", "08.02.0001.0027")
My Question is how to add the new tags to the XML in the right format. I guess its working that way for parsing again the changed XML but its not nice formated. Any Ideas? Thanks in advance.
To ensure nice pretty-printed output, you need to do two things:
Parse the input file using an XMLParser object with remove_blank_text=True.
Write the output using pretty_print=True
Example:
from lxml import etree
parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse("Test_startup_0029.xml", parser)
root = tree.getroot()
duration_tag = etree.SubElement(root, "Duration")
duration_tag.text = "12"
sw_version_tag = etree.SubElement(root, "EVC-SW-Version")
sw_version_tag.text = "08.02.0001.0027"
stac_release_tag = etree.SubElement(root, "STAC-Release")
stac_release_tag.text = "08.02.0001.0027"
tree.write("output.xml", pretty_print=True)
Contents of output.xml:
<Scenario xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="Scenario.xsd">
<TestCase>test_startup_0029</TestCase>
<ShortDescription>Restart of the EVC with missing ODO5 board.</ShortDescription>
<Events>
<Event Num="1">Switch on the EVC</Event>
</Events>
<HW-configuration>
<ELBE5A>true</ELBE5A>
<ELBE5K>false</ELBE5K>
</HW-configuration>
<SystemFailure>true</SystemFailure>
<Duration>12</Duration>
<EVC-SW-Version>08.02.0001.0027</EVC-SW-Version>
<STAC-Release>08.02.0001.0027</STAC-Release>
</Scenario>
See also http://lxml.de/FAQ.html#why-doesn-t-the-pretty-print-option-reformat-my-xml-output.

lxml, get xml between elements

given this sample xml:
<xml>
<pb facs="id1" />
<aa></aa>
<aa></aa>
<lot-of-xml></lot-of-xml>
<pb facs="id2" />
<bb></bb>
<bb></bb>
<lot-of-xml></lot-of-xml>
</xml>
i need to parse it and get all the content between pb, saving into distinct external files.
expected result:
$ cat id1
<aa></aa>
<aa></aa>
<lot-of-xml></lot-of-xml>
$ cat id2
<bb></bb>
<bb></bb>
<lot-of-xml></lot-of-xml>
what is the correct xpath axe to use?
from lxml import etree
xml = etree.parse("sample.xml")
for pb in xml.xpath('//pb'):
filename = pb.xpath('#facs')[0]
f = open(filename, 'w')
content = **{{ HOW TO GET THE CONTENT HERE? }}**
f.write(content)
f.close()
is there any xpath expression to get all descendants and stop when reached a new pb?
Do you want to extract the tag between two pb's? If yes then that's not quite possible because it is not a tag in between pb's rather than an individual tag on the same level as pb as you have closed the tag pb . If you close the tag after the test tag then test can become a child of pb.
In other words if your xml is like this:
<xml>
<pb facs="id1">
<test></test>
</pb>
<test></test>
<pb facs="id2" />
<test></test>
<test></test>
</xml>
Then you can use
import xml.etree.ElementTree as ET
tree = ET.parse('test.xml')
root = tree.getroot()
for child in root:
for subchild in child:
print subchild
to print the subchild('test') with pb as a parent.
Well if that's not the case (you just want to extract the attributes of pb tag)then you can use either of the two methods shown below to extract the elements.
With python's inbuilt etree
import xml.etree.ElementTree as ET
tree = ET.parse('sample.xml')
root = tree.getroot()
for child in root:
if child.get('facs'):
print child.get('facs')
With the lxml library you can parse it like this:
tree = etree.parse('test.xml')
root = tree.getroot()
for child in root:
if child.get('facs'):
print child.get('facs')
OK, I tested this code:
lists = []
for node in tree.findall('*'):
if node.tag == 'pb':
lists.append([])
else:
lists[-1].append(node)
Output:
>>> lists
[[<Element test at 2967fa8>, <Element test at 2a89030>, <Element lot-of-xml at 2a89080>], [<Element test at 2a89170>, <Element test at 2a891c0>, <Element lot-of-xml at 2a89210>]]
Input file (just in case):
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<xml>
<pb facs="id1" />
<test></test>
<test></test>
<lot-of-xml></lot-of-xml>
<pb facs="id2" />
<test></test>
<test></test>
<lot-of-xml></lot-of-xml>
</xml>

Parsing wsdl (retrieve namespaces from the definitions)using an Element Tree

I am trying to parse a wsdl file using ElementTree, As part of this I"d like to retrieve all the namespaces from a given wsdl definitions element.
For instance in the below snippet , I am trying to retrieve all the namespaces in the definitions tag
<?xml version="1.0"?>
<definitions name="DateService" targetNamespace="http://dev-b.handel-dev.local:8080/DateService.wsdl" xmlns:tns="http://dev-b.handel-dev.local:8080/DateService.wsdl"
xmlns="http://schemas.xmlsoap.org/wsdl/" xmlns:soap="http://schemas.xmlsoap.org/wsdl/soap/" xmlns:myType="DateType_NS" xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:wsdl="http://schemas.xmlsoap.org/wsdl/">
My code looks like this
import xml.etree.ElementTree as ET
xml_file='<path_to_my_wsdl>'
tree = xml.parse(xml_file)
rootElement = tree.getroot()
print (rootElement.tag) #{http://schemas.xmlsoap.org/wsdl/}definitions
print(rootElement.attrib) #targetNamespace="http://dev-b..../DateService.wsdl"
As I understand, in ElementTree the namespace URI is combined with the local name of the element .How can I retrieve all the namespace entries from the definitions element?
Appreciate your help on this
P.S: I am new (very!) to python
>>> import xml.etree.ElementTree as etree
>>> from StringIO import StringIO
>>>
>>> s = """<?xml version="1.0"?>
... <definitions
... name="DateService"
... targetNamespace="http://dev-b.handel-dev.local:8080/DateService.wsdl"
... xmlns:tns="http://dev-b.handel-dev.local:8080/DateService.wsdl"
... xmlns="http://schemas.xmlsoap.org/wsdl/"
... xmlns:soap="http://schemas.xmlsoap.org/wsdl/soap/"
... xmlns:myType="DateType_NS"
... xmlns:xsd="http://www.w3.org/2001/XMLSchema"
... xmlns:wsdl="http://schemas.xmlsoap.org/wsdl/">
... </definitions>"""
>>> file_ = StringIO(s)
>>> namespaces = []
>>> for event, elem in etree.iterparse(file_, events=('start-ns',)):
... print elem
...
(u'tns', 'http://dev-b.handel-dev.local:8080/DateService.wsdl')
('', 'http://schemas.xmlsoap.org/wsdl/')
(u'soap', 'http://schemas.xmlsoap.org/wsdl/soap/')
(u'myType', 'DateType_NS')
(u'xsd', 'http://www.w3.org/2001/XMLSchema')
(u'wsdl', 'http://schemas.xmlsoap.org/wsdl/')
Inspired by the ElementTree documentation
You can use lxml.
from lxml import etree
tree = etree.parse(file)
root = tree.getroot()
namespaces = root.nsmap
see https://stackoverflow.com/a/26807636/5375693

Categories