Python Elementtree filtering by XPath - python

imagine I have an XML like this:
<root>
<elements>
<element> foo </element>
<element is="false"> foo </element>
<element is="false"> bli </element>
<element is="false"> bla </element>
</elements>
</root>
How can I do this:
import xml.etree.ElementTree as ET
root = ET.fromstring(XmlFromAbove)
res_a = root.findall("element[#is='false']")) ##<- This gives me all elements with the specific attribute
res_b = root.findall("element[not#is='false']")) ##<- This would be nice to give me all elements without that specific attribute (`<element> foo </element>` in this case)
Now, I know that res_b will not work but I guess this is a common issue so anybody has an idea what the workaround for that is?
To point it out a little bit more (copied from the comments)
I could find the element containing "foo" for sure but what I want to know is if there is a way to find any element that is NOT containing the attribute is="false".

see below
import xml.etree.ElementTree as ET
xml = '''<root>
<elements>
<element> foo </element>
<element is="false"> foo </element>
<element is="false"> bli </element>
<element is="false"> bla </element>
<element please="false"> no_is </element>
<element is="true"> with_true_is </element>
</elements>
</root>'''
root = ET.fromstring(xml)
no_is_lst = [e for e in root.findall('.//element') if 'is' not in e.attrib]
for e in no_is_lst:
print(e.text)
output
foo
no_is

You can use lxml
from lxml import etree
root = etree.fromstring(data)
res = root.xpath(".//element[not(#is)]")
print(res[0].text) #foo

Related

cleanup_namespaces does not remove namespaces from XML

Here is my xml string
xml = '''
<exta>
<signature>This </signature>
<begin_date>2019-07-12T09:41:48.187</begin_date>
<ver>4</ver>
<maiden_bc>1549</maiden_bc>
<exta_id>12345</exta_id>
<nps_max_price xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<exta_id>72723</exta_id>
<extended_datetime>2018-11-20T11:01:29.040</extended_datetime>
<event_ind>E</event_ind>
<maiden>12345</maiden>
<patient_id>123</patient_id>
<boss_id>123LHF</boss_id>
<template_name/>
<end_date>2019-01-01T00:00:00</end_date>
<UYI_AMN xsi:nil="true"/>
<dedt_bef_ATS xsi:nil="true"/>
<form>W</form>
</nps_max_price>
</exta>
'''
I was using cleanup_namespaces to remove namespace from the xml string
from lxml import etree
root = etree.fromstring(xml)
for elem in root.getiterator():
elem.tag = etree.QName(elem).localname
etree.cleanup_namespaces(root)
print(etree.tostring(root).decode())
This gives me :
<exta>
<signature>This </signature>
<begin_date>2019-07-12T09:41:48.187</begin_date>
<ver>4</ver>
<maiden_bc>1549</maiden_bc>
<exta_id>12345</exta_id>
<nps_max_price xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<exta_id>72723</exta_id>
<extended_datetime>2018-11-20T11:01:29.040</extended_datetime>
<event_ind>E</event_ind>
<maiden>12345</maiden>
<patient_id>123</patient_id>
<boss_id>123LHF</boss_id>
<template_name/>
<end_date>2019-01-01T00:00:00</end_date>
<UYI_AMN xsi:nil="true"/>
<dedt_bef_ATS xsi:nil="true"/>
<form>W</form>
</nps_max_price>
</exta>
However the expected output was xml to not to have the namespaces xmlns:xsi, xsi:nil, xsd etc. How can I do this?
Expected Output:
<exta>
<signature>This </signature>
<begin_date>2019-07-12T09:41:48.187</begin_date>
<ver>4</ver>
<maiden_bc>1549</maiden_bc>
<exta_id>12345</exta_id>
<nps_max_price>
<exta_id>72723</exta_id>
<extended_datetime>2018-11-20T11:01:29.040</extended_datetime>
<event_ind>E</event_ind>
<maiden>12345</maiden>
<patient_id>123</patient_id>
<boss_id>123LHF</boss_id>
<template_name/>
<end_date>2019-01-01T00:00:00</end_date>
<UYI_AMN/>
<dedt_bef_ATS/>
<form>W</form>
</nps_max_price>
</exta>
The code in the question removes namespaces from elements. But in your XML string, none of the elements are bound to a namespace. That is why nothing changes.
However, there are two namespaced attributes (xsi:nil). If you simply want to delete those attributes (or any namespaced attribute), here is how you can do it:
for elem in root.iter():
for attr in elem.attrib:
if etree.QName(attr).namespace:
del elem.attrib[attr]
etree.cleanup_namespaces(root)

Parsing xml in python to get all child elements

I have parsed an XML file to get all its elements. I am getting the following output
[<Element '{urn:mitel:params:xml:ns:yang:vld}vld-list' at 0x0000000003059188>, <Element '{urn:mitel:params:xml:ns:yang:vld}vl-id' at 0x00000000030689F8>, <Element '{urn:mitel:params:xml:ns:yang:vld}descriptor-version' at 0x0000000003068A48>]
I need to select the value between } and ' only for each element of the list.
This is my Code till now :
import xml.etree.ElementTree as ET
tree = ET.parse('UMR_VLD01_OAM_V6-Provider_eth0.xml')
root = tree.getroot()
# all items
print('\nAll item data:')
for elem in root:
all_descendants = list(elem.iter())
print(all_descendants)
How can i achieve this ?
The text in {} is the namespace part of the qualified name (QName) of the XML element. AFAIK there is no method in ElementTree to return only the local name. So, you have to either
extract the local part of the name with string handling, as already proposed in a comment to your question,
use lxml.etree instead of xml.etree.ElementTree and apply xpath('local-name()') on each element,
or provide an XML source without namespace. You can strip the namespace with XSLT.
So, given this XML input:
<?xml version="1.0" encoding="UTF-8"?>
<foo xmlns="urn:mitel:params:xml:ns:yang:vld">
<bar>
<baz x="1"/>
<yet>
<more>
<nested/>
</more>
</yet>
</bar>
<bar/>
</foo>
You can print a list of the local names only with this variation of your program:
import xml.etree.ElementTree as ET
tree = ET.parse('UMR_VLD01_OAM_V6-Provider_eth0.xml')
root = tree.getroot()
# all items
print('\nAll item data:')
for elem in root:
all_descendants = [e.tag.split('}', 1)[1] for e in elem.iter()]
print(all_descendants)
Output:
['bar', 'baz', 'yet', 'more', 'nested']
['bar']
The version with lxml.etree and xpath('local-name()') looks like this:
import lxml.etree as ET
tree = ET.parse('UMR_VLD01_OAM_V6-Provider_eth0.xml')
root = tree.getroot()
# all items
print('\nAll item data:')
for elem in root:
all_descendants = [e.xpath('local-name()') for e in elem.iter()]
print(all_descendants)
The output is the same as with the string handling version.
For stripping the namespace completely from your input, you can apply this XSLT:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" >
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
<xsl:template match="*">
<xsl:element name="{local-name()}">
<xsl:copy-of select="#*"/>
<xsl:apply-templates/>
</xsl:element>
</xsl:template>
</xsl:stylesheet>
Then your original program outputs:
[<Element 'bar' at 0x04583B40>, <Element 'baz' at 0x04583B70>, <Element 'yet' at 0x04583BD0>, <Element 'more' at 0x04583C30>, <Element 'nested' at 0x04583C90>]
[<Element 'bar' at 0x04583CC0>]
Now the elements themselves do not bear a namespace. So, you don't have to strip it anymore.
You can apply the XSLT with with xsltproc, then you don't need to change your program. Alternatively, you can apply XSLT in python, but this also requires you to use lxml.etree. So, the last variation of your program looks like this:
import lxml.etree as ET
tree = ET.parse('UMR_VLD01_OAM_V6-Provider_eth0.xml')
xslt = ET.parse('stripns.xslt')
transform = ET.XSLT(xslt)
tree = transform(tree)
root = tree.getroot()
# all items
print('\nAll item data:')
for elem in root:
all_descendants = list(elem.iter())
print(all_descendants)

Removing empty xml nodes

I have an xml file that I'm trying to remove empty nodes from with python. When I've tested it to check if a the value is, say, 'shark', it works. But when i check for it being none, it doesn't remove the empty node.
for records in recordList:
for fieldGroup in records:
for field in fieldGroup:
if field.text is None:
fieldGroup.remove(field)
xpath is your friend here.
from lxml import etree
doc = etree.XML("""<root><a>1</a><b><c></c></b><d></d></root>""")
def remove_empty_elements(doc):
for element in doc.xpath('//*[not(node())]'):
element.getparent().remove(element)
Then:
>>> print etree.tostring(doc,pretty_print=True)
<root>
<a>1</a>
<b>
<c/>
</b>
<d/>
</root>
>>> remove_empty_elements(doc)
>>> print etree.tostring(doc,pretty_print=True)
<root>
<a>1</a>
<b/>
</root>
>>> remove_empty_elements(doc)
>>> print etree.tostring(doc,pretty_print=True)
<root>
<a>1</a>
</root>

lxml, get xml between elements

given this sample xml:
<xml>
<pb facs="id1" />
<aa></aa>
<aa></aa>
<lot-of-xml></lot-of-xml>
<pb facs="id2" />
<bb></bb>
<bb></bb>
<lot-of-xml></lot-of-xml>
</xml>
i need to parse it and get all the content between pb, saving into distinct external files.
expected result:
$ cat id1
<aa></aa>
<aa></aa>
<lot-of-xml></lot-of-xml>
$ cat id2
<bb></bb>
<bb></bb>
<lot-of-xml></lot-of-xml>
what is the correct xpath axe to use?
from lxml import etree
xml = etree.parse("sample.xml")
for pb in xml.xpath('//pb'):
filename = pb.xpath('#facs')[0]
f = open(filename, 'w')
content = **{{ HOW TO GET THE CONTENT HERE? }}**
f.write(content)
f.close()
is there any xpath expression to get all descendants and stop when reached a new pb?
Do you want to extract the tag between two pb's? If yes then that's not quite possible because it is not a tag in between pb's rather than an individual tag on the same level as pb as you have closed the tag pb . If you close the tag after the test tag then test can become a child of pb.
In other words if your xml is like this:
<xml>
<pb facs="id1">
<test></test>
</pb>
<test></test>
<pb facs="id2" />
<test></test>
<test></test>
</xml>
Then you can use
import xml.etree.ElementTree as ET
tree = ET.parse('test.xml')
root = tree.getroot()
for child in root:
for subchild in child:
print subchild
to print the subchild('test') with pb as a parent.
Well if that's not the case (you just want to extract the attributes of pb tag)then you can use either of the two methods shown below to extract the elements.
With python's inbuilt etree
import xml.etree.ElementTree as ET
tree = ET.parse('sample.xml')
root = tree.getroot()
for child in root:
if child.get('facs'):
print child.get('facs')
With the lxml library you can parse it like this:
tree = etree.parse('test.xml')
root = tree.getroot()
for child in root:
if child.get('facs'):
print child.get('facs')
OK, I tested this code:
lists = []
for node in tree.findall('*'):
if node.tag == 'pb':
lists.append([])
else:
lists[-1].append(node)
Output:
>>> lists
[[<Element test at 2967fa8>, <Element test at 2a89030>, <Element lot-of-xml at 2a89080>], [<Element test at 2a89170>, <Element test at 2a891c0>, <Element lot-of-xml at 2a89210>]]
Input file (just in case):
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<xml>
<pb facs="id1" />
<test></test>
<test></test>
<lot-of-xml></lot-of-xml>
<pb facs="id2" />
<test></test>
<test></test>
<lot-of-xml></lot-of-xml>
</xml>

etree.strip_tags returning 'None' when trying to strip tag

Script:
print entryDetails
for i in range(len(entryDetails)):
print etree.tostring(entryDetails[i])
print etree.strip_tags(entryDetails[i], 'entry-details')
Output:
[<Element entry-details at 0x234e0a8>, <Element entry-details at 0x234e878>]
<entry-details>2014-02-05 11:57:01</entry-details>
None
<entry-details>2014-02-05 12:11:05</entry-details>
None
How is etree.strip_tags failing to strip the entry-details tag? Is the dash in the tag name affecting it?
strip_tags() does not return anything. It strips off the tags in-place.
The documentation says: "Note that this will not delete the element (or ElementTree root element) that you passed even if it matches. It will only treat its descendants.".
Demo code:
from lxml import etree
XML = """
<root>
<entry-details>ABC</entry-details>
</root>"""
root = etree.fromstring(XML)
ed = root.xpath("//entry-details")[0]
print ed
print
etree.strip_tags(ed, "entry-details") # Has no effect
print etree.tostring(root)
print
etree.strip_tags(root, "entry-details")
print etree.tostring(root)
Output:
<Element entry-details at 0x2123b98>
<root>
<entry-details>ABC</entry-details>
</root>
<root>
ABC
</root>

Categories