Reading a spreadsheet like .xml with ElementTree - python

I am reading an xml file using ElementTree but there is a Cell in which I cannot read its data.
I adapted my file to make a reproducable example that I present next:
from xml.etree import ElementTree
import io
xmlf = """<?xml version="1.0"?>
<?mso-application progid="Excel.Sheet"?>
<Workbook ss:ResourcesPackageName="" ss:ResourcesPackageVersion="" xmlns="urn:schemas-microsoft-com:office:spreadsheet"
xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet"
xmlns:html="http://www.w3.org/TR/REC-html40">
<Worksheet ss:Name="DigitalOutput" ss:IsDeviceType="true">
<Row ss:AutoFitHeight="0">
<Cell><Data ss:Type="String">A</Data><NamedCell ss:Name="_FilterDatabase"/></Cell>
<Cell><Data ss:Type="String">B</Data><NamedCell ss:Name="_FilterDatabase"/></Cell>
<Cell><Data ss:Type="String">C</Data><NamedCell ss:Name="_FilterDatabase"/></Cell>
<Cell ss:Index="7"><ss:Data ss:Type="String"
xmlns="http://www.w3.org/TR/REC-html40"><Font html:Color="#000000">CAN'T READ </Font><Font>THIS</Font></ss:Data><NamedCell
ss:Name="_FilterDatabase"/></Cell>
<Cell ss:Index="10"><Data ss:Type="String">D</Data><NamedCell
ss:Name="_FilterDatabase"/></Cell>
</Row>
</Worksheet>
</Workbook>"""
ss = "urn:schemas-microsoft-com:office:spreadsheet"
worksheet_label = '{%s}Worksheet' % ss
row_label = '{%s}Row' % ss
cell_label = '{%s}Cell' % ss
data_label = '{%s}Data' % ss
tree = ElementTree.parse(io.StringIO(xmlf))
root = tree.getroot()
for ws in root.findall(worksheet_label):
for table in ws.findall(row_label):
for c in table.findall(cell_label):
data = c.find(data_label)
print(data.text)
The output is:
A
B
C
None
D
So, the fourth cell was not read. Can you help me on fixing this?

Question: Reading a spreadsheet like .xml with ElementTree
Documentation: The lxml.etree Tutorial- Namespaces
Define the namespaces used
ns = {'ss':"urn:schemas-microsoft-com:office:spreadsheet",
'html':"http://www.w3.org/TR/REC-html40"
}
Use the namespaces with find(.../findall(...
tree = ElementTree.parse(io.StringIO(xmlf))
root = tree.getroot()
for ws in root.findall('ss:Worksheet', ns):
for table in ws.findall('ss:Row', ns):
for c in table.findall('ss:Cell', ns):
data = c.find('ss:Data', ns)
if data.text is None:
text = []
data = data.findall('html:Font', ns)
for element in data:
text.append(element.text)
data_text = ''.join(text)
print(data_text)
else:
print(data.text)
Output:
A
B
C
CAN'T READ THIS
D
Tested with Python: 3.5

The text content of the fourth cell belongs to the two Font subelements, which are bound to another namespace. Demo:
for e in root.iter():
text = e.text.strip() if e.text else None
if text:
print(e, text)
Output:
<Element {urn:schemas-microsoft-com:office:spreadsheet}Data at 0x7f8013d01dc8> A
<Element {urn:schemas-microsoft-com:office:spreadsheet}Data at 0x7f8013d01dc8> B
<Element {urn:schemas-microsoft-com:office:spreadsheet}Data at 0x7f8013d01dc8> C
<Element {http://www.w3.org/TR/REC-html40}Font at 0x7f8013d01e08> CAN'T READ
<Element {http://www.w3.org/TR/REC-html40}Font at 0x7f8013d01e48> THIS
<Element {urn:schemas-microsoft-com:office:spreadsheet}Data at 0x7f8013d01e48> D

Related

Parsing xml in python to get all child elements

I have parsed an XML file to get all its elements. I am getting the following output
[<Element '{urn:mitel:params:xml:ns:yang:vld}vld-list' at 0x0000000003059188>, <Element '{urn:mitel:params:xml:ns:yang:vld}vl-id' at 0x00000000030689F8>, <Element '{urn:mitel:params:xml:ns:yang:vld}descriptor-version' at 0x0000000003068A48>]
I need to select the value between } and ' only for each element of the list.
This is my Code till now :
import xml.etree.ElementTree as ET
tree = ET.parse('UMR_VLD01_OAM_V6-Provider_eth0.xml')
root = tree.getroot()
# all items
print('\nAll item data:')
for elem in root:
all_descendants = list(elem.iter())
print(all_descendants)
How can i achieve this ?
The text in {} is the namespace part of the qualified name (QName) of the XML element. AFAIK there is no method in ElementTree to return only the local name. So, you have to either
extract the local part of the name with string handling, as already proposed in a comment to your question,
use lxml.etree instead of xml.etree.ElementTree and apply xpath('local-name()') on each element,
or provide an XML source without namespace. You can strip the namespace with XSLT.
So, given this XML input:
<?xml version="1.0" encoding="UTF-8"?>
<foo xmlns="urn:mitel:params:xml:ns:yang:vld">
<bar>
<baz x="1"/>
<yet>
<more>
<nested/>
</more>
</yet>
</bar>
<bar/>
</foo>
You can print a list of the local names only with this variation of your program:
import xml.etree.ElementTree as ET
tree = ET.parse('UMR_VLD01_OAM_V6-Provider_eth0.xml')
root = tree.getroot()
# all items
print('\nAll item data:')
for elem in root:
all_descendants = [e.tag.split('}', 1)[1] for e in elem.iter()]
print(all_descendants)
Output:
['bar', 'baz', 'yet', 'more', 'nested']
['bar']
The version with lxml.etree and xpath('local-name()') looks like this:
import lxml.etree as ET
tree = ET.parse('UMR_VLD01_OAM_V6-Provider_eth0.xml')
root = tree.getroot()
# all items
print('\nAll item data:')
for elem in root:
all_descendants = [e.xpath('local-name()') for e in elem.iter()]
print(all_descendants)
The output is the same as with the string handling version.
For stripping the namespace completely from your input, you can apply this XSLT:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" >
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
<xsl:template match="*">
<xsl:element name="{local-name()}">
<xsl:copy-of select="#*"/>
<xsl:apply-templates/>
</xsl:element>
</xsl:template>
</xsl:stylesheet>
Then your original program outputs:
[<Element 'bar' at 0x04583B40>, <Element 'baz' at 0x04583B70>, <Element 'yet' at 0x04583BD0>, <Element 'more' at 0x04583C30>, <Element 'nested' at 0x04583C90>]
[<Element 'bar' at 0x04583CC0>]
Now the elements themselves do not bear a namespace. So, you don't have to strip it anymore.
You can apply the XSLT with with xsltproc, then you don't need to change your program. Alternatively, you can apply XSLT in python, but this also requires you to use lxml.etree. So, the last variation of your program looks like this:
import lxml.etree as ET
tree = ET.parse('UMR_VLD01_OAM_V6-Provider_eth0.xml')
xslt = ET.parse('stripns.xslt')
transform = ET.XSLT(xslt)
tree = transform(tree)
root = tree.getroot()
# all items
print('\nAll item data:')
for elem in root:
all_descendants = list(elem.iter())
print(all_descendants)

Python add Tags to XML using lxml

I have the following Input XML:
<?xml version="1.0" encoding="utf-8"?>
<Scenario xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="Scenario.xsd">
<TestCase>test_startup_0029</TestCase>
<ShortDescription>Restart of the EVC with missing ODO5 board.</ShortDescription>
<Events>
<Event Num="1">Switch on the EVC</Event>
</Events>
<HW-configuration>
<ELBE5A>true</ELBE5A>
<ELBE5K>false</ELBE5K>
</HW-configuration>
<SystemFailure>true</SystemFailure>
</Scenario>
My Program does add three Tags to the XML but they are formatted false.
The Output XML looks like the following:
<Scenario xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="Scenario.xsd">
<TestCase>test_startup_0029</TestCase>
<ShortDescription>Restart of the EVC with missing ODO5 board.</ShortDescription>
<Events>
<Event Num="1">Switch on the EVC</Event>
</Events>
<HW-configuration>
<ELBE5A>true</ELBE5A>
<ELBE5K>false</ELBE5K>
</HW-configuration>
<SystemFailure>true</SystemFailure>
<Duration>12</Duration><EVC-SW-Version>08.02.0001.0027</EVC-SW-Version><STAC-Release>08.02.0001.0027</STAC-Release></Scenario>
Thats my Source-Code:
class XmlManager:
#staticmethod
def write_xml(xml_path, duration, evc_sw_version):
xml_path = os.path.abspath(xml_path)
if os.path.isfile(xml_path) and xml_path.endswith(".xml"):
# parse XML into etree
root = etree.parse(xml_path).getroot()
# add tags
duration_tag = etree.SubElement(root, "Duration")
duration_tag.text = duration
sw_version_tag = etree.SubElement(root, "EVC-SW-Version")
sw_version_tag.text = evc_sw_version
stac_release = evc_sw_version
stac_release_tag = etree.SubElement(root, "STAC-Release")
stac_release_tag.text = stac_release
# write changes to the XML-file
tree = etree.ElementTree(root)
tree.write(xml_path, pretty_print=False)
else:
XmlManager.logger.log("Invalid path to XML-file")
def main():
xml = r".\Test_Input_Data_Base\blnmerf1_md1czjyc_REL_V_08.01.0001.000x\Test_startup_0029\Test_startup_0029.xml"
XmlManager.write_xml(xml, "12", "08.02.0001.0027")
My Question is how to add the new tags to the XML in the right format. I guess its working that way for parsing again the changed XML but its not nice formated. Any Ideas? Thanks in advance.
To ensure nice pretty-printed output, you need to do two things:
Parse the input file using an XMLParser object with remove_blank_text=True.
Write the output using pretty_print=True
Example:
from lxml import etree
parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse("Test_startup_0029.xml", parser)
root = tree.getroot()
duration_tag = etree.SubElement(root, "Duration")
duration_tag.text = "12"
sw_version_tag = etree.SubElement(root, "EVC-SW-Version")
sw_version_tag.text = "08.02.0001.0027"
stac_release_tag = etree.SubElement(root, "STAC-Release")
stac_release_tag.text = "08.02.0001.0027"
tree.write("output.xml", pretty_print=True)
Contents of output.xml:
<Scenario xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="Scenario.xsd">
<TestCase>test_startup_0029</TestCase>
<ShortDescription>Restart of the EVC with missing ODO5 board.</ShortDescription>
<Events>
<Event Num="1">Switch on the EVC</Event>
</Events>
<HW-configuration>
<ELBE5A>true</ELBE5A>
<ELBE5K>false</ELBE5K>
</HW-configuration>
<SystemFailure>true</SystemFailure>
<Duration>12</Duration>
<EVC-SW-Version>08.02.0001.0027</EVC-SW-Version>
<STAC-Release>08.02.0001.0027</STAC-Release>
</Scenario>
See also http://lxml.de/FAQ.html#why-doesn-t-the-pretty-print-option-reformat-my-xml-output.

prettify adding extra lines in xml

I'm using Prettify to make my XML file readable. I am adding some new info in to an excising XML file but when i save it to a file i get extra lines in between the lines. is there a way of removing these line? Below is the code i'm using
import xml.etree.ElementTree as xml
import xml.dom.minidom as minidom
from lxml import etree
def prettify(elem):
rough_string = xml.tostring(elem, 'utf-8')
reparsed = minidom.parseString(rough_string)
return reparsed.toprettyxml(indent="\t")
cid = "[123,123,123,123,123]"
doc = xml.parse('test.xml')
root = doc.getroot()
root.getchildren().index(root.find('card'))
e = xml.Element('card')
e.set('id', cid)
n = xml.SubElement(e, "name")
n.text = "FOLDER"
r = xml.SubElement(e, "red")
r.text = "FILE.AVI"
g = xml.SubElement(e, "green")
g.text = "FILE.AVI"
b = xml.SubElement(e, "blue")
b.text = "FILE.AVI"
root.insert(0, e)
doc2 = prettify(root)
with open("testnew.xml", "w") as f:
f.write(doc2)
Below is what i get in the file
<data>
<card id="[123,123,123,123,123]">
<name>FOLDER</name>
<red>FILE.AVI</red>
<green>FILE.AVI</green>
<blue>FILE.AVI</blue>
</card>
<card id="[000,000,000,000,000]">
<name>Colours</name>
<red>/media/usb/cow.avi</red>
<green>/media/usb/pig.avi</green>
<blue>/media/usb/cat.avi</blue>
</card>
</data>
input file "test.xml" looks like
<data>
<card id="[000,000,000,000,000]">
<name>Colours</name>
<red>/media/usb/cow.avi</red>
<green>/media/usb/pig.avi</green>
<blue>/media/usb/cat.avi</blue>
</card>
</data>
The new content added is being printed fine. Removing any "prettification" of the existing text solves the issue
Add
for elem in root.iter('*'):
if elem == e:
print "Added XML node does not need to be stripped"
continue
if elem.text is not None:
elem.text = elem.text.strip()
if elem.tail is not None:
elem.tail = elem.tail.strip()
before calling
doc2 = prettify(root)
Related answer: Python how to strip white-spaces from xml text nodes

LXML add an element into root

Im trying to take two elements from one file (file1.xml), and write them onto the end of another file (file2.xml). I am able to get them to print out, but am stuck trying to write them onto file2.xml! Help !
filename = "file1.xml"
appendtoxml = "file2.xml"
output_file = appendtoxml.replace('.xml', '') + "_editedbyed.xml"
parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse(filename, parser)
etree.tostring(tree)
root = tree.getroot()
a = root.findall(".//Device")
b = root.findall(".//Speaker")
for r in a:
print etree.tostring(r)
for e in b:
print etree.tostring(e)
NewSub = etree.SubElement (root, "Audio(just writes audio..")
print NewSub
I want the results of a, b to be added onto the end of outputfile.xml in the root.
Parse both the input file and the file you wish to append to.
Use root.append(elt) to append Element, elt, to root.
Then use tree.write to write the new tree to a file (e.g. appendtoxml):
Note: The links above point to documentation for xml.etree from the standard
library. Since lxml's API tries to be compatible with the standard library's
xml.etree, the standard library documentation applies to lxml as well (at
least for these methods). See http://lxml.de/api.html for information on where
the APIs differ.
import lxml.etree as ET
filename = "file1.xml"
appendtoxml = "file2.xml"
output_file = appendtoxml.replace('.xml', '') + "_editedbyed.xml"
parser = ET.XMLParser(remove_blank_text=True)
tree = ET.parse(filename, parser)
root = tree.getroot()
out_tree = ET.parse(appendtoxml, parser)
out_root = out_tree.getroot()
for path in [".//Device", ".//Speaker"]:
for elt in root.findall(path):
out_root.append(elt)
out_tree.write(output_file, pretty_print=True)
If file1.xml contains
<?xml version="1.0"?>
<root>
<Speaker>boozhoo</Speaker>
<Device>waaboo</Device>
<Speaker>anin</Speaker>
<Device>gigiwishimowin</Device>
</root>
and file2.xml contains
<?xml version="1.0"?>
<root>
<Speaker>jubal</Speaker>
<Device>crane</Device>
</root>
then file2_editedbyed.xml will contain
<root>
<Speaker>jubal</Speaker>
<Device>crane</Device>
<Device>waaboo</Device>
<Device>gigiwishimowin</Device>
<Speaker>boozhoo</Speaker>
<Speaker>anin</Speaker>
</root>

Entity references and lxml

Here's the code I have:
from cStringIO import StringIO
from lxml import etree
xml = StringIO('''<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE root [
<!ENTITY test "This is a test">
]>
<root>
<sub>&test;</sub>
</root>''')
d1 = etree.parse(xml)
print '%r' % d1.find('/sub').text
parser = etree.XMLParser(resolve_entities=False)
d2 = etree.parse(xml, parser=parser)
print '%r' % d2.find('/sub').text
Here's the output:
'This is a test'
None
How do I get lxml to give me '&test;', i.e., the raw entity reference?
The "unresolved" Entity is left as child node of the element node sub
>>> print d2.find('/sub')[0]
&test;
>>> d2.find('/sub').getchildren()
[&test;]

Categories