I'm a bit new to XML and python. Below is a cut down version of a large XML file I'm trying to bring into python to eventually write into SQL Server db.
<?xml version="1.0" encoding="utf-8"?>
<MyOrgRefData:OrgRefData xmlns:MyOrgRefData="http://refdata.org/org/v2-0-0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://refdata.org/org/v2-0-0/MyOrgRefData.xsd">
<Manifest>
<Version value="2-0-0" />
<PublicationType value="Full" />
<PublicationSource value="TEST123" />
<PublicationDate value="2022-05-23" />
<PublicationSeqNum value="1659" />
<FileCreationDateTime value="2022-05-23T22:14:47" />
<RecordCount value="287654" />
<ContentDescription value="FullFile_20220523" />
<PrimaryRoleScope>
<PrimaryRole id="123" displayName="Free beer for me" />
<PrimaryRole id="456" displayName="Free air for you" />
</PrimaryRoleScope>
</Manifest>
<CodeSystems>
<CodeSystem name="OrganisationRecordClass" oid="1.2.3.4.5">
<concept id="RC2" code="2" displayName="World1" />
<concept id="RC1" code="1" displayName="World2" />
</CodeSystem>
<CodeSystem name="OrganisationRole" oid="5.4.7.8">
<concept id="B1ng0" code="179" displayName="BoomBastic" />
<concept id="R2D2a" code="180" displayName="Fantastic" />
</CodeSystem>
</CodeSystems>
</MyOrgRefData:OrgRefData>
I've tried with lxml, pandas.read_xml, xml.etree and I'm not able to understand how to get what I want.
Ideally I'd like to pull in Manifest into a dataframe ready to to send to SQL (pd.to_sql()). I would do the same with CodeSystems as well, but separately. (there are other sections but I cut them off to shorten)
For example, using pandas to read in, I can only get a column with the values in. But I would like to either have the tag (Version, PublicationType, PublicationSource etc) in a column by the side of the value, or have them as the column headers and the values pivoted across the row instead.
dataFolder = '/Some/directory'
df_bulk = pd.read_xml(
dataFolder+'Data_Full_20220523.xml',
xpath='//Manifest/*',
attrs_only=True ,
)
df_bulk.head()
This is the output I get:
inx
value
0
2-0-0
1
Full
2
TEST123
3
2022-05-23
4
1659
5
2022-05-23T22:14:47
6
287654
7
FullFile_20220523
Ideally I would like:
inx
value
Version
2-0-0
PublicationType
Full
PublicationSource
TEST123
PublicationDate
2022-05-23
PublicationSeqNum
1659
FileCreationDateTime
2022-05-23T22:14:47
FileCreationDateTime
287654
ContentDescription
FullFile_20220523
The eagle eyed among you will notice I've left out PrimaryRoleScope. I would ideally like to treat this separately in it's own dataframe as well. But I am unsure how to exclude it when pulling in the rest of the Manifest section.
Many thanks if you've read this far, even more thanks for any help.
One possibility is using the stylesheet parameter to transform the XML data internally with XSLT before processing it.
So your code could look like this:
dataFolder = '/Some/directory'
df_bulk = pd.read_xml(
dataFolder+'Data_Full_20220523.xml',
stylesheet='transform.xslt',
xpath='/Root/Item',
attrs_only=True ,
)
print(df_bulk.head(10))
The stylesheet(transform.xml) to be passed to read_xml could be (lxml is required)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" omit-xml-declaration="yes"/>
<xsl:template match="/">
<Root><xsl:apply-templates /></Root>
</xsl:template>
<xsl:template match="//Manifest/*[not(self::PrimaryRoleScope)]">
<Item name="{name()}" value="{#value}" />
</xsl:template>
</xsl:stylesheet>
In this example a new XML like the following is created. It is intermediate XML and not shown, but the xpath= parameter above has to be set accordingly.
<Root>
<Item name="Version" value="2-0-0"/>
<Item name="PublicationType" value="Full"/>
<Item name="PublicationSource" value="TEST123"/>
<Item name="PublicationDate" value="2022-05-23"/>
<Item name="PublicationSeqNum" value="1659"/>
<Item name="FileCreationDateTime" value="2022-05-23T22:14:47"/>
<Item name="RecordCount" value="287654"/>
<Item name="ContentDescription" value="FullFile_20220523"/>
</Root>
And the final output is
name value
0 Version 2-0-0
1 PublicationType Full
2 PublicationSource TEST123
3 PublicationDate 2022-05-23
4 PublicationSeqNum 1659
5 FileCreationDateTime 2022-05-23T22:14:47
6 RecordCount 287654
7 ContentDescription FullFile_20220523
The above approach uses only attributes, but you could also create an element structure with the XSLT if you prefer that. In this case change one template to
<xsl:template match="//Manifest/*[not(self::PrimaryRoleScope)]">
<Item>
<name><xsl:value-of select="name()" /></name>
<value><xsl:value-of select="#value" /></value>
</Item>
</xsl:template>
and your python code to
dataFolder = '/Some/directory'
df_bulk = pd.read_xml(
dataFolder+'Data_Full_20220523.xml',
stylesheet='transform.xslt',
xpath='/Root/Item',
)
print(df_bulk.head(10))
The output is the same.
Related
We have a requirement to get the data from a SOAP XML Response.
Below is the associated XML file
<?xml version="1.0" encoding="utf-8"?>
<soap:Envelope xmlns:soap="http://www.w3.org/2003/05/soap-envelope" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<soap:Body>
<GetResultResponse xmlns="http://www.relatics.com/">
<GetResultResult>
<Report ReportName="RFC" GeneratedOn="2022-12-22" EnvironmentID="XXXX" EnvironmentName="Systematic Assurance – an XXX Solution" EnvironmentURL="https://XXXX.relaticsonline.com/" WorkspaceID="XXXXX" WorkspaceName="P - ADL Program Management - XXX" TargetDevice="Pc" ReportField="" xmlns="">
<Change_module>
<applied_individual_change_request Change_Request="TestKZIreport" RFC_GUID="XXXXX">
<code RFC_Code="VtW-0101" />
<progress RFC_Progress="agreed" />
<applied_individual_project_organisation Organisation="XXXX" />
<applied__individual_discipline Discipline="Highways" />
<specification Specification="Context of Documents">
<code Specification_Code="1.1.1a" />
</specification>
<applied_individual_workpackage Workpackage="Enabling work">
<code Workpackage_Code="WP-01" />
</applied_individual_workpackage>
<physical_object Physical_Object="Train Station">
<code Physical_Object_Code="TFO-0001" />
</physical_object>
<person approver="XXX" />
<applied_individual_change_consequence_qualification Consequence_Value="10 days">
<applied_conceptual_change_consequence_aspect Consequence_Aspect="Schedule" />
</applied_individual_change_consequence_qualification>
<document Document_Name="WI 300 Design.pdf">
<code Document_Code="DOC-0002" />
</document>
<answer_status BR_Status="no" />
<applied_individual_business_rule Business_Rule="Change Review compliance">
<code BR_Code="BR-006" />
</applied_individual_business_rule>
<applied_individual_change_consequence_qualification Consequence_Value="XXX">
<applied_conceptual_change_consequence_aspect Consequence_Aspect="Finance" />
</applied_individual_change_consequence_qualification>
</applied_individual_change_request>
</Change_module>
</Report>
</GetResultResult>
</GetResultResponse>
</soap:Body>
</soap:Envelope>
i need all the tag values after Change_module.i tried some online help in Stack overflow but it didn't work.
I never worked with XML documents before and here is the sample code i
tried from Stack Overflow.
import xml.etree.ElementTree as ET
import pandas as pd
import numpy as np
tree = ET.parse("Relatics_XML.xml")
root = tree.getroot()
print(root.tag)
print(root.attrib)
namespaces = {"soap": "http://www.w3.org/2003/05/soap-envelope/",
"xsi": "http://www.w3.org/2001/XMLSchema-instance",
"xsd": "http://www.w3.org/2001/XMLSchema/",
'a': 'http://www.relatics.com/',}
names = tree.findall('./soap:Body''/a:GetResultResponse''/a:GetResultResult', namespaces)
print(names)
for name in names:
print(name.text)
i tried different methods like find and findall and also inside the method i try to pass different values but all its printing is null.
I'm not sure how to get the values out of tags.
Using xml.etree.ElementTree make life easier.
documentation in here
It can parsing tag attribute or innerText.
import xml.etree.ElementTree as ET
xml = """\
<?xml version="1.0" encoding="utf-8"?>
<soap:Envelope xmlns:soap="http://www.w3.org/2003/05/soap-envelope"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<soap:Body>
<GetResultResponse xmlns="http://www.relatics.com/">
<GetResultResult>
<Report ReportName="RFC" GeneratedOn="2022-12-22" EnvironmentID="XXXX" EnvironmentName="Systematic Assurance – an XXX Solution" EnvironmentURL="https://XXXX.relaticsonline.com/" WorkspaceID="XXXXX" WorkspaceName="P - ADL Program Management - XXX" TargetDevice="Pc" ReportField=""
xmlns="">
<Change_module>
<applied_individual_change_request Change_Request="TestKZIreport" RFC_GUID="XXXXX">
<code RFC_Code="VtW-0101" />
<progress RFC_Progress="agreed" />
<applied_individual_project_organisation Organisation="XXXX" />
<applied__individual_discipline Discipline="Highways" />
<specification Specification="Context of Documents">
<code Specification_Code="1.1.1a" />
</specification>
<applied_individual_workpackage Workpackage="Enabling work">
<code Workpackage_Code="WP-01" />
</applied_individual_workpackage>
<physical_object Physical_Object="Train Station">
<code Physical_Object_Code="TFO-0001" />
</physical_object>
<person approver="XXX" />
<applied_individual_change_consequence_qualification Consequence_Value="10 days">
<applied_conceptual_change_consequence_aspect Consequence_Aspect="Schedule" />
</applied_individual_change_consequence_qualification>
<document Document_Name="WI 300 Design.pdf">
<code Document_Code="DOC-0002" />
</document>
<answer_status BR_Status="no" />
<applied_individual_business_rule Business_Rule="Change Review compliance">
<code BR_Code="BR-006" />
</applied_individual_business_rule>
<applied_individual_change_consequence_qualification Consequence_Value="XXX">
<applied_conceptual_change_consequence_aspect Consequence_Aspect="Finance" />
</applied_individual_change_consequence_qualification>
</applied_individual_change_request>
</Change_module>
</Report>
</GetResultResult>
</GetResultResponse>
</soap:Body>
</soap:Envelope>
"""
root = ET.fromstring(xml)
print("RFC_Code: " + str(root.find(".//code[#RFC_Code]").attrib))
print("RFC_Progress: " + str(root.find(".//progress[#RFC_Progress]").attrib))
print("specification: " + str(root.find(".//specification[#Specification]").attrib))
print("Specification_Code: " + str(root.find(".//code[#Specification_Code]").attrib))
print("Workpackage_Code: " + str(root.find(".//code[#Workpackage_Code]").attrib))
print("Document_Code: " + str(root.find(".//code[#Document_Code]").attrib))
Result
$ python get-data.py
RFC_Code: {'RFC_Code': 'VtW-0101'}
RFC_Progress: {'RFC_Progress': 'agreed'}
specification: {'Specification': 'Context of Documents'}
Specification_Code: {'Specification_Code': '1.1.1a'}
Workpackage_Code: {'Workpackage_Code': 'WP-01'}
Document_Code: {'Document_Code': 'DOC-0002'}
If you using xml file open, using this code
with open('data.xml', 'r') as xml_file:
root = ET.parse(xml_file)
I want to change the structure from an XML to another standard structure given to me.
I believe I can achieve that through the following steps:
Get all the tags and their attributes, so I can know what to modify,
remove or add.
Change the tags name (i.e. informaltable to table, or , sect1 to
section)
Establish certain standard attributes for the different tags, and
show it in a dictionary (i.e. all the section, title and table tags
must have these attributes ----
section:{"xmlns:xsi","id","type","xsi:noNamespaceSchemaLocation"} ,
title:"id" , table:{"frame","id"} .
Give a random alpha-numerical id to every tag that has the id
attribute and it must never repeat itself(i.e.
id=id-824fc56b-431b-4ad3-e933-f0fc222e50d3)
Modify, add and remove attributes values for certain tags (i.e.
frame was frame=all and now is frame=any) (i.e. delete the rowsep
attribute in the colspec tag).
Remove specific tags(i.e. remove the anchor tags and of course all
of their attributes) (I hope this doesn't affect the whole
hierarchy).
I have this xml example
<section xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" id="id-c3ee53e4-e2ef-441b-8f3b-7320c4e32ef8" type="policy" xsi:noNamespaceSchemaLocation="urn:fontoxml:cpa.xsd:5.0">
<title id="id-f0497441-5ecb-47ee-b7c0-263832a9e402">
<anchor id="_Toc493170182"/>
<anchor id="__RefHeading___Toc3574_3674829928"/>
<anchor id="_Toc72503731"/>
<anchor id="_Toc69390724"/>
<anchor id="_Toc493496869"/>
Abbreviations of Terms
</title>
<table frame="all" id="id-6837f232-02e3-4e7a-ce8d-cb2df48256ac">
<tgroup cols="2" id="id-437c0d54-7257-4d34-a73d-351d533f0460">
<colspec colname="column-0" colnum="1" colsep="1" rowsep="1" colwidth="0.2*" id="id-c87e1040-c2d7-4b15-fb0c-86557d201235" />
<colspec colname="column-1" colnum="2" colsep="1" rowsep="1" colwidth="0.8*" id="id-5bebcf85-440b-416e-b2f9-72e47d5bb4f7" />
<thead id="id-ff67f8a7-5baf-4a42-ac31-09c0f99cceed">
<row id="id-542df999-7736-4cc2-e725-1b7b106e08d6">
<entry rowsep="1" colsep="1" colname="column-0" id="id-54a7d605-21ff-44db-c1f6-03111db180c7">
<para id="id-f43f7fb1-cd40-4b4a-88f2-02e55e786a5e">
<emphasis style="bold">Abbreviation
</emphasis>
</para>
</entry>
<entry rowsep="1" colsep="1" colname="column-1" id="id-aecec4c6-f85b-490e-9b72-99c6764b49cf">
<para id="id-4d89100a-4e4c-419a-d081-f776bcf9083e">
<emphasis style="bold">Definition
</emphasis>
</para>
</entry>
</row>
</thead>
<tbody id="id-824fc56b-431b-4ad3-e933-f0fc222e50d3">
<row id="id-620a8ff6-0189-41c7-e9af-dc9498ce703e">
<entry rowsep="1" colsep="1" colname="column-0" id="id-fb941cc0-287d-4760-a5a0-87419fa66d68">
<para id="id-127a8a37-9705-496b-87ee-303bcfd52a25">A/C</para>
</entry>
<entry rowsep="1" colsep="1" colname="column-1" id="id-317ad682-6e02-43c3-b724-5d50683c8f79">
<para id="id-c7c2fac5-f286-4802-b8d6-2e54fa2cad3c">AirCraft</para>
</entry>
</row>
</tbody>
</tgroup>
</table>
</section>
And this is the code that I have so far
from lxml import etree
import numpy as np
#Parsing the xml file and creating lists
tree = etree.parse("InitialFile")
root = tree.getroot()
Lista = []
tags = []
#Get the unique tags values
for element in root.iter():
Lista.append(element.tag)
tags = np.unique(Lista)
#Show the unique tag[attributes] pairs
for tag in tags:
print(tag,root.xpath(f'//{tag}')[0].attrib.keys())
#Changes the tag name to the required's tag's name
for p in tree.findall(".//sect1"):
p.tag = ("section")
for p in tree.findall(".//informaltable"):
p.tag = ("table")
#Modify the tag's attributes to its desired form
for cy in root.xpath('//section'):
cy.attrib['xmlns:xsi']='http://www.w3.org/2001/XMLSchema-instance' #it doesnt accept : as part of the attribute's name and i don't know why
cy.attrib['id']=random() #this doesn't work yet
cy.attrib['type']='policy'
cy.attrib['xsi:noNamespaceSchemaLocation']='urn:fontoxml:cpa.xsd:1.0'#it doesnt accept :as part of the attribute's name and i don't know why
#Modify the attributes values
for t in root.xpath('//title'):
t.attrib['id']='random()
for p in root.xpath('//section'):
p.attrib['id']=random()
p.attrib['type']='policy'
for p in root.xpath('//table'):
p.attrib['id']=random()
for ct in root.xpath('//colspec'):
ct.attrib.pop("rowsep", None)
#Print the new xml to make sure it worked:
print(etree.tostring(root).decode())
tree.write("Final file.xml")
If you have any other ideas please feel free to share.
I agree that this is a task for XSLT (which can be used by lxml), here is an example stylesheet that tries to implement some of your requirements in a modular way by delegating each change to a template of its own:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
version="1.0">
<xsl:output method="xml"/>
<xsl:template match="#* | node()">
<xsl:copy>
<xsl:apply-templates select="#* | node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="sect1">
<section>
<xsl:apply-templates select="#* | node()"/>
</section>
</xsl:template>
<xsl:template match="informaltable">
<table>
<xsl:apply-templates select="#* | node()"/>
</table>
</xsl:template>
<xsl:template match="#id">
<xsl:attribute name="{name()}">
<xsl:value-of select="generate-id()"/>
</xsl:attribute>
</xsl:template>
<xsl:template match="#xsi:noNamespaceSchemaLocation">
<xsl:attribute name="{name()}" namespace="{namespace-uri()}">urn:fontoxml:cpa.xsd:1.0</xsl:attribute>
</xsl:template>
<xsl:template match="colspec/#rowsep"/>
</xsl:stylesheet>
https://xsltfiddle.liberty-development.net/bET2rXs
I hope with that as a starting point and any XSLT tutorial or introduction you can work it out.
I am new with python and I am looking for advices on what is the best approach to do the following task:
I have an xml file looking like this
<component xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.spiritconsortium.org/XMLSchema/SPIRIT/1685-2009 http://www.spiritconsortium.org/XMLSchema/SPIRIT/1685-2009/index.xsd">
<memoryMaps>
<memoryMap>
<name>name</name>
<description>description</description>
<peripheral>
<name>periph</name>
<description>description</description>
<baseAddress>0x0</baseAddress>
<range>0x8</range>
<width>32</width>
<register>
<name>reg1</name>
<displayName>reg1</displayName>
<addressOffset>0x0</addressOffset>
<size>32</size>
<access>read-write</access>
<reset>
<value>0x00000002</value>
<mask>0xFFFFFFFF</mask>
</reset>
<field>
</field>
</register>
</peripheral>
</memoryMap>
</memoryMaps>
</component>
I want to do some modifications to modify the node of "reset" to become 2 separate nodes, one for "resetValue" and another "resetMask" but keeping same data in "value" and "mask" extracted into "resetValue" and "resetMask" as follow:
........
<access>read-write</access>
<resetValue>0x00000002</resetValue>
<resetMask>0xFFFFFFFF</resetMask>
<field>
.............
I managed the part of parsing my xml file with success, now I can't know how to start this first modification.
Thank you to guide me.
code that create 2 sub elements under 'register' and remove the unneeded element 'reset'
import xml.etree.ElementTree as ET
xml = '''<component xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.spiritconsortium.org/XMLSchema/SPIRIT/1685-2009 http://www.spiritconsortium.org/XMLSchema/SPIRIT/1685-2009/index.xsd">
<memoryMaps>
<memoryMap>
<name>name</name>
<description>description</description>
<peripheral>
<name>periph</name>
<description>description</description>
<baseAddress>0x0</baseAddress>
<range>0x8</range>
<width>32</width>
<register>
<name>reg1</name>
<displayName>reg1</displayName>
<addressOffset>0x0</addressOffset>
<size>32</size>
<access>read-write</access>
<reset>
<value>0x00000002</value>
<mask>0xFFFFFFFF</mask>
</reset>
<field>
</field>
</register>
</peripheral>
</memoryMap>
</memoryMaps>
</component>'''
root = ET.fromstring(xml)
register = root.find('.//register')
value = register.find('.//reset/value').text
mask = register.find('.//reset/mask').text
v = ET.SubElement(register, 'resetValue')
v.text = value
m = ET.SubElement(register, 'resetMask')
m.text = mask
register.remove(register.find('reset'))
ET.dump(root)
output
<?xml version="1.0" encoding="UTF-8"?>
<component xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.spiritconsortium.org/XMLSchema/SPIRIT/1685-2009 http://www.spiritconsortium.org/XMLSchema/SPIRIT/1685-2009/index.xsd">
<memoryMaps>
<memoryMap>
<name>name</name>
<description>description</description>
<peripheral>
<name>periph</name>
<description>description</description>
<baseAddress>0x0</baseAddress>
<range>0x8</range>
<width>32</width>
<register>
<name>reg1</name>
<displayName>reg1</displayName>
<addressOffset>0x0</addressOffset>
<size>32</size>
<access>read-write</access>
<field />
<resetValue>0x00000002</resetValue>
<resetMask>0xFFFFFFFF</resetMask>
</register>
</peripheral>
</memoryMap>
</memoryMaps>
</component>
Considering this XML example:
<data>
<items>
<item name="item1">item1pre <bold>ok!</bold> item1post</item>
<item name="item2">item2</item>
</items>
</data>
I am looking for a way to get the following result:
"item1pre **ok! ** item1post"
I thought of getting all the content of item1 as a string "item1pre <'bold> ok!<'/bold> item1post" and then replace "<'bold>" and "<'/bold>" by "**", but I don't know how to get that.
xml="""
<data>
<items>
<item name="item1">item1pre<bold>ok!</bold>item1post</item>
<item name="item2">item2</item>
</items>
</data>
"""
import xml.etree.ElementTree as ET
# python included module
def cleaned_strings_from_xml(xml_str, tag='item'):
"""
finds all items of type tag from xml-string
:param xml_str: valid xml structure as string
:param tag: tag to search inside the xml
:returns: list of all texts of 'tag'-items
"""
strings = []
root = ET.fromstring(xml)
for item in root.iter(tag):
item_str = ET.tostring(item).decode('utf-8')
item_str = item_str.replace('<bold>', ' **').replace('</bold>', ' **')
strings.append(ET.fromstring(item_str).text)
return strings
print(cleaned_strings_from_xml(xml))
You could offload all xml processing to libxml by using an xslt transformation. Libxml is written in C which should be quicker:
from lxml import etree
transform = etree.XSLT(etree.XML('''
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text" />
<xsl:strip-space elements="*"/>
<xsl:template match="/">
<xsl:apply-templates />
</xsl:template>
<xsl:template match="data/items/item[#name = 'item1']">
<xsl:text>"</xsl:text>
<xsl:value-of select="text()"/>
<xsl:text>**</xsl:text>
<xsl:value-of select="bold/."/>
<xsl:text>**</xsl:text>
<xsl:value-of select="bold/following-sibling::text()[1]"/>
<xsl:text>"</xsl:text>
</xsl:template>
<xsl:template match="data/items/item[#name != 'item1']" />
</xsl:stylesheet>
'''))
with open("source.xml") as f:
print(transform(etree.parse(f)))
In a nutshell: Match the item element with name attribute 'item1' then use relative xpath expressions to extract the strings.
Started playing with Python and ElementTree very recently to acheive something quite specific. I am very nearly there I think but there is one thing that I can't quite work out. I am querying an xml file and pulling back the relevant data - then putting that data into a csv file. It all works but the issue is that the elem.attrib["text"] actually returns multiple lines - when I put it into a variable the variable and export to a csv it only exports the first line - below is the code I am using...
import os
import csv
import xml.etree.cElementTree as ET
path = "/share/new"
c = csv.writer(open("/share/redacted.csv", "wb"))
c.writerow(["S","R","T","R2","R3"])
for filename in os.listdir(path):
if filename.endswith('.xml'):
fullname = os.path.join(path, filename)
tree = ET.ElementTree(file=(fullname))
for elem in tree.iterfind('PropertyList/Property[#name="Sender"]'):
c1 = elem.attrib["value"]
for elem in tree.iterfind('PropertyList/Property[#name="Recipient"]'):
c2 = elem.attrib["value"]
for elem in tree.iterfind('PropertyList/Property[#name="Date"]'):
c3 = elem.attrib["value"]
for elem in tree.iterfind('ChildContext/ResponseList/Response/TextualAnalysis/ExpressionList/Expression/Match'):
c4 = elem.attrib["textView"]
for elem in tree.iterfind('ChildContext/ResponseList/Response/TextualAnalysis/ExpressionList/Expression/Match/Matched'):
c5 = elem.attrib["text"]
print elem.attrib["text"]
print c5
c.writerow([(c1),(c2),(c3),(c4),(c5)])
The most important part is right near the bottom - the output of print elem.atrrib["text"] is:
Apples
Bananas
The output of 'print c5' is the same (just to be clear that is Apples and Bananas on seperate lines)
But, outputting c5 to a csv only outputs the first line and therefore only Apples appears in the csv.
I hope this makes sense - what I need to do is output both Apples and Bananas to the csv (in the same cell preferably). The below is in Python 2.7 in development but ideally I need it to work in 2.6 (I realise iterfind is not in 2.6 - I have 2 versions of code already)
I would post the xml but it is a bit of a beast. - As per suggestion in comments here is a cleaned up XML.
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<Context>
<PropertyList duplicates="true">
<Property name="Sender" type="string" value="S:demo1#no-one.local"/>
<Property name="Recipient" type="string" value="RPFD:no-one.local"/>
<Property name="Date" type="string" value="Tue, 4 Aug 2015 13:24:16 +0100"/>
</PropertyList>
<ChildContext>
<ResponseList>
<Response>
<Description>
<Arg />
<Arg />
</Description>
<TextualAnalysis version="2.0">
<ExpressionList>
<Expression specified=".CLEAN.(Apples)" total="1" >
<Match textView="Body" truncated="false">
<Surrounding text="..."/>
<Surrounding text="How do you like them "/>
<Matched cleaned="true" text="Apples " type="expression"/>
<Surrounding text="???????? "/>
<Surrounding text="..."/>
</Match>
</Expression>
</ExpressionList>
</TextualAnalysis>
</Response>
</ResponseList>
</ChildContext>
<ChildContext>
<ResponseList>
<Response>
<Description>
<Arg />
<Arg />
</Description>
<TextualAnalysis version="2.0">
<ExpressionList>
<Expression specified=".CLEAN.(Bananas)" total="1" >
<Match textView="Attach" truncated="false">
<Surrounding text="..."/>
<Surrounding text="Also I don't like... "/>
<Matched cleaned="true" text="Bananas " type="expression"/>
<Surrounding text="!!!!!!! "/>
<Surrounding text="..."/>
</Match>
</Expression>
</ExpressionList>
</TextualAnalysis>
</Response>
</ResponseList>
</ChildContext>
</Context>
The following will join together all the text elements, and put them on separate lines in the same cell inside your CSV. You can change the '\n' separator to ' ' or ',' to put them on the same line. However, you might still have issues with some of your other stuff -- you don't have nested loops there and I don't really understand what you are trying to accomplish, so maybe you have more than one of each of those other things too. Anyway:
c5 = []
for elem in tree.iterfind('ChildContext/ResponseList/Response/TextualAnalysis/ExpressionList/Expression/Match/Matched'):
c5.append(elem.attrib["text"])
c.writerow([c1, c2, c3, c4, '\n'.join(c5)])