Optimizing XML parse into CSV using Python - python

I have about 10,000 of XML files with similar structure that I wish to convert to a single CSV file.
Each XML file looks like this:
<?xml version='1.0' encoding='UTF-8'?>
<S:Envelope xmlns:S="http://schemas.xmlsoap.org/soap/envelope/">
<S:Body>
<ns7:GetStopMonitoringServiceResponse xmlns:ns3="http://www.siri.org.uk/siri" xmlns:ns4="http://www.ifopt.org.uk/acsb" xmlns:ns5="http://www.ifopt.org.uk/ifopt" xmlns:ns6="http://datex2.eu/schema/1_0/1_0" xmlns:ns7="http://new.webservice.namespace">
<Answer>
<ns3:ResponseTimestamp>2019-03-31T09:00:52.912+03:00</ns3:ResponseTimestamp>
<ns3:ProducerRef>ISR Siri Server (141.10)</ns3:ProducerRef>
<ns3:ResponseMessageIdentifier>276480603</ns3:ResponseMessageIdentifier>
<ns3:RequestMessageRef>0100700:1351669188:4684</ns3:RequestMessageRef>
<ns3:Status>true</ns3:Status>
<ns3:StopMonitoringDelivery version="IL2.71">
<ns3:ResponseTimestamp>2019-03-31T09:00:52.912+03:00</ns3:ResponseTimestamp>
<ns3:Status>true</ns3:Status>
<ns3:MonitoredStopVisit>
<ns3:RecordedAtTime>2019-03-31T09:00:52.000+03:00</ns3:RecordedAtTime>
<ns3:ItemIdentifier>-881202701</ns3:ItemIdentifier>
<ns3:MonitoringRef>20902</ns3:MonitoringRef>
<ns3:MonitoredVehicleJourney>
<ns3:LineRef>23925</ns3:LineRef>
<ns3:DirectionRef>2</ns3:DirectionRef>
<ns3:FramedVehicleJourneyRef>
<ns3:DataFrameRef>2019-03-31</ns3:DataFrameRef>
<ns3:DatedVehicleJourneyRef>36962685</ns3:DatedVehicleJourneyRef>
</ns3:FramedVehicleJourneyRef>
<ns3:PublishedLineName>15</ns3:PublishedLineName>
<ns3:OperatorRef>15</ns3:OperatorRef>
<ns3:DestinationRef>26020</ns3:DestinationRef>
<ns3:OriginAimedDepartureTime>2019-03-31T08:35:00.000+03:00</ns3:OriginAimedDepartureTime>
<ns3:VehicleLocation>
<ns3:Longitude>34.78000259399414</ns3:Longitude>
<ns3:Latitude>32.042293548583984</ns3:Latitude>
</ns3:VehicleLocation>
<ns3:VehicleRef>37629301</ns3:VehicleRef>
<ns3:MonitoredCall>
<ns3:StopPointRef>20902</ns3:StopPointRef>
<ns3:ExpectedArrivalTime>2019-03-31T09:03:00.000+03:00</ns3:ExpectedArrivalTime>
</ns3:MonitoredCall>
</ns3:MonitoredVehicleJourney>
</ns3:MonitoredStopVisit>
</ns3:StopMonitoringDelivery>
</Answer>
</ns7:GetStopMonitoringServiceResponse>
</S:Body>
</S:Envelope>
The example above shows one MonitoredStopVisit nested tag, but every XML have about 4,000 of them.
Full XML as an example can be found here.
I want to convert all the 10K files to one CSV where each record corresponds to a MonitoredStopVisit tag, so the CSV should look like this:
Currently this is my architecture:
split the 10K files into 8 chunks (per my PC cores).
Each sub-process iterates through its xml files and objectifies the xml.
The object is then iterated, and per each element I use conditions to exclude/include data using an array.
When the tag is /ns3:MonitoredStopVisit, the array is appended to a pandas dataframe as a series.
When all sub-processes are done, the dataframes are merged and saved as CSV.
This is the xml to df code:
def xml_to_df(xml_file):
from lxml import objectify
xml_content = xml_file.read()
obj = objectify.fromstring(xml_content)
df_cols=[
'RecordedAtTime',
'MonitoringRef',
'LineRef',
'DirectionRef',
'PublishedLineName',
'OperatorRef',
'DestinationRef',
'OriginAimedDepartureTime',
'Longitude',
'Latitude',
'VehicleRef',
'StopPointRef',
'ExpectedArrivalTime',
'AimedArrivalTime'
]
tempdf = pd.DataFrame(columns=df_cols)
arr_of_vals = [""] * 14
for i in obj.getiterator():
if "MonitoredStopVisit" in i.tag or "Status" in i.tag and "false" in str(i):
if arr_of_vals[0] != "" and (arr_of_vals[8] and arr_of_vals[9]):
s = pd.Series(arr_of_vals, index=df_cols)
if tempdf[(tempdf==s).all(axis=1)].empty:
tempdf = tempdf.append(s, ignore_index=True)
arr_of_vals = [""] * 14
elif "RecordedAtTime" in i.tag:
arr_of_vals[0] = str(i)
elif "MonitoringRef" in i.tag:
arr_of_vals[1] = str(i)
elif "LineRef" in i.tag:
arr_of_vals[2] = str(i)
elif "DestinationRef" in i.tag:
arr_of_vals[6] = str(i)
elif "OriginAimedDepartureTime" in i.tag:
arr_of_vals[7] = str(i)
elif "Longitude" in i.tag:
if str(i) == "345353":
print("Lon: " + str(i))
arr_of_vals[8] = str(i)
elif "Latitude" in i.tag:
arr_of_vals[9] = str(i)
elif "VehicleRef" in i.tag:
arr_of_vals[10] = str(i)
elif "ExpectedArrivalTime" in i.tag:
arr_of_vals[12] = str(i)
if arr_of_vals[0] != "" and (arr_of_vals[8] and arr_of_vals[9]):
s = pd.Series(arr_of_vals, index=df_cols)
if tempdf[(tempdf == s).all(axis=1)].empty:
tempdf = tempdf.append(s, ignore_index=True)
return tempdf
The problem is that for 10K files this takes about 10 hours with 8 sub-processors.
When checking CPU/Mem usage, I can see that are not fully utilized.
Any idea how this can be improved? My next step is threading, but maybe there are other applicable ways.
Just as a note, the order of records isn't important - I can sort it later.

Here is my solution with pandas:
Computation time for each 5Mb file is about 0.4s
import xml.etree.ElementTree as ET
import re
import pandas as pd
import os
def collect_data(xml_file):
# create xml object
root = ET.parse(xml_file).getroot()
# collect raw data
out_data = []
for element in root.iter():
# get tag name
tag = re.sub('{.*?}', '', element.tag)
# add break segment element
if tag == 'RecordedAtTime':
out_data.append('break')
if tag in tag_list:
out_data.append((tag, element.text))
# get break indexes
break_index = [i for i, x in enumerate(out_data) if x == 'break']
# break list into parts
list_data = []
for i in range(len(break_index) - 1):
list_data.append(out_data[break_index[i]:break_index[i + 1]])
# check for each value in data
final_data = []
for item in list_data:
# delete bleak element ad convert list into dictionary
del item[item.index('break')]
data_dictionary = dict(item)
if 'RecordedAtTime' in data_dictionary.keys():
recorded_at_time = data_dictionary.get('RecordedAtTime')
else:
recorded_at_time = ''
if 'MonitoringRef' in data_dictionary.keys():
monitoring_ref = data_dictionary.get('MonitoringRef')
else:
monitoring_ref = ''
if 'LineRef' in data_dictionary.keys():
line_ref = data_dictionary.get('LineRef')
else:
line_ref = ''
if 'DirectionRef' in data_dictionary.keys():
direction_ref = data_dictionary.get('DirectionReff')
else:
direction_ref = ''
if 'PublishedLineName' in data_dictionary.keys():
published_line_name = data_dictionary.get('PublishedLineName')
else:
published_line_name = ''
if 'OperatorRef' in data_dictionary.keys():
operator_ref = data_dictionary.get('OperatorRef')
else:
operator_ref = ''
if 'DestinationRef' in data_dictionary.keys():
destination_ref = data_dictionary.get('DestinationRef')
else:
destination_ref = ''
if 'OriginAimedDepartureTime' in data_dictionary.keys():
origin_aimed_departure_time = data_dictionary.get('OriginAimedDepartureTime')
else:
origin_aimed_departure_time = ''
if 'Longitude' in data_dictionary.keys():
longitude = data_dictionary.get('Longitude')
else:
longitude = ''
if 'Latitude' in data_dictionary.keys():
latitude = data_dictionary.get('Latitude')
else:
latitude = ''
if 'VehicleRef' in data_dictionary.keys():
vehicle_ref = data_dictionary.get('VehicleRef')
else:
vehicle_ref = ''
if 'StopPointRef' in data_dictionary.keys():
stop_point_ref = data_dictionary.get('StopPointRef')
else:
stop_point_ref = ''
if 'ExpectedArrivalTime' in data_dictionary.keys():
expected_arrival_time = data_dictionary.get('ExpectedArrivalTime')
else:
expected_arrival_time = ''
if 'AimedArrivalTime' in data_dictionary.keys():
aimed_arrival_time = data_dictionary.get('AimedArrivalTime')
else:
aimed_arrival_time = ''
final_data.append((recorded_at_time, monitoring_ref, line_ref, direction_ref, published_line_name, operator_ref,
destination_ref, origin_aimed_departure_time, longitude, latitude, vehicle_ref,
stop_point_ref,
expected_arrival_time, aimed_arrival_time))
return final_data
# setup tags list for checking
tag_list = ['RecordedAtTime', 'MonitoringRef', 'LineRef', 'DirectionRef', 'PublishedLineName', 'OperatorRef',
'DestinationRef', 'OriginAimedDepartureTime', 'Longitude', 'Latitude', 'VehicleRef', 'StopPointRef',
'ExpectedArrivalTime', 'AimedArrivalTime']
# collect data from each file
save_data = []
for file_name in os.listdir(os.getcwd()):
if file_name.endswith('.xml'):
save_data.append(collect_data(file_name))
else:
pass
# merge list of lists
flat_list = []
for sublist in save_data:
for item in sublist:
flat_list.append(item)
# load data into data frame
data = pd.DataFrame(flat_list, columns=tag_list)
# save data to file
data.to_csv('data.csv', index=False)

So it seems the issue is the use of the Pandas dataframe and series.
Using the code above, processing one xml file with ~4000 records took 4-120 seconds. The time increased as the program kept working.
Using python lists or numpy matrices (more convenient for working into a csv) decreased the running time significantly - each xml file processing now takes 0.1-0.5 seconds tops.
I used the following code to append the new processed records each time
records = np.append(records, new_redocrds, axis=0)
This is equivalent to:
tempdf = tempdf.append(s, ignore_index=True)
but significantly faster.
Hope this helps anyone who might encounter similar issues!

Actually consider XSLT, the special-purpose language to transform XML files into other XML even text files such as CSV. The only third-party library needed is Python's lxml which can run XSLT 1.0 scripts leaving out the heavier, extensive analytical tools such as Pandas and Numpy.
In fact, because XSLT is a separate, industry language, it is portable and can be run in any language with XSLT library (e.g., Java, PHP, Perl, C#, VB) or standalone 1.0, 2.0, or 3.0 processors (e.g., Xalan, Saxon), all of which Python can call as a command line subprocess.
XSLT (save below as a .xsl file, a special .xml file)
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:S="http://schemas.xmlsoap.org/soap/envelope/"
xmlns:ns3="http://www.siri.org.uk/siri"
xmlns:ns4="http://www.ifopt.org.uk/acsb"
xmlns:ns5="http://www.ifopt.org.uk/ifopt"
xmlns:ns6="http://datex2.eu/schema/1_0/1_0"
xmlns:ns7="http://new.webservice.namespace">
<xsl:output method="text" indent="yes" omit-xml-declaration="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match ="/S:Envelope/S:Body/ns7:GetStopMonitoringServiceResponse/Answer">
<xsl:apply-templates select="ns3:StopMonitoringDelivery"/>
</xsl:template>
<xsl:template match="ns3:StopMonitoringDelivery">
<!-- HEADERS -->
<!-- <xsl:text>RecordedAtTime,MonitoringRef,LineRef,DirectionRef,PublishedLineName,OperatorRef,DestinationRef,OriginAimedDepartureTime,Longitude,Latitude,VehicleRef,StopPointRef,ExpectedArrivalTime,AimedArrivalTime
</xsl:text> -->
<xsl:apply-templates select="ns3:MonitoredStopVisit"/>
<xsl:text>
</xsl:text>
</xsl:template>
<xsl:template match="ns3:MonitoredStopVisit">
<xsl:variable name="delim">,</xsl:variable>
<xsl:variable name="quote">"</xsl:variable>
<!-- DATA ROWS -->
<xsl:value-of select="concat($quote, ns3:RecordedAtTime, $quote, $delim,
$quote, ns3:MonitoringRef, $quote, $delim,
$quote, ns3:MonitoredVehicleJourney/ns3:LineRef, $quote, $delim,
$quote, ns3:MonitoredVehicleJourney/ns3:DirectionRef, $quote, $delim,
$quote, ns3:MonitoredVehicleJourney/ns3:PublishedLineName, $quote, $delim,
$quote, ns3:MonitoredVehicleJourney/ns3:OperatorRef, $quote, $delim,
$quote, ns3:MonitoredVehicleJourney/ns3:DestinationRef, $quote, $delim,
$quote, ns3:MonitoredVehicleJourney/ns3:OriginAimedDepartureTime, $quote, $delim,
$quote, ns3:MonitoredVehicleJourney/ns3:VehicleLocation/ns3:Longitude, $quote, $delim,
$quote, ns3:MonitoredVehicleJourney/ns3:VehicleLocation/ns3:Latitude, $quote, $delim,
$quote, ns3:MonitoredVehicleJourney/ns3:VehicleRef, $quote, $delim,
$quote, ns3:MonitoredVehicleJourney/ns3:MonitoredCall/ns3:StopPointRef, $quote, $delim,
$quote, ns3:MonitoredVehicleJourney/ns3:MonitoredCall/ns3:ExpectedArrivalTime, $quote, $delim,
$quote, ns3:MonitoredVehicleJourney/ns3:MonitoredCall/ns3:AimedArrivalTime, $quote, $delim
)"/>
</xsl:template>
</xsl:stylesheet>
Online Demo
Python (no appending lists, arrays, or dataframes)
import glob # TO RETRIEVE ALL XML FILES
import lxml.etree as et # TO PARSE XML AND RUN XSLT
xml_path = "/path/to/xml/files"
# PARSE XSLT
xsl = et.parse('XSLTScript.xsl')
# BUILD CSV
with open("MonitoredStopVisits.csv", 'w') as f:
# HEADER
f.write('RecordedAtTime,MonitoringRef,LineRef,DirectionRef,PublishedLineName,'
'OperatorRef,DestinationRef,OriginAimedDepartureTime,Longitude,Latitude,'
'VehicleRef,StopPointRef,ExpectedArrivalTime,AimedArrivalTime\n')
# DATA ROWS
for f in glob.glob(xml_path + "/**/*.xml", recursive=True):
# LOAD XML AND XSL SCRIPT
xml = et.parse(f)
# TRANSFORM XML TO STRING RESULT TREE
transform = et.XSLT(xsl)
result = str(transform(xml))
# WRITE TO CSV
f.write(result)

Related

Transform Nested XML

I am currently looking to parse out a nested XML into a pandas Datatable so I can generate a CSV with each column being an element name and the value of that being the element text but I am having some issues parsing the information out. Below is an example of the nested XML and what I have tried.
The below XML can be quite large with hundreds of different records. This is what I tried:
##Import modules
import xml.etree.ElementTree as ET
import pandas as pd
from lxml import etree
tree = ET.parse("File.xml")
root = tree.getroot()
for subelement in root:
for subsub in subelement:
print(subsub.tag,",", subsub.text, subsub.attrib, subsub.items())
for subelement in root:
for subsub in subelement:
for subsubsub in subsub:
print(subsubsub.tag,",", subsubsub.text, subsubsub.attrib)
<?xml version="1.0" encoding="utf-16"?>
<test1 xmlns="test.xsd">
<test2 ID="123123123" test3="123123">
<test3>Separate</test3>
<test4>AA</test4>
<Comments>BB</Comments>
<test5>
<test6 ID="123123">
<test3>today</test3>
<test7>123 street</test7>
</test6>
</test5>
<test8>
<test10 ID="434234">
<test3>type of work</test3>
<test9>test work</test9>
</test10>
</test8>
<test11>
<test12 ID="234234234">
<test3>Social</test3>
<test14>test</test14>
</test12>
<test12 ID="123123">
<test3>Something Here</test3>
<test13>Some date</test13>
<test14>123123124433</test14>
</test12>
</test11>
<test15>
<test16 ID="6456456456">
<test3>Something Something</test3>
<test14>746745636</test14>
</test16>
</test15>
</test2>
<test2 ID="353453245" test3="list of something">
<test3>Somewhere</test3>
<test4>Someone</test4>
<Comments>Some comment</Comments>
<test5>
<test6 ID="567456756">
<test3>Not today</test3>
<test7>5634643643</test7>
<test17>Some Info</test17>
<test19>Somewhere</test19>
<test18>63243333</test18>
</test6>
</test5>
<test11>
<test12 ID="456436346">
<test3>Pattern</test3>
<test14>436346346</test14>
</test12>
<test12 ID="4364356">
<test3> ID</test3>
<test14>5674567457</test14>
</test12>
<test12 ID="123123123443">
<test3>Other ID</test3>
<test13>54234532452345</test13>
<test14>231423532452345</test14>
</test12>
</test11>
<test15>
<test16 ID="34252345">
<test3>None test</test3>
<test14>456436436346</test14>
</test16>
</test15>
</test2>
</test1>
Update So would the full code look something like this?
###TEST USING EXAMPLE HOTLIST
with open("file.csv", "w", newline='') as fout:
header = ['test3','test4','test7','test9','test13','test14','test17','test18','test19','Comments']
csvout = csv.DictWriter(fout, fieldnames=header)
csvout.writeheader()
row = {}
for _, elem in ET.iterparse('file.xml'):
# strip the namespace from the element tag name; e.g. {Test.xsd}test14 > test14
tag = re.sub("^{.*?}", "", elem.tag)
if tag == 'test2':
if len(row) != 0:
print(row)
csvout.writerow(row)
row = {}
if len(elem) == 0:
text = elem.text
old = row.get(tag)
if old is None:
# first occurrence of the tag
row[tag] = text
elif isinstance(old, str):
# second occurrence of the tag
row[tag] = [old, text]
else:
# already a list
old.append(text)
For nested XML you can use iterparse() function to iterate over all elements in the XML. You would then need to have logic to handle the elements depending on what tag it's looking at to add to a dictionary object to export as a row.
for _, elem in ET.iterparse('file.xml'):
if len(elem) == 0:
print(f'{elem.tag} {elem.attrib} text={elem.text}')
else:
print(f'{elem.tag} {elem.attrib}')
To create a row in a CSV file from the element text then can do something like this. If, for example, the "test2" marks the beginning of a new record then that can be used to write the record to a new row and clear the dictionary for the next record.
If want to output all or some attributes then need to add a few lines of code for that. If attribute names have the same name as element name or multiple elements have same attribute (e.g. ID) then need to address that in your code.
import xml.etree.ElementTree as ET
import re
import csv
with open("out.csv", "w", newline='') as fout:
header = ['test3','test4','test7','test9','test13','test14','test17','test18','test19','Comments']
csvout = csv.DictWriter(fout, fieldnames=header)
csvout.writeheader()
row = {}
for _, elem in ET.iterparse('test.xml'):
# strip the namespace from the element tag name; e.g. {Test.xsd}test14 > test14
tag = re.sub("^{.*?}", "", elem.tag)
if tag == 'test2':
if len(row) != 0:
print(row)
csvout.writerow(row)
row = {}
if len(elem) == 0:
row[tag] = elem.text
Output:
{'test3': 'Something Something', 'test4': 'AA', 'Comments': 'BB', 'test7': '123 street', 'test9': 'test work', 'test14': '746745636', 'test13': 'Some date'}
{'test3': 'None test', 'test4': 'Someone', 'Comments': 'Some comment', 'test7': '5634643643', 'test17': 'Some Info', 'test19': 'Somewhere', 'test18': '63243333', 'test14': '456436436346', 'test13': '54234532452345'}
CSV Output:
test3,test4,test7,test9,test13,test14,test17,test18,test19,Comments
Something Something,AA,123 street,test work,Some date,746745636,,,,BB
None test,Someone,5634643643,,54234532452345,456436436346,Some Info,63243333,Somewhere,Some comment
Update:
If want to handle duplicate tags and create a list of values then try something like this:
if len(elem) == 0:
text = elem.text
old = row.get(tag)
if old is None:
# first occurrence
row[tag] = text
elif isinstance(old, str):
# second occurrence > create list
row[tag] = [old, text]
else:
old.append(text)

XML Parsing Python ElementTree - Nested for loops

I'm using Jupyter Notebook and ElementTree (Python 3) to create a dataframe and save as csv from an XML file. Here is the XML format (in Estonian):
<asutused hetk="2020-04-14T03:53:33" ver="2">
<asutus>
<registrikood>10000515</registrikood>
<nimi>Osaühing B.Braun Medical</nimi>
<aadress />
<tegevusload>
<tegevusluba>
<tegevusloa_number>L04647</tegevusloa_number>
<alates>2019-12-10</alates>
<kuni />
<loaliik_kood>1</loaliik_kood>
<loaliik_nimi>Eriarstiabi</loaliik_nimi>
<haiglaliik_kood />
<haiglaliik_nimi />
<tegevuskohad>
<tegevuskoht>
<aadress>Harju maakond, Tallinn, Mustamäe linnaosa, J. Sütiste tee 17/1</aadress>
<teenused>
<teenus>
<kood>T0038</kood>
<nimi>ambulatoorsed üldkirurgiateenused</nimi>
</teenus>
<teenus>
<kood>T0236</kood>
<nimi>õe vastuvõtuteenus</nimi>
</teenus>
</teenused>
</tegevuskoht>
<tegevuskoht>
<aadress>Harju maakond, Tallinn, Mustamäe linnaosa, J. Sütiste tee 17/1</aadress>
<teenused>
<teenus>
<kood>T0038</kood>
<nimi>ambulatoorsed üldkirurgiateenused</nimi>
</teenus>
<teenus>
<kood>T0236</kood>
<nimi>õe vastuvõtuteenus</nimi>
</teenus>
</teenused>
</tegevuskoht>
</tegevuskohad>
</tegevusluba>
<tegevusluba>
<tegevusloa_number>L04651</tegevusloa_number>
<alates>2019-12-11</alates>
<kuni />
<loaliik_kood>2</loaliik_kood>
<loaliik_nimi>Õendusabi</loaliik_nimi>
<haiglaliik_kood />
<haiglaliik_nimi />
<tegevuskohad>
<tegevuskoht>
<aadress>Harju maakond, Tallinn, Mustamäe linnaosa, J. Sütiste tee 17/1</aadress>
<teenused>
<teenus>
<kood>T0038</kood>
<nimi>ambulatoorsed üldkirurgiateenused</nimi>
</teenus>
<teenus>
<kood>T0236</kood>
<nimi>õe vastuvõtuteenus</nimi>
</teenus>
</teenused>
</tegevuskoht>
<tegevuskoht>
<aadress>Harju maakond, Tallinn, Mustamäe linnaosa, J. Sütiste tee 17/1</aadress>
<teenused>
<teenus>
<kood>T0038</kood>
<nimi>ambulatoorsed üldkirurgiateenused</nimi>
</teenus>
<teenus>
<kood>T0236</kood>
<nimi>õe vastuvõtuteenus</nimi>
</teenus>
</teenused>
</tegevuskoht>
</tegevuskohad>
</tegevusluba>
</tegevusload>
<tootajad>
<tootaja>
<kood>D03091</kood>
<eesnimi>Evo</eesnimi>
<perenimi>Kaha</perenimi>
<kutse_kood>11</kutse_kood>
<kutse_nimi>Arst</kutse_nimi>
<erialad>
<eriala>
<kood>E420</kood>
<nimi>üldkirurgia</nimi>
</eriala>
</erialad>
</tootaja>
<tootaja>
<kood>N01146</kood>
<eesnimi>Karmen</eesnimi>
<perenimi>Mežulis</perenimi>
<kutse_kood>15</kutse_kood>
<kutse_nimi>Õde</kutse_nimi>
</tootaja>
<tootaja>
<kood>N01153</kood>
<eesnimi>Nele</eesnimi>
<perenimi>Terras</perenimi>
<kutse_kood>15</kutse_kood>
<kutse_nimi>Õde</kutse_nimi>
</tootaja>
<tootaja>
<kood>N02767</kood>
<eesnimi>Helena</eesnimi>
<perenimi>Tern</perenimi>
<kutse_kood>15</kutse_kood>
<kutse_nimi>Õde</kutse_nimi>
</tootaja>
<tootaja>
<kood>N12882</kood>
<eesnimi>Hanna</eesnimi>
<perenimi>Leemet</perenimi>
<kutse_kood>15</kutse_kood>
<kutse_nimi>Õde</kutse_nimi>
</tootaja>
</tootajad>
</asutus>
</asutused>
Each "asutus" is a hospital and I need some of the information inside. Here is my code:
tree = ET.parse("od_asutused.xml")
root = tree.getroot()
# open a file for writing
data = open('EE.csv', 'w')
# create the csv writer object
csvwriter = csv.writer(data, delimiter=';')
head = []
count = 0
for member in root.findall('asutus'):
hospital = []
if count == 0:
ident = member.find('registrikood').tag
head.append(id)
name = member.find('nimi').tag
head.append(name)
address = member.find('aadress').tag
head.append(address)
facility_type = member.find('./tegevusload/tegevusluba/haiglaliik_nimi').tag
head.append(facility_type)
site_address = member.find('./tegevusload/tegevusluba/tegevuskohad/tegevuskoht/aadress').tag
head.append(site_address)
for elem in member.findall('tegevusload'):
list_specs = elem.find('./tegevusluba/tegevuskohad/tegevuskoht/teenused/teenus/nimi').tag
head.append(list_specs)
csvwriter.writerow(head)
count = count + 1
ident = member.find('registrikood').text
hospital.append(ident)
name = member.find('nimi').text
hospital.append(name)
address = member.find('aadress').text
hospital.append(address)
facility_type = member.find('./tegevusload/tegevusluba/haiglaliik_nimi').text
hospital.append(facility_type)
site_address = member.find('./tegevusload/tegevusluba/tegevuskohad/tegevuskoht/aadress').text
hospital.append(site_address)
for spec in elem.findall('tegevusload'):
list_specs = spec.find('./tegevusluba/tegevuskohad/tegevuskoht/teenused/teenus/nimi').text
hospital.append(list_specs)
csvwriter.writerow(hospital)
data.close()
#Upload csv for geocoding
df = pd.read_csv(r'EE.csv', na_filter= False, delimiter=';')
#Rename columns
df.rename(columns = {'<built-in function id>':'id',
'nimi':'name',
'aadress':'address',
'haiglaliik_nimi':'facility_type',
'haiglaliik_kood':'facility_type_c',
'aadress.1':'site_address',
'nimi.1':'list_specs'},
inplace = True)
#Add columns
df['country'] = 'Estonia'
df['cc'] = 'EE'
df.head(10)
And the result of the df.head(10):
Result of dataframe
The "list_specs" is blank no matter what I do. How can I populate this field with a list of each 'nimi' for each site address? Thank you.
I found in your code the following points to change:
At least on my computer, calling csv.writer causes that newline chars
are doubled. The remedy I found is to open the output file with
additional parameters:
data = open('EE.csv', 'w', newline='\n', encoding='utf-8')
There is no sense to write head with Estonian column names and then
rename the columns. Note also that in head.append(id) you use an undeclared
variable (id).
But this is not so important, as I changed this whole section with writing
target column names (see below).
As you write the CSV file to be read by read_csv, it should contain a
fixed number of columns. So it is a bad practice to use a loop to write
one element.
Your instruction list_specs = elem.findall(...) was wrong, because
elem is not set in the current loop. Instead you should use member (but
I solved this detail other way).
There is no sense to create a variable only in order to use it once.
More concise and readable code is e.g. hospital.append(member.findtext('nimi')).
To avoid long XPath expressions, with repeated initial part, I decided
to set a temporary variable "in the middle" of this path, e.g.
tgvLb = member.find('tegevusload/tegevusluba') and then use a relative
XPath starting from this node.
Your rename instruction contains one not needed column, namely facility_type_c. You read only 6 columns, not 7.
So change the middle part of your code to:
data = open('EE.csv', 'w', newline='\n', encoding='utf-8')
csvwriter = csv.writer(data, delimiter=';')
head = ['id', 'name', 'address', 'facility_type', 'site_address', 'list_specs']
csvwriter.writerow(head)
for member in root.findall('asutus'):
hospital = []
hospital.append(member.findtext('registrikood'))
hospital.append(member.findtext('nimi'))
hospital.append(member.findtext('aadress'))
tgvLb = member.find('tegevusload/tegevusluba')
hospital.append(tgvLb.findtext('haiglaliik_nimi'))
tgvKoht = tgvLb.find('tegevuskohad/tegevuskoht')
hospital.append(tgvKoht.findtext('aadress'))
hospital.append(tgvKoht.findtext('teenused/teenus/nimi'))
csvwriter.writerow(hospital)
data.close()
df = pd.read_csv(r'EE.csv', na_filter= False, delimiter=';')
and drop df.rename from your code.

Building XML from excel data with Python

I am trying to build an xml file from an excel spreadsheet using python but am having trouble building the structure. The xml schema is unique to a software so the opening few tags and ending few would be easier to be written to the xml file just as variables, shown below. They are constant so are pulled from the "
I believe the script neeeds to loop through another sheet, being the ".XML Framework" sheet to build the .xml structure as these are the values which will be ultimately changing. The structure of this sheet is provided below.
here is the .xml structure, from which the python is outputting well up to the unique values, and the changing values are shown in bold. This shows just one row of the data from the workbook. When the workbook has a second row, the .xml structure repeats again where it starts with .
The data structure in the excel sheet ".XML Framework" is:
col 1 = **equals**
col 2 = **74**
col 3 = **Data**"
col 4 = col 3
col 5 = **Name 07**
col 6 = col 5
col 7 = **wstring**
col 8 = /**SM15-HVAC-SUPP-TM-37250-ST**
THIS IS THE DESIRED XML STRUCTURE
<?xml version="1.0" encoding="UTF-8" ?>
<exchange xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://download.autodesk.com/us/navisworks/schemas/nw-exchange-12.0.xsd" units="m" filename="" filepath="">
<selectionsets>
<selectionset name="Dev_1">
<findspec mode="all" disjoint="0">
<conditions>
<condition test="**equals**" flags="**74**">
<category>
<name internal="**Data**">**Data**</name>
</category>
<property>
<name internal="**Name 07**">**Name 07**</name>
</property>
<value>
<data type="**wstring**">/**SM15-HVAC-SUPP-TM-37250-ST**</data>
</value>
</condition>
</conditions>
<locator>/</locator>
</findspec>
</exchange>
Here is my attempt from the python:
path = (r"C:\\Users\\ciara\\desktop\\")
book = os.path.join(path + "Search_Set.xlsm")
wb = openpyxl.load_workbook(book)
sh = wb.get_sheet_by_name('.XML Framework')
df1 = pd.read_excel(book, "<CLEAN>", header=None)
#opening 5 lines of .xml search
print(df1)
cV1 = df1.iloc[0,0] #xml header
print (cV1)
cV2 = df1.iloc[1,0] #<exchange>
print (cV2)
cV3 = df1.iloc[2,0] #<selectionsets>
print (cV3)
cV4 = df1.iloc[3,0] #<selection set name>
print (cV4)
cV5 = df1.iloc[4,0] #<findspec mode>
print (cV5)
cV6 = df1.iloc[5,0] #<findspec mode>
print (cV6)
E = lxml.builder.ElementMaker()
root = ET.Element(cV1)
doc0 = ET.SubElement(root, cV2)
doc1 = ET.SubElement(doc0, cV3)
doc2 = ET.SubElement(doc1, cV4)
doc3 = ET.SubElement(doc2, cV5)
doc4 = ET.SubElement(doc3, cV6)
the_doc = root(
doc0(
doc1(
doc2(
doc3(
FIELD1('condition test=', name='blah'),
FIELD2('some value2', name='asdfasd'),
)
)
)
)
)
print (lxml.etree.tostring(the_doc, pretty_print=True))
tree = ET.ElementTree(root)
tree.write("filename.xml")

python lxml - loop/iterate through excel rows and save each row as one xml

the problem is that the 2nd xml file contains also the data from the first iteration of the excel row and the third xml file every data from the first and 2nd rows
Working since hours on that and cant figure it out
from lxml import etree
import openpyxl
# Create root element with namespace information
xmlns = "http://xml.datev.de/bedi/tps/ledger/v040"
xsi = "http://www.w3.org/2001/XMLSchema-instance"
schemaLocation = "http://xml.datev.de/bedi/tps/ledger/v040 Belegverwaltung_online_ledger_import_v040.xsd"
version = "4.0"
generator_info = "DATEV Musterdaten"
generating_system = "DATEV manuell"
xmlRoot = etree.Element(
"{" + xmlns + "}LedgerImport",
version=version,
attrib={"{" + xsi + "}schemaLocation": schemaLocation},
generator_info=generator_info,
generating_system=generating_system,
nsmap={'xsi': xsi, None: xmlns}
)
####open excel file speadsheet
wb = openpyxl.load_workbook('import_spendesk_datev.xlsx')
sheet = wb['Import']
# build the xml tree
for i in range(2,6):
consolidate = etree.SubElement(xmlRoot, 'consolidate', attrib={'consolidatedAmount': str(sheet.cell(row=i,column=16).value),'consolidatedDate': str(sheet.cell(row=i,column=2).value), 'consolidatedInvoiceId': str(sheet.cell(row=i,column=13).value), 'consolidatedCurrencyCode': str(sheet.cell(row=i,column=12).value) })
accountsPayableLedger = etree.SubElement(consolidate, 'accountsPayableLedger')
account = etree.SubElement(accountsPayableLedger, 'bookingText')
account.text = sheet.cell(row=i,column=21).value
invoice = etree.SubElement(accountsPayableLedger, 'invoiceId')
invoice.text = sheet.cell(row=i,column=13).value
date = etree.SubElement(accountsPayableLedger, 'date')
date.text = sheet.cell(row=i,column=2).value
amount = etree.SubElement(accountsPayableLedger, 'amount')
amount.text = sheet.cell(row=i,column=16).value
account_no = etree.SubElement(accountsPayableLedger, 'accountNo')
account_no.text = sheet.cell(row=i,column=19).value
cost1 = etree.SubElement(accountsPayableLedger, 'costCategoryId')
cost1.text = sheet.cell(row=i,column=15).value
currency_code = etree.SubElement(accountsPayableLedger, 'currencyCode')
currency_code.text = sheet.cell(row=i,column=12).value
party_id = etree.SubElement(accountsPayableLedger, 'partyId')
party_id.text = sheet.cell(row=i,column=20).value
bpaccount = etree.SubElement(accountsPayableLedger, 'bpAccountNo')
bpaccount.text = sheet.cell(row=i,column=20).value
doc = etree.ElementTree(xmlRoot)
doc.write( str(sheet.cell(row=i,column=13).value)+".xml", xml_declaration=True, encoding='utf-8', pretty_print=True)
as described
this for every single excel row and for each row one .xml file
<?xml version='1.0' encoding='UTF-8'?>
<LedgerImport xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://xml.datev.de/bedi/tps/ledger/v040" generating_system="DATEV manuell" generator_info="DATEV Musterdaten" version="4.0" xsi:schemaLocation="http://xml.datev.de/bedi/tps/ledger/v040 Belegverwaltung_online_ledger_import_v040.xsd">
<consolidate consolidatedAmount="1337.01">
<accountsPayableLedger>
<bookingText>amazon</bookingText>
<invoiceId>1</invoiceId>
</accountsPayableLedger>
</consolidate>
</LedgerImport>
The same xmlRoot object is reused several times. You need to create a new root element for each iteration in the for loop.
The code that creates the root element can be put in a function. Here is a simplified example:
from lxml import etree
def makeroot():
return etree.Element("LedgerImport")
for i in range(2, 6):
xmlRoot = makeroot()
consolidate = etree.SubElement(xmlRoot, 'consolidate',
attrib={'consolidatedAmount': str(i)})
doc = etree.ElementTree(xmlRoot)
doc.write(str(i) + ".xml", xml_declaration=True, encoding='utf-8', pretty_print=True)
After #mzjn pointed out your basic mistake, here is a thing I made for fun - you can create nested XML with a declarative mapping, instead of laboriously calling etree.SubElement yourself.
Here is how. Assume this as the basic situation:
from lxml import etree
import openpyxl
ns = {
None: 'http://xml.datev.de/bedi/tps/ledger/v040',
'xsi': 'http://www.w3.org/2001/XMLSchema-instance',
}
mapping = {
'_tag': '{' + ns[None] + '}LedgerImport',
'attrib': {
'version': '4.0',
'{' + ns['xsi'] + '}schemaLocation': 'http://xml.datev.de/bedi/tps/ledger/v040 Belegverwaltung_online_ledger_import_v040.xsd',
'generator_info': 'DATEV Musterdaten',
'generating_system': 'DATEV manuell',
},
'nsmap': ns,
'_children': [{
'_tag': 'consolidate',
'attrib': {
'consolidatedAmount': lambda: sheet.cell(i, 16).value,
'consolidatedDate': lambda: sheet.cell(i, 2).value,
'consolidatedInvoiceId': lambda: sheet.cell(i, 13).value,
'consolidatedCurrencyCode': lambda: sheet.cell(i, 12).value,
},
'_children': [{
'_tag': 'accountsPayableLedger',
'_children': [
{'_tag': 'bookingText', '_text': lambda: sheet.cell(i, 21).value},
{'_tag': 'invoiceId', '_text': lambda: sheet.cell(i, 13).value},
{'_tag': 'date', '_text': lambda: sheet.cell(i, 2).value},
{'_tag': 'amount', '_text': lambda: sheet.cell(i, 16).value},
{'_tag': 'accountNo', '_text': lambda: sheet.cell(i, 19).value},
{'_tag': 'costCategoryId', '_text': lambda: sheet.cell(i, 15).value},
{'_tag': 'currencyCode', '_text': lambda: sheet.cell(i, 12).value},
{'_tag': 'partyId', '_text': lambda: sheet.cell(i, 20).value},
{'_tag': 'bpAccountNo', '_text': lambda: sheet.cell(i, 20).value},
]
}]
}],
}
The nested dict resembles your final XML document. Its keys also resemble the parameters that etree.Element() and etree.SubElement() take, with the addition of _text and _children.
Now we can define a single recursive helper function that takes this input tree and transforms it into a nested XML tree of the same configuration. As a bonus we can execute the lambda functions, which allows us to dynamically calculate attribute values and text:
def build_tree(template, parent=None):
# prepare a dict for calling etree.Element()/etree.SubElement()
params = {k: v for k, v in template.items() if k not in ['_children', '_text']}
# calculate any dynamic attribute values
for name in params.get('attrib', {}):
value = params['attrib'][name]
params['attrib'][name] = str(value() if callable(value) else value)
if parent is None:
node = etree.Element(**params)
else:
params['_parent'] = parent
node = etree.SubElement(**params)
# calculate (if necessary) and set the node text
if '_text' in template:
if callable(template['_text']):
node.text = str(template['_text']())
else:
node.text = str(template['_text']) if template['_text'] else template['_text']
# recurse into children, if any
for child in template.get('_children', []):
build_tree(child, node)
return node
We can call this in a loop:
wb = openpyxl.load_workbook('import_spendesk_datev.xlsx')
sheet = wb['Import']
for i in range(2,6):
root = build_tree(mapping)
doc = etree.ElementTree(root)
name = "%s.xml" % sheet.cell(i, 13).value
doc.write(name, xml_declaration=True, encoding='utf-8', pretty_print=True)
This should generate a couple of nicely nested XML documents, and it should be a lot easier to manage if your XML structure changes or gets more complicated.
Alternatively, consider XSLT, the special-purpose declarative langauge designed to transform XML files, which lxml does support. Specifically, pass parameters from Python to the stylesheet to transform a template XML (not unlike passing parameters to a prepared SQL statement):
XML template (includes all top-level namespaces)
<?xml version='1.0' encoding='UTF-8'?>
<LedgerImport xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://xml.datev.de/bedi/tps/ledger/v040"
generating_system="DATEV manuell"
generator_info="DATEV Musterdaten" version="4.0"
xsi:schemaLocation="http://xml.datev.de/bedi/tps/ledger/v040 Belegverwaltung_online_ledger_import_v040.xsd">
<consolidate consolidatedAmount="???">
<accountsPayableLedger>
<bookingText>???</bookingText>
<invoiceId>???</invoiceId>
<date>???</date>
<amount>???</amount>
<accountNo>???</accountNo>
<costCategoryId>???</costCategoryId>
<currencyCode>???</currencyCode>
<partyId>???</partyId>
<bpAccountNo>???</bpAccountNo>
</accountsPayableLedger>
</consolidate>
</LedgerImport>
XSLT (save as .xsl file, a little longer due to default namespace in XML)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:doc="http://xml.datev.de/bedi/tps/ledger/v040">
<xsl:output indent="yes"/>
<xsl:strip-space elements="*"/>
<!-- INITIALIZE PARAMETERS -->
<xsl:param name="prm_consolidate" />
<xsl:param name="prm_bookingText" />
<xsl:param name="prm_invoiceId" />
<xsl:param name="prm_date" />
<xsl:param name="prm_amount" />
<xsl:param name="prm_accountNo" />
<xsl:param name="prm_costCategoryId" />
<xsl:param name="prm_currencyCode" />
<xsl:param name="prm_partyId" />
<xsl:param name="prm_bpAccountNo" />
<!-- IDENTITY TRANSFORM -->
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<!-- REWRITE TITLE TEXT -->
<xsl:template match="doc:accountsPayableLedger">
<xsl:copy>
<xsl:element name="consolidate" namespace="http://xml.datev.de/bedi/tps/ledger/v040">
<xsl:attribute name="consolidatedAmount"><xsl:value-of select="$prm_consolidate"/></xsl:attribute>
</xsl:element>
<xsl:element name="bookingText" namespace="http://xml.datev.de/bedi/tps/ledger/v040"><xsl:value-of select="$prm_bookingText"/></xsl:element>
<xsl:element name="invoiceId" namespace="http://xml.datev.de/bedi/tps/ledger/v040"><xsl:value-of select="$prm_invoiceId"/></xsl:element>
<xsl:element name="date" namespace="http://xml.datev.de/bedi/tps/ledger/v040"><xsl:value-of select="$prm_date"/></xsl:element>
<xsl:element name="amount" namespace="http://xml.datev.de/bedi/tps/ledger/v040"><xsl:value-of select="$prm_amount"/></xsl:element>
<xsl:element name="accountNo" namespace="http://xml.datev.de/bedi/tps/ledger/v040"><xsl:value-of select="$prm_accountNo"/></xsl:element>
<xsl:element name="costCategoryId" namespace="http://xml.datev.de/bedi/tps/ledger/v040"><xsl:value-of select="$prm_costCategoryId"/></xsl:element>
<xsl:element name="currencyCode" namespace="http://xml.datev.de/bedi/tps/ledger/v040"><xsl:value-of select="$prm_currencyCode"/></xsl:element>
<xsl:element name="partyId" namespace="http://xml.datev.de/bedi/tps/ledger/v040"><xsl:value-of select="$prm_partyId"/></xsl:element>
<xsl:element name="bpAccountNo" namespace="http://xml.datev.de/bedi/tps/ledger/v040"><xsl:value-of select="$prm_bpAccountNo"/></xsl:element>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
Python (no DOM element building)
import lxml.etree as et
# LOAD XML AND XSL
xml = et.parse('/path/to/Template.xml')
xsl = et.parse('/path/to/XSLTScript.xsl')
### OPEN EXCEL SPREADSHEET
wb = openpyxl.load_workbook('import_spendesk_datev.xlsx')
sheet = wb['Import']
# LOOP THROUGH ROWS
for i in range(2, 6):
consolidate = et.XSLT.strparam(sheet.cell(row=i,column=16).value)
account = et.XSLT.strparam(sheet.cell(row=i,column=21).value)
invoice = et.XSLT.strparam(sheet.cell(row=i,column=13).value)
date = et.XSLT.strparam(sheet.cell(row=i,column=2).value)
amount = et.XSLT.strparam(sheet.cell(row=i,column=16).value)
account_no = et.XSLT.strparam(sheet.cell(row=i,column=19).value)
cost1 = et.XSLT.strparam(sheet.cell(row=i,column=15).value)
currency_code = et.XSLT.strparam(sheet.cell(row=i,column=12).value)
party_id = et.XSLT.strparam(sheet.cell(row=i,column=20).value)
bpaccount = et.XSLT.strparam(sheet.cell(row=i,column=20).value)
# PASS PARAMETER TO XSLT
transform = et.XSLT(xsl)
result = transform(xml, prm_consolidate = consolidate,
prm_bookingText=account,
prm_invoiceId = invoice,
prm_date = date,
prm_amount = amount,
prm_account_no = account_no,
prm_costCategoryId = cost1,
prm_currencyCode = currency_code,
prm_partyId = party_id,
prm_bpAccountNo = bpaccount)
# SAVE XML TO FILE
with open('/path/to/Output_Row{}.xml'.format(i), 'wb') as f:
f.write(result)

parsing repeating child elements python

I am trying to parse an XML document that contains repeating child elements using Python. When I attempt to parse the data, it creates an empty file. If I comment out the repeating child elements code (see bolded section in python script below), the document generates correctly. Can someone help?
XML:
<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?>
<FRPerformance xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<FRPerformanceShareClassCurrency>
<FundCode>00190</FundCode>
<CurrencyID>USD</CurrencyID>
<FundShareClassCode>A</FundShareClassCode>
<ReportPeriodFrequency>Quarterly</ReportPeriodFrequency>
<ReportPeriodEndDate>06/30/2012</ReportPeriodEndDate>
<Net>
<Annualized>
<Year1>-4.909000000</Year1>
<Year3>10.140000000</Year3>
<Year5>-22.250000000</Year5>
<Year10>-7.570000000</Year10>
<Year15>-4.730000000</Year15>
<Year20>-0.900000000</Year20>
<SI>1.900000000</SI>
</Annualized>
</Net>
<Gross>
<Annualized>
<Month3>1.279000000</Month3>
<YTD>7.294000000</YTD>
<Year1>-0.167000000</Year1>
<Year3>11.940000000</Year3>
<Year5>-21.490000000</Year5>
<Year10>-7.120000000</Year10>
<Year15>-4.420000000</Year15>
<Year20>-0.660000000</Year20>
<SI>2.110000000</SI>
</Annualized>
<Cumulative>
<Month1Back>2.288000000</Month1Back>
<Month2Back>-1.587000000</Month2Back>
<Month3Back>0.610000000</Month3Back>
<CurrentYear>7.294000000</CurrentYear>
<Year1Back>-2.409000000</Year1Back>
<Year2Back>13.804000000</Year2Back>
<Year3Back>20.287000000</Year3Back>
<Year4Back>-78.528000000</Year4Back>
<Year5Back>-0.101000000</Year5Back>
<Year6Back>9.193000000</Year6Back>
<Year7Back>2.659000000</Year7Back>
<Year8Back>9.208000000</Year8Back>
<Year9Back>25.916000000</Year9Back>
<Year10Back>-3.612000000</Year10Back>
</Cumulative>
<HistoricReturns>
<HistoricReturns_Item>
<Date>Fri, 28 Feb 1997 00:00:00 -0600</Date>
<Return>32058.090000000</Return>
</HistoricReturns_Item>
<HistoricReturns_Item>
<Date>Fri, 28 Feb 2003 00:00:00 -0600</Date>
<Return>36415.110000000</Return>
</HistoricReturns_Item>
<HistoricReturns_Item>
<Date>Fri, 29 Feb 2008 00:00:00 -0600</Date>
<Return>49529.290000000</Return>
</HistoricReturns_Item>
<HistoricReturns_Item>
<Date>Fri, 30 Apr 1993 00:00:00 -0600</Date>
<Return>21621.500000000</Return>
</HistoricReturns_Item>
</<HistoricReturns>
Python script
## Create command line arguments for XML file and tageName
xmlFile = sys.argv[1]
tagName = sys.argv[2]
tree = ET.parse(xmlFile)
root = tree.getroot()
## Setup the file for output
saveout = sys.stdout
output_file = open('parsedXML.csv', 'w')
sys.stdout = output_file
## Parse XML
for node in root.findall(tagName):
fundCode = node.find('FundCode').text
curr = node.find('CurrencyID').text
shareClass = node.find('FundShareClassCode').text
for node2 in node.findall('./Net/Annualized'):
year1 = node2.findtext('Year1')
year3 = node2.findtext('Year3')
year5 = node2.findtext('Year5')
year10 = node2.findtext('Year10')
year15 = node2.findtext('Year15')
year20 = node2.findtext('Year20')
SI = node2.findtext('SI')
for node3 in node.findall('./Gross'):
for node4 in node3.findall('./Annualized'):
month3 = node4.findtext('Month3')
ytd = node4.findtext('YTD')
year1g = node4.findtext('Year1')
year3g = node4.findtext('Year3')
year5g = node4.findtext('Year5')
year10g = node4.findtext('Year10')
year15g = node4.findtext('Year15')
year20g = node4.findtext('Year2')
SIg = node4.findtext('SI')
for node5 in node3.findall('./Cumulative'):
month1b = node5.findtext('Month1Back')
month2b = node5.findtext('Month2Back')
month3b = node5.findtext('Month3Back')
curYear = node5.findtext('CurrentYear')
year1b = node5.findtext('Year1Back')
year2b = node5.findtext('Year2Back')
year3b = node5.findtext('Year3Back')
year4b = node5.findtext('Year4Back')
year5b = node5.findtext('Year5Back')
year6b = node5.findtext('Year6Back')
year7b = node5.findtext('Year7Back')
year8b = node5.findtext('Year8Back')
year9b = node5.findtext('Year9Back')
year10b = node5.findtext('Year10Back')
**for node6 in node.findall('./HistoricReturns'):
for node7 in node6.findall('./HistoricReturns_Item'):
hDate = node7.findall('Date')
hReturn = node7.findall('Return')**
print(fundCode, curr, shareClass,year1, year3, year5, year10, year15, year15, year20, SI,month3, ytd, year1g, year3g, year5g, year10g, year15g, year20g, SIg, month1b, month2b, month3b, curYear, year1b, year2b, year3b, year4b, year5b, year6b, year7b, year8b,year9b,year10b, hDate, hReturn)
The sample XML and the python code don't match up in terms of structure. Either
you're missing a closing </Gross> tag from the XML (which should be before the <HistoricReturns> section starts) - in which case the code is correct or
the code should be for node6 in node3.findall('./HistoricReturns'): i.e. node3 instead of node
N.B. The XML sample isn't complete (it isn't well-formed XML) because it's missing closing tags for Gross, FRPerformanceShareClassCurrency and FRPerformance so this makes it impossible to answer the question definitively. Hope this helps though.

Categories