Building XML from excel data with Python - python

I am trying to build an xml file from an excel spreadsheet using python but am having trouble building the structure. The xml schema is unique to a software so the opening few tags and ending few would be easier to be written to the xml file just as variables, shown below. They are constant so are pulled from the "
I believe the script neeeds to loop through another sheet, being the ".XML Framework" sheet to build the .xml structure as these are the values which will be ultimately changing. The structure of this sheet is provided below.
here is the .xml structure, from which the python is outputting well up to the unique values, and the changing values are shown in bold. This shows just one row of the data from the workbook. When the workbook has a second row, the .xml structure repeats again where it starts with .
The data structure in the excel sheet ".XML Framework" is:
col 1 = **equals**
col 2 = **74**
col 3 = **Data**"
col 4 = col 3
col 5 = **Name 07**
col 6 = col 5
col 7 = **wstring**
col 8 = /**SM15-HVAC-SUPP-TM-37250-ST**
THIS IS THE DESIRED XML STRUCTURE
<?xml version="1.0" encoding="UTF-8" ?>
<exchange xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://download.autodesk.com/us/navisworks/schemas/nw-exchange-12.0.xsd" units="m" filename="" filepath="">
<selectionsets>
<selectionset name="Dev_1">
<findspec mode="all" disjoint="0">
<conditions>
<condition test="**equals**" flags="**74**">
<category>
<name internal="**Data**">**Data**</name>
</category>
<property>
<name internal="**Name 07**">**Name 07**</name>
</property>
<value>
<data type="**wstring**">/**SM15-HVAC-SUPP-TM-37250-ST**</data>
</value>
</condition>
</conditions>
<locator>/</locator>
</findspec>
</exchange>
Here is my attempt from the python:
path = (r"C:\\Users\\ciara\\desktop\\")
book = os.path.join(path + "Search_Set.xlsm")
wb = openpyxl.load_workbook(book)
sh = wb.get_sheet_by_name('.XML Framework')
df1 = pd.read_excel(book, "<CLEAN>", header=None)
#opening 5 lines of .xml search
print(df1)
cV1 = df1.iloc[0,0] #xml header
print (cV1)
cV2 = df1.iloc[1,0] #<exchange>
print (cV2)
cV3 = df1.iloc[2,0] #<selectionsets>
print (cV3)
cV4 = df1.iloc[3,0] #<selection set name>
print (cV4)
cV5 = df1.iloc[4,0] #<findspec mode>
print (cV5)
cV6 = df1.iloc[5,0] #<findspec mode>
print (cV6)
E = lxml.builder.ElementMaker()
root = ET.Element(cV1)
doc0 = ET.SubElement(root, cV2)
doc1 = ET.SubElement(doc0, cV3)
doc2 = ET.SubElement(doc1, cV4)
doc3 = ET.SubElement(doc2, cV5)
doc4 = ET.SubElement(doc3, cV6)
the_doc = root(
doc0(
doc1(
doc2(
doc3(
FIELD1('condition test=', name='blah'),
FIELD2('some value2', name='asdfasd'),
)
)
)
)
)
print (lxml.etree.tostring(the_doc, pretty_print=True))
tree = ET.ElementTree(root)
tree.write("filename.xml")

Related

encoding Lithuanian character in xml using python

I have a code:
def convert_df_to_xml(df,fd,ld):
# sukuriam pagrindini elementa (root) su pavadinimu Invoices.
root = ET.Element("Invoices")
root.set("from", str(fd))
root.set("till", str(ld))
for i in range(len(df['partner_id'])):
# pridedam sub elementa.
invoices = ET.SubElement(root, "Invoice")
invoices.set('clientid',df['company_registry'][i])
invoices.set('imones_pavadinimas', df['partner_id'][i])
# pridedam sub-sub elementa.
quebec = ET.SubElement(invoices, "Product")
# susikraunam eiluciu info is dataframe
sectin_1 = ET.SubElement(quebec, "Name")
sectin_1.text = str(df["Name"][i])
sectin_2 = ET.SubElement(quebec, 'Quantity')
sectin_2.text = str(df["time_dif"][i])
sectin_3 = ET.SubElement(quebec, 'Price')
sectin_3.text = str(df["price_unit"][i])
xmlstr = minidom.parseString(ET.tostring(root)).toprettyxml(indent=" ", encoding="UTF-8").decode("UTF-8")
with open("bandomasis_itp_xml_failas_V_1.1.xml", "w") as f:
f.write(xmlstr)
I'm creating xml file from python DataFrame. The problem is that in xml file I got "?" marks instead "ė" character.
In dataframe i have strings with characters "ė,ą,š,ų" and I need them to be in xml file.
My dataframe:
df1 = pd.DataFrame({'partner_id': ['MED GRUPĖ, UAB'], 'Name':['Pirmas'], 'company_registry': ['3432543'],
'time_dif':['2'],'price_unit':['23']})
what is the problem with encoding here?

Create dataframe of certain XML element's text python pandas

I am trying to create a dataframe out the XML code as shown below
<Structure>
<Field>
<Field_Name>GAMEREF</Field_Name>
<Field_Type>Numeric</Field_Type>
<Field_Len>4</Field_Len>
<Field_Dec>0</Field_Dec>
</Field>
...
<Field>
<Field_Name>WINLOSS</Field_Name>
<Field_Type>Character</Field_Type>
<Field_Len>1</Field_Len>
<Field_Dec>0</Field_Dec>
</Field>
</Structure>
<Records>
<Record>
<GAMEREF>1217</GAMEREF>
<YEAR>2021</YEAR>
(MORE ELEMENTS I DO NOT CARE ABOUT)
<GAMENO>1</GAMENO>
<WINLOSS>W</WINLOSS>
</Record>
...
<Record>
<GAMEREF>1220</GAMEREF>
<YEAR>2021</YEAR>
(MORE ELEMENTS I DO NOT CARE ABOUT)
<GAMENO>4</GAMENO>
<WINLOSS>L</WINLOSS>
</Record>
</Records>
The structure section of the XML code that is irrelevant to the dataframe I am trying to create.
I am trying to only use the XML elements of GAMEREF, YEAR, GAMENO, and WINLOSS as there are more in the XML for the Record elements.
I have tried using code as shown below to get this to work, but when I run the code I get the error of "AttributeError: 'NoneType' object has no attribute 'text'"
Code is below.
import pandas as pd
import xml.etree.ElementTree as et
xtree = et.parse("gamedata.xml")
xroot = xtree.getroot()
df_cols = ["GAME REF","YEAR", "GAME NO", "WIN LOSS"]
rows = []
for child in xroot.iter():
s_gameref = child.find('GAMEREF').text,
s_year = child.find('YEAR').text,
s_game_no = child.find('GAMENO').text,
s_winloss = child.find('WINLOSS').text
rows.append({"GAME REF": s_gameref,"YEAR": s_year,
"GAME NO": s_game_no, "WIN LOSS": s_winloss})
df = pd.DataFrame(rows, columns = df_cols)
The code is based off other stuff I have seen on the Stack and other sites, but nothing is working yet.
Ideal dataframe output is below
GAME REF
YEAR
GAME NO
WIN LOSS
1217
2021
1
W
1218
2021
2
W
1219
2021
3
L
1220
2021
4
L
Thanks
EDIT - NOT SURE WHAT IS GOING ON WITH MY TABLE, BUT IT SHOULD LOOK LIKE THIS
I think the below is what you are looking for. (Just loop over the "interesting" sub elements of Record). The logic of the code is in the line that starts with data = [.... The 2 loops can be found there.
import pandas as pd
import xml.etree.ElementTree as ET
xml = '''<r><Structure>
<Field>
<Field_Name>GAMEREF</Field_Name>
<Field_Type>Numeric</Field_Type>
<Field_Len>4</Field_Len>
<Field_Dec>0</Field_Dec>
</Field>
...
<Field>
<Field_Name>WINLOSS</Field_Name>
<Field_Type>Character</Field_Type>
<Field_Len>1</Field_Len>
<Field_Dec>0</Field_Dec>
</Field>
</Structure>
<Records>
<Record>
<GAMEREF>1217</GAMEREF>
<YEAR>2021</YEAR>
<GAMENO>1</GAMENO>
<WINLOSS>W</WINLOSS>
</Record>
<Record>
<GAMEREF>1220</GAMEREF>
<YEAR>2021</YEAR>
<GAMENO>4</GAMENO>
<WINLOSS>L</WINLOSS>
</Record>
</Records></r>'''
fields = {'GAMEREF':'GAME REF', 'YEAR':'YEAR', 'GAMENO':'GAME NO','WINLOSS':'WIN LOSS'}
root = ET.fromstring(xml)
data = [{display_name: rec.find(element_name).text for element_name,display_name in fields.items()} for rec in root.findall('.//Record')]
df = pd.DataFrame(data)
print(df)
output
GAME REF YEAR GAME NO WIN LOSS
0 1217 2021 1 W
1 1220 2021 4 L
import pandas as pd
import xml.etree.ElementTree as et
xtree = et.parse("gamedata.xml")
xroot = xtree.getroot()
df_cols = ["GAME REF","YEAR", "GAME NO", "WIN LOSS"]
rows = []
for record in xroot:
s_gameref = record.find('GAMEREF').text
s_year = record.find('YEAR').text
s_game_no = record.find('GAMENO').text
s_winloss = record.find('WINLOSS').text
rows.append({"GAME REF": s_gameref,"YEAR": s_year,
"GAME NO": s_game_no, "WIN LOSS": s_winloss})
df = pd.DataFrame(rows, columns = df_cols)
Remove .iter()

Optimizing XML parse into CSV using Python

I have about 10,000 of XML files with similar structure that I wish to convert to a single CSV file.
Each XML file looks like this:
<?xml version='1.0' encoding='UTF-8'?>
<S:Envelope xmlns:S="http://schemas.xmlsoap.org/soap/envelope/">
<S:Body>
<ns7:GetStopMonitoringServiceResponse xmlns:ns3="http://www.siri.org.uk/siri" xmlns:ns4="http://www.ifopt.org.uk/acsb" xmlns:ns5="http://www.ifopt.org.uk/ifopt" xmlns:ns6="http://datex2.eu/schema/1_0/1_0" xmlns:ns7="http://new.webservice.namespace">
<Answer>
<ns3:ResponseTimestamp>2019-03-31T09:00:52.912+03:00</ns3:ResponseTimestamp>
<ns3:ProducerRef>ISR Siri Server (141.10)</ns3:ProducerRef>
<ns3:ResponseMessageIdentifier>276480603</ns3:ResponseMessageIdentifier>
<ns3:RequestMessageRef>0100700:1351669188:4684</ns3:RequestMessageRef>
<ns3:Status>true</ns3:Status>
<ns3:StopMonitoringDelivery version="IL2.71">
<ns3:ResponseTimestamp>2019-03-31T09:00:52.912+03:00</ns3:ResponseTimestamp>
<ns3:Status>true</ns3:Status>
<ns3:MonitoredStopVisit>
<ns3:RecordedAtTime>2019-03-31T09:00:52.000+03:00</ns3:RecordedAtTime>
<ns3:ItemIdentifier>-881202701</ns3:ItemIdentifier>
<ns3:MonitoringRef>20902</ns3:MonitoringRef>
<ns3:MonitoredVehicleJourney>
<ns3:LineRef>23925</ns3:LineRef>
<ns3:DirectionRef>2</ns3:DirectionRef>
<ns3:FramedVehicleJourneyRef>
<ns3:DataFrameRef>2019-03-31</ns3:DataFrameRef>
<ns3:DatedVehicleJourneyRef>36962685</ns3:DatedVehicleJourneyRef>
</ns3:FramedVehicleJourneyRef>
<ns3:PublishedLineName>15</ns3:PublishedLineName>
<ns3:OperatorRef>15</ns3:OperatorRef>
<ns3:DestinationRef>26020</ns3:DestinationRef>
<ns3:OriginAimedDepartureTime>2019-03-31T08:35:00.000+03:00</ns3:OriginAimedDepartureTime>
<ns3:VehicleLocation>
<ns3:Longitude>34.78000259399414</ns3:Longitude>
<ns3:Latitude>32.042293548583984</ns3:Latitude>
</ns3:VehicleLocation>
<ns3:VehicleRef>37629301</ns3:VehicleRef>
<ns3:MonitoredCall>
<ns3:StopPointRef>20902</ns3:StopPointRef>
<ns3:ExpectedArrivalTime>2019-03-31T09:03:00.000+03:00</ns3:ExpectedArrivalTime>
</ns3:MonitoredCall>
</ns3:MonitoredVehicleJourney>
</ns3:MonitoredStopVisit>
</ns3:StopMonitoringDelivery>
</Answer>
</ns7:GetStopMonitoringServiceResponse>
</S:Body>
</S:Envelope>
The example above shows one MonitoredStopVisit nested tag, but every XML have about 4,000 of them.
Full XML as an example can be found here.
I want to convert all the 10K files to one CSV where each record corresponds to a MonitoredStopVisit tag, so the CSV should look like this:
Currently this is my architecture:
split the 10K files into 8 chunks (per my PC cores).
Each sub-process iterates through its xml files and objectifies the xml.
The object is then iterated, and per each element I use conditions to exclude/include data using an array.
When the tag is /ns3:MonitoredStopVisit, the array is appended to a pandas dataframe as a series.
When all sub-processes are done, the dataframes are merged and saved as CSV.
This is the xml to df code:
def xml_to_df(xml_file):
from lxml import objectify
xml_content = xml_file.read()
obj = objectify.fromstring(xml_content)
df_cols=[
'RecordedAtTime',
'MonitoringRef',
'LineRef',
'DirectionRef',
'PublishedLineName',
'OperatorRef',
'DestinationRef',
'OriginAimedDepartureTime',
'Longitude',
'Latitude',
'VehicleRef',
'StopPointRef',
'ExpectedArrivalTime',
'AimedArrivalTime'
]
tempdf = pd.DataFrame(columns=df_cols)
arr_of_vals = [""] * 14
for i in obj.getiterator():
if "MonitoredStopVisit" in i.tag or "Status" in i.tag and "false" in str(i):
if arr_of_vals[0] != "" and (arr_of_vals[8] and arr_of_vals[9]):
s = pd.Series(arr_of_vals, index=df_cols)
if tempdf[(tempdf==s).all(axis=1)].empty:
tempdf = tempdf.append(s, ignore_index=True)
arr_of_vals = [""] * 14
elif "RecordedAtTime" in i.tag:
arr_of_vals[0] = str(i)
elif "MonitoringRef" in i.tag:
arr_of_vals[1] = str(i)
elif "LineRef" in i.tag:
arr_of_vals[2] = str(i)
elif "DestinationRef" in i.tag:
arr_of_vals[6] = str(i)
elif "OriginAimedDepartureTime" in i.tag:
arr_of_vals[7] = str(i)
elif "Longitude" in i.tag:
if str(i) == "345353":
print("Lon: " + str(i))
arr_of_vals[8] = str(i)
elif "Latitude" in i.tag:
arr_of_vals[9] = str(i)
elif "VehicleRef" in i.tag:
arr_of_vals[10] = str(i)
elif "ExpectedArrivalTime" in i.tag:
arr_of_vals[12] = str(i)
if arr_of_vals[0] != "" and (arr_of_vals[8] and arr_of_vals[9]):
s = pd.Series(arr_of_vals, index=df_cols)
if tempdf[(tempdf == s).all(axis=1)].empty:
tempdf = tempdf.append(s, ignore_index=True)
return tempdf
The problem is that for 10K files this takes about 10 hours with 8 sub-processors.
When checking CPU/Mem usage, I can see that are not fully utilized.
Any idea how this can be improved? My next step is threading, but maybe there are other applicable ways.
Just as a note, the order of records isn't important - I can sort it later.
Here is my solution with pandas:
Computation time for each 5Mb file is about 0.4s
import xml.etree.ElementTree as ET
import re
import pandas as pd
import os
def collect_data(xml_file):
# create xml object
root = ET.parse(xml_file).getroot()
# collect raw data
out_data = []
for element in root.iter():
# get tag name
tag = re.sub('{.*?}', '', element.tag)
# add break segment element
if tag == 'RecordedAtTime':
out_data.append('break')
if tag in tag_list:
out_data.append((tag, element.text))
# get break indexes
break_index = [i for i, x in enumerate(out_data) if x == 'break']
# break list into parts
list_data = []
for i in range(len(break_index) - 1):
list_data.append(out_data[break_index[i]:break_index[i + 1]])
# check for each value in data
final_data = []
for item in list_data:
# delete bleak element ad convert list into dictionary
del item[item.index('break')]
data_dictionary = dict(item)
if 'RecordedAtTime' in data_dictionary.keys():
recorded_at_time = data_dictionary.get('RecordedAtTime')
else:
recorded_at_time = ''
if 'MonitoringRef' in data_dictionary.keys():
monitoring_ref = data_dictionary.get('MonitoringRef')
else:
monitoring_ref = ''
if 'LineRef' in data_dictionary.keys():
line_ref = data_dictionary.get('LineRef')
else:
line_ref = ''
if 'DirectionRef' in data_dictionary.keys():
direction_ref = data_dictionary.get('DirectionReff')
else:
direction_ref = ''
if 'PublishedLineName' in data_dictionary.keys():
published_line_name = data_dictionary.get('PublishedLineName')
else:
published_line_name = ''
if 'OperatorRef' in data_dictionary.keys():
operator_ref = data_dictionary.get('OperatorRef')
else:
operator_ref = ''
if 'DestinationRef' in data_dictionary.keys():
destination_ref = data_dictionary.get('DestinationRef')
else:
destination_ref = ''
if 'OriginAimedDepartureTime' in data_dictionary.keys():
origin_aimed_departure_time = data_dictionary.get('OriginAimedDepartureTime')
else:
origin_aimed_departure_time = ''
if 'Longitude' in data_dictionary.keys():
longitude = data_dictionary.get('Longitude')
else:
longitude = ''
if 'Latitude' in data_dictionary.keys():
latitude = data_dictionary.get('Latitude')
else:
latitude = ''
if 'VehicleRef' in data_dictionary.keys():
vehicle_ref = data_dictionary.get('VehicleRef')
else:
vehicle_ref = ''
if 'StopPointRef' in data_dictionary.keys():
stop_point_ref = data_dictionary.get('StopPointRef')
else:
stop_point_ref = ''
if 'ExpectedArrivalTime' in data_dictionary.keys():
expected_arrival_time = data_dictionary.get('ExpectedArrivalTime')
else:
expected_arrival_time = ''
if 'AimedArrivalTime' in data_dictionary.keys():
aimed_arrival_time = data_dictionary.get('AimedArrivalTime')
else:
aimed_arrival_time = ''
final_data.append((recorded_at_time, monitoring_ref, line_ref, direction_ref, published_line_name, operator_ref,
destination_ref, origin_aimed_departure_time, longitude, latitude, vehicle_ref,
stop_point_ref,
expected_arrival_time, aimed_arrival_time))
return final_data
# setup tags list for checking
tag_list = ['RecordedAtTime', 'MonitoringRef', 'LineRef', 'DirectionRef', 'PublishedLineName', 'OperatorRef',
'DestinationRef', 'OriginAimedDepartureTime', 'Longitude', 'Latitude', 'VehicleRef', 'StopPointRef',
'ExpectedArrivalTime', 'AimedArrivalTime']
# collect data from each file
save_data = []
for file_name in os.listdir(os.getcwd()):
if file_name.endswith('.xml'):
save_data.append(collect_data(file_name))
else:
pass
# merge list of lists
flat_list = []
for sublist in save_data:
for item in sublist:
flat_list.append(item)
# load data into data frame
data = pd.DataFrame(flat_list, columns=tag_list)
# save data to file
data.to_csv('data.csv', index=False)
So it seems the issue is the use of the Pandas dataframe and series.
Using the code above, processing one xml file with ~4000 records took 4-120 seconds. The time increased as the program kept working.
Using python lists or numpy matrices (more convenient for working into a csv) decreased the running time significantly - each xml file processing now takes 0.1-0.5 seconds tops.
I used the following code to append the new processed records each time
records = np.append(records, new_redocrds, axis=0)
This is equivalent to:
tempdf = tempdf.append(s, ignore_index=True)
but significantly faster.
Hope this helps anyone who might encounter similar issues!
Actually consider XSLT, the special-purpose language to transform XML files into other XML even text files such as CSV. The only third-party library needed is Python's lxml which can run XSLT 1.0 scripts leaving out the heavier, extensive analytical tools such as Pandas and Numpy.
In fact, because XSLT is a separate, industry language, it is portable and can be run in any language with XSLT library (e.g., Java, PHP, Perl, C#, VB) or standalone 1.0, 2.0, or 3.0 processors (e.g., Xalan, Saxon), all of which Python can call as a command line subprocess.
XSLT (save below as a .xsl file, a special .xml file)
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:S="http://schemas.xmlsoap.org/soap/envelope/"
xmlns:ns3="http://www.siri.org.uk/siri"
xmlns:ns4="http://www.ifopt.org.uk/acsb"
xmlns:ns5="http://www.ifopt.org.uk/ifopt"
xmlns:ns6="http://datex2.eu/schema/1_0/1_0"
xmlns:ns7="http://new.webservice.namespace">
<xsl:output method="text" indent="yes" omit-xml-declaration="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match ="/S:Envelope/S:Body/ns7:GetStopMonitoringServiceResponse/Answer">
<xsl:apply-templates select="ns3:StopMonitoringDelivery"/>
</xsl:template>
<xsl:template match="ns3:StopMonitoringDelivery">
<!-- HEADERS -->
<!-- <xsl:text>RecordedAtTime,MonitoringRef,LineRef,DirectionRef,PublishedLineName,OperatorRef,DestinationRef,OriginAimedDepartureTime,Longitude,Latitude,VehicleRef,StopPointRef,ExpectedArrivalTime,AimedArrivalTime
</xsl:text> -->
<xsl:apply-templates select="ns3:MonitoredStopVisit"/>
<xsl:text>
</xsl:text>
</xsl:template>
<xsl:template match="ns3:MonitoredStopVisit">
<xsl:variable name="delim">,</xsl:variable>
<xsl:variable name="quote">"</xsl:variable>
<!-- DATA ROWS -->
<xsl:value-of select="concat($quote, ns3:RecordedAtTime, $quote, $delim,
$quote, ns3:MonitoringRef, $quote, $delim,
$quote, ns3:MonitoredVehicleJourney/ns3:LineRef, $quote, $delim,
$quote, ns3:MonitoredVehicleJourney/ns3:DirectionRef, $quote, $delim,
$quote, ns3:MonitoredVehicleJourney/ns3:PublishedLineName, $quote, $delim,
$quote, ns3:MonitoredVehicleJourney/ns3:OperatorRef, $quote, $delim,
$quote, ns3:MonitoredVehicleJourney/ns3:DestinationRef, $quote, $delim,
$quote, ns3:MonitoredVehicleJourney/ns3:OriginAimedDepartureTime, $quote, $delim,
$quote, ns3:MonitoredVehicleJourney/ns3:VehicleLocation/ns3:Longitude, $quote, $delim,
$quote, ns3:MonitoredVehicleJourney/ns3:VehicleLocation/ns3:Latitude, $quote, $delim,
$quote, ns3:MonitoredVehicleJourney/ns3:VehicleRef, $quote, $delim,
$quote, ns3:MonitoredVehicleJourney/ns3:MonitoredCall/ns3:StopPointRef, $quote, $delim,
$quote, ns3:MonitoredVehicleJourney/ns3:MonitoredCall/ns3:ExpectedArrivalTime, $quote, $delim,
$quote, ns3:MonitoredVehicleJourney/ns3:MonitoredCall/ns3:AimedArrivalTime, $quote, $delim
)"/>
</xsl:template>
</xsl:stylesheet>
Online Demo
Python (no appending lists, arrays, or dataframes)
import glob # TO RETRIEVE ALL XML FILES
import lxml.etree as et # TO PARSE XML AND RUN XSLT
xml_path = "/path/to/xml/files"
# PARSE XSLT
xsl = et.parse('XSLTScript.xsl')
# BUILD CSV
with open("MonitoredStopVisits.csv", 'w') as f:
# HEADER
f.write('RecordedAtTime,MonitoringRef,LineRef,DirectionRef,PublishedLineName,'
'OperatorRef,DestinationRef,OriginAimedDepartureTime,Longitude,Latitude,'
'VehicleRef,StopPointRef,ExpectedArrivalTime,AimedArrivalTime\n')
# DATA ROWS
for f in glob.glob(xml_path + "/**/*.xml", recursive=True):
# LOAD XML AND XSL SCRIPT
xml = et.parse(f)
# TRANSFORM XML TO STRING RESULT TREE
transform = et.XSLT(xsl)
result = str(transform(xml))
# WRITE TO CSV
f.write(result)

XML to CSV in PYTHON: Extract series of subnodes for every node

My goal is to convert an .XML file into a .CSV file.
This part of the code is already functional.
However, I also want to extract the sub-sub-nodes of one of the "father" nodes.
Maybe an example would be more self explanatory;
Here is the structure of my XML:
<nedisCatalogue>
<headerInfo>
<feedVersion>1-0</feedVersion>
<dateCreated>2018-01-22T23:37:01+0100</dateCreated>
<supplier>Nedis_BENED</supplier>
<locale>nl_BE</locale>
</headerInfo>
<productList>
<product>
<nedisPartnr><![CDATA[VS-150/63BA]]></nedisPartnr>
<nedisArtlid>17005</nedisArtlid>
<vendorPartnr><![CDATA[TONFREQ-ELKOS / BIPOL 150, 5390]]></vendorPartnr>
<brand><![CDATA[Visaton]]></brand>
<EAN>4007540053905</EAN>
<intrastatCode>8532220000</intrastatCode>
<UNSPSC>52161514</UNSPSC>
<headerText><![CDATA[Crossover Foil capacitor]]></headerText>
<internetText><![CDATA[Bipolaire elco met een ruwe folie en een zeer goede prijs/kwaliteits-verhouding voor de bouw van cross-overs. 63 Vdc, 10% tolerantie.]]></internetText>
<generalText><![CDATA[Dimensions 16 x 35 mm
]]></generalText>
<images>
<image type="2" order="15">767736.JPG</image>
</images>
<attachments>
</attachments>
<categories>
<tree name="Internet_Tree_ISHP">
<entry depth="001" id="1067858"><![CDATA[Audio]]></entry>
<entry depth="002" id="1067945"><![CDATA[Speakers]]></entry>
<entry depth="003" id="1068470"><![CDATA[Accessoires]]></entry>
</tree>
</categories>
<properties>
<property id="360" multiplierID="" unitID="" valueID=""><![CDATA[...]]></property>
</properties>
<status>
<code status="NORMAL"></code>
</status>
<packaging quantity="1" weight="8"></packaging>
<introductionDate>2015-10-26</introductionDate>
<serialnumberKeeping>N</serialnumberKeeping>
<priceLevels>
<normalPricing from="2017-02-13" to="2018-01-23">
<price level="1" moq="1" currency="EUR">2.48</price>
</normalPricing>
<specialOfferPricing></specialOfferPricing>
<goingPriceInclVAT currency="EUR" quantity="1">3.99</goingPriceInclVAT>
</priceLevels>
<tax>
</tax>
<stock>
<inStockLocal>25</inStockLocal>
<inStockCentral>25</inStockCentral>
<ATP>
<nextExpectedStockDateLocal></nextExpectedStockDateLocal>
<nextExpectedStockDateCentral></nextExpectedStockDateCentral>
</ATP>
</stock>
</product>
....
</nedisCatalogue>
And here is the code that I have now:
import xml.etree.ElementTree as ET
import csv
tree = ET.parse("/Users/BE07861/Documents/nedis_catalog_2018-01-23_nl_BE_32191_v1-0_xml")
root = tree.getroot()
f = open('/Users/BE07861/Documents/test2.csv', 'w')
csvwriter = csv.writer(f, delimiter='ç')
count = 0
head = ['Nedis Part Number', 'Nedis Article ID', 'Vendor Part Number', 'Brand', 'EAN', 'Header text', 'Internet Text', 'General Text', 'categories']
prdlist = root[1]
prdct = prdlist[5]
cat = prdct[12]
tree1=cat[0]
csvwriter.writerow(head)
for time in prdlist.findall('product'):
row = []
nedis_number = time.find('nedisPartnr').text
row.append(nedis_number)
nedis_art_id = time.find('nedisArtlid').text
row.append(nedis_art_id)
vendor_part_nbr = time.find('vendorPartnr').text
row.append(vendor_part_nbr)
Brand = time.find('brand').text
row.append(Brand)
ean = time.find('EAN').text
row.append(ean)
header_text = time.find('headerText').text
row.append(header_text)
internet_text = time.find('internetText').text
row.append(internet_text)
general_text = time.find('generalText').text
row.append(general_text)
categ = time.find('categories').find('tree').find('entry').text
row.append(categ)
csvwriter.writerow(row)
f.close()
If you run the code, you'll see that I only retrieve the first "entry" of the categories/tree; which is normal. However, I don't know how to create a loop that, for every node "categories", creates new columns such as categories1, categories2 & categories3 with the values: "entry".
My result should look like this
Nedis Part Number Nedis Article ID Vendor Part Number
VS-150/63BA 17005 TONFREQ-ELKOS / BIPOL 150, 5390
Brand EAN Header text Internet Text
Visaton 4,00754E+12 Crossover Foil capacitor Bipolaire elco …
General Text Category1 Categroy2 Category3
Dimensions 16 x 35 mm Audio Speakers Accessoires
I've really tried my best but didn't manage to find the solution.
Any help would be very much appreciated!!! :)
Thanks a lot,
Allan
I think this is what you're looking for:
for child in time.find('categories').find('tree'):
categ = child.text
row.append(categ)
Here's a solution that loops through the xml once to figure out how many headers to add, adds the headers, and then loops through each product's category list:
**Updated to iterate through images in addition to categories. This is the biggest difference:
for child in time.find('categories').find('tree'):
categ = child.text
row.append(categ)
curcat += 1
while curcat < maxcat:
row.append('')
curcat += 1
It's going to figure out the maximum number of categories on a single record and then and that many columns. If a particular record has less categories, this code will stick blank values in as placeholders so the column headers always line up with the data.
For instance:
Cat1 Cat2 Cat3 Img1 Img2 Img3
A B C 1 2 3
D E <blank> 4 <blank> <blank>
Here's the full solution:
import xml.etree.ElementTree as ET
import csv
tree = ET.parse("c:\\python\\xml.xml")
root = tree.getroot()
f = open('c:\\python\\xml.csv', 'w')
csvwriter = csv.writer(f, delimiter=',')
count = 0
head = ['Nedis Part Number', 'Nedis Article ID', 'Vendor Part Number', 'Brand', 'EAN', 'Header text', 'Internet Text', 'General Text']
prdlist = root[1]
maxcat = 0
for time in prdlist.findall('product'):
cur = 0
for child in time.find('categories').find('tree'):
cur += 1
if cur > maxcat:
maxcat = cur
for cnt in range (0, maxcat):
head.append('Category ' + str(cnt + 1))
maximg = 0
for time in prdlist.findall('product'):
cur = 0
for child in time.find('images'):
cur += 1
if cur > maximg:
maximg = cur
for cnt in range(0, maximg):
head.append('Image ' + str(cnt + 1))
csvwriter.writerow(head)
for time in prdlist.findall('product'):
row = []
nedis_number = time.find('nedisPartnr').text
row.append(nedis_number)
nedis_art_id = time.find('nedisArtlid').text
row.append(nedis_art_id)
vendor_part_nbr = time.find('vendorPartnr').text
row.append(vendor_part_nbr)
Brand = time.find('brand').text
row.append(Brand)
ean = time.find('EAN').text
row.append(ean)
header_text = time.find('headerText').text
row.append(header_text)
internet_text = time.find('internetText').text
row.append(internet_text)
general_text = time.find('generalText').text
row.append(general_text)
curcat = 0
for child in time.find('categories').find('tree'):
categ = child.text
row.append(categ)
curcat += 1
while curcat < maxcat:
row.append('')
curcat += 1
curimg = 0
for img in time.find('images'):
image = img.text
row.append(image)
curimg += 1
while curimg < maximg:
row.append('')
curimg += 1
csvwriter.writerow(row)
f.close()

parsing repeating child elements python

I am trying to parse an XML document that contains repeating child elements using Python. When I attempt to parse the data, it creates an empty file. If I comment out the repeating child elements code (see bolded section in python script below), the document generates correctly. Can someone help?
XML:
<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?>
<FRPerformance xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<FRPerformanceShareClassCurrency>
<FundCode>00190</FundCode>
<CurrencyID>USD</CurrencyID>
<FundShareClassCode>A</FundShareClassCode>
<ReportPeriodFrequency>Quarterly</ReportPeriodFrequency>
<ReportPeriodEndDate>06/30/2012</ReportPeriodEndDate>
<Net>
<Annualized>
<Year1>-4.909000000</Year1>
<Year3>10.140000000</Year3>
<Year5>-22.250000000</Year5>
<Year10>-7.570000000</Year10>
<Year15>-4.730000000</Year15>
<Year20>-0.900000000</Year20>
<SI>1.900000000</SI>
</Annualized>
</Net>
<Gross>
<Annualized>
<Month3>1.279000000</Month3>
<YTD>7.294000000</YTD>
<Year1>-0.167000000</Year1>
<Year3>11.940000000</Year3>
<Year5>-21.490000000</Year5>
<Year10>-7.120000000</Year10>
<Year15>-4.420000000</Year15>
<Year20>-0.660000000</Year20>
<SI>2.110000000</SI>
</Annualized>
<Cumulative>
<Month1Back>2.288000000</Month1Back>
<Month2Back>-1.587000000</Month2Back>
<Month3Back>0.610000000</Month3Back>
<CurrentYear>7.294000000</CurrentYear>
<Year1Back>-2.409000000</Year1Back>
<Year2Back>13.804000000</Year2Back>
<Year3Back>20.287000000</Year3Back>
<Year4Back>-78.528000000</Year4Back>
<Year5Back>-0.101000000</Year5Back>
<Year6Back>9.193000000</Year6Back>
<Year7Back>2.659000000</Year7Back>
<Year8Back>9.208000000</Year8Back>
<Year9Back>25.916000000</Year9Back>
<Year10Back>-3.612000000</Year10Back>
</Cumulative>
<HistoricReturns>
<HistoricReturns_Item>
<Date>Fri, 28 Feb 1997 00:00:00 -0600</Date>
<Return>32058.090000000</Return>
</HistoricReturns_Item>
<HistoricReturns_Item>
<Date>Fri, 28 Feb 2003 00:00:00 -0600</Date>
<Return>36415.110000000</Return>
</HistoricReturns_Item>
<HistoricReturns_Item>
<Date>Fri, 29 Feb 2008 00:00:00 -0600</Date>
<Return>49529.290000000</Return>
</HistoricReturns_Item>
<HistoricReturns_Item>
<Date>Fri, 30 Apr 1993 00:00:00 -0600</Date>
<Return>21621.500000000</Return>
</HistoricReturns_Item>
</<HistoricReturns>
Python script
## Create command line arguments for XML file and tageName
xmlFile = sys.argv[1]
tagName = sys.argv[2]
tree = ET.parse(xmlFile)
root = tree.getroot()
## Setup the file for output
saveout = sys.stdout
output_file = open('parsedXML.csv', 'w')
sys.stdout = output_file
## Parse XML
for node in root.findall(tagName):
fundCode = node.find('FundCode').text
curr = node.find('CurrencyID').text
shareClass = node.find('FundShareClassCode').text
for node2 in node.findall('./Net/Annualized'):
year1 = node2.findtext('Year1')
year3 = node2.findtext('Year3')
year5 = node2.findtext('Year5')
year10 = node2.findtext('Year10')
year15 = node2.findtext('Year15')
year20 = node2.findtext('Year20')
SI = node2.findtext('SI')
for node3 in node.findall('./Gross'):
for node4 in node3.findall('./Annualized'):
month3 = node4.findtext('Month3')
ytd = node4.findtext('YTD')
year1g = node4.findtext('Year1')
year3g = node4.findtext('Year3')
year5g = node4.findtext('Year5')
year10g = node4.findtext('Year10')
year15g = node4.findtext('Year15')
year20g = node4.findtext('Year2')
SIg = node4.findtext('SI')
for node5 in node3.findall('./Cumulative'):
month1b = node5.findtext('Month1Back')
month2b = node5.findtext('Month2Back')
month3b = node5.findtext('Month3Back')
curYear = node5.findtext('CurrentYear')
year1b = node5.findtext('Year1Back')
year2b = node5.findtext('Year2Back')
year3b = node5.findtext('Year3Back')
year4b = node5.findtext('Year4Back')
year5b = node5.findtext('Year5Back')
year6b = node5.findtext('Year6Back')
year7b = node5.findtext('Year7Back')
year8b = node5.findtext('Year8Back')
year9b = node5.findtext('Year9Back')
year10b = node5.findtext('Year10Back')
**for node6 in node.findall('./HistoricReturns'):
for node7 in node6.findall('./HistoricReturns_Item'):
hDate = node7.findall('Date')
hReturn = node7.findall('Return')**
print(fundCode, curr, shareClass,year1, year3, year5, year10, year15, year15, year20, SI,month3, ytd, year1g, year3g, year5g, year10g, year15g, year20g, SIg, month1b, month2b, month3b, curYear, year1b, year2b, year3b, year4b, year5b, year6b, year7b, year8b,year9b,year10b, hDate, hReturn)
The sample XML and the python code don't match up in terms of structure. Either
you're missing a closing </Gross> tag from the XML (which should be before the <HistoricReturns> section starts) - in which case the code is correct or
the code should be for node6 in node3.findall('./HistoricReturns'): i.e. node3 instead of node
N.B. The XML sample isn't complete (it isn't well-formed XML) because it's missing closing tags for Gross, FRPerformanceShareClassCurrency and FRPerformance so this makes it impossible to answer the question definitively. Hope this helps though.

Categories