convert xml to csv using python - python

I am learning my way around python and right now I need a little bit of help. I have an XML file from soap api that I am failing at converting to CSV. I managed to get the data with the request library easily. My struggle is converting it to CSV, I end up with headers with no values
My XML Data :
<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<soap:Body>
<Level2 xmlns="https://xxxxxxxxxx/xxxxxxx">
<Level3>
<ResponseStatus>Success</ResponseStatus>
<ErrorMessage/>
<Message>20 alert(s) generated for this period</Message>
<ProcessingTimeSecs>0.88217689999999993</ProcessingTimeSecs>
<Something1>1</Something1>
<Something2/>
<Something3/>
<Something4/>
<VIP>
<MainVIP>
<Date>20210616</Date>
<RegisteredDate>20210216</RegisteredDate>
<Type>YMBA</Type>
<TypeDescription>TYPE OF ENQUIRY</TypeDescription>
<BusinessName>COMPANY NAME</BusinessName>
<ITNumber>987654321</ITNumber>
<RegistrationNumber>123456789</RegistrationNumber>
<SubscriberNumber>55889977</SubscriberNumber>
<SubscriberReference/>
<TicketNumber>1122336655</TicketNumber>
<SubscriberName>COMPANY NAME 2 </SubscriberName>
<CompletedDate>20210615</CompletedDate>
</MainVIP>
</VIP>
<Something5/>
<Something6/>
<Something7/>
<Something8/>
<Something9/>
<PrincipalSomething10/>
<PrincipalSomething11/>
<PrincipalSomething12/>
<PrincipalSomething13/>
<Something14/>
<Something15/>
<Something16/>
<Something17/>
<Something18/>
<PrincipalSomething19/>
<PrincipalSomething20/>
</Level3>
</Level2>
</soap:Body>
</soap:Envelope>
My python code looks like this :
import xml.etree.ElementTree as ET
import pandas as pd
cols = ['Date', 'RegisteredDate', 'Type',
'TypeDescription']
rows = []
# parse xml file
xmlparse = ET.parse('xmldata.xml')
root = xmlparse.getroot()
for i in root:
Date = i.get('Date').text
RegisteredDate = i.get('RegisteredDate').text
Type = i.get('Type').text
TypeDescription = i.get('TypeDescription').text
rows.append({'Date': Date,
'RegisteredDate': RegisteredDate,
'Type': Type,
'TypeDescription': TypeDescription})
df = pd.DataFrame(rows, columns=cols)
print(df)
df.to_csv('csvdata.csv')
In my approach, I was following the idea from here https://www.geeksforgeeks.org/convert-xml-to-csv-in-python/

You probably don't need to go through ElementTree; you can feed the xml directly to pandas. If I understand you correctly, this should do it:
df = pd.read_xml(path_to_file,"//*[local-name()='MainVIP']")
df = df.iloc[:,:4]
df
Output from your xml above:
Date RegisteredDate Type TypeDescription
0 20210616 20210216 YMBA TYPE OF ENQUIRY

Without any external lib - the code below generates a csv file.
The idea is to collect the required elements data from MainVip and store it in list of dicts. Loop on the list and write the data into a file.
import xml.etree.ElementTree as ET
xml = ''' <soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<soap:Body>
<Level2 xmlns="https://xxxxxxxxxx/xxxxxxx">
<Level3>
<ResponseStatus>Success</ResponseStatus>
<ErrorMessage/>
<Message>20 alert(s) generated for this period</Message>
<ProcessingTimeSecs>0.88217689999999993</ProcessingTimeSecs>
<Something1>1</Something1>
<Something2/>
<Something3/>
<Something4/>
<VIP>
<MainVIP>
<Date>20210616</Date>
<RegisteredDate>20210216</RegisteredDate>
<Type>YMBA</Type>
<TypeDescription>TYPE OF ENQUIRY</TypeDescription>
<BusinessName>COMPANY NAME</BusinessName>
<ITNumber>987654321</ITNumber>
<RegistrationNumber>123456789</RegistrationNumber>
<SubscriberNumber>55889977</SubscriberNumber>
<SubscriberReference/>
<TicketNumber>1122336655</TicketNumber>
<SubscriberName>COMPANY NAME 2 </SubscriberName>
<CompletedDate>20210615</CompletedDate>
</MainVIP>
</VIP>
<Something5/>
<Something6/>
<Something7/>
<Something8/>
<Something9/>
<PrincipalSomething10/>
<PrincipalSomething11/>
<PrincipalSomething12/>
<PrincipalSomething13/>
<Something14/>
<Something15/>
<Something16/>
<Something17/>
<Something18/>
<PrincipalSomething19/>
<PrincipalSomething20/>
</Level3>
</Level2>
</soap:Body>
</soap:Envelope>'''
cols = ['Date', 'RegisteredDate', 'Type',
'TypeDescription']
rows = []
NS = '{https://xxxxxxxxxx/xxxxxxx}'
root = ET.fromstring(xml)
for vip in root.findall(f'.//{NS}MainVIP'):
rows.append({c: vip.find(NS+c).text for c in cols})
with open('out.csv','w') as f:
f.write(','.join(cols) + '\n')
for row in rows:
f.write(','.join(row[c] for c in cols) + '\n')
out.csv
Date,RegisteredDate,Type,TypeDescription
20210616,20210216,YMBA,TYPE OF ENQUIRY

Related

How to iterate over a XML file and sum a specific field

I want to iterate over an xml file and get the sum of the field "PremieTot" (marked in the xml below)
<?xml version="1.0" encoding="iso-8859-1" ?>
<Bericht Version="1.0" xmlns="http://www.test.nl/test/2022/01">
<Bericht>
<IdBer>1111</IdBer>
<IdLcr>2323</IdLcr>
<NmLcr>Test Company</NmLcr>
</Bericht>
<AdministratieveEenheid>
<LhNr>3434</LhNr>
<NmIP>Test Company</NmIP>
<TvkCd>MND</TvkCd>
<TijdvakAangifte>
<DatAanvTv>2022-01-01</DatAanvTv>
<DatEindTv>2022-01-31</DatEindTv>
<VolledigeAangifte>
<CollectieveAangifte>
<TotaalRegelingen>
<RelNrAansl>3434</RelNrAansl>
</TotaalRegelingen>
<TotaalRegelingen>
<RelNrAansl>3434</RelNrAansl>
</TotaalRegelingen>
</CollectieveAangifte>
<InkomstenverhoudingInitieel>
<NumIV>1</NumIV>
<DatAanv>2020-01-01</DatAanv>
<PersNr>2364</PersNr>
<RegelingGegevens>
<PremieTot>0.52</PremieTot> //I want to sum this field
</RegelingGegevens>
</InkomstenverhoudingInitieel>
<InkomstenverhoudingInitieel>
<NumIV>1</NumIV>
<DatAanv>2020-07-01</DatAanv>
<PersNr>2365</PersNr>
<RegelingGegevens>
<PremieTot>0.66</PremieTot> //I want to sum this field
<AantVerlUPens>29.12</AantVerlUPens>
</RegelingGegevens>
</InkomstenverhoudingInitieel>
</VolledigeAangifte>
</TijdvakAangifte>
</AdministratieveEenheid>
</Bericht>
Iam trying it with xmldict to parse the xml file into a dict, but for some reason i cant get the value "PremieTot"
info_dict = xml_dict["PensioenAangifte"]["AdministratieveEenheid"]["TijdvakAangifte"]["VolledigeAangifte"]
premieTotal = [xml_data["RegelingGegevens]["PremieTot"] for xml_data in info_dict]
Quite easy with ElementTree:
from xml.etree import ElementTree as ET
et = ET.fromstring(xml)
result = sum(
float(el.text)
for el in et.findall('.//{*}PremieTot')
)

Python: lxml is not reading element text all time

I want to load xml file with below structure into pandas dataframe
The size of xml could be between 1 GB to 6GB
Below xml sample just have 5 records but my acutal file will have around 100000 records as mention in the RECORDS attributes below (RECORDS="108881")
Also each and every element in this file will have some value.
None of the element is empty in the whole file.
<?xml version="1.0" encoding="UTF-8"?>
<ACADEMICS>
<STUDENTS ASOF_DATE="11/21/2019" CREATE_DATE="11/22/2019" RECORDS="108881">
<STUDENT>
<NAME>JOHN</NAME>
<REGNUM>1000</REGNUM>
<COUNTRY>USA</COUNTRY>
<ID>JH1</ID>
<SHORT_STD_DESC>JOHN IS A GOOD STUDENT</SHORT_STD_DESC>
</STUDENT>
<STUDENT>
<NAME>ADAM</NAME>
<REGNUM>1001</REGNUM>
<COUNTRY>FRANCE</COUNTRY>
<ID>AD2</ID>
<SHORT_STD_DESC>ADAM IS A GOOD STUDENT</SHORT_STD_DESC>
</STUDENT>
<STUDENT>
<NAME>PETER</NAME>
<REGNUM>1003</REGNUM>
<COUNTRY>BELGIUM</COUNTRY>
<ID>PE5</ID>
<SHORT_STD_DESC>PETER IS A GOOD STUDENT</SHORT_STD_DESC>
</STUDENT>
<STUDENT>
<NAME>ERIC</NAME>
<REGNUM>1006</REGNUM>
<COUNTRY>AUSTRALIA</COUNTRY>
<ID>ER7</ID>
<SHORT_STD_DESC>ERIC IS A GOOD STUDENT</SHORT_STD_DESC>
</STUDENT>
<STUDENT>
<NAME>NICHOLAS</NAME>
<REGNUM>1009</REGNUM>
<COUNTRY>GREECE</COUNTRY>
<ID>NI8</ID>
<SHORT_STD_DESC>NICHOLAS IS A GOOD STUDENT</SHORT_STD_DESC>
</STUDENT>
</STUDENTS>
i am trying to read these xmls with lxml as below functions
As you can see in below functions, i am just interested in reading specific tags from xml file which are ["ACADEMICS","STUDENDS","ID","SHORT_STD_DESC"]
def recursive_dict(self,element):
return element.tag, \
dict(map(self.recursive_dict, element)) or element.text
def ConvertFilePivot(self, inputfile):
context = etree.iterparse(inputfile, events=('start','end'), tag=["ACADEMICS","STUDENDS","ID","SHORT_STD_DESC"])
lstValues = []
asOfDate = ""
for event, elem in context:
if elem.tag == "ACADEMICS" :
asOfDate = elem[0].attrib['ASOF_DATE']
else:
for event, elem in context:
doc = self.recursive_dict(elem)
lstValues.append(doc)
dfvalues = pd.DataFrame(lstValues,columns=["ColName","ColValue"])
columns = dfvalues['ColName'].unique()
data = {}
for column in columns:
data[column] = list(dfvalues[dfvalues['ColName'] == column]['ColValue'])
dfdata = pd.DataFrame(data)
return dfdata
Now, the problem is when i load this xml into dataframe as shown in above function, for some records i get 'None' as a text for ID and SHORT_STD_DESC elements.
But the actual xml file has that value.
So i am not sure why it is not reflected in my dataframe ?
Any input would be great help for me.
This may be more a comment than an answer, but I can't fit it in an actual comment...
Try changing
else:
for event, elem in context:
doc = self.recursive_dict(elem)
to just:
else:
doc = self.recursive_dict(elem)
and see if it works.

XML converting to JSON, I want to make a dynamic solution - using python dictonary

I am asking you all for help with converting XML to JSON. I am using dictonary.
my code looks like this
import json
import xmltodict
with open('xmlskuska.xml') as fd:
doc = xmltodict.parse(fd.read())
doc['Invoice']['ID'] = doc['Invoice'].pop('cbc:ID')
doc['Invoice']['IssueDate'] = doc['Invoice'].pop('cbc:IssueDate')
doc['Invoice']['OrderReference'] = doc['Invoice'].pop('cac:OrderReference')
doc['Invoice']['OrderReference']['ID'] = doc['Invoice']['OrderReference'].pop('cbc:ID')
doc['Invoice']['InvoiceLine'] = doc['Invoice'].pop('cac:InvoiceLine')
doc['Invoice']['InvoiceLine']['Price'] = doc['Invoice']['InvoiceLine'].pop('cac:Price')
doc['Invoice']['InvoiceLine']['Price']['PriceAmount'] = doc['Invoice']['InvoiceLine']['Price'].pop('cbc:PriceAmount')
doc['Invoice']['InvoiceLine']['Price']['BaseQuantity'] = doc['Invoice']['InvoiceLine']['Price'].pop('cbc:BaseQuantity')
app_json = json.dumps(doc)
print(app_json)
with open('skuska.json', 'w') as json_file:
json.dump(doc, json_file)
this is my XML file xmlskuska.xml
<?xml version="1.0" encoding="UTF-8"?>
<Invoice xmlns="urn:oasis:names:specification:ubl:schema:xsd:Invoice-2"
xmlns:cac="urn:oasis:names:specification:ubl:schema:xsd:CommonAggregateComponents-2"
xmlns:cbc="urn:oasis:names:specification:ubl:schema:xsd:CommonBasicComponents-2">
<cbc:ID>TOSL108</cbc:ID>
<cbc:IssueDate>2009-12-15</cbc:IssueDate>
<cac:OrderReference>
<cbc:ID>123</cbc:ID>
<cac:OrderReference>
<cac:InvoiceLine>
<cac:Price>
<cbc:PriceAmount currencyID="EUR">0.75</cbc:PriceAmount>
<cbc:BaseQuantity unitCode="C62">1</cbc:BaseQuantity>
</cac:Price>
</cac:InvoiceLine>
</Invoice>
The thing is that the output of the file has structure which i don't want. I want nice JSON structure.
I Am getting this:
{"Invoice": {"#xmlns": "urn:oasis:names:specification:ubl:schema:xsd:Invoice-2", "#xmlns:cac": "urn:oasis:names:specification:ubl:schema:xsd:CommonAggregateComponents-2", "#xmlns:cbc": "urn:oasis:names:specification:ubl:schema:xsd:CommonBasicComponents-2", "ID": "TOSL108", "IssueDate": "2009-12-15", "OrderReference": {"ID": "123"}, "InvoiceLine": {"Price": {"PriceAmount": {"#currencyID": "EUR", "#text": "0.75"}, "BaseQuantity": {"#unitCode": "C62", "#text": "1"}}}}}
The second thing which I want to ask is, how to make a dynamic solution with this conversion? I don't want to make all this section with changing value of dictorany, for example "doc['Invoice']['ID'] = doc['Invoice'].pop('cbc:ID')".
I mean that I want to use many more XML files to conversion, not just this one I shared.
Thanks in advance for all your help!
See below (no external library is used)
import xml.etree.ElementTree as ET
import pprint
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<Invoice xmlns="urn:oasis:names:specification:ubl:schema:xsd:Invoice-2"
xmlns:cac="urn:oasis:names:specification:ubl:schema:xsd:CommonAggregateComponents-2"
xmlns:cbc="urn:oasis:names:specification:ubl:schema:xsd:CommonBasicComponents-2">
<cbc:ID>TOSL108</cbc:ID>
<cbc:IssueDate>2009-12-15</cbc:IssueDate>
<cac:OrderReference>
<cbc:ID>123</cbc:ID>
</cac:OrderReference>
<cac:InvoiceLine>
<cac:Price>
<cbc:PriceAmount currencyID="EUR">0.75</cbc:PriceAmount>
<cbc:BaseQuantity unitCode="C62">1</cbc:BaseQuantity>
</cac:Price>
</cac:InvoiceLine>
</Invoice>'''
root = ET.fromstring(xml)
data = {}
def xml2dict(root,data):
lst = list(root)
for e in lst:
idx = e.tag.rfind('}')
if idx != -1:
_tag = e.tag[idx+1:]
if e.text.strip():
data[_tag] = e.text
else:
data[_tag] = {}
xml2dict(e,data[_tag])
xml2dict(root,data)
pprint.pprint(data)
output:
{'ID': 'TOSL108',
'InvoiceLine': {'Price': {'BaseQuantity': '1', 'PriceAmount': '0.75'}},
'IssueDate': '2009-12-15',
'OrderReference': {'ID': '123'}}

gpxpy: Get extension value from gpx file

I’m trying to get the ID from a waypoint in my gpx-file. The ID is placed in the extension tag of my file. I’m using gpxpy to get other values like the latitude and longitude from the file, but I didn’t find a way to get the ID.
Here you can see my code:
import gpxpy
node_id = []
gpx_file = open("test.gpx", mode='rt', encoding='utf-8')
gpx = gpxpy.parse(gpx_file)
for waypoint in gpx.waypoints:
node_id.append(waypoint.extensions.id)
And a part of my test.gpx-file:
<wpt lat="53.865650" lon="10.684415">
<extensions>
<ogr:id>17</ogr:id>
<ogr:longitude>10.684415</ogr:longitude>
<ogr:latitude>53.865650</ogr:latitude>
</extensions>
</wpt>
Is there a way to get the id of the waypoint with gpxpy?
waypoint.extensions is just an array. So you can't just get an item by name. You have to iterate through that array. The "name" of the extensions is stored in the "tag" property of the Element, the value in the "text" property. As i don't have your xml-scheme to test with the extension ogr:id, i tried with the following gpx file:
<?xml version="1.0" encoding="UTF-8" ?>
<gpx xmlns="http://www.topografix.com/GPX/1/1" version="1.1" creator="OSMTracker for Android™ - https://github.com/labexp/osmtracker-android"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.topografix.com/GPX/1/1 http://www.topografix.com/GPX/1/1/gpx.xsd ">
<wpt lat="10.31345465" lon="10.21237815">
<extensions>
<id>17</id>
</extensions>
<ele>110.0</ele>
<time>2018-09-29T09:31:58Z</time>
<name><![CDATA[train station]]></name>
<sat>0</sat>
</wpt>
</gpx>
I wrote an short function to get the id. It is not tested against anything (for example the extensions doesn't exist).
import gpxpy
def getId(waypoint):
for extension in waypoint.extensions:
if extension.tag == 'id':
return extension.text
node_id = []
gpx_file = open("test2.gpx", mode='rt', encoding='utf-8')
gpx = gpxpy.parse(gpx_file)
for waypoint in gpx.waypoints:
print(getId(waypoint))
The functions gets an GPX Waypoint as argument and loops through the extensions array. If that array contains an element with the tag (name) "id" it returns the text (value).
Best regards
Thimo

xml to csv conversion in python

Here i need to parse the xml and get the values. I need to get attribute element like 'personid =01' which i couldnt get in this code. And also i need to fetch the grand children node values also. here it is for "SIBLING" and its name tags.BUt i cant hard code it as sibling and fetch the value. And top of all i need to handle multiple attributes and join them to form a unique key which will come as a column in the final table.
import xml.dom
import xml.dom.minidom
doc = xml.dom.minidom.parseString('''
<root>
<person id="01">
<name> abc</name>
<age>32</age>
<address>addr123</address>
<siblings>
<name></name>
</siblings>
</person>
<person id="02">
<name> def</name>
<age>22</age>
<address>addr456</address>
<siblings>
<name></name>
<name></name>
</siblings>
</person>
</root>
''')
innerlist=[]
outerlist=[]
def innerHtml(root):
text = ''
nodes = [ root ]
while not nodes==[]:
node = nodes.pop()
if node.nodeType==xml.dom.Node.TEXT_NODE:
text += node.wholeText
else:
nodes.extend(node.childNodes)
return text
for statusNode in doc.getElementsByTagName('person'):
for childNode in statusNode.childNodes:
if childNode.nodeType==xml.dom.Node.ELEMENT_NODE:
if innerHtml(childNode).strip() != '':
innerlist.append(childNode.nodeName+" "+innerHtml(childNode).strip())
outerlist.append(innerlist)
innerlist=[]
#print(outerlist)
attrlist = []
nodes = doc.getElementsByTagName('person')
for node in nodes:
if 'id' in node.attributes:
#print(node.attributes['id'].value)
attrlist.append(node.attributes['id'].value)
#print(attrlist)
dictionary = dict(zip(attrlist, outerlist))
print(dictionary)
Comment: i have stored it in a dictnorary. {'01': ['name abc', 'age 32', 'address addr123'], '02': ['name def', 'age 22', 'address addr456']}.
You can't write suche a dict to CSV!
ValueError: dict contains fields not in fieldnames: '01'
Do you REALY want to convert to CSV?
Read about CSV File Reading and Writing
Comment: Here i need to get sibiling tag also as another innerlist.
CSV dosn't support such innerlist?
Edit your Question and show expected CSV Output!
Question: xml to csv conversion
Solution with xml.etree.ElementTree.
Note: Don't understand how you want to handle grand children node values.
Write it as List of dict in one Column.
import csv
import xml.etree.ElementTree as ET
root = ET.fromstring(doc)
fieldnames = None
with open('doc.csv', 'w') as fh:
for p in root.findall('person'):
person = {'_id':p.attrib['id']}
for element in p:
if len(element) >= 1:
person[element.tag] = []
for sub_e in element:
person[element.tag].append({sub_e.tag:sub_e.text})
else:
person[element.tag] = element.text
if not fieldnames:
fieldnames = sorted(person)
w = csv.DictWriter(fh, fieldnames=fieldnames)
w.writeheader()
w.writerow(person)
Output:
_id,address,age,name,siblings
01,addr123,32, abc,[{'name': 'sib1'}]
02,addr456,, def,"[{'name': 'sib2'}, {'name': 'sib3'}]"
Tested with Python: 3.4.2

Categories