Parsing XML document into a pandas DataFrame - python

I have an XML file that looks like this:
<?xml version="1.0" encoding="utf-8"?>
<comments>
<row Id="1" PostId="2" Score="0" Text="(...)" CreationDate="2011-08-30T21:15:28.063" UserId="16" />
<row Id="2" PostId="17" Score="1" Text="(...)" CreationDate="2011-08-30T21:24:56.573" UserId="27" />
<row Id="3" PostId="26" Score="0" Text="(...)" UserId="9" />
</comments>
What I'm trying to do is to extract ID, Text and CreationDate colums into pandas DF and I've tried following:
import xml.etree.cElementTree as et
import pandas as pd
path = '/.../...'
dfcols = ['ID', 'Text', 'CreationDate']
df_xml = pd.DataFrame(columns=dfcols)
root = et.parse(path)
rows = root.findall('.//row')
for row in rows:
ID = row.find('Id')
text = row.find('Text')
date = row.find('CreationDate')
print(ID, text, date)
df_xml = df_xml.append(pd.Series([ID, text, date], index=dfcols), ignore_index=True)
print(df_xml)
But the output is:
None None None
How do I fix this?

As advised in this solution by gold member Python/pandas/numpy guru, #unutbu:
Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying.
Therefore, consider parsing your XML data into a separate list then pass list into the DataFrame constructor in one call outside of any loop. In fact, you can pass nested lists with list comprehension directly into the constructor:
path = 'AttributesXMLPandas.xml'
dfcols = ['ID', 'Text', 'CreationDate']
root = et.parse(path)
rows = root.findall('.//row')
# NESTED LIST
xml_data = [[row.get('Id'), row.get('Text'), row.get('CreationDate')]
for row in rows]
df_xml = pd.DataFrame(xml_data, columns=dfcols)
print(df_xml)
# ID Text CreationDate
# 0 1 (...) 2011-08-30T21:15:28.063
# 1 2 (...) 2011-08-30T21:24:56.573
# 2 3 (...) None

Just a minor change in your code
ID = row.get('Id')
text = row.get('Text')
date = row.get('CreationDate')

Based on #Parfait solution, I wrote my version that gets the columns as a parameter and returns the Pandas DataFrame.
test.xml:
<?xml version="1.0" encoding="utf-8"?>
<comments>
<row Id="1" PostId="2" Score="0" Text="(.1.)" CreationDate="2011-08-30T21:15:28.063" UserId="16" />
<row Id="2" PostId="17" Score="1" Text="(.2.)" CreationDate="2011-08-30T21:24:56.573" UserId="27" />
<row Id="3" PostId="26" Score="0" Text="(.3.)" UserId="9" />
</comments>
xml_to_pandas.py:
'''Xml to Pandas DataFrame Convertor.'''
import xml.etree.cElementTree as et
import pandas as pd
def xml_to_pandas(root, columns, row_name):
'''get xml.etree root, the columns and return Pandas DataFrame'''
df = None
try:
rows = root.findall('.//{}'.format(row_name))
xml_data = [[row.get(c) for c in columns] for row in rows] # NESTED LIST
df = pd.DataFrame(xml_data, columns=columns)
except Exception as e:
print('[xml_to_pandas] Exception: {}.'.format(e))
return df
path = 'test.xml'
row_name = 'row'
columns = ['ID', 'Text', 'CreationDate']
root = et.parse(path)
df = xml_to_pandas(root, columns, row_name)
print(df)
output:

Since pandas 1.3.0, there's a built-in pandas function pd.read_xml that reads XML documents into a pandas DataFrame.
path = """<?xml version="1.0" encoding="utf-8"?>
<comments>
<row Id="1" PostId="2" Score="0" Text="(...)" CreationDate="2011-08-30T21:15:28.063" UserId="16" />
<row Id="2" PostId="17" Score="1" Text="(...)" CreationDate="2011-08-30T21:24:56.573" UserId="27" />
<row Id="3" PostId="26" Score="0" Text="(...)" UserId="9" />
</comments>"""
# or a path to an XML doc
path = 'test.xml'
pd.read_xml(path)
The XML doc in the OP becomes the following by simply calling read_xml:

Related

Create a DataFrame from a XML File

im new to XML and i want to know how to create a dataframe in python from this XML file.
<EXTENDEDPROPERTIES>
<DEBTCONFIGURATION>
<row Key="guid" Value="2018438038"/>
<row Key="status" Value="0"/>
<row Key="forma_pago" Value="DEBITO A CUENTA"/>
<row Key="monto" Value="23699.1"/>
<row Key="monto_abono" Value="360.55"/>
<row Key="entidad" Value="BANCO CAPRICHOSO S.A."/>
<row Key="tipo" Value="PREST. AUTO"/>
<row Key="balance" Value="19617.5"/>
<row Key="KIND_ID" Value="PRINCIPAL"/>
<row Key="TYPE_ID" Value="CEDULA_IDENTIDAD"/>
<row Key="CUSTOMER_ID" Value="777-555-888"/>
<row Key="MEMBER_TYPE" Value="DEUDOR"/>
</DEBTCONFIGURATION>
I have the following code, it creates the DataFrame but when i tried to append the value of the row, i dont know why it keeps coming "None".
I dont know if i have to change de calling argument i.e Attrib.get.
Also i tried changing the attrib.get to find("value").text but it give me the error that it dosnt have the a text attribute.
import pandas as pd
import xml.etree.ElementTree as ET
xtree = ET.parse("davi_apc.xml")
xroot = xtree.getroot()
df_cols = ["guid", "status", "forma_pago", "monto", "monto_abono", "entidad", "tipo", "balance","KIND_ID", "TYPE_ID", "CUSTOMER_ID", "MEMBER_TYPE"]
rows = []
for node in xroot:
s_guid = node.attrib.get("guid")
s_status = node.attrib.get("status")
s_formapago = node.attrib.get("forma_pago")
s_monto = node.attrib.get("monto")
s_monto_abono = node.attrib.get("monto_abono")
s_entidad = node.attrib.get("entidad")
s_tipo = node.attrib.get("tipo")
s_balance = node.attrib.get("balance")
s_kind_id = node.attrib.get("KIND_ID")
s_type_id = node.attrib.get("TYPE_ID")
s_customer_id = node.attrib.get("CUSTOMER_ID")
s_mebder_type = node.attrib.get("MEMBER_TYPE")
rows.append({
"guid" : s_guid,
"status" : s_status,
"forma_pago" : s_formapago,
"monto" : s_monto,
"monto_abono" : s_monto_abono,
"entidad" : s_entidad,
"tipo" : s_tipo,
"balance" : s_balance,
"KIND_ID" : s_kind_id,
"TYPE_ID" : s_type_id,
"CUSTOMER_ID" : s_customer_id,
"MEMBER_TYPE" : s_mebder_type
})
out_df = pd.DataFrame(rows, columns = df_cols)
this is the printout of print(rows)
[{'guid': None, 'status': None, 'forma_pago': None, 'monto': None, 'monto_abono': None, 'entidad': None, 'tipo': None, 'balance': None, 'KIND_ID': None, 'TYPE_ID': None, 'CUSTOMER_ID': None, 'MEMBER_TYPE': None}]
and this is the printout of the dataframe
guid status forma_pago monto monto_abono entidad tipo balance KIND_ID
0 None None None None None None None None None
TYPE_ID CUSTOMER_ID MEMBER_TYPE
0 None None None
Here is a working solution:
1/ remove top line from xml file, I am unsure if the first tag is xml compliant ?
<DEBTCONFIGURATION>
<row Key="guid" Value="2018438038"/>
<row Key="status" Value="0"/>
<row Key="forma_pago" Value="DEBITO A CUENTA"/>
<row Key="monto" Value="23699.1"/>
<row Key="monto_abono" Value="360.55"/>
<row Key="entidad" Value="BANCO CAPRICHOSO S.A."/>
<row Key="tipo" Value="PREST. AUTO"/>
<row Key="balance" Value="19617.5"/>
<row Key="KIND_ID" Value="PRINCIPAL"/>
<row Key="TYPE_ID" Value="CEDULA_IDENTIDAD"/>
<row Key="CUSTOMER_ID" Value="777-555-888"/>
<row Key="MEMBER_TYPE" Value="DEUDOR"/>
</DEBTCONFIGURATION>
2/ code:
import pandas as pd
import xml.etree.ElementTree as ET
xtree = ET.parse("davi_apc.xml")
xroot = xtree.getroot()
rows = [{}]
for node in xroot:
print(node.attrib)
rows[0].update({node.attrib['Key']:node.attrib['Value']})
out_df = pd.DataFrame(rows)
3/ output for out_df:
out_df.head(10)
guid status ... CUSTOMER_ID MEMBER_TYPE
0 2018438038 0 ... 777-555-888 DEUDOR

How to remove new lines in dataframe while writing xml into it?

I'm trying to write xml response from an api, iterating over tag TrackingEvent and saving into a dataframe:
The xml response looks like this :
<?xml version="1.0" encoding="UTF-8"?>
<TrackingResult>
<Events>
<TrackingEvent>
<DateTimeStamp>202010</DateTimeStamp>
<Event>Delivered</Event>
<ExtraInfo>02921</ExtraInfo>
</TrackingEvent>
<TrackingEvent>
<DateTimeStamp>202010</DateTimeStamp>
<Event>Delivery today</Event>
<ExtraInfo>31916</ExtraInfo>
</TrackingEvent>
</Events>
<Signatures />
<Errors />
</TrackingResult>
I am writing this into a dataframe using this code :
response_status = requests.get(url, headers = headers)
print(type(response_status)) #<class 'requests.models.Response'>
print(type(response_status.content)) #<class 'bytes'>
tree = ET.fromstring(response_status.content)
for child in tree.iter('TrackingEvent'):
for elem in child.iter():
data = {str(elem.tag):[elem.text]}
if str(elem.text)=='None' :continue
response_df = pd.DataFrame(data)
consolidated_list.append(response_df)
consolidated_df = pd.concat(consolidated_list,ignore_index=True)
print(consolidated_df)
This is the output dataframe i'm getting :
DateTimeStamp Event ExtraInfo
202010
Delivered
02921
202010
Delivery today
31916
I want to remove these empty spaces to put one child iteration into a single dataframe row
expected output:
DateTimeStamp Event ExtraInfo
202010 Delivered 02921
202010 Delivery today 31916
I would recommend to build a dict in a loop and then create dataframe on basis of that dict. Here is an example:
xml = '''<?xml version="1.0" ?>
<TrackingResult>
<Events>
<TrackingEvent>
<DateTimeStamp>202010</DateTimeStamp>
<Event>Delivered</Event>
<ExtraInfo>02921</ExtraInfo>
</TrackingEvent>
<TrackingEvent>
<DateTimeStamp>202010</DateTimeStamp>
<Event>Delivery today</Event>
<ExtraInfo>31916</ExtraInfo>
</TrackingEvent>
</Events>
<Signatures />
<Errors />
</TrackingResult>'''
from lxml import etree as ET
import pandas as pd
from collections import defaultdict
d = defaultdict(list)
tree = ET.fromstring(xml)
for child in tree.iter('TrackingEvent'):
for elem in child.iter():
if (elem.text is not None and str(elem.text).strip() != ''):
d[elem.tag].append(elem.text.strip())
else:
if len(list(elem)) == 0:
d[elem.tag].append(None)
df = pd.DataFrame(d)
print(df)
Output:
DateTimeStamp Event ExtraInfo
0 202010 Delivered 02921
1 202010 Delivery today 31916

XML to table form in Excel

There is this option when opening an xml file using Excel. You get prompted with the option as seen in the picture Here
It basically open that xml file in a table work and based on the analysis that I have done. It seems to do a pretty good job.
This is how it looks after I opened an xml file using excel as a tabel form Here
My Question: I want to convert an Xml into a table from like that feature in Excel does it. Is that possible?
The reason I want this result, is that working with tables inside excel is really easy using libraries like pandas. However, I don’t want to go an open every xml file with excel, show the table and then save it again. It is not very time efficient
This is my XML file
<?xml version="1.0" encoding="utf-8"?>
<ProjectData>
<FINAL>
<START id="ID0001" service_code="0x5196">
<Docs Docs_type="START">
<Rational>225196</Rational>
<Qualify>6251960000A0DE</Qualify>
</Docs>
<Description num="1213f2312">The parameter</Description>
<SetFile dg="" dg_id="">
<SetData value="32" />
</SetFile>
</START>
<START id="DG0003" service_code="0x517B">
<Docs Docs_type="START">
<Rational>23423</Rational>
<Qualify>342342</Qualify>
</Docs>
<Description num="3423423f3423">The third</Description>
<SetFile dg="" dg_id="">
<FileX dg="" axis_pts="2" name="" num="" dg_id="" />
<FileY unit="" axis_pts="20" name="TOOLS" text_id="23423" unit_id="" />
<SetData x="E1" value="21259" />
<SetData x="E2" value="0" />
</SetFile>
</START>
<START id="ID0048" service_code="0x5198">
<RawData rawdata_type="OPDATA">
<Request>225198</Request>
<Response>343243324234234</Response>
</RawData>
<Meaning text_id="434234234">The forth</Meaning>
<ValueDataset unit="m" unit_id="FEDS">
<FileX dg="kg" discrete="false" axis_pts="19" name="weight" text_id="SDF3" unit_id="SDGFDS" />
<SetData xin="sdf" xax="233" value="323" />
<SetData xin="123" xax="213" value="232" />
<SetData xin="2321" xax="232" value="23" />
</ValueDataset>
</START>
</FINAL>
</ProjectData>
So let's say I have the following input.xml file:
<main>
<item name="item1" image="a"></item>
<item name="item2" image="b"></item>
<item name="item3" image="c"></item>
<item name="item4" image="d"></item>
</main>
You can use the following code:
import xml.etree.ElementTree as ET
import pandas as pd
tree = ET.parse('input.xml')
tags = [ e.attrib for e in tree.getroot() ]
df = pd.DataFrame(tags)
# df:
# name image
# 0 item1 a
# 1 item2 b
# 2 item3 c
# 3 item4 d
And this should be independent of the number of attributes in a given file.
To write to a simple CSV file from pandas, you can use the to_csv command. See documentation. If it is necessary to be an excel sheet, you can use to_excel, see here.
# Write to csv without the row names
df.to_csv('file_name.csv', index = False)
# Write to xlsx sheet without the row names
df.to_excel('file_name.xlsx', index=False)
UPDATE:
For your XML file, and based on your clarification in the comments, I suggest the following, where all elements in the first level in the tree will be rows, and every attribute or node text will be column:
def has_children(e):
''' Check if element, e, has children'''
return(len(list(e)) > 0)
def has_attrib(e):
''' Check if element, e, has attributes'''
return(len(e.attrib)>0)
def get_uniqe_key(mydict, key):
''' Generate unique key if already exists in mydict'''
if key in mydict:
while key in mydict:
key = key + '*'
return(key)
tree = ET.parse('input2.xml')
root = tree.getroot()
# Get first level:
lvl_one = list(root)
myList = [];
for e in lvl_one:
mydict = {}
# Iterate over each node in level one element
for node in e.iter():
if (not has_children(node)) & (node.text != None):
uniqe_key = get_uniqe_key(mydict, node.tag)
mydict[uniqe_key] = node.text
if has_attrib(node):
for key in node.attrib:
uniqe_key = get_uniqe_key(mydict, key)
mydict[uniqe_key] = node.attrib[key]
myList.append(mydict)
print(pd.DataFrame(myList))
Notice in this code, I check if the column name exists for each key, and if it exists, I create a new column name by suffixing with '*'.

PySpark counting rows that contain string

I have multiple xml files that look something like this:
<?xml version="1.0" encoding="UTF-8"?>
<parent>
<row AcceptedAnswerId="15" AnswerCount="5" Body="<p>How should
I elicit prior distributions from experts when fitting a Bayesian
model?</p>
" CommentCount="1" CreationDate="2010-07-
19T19:12:12.510" FavoriteCount="17" Id="1" LastActivityDate="2010-09-
15T21:08:26.077" OwnerUserId="8" PostTypeId="1" Score="26"
Tags="<bayesian><prior><elicitation>"
Title="Eliciting priors from experts" ViewCount="1457" />
I would like to be able to use PySpark to count the lines that DO NOT contain the string: <row
My current thought:
def startWithRow(line):
if line.strip().startswith("<row"):
return True
else:
return False
sc.textFile(localpath("folder_containing_xmg.gz_files")) \
.filter(lambda x: not startWithRow(x)) \
.count()
I have tried validating this, but am getting results from even a simple count lines that don't make sense (I downloaded the xml file and did a wc on it which did not match the word count from PySpark.)
Does anything about my approach above stand out as wrong/weird?
I will just use lxml library combined with Spark to count the line with row or filter something out.
from lxml import etree
def find_number_of_rows(path):
try:
tree = etree.fromstring(path)
except:
tree = etree.parse(path)
return len(tree.findall('row'))
rdd = spark.sparkContext.parallelize(paths) # paths is a list to all your paths
rdd.map(lambda x: find_number_of_rows(x)).collect()
For example, if you have list or XML string (just toy example), you can do the following:
text = [
"""
<parent>
<row ViewCount="1457" />
<row ViewCount="1457" />
</parent>
""",
"""
<parent>
<row ViewCount="1457" />
<row ViewCount="1457" />
<row ViewCount="1457" />
</parent>
"""
]
rdd = spark.sparkContext.parallelize(text)
rdd.map(lambda x: find_number_of_rows(x)).collect()
In your case, your function have to take in path to file instead. Then, you can count or filter those rows. I don't have a full file to test on. Let me know if you need extra help!
def badRowParser(x):
try:
line = ET.fromstring(x.strip().encode('utf-8'))
return True
except:
return False
posts = sc.textFile(localpath('folder_containing_xml.gz_files'))
rejected = posts.filter(lambda l: "<row" in l.encode('utf-
8')).map(lambda x: not badRowParser(x))
ans = rejected.collect()
from collections import Counter
Counter(ans)

Python ElementTree xml output to csv

I have the following XML file ('registerreads_EE.xml'):
<?xml version="1.0" encoding="us-ascii" standalone="yes"?>
<ReadingDocument xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<ReadingStatusRefTable>
<ReadingStatusRef Ref="1">
<UnencodedStatus SourceValidation="SIMPLE">
<StatusCodes>
<Signal>XX</Signal>
</StatusCodes>
</UnencodedStatus>
</ReadingStatusRef>
</ReadingStatusRefTable>
<Header>
<IEE_System Id="XXXXXXXXXXXXXXX" />
<Creation_Datetime Datetime="2015-10-22T09:05:32Z" />
<Timezone Id="UTC" />
<Path FilePath="X:\XXXXXXXXXXXX.xml" />
<Export_Template Id="XXXXX" />
<CorrelationID Id="" />
</Header>
<ImportExportParameters ResubmitFile="false" CreateGroup="true">
<DataFormat TimestampType="XXXXXX" Type="XXXX" />
</ImportExportParameters>
<Channels>
<Channel StartDate="2015-10-21T00:00:00-05:00" EndDate="2015-10-22T00:00:00-05:00">
<ChannelID ServicePointChannelID="73825603:301" />
<Readings>
<Reading Value="3577.0" ReadingTime="2015-10-21T00:00:00-05:00" StatusRef="1" />
<Reading Value="3601.3" ReadingTime="2015-10-22T00:00:00-05:00" StatusRef="1" />
</Readings>
<ExportRequest RequestID="152" EntityType="ServicePoint" EntityID="73825603" RequestSource="Scheduled" />
</Channel>
<Channel StartDate="2015-10-21T00:00:00-05:00" EndDate="2015-10-22T00:00:00-05:00">
<ChannelID ServicePointChannelID="73825604:301" />
<Readings>
<Reading Value="3462.5" ReadingTime="2015-10-21T00:00:00-05:00" StatusRef="1" />
<Reading Value="3501.5" ReadingTime="2015-10-22T00:00:00-05:00" StatusRef="1" />
</Readings>
<ExportRequest RequestID="152" EntityType="ServicePoint" EntityID="73825604" RequestSource="Scheduled" />
</Channel>
</Channels>
</ReadingDocument>
I want to parse the XML of the channel data into a csv file.
He is what I have written in Python 2.7.10:
import xml.etree.ElementTree as ET
tree = ET.parse('registerreads_EE.xml')
root = tree.getroot()[3]
for channel in tree.iter('Channel'):
for exportrequest in channel.iter('ExportRequest'):
entityid = exportrequest.attrib.get('EntityID')
for meterread in channel.iter('Reading'):
read = meterread.attrib.get('Value')
date = meterread.attrib.get('ReadingTime')
print read[:-2],",",date[:10],",",entityid
tree.write(open('registerreads_EE.csv','w'))
Here is the screen output when the above is run:
3577 , 2015-10-21 , 73825603
3601 , 2015-10-22 , 73825603
3462 , 2015-10-21 , 73825604
3501 , 2015-10-22 , 73825604
The 'registerreads.csv' output file is like the original XML file, minus the first line.
I would like the printed output above outputted to a csv file with headers of read, date, entityid.
I am having difficulty with this. This is my first python program. Any help is appreciated.
Use the csv module not lxml module to write rows to csv file. But still use lxml to parse and extract content from xml file:
import xml.etree.ElementTree as ET
import csv
tree = ET.parse('registerreads_EE.xml')
root = tree.getroot()[3]
with open('registerreads_EE.csv', 'w', newline='') as r:
writer = csv.writer(r)
writer.writerow(['read', 'date', 'entityid']) # WRITING HEADERS
for channel in tree.iter('Channel'):
for exportrequest in channel.iter('ExportRequest'):
entityid = exportrequest.attrib.get('EntityID')
for meterread in channel.iter('Reading'):
read = meterread.attrib.get('Value')
date = meterread.attrib.get('ReadingTime')
# WRITE EACH ROW ITERATIVELY
writer.writerow([read[:-2],date[:10],entityid])

Categories