How to Export 2D Table in a csv file using PyCharm - python

I have a xml file: 'product.xml', here is an example of the sample file:
<?xml version="1.0"?>
<Rowset>
<ROW>
<Product_ID>32</Product_ID>
<Company_ID>2</Company_ID>
<User_ID>90</User_ID>
<Product_Type>1</Product_Type>
<Application_ID>BBC#:1010</Application_ID>
</ROW>
<ROW>
<Product_ID>22</Product_ID>
<Company_ID>4</Company_ID>
<User_ID>190</User_ID>
<Product_Type>2</Product_Type>
<Application_ID>NBA#:1111</Application_ID>
</ROW>
<ROW>
<Product_ID>63</Product_ID>
<Company_ID>4</Company_ID>
<User_ID>99</User_ID>
<Product_Type>1</Product_Type>
<Application_ID>BBC#:1212</Application_ID>
</ROW>
<ROW>
<Product_ID>22</Product_ID>
<Company_ID>2</Company_ID>
<User_ID>65</User_ID>
<Product_Type>2</Product_Type>
<Application_ID>NBA#:2210</Application_ID>
</ROW>
This is my code:
import xml.etree.cElementTree as ET
tree = ET.parse('product.xml')
root = tree.getroot()
for rows in root:
for attr in rows:
if (attr.tag=='User_ID'):
print('User_ID: ' + attr.text)
if (attr.tag=='Application_ID'):
print('Application_ID: ' + attr.text)
Output for this is:
User_ID: 90
Application_ID: BBC#:1010
User_ID: 190
Application_ID: NBA#:1111
User_ID: 99
Application_ID: BBC#:1212
I am wondering how can I generate a 2D table with Pandas Data frame, using 'Application_ID' and 'User_ID' as ROW Headers and their data as columns, like:
Application_ID User_ID
BBC#:1010 90
NBA#:1111 190
BBC#:1212 99
And export these 2D Table results into a csv file to save them, Thank you.

Something like the below
import xml.etree.ElementTree as ET
import pandas as pd
xml = '''<?xml version="1.0"?>
<Rowset>
<ROW>
<Product_ID>32</Product_ID>
<Company_ID>2</Company_ID>
<User_ID>90</User_ID>
<Product_Type>1</Product_Type>
<Application_ID>BBC#:1010</Application_ID>
</ROW>
<ROW>
<Product_ID>22</Product_ID>
<Company_ID>4</Company_ID>
<User_ID>190</User_ID>
<Product_Type>2</Product_Type>
<Application_ID>NBA#:1111</Application_ID>
</ROW>
<ROW>
<Product_ID>63</Product_ID>
<Company_ID>4</Company_ID>
<User_ID>99</User_ID>
<Product_Type>1</Product_Type>
<Application_ID>BBC#:1212</Application_ID>
</ROW>
<ROW>
<Product_ID>22</Product_ID>
<Company_ID>2</Company_ID>
<User_ID>65</User_ID>
<Product_Type>2</Product_Type>
<Application_ID>NBA#:2210</Application_ID>
</ROW>
</Rowset>
'''
FIELDS = ['Application_ID','User_ID']
data = []
root = ET.fromstring(xml)
for row in root.findall('.//ROW'):
data.append([row.find(f).text for f in FIELDS])
df = pd.DataFrame(data,columns=FIELDS)
print(df)
output
Application_ID User_ID
0 BBC#:1010 90
1 NBA#:1111 190
2 BBC#:1212 99
3 NBA#:2210 65

Try:
def parse_row(row):
ret = {'User_ID':np.nan, 'Application_ID':np.nan}
for attr in row:
if attr.tag in ret: ret[attr.tag] = attr.text
return ret
out = pd.DataFrame([parse_row(r) for r in root])
Output:
User_ID Application_ID
0 90 BBC#:1010
1 190 NBA#:1111
2 99 BBC#:1212
3 65 NBA#:2210

Pandas is able to read most file types into DataFrames.
### This line would get you all of your columns
df = pd.read_xml('product.xml')
### Drop (remove) unwanted columns
df.drop(['Product_ID', 'Company_ID', 'Product_Type'], axis=1, inplace=True)
### Export to csv
df.to_csv('outputfile.csv')

Related

Parse XML to CSV with python

I need to parse some XML to CSV. I am struggling getting the 'record' attribute to iterate. The code below can pull out the allocation text. How do I get the record product-id?
import xml.etree.ElementTree as ET
mytree = ET.parse('Salesforce_01_30_2023.xml')
myroot = mytree.getroot()
print(myroot)
for x in myroot.findall('record'):
product = myroot.attrib
inventory = x.find('allocation').text
print(product, inventory)
XML
<?xml version="1.0" encoding="UTF-8"?>
<records>
<record product-id="99124">
<allocation>15</allocation>
<allocation-timestamp>2023-01-30T15:03:39.598Z</allocation-timestamp>
<perpetual>false</perpetual>
<preorder-backorder-handling>none</preorder-backorder-handling>
<ats>15</ats>
</record>
<record product-id="011443">
<allocation>0</allocation>
<allocation-timestamp>2023-01-30T15:03:39.598Z</allocation-timestamp>
<perpetual>false</perpetual>
<preorder-backorder-handling>none</preorder-backorder-handling>
<ats>0</ats>
</record>
To get product-id number you can use .attrib["product-id"]:
import xml.etree.ElementTree as ET
mytree = ET.parse('Salesforce_01_30_2023.xml')
myroot = mytree.getroot()
for product in myroot.findall('record'):
inventory = product.find('allocation').text
print(product.attrib['product-id'], inventory)
Prints:
99124 15
011443 0
Option 1: You can use pandas DataFrame read_xml() and to_csv():
import pandas as pd
df = pd.read_xml("prod_id.xml", xpath=".//record")
df.to_csv('prod.csv')
print(df.to_string())
Output:
product-id allocation allocation-timestamp perpetual preorder-backorder-handling ats
0 99124 15 2023-01-30T15:03:39.598Z False none 15
1 11443 0 2023-01-30T15:03:39.598Z False none 0
CSV:
,product-id,allocation,allocation-timestamp,perpetual,preorder-backorder-handling,ats
0,99124,15,2023-01-30T15:03:39.598Z,False,none,15
1,11443,0,2023-01-30T15:03:39.598Z,False,none,0
Option 2, if you prefere the xml.etree.ElementTree.
xml attribute values can be searched by .get():
import xml.etree.ElementTree as ET
tree = ET.parse('prod_id.xml')
root = tree.getroot()
for elem in root.iter():
# print(elem.tag, elem.attrib, elem.text)
if elem.tag == "record":
print("Product-id:",elem.get('product-id'))
Output:
Product-id: 99124
Product-id: 011443

How to iterate over an xml and save it into a dataframe in Python?

I have an XML and i'm trying to iterate over it and save it(just the tracking events part) into a dataframe.
this is the input XML:
<?xml version="1.0" encoding="UTF-8"?>
<trackingresponse>
<trackingdetails>
<trackingdetail>
<trackingnumber>1550161004</trackingnumber>
<trackingevents>
<trackingevent>
<date>2020-10-21T11:04:00+01:00</date>
<code>17</code>
</trackingevent>
<trackingevent>
<date>2020-10-21T08:41:00+01:00</date>
<code>18</code>
</trackingevent>
</trackingdetail>
</trackingdetails>
</trackingresponse>
i tried this code but it shows empty dataframe :
response =requests.post(endpoint_url, data=t, headers = headers).text
# response is correct
response_tree = ET.fromstring(response)
data = []
for el in response_tree.iter('./*'):
for i in el.iter('*'):
data.append(dict(i.items()))
df = pd.DataFrame(data)
print(df)
also i tried writing text values into a temp dataframe, but this wont either :
response_df = pd.read_csv('/home/test.csv')
response_df['date']= response_tree.find('.//date').text
response_df['code']= response_tree.find('.//code').text
i also tried this , but its giving me everything element as a new row :
for child in tree.iter('trackingevent'):
for elem in child.iter():
data = {str(elem.tag):[elem.text]}
if str(elem.text)=='None':continue
response_df = pd.DataFrame(data)
consolidated_list.append(response_df)
i'm just trying to get the tracking events inside the xml into a dataframe
expected dataframe:
date code
2020-10-21T11:04:00+01:00 17
2020-10-21T08:41:00+01:00 18
The below code does the job
import xml.etree.ElementTree as ET
import pandas as pd
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<trackingresponse>
<trackingdetails>
<trackingdetail>
<trackingnumber>1550161004</trackingnumber>
<trackingevents>
<trackingevent>
<date>2020-10-21T11:04:00+01:00</date>
<code>17</code>
</trackingevent>
<trackingevent>
<date>2020-10-21T08:41:00+01:00</date>
<code>18</code>
</trackingevent>
</trackingevents>
</trackingdetail>
</trackingdetails>
</trackingresponse>'''
root = ET.fromstring(xml)
data = [{'date': e.find('date').text, 'code': e.find('code').text} for e in root.findall('.//trackingevent')]
df = pd.DataFrame(data)
print(df)
output
date code
0 2020-10-21T11:04:00+01:00 17
1 2020-10-21T08:41:00+01:00 18
You can use this example to parse the XML with etree (note: you're missing </trackingevents> in your XML snippet, probably a typo):
import pandas as pd
import xml.etree.ElementTree as et
tree = et.ElementTree(file='<your file.xml>')
data = []
for ev in tree.findall('.//trackingevent'):
date = ev.find('date').text
code = ev.find('code').text
data.append({
'date': date,
'code': code
})
df = pd.DataFrame(data)
print(df)
Prints:
date code
0 2020-10-21T11:04:00+01:00 17
1 2020-10-21T08:41:00+01:00 18

How to remove new lines in dataframe while writing xml into it?

I'm trying to write xml response from an api, iterating over tag TrackingEvent and saving into a dataframe:
The xml response looks like this :
<?xml version="1.0" encoding="UTF-8"?>
<TrackingResult>
<Events>
<TrackingEvent>
<DateTimeStamp>202010</DateTimeStamp>
<Event>Delivered</Event>
<ExtraInfo>02921</ExtraInfo>
</TrackingEvent>
<TrackingEvent>
<DateTimeStamp>202010</DateTimeStamp>
<Event>Delivery today</Event>
<ExtraInfo>31916</ExtraInfo>
</TrackingEvent>
</Events>
<Signatures />
<Errors />
</TrackingResult>
I am writing this into a dataframe using this code :
response_status = requests.get(url, headers = headers)
print(type(response_status)) #<class 'requests.models.Response'>
print(type(response_status.content)) #<class 'bytes'>
tree = ET.fromstring(response_status.content)
for child in tree.iter('TrackingEvent'):
for elem in child.iter():
data = {str(elem.tag):[elem.text]}
if str(elem.text)=='None' :continue
response_df = pd.DataFrame(data)
consolidated_list.append(response_df)
consolidated_df = pd.concat(consolidated_list,ignore_index=True)
print(consolidated_df)
This is the output dataframe i'm getting :
DateTimeStamp Event ExtraInfo
202010
Delivered
02921
202010
Delivery today
31916
I want to remove these empty spaces to put one child iteration into a single dataframe row
expected output:
DateTimeStamp Event ExtraInfo
202010 Delivered 02921
202010 Delivery today 31916
I would recommend to build a dict in a loop and then create dataframe on basis of that dict. Here is an example:
xml = '''<?xml version="1.0" ?>
<TrackingResult>
<Events>
<TrackingEvent>
<DateTimeStamp>202010</DateTimeStamp>
<Event>Delivered</Event>
<ExtraInfo>02921</ExtraInfo>
</TrackingEvent>
<TrackingEvent>
<DateTimeStamp>202010</DateTimeStamp>
<Event>Delivery today</Event>
<ExtraInfo>31916</ExtraInfo>
</TrackingEvent>
</Events>
<Signatures />
<Errors />
</TrackingResult>'''
from lxml import etree as ET
import pandas as pd
from collections import defaultdict
d = defaultdict(list)
tree = ET.fromstring(xml)
for child in tree.iter('TrackingEvent'):
for elem in child.iter():
if (elem.text is not None and str(elem.text).strip() != ''):
d[elem.tag].append(elem.text.strip())
else:
if len(list(elem)) == 0:
d[elem.tag].append(None)
df = pd.DataFrame(d)
print(df)
Output:
DateTimeStamp Event ExtraInfo
0 202010 Delivered 02921
1 202010 Delivery today 31916

Parsing XML document into a pandas DataFrame

I have an XML file that looks like this:
<?xml version="1.0" encoding="utf-8"?>
<comments>
<row Id="1" PostId="2" Score="0" Text="(...)" CreationDate="2011-08-30T21:15:28.063" UserId="16" />
<row Id="2" PostId="17" Score="1" Text="(...)" CreationDate="2011-08-30T21:24:56.573" UserId="27" />
<row Id="3" PostId="26" Score="0" Text="(...)" UserId="9" />
</comments>
What I'm trying to do is to extract ID, Text and CreationDate colums into pandas DF and I've tried following:
import xml.etree.cElementTree as et
import pandas as pd
path = '/.../...'
dfcols = ['ID', 'Text', 'CreationDate']
df_xml = pd.DataFrame(columns=dfcols)
root = et.parse(path)
rows = root.findall('.//row')
for row in rows:
ID = row.find('Id')
text = row.find('Text')
date = row.find('CreationDate')
print(ID, text, date)
df_xml = df_xml.append(pd.Series([ID, text, date], index=dfcols), ignore_index=True)
print(df_xml)
But the output is:
None None None
How do I fix this?
As advised in this solution by gold member Python/pandas/numpy guru, #unutbu:
Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying.
Therefore, consider parsing your XML data into a separate list then pass list into the DataFrame constructor in one call outside of any loop. In fact, you can pass nested lists with list comprehension directly into the constructor:
path = 'AttributesXMLPandas.xml'
dfcols = ['ID', 'Text', 'CreationDate']
root = et.parse(path)
rows = root.findall('.//row')
# NESTED LIST
xml_data = [[row.get('Id'), row.get('Text'), row.get('CreationDate')]
for row in rows]
df_xml = pd.DataFrame(xml_data, columns=dfcols)
print(df_xml)
# ID Text CreationDate
# 0 1 (...) 2011-08-30T21:15:28.063
# 1 2 (...) 2011-08-30T21:24:56.573
# 2 3 (...) None
Just a minor change in your code
ID = row.get('Id')
text = row.get('Text')
date = row.get('CreationDate')
Based on #Parfait solution, I wrote my version that gets the columns as a parameter and returns the Pandas DataFrame.
test.xml:
<?xml version="1.0" encoding="utf-8"?>
<comments>
<row Id="1" PostId="2" Score="0" Text="(.1.)" CreationDate="2011-08-30T21:15:28.063" UserId="16" />
<row Id="2" PostId="17" Score="1" Text="(.2.)" CreationDate="2011-08-30T21:24:56.573" UserId="27" />
<row Id="3" PostId="26" Score="0" Text="(.3.)" UserId="9" />
</comments>
xml_to_pandas.py:
'''Xml to Pandas DataFrame Convertor.'''
import xml.etree.cElementTree as et
import pandas as pd
def xml_to_pandas(root, columns, row_name):
'''get xml.etree root, the columns and return Pandas DataFrame'''
df = None
try:
rows = root.findall('.//{}'.format(row_name))
xml_data = [[row.get(c) for c in columns] for row in rows] # NESTED LIST
df = pd.DataFrame(xml_data, columns=columns)
except Exception as e:
print('[xml_to_pandas] Exception: {}.'.format(e))
return df
path = 'test.xml'
row_name = 'row'
columns = ['ID', 'Text', 'CreationDate']
root = et.parse(path)
df = xml_to_pandas(root, columns, row_name)
print(df)
output:
Since pandas 1.3.0, there's a built-in pandas function pd.read_xml that reads XML documents into a pandas DataFrame.
path = """<?xml version="1.0" encoding="utf-8"?>
<comments>
<row Id="1" PostId="2" Score="0" Text="(...)" CreationDate="2011-08-30T21:15:28.063" UserId="16" />
<row Id="2" PostId="17" Score="1" Text="(...)" CreationDate="2011-08-30T21:24:56.573" UserId="27" />
<row Id="3" PostId="26" Score="0" Text="(...)" UserId="9" />
</comments>"""
# or a path to an XML doc
path = 'test.xml'
pd.read_xml(path)
The XML doc in the OP becomes the following by simply calling read_xml:

How to insert text from file into new XML tags

I have the following code to try to parse an XML file such that it reads from external text files (if found) and inserts its contents into newly introduced tags and saves a new XML file with the resultant manipulations.
The code looks like this:
try:
import xml.etree.cElementTree as ET
except ImportError:
import xml.etree.ElementTree as ET
import os
# define our data file
data_file = 'test2_of_2016-09-19.xml'
tree = ET.ElementTree(file=data_file)
root = tree.getroot()
for element in root:
if element.find('File_directory') is not None:
directory = element.find('File_directory').text
if element.find('Introduction') is not None:
introduction = element.find('Introduction').text
if element.find('Directions') is not None:
directions = element.find('Directions').text
for element in root:
if element.find('File_directory') is not None:
if element.find('Introduction') is not None:
intro_tree = directory+introduction
with open(intro_tree, 'r') as f:
intro_text = f.read()
f.closed
intro_body = ET.SubElement(element,'Introduction_Body')
intro_body.text = intro_text
if element.find('Directions') is not None:
directions_tree = directory+directions
with open(directions_tree, 'r') as f:
directions_text = f.read()
f.closed
directions_body = ET.SubElement(element,'Directions_Body')
directions_body.text = directions_text
tree.write('new_' + data_file)
The problem is that it seems like the last found instance of file_directory, introduction, and directions is saved and spread out to multiple entries, which is not desired as each entry has its own individual record so to speak.
The source XML file looks like this:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<Row>
<Entry_No>1</Entry_No>
<Waterfall_Name>Bridalveil Fall</Waterfall_Name>
<File_directory>./waterfall_writeups/1_Bridalveil_Fall/</File_directory>
<Introduction>introduction-bridalveil-fall.html</Introduction>
<Directions>directions-bridalveil-fall.html</Directions>
</Row>
<Row>
<Entry_No>52</Entry_No>
<Waterfall_Name>Switzer Falls</Waterfall_Name>
<File_directory>./waterfall_writeups/52_Switzer_Falls/</File_directory>
<Introduction>introduction-switzer-falls.html</Introduction>
<Directions>directions-switzer-falls.html</Directions>
</Row>
</Root>
The desired output XML should look like this:
<Root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<Row>
<Entry_No>1</Entry_No>
<Waterfall_Name>Bridalveil Fall</Waterfall_Name>
<File_directory>./waterfall_writeups/1_Bridalveil_Fall/</File_directory>
<Introduction>introduction-bridalveil-fall.html</Introduction>
<Directions>directions-bridalveil-fall.html</Directions>
<Introduction_Body>Text from ./waterfall_writeups/1_Bridalveil_Fall/introduction-bridalveil-fall.html</Introduction_Body>
<Directions_Body>Text from ./waterfall_writeups/1_Bridalveil_Fall/directions-bridalveil-fall.html</Directions_Body>
</Row>
<Row>
<Entry_No>52</Entry_No>
<Waterfall_Name>Switzer Falls</Waterfall_Name>
<File_directory>./waterfall_writeups/52_Switzer_Falls/</File_directory>
<Introduction>introduction-switzer-falls.html</Introduction>
<Directions>directions-switzer-falls.html</Directions>
<Introduction_Body>Text from ./waterfall_writeups/52_Switzer_Falls/introduction-switzer-falls.html</Introduction_Body>
<Directions_Body>Text from ./waterfall_writeups/52_Switzer_Falls/directions-switzer-falls.html</Directions_Body>
</Row>
</Root>
But what I end up getting is:
<Root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<Row>
<Entry_No>1</Entry_No>
<Waterfall_Name>Bridalveil Fall</Waterfall_Name>
<File_directory>./waterfall_writeups/1_Bridalveil_Fall/</File_directory>
<Introduction>introduction-bridalveil-fall.html</Introduction>
<Directions>directions-bridalveil-fall.html</Directions>
<Introduction_Body>Text from ./waterfall_writeups/52_Switzer_Falls/introduction-switzer-falls.html</Introduction_Body>
<Directions_Body>Text from ./waterfall_writeups/52_Switzer_Falls/directions-switzer-falls.html</Directions_Body>
</Row>
<Row>
<Entry_No>52</Entry_No>
<Waterfall_Name>Switzer Falls</Waterfall_Name>
<File_directory>./waterfall_writeups/52_Switzer_Falls/</File_directory>
<Introduction>introduction-switzer-falls.html</Introduction>
<Directions>directions-switzer-falls.html</Directions>
<Introduction_Body>Text from ./waterfall_writeups/52_Switzer_Falls/introduction-switzer-falls.html</Introduction_Body>
<Directions_Body>Text from ./waterfall_writeups/52_Switzer_Falls/directions-switzer-falls.html</Directions_Body>
</Row>
</Root>
As an aside, is there any way to introduce the body tags' content without it all being printed on one line (for readability)?
The first for loop iterates over the Row elements of your document, assigning new values to your directory, introduction, and directions variables respectively, with each iteration, ending up with the values from the last occurring Row element.
What I would do is create a dictionary to map tag names to text contents, and then use that mapping to add the new sub-elements on the fly. Example (without reading the referenced files):
for row in root:
elements = {}
for node in row:
elements[node.tag] = node.text
directory = elements['File_directory']
intro_tree = directory + elements['Introduction']
intro_body = ET.SubElement(row, 'Introduction_Body')
intro_body.text = 'Text from %s' % intro_tree
directions_tree = directory + elements['Directions']
directions_body = ET.SubElement(row, 'Directions_Body')
directions_body.text = 'Text from %s' % directions_tree

Categories