Parse XML to CSV with python

Parse XML to CSV with python - python

I need to parse some XML to CSV. I am struggling getting the 'record' attribute to iterate. The code below can pull out the allocation text. How do I get the record product-id?
import xml.etree.ElementTree as ET
mytree = ET.parse('Salesforce_01_30_2023.xml')
myroot = mytree.getroot()
print(myroot)
for x in myroot.findall('record'):
product = myroot.attrib
inventory = x.find('allocation').text
print(product, inventory)
XML
<?xml version="1.0" encoding="UTF-8"?>
<records>
<record product-id="99124">
<allocation>15</allocation>
<allocation-timestamp>2023-01-30T15:03:39.598Z</allocation-timestamp>
<perpetual>false</perpetual>
<preorder-backorder-handling>none</preorder-backorder-handling>
<ats>15</ats>
</record>
<record product-id="011443">
<allocation>0</allocation>
<allocation-timestamp>2023-01-30T15:03:39.598Z</allocation-timestamp>
<perpetual>false</perpetual>
<preorder-backorder-handling>none</preorder-backorder-handling>
<ats>0</ats>
</record>

To get product-id number you can use .attrib["product-id"]:
import xml.etree.ElementTree as ET
mytree = ET.parse('Salesforce_01_30_2023.xml')
myroot = mytree.getroot()
for product in myroot.findall('record'):
inventory = product.find('allocation').text
print(product.attrib['product-id'], inventory)
Prints:
99124 15
011443 0

Option 1: You can use pandas DataFrame read_xml() and to_csv():
import pandas as pd
df = pd.read_xml("prod_id.xml", xpath=".//record")
df.to_csv('prod.csv')
print(df.to_string())
Output:
product-id allocation allocation-timestamp perpetual preorder-backorder-handling ats
0 99124 15 2023-01-30T15:03:39.598Z False none 15
1 11443 0 2023-01-30T15:03:39.598Z False none 0
CSV:
,product-id,allocation,allocation-timestamp,perpetual,preorder-backorder-handling,ats
0,99124,15,2023-01-30T15:03:39.598Z,False,none,15
1,11443,0,2023-01-30T15:03:39.598Z,False,none,0
Option 2, if you prefere the xml.etree.ElementTree.
xml attribute values can be searched by .get():
import xml.etree.ElementTree as ET
tree = ET.parse('prod_id.xml')
root = tree.getroot()
for elem in root.iter():
# print(elem.tag, elem.attrib, elem.text)
if elem.tag == "record":
print("Product-id:",elem.get('product-id'))
Output:
Product-id: 99124
Product-id: 011443

Related

How to access UBL 2.1 xml tag using python

I need to access the tags in UBL 2.1 and modify them depend on the on the user input on python.
So, I used the ElementTree library to access the tags and modify them.
Here is a sample of the xml code:
<ns0:Invoice xmlns:ns0="urn:oasis:names:specification:ubl:schema:xsd:Invoice-2" xmlns:ns1="urn:oasis:names:specification:ubl:schema:xsd:CommonBasicComponents-2" xmlns:ns2="urn:oasis:names:specification:ubl:schema:xsd:CommonAggregateComponents-2">
<ns1:ProfileID>reporting:1.0</ns1:ProfileID>
<ns1:ID>0</ns1:ID>
<ns1:UUID>dbdf65eb-5d66-47e6-bb0c-a84bbf7baa30</ns1:UUID>
<ns1:IssueDate>2022-11-05</ns1:IssueDate>
The issue :
I want to access the tags but it is doesn't modifed and enter the loop
I tried both ways:
mytree = ET.parse('test.xml')
myroot = mytree.getroot()
for x in myroot.find({xmlns:ns1=urn:oasis:names:specification:ubl:schema:xsd:CommonBasicComponents-2}IssueDate}"):
x.text = '1999'
mytree.write('test.xml')
mytree = ET.parse('test.xml')
myroot = mytree.getroot()
for x in myroot.iter('./Invoice/AllowanceCharge/ChargeIndicator'):
x.text = str('true')
mytree.write('test.xml')
None of them worked and modify the tag.
So the questions is : How can I reach the specific tag and modify it?

If you correct the namespace and the brakets in your for loop it works for a valid XML like (root tag must be closed!):
Input:
<?xml version="1.0" encoding="utf-8"?>
<ns0:Invoice xmlns:ns0="urn:oasis:names:specification:ubl:schema:xsd:Invoice-2" xmlns:ns1="urn:oasis:names:specification:ubl:schema:xsd:CommonBasicComponents-2" xmlns:ns2="urn:oasis:names:specification:ubl:schema:xsd:CommonAggregateComponents-2">
<ns1:ProfileID>reporting:1.0</ns1:ProfileID>
<ns1:ID>0</ns1:ID>
<ns1:UUID>dbdf65eb-5d66-47e6-bb0c-a84bbf7baa30</ns1:UUID>
<ns1:IssueDate>2022-11-05</ns1:IssueDate>
</ns0:Invoice>
Your repaired code:
import xml.etree.ElementTree as ET
tree = ET.parse('test.xml')
root = tree.getroot()
for elem in root.findall("{urn:oasis:names:specification:ubl:schema:xsd:CommonBasicComponents-2}IssueDate"):
elem.text = '1999'
tree.write('test_changed.xml', encoding='utf-8', xml_declaration=True)
ET.dump(root)
Output:
<ns0:Invoice xmlns:ns0="urn:oasis:names:specification:ubl:schema:xsd:Invoice-2" xmlns:ns1="urn:oasis:names:specification:ubl:schema:xsd:CommonBasicComponents-2">
<ns1:ProfileID>reporting:1.0</ns1:ProfileID>
<ns1:ID>0</ns1:ID>
<ns1:UUID>dbdf65eb-5d66-47e6-bb0c-a84bbf7baa30</ns1:UUID>
<ns1:IssueDate>1999</ns1:IssueDate>
</ns0:Invoice>

Ho to parse and get element of an xml using Python data frame

This is my XML string i am getting this as a message so it is not a file
<?xml version="1.0" encoding="UTF-8"?>
<OperationStatus xmlns:ns2="summaries">
<EventId>123456</EventId>
<notificationId>123456</notificationId>
<userDetails>
<clientId>client_1</clientId>
<userId>user_1</userId>
<groupIds>
<groupId>123456</groupId>
<groupId>123457</groupId>
</groupIds>
</userDetails>
</OperationStatus>
I want to get output in below format
message,code,Id
I have mentioned only three elements but i can have many more elements .
This is how i am trying but not getting the exact output
I started learning Python so excuse me for silly mistakes
from __future__ import print_function
import pandas as pd
def lambda_handler():
import xml.etree.ElementTree as et
xtree = et.parse('''<?xml version="1.0" encoding="UTF-8"?>
<OperationStatus xmlns:ns2="summaries">
<EventId>123456</EventId>
<notificationId>123456</notificationId>
<userDetails>
<clientId>client_1</clientId>
<userId>user_1</userId>
<groupIds>
<groupId>123456</groupId>
<groupId>123457</groupId>
</groupIds>
</userDetails>
</OperationStatus>''')
xroot = xtree.getroot()
df_cols = ["message", "code", "Id"]
rows = []
for node in xroot:
s_name = node.attrib.get("message")
s_mail = node.find("code").text if node is not None else None
s_grade = node.find("Id").text if node is not None else None
lambda_handler()

you can try using XPath, it will be easier to retrieve the wanted data
import xml.etree.ElementTree as et
import pandas as pd
xtree = et.fromstring("""<?xml version="1.0" encoding="UTF-8"?>
<name xmlns:ns2="summaries">
<message>5jb10x5rf7sp1fov5msgoof7r</message>
<code>COMPLETED</code>
<Id>dfkjlhgd98568y</Id>
</name>""")
keys = ["message", "code", "Id"]
data = {k: [xtree.find(".//"+k).text] for k in keys}
print(pd.DataFrame(data))
# Outputs:
# message code Id
# 0 5jb10x5rf7sp1fov5msgoof7r COMPLETED dfkjlhgd98568y

Is this the output you desire?
# !pip install xmltodict
import xmltodict
xml = """
<name xmlns:ns2="summaries">
<message>5jb10x5rf7sp1fov5msgoof7r</message>
<code>COMPLETED</code>
<Id>dfkjlhgd98568y</Id>
</name>
"""
d = xmltodict.parse(xml)
print(d['name']['message'])
print(d['name']['code'])
print(d['name']['Id'])
Output
5jb10x5rf7sp1fov5msgoof7r
COMPLETED
dfkjlhgd98568y
More info on xmltodict at https://github.com/martinblech/xmltodict

Given your string:
your_string='''\
<?xml version="1.0" encoding="UTF-8"?>
<name xmlns:ns2="summaries">
<message>5jb10x5rf7sp1fov5msgoof7r</message>
<code>COMPLETED</code>
<Id>dfkjlhgd98568y</Id>
</name>'''
Since this is a string, you would use .fromstring() rather than .parse(). That automatically finds the root node name for you (ie, no need to call .getroot()):
root = et.fromstring(your_string)
>>> root
<Element 'name' at 0x1050f51d0>
Once you have the data structure with name as the root, you can either iterate over the sub elements:
df_cols = ["message", "code", "Id"]
for node in root:
if node.tag in df_cols:
print({node.tag:node.text})
Prints:
{'message': '5jb10x5rf7sp1fov5msgoof7r'}
{'code': 'COMPLETED'}
{'Id': 'dfkjlhgd98568y'}
Or you can use an xpath query to find each element of interest:
for k in df_cols:
print({k:root.find(f'./{k}').text})
# same output
Now since a data frame can be constructed by {key:[list_of_elements],...} you can construct that type of dict from what we have built here:
df=pd.DataFrame({k:[root.find(f'./{k}').text] for k in df_cols})
If you have multiple elements, use findall:
df=pd.DataFrame({k:[x.text for x in root.findall(f'./{k}')] for k in df_cols})

How to iterate over an xml and save it into a dataframe in Python?

I have an XML and i'm trying to iterate over it and save it(just the tracking events part) into a dataframe.
this is the input XML:
<?xml version="1.0" encoding="UTF-8"?>
<trackingresponse>
<trackingdetails>
<trackingdetail>
<trackingnumber>1550161004</trackingnumber>
<trackingevents>
<trackingevent>
<date>2020-10-21T11:04:00+01:00</date>
<code>17</code>
</trackingevent>
<trackingevent>
<date>2020-10-21T08:41:00+01:00</date>
<code>18</code>
</trackingevent>
</trackingdetail>
</trackingdetails>
</trackingresponse>
i tried this code but it shows empty dataframe :
response =requests.post(endpoint_url, data=t, headers = headers).text
# response is correct
response_tree = ET.fromstring(response)
data = []
for el in response_tree.iter('./*'):
for i in el.iter('*'):
data.append(dict(i.items()))
df = pd.DataFrame(data)
print(df)
also i tried writing text values into a temp dataframe, but this wont either :
response_df = pd.read_csv('/home/test.csv')
response_df['date']= response_tree.find('.//date').text
response_df['code']= response_tree.find('.//code').text
i also tried this , but its giving me everything element as a new row :
for child in tree.iter('trackingevent'):
for elem in child.iter():
data = {str(elem.tag):[elem.text]}
if str(elem.text)=='None':continue
response_df = pd.DataFrame(data)
consolidated_list.append(response_df)
i'm just trying to get the tracking events inside the xml into a dataframe
expected dataframe:
date code
2020-10-21T11:04:00+01:00 17
2020-10-21T08:41:00+01:00 18

The below code does the job
import xml.etree.ElementTree as ET
import pandas as pd
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<trackingresponse>
<trackingdetails>
<trackingdetail>
<trackingnumber>1550161004</trackingnumber>
<trackingevents>
<trackingevent>
<date>2020-10-21T11:04:00+01:00</date>
<code>17</code>
</trackingevent>
<trackingevent>
<date>2020-10-21T08:41:00+01:00</date>
<code>18</code>
</trackingevent>
</trackingevents>
</trackingdetail>
</trackingdetails>
</trackingresponse>'''
root = ET.fromstring(xml)
data = [{'date': e.find('date').text, 'code': e.find('code').text} for e in root.findall('.//trackingevent')]
df = pd.DataFrame(data)
print(df)
output
date code
0 2020-10-21T11:04:00+01:00 17
1 2020-10-21T08:41:00+01:00 18

You can use this example to parse the XML with etree (note: you're missing </trackingevents> in your XML snippet, probably a typo):
import pandas as pd
import xml.etree.ElementTree as et
tree = et.ElementTree(file='<your file.xml>')
data = []
for ev in tree.findall('.//trackingevent'):
date = ev.find('date').text
code = ev.find('code').text
data.append({
'date': date,
'code': code
})
df = pd.DataFrame(data)
print(df)
Prints:
date code
0 2020-10-21T11:04:00+01:00 17
1 2020-10-21T08:41:00+01:00 18

How to remove new lines in dataframe while writing xml into it?

I'm trying to write xml response from an api, iterating over tag TrackingEvent and saving into a dataframe:
The xml response looks like this :
<?xml version="1.0" encoding="UTF-8"?>
<TrackingResult>
<Events>
<TrackingEvent>
<DateTimeStamp>202010</DateTimeStamp>
<Event>Delivered</Event>
<ExtraInfo>02921</ExtraInfo>
</TrackingEvent>
<TrackingEvent>
<DateTimeStamp>202010</DateTimeStamp>
<Event>Delivery today</Event>
<ExtraInfo>31916</ExtraInfo>
</TrackingEvent>
</Events>
<Signatures />
<Errors />
</TrackingResult>
I am writing this into a dataframe using this code :
response_status = requests.get(url, headers = headers)
print(type(response_status)) #<class 'requests.models.Response'>
print(type(response_status.content)) #<class 'bytes'>
tree = ET.fromstring(response_status.content)
for child in tree.iter('TrackingEvent'):
for elem in child.iter():
data = {str(elem.tag):[elem.text]}
if str(elem.text)=='None' :continue
response_df = pd.DataFrame(data)
consolidated_list.append(response_df)
consolidated_df = pd.concat(consolidated_list,ignore_index=True)
print(consolidated_df)
This is the output dataframe i'm getting :
DateTimeStamp Event ExtraInfo
202010
Delivered
02921
202010
Delivery today
31916
I want to remove these empty spaces to put one child iteration into a single dataframe row
expected output:
DateTimeStamp Event ExtraInfo
202010 Delivered 02921
202010 Delivery today 31916

I would recommend to build a dict in a loop and then create dataframe on basis of that dict. Here is an example:
xml = '''<?xml version="1.0" ?>
<TrackingResult>
<Events>
<TrackingEvent>
<DateTimeStamp>202010</DateTimeStamp>
<Event>Delivered</Event>
<ExtraInfo>02921</ExtraInfo>
</TrackingEvent>
<TrackingEvent>
<DateTimeStamp>202010</DateTimeStamp>
<Event>Delivery today</Event>
<ExtraInfo>31916</ExtraInfo>
</TrackingEvent>
</Events>
<Signatures />
<Errors />
</TrackingResult>'''
from lxml import etree as ET
import pandas as pd
from collections import defaultdict
d = defaultdict(list)
tree = ET.fromstring(xml)
for child in tree.iter('TrackingEvent'):
for elem in child.iter():
if (elem.text is not None and str(elem.text).strip() != ''):
d[elem.tag].append(elem.text.strip())
else:
if len(list(elem)) == 0:
d[elem.tag].append(None)
df = pd.DataFrame(d)
print(df)
Output:
DateTimeStamp Event ExtraInfo
0 202010 Delivered 02921
1 202010 Delivery today 31916

python get xml element by path

I try to walk through a large xml file, and collect some data. As the location of the data can be find by the path, I used xpath, but no result.
Could someonne suggest what I am doing wrong?
Example of the xml:
<?xml version="1.0" encoding="UTF-8"?>
<rootnode>
<subnode1>
</subnode1>
<subnode2>
</subnode2>
<subnode3>
<listnode>
<item id="1"><name>test name1</name></item>
<item id="2"><name>test name2</name></item>
<item id="3"><name>test name3</name></item>
</listnode>
</subnode3>
</rootnode>
The code:
import lxml.etree as ET
tree = ET.parse('temp/temp.xml')
subtree = tree.xpath('./rootnode/subnode3/listnode')
for next_item in subtree:
Id = next_item.attrib.get('id')
name = next_item.find('name').text
print('{:>20} - {:>20}'.format(name,Id))

You are pretty close.
Ex:
import lxml.etree as ET
tree = ET.parse('temp/temp.xml')
subtree = tree.xpath('/rootnode/subnode3/listnode')
for next_item in subtree:
for item in next_item.findall('item'):
Id = item.attrib.get('id')
name = item.find('name').text
print('{:>20} - {:>20}'.format(name,Id))
OR
subtree = tree.xpath('/rootnode/subnode3/listnode/item')
for item in subtree:
Id = item.attrib.get('id')
name = item.find('name').text
print('{:>20} - {:>20}'.format(name,Id))
Output:
test name1 - 1
test name2 - 2
test name3 - 3

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parse XML to CSV with python - python

Related

How to access UBL 2.1 xml tag using python

Ho to parse and get element of an xml using Python data frame

How to iterate over an xml and save it into a dataframe in Python?

How to remove new lines in dataframe while writing xml into it?

python get xml element by path

Categories

Resources