I need to parse some XML to CSV. I am struggling getting the 'record' attribute to iterate. The code below can pull out the allocation text. How do I get the record product-id?
import xml.etree.ElementTree as ET
mytree = ET.parse('Salesforce_01_30_2023.xml')
myroot = mytree.getroot()
print(myroot)
for x in myroot.findall('record'):
product = myroot.attrib
inventory = x.find('allocation').text
print(product, inventory)
XML
<?xml version="1.0" encoding="UTF-8"?>
<records>
<record product-id="99124">
<allocation>15</allocation>
<allocation-timestamp>2023-01-30T15:03:39.598Z</allocation-timestamp>
<perpetual>false</perpetual>
<preorder-backorder-handling>none</preorder-backorder-handling>
<ats>15</ats>
</record>
<record product-id="011443">
<allocation>0</allocation>
<allocation-timestamp>2023-01-30T15:03:39.598Z</allocation-timestamp>
<perpetual>false</perpetual>
<preorder-backorder-handling>none</preorder-backorder-handling>
<ats>0</ats>
</record>
To get product-id number you can use .attrib["product-id"]:
import xml.etree.ElementTree as ET
mytree = ET.parse('Salesforce_01_30_2023.xml')
myroot = mytree.getroot()
for product in myroot.findall('record'):
inventory = product.find('allocation').text
print(product.attrib['product-id'], inventory)
Prints:
99124 15
011443 0
Option 1: You can use pandas DataFrame read_xml() and to_csv():
import pandas as pd
df = pd.read_xml("prod_id.xml", xpath=".//record")
df.to_csv('prod.csv')
print(df.to_string())
Output:
product-id allocation allocation-timestamp perpetual preorder-backorder-handling ats
0 99124 15 2023-01-30T15:03:39.598Z False none 15
1 11443 0 2023-01-30T15:03:39.598Z False none 0
CSV:
,product-id,allocation,allocation-timestamp,perpetual,preorder-backorder-handling,ats
0,99124,15,2023-01-30T15:03:39.598Z,False,none,15
1,11443,0,2023-01-30T15:03:39.598Z,False,none,0
Option 2, if you prefere the xml.etree.ElementTree.
xml attribute values can be searched by .get():
import xml.etree.ElementTree as ET
tree = ET.parse('prod_id.xml')
root = tree.getroot()
for elem in root.iter():
# print(elem.tag, elem.attrib, elem.text)
if elem.tag == "record":
print("Product-id:",elem.get('product-id'))
Output:
Product-id: 99124
Product-id: 011443
Related
I need to access the tags in UBL 2.1 and modify them depend on the on the user input on python.
So, I used the ElementTree library to access the tags and modify them.
Here is a sample of the xml code:
<ns0:Invoice xmlns:ns0="urn:oasis:names:specification:ubl:schema:xsd:Invoice-2" xmlns:ns1="urn:oasis:names:specification:ubl:schema:xsd:CommonBasicComponents-2" xmlns:ns2="urn:oasis:names:specification:ubl:schema:xsd:CommonAggregateComponents-2">
<ns1:ProfileID>reporting:1.0</ns1:ProfileID>
<ns1:ID>0</ns1:ID>
<ns1:UUID>dbdf65eb-5d66-47e6-bb0c-a84bbf7baa30</ns1:UUID>
<ns1:IssueDate>2022-11-05</ns1:IssueDate>
The issue :
I want to access the tags but it is doesn't modifed and enter the loop
I tried both ways:
mytree = ET.parse('test.xml')
myroot = mytree.getroot()
for x in myroot.find({xmlns:ns1=urn:oasis:names:specification:ubl:schema:xsd:CommonBasicComponents-2}IssueDate}"):
x.text = '1999'
mytree.write('test.xml')
mytree = ET.parse('test.xml')
myroot = mytree.getroot()
for x in myroot.iter('./Invoice/AllowanceCharge/ChargeIndicator'):
x.text = str('true')
mytree.write('test.xml')
None of them worked and modify the tag.
So the questions is : How can I reach the specific tag and modify it?
If you correct the namespace and the brakets in your for loop it works for a valid XML like (root tag must be closed!):
Input:
<?xml version="1.0" encoding="utf-8"?>
<ns0:Invoice xmlns:ns0="urn:oasis:names:specification:ubl:schema:xsd:Invoice-2" xmlns:ns1="urn:oasis:names:specification:ubl:schema:xsd:CommonBasicComponents-2" xmlns:ns2="urn:oasis:names:specification:ubl:schema:xsd:CommonAggregateComponents-2">
<ns1:ProfileID>reporting:1.0</ns1:ProfileID>
<ns1:ID>0</ns1:ID>
<ns1:UUID>dbdf65eb-5d66-47e6-bb0c-a84bbf7baa30</ns1:UUID>
<ns1:IssueDate>2022-11-05</ns1:IssueDate>
</ns0:Invoice>
Your repaired code:
import xml.etree.ElementTree as ET
tree = ET.parse('test.xml')
root = tree.getroot()
for elem in root.findall("{urn:oasis:names:specification:ubl:schema:xsd:CommonBasicComponents-2}IssueDate"):
elem.text = '1999'
tree.write('test_changed.xml', encoding='utf-8', xml_declaration=True)
ET.dump(root)
Output:
<ns0:Invoice xmlns:ns0="urn:oasis:names:specification:ubl:schema:xsd:Invoice-2" xmlns:ns1="urn:oasis:names:specification:ubl:schema:xsd:CommonBasicComponents-2">
<ns1:ProfileID>reporting:1.0</ns1:ProfileID>
<ns1:ID>0</ns1:ID>
<ns1:UUID>dbdf65eb-5d66-47e6-bb0c-a84bbf7baa30</ns1:UUID>
<ns1:IssueDate>1999</ns1:IssueDate>
</ns0:Invoice>
This is my XML string i am getting this as a message so it is not a file
<?xml version="1.0" encoding="UTF-8"?>
<OperationStatus xmlns:ns2="summaries">
<EventId>123456</EventId>
<notificationId>123456</notificationId>
<userDetails>
<clientId>client_1</clientId>
<userId>user_1</userId>
<groupIds>
<groupId>123456</groupId>
<groupId>123457</groupId>
</groupIds>
</userDetails>
</OperationStatus>
I want to get output in below format
message,code,Id
I have mentioned only three elements but i can have many more elements .
This is how i am trying but not getting the exact output
I started learning Python so excuse me for silly mistakes
from __future__ import print_function
import pandas as pd
def lambda_handler():
import xml.etree.ElementTree as et
xtree = et.parse('''<?xml version="1.0" encoding="UTF-8"?>
<OperationStatus xmlns:ns2="summaries">
<EventId>123456</EventId>
<notificationId>123456</notificationId>
<userDetails>
<clientId>client_1</clientId>
<userId>user_1</userId>
<groupIds>
<groupId>123456</groupId>
<groupId>123457</groupId>
</groupIds>
</userDetails>
</OperationStatus>''')
xroot = xtree.getroot()
df_cols = ["message", "code", "Id"]
rows = []
for node in xroot:
s_name = node.attrib.get("message")
s_mail = node.find("code").text if node is not None else None
s_grade = node.find("Id").text if node is not None else None
lambda_handler()
you can try using XPath, it will be easier to retrieve the wanted data
import xml.etree.ElementTree as et
import pandas as pd
xtree = et.fromstring("""<?xml version="1.0" encoding="UTF-8"?>
<name xmlns:ns2="summaries">
<message>5jb10x5rf7sp1fov5msgoof7r</message>
<code>COMPLETED</code>
<Id>dfkjlhgd98568y</Id>
</name>""")
keys = ["message", "code", "Id"]
data = {k: [xtree.find(".//"+k).text] for k in keys}
print(pd.DataFrame(data))
# Outputs:
# message code Id
# 0 5jb10x5rf7sp1fov5msgoof7r COMPLETED dfkjlhgd98568y
Is this the output you desire?
# !pip install xmltodict
import xmltodict
xml = """
<name xmlns:ns2="summaries">
<message>5jb10x5rf7sp1fov5msgoof7r</message>
<code>COMPLETED</code>
<Id>dfkjlhgd98568y</Id>
</name>
"""
d = xmltodict.parse(xml)
print(d['name']['message'])
print(d['name']['code'])
print(d['name']['Id'])
Output
5jb10x5rf7sp1fov5msgoof7r
COMPLETED
dfkjlhgd98568y
More info on xmltodict at https://github.com/martinblech/xmltodict
Given your string:
your_string='''\
<?xml version="1.0" encoding="UTF-8"?>
<name xmlns:ns2="summaries">
<message>5jb10x5rf7sp1fov5msgoof7r</message>
<code>COMPLETED</code>
<Id>dfkjlhgd98568y</Id>
</name>'''
Since this is a string, you would use .fromstring() rather than .parse(). That automatically finds the root node name for you (ie, no need to call .getroot()):
root = et.fromstring(your_string)
>>> root
<Element 'name' at 0x1050f51d0>
Once you have the data structure with name as the root, you can either iterate over the sub elements:
df_cols = ["message", "code", "Id"]
for node in root:
if node.tag in df_cols:
print({node.tag:node.text})
Prints:
{'message': '5jb10x5rf7sp1fov5msgoof7r'}
{'code': 'COMPLETED'}
{'Id': 'dfkjlhgd98568y'}
Or you can use an xpath query to find each element of interest:
for k in df_cols:
print({k:root.find(f'./{k}').text})
# same output
Now since a data frame can be constructed by {key:[list_of_elements],...} you can construct that type of dict from what we have built here:
df=pd.DataFrame({k:[root.find(f'./{k}').text] for k in df_cols})
If you have multiple elements, use findall:
df=pd.DataFrame({k:[x.text for x in root.findall(f'./{k}')] for k in df_cols})
I have an XML and i'm trying to iterate over it and save it(just the tracking events part) into a dataframe.
this is the input XML:
<?xml version="1.0" encoding="UTF-8"?>
<trackingresponse>
<trackingdetails>
<trackingdetail>
<trackingnumber>1550161004</trackingnumber>
<trackingevents>
<trackingevent>
<date>2020-10-21T11:04:00+01:00</date>
<code>17</code>
</trackingevent>
<trackingevent>
<date>2020-10-21T08:41:00+01:00</date>
<code>18</code>
</trackingevent>
</trackingdetail>
</trackingdetails>
</trackingresponse>
i tried this code but it shows empty dataframe :
response =requests.post(endpoint_url, data=t, headers = headers).text
# response is correct
response_tree = ET.fromstring(response)
data = []
for el in response_tree.iter('./*'):
for i in el.iter('*'):
data.append(dict(i.items()))
df = pd.DataFrame(data)
print(df)
also i tried writing text values into a temp dataframe, but this wont either :
response_df = pd.read_csv('/home/test.csv')
response_df['date']= response_tree.find('.//date').text
response_df['code']= response_tree.find('.//code').text
i also tried this , but its giving me everything element as a new row :
for child in tree.iter('trackingevent'):
for elem in child.iter():
data = {str(elem.tag):[elem.text]}
if str(elem.text)=='None':continue
response_df = pd.DataFrame(data)
consolidated_list.append(response_df)
i'm just trying to get the tracking events inside the xml into a dataframe
expected dataframe:
date code
2020-10-21T11:04:00+01:00 17
2020-10-21T08:41:00+01:00 18
The below code does the job
import xml.etree.ElementTree as ET
import pandas as pd
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<trackingresponse>
<trackingdetails>
<trackingdetail>
<trackingnumber>1550161004</trackingnumber>
<trackingevents>
<trackingevent>
<date>2020-10-21T11:04:00+01:00</date>
<code>17</code>
</trackingevent>
<trackingevent>
<date>2020-10-21T08:41:00+01:00</date>
<code>18</code>
</trackingevent>
</trackingevents>
</trackingdetail>
</trackingdetails>
</trackingresponse>'''
root = ET.fromstring(xml)
data = [{'date': e.find('date').text, 'code': e.find('code').text} for e in root.findall('.//trackingevent')]
df = pd.DataFrame(data)
print(df)
output
date code
0 2020-10-21T11:04:00+01:00 17
1 2020-10-21T08:41:00+01:00 18
You can use this example to parse the XML with etree (note: you're missing </trackingevents> in your XML snippet, probably a typo):
import pandas as pd
import xml.etree.ElementTree as et
tree = et.ElementTree(file='<your file.xml>')
data = []
for ev in tree.findall('.//trackingevent'):
date = ev.find('date').text
code = ev.find('code').text
data.append({
'date': date,
'code': code
})
df = pd.DataFrame(data)
print(df)
Prints:
date code
0 2020-10-21T11:04:00+01:00 17
1 2020-10-21T08:41:00+01:00 18
I'm trying to write xml response from an api, iterating over tag TrackingEvent and saving into a dataframe:
The xml response looks like this :
<?xml version="1.0" encoding="UTF-8"?>
<TrackingResult>
<Events>
<TrackingEvent>
<DateTimeStamp>202010</DateTimeStamp>
<Event>Delivered</Event>
<ExtraInfo>02921</ExtraInfo>
</TrackingEvent>
<TrackingEvent>
<DateTimeStamp>202010</DateTimeStamp>
<Event>Delivery today</Event>
<ExtraInfo>31916</ExtraInfo>
</TrackingEvent>
</Events>
<Signatures />
<Errors />
</TrackingResult>
I am writing this into a dataframe using this code :
response_status = requests.get(url, headers = headers)
print(type(response_status)) #<class 'requests.models.Response'>
print(type(response_status.content)) #<class 'bytes'>
tree = ET.fromstring(response_status.content)
for child in tree.iter('TrackingEvent'):
for elem in child.iter():
data = {str(elem.tag):[elem.text]}
if str(elem.text)=='None' :continue
response_df = pd.DataFrame(data)
consolidated_list.append(response_df)
consolidated_df = pd.concat(consolidated_list,ignore_index=True)
print(consolidated_df)
This is the output dataframe i'm getting :
DateTimeStamp Event ExtraInfo
202010
Delivered
02921
202010
Delivery today
31916
I want to remove these empty spaces to put one child iteration into a single dataframe row
expected output:
DateTimeStamp Event ExtraInfo
202010 Delivered 02921
202010 Delivery today 31916
I would recommend to build a dict in a loop and then create dataframe on basis of that dict. Here is an example:
xml = '''<?xml version="1.0" ?>
<TrackingResult>
<Events>
<TrackingEvent>
<DateTimeStamp>202010</DateTimeStamp>
<Event>Delivered</Event>
<ExtraInfo>02921</ExtraInfo>
</TrackingEvent>
<TrackingEvent>
<DateTimeStamp>202010</DateTimeStamp>
<Event>Delivery today</Event>
<ExtraInfo>31916</ExtraInfo>
</TrackingEvent>
</Events>
<Signatures />
<Errors />
</TrackingResult>'''
from lxml import etree as ET
import pandas as pd
from collections import defaultdict
d = defaultdict(list)
tree = ET.fromstring(xml)
for child in tree.iter('TrackingEvent'):
for elem in child.iter():
if (elem.text is not None and str(elem.text).strip() != ''):
d[elem.tag].append(elem.text.strip())
else:
if len(list(elem)) == 0:
d[elem.tag].append(None)
df = pd.DataFrame(d)
print(df)
Output:
DateTimeStamp Event ExtraInfo
0 202010 Delivered 02921
1 202010 Delivery today 31916
I try to walk through a large xml file, and collect some data. As the location of the data can be find by the path, I used xpath, but no result.
Could someonne suggest what I am doing wrong?
Example of the xml:
<?xml version="1.0" encoding="UTF-8"?>
<rootnode>
<subnode1>
</subnode1>
<subnode2>
</subnode2>
<subnode3>
<listnode>
<item id="1"><name>test name1</name></item>
<item id="2"><name>test name2</name></item>
<item id="3"><name>test name3</name></item>
</listnode>
</subnode3>
</rootnode>
The code:
import lxml.etree as ET
tree = ET.parse('temp/temp.xml')
subtree = tree.xpath('./rootnode/subnode3/listnode')
for next_item in subtree:
Id = next_item.attrib.get('id')
name = next_item.find('name').text
print('{:>20} - {:>20}'.format(name,Id))
You are pretty close.
Ex:
import lxml.etree as ET
tree = ET.parse('temp/temp.xml')
subtree = tree.xpath('/rootnode/subnode3/listnode')
for next_item in subtree:
for item in next_item.findall('item'):
Id = item.attrib.get('id')
name = item.find('name').text
print('{:>20} - {:>20}'.format(name,Id))
OR
subtree = tree.xpath('/rootnode/subnode3/listnode/item')
for item in subtree:
Id = item.attrib.get('id')
name = item.find('name').text
print('{:>20} - {:>20}'.format(name,Id))
Output:
test name1 - 1
test name2 - 2
test name3 - 3