How to remove new lines in dataframe while writing xml into it?

How to remove new lines in dataframe while writing xml into it? - python

I'm trying to write xml response from an api, iterating over tag TrackingEvent and saving into a dataframe:
The xml response looks like this :
<?xml version="1.0" encoding="UTF-8"?>
<TrackingResult>
<Events>
<TrackingEvent>
<DateTimeStamp>202010</DateTimeStamp>
<Event>Delivered</Event>
<ExtraInfo>02921</ExtraInfo>
</TrackingEvent>
<TrackingEvent>
<DateTimeStamp>202010</DateTimeStamp>
<Event>Delivery today</Event>
<ExtraInfo>31916</ExtraInfo>
</TrackingEvent>
</Events>
<Signatures />
<Errors />
</TrackingResult>
I am writing this into a dataframe using this code :
response_status = requests.get(url, headers = headers)
print(type(response_status)) #<class 'requests.models.Response'>
print(type(response_status.content)) #<class 'bytes'>
tree = ET.fromstring(response_status.content)
for child in tree.iter('TrackingEvent'):
for elem in child.iter():
data = {str(elem.tag):[elem.text]}
if str(elem.text)=='None' :continue
response_df = pd.DataFrame(data)
consolidated_list.append(response_df)
consolidated_df = pd.concat(consolidated_list,ignore_index=True)
print(consolidated_df)
This is the output dataframe i'm getting :
DateTimeStamp Event ExtraInfo
202010
Delivered
02921
202010
Delivery today
31916
I want to remove these empty spaces to put one child iteration into a single dataframe row
expected output:
DateTimeStamp Event ExtraInfo
202010 Delivered 02921
202010 Delivery today 31916

I would recommend to build a dict in a loop and then create dataframe on basis of that dict. Here is an example:
xml = '''<?xml version="1.0" ?>
<TrackingResult>
<Events>
<TrackingEvent>
<DateTimeStamp>202010</DateTimeStamp>
<Event>Delivered</Event>
<ExtraInfo>02921</ExtraInfo>
</TrackingEvent>
<TrackingEvent>
<DateTimeStamp>202010</DateTimeStamp>
<Event>Delivery today</Event>
<ExtraInfo>31916</ExtraInfo>
</TrackingEvent>
</Events>
<Signatures />
<Errors />
</TrackingResult>'''
from lxml import etree as ET
import pandas as pd
from collections import defaultdict
d = defaultdict(list)
tree = ET.fromstring(xml)
for child in tree.iter('TrackingEvent'):
for elem in child.iter():
if (elem.text is not None and str(elem.text).strip() != ''):
d[elem.tag].append(elem.text.strip())
else:
if len(list(elem)) == 0:
d[elem.tag].append(None)
df = pd.DataFrame(d)
print(df)
Output:
DateTimeStamp Event ExtraInfo
0 202010 Delivered 02921
1 202010 Delivery today 31916

Related

Parse XML to CSV with python

I need to parse some XML to CSV. I am struggling getting the 'record' attribute to iterate. The code below can pull out the allocation text. How do I get the record product-id?
import xml.etree.ElementTree as ET
mytree = ET.parse('Salesforce_01_30_2023.xml')
myroot = mytree.getroot()
print(myroot)
for x in myroot.findall('record'):
product = myroot.attrib
inventory = x.find('allocation').text
print(product, inventory)
XML
<?xml version="1.0" encoding="UTF-8"?>
<records>
<record product-id="99124">
<allocation>15</allocation>
<allocation-timestamp>2023-01-30T15:03:39.598Z</allocation-timestamp>
<perpetual>false</perpetual>
<preorder-backorder-handling>none</preorder-backorder-handling>
<ats>15</ats>
</record>
<record product-id="011443">
<allocation>0</allocation>
<allocation-timestamp>2023-01-30T15:03:39.598Z</allocation-timestamp>
<perpetual>false</perpetual>
<preorder-backorder-handling>none</preorder-backorder-handling>
<ats>0</ats>
</record>

To get product-id number you can use .attrib["product-id"]:
import xml.etree.ElementTree as ET
mytree = ET.parse('Salesforce_01_30_2023.xml')
myroot = mytree.getroot()
for product in myroot.findall('record'):
inventory = product.find('allocation').text
print(product.attrib['product-id'], inventory)
Prints:
99124 15
011443 0

Option 1: You can use pandas DataFrame read_xml() and to_csv():
import pandas as pd
df = pd.read_xml("prod_id.xml", xpath=".//record")
df.to_csv('prod.csv')
print(df.to_string())
Output:
product-id allocation allocation-timestamp perpetual preorder-backorder-handling ats
0 99124 15 2023-01-30T15:03:39.598Z False none 15
1 11443 0 2023-01-30T15:03:39.598Z False none 0
CSV:
,product-id,allocation,allocation-timestamp,perpetual,preorder-backorder-handling,ats
0,99124,15,2023-01-30T15:03:39.598Z,False,none,15
1,11443,0,2023-01-30T15:03:39.598Z,False,none,0
Option 2, if you prefere the xml.etree.ElementTree.
xml attribute values can be searched by .get():
import xml.etree.ElementTree as ET
tree = ET.parse('prod_id.xml')
root = tree.getroot()
for elem in root.iter():
# print(elem.tag, elem.attrib, elem.text)
if elem.tag == "record":
print("Product-id:",elem.get('product-id'))
Output:
Product-id: 99124
Product-id: 011443

Ho to parse and get element of an xml using Python data frame

This is my XML string i am getting this as a message so it is not a file
<?xml version="1.0" encoding="UTF-8"?>
<OperationStatus xmlns:ns2="summaries">
<EventId>123456</EventId>
<notificationId>123456</notificationId>
<userDetails>
<clientId>client_1</clientId>
<userId>user_1</userId>
<groupIds>
<groupId>123456</groupId>
<groupId>123457</groupId>
</groupIds>
</userDetails>
</OperationStatus>
I want to get output in below format
message,code,Id
I have mentioned only three elements but i can have many more elements .
This is how i am trying but not getting the exact output
I started learning Python so excuse me for silly mistakes
from __future__ import print_function
import pandas as pd
def lambda_handler():
import xml.etree.ElementTree as et
xtree = et.parse('''<?xml version="1.0" encoding="UTF-8"?>
<OperationStatus xmlns:ns2="summaries">
<EventId>123456</EventId>
<notificationId>123456</notificationId>
<userDetails>
<clientId>client_1</clientId>
<userId>user_1</userId>
<groupIds>
<groupId>123456</groupId>
<groupId>123457</groupId>
</groupIds>
</userDetails>
</OperationStatus>''')
xroot = xtree.getroot()
df_cols = ["message", "code", "Id"]
rows = []
for node in xroot:
s_name = node.attrib.get("message")
s_mail = node.find("code").text if node is not None else None
s_grade = node.find("Id").text if node is not None else None
lambda_handler()

you can try using XPath, it will be easier to retrieve the wanted data
import xml.etree.ElementTree as et
import pandas as pd
xtree = et.fromstring("""<?xml version="1.0" encoding="UTF-8"?>
<name xmlns:ns2="summaries">
<message>5jb10x5rf7sp1fov5msgoof7r</message>
<code>COMPLETED</code>
<Id>dfkjlhgd98568y</Id>
</name>""")
keys = ["message", "code", "Id"]
data = {k: [xtree.find(".//"+k).text] for k in keys}
print(pd.DataFrame(data))
# Outputs:
# message code Id
# 0 5jb10x5rf7sp1fov5msgoof7r COMPLETED dfkjlhgd98568y

Is this the output you desire?
# !pip install xmltodict
import xmltodict
xml = """
<name xmlns:ns2="summaries">
<message>5jb10x5rf7sp1fov5msgoof7r</message>
<code>COMPLETED</code>
<Id>dfkjlhgd98568y</Id>
</name>
"""
d = xmltodict.parse(xml)
print(d['name']['message'])
print(d['name']['code'])
print(d['name']['Id'])
Output
5jb10x5rf7sp1fov5msgoof7r
COMPLETED
dfkjlhgd98568y
More info on xmltodict at https://github.com/martinblech/xmltodict

Given your string:
your_string='''\
<?xml version="1.0" encoding="UTF-8"?>
<name xmlns:ns2="summaries">
<message>5jb10x5rf7sp1fov5msgoof7r</message>
<code>COMPLETED</code>
<Id>dfkjlhgd98568y</Id>
</name>'''
Since this is a string, you would use .fromstring() rather than .parse(). That automatically finds the root node name for you (ie, no need to call .getroot()):
root = et.fromstring(your_string)
>>> root
<Element 'name' at 0x1050f51d0>
Once you have the data structure with name as the root, you can either iterate over the sub elements:
df_cols = ["message", "code", "Id"]
for node in root:
if node.tag in df_cols:
print({node.tag:node.text})
Prints:
{'message': '5jb10x5rf7sp1fov5msgoof7r'}
{'code': 'COMPLETED'}
{'Id': 'dfkjlhgd98568y'}
Or you can use an xpath query to find each element of interest:
for k in df_cols:
print({k:root.find(f'./{k}').text})
# same output
Now since a data frame can be constructed by {key:[list_of_elements],...} you can construct that type of dict from what we have built here:
df=pd.DataFrame({k:[root.find(f'./{k}').text] for k in df_cols})
If you have multiple elements, use findall:
df=pd.DataFrame({k:[x.text for x in root.findall(f'./{k}')] for k in df_cols})

How to iterate over an xml and save it into a dataframe in Python?

I have an XML and i'm trying to iterate over it and save it(just the tracking events part) into a dataframe.
this is the input XML:
<?xml version="1.0" encoding="UTF-8"?>
<trackingresponse>
<trackingdetails>
<trackingdetail>
<trackingnumber>1550161004</trackingnumber>
<trackingevents>
<trackingevent>
<date>2020-10-21T11:04:00+01:00</date>
<code>17</code>
</trackingevent>
<trackingevent>
<date>2020-10-21T08:41:00+01:00</date>
<code>18</code>
</trackingevent>
</trackingdetail>
</trackingdetails>
</trackingresponse>
i tried this code but it shows empty dataframe :
response =requests.post(endpoint_url, data=t, headers = headers).text
# response is correct
response_tree = ET.fromstring(response)
data = []
for el in response_tree.iter('./*'):
for i in el.iter('*'):
data.append(dict(i.items()))
df = pd.DataFrame(data)
print(df)
also i tried writing text values into a temp dataframe, but this wont either :
response_df = pd.read_csv('/home/test.csv')
response_df['date']= response_tree.find('.//date').text
response_df['code']= response_tree.find('.//code').text
i also tried this , but its giving me everything element as a new row :
for child in tree.iter('trackingevent'):
for elem in child.iter():
data = {str(elem.tag):[elem.text]}
if str(elem.text)=='None':continue
response_df = pd.DataFrame(data)
consolidated_list.append(response_df)
i'm just trying to get the tracking events inside the xml into a dataframe
expected dataframe:
date code
2020-10-21T11:04:00+01:00 17
2020-10-21T08:41:00+01:00 18

The below code does the job
import xml.etree.ElementTree as ET
import pandas as pd
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<trackingresponse>
<trackingdetails>
<trackingdetail>
<trackingnumber>1550161004</trackingnumber>
<trackingevents>
<trackingevent>
<date>2020-10-21T11:04:00+01:00</date>
<code>17</code>
</trackingevent>
<trackingevent>
<date>2020-10-21T08:41:00+01:00</date>
<code>18</code>
</trackingevent>
</trackingevents>
</trackingdetail>
</trackingdetails>
</trackingresponse>'''
root = ET.fromstring(xml)
data = [{'date': e.find('date').text, 'code': e.find('code').text} for e in root.findall('.//trackingevent')]
df = pd.DataFrame(data)
print(df)
output
date code
0 2020-10-21T11:04:00+01:00 17
1 2020-10-21T08:41:00+01:00 18

You can use this example to parse the XML with etree (note: you're missing </trackingevents> in your XML snippet, probably a typo):
import pandas as pd
import xml.etree.ElementTree as et
tree = et.ElementTree(file='<your file.xml>')
data = []
for ev in tree.findall('.//trackingevent'):
date = ev.find('date').text
code = ev.find('code').text
data.append({
'date': date,
'code': code
})
df = pd.DataFrame(data)
print(df)
Prints:
date code
0 2020-10-21T11:04:00+01:00 17
1 2020-10-21T08:41:00+01:00 18

Parsing XML document into a pandas DataFrame

I have an XML file that looks like this:
<?xml version="1.0" encoding="utf-8"?>
<comments>
<row Id="1" PostId="2" Score="0" Text="(...)" CreationDate="2011-08-30T21:15:28.063" UserId="16" />
<row Id="2" PostId="17" Score="1" Text="(...)" CreationDate="2011-08-30T21:24:56.573" UserId="27" />
<row Id="3" PostId="26" Score="0" Text="(...)" UserId="9" />
</comments>
What I'm trying to do is to extract ID, Text and CreationDate colums into pandas DF and I've tried following:
import xml.etree.cElementTree as et
import pandas as pd
path = '/.../...'
dfcols = ['ID', 'Text', 'CreationDate']
df_xml = pd.DataFrame(columns=dfcols)
root = et.parse(path)
rows = root.findall('.//row')
for row in rows:
ID = row.find('Id')
text = row.find('Text')
date = row.find('CreationDate')
print(ID, text, date)
df_xml = df_xml.append(pd.Series([ID, text, date], index=dfcols), ignore_index=True)
print(df_xml)
But the output is:
None None None
How do I fix this?

As advised in this solution by gold member Python/pandas/numpy guru, #unutbu:
Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying.
Therefore, consider parsing your XML data into a separate list then pass list into the DataFrame constructor in one call outside of any loop. In fact, you can pass nested lists with list comprehension directly into the constructor:
path = 'AttributesXMLPandas.xml'
dfcols = ['ID', 'Text', 'CreationDate']
root = et.parse(path)
rows = root.findall('.//row')
# NESTED LIST
xml_data = [[row.get('Id'), row.get('Text'), row.get('CreationDate')]
for row in rows]
df_xml = pd.DataFrame(xml_data, columns=dfcols)
print(df_xml)
# ID Text CreationDate
# 0 1 (...) 2011-08-30T21:15:28.063
# 1 2 (...) 2011-08-30T21:24:56.573
# 2 3 (...) None

Just a minor change in your code
ID = row.get('Id')
text = row.get('Text')
date = row.get('CreationDate')

Based on #Parfait solution, I wrote my version that gets the columns as a parameter and returns the Pandas DataFrame.
test.xml:
<?xml version="1.0" encoding="utf-8"?>
<comments>
<row Id="1" PostId="2" Score="0" Text="(.1.)" CreationDate="2011-08-30T21:15:28.063" UserId="16" />
<row Id="2" PostId="17" Score="1" Text="(.2.)" CreationDate="2011-08-30T21:24:56.573" UserId="27" />
<row Id="3" PostId="26" Score="0" Text="(.3.)" UserId="9" />
</comments>
xml_to_pandas.py:
'''Xml to Pandas DataFrame Convertor.'''
import xml.etree.cElementTree as et
import pandas as pd
def xml_to_pandas(root, columns, row_name):
'''get xml.etree root, the columns and return Pandas DataFrame'''
df = None
try:
rows = root.findall('.//{}'.format(row_name))
xml_data = [[row.get(c) for c in columns] for row in rows] # NESTED LIST
df = pd.DataFrame(xml_data, columns=columns)
except Exception as e:
print('[xml_to_pandas] Exception: {}.'.format(e))
return df
path = 'test.xml'
row_name = 'row'
columns = ['ID', 'Text', 'CreationDate']
root = et.parse(path)
df = xml_to_pandas(root, columns, row_name)
print(df)
output:

Since pandas 1.3.0, there's a built-in pandas function pd.read_xml that reads XML documents into a pandas DataFrame.
path = """<?xml version="1.0" encoding="utf-8"?>
<comments>
<row Id="1" PostId="2" Score="0" Text="(...)" CreationDate="2011-08-30T21:15:28.063" UserId="16" />
<row Id="2" PostId="17" Score="1" Text="(...)" CreationDate="2011-08-30T21:24:56.573" UserId="27" />
<row Id="3" PostId="26" Score="0" Text="(...)" UserId="9" />
</comments>"""
# or a path to an XML doc
path = 'test.xml'
pd.read_xml(path)
The XML doc in the OP becomes the following by simply calling read_xml:

parsing serial numbers tags from xml python

I have an xml file a shorter version is as follows:
<?xml version="1.0" encoding="UTF-8"?>
<DATA>
<_1>
<member_id>AFCE6DB97D4CD67D</member_id>
</_1>
<_2>
<member_id>AFCE6DB97D4CD67D</member_id>
</_2>
</DATA>
I am using the following code to parse
tree = ElementTree.parse(args['inputxml'])
root = tree.getroot()
for dat in root:
memberID = dat.find('member_id').text
I am able to parse the member id but not sure how to parse the serial number <_1>``<_2>etc. This number keeps extending with every new record in xml.

You could use xpath():
xml = """<?xml version="1.0" encoding="UTF-8"?>
<DATA>
<_1>
<member_id>AFCE6DB97D4CD67D</member_id>
</_1>
<_2>
<member_id>AFCE6DB97D4CD67D</member_id>
</_2>
</DATA>"""
root = etree.fromstring(xml)
members = root.xpath("//member_id")
for m in members:
print m.text, m.getparent().tag
This prints:
AFCE6DB97D4CD67D _1
AFCE6DB97D4CD67E _2

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to remove new lines in dataframe while writing xml into it? - python

Related

Parse XML to CSV with python

Ho to parse and get element of an xml using Python data frame

How to iterate over an xml and save it into a dataframe in Python?

Parsing XML document into a pandas DataFrame

parsing serial numbers tags from xml python

Categories

Resources