How to iterate over a xml file to extract some attributes? - python

I'm trying to extract the values from a xml file and save it as a dataframe. For each line element, I'd like to add the date from the chk element.
<?xml version="1.0" encoding="ISO-8859-1"?>
<sales>
<chk no="xxx" date="xxxx" time="xxx" total="xxxx" debtor="xxxx" name="xxx" cardnumber="xxxxxxx" mobil="" >
<line productId="xxxx" product="xxxx" productGroupId="xxx" productGroup="xxx" amount="x" price="xxx" />
<line productId="xxx" product="xxx" productGroupId="xxx" productGroup="xxx" amount="xx" price="xxxx" />
</chk>
<chk no="xxx" date="xxxx" time="xx" total="xxxx" debtor="xxxx" name="xxxx" cardnumber="xxxx" mobil="xxxxx" >
<line productId="xxxx" product="xxxxx" productGroupId="xxxx" productGroup="xxx" amount="xxxx" price="xxxx" />
<line productId="xxxxx" product="xxxxx" productGroupId="xxxx" productGroup="xxxx" amount="xxx" price="xxxxx" />
</chk>
</sales>
root = ET.fromstring(response.content)
sales = []
for date in root.iter('chk'):
sales.append(date.attrib)
lines = []
for line in root.iter('line'):
lines.append(line.attrib)
I am able to extract the chk and line element separately. How can I append the date to the lines list?

import xml.etree.ElementTree as ET
xml = '''<?xml version="1.0" encoding="ISO-8859-1"?>
<sales>
<chk no="xxx" date="xxxx" time="xxx" total="xxxx" debtor="xxxx" name="xxx" cardnumber="xxxxxxx" mobil="" >
<line productId="xxxx" product="xxxx" productGroupId="xxx" productGroup="xxx" amount="x" price="xxx" />
<line productId="xxx" product="xxx" productGroupId="xxx" productGroup="xxx" amount="xx" price="xxxx" />
</chk>
<chk no="xxx" date="zzzz" time="xx" total="xxxx" debtor="xxxx" name="xxxx" cardnumber="xxxx" mobil="xxxxx" >
<line productId="xxxx" product="xxxxx" productGroupId="xxxx" productGroup="xxx" amount="xxxx" price="xxxx" />
<line productId="xxxxx" product="xxxxx" productGroupId="xxxx" productGroup="xxxx" amount="xxx" price="xxxxx" />
</chk>
</sales>'''
root = ET.fromstring(xml)
for chk in root.findall('.//chk'):
for line in chk.findall('line'):
line.attrib['date'] = chk.attrib['date']
ET.dump(root)

Iterate over lines inside the chk iteration and use date i/o root as a iteration object. Something like that
root = ET.fromstring(resp)
for date in root.iter('chk'):
for line in date.iter('line'):
print(date.attrib,line.attrib)

Related

How to Extract the Information from XML Soap Response?

We have a requirement to get the data from a SOAP XML Response.
Below is the associated XML file
<?xml version="1.0" encoding="utf-8"?>
<soap:Envelope xmlns:soap="http://www.w3.org/2003/05/soap-envelope" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<soap:Body>
<GetResultResponse xmlns="http://www.relatics.com/">
<GetResultResult>
<Report ReportName="RFC" GeneratedOn="2022-12-22" EnvironmentID="XXXX" EnvironmentName="Systematic Assurance – an XXX Solution" EnvironmentURL="https://XXXX.relaticsonline.com/" WorkspaceID="XXXXX" WorkspaceName="P - ADL Program Management - XXX" TargetDevice="Pc" ReportField="" xmlns="">
<Change_module>
<applied_individual_change_request Change_Request="TestKZIreport" RFC_GUID="XXXXX">
<code RFC_Code="VtW-0101" />
<progress RFC_Progress="agreed" />
<applied_individual_project_organisation Organisation="XXXX" />
<applied__individual_discipline Discipline="Highways" />
<specification Specification="Context of Documents">
<code Specification_Code="1.1.1a" />
</specification>
<applied_individual_workpackage Workpackage="Enabling work">
<code Workpackage_Code="WP-01" />
</applied_individual_workpackage>
<physical_object Physical_Object="Train Station">
<code Physical_Object_Code="TFO-0001" />
</physical_object>
<person approver="XXX" />
<applied_individual_change_consequence_qualification Consequence_Value="10 days">
<applied_conceptual_change_consequence_aspect Consequence_Aspect="Schedule" />
</applied_individual_change_consequence_qualification>
<document Document_Name="WI 300 Design.pdf">
<code Document_Code="DOC-0002" />
</document>
<answer_status BR_Status="no" />
<applied_individual_business_rule Business_Rule="Change Review compliance">
<code BR_Code="BR-006" />
</applied_individual_business_rule>
<applied_individual_change_consequence_qualification Consequence_Value="XXX">
<applied_conceptual_change_consequence_aspect Consequence_Aspect="Finance" />
</applied_individual_change_consequence_qualification>
</applied_individual_change_request>
</Change_module>
</Report>
</GetResultResult>
</GetResultResponse>
</soap:Body>
</soap:Envelope>
i need all the tag values after Change_module.i tried some online help in Stack overflow but it didn't work.
I never worked with XML documents before and here is the sample code i
tried from Stack Overflow.
import xml.etree.ElementTree as ET
import pandas as pd
import numpy as np
tree = ET.parse("Relatics_XML.xml")
root = tree.getroot()
print(root.tag)
print(root.attrib)
namespaces = {"soap": "http://www.w3.org/2003/05/soap-envelope/",
"xsi": "http://www.w3.org/2001/XMLSchema-instance",
"xsd": "http://www.w3.org/2001/XMLSchema/",
'a': 'http://www.relatics.com/',}
names = tree.findall('./soap:Body''/a:GetResultResponse''/a:GetResultResult', namespaces)
print(names)
for name in names:
print(name.text)
i tried different methods like find and findall and also inside the method i try to pass different values but all its printing is null.
I'm not sure how to get the values out of tags.
Using xml.etree.ElementTree make life easier.
documentation in here
It can parsing tag attribute or innerText.
import xml.etree.ElementTree as ET
xml = """\
<?xml version="1.0" encoding="utf-8"?>
<soap:Envelope xmlns:soap="http://www.w3.org/2003/05/soap-envelope"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<soap:Body>
<GetResultResponse xmlns="http://www.relatics.com/">
<GetResultResult>
<Report ReportName="RFC" GeneratedOn="2022-12-22" EnvironmentID="XXXX" EnvironmentName="Systematic Assurance – an XXX Solution" EnvironmentURL="https://XXXX.relaticsonline.com/" WorkspaceID="XXXXX" WorkspaceName="P - ADL Program Management - XXX" TargetDevice="Pc" ReportField=""
xmlns="">
<Change_module>
<applied_individual_change_request Change_Request="TestKZIreport" RFC_GUID="XXXXX">
<code RFC_Code="VtW-0101" />
<progress RFC_Progress="agreed" />
<applied_individual_project_organisation Organisation="XXXX" />
<applied__individual_discipline Discipline="Highways" />
<specification Specification="Context of Documents">
<code Specification_Code="1.1.1a" />
</specification>
<applied_individual_workpackage Workpackage="Enabling work">
<code Workpackage_Code="WP-01" />
</applied_individual_workpackage>
<physical_object Physical_Object="Train Station">
<code Physical_Object_Code="TFO-0001" />
</physical_object>
<person approver="XXX" />
<applied_individual_change_consequence_qualification Consequence_Value="10 days">
<applied_conceptual_change_consequence_aspect Consequence_Aspect="Schedule" />
</applied_individual_change_consequence_qualification>
<document Document_Name="WI 300 Design.pdf">
<code Document_Code="DOC-0002" />
</document>
<answer_status BR_Status="no" />
<applied_individual_business_rule Business_Rule="Change Review compliance">
<code BR_Code="BR-006" />
</applied_individual_business_rule>
<applied_individual_change_consequence_qualification Consequence_Value="XXX">
<applied_conceptual_change_consequence_aspect Consequence_Aspect="Finance" />
</applied_individual_change_consequence_qualification>
</applied_individual_change_request>
</Change_module>
</Report>
</GetResultResult>
</GetResultResponse>
</soap:Body>
</soap:Envelope>
"""
root = ET.fromstring(xml)
print("RFC_Code: " + str(root.find(".//code[#RFC_Code]").attrib))
print("RFC_Progress: " + str(root.find(".//progress[#RFC_Progress]").attrib))
print("specification: " + str(root.find(".//specification[#Specification]").attrib))
print("Specification_Code: " + str(root.find(".//code[#Specification_Code]").attrib))
print("Workpackage_Code: " + str(root.find(".//code[#Workpackage_Code]").attrib))
print("Document_Code: " + str(root.find(".//code[#Document_Code]").attrib))
Result
$ python get-data.py
RFC_Code: {'RFC_Code': 'VtW-0101'}
RFC_Progress: {'RFC_Progress': 'agreed'}
specification: {'Specification': 'Context of Documents'}
Specification_Code: {'Specification_Code': '1.1.1a'}
Workpackage_Code: {'Workpackage_Code': 'WP-01'}
Document_Code: {'Document_Code': 'DOC-0002'}
If you using xml file open, using this code
with open('data.xml', 'r') as xml_file:
root = ET.parse(xml_file)

How to pass <Br /> element when combine text in XML using python?

I've been trying to combine all text in the content element in XML using python.
I succeeded combining all content text but need to except content which is right below <'Br /> element.
<'Br /> element means Enter in adobe indesign program.
This XML is exported from adobe indesign.
This is example as follow :
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Root>
<Story>
<ParagraphStyleRange>
<CharacterStyleRange>
<Content>AAA</Content>
</CharacterStyleRange>
<CharacterStyleRange>
<Content>BBB</Content>
</CharacterStyleRange>
<CharacterStyleRange>
<Br />
<Content>CCC</Content>
<Br />
<Content>DDD</Content>
</CharacterStyleRange>
<CharacterStyleRange>
<Content>EEE</Content>
</CharacterStyleRange>
<CharacterStyleRange>
<Br />
<Content>FFF</Content>
<Br />
</CharacterStyleRange>
</ParagraphStyleRange>
</Story>
</Root>
and it's what i want as follow :
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Root>
<Story>
<ParagraphStyleRange>
<CharacterStyleRange>
<Content>AAA</Content>
</CharacterStyleRange>
<CharacterStyleRange>
<Content>AAABBB</Content>
</CharacterStyleRange>
<CharacterStyleRange>
<Br />
<Content>CCC</Content>
<Br />
<Content>DDD</Content>
</CharacterStyleRange>
<CharacterStyleRange>
<Content>DDDEEE</Content>
</CharacterStyleRange>
<CharacterStyleRange>
<Br />
<Content>FFF</Content>
<Br />
</CharacterStyleRange>
</ParagraphStyleRange>
</Story>
</Root>
As you see, i don't want to add content text to next one if there is <'Br /> element right above the content that i want to add.
In detail, the first Content element text is AAA and next one is BBB.
in this case AAA should be attched in front of BBB.
and BBB is not attached in front of CCC because there is <'Br /> element right above CCC Content.
Would you help me how to recognize the <'Br /> element to pass?
this is what i'am doing code so far, but it doesn't work well...
tree = ET.parse("C:\\Br_test.xml")
root = tree.getroot()
for ParagraphStyleRange in root.findall('.//Story/ParagraphStyleRange'):
CharacterStyleRange_count = len(ParagraphStyleRange.findall('CharacterStyleRange'))
#print(CharacterStyleRange_count)
if int(CharacterStyleRange_count) >= 2 :
try :
Content_collect = ''
for CharacterStyleRange in ParagraphStyleRange.findall('CharacterStyleRange'):
Br_count = len(CharacterStyleRange.findall('Br'))
print(Br_count)
if int(Br_count) == 0 :
for Content in CharacterStyleRange.findall('Content'):
Content_collect += Content.text
Content.text = str(Content_collect)
print(Content_collect)
#---- Code to delete Contents that are attached to next one---
#for CharacterStyleRange in ParagraphStyleRange.findall('CharacterStyleRange')[:-1]:
# for Content in CharacterStyleRange.findall('Content'):
# Content_remove = CharacterStyleRange.remove(Content)
except:
pass

Python XML get immediate child elements only

I have an xml file as below:
<?xml version="1.0" encoding="utf-8"?>
<EDoc CID="1000101" Cname="somename" IName="iname" CSource="e1" Version="1.0">
<RIGLIST>
<RIG RIGID="100001" RIGName="RgName1">
<ListID>
<nodeA nodeAID="1000011" nodeAName="node1A" nodeAExtID="9000011" />
<nodeA nodeAID="1000012" nodeAName="node2A" nodeAExtID="9000012" />
<nodeA nodeAID="1000013" nodeAName="node3A" nodeAExtID="9000013" />
<nodeA nodeAID="1000014" nodeAName="node4A" nodeAExtID="9000014" />
<nodeA nodeAID="1000015" nodeAName="node5A" nodeAExtID="9000015" />
<nodeA nodeAID="1000016" nodeAName="node6A" nodeAExtID="9000016" />
<nodeA nodeAID="1000017" nodeAName="node7A" nodeAExtID="9000017" />
</ListID>
</RIG>
<RIG RIGID="100002" RIGName="RgName2">
<ListID>
<nodeA nodeAID="1000021" nodeAName="node1B" nodeAExtID="9000021" />
<nodeA nodeAID="1000022" nodeAName="node2B" nodeAExtID="9000022" />
<nodeA nodeAID="1000023" nodeAName="node3B" nodeAExtID="9000023" />
</ListID>
</RIG>
</RIGLIST>
</EDoc>
I need to search for the Node value RIGName and if match is found print out all the values of nodeAName
Example:
Searching for RIGName = "RgName2" should print all the values as node1B, node2B, node3B
As of now I am only able to get the first part as below:
import xml.etree.ElementTree as eT
import re
xmlfilePath = "Path of xml file"
tree = eT.parse(xmlfilePath)
root = tree.getroot()
for elem in root.iter("RIGName"):
# print(elem.tag, elem.attrib)
if re.findall(searchtxt, elem.attrib['RIGName'], re.IGNORECASE):
print(elem.attrib)
count += 1
How can I get only the immediate child node values?
Switching from xml.etree to lxml would give you a way to do it in a single go because of a much better XPath query language support:
In [1]: from lxml import etree as ET
In [2]: tree = ET.parse('input.xml')
In [3]: root = tree.getroot()
In [4]: root.xpath('//RIG[#RIGName = "RgName2"]/ListID/nodeA/#nodeAName')
Out[4]: ['node1B', 'node2B', 'node3B']

Parsing XML document into a pandas DataFrame

I have an XML file that looks like this:
<?xml version="1.0" encoding="utf-8"?>
<comments>
<row Id="1" PostId="2" Score="0" Text="(...)" CreationDate="2011-08-30T21:15:28.063" UserId="16" />
<row Id="2" PostId="17" Score="1" Text="(...)" CreationDate="2011-08-30T21:24:56.573" UserId="27" />
<row Id="3" PostId="26" Score="0" Text="(...)" UserId="9" />
</comments>
What I'm trying to do is to extract ID, Text and CreationDate colums into pandas DF and I've tried following:
import xml.etree.cElementTree as et
import pandas as pd
path = '/.../...'
dfcols = ['ID', 'Text', 'CreationDate']
df_xml = pd.DataFrame(columns=dfcols)
root = et.parse(path)
rows = root.findall('.//row')
for row in rows:
ID = row.find('Id')
text = row.find('Text')
date = row.find('CreationDate')
print(ID, text, date)
df_xml = df_xml.append(pd.Series([ID, text, date], index=dfcols), ignore_index=True)
print(df_xml)
But the output is:
None None None
How do I fix this?
As advised in this solution by gold member Python/pandas/numpy guru, #unutbu:
Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying.
Therefore, consider parsing your XML data into a separate list then pass list into the DataFrame constructor in one call outside of any loop. In fact, you can pass nested lists with list comprehension directly into the constructor:
path = 'AttributesXMLPandas.xml'
dfcols = ['ID', 'Text', 'CreationDate']
root = et.parse(path)
rows = root.findall('.//row')
# NESTED LIST
xml_data = [[row.get('Id'), row.get('Text'), row.get('CreationDate')]
for row in rows]
df_xml = pd.DataFrame(xml_data, columns=dfcols)
print(df_xml)
# ID Text CreationDate
# 0 1 (...) 2011-08-30T21:15:28.063
# 1 2 (...) 2011-08-30T21:24:56.573
# 2 3 (...) None
Just a minor change in your code
ID = row.get('Id')
text = row.get('Text')
date = row.get('CreationDate')
Based on #Parfait solution, I wrote my version that gets the columns as a parameter and returns the Pandas DataFrame.
test.xml:
<?xml version="1.0" encoding="utf-8"?>
<comments>
<row Id="1" PostId="2" Score="0" Text="(.1.)" CreationDate="2011-08-30T21:15:28.063" UserId="16" />
<row Id="2" PostId="17" Score="1" Text="(.2.)" CreationDate="2011-08-30T21:24:56.573" UserId="27" />
<row Id="3" PostId="26" Score="0" Text="(.3.)" UserId="9" />
</comments>
xml_to_pandas.py:
'''Xml to Pandas DataFrame Convertor.'''
import xml.etree.cElementTree as et
import pandas as pd
def xml_to_pandas(root, columns, row_name):
'''get xml.etree root, the columns and return Pandas DataFrame'''
df = None
try:
rows = root.findall('.//{}'.format(row_name))
xml_data = [[row.get(c) for c in columns] for row in rows] # NESTED LIST
df = pd.DataFrame(xml_data, columns=columns)
except Exception as e:
print('[xml_to_pandas] Exception: {}.'.format(e))
return df
path = 'test.xml'
row_name = 'row'
columns = ['ID', 'Text', 'CreationDate']
root = et.parse(path)
df = xml_to_pandas(root, columns, row_name)
print(df)
output:
Since pandas 1.3.0, there's a built-in pandas function pd.read_xml that reads XML documents into a pandas DataFrame.
path = """<?xml version="1.0" encoding="utf-8"?>
<comments>
<row Id="1" PostId="2" Score="0" Text="(...)" CreationDate="2011-08-30T21:15:28.063" UserId="16" />
<row Id="2" PostId="17" Score="1" Text="(...)" CreationDate="2011-08-30T21:24:56.573" UserId="27" />
<row Id="3" PostId="26" Score="0" Text="(...)" UserId="9" />
</comments>"""
# or a path to an XML doc
path = 'test.xml'
pd.read_xml(path)
The XML doc in the OP becomes the following by simply calling read_xml:

Python ElementTree xml output to csv

I have the following XML file ('registerreads_EE.xml'):
<?xml version="1.0" encoding="us-ascii" standalone="yes"?>
<ReadingDocument xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<ReadingStatusRefTable>
<ReadingStatusRef Ref="1">
<UnencodedStatus SourceValidation="SIMPLE">
<StatusCodes>
<Signal>XX</Signal>
</StatusCodes>
</UnencodedStatus>
</ReadingStatusRef>
</ReadingStatusRefTable>
<Header>
<IEE_System Id="XXXXXXXXXXXXXXX" />
<Creation_Datetime Datetime="2015-10-22T09:05:32Z" />
<Timezone Id="UTC" />
<Path FilePath="X:\XXXXXXXXXXXX.xml" />
<Export_Template Id="XXXXX" />
<CorrelationID Id="" />
</Header>
<ImportExportParameters ResubmitFile="false" CreateGroup="true">
<DataFormat TimestampType="XXXXXX" Type="XXXX" />
</ImportExportParameters>
<Channels>
<Channel StartDate="2015-10-21T00:00:00-05:00" EndDate="2015-10-22T00:00:00-05:00">
<ChannelID ServicePointChannelID="73825603:301" />
<Readings>
<Reading Value="3577.0" ReadingTime="2015-10-21T00:00:00-05:00" StatusRef="1" />
<Reading Value="3601.3" ReadingTime="2015-10-22T00:00:00-05:00" StatusRef="1" />
</Readings>
<ExportRequest RequestID="152" EntityType="ServicePoint" EntityID="73825603" RequestSource="Scheduled" />
</Channel>
<Channel StartDate="2015-10-21T00:00:00-05:00" EndDate="2015-10-22T00:00:00-05:00">
<ChannelID ServicePointChannelID="73825604:301" />
<Readings>
<Reading Value="3462.5" ReadingTime="2015-10-21T00:00:00-05:00" StatusRef="1" />
<Reading Value="3501.5" ReadingTime="2015-10-22T00:00:00-05:00" StatusRef="1" />
</Readings>
<ExportRequest RequestID="152" EntityType="ServicePoint" EntityID="73825604" RequestSource="Scheduled" />
</Channel>
</Channels>
</ReadingDocument>
I want to parse the XML of the channel data into a csv file.
He is what I have written in Python 2.7.10:
import xml.etree.ElementTree as ET
tree = ET.parse('registerreads_EE.xml')
root = tree.getroot()[3]
for channel in tree.iter('Channel'):
for exportrequest in channel.iter('ExportRequest'):
entityid = exportrequest.attrib.get('EntityID')
for meterread in channel.iter('Reading'):
read = meterread.attrib.get('Value')
date = meterread.attrib.get('ReadingTime')
print read[:-2],",",date[:10],",",entityid
tree.write(open('registerreads_EE.csv','w'))
Here is the screen output when the above is run:
3577 , 2015-10-21 , 73825603
3601 , 2015-10-22 , 73825603
3462 , 2015-10-21 , 73825604
3501 , 2015-10-22 , 73825604
The 'registerreads.csv' output file is like the original XML file, minus the first line.
I would like the printed output above outputted to a csv file with headers of read, date, entityid.
I am having difficulty with this. This is my first python program. Any help is appreciated.
Use the csv module not lxml module to write rows to csv file. But still use lxml to parse and extract content from xml file:
import xml.etree.ElementTree as ET
import csv
tree = ET.parse('registerreads_EE.xml')
root = tree.getroot()[3]
with open('registerreads_EE.csv', 'w', newline='') as r:
writer = csv.writer(r)
writer.writerow(['read', 'date', 'entityid']) # WRITING HEADERS
for channel in tree.iter('Channel'):
for exportrequest in channel.iter('ExportRequest'):
entityid = exportrequest.attrib.get('EntityID')
for meterread in channel.iter('Reading'):
read = meterread.attrib.get('Value')
date = meterread.attrib.get('ReadingTime')
# WRITE EACH ROW ITERATIVELY
writer.writerow([read[:-2],date[:10],entityid])

Categories