Element XML parsing not giving proper result

Element XML parsing not giving proper result - python

I have sample below XML file and I am trying to generate below JSON but I am not geeting expected result it is only add one document in dictionary.
Sample Input XML:
<results status="passed">
<num-records>2</num-records>
<records>
<volume-info>
<flexible-volume-info>
<agg-name>aggr1_split</aggregate-name>
</flexible-volume-info>
<volume-name>volume1</volume-name>
<volume-size>
<actual-size>44</actual-volume-size>
<afs-avail>90</afs-avail>
</volume-size>
</volume-info>
<volume-info>
<flexible-volume-info>
<agg-name>aggr2_split</aggregate-name>
</flexible-volume-info>
<volume-name>volume2</volume-name>
<volume-size>
<actual-size>10</actual-volume-size>
<afs-avail>14</afs-avail>
</volume-size>
</volume-info>
</records>
</results>
Expected Output:
{
"agg-name": "aggr1_split",
"volume-name": "volume1",
"actual-size": "44"
},
{
"agg-name": "aggr2_split",
"volume-name": "volume2",
"actual-size": "10"
}
Sample code:
result = {}
for child in root.iter("records"):
result['agg-name'] = child.find('volume-info/flexible-volume-info/agg-name').text
result['volume-name'] = child.find('volume-info/volume-name').text
result['actual-size'] = child.find('volume-info/volume-size/actual-size').text
print result

Your expected output would be a dictionary which contained multiple identical keys which is not possible. You either need to choose different keys for each iteration of your loop or better still have a list of dictionaries:
import xml.etree.ElementTree as ET
xml_data = """<results status="passed">
<num-records>2</num-records>
<records>
<volume-info>
<flexible-volume-info>
<agg-name>aggr1_split</agg-name>
</flexible-volume-info>
<volume-name>volume1</volume-name>
<volume-size>
<actual-size>44</actual-size>
<afs-avail>90</afs-avail>
</volume-size>
</volume-info>
<volume-info>
<flexible-volume-info>
<agg-name>aggr2_split</agg-name>
</flexible-volume-info>
<volume-name>volume2</volume-name>
<volume-size>
<actual-size>10</actual-size>
<afs-avail>14</afs-avail>
</volume-size>
</volume-info>
</records>
</results>"""
root = ET.fromstring(xml_data)
results = []
for child in root.iter("volume-info"):
result = {}
print(child)
result['agg-name'] = child.find('flexible-volume-info/agg-name').text
result['volume-name'] = child.find('volume-name').text
result['actual-size'] = child.find('volume-size/actual-size').text
results.append(result)
print(results)
This would give you:
[{'agg-name': 'aggr1_split', 'volume-name': 'volume1', 'actual-size': '44'}, {'agg-name': 'aggr2_split', 'volume-name': 'volume2', 'actual-size': '10'}]
Your XML is also badly formed, the open and closing tags do not always match.

Related

XML parsing in python issue using elementTree

I need to parse a soap response and convert to a text file. I am trying to parse the values as detailed below. I am using ElementTree in python
I have the below xml response which I need to parse
<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:tmf854="tmf854.v1" xmlns:alu="alu.v1">
<soapenv:Header>
<tmf854:header>
<tmf854:activityName>query</tmf854:activityName>
<tmf854:msgName>queryResponse</tmf854:msgName>
<tmf854:msgType>RESPONSE</tmf854:msgType>
<tmf854:senderURI>https:/destinationhost:8443/tmf854/services</tmf854:senderURI>
<tmf854:destinationURI>https://localhost:8443</tmf854:destinationURI>
<tmf854:activityStatus>SUCCESS</tmf854:activityStatus>
<tmf854:correlationId>1</tmf854:correlationId>
<tmf854:communicationPattern>MultipleBatchResponse</tmf854:communicationPattern>
<tmf854:communicationStyle>RPC</tmf854:communicationStyle>
<tmf854:requestedBatchSize>1500</tmf854:requestedBatchSize>
<tmf854:batchSequenceNumber>1</tmf854:batchSequenceNumber>
<tmf854:batchSequenceEndOfReply>true</tmf854:batchSequenceEndOfReply>
<tmf854:iteratorReferenceURI>http://9195985371165397084</tmf854:iteratorReferenceURI>
<tmf854:timestamp>20220915222121.472+0530</tmf854:timestamp>
</tmf854:header>
</soapenv:Header>
<soapenv:Body>
<queryResponse xmlns="alu.v1">
<queryObjectData>
<queryObject>
<name>
<tmf854:mdNm>AMS</tmf854:mdNm>
<tmf854:meNm>CHEERLAVANCHA_281743</tmf854:meNm>
<tmf854:ptpNm>/type=NE/CHEERLAVANCHA_281743</tmf854:ptpNm>
</name>
<vendorExtensions>
<package>
<NameAndStringValue>
<tmf854:name>hubSubtendedStatus</tmf854:name>
<tmf854:value>NONE</tmf854:value>
</NameAndStringValue>
<NameAndStringValue>
<tmf854:name>productAndRelease</tmf854:name>
<tmf854:value>DF.6.1</tmf854:value>
</NameAndStringValue>
<NameAndStringValue>
<tmf854:name>adminUserName</tmf854:name>
<tmf854:value>isadmin</tmf854:value>
</NameAndStringValue>
<NameAndStringValue>
</package>
</vendorExtensions>
</queryObject>
</queryObjectData>
</queryResponse>
</soapenv:Body>
</soapenv:Envelope>
I need to use the below code snippet.
parser = ElementTree.parse("response.txt")
root = parser.getroot()
inventoryObjectData = root.find(".//{alu.v1}queryObjectData")
for inventoryObject in inventoryObjectData:
for device in inventoryObject:
if (device.tag.split("}")[1]) == "me":
vendorExtensionsNames = []
vendorExtensionsValues = []
if device.find(".//{tmf854.v1}mdNm") is not None:
mdnm = device.find(".//{tmf854.v1}mdNm").text
if device.find(".//{tmf854.v1}meNm") is not None:
menm = device.find(".//{tmf854.v1}meNm").text
if device.find(".//{tmf854.v1}userLabel") is not None:
userlabel = device.find(".//{tmf854.v1}userLabel").text
if device.find(".//{tmf854.v1}resourceState") is not None:
resourcestate = device.find(".//{tmf854.v1}resourceState").text
if device.find(".//{tmf854.v1}location") is not None:
location = device.find(".//{tmf854.v1}location").text
if device.find(".//{tmf854.v1}manufacturer") is not None:
manufacturer = device.find(".//{tmf854.v1}manufacturer").text
if device.find(".//{tmf854.v1}productName") is not None:
productname = device.find(".//{tmf854.v1}productName").text
if device.find(".//{tmf854.v1}version") is not None:
version = device.find(".//{tmf854.v1}version").text
vendorExtensions = device.find("vendorExtensions")
vendorExtensionsNamesElements = vendorExtensions.findall(".//{tmf854.v1}name")
for i in vendorExtensionsNamesElements:
vendorExtensionsNames.append(i.text.strip())
vendorExtensionsValuesElements = vendorExtensions.findall(".//{tmf854.v1}value")
for i in vendorExtensionsValuesElements:
vendorExtensionsValues.append(str(i.text or "").strip())
alu = ""
for i in vendorExtensions:
if i.attrib:
if alu == "":
alu = i.attrib.get("{alu.v1}name")
else:
alu = alu + "|" + i.attrib.get("{alu.v1}name")
The issue is that The below code is not able to find the 'vendorExtensions"'. Please help here.
vendorExtensions = device.find("vendorExtensions")
Have tried the below as well
vendorExtensions = device.find(".//queryObject/vendorExtensions")

Your document declares a default namespace of alu.v1:
<queryResponse xmlns="alu.v1">
...
</queryResponse>
Any attribute without an explicit namespace is in the alu.v1 namespace. You need to qualify your attribute name appropriately:
vendorExtensions = device.find("{alu.v1}vendorExtensions")
While the above is a real problem with your code that needs to be corrected (the Wikipedia entry on XML namespaces may be useful reading if you're unfamiliar with how namespaces work), there are also some logic problems with your code.
Let's drop the big list of conditionals from the code and see if it's actually doing what we think it's doing. If we run this:
from xml.etree import ElementTree
parser = ElementTree.parse("data.xml")
root = parser.getroot()
queryObjectData = root.find(".//{alu.v1}queryObjectData")
for queryObject in queryObjectData:
for device in queryObject:
print(device.tag)
Then using your sample data (once it has been corrected to be syntactically valid), we see as output:
{alu.v1}name
{alu.v1}vendorExtensions
Your search for the {alu.v1}vendorExtensions element will never succeed before the thing on which you're trying to search (the device variable) is the thing you're trying to find.
Additionally, the conditional in your loop...
if (device.tag.split("}")[1]) == "me":
...will never match (there is no element in the entire document for which tag.split("}")[1] == "me" is True).
I'm not entirely clear what you're trying to do, but here's are some thoughts:
Given your example data, you probably don't want that for device in inventoryObject: loop
We can drastically simplify your code by replacing that long block of conditionals with a list of attributes in which we are interested and then a for loop to extract them.
Rather than assigning a bunch of individual variables, we can build up a dictionary with the data from the queryObject
That might look like:
from xml.etree import ElementTree
import json
attributeNames = [
"mdNm",
"meNm",
"userLabel",
"resourceState",
"location",
"manufacturer",
"productName",
"version",
]
parser = ElementTree.parse("data.xml")
root = parser.getroot()
queryObjectData = root.find(".//{alu.v1}queryObjectData")
for queryObject in queryObjectData:
device = {}
for name in attributeNames:
if (value := queryObject.find(f".//{{tmf854.v1}}{name}")) is not None:
device[name] = value.text
vendorExtensions = queryObject.find("{alu.v1}vendorExtensions")
extensionMap = {}
for extension in vendorExtensions.findall(".//{alu.v1}NameAndStringValue"):
extname = extension.find("{tmf854.v1}name").text
extvalue = extension.find("{tmf854.v1}value").text
extensionMap[extname] = extvalue
device["vendorExtensions"] = extensionMap
print(json.dumps(device, indent=2))
Given your example data, this outputs:
{
"mdNm": "AMS",
"meNm": "CHEERLAVANCHA_281743",
"vendorExtensions": {
"hubSubtendedStatus": "NONE",
"productAndRelease": "DF.6.1",
"adminUserName": "isadmin"
}
}
An alternate approach, in which we just transform each queryObject into a dictionary, might look like this:
from xml.etree import ElementTree
import json
def localName(ele):
return ele.tag.split("}")[1]
def etree_to_dict(t):
if list(t):
d = {}
for child in t:
if localName(child) == "NameAndStringValue":
d.update(dict([[x.text.strip() for x in child]]))
else:
d.update({localName(child): etree_to_dict(child) for child in t})
return d
else:
return t.text.strip()
parser = ElementTree.parse("data.xml")
root = parser.getroot()
queryObjectData = root.find(".//{alu.v1}queryObjectData") or []
for queryObject in queryObjectData:
d = etree_to_dict(queryObject)
print(json.dumps(d, indent=2))
This will output:
{
"name": {
"mdNm": "AMS",
"meNm": "CHEERLAVANCHA_281743",
"ptpNm": "/type=NE/CHEERLAVANCHA_281743"
},
"vendorExtensions": {
"package": {
"hubSubtendedStatus": "NONE",
"productAndRelease": "DF.6.1",
"adminUserName": "isadmin"
}
}
}
That may or may not be appropriate depending on the structure of your real data and exactly what you're trying to accomplish.

Create a dictionary from an XML using xpath

I would like to create a dictionary from an XML file unsing xpath. Here's an example of the XML:
</Contract>
<Contract ID="1">
<UnwantedPatterns>
<Pattern>0</Pattern>
<Pattern>1</Pattern>
</Contract>
<Contract ID="2
<UnwantedPatterns>
<Pattern>0</Pattern>
<Pattern>1</Pattern>
</Contract>
What I would like it's having the contract ID as key and the unwanted patterns as value.
Here's my code:
UnwantedPatterns = []
key = []
DictUP = {}
for ID in root.xpath('//Contracts'):
key = ID.xpath('./Contract/#ID')
for patterns in root.xpath('.//Contract/UnwantedPatterns/Pattern'):
DictUP[key] = UnwantedPatterns.append(patterns.text)
I get the error "unhashable type: 'list'". Thank you for your help, the output should look like that:
{1: 0,1
2: 0,1}

xpath returns list, so instead of
key = ID.xpath('./Contract/#ID')
try
key = ID.xpath('./Contract/#ID')[0]
As for output, as dictionary cannot have multiple values with the same key DictUP[key] = UnwantedPatterns.append(patterns.text) will overwrite value on each iteration.
Try
for ID in root.xpath('//Contracts'):
key = ID.xpath('./Contract/#ID')[0]
_patterns = []
for unwanted in root.xpath('.//Contract/UnwantedPatterns'):
_patterns.extend([pattern.text for pattern in unwanted.xpath('./Pattern')])
DictUP[key] = _patterns

Listing path and data from a xml file to store in a dataframe

Here is a xml file :
<SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/">
<SOAP-ENV:Header />
<SOAP-ENV:Body>
<ADD_LandIndex_001>
<CNTROLAREA>
<BSR>
<status>ADD</status>
<NOUN>LandIndex</NOUN>
<REVISION>001</REVISION>
</BSR>
</CNTROLAREA>
<DATAAREA>
<LandIndex>
<reportId>AMI100031</reportId>
<requestKey>R3278458</requestKey>
<SubmittedBy>EN4871</SubmittedBy>
<submittedOn>2015/01/06 4:20:11 PM</submittedOn>
<LandIndex>
<agreementdetail>
<agreementid>001 4860</agreementid>
<agreementtype>NATURAL GAS</agreementtype>
<currentstatus>
<status>ACTIVE</status>
<statuseffectivedate>1965/02/18</statuseffectivedate>
<termdate>1965/02/18</termdate>
</currentstatus>
<designatedrepresentative></designatedrepresentative>
</agreementdetail>
</LandIndex>
</LandIndex>
</DATAAREA>
</ADD_LandIndex_001>
</SOAP-ENV:Body>
</SOAP-ENV:Envelope>
I want to save in a dataframe : 1) the path and 2) the text of the elements corresponding to the path. To do this dataframe, I am thinking to do a dictionary to store both. So first I would like to get a dictionary like that (where I have the values associated to the corresonding path).
{'/Envelope/Body/ADD_LandIndex_001/CNTROLAREA/BSR/status': 'ADD', /Envelope/Body/ADD_LandIndex_001/CNTROLAREA/BSR/NOUN: 'LandIndex',...}
Like that I just have to use the function df=pd.DataFrame() to create a dataframe that I can export in a excel sheet. I have already a part for the listing of the path, however I can not get text from those paths. I do not get how the lxml library works. I tried the function .text() and text_content() but I have an error.
Here is my code :
from lxml import etree
import xml.etree.ElementTree as et
from bs4 import BeautifulSoup
import pandas as pd
filename = 'file_try.xml'
with open(filename, 'r') as f:
soap = f.read()
root = etree.XML(soap.encode())
tree = etree.ElementTree(root)
mylist_path = []
mylist_data = []
mydico = {}
mylist = []
for target in root.xpath('//text()'):
if len(target.strip())>0:
path = tree.getpath(target.getparent()).replace('SOAP-ENV:','')
mydico[path] = target.text()
mylist_path.append(path)
mylist_data.append(target.text())
mylist.append(mydico)
df=pd.DataFrame(mylist)
df.to_excel("data_xml.xlsx")
print(mylist_path)
print(mylist_data)
Thank you for the help !

Here is an example of traversing XML tree. For this purpose recursive function will be needed. Fortunately lxml provides all functionality for this.
from lxml import etree as et
from collections import defaultdict
import pandas as pd
d = defaultdict(list)
root = et.fromstring(xml)
tree = et.ElementTree(root)
def traverse(el, d):
if len(list(el)) > 0:
for child in el:
traverse(child, d)
else:
if el.text is not None:
d[tree.getelementpath(el)].append(el.text)
traverse(root, d)
df = pd.DataFrame(d)
df.head()
Output:
{
'{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/CNTROLAREA/BSR/status': ['ADD'],
'{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/CNTROLAREA/BSR/NOUN': ['LandIndex'],
'{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/CNTROLAREA/BSR/REVISION': ['001'],
'{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/DATAAREA/LandIndex/reportId': ['AMI100031'],
'{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/DATAAREA/LandIndex/requestKey': ['R3278458'],
'{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/DATAAREA/LandIndex/SubmittedBy': ['EN4871'],
'{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/DATAAREA/LandIndex/submittedOn': ['2015/01/06 4:20:11 PM'],
'{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/DATAAREA/LandIndex/LandIndex/agreementdetail/agreementid': ['001 4860'],
'{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/DATAAREA/LandIndex/LandIndex/agreementdetail/agreementtype': ['NATURAL GAS'],
'{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/DATAAREA/LandIndex/LandIndex/agreementdetail/currentstatus/status': ['ACTIVE'],
'{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/DATAAREA/LandIndex/LandIndex/agreementdetail/currentstatus/statuseffectivedate': ['1965/02/18'],
'{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/DATAAREA/LandIndex/LandIndex/agreementdetail/currentstatus/termdate': ['1965/02/18']
}
Please note, the dictionary d contains lists as values. That's because elements can be repeated in XML and otherwise last value will override previous one. If that's not the case for your particular XML, use regular dict instead of defaultdict d = {} and use assignment instead of appending d[tree.getelementpath(el)] = el.text.
The same when reading from file:
d = defaultdict(list)
with open('output.xml', 'rb') as file:
root = et.parse(file).getroot()
tree = et.ElementTree(root)
def traverse(el, d):
if len(list(el)) > 0:
for child in el:
traverse(child, d)
else:
if el.text is not None:
d[tree.getelementpath(el)].append(el.text)
traverse(root, d)
df = pd.DataFrame(d)
print(d)

Is there a way to write less repetitive code? Writing an XML file in python using ElementTree

I am currently learning to code in python, but working with XML files is giving me some trouble. I tried to write an XML-file using some data that i filtered from a JSON-file.
The XML-file I want to write should look like this:
<?xml version='1.0' encoding='UTF-8'?>
<collection>
<work>
<title>Title</title>
<dimensions>
<width>Width (cm)</width>
<height>Height (cm)</height>
</dimensions>
<acquisition>
<number>AccessionNumber</number>
<year>year of DateAcquired</year>
</acquisition>
</work>
[...]
</collection>
It can be written into one line in the XML, since it doesn't need to be pretty.
My python code at the moment is looking like this:
import xml.etree.ElementTree as ET
root = ET.Element('collection')
tree = ET.ElementTree(root)
for artwork in artworks_filtered_list:
work = ET.SubElement(root, 'work')
title = ET.SubElement(work, 'title')
title.text = artwork['Title']
dimensions = ET.SubElement(work, 'dimensions')
if 'Width (cm)' in artwork:
width = ET.SubElement(dimensions, 'width')
width.text = str(artwork['Width (cm)'])
height = ET.SubElement(dimensions, 'height')
height.text = str(artwork['Height (cm)'])
acquisition = ET.SubElement(work, 'acquisition')
number = ET.SubElement(acquisition, 'number')
number.text = str(artwork['AccessionNumber'])
year = ET.SubElement(acquisition, 'year')
year.text = str(artwork['DateAcquired'][:4])
tree.write('example.xml', encoding='UTF-8', xml_declaration=True)
Since width is missing in some artwork data, I needed to check if it exists for each entry. Otherwise I get an error message.
artworks_filtered_list is a list of dictionaries that contains entries for different artworks and is looking like this:
artworks_filtered_list = [
{
"Title": "Interval",
"Artist": ["David Hartt"],
"ConstituentID": [47183],
"ArtistBio": ["Canadian, born 1967"],
"Nationality": ["Canadian"],
"BeginDate": [1967],
"EndDate": [0],
"Gender": ["Male"],
"Date": "2016",
"Medium": "Aluminum and tempered glass",
"Dimensions": 'Wall: 102 × 218 × 4" (259.1 × 553.7 × 10.2 cm)',
"CreditLine": "Fund for the Twenty First Century",
"AccessionNumber": "1772.2015.5",
"Classification": "Installation",
"Department": "Media and Performance Art",
"DateAcquired": "2015-12-11",
"Cataloged": "Y",
"ObjectID": 205745,
"URL": "http://www.moma.org/collection/works/205745",
"ThumbnailURL": None,
"Depth (cm)": 10.16002032,
"Height (cm)": 259.080518161,
"Width (cm)": 553.7211074422,
},
...,
]
This is my code right now. It is working and creating the XML-file as intended, but i feel like there might be more code than needed. Is there a way to get the same result with less repetitive/ prettier code? (It should still use ElementTree)

I would probably use some sort of mapping to reduce the amount of redundant code. Something like this:
#!/usr/bin/python
import datetime
import xml.etree.ElementTree as ET
artworks_filtered_list = [...as in your example...]
# This by itself reduces the amount of code by about 50% :)
def add_text_element(root, tag, text):
new = ET.SubElement(root, tag)
new.text = text
# According to your question, acquisition.year should be just the
# year from DateAcquired, so we need a method to extract and return
# the year.
def extract_year(val):
dt = datetime.datetime.strptime(val, '%Y-%m-%d')
return dt.year
# This is called once for every work of art in artworks_filtered_list
def append_work(root, work):
# create the <work> container element
new = ET.SubElement(root, 'work')
# create empty dimension and acquisition elements. This adds them
# to the new <work> element.
dimensions = ET.SubElement(new, 'dimensions')
acquisition = ET.SubElement(new, 'acquisition')
# Add our title
add_text_element(new, 'title', work['Title'])
# Now build a map that will link keys from a work of art in
# artworks_filtered_list to XML elements. Each item in this list is
# a 4-tuple, where the items are:
#
# 1. The parent element to which we will be adding a new element
# 2. The name of the new element
# 3. The dictionary key from which we will get the text value
# 4. A function to transform the value, if necessary (or None)
#
attrmap = [
(dimensions, 'width', 'Width (cm)', None),
(dimensions, 'height', 'Height (cm)', None),
(acquisition, 'number', 'AccessionNumber', None),
(acquisition, 'year', 'DateAcquired', extract_year),
]
# And now use the above map to transform the artwork dictionary
# into XML elements.
for parent, tag, key, xform in attrmap:
if key in work:
add_text_element(parent, tag,
str(xform(work[key]) if xform else work[key]))
def main():
root = ET.Element('collection')
for work in artworks_filtered_list:
append_work(root, work)
print(ET.tostring(root).decode('utf-8'))
if __name__ == '__main__':
main()
Given your sample input, the above code produces:
<collection>
<work>
<dimensions>
<width>553.7211074422</width>
<height>259.080518161</height>
</dimensions>
<acquisition>
<number>1772.2015.5</number>
<year>2015</year>
</acquisition>
<title>Interval</title>
</work>
</collection>
...although it doesn't actually pretty-print it. If you were to use lxml.etree instead of xml.etree, tostring would be able to pretty-print XML for you.

How to associate values of tags with label of the tag the using ElementTree in a Pythonic way

I have some xml files I am trying to process.
Here is a derived sample from one of the files
fileAsString = """
<?xml version="1.0" encoding="utf-8"?>
<eventDocument>
<schemaVersion>X2</schemaVersion>
<eventTable>
<eventTransaction>
<eventTitle>
<value>Some Event</value>
</eventTitle>
<eventDate>
<value>2003-12-31</value>
</eventDate>
<eventCoding>
<eventType>47</eventType>
<eventCode>A</eventCode>
<footnoteId id="F1"/>
<footnoteId id="F2"/>
</eventCoding>
<eventCycled>
<value></value>
</eventCycled>
<eventAmounts>
<eventVoltage>
<value>40000</value>
</eventVoltage>
</eventAmounts>
</eventTransaction>
</eventTable>
</eventDocument>"""
Note, there can be many eventTables in each document and events can have more details then just the ones I have isolated.
My goal is to create a dictionary in the following form
{'eventTitle':'Some Event, 'eventDate':'2003-12-31','eventType':'47',\
'eventCode':'A', 'eventCoding_FTNT_1':'F1','eventCoding_FTNT_2':'F2',\
'eventCycled': , 'eventVoltage':'40000'}
I am actually reading these in from files but assuming I have a string my code to get the text for the elements right below the eventTransaction element where the text is inside a value tag is as follows
import xml.etree.cElementTree as ET
myXML = ET.fromstring(fileAsString)
eventTransactions = [ e for e in myXML.iter() if e.tag == 'eventTransaction']
testTransaction = eventTransactions[0]
my_dict = {}
for child_of in testTransaction:
grand_children_tags = [e.tag for e in child_of]
if grand_children_tags == ['value']:
my_dict[child_of.tag] = [e.text for e in child_of][0]
>>> my_dict
{'eventTitle': 'Some Event', 'eventCycled': None, 'eventDate': '2003-12-31'}
This seems wrong because I am not really taking advantage of xml instead I am using brute force but I have not seemed to find an example.
Is there a clearer and more pythonic way to create the output I am looking for?

Use XPath to pull out the elements you're interested in.
The following code creates a list of lists of dicts (i.e. tables/transactions/info):
tables = []
myXML = ET.fromstring(fileAsString)
for table in myXML.findall('./eventTable'):
transactions = []
tables.append(transactions)
for transaction in table.findall('./eventTransaction'):
info = {}
for element in table.findall('.//*[value]'):
info[element.tag] = element.find('./value').text or ''
coding = transaction.find('./eventCoding')
if coding is not None:
for tag in 'eventType', 'eventCode':
element = coding.find('./%s' % tag)
if element is not None:
info[tag] = element.text or ''
for index, element in enumerate(coding.findall('./footnoteId')):
info['eventCoding_FTNT_%d' % index] = element.get('id', '')
if info:
transactions.append(info)
Output:
[[{'eventCode': 'A',
'eventCoding_FTNT_0': 'F1',
'eventCoding_FTNT_1': 'F2',
'eventCycled': '',
'eventDate': '2003-12-31',
'eventTitle': 'Some Event',
'eventType': '47',
'eventVoltage': '40000'}]]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Element XML parsing not giving proper result - python

Related

XML parsing in python issue using elementTree

Create a dictionary from an XML using xpath

Listing path and data from a xml file to store in a dataframe

Is there a way to write less repetitive code? Writing an XML file in python using ElementTree

How to associate values of tags with label of the tag the using ElementTree in a Pythonic way

Categories

Resources