How to get path of all elements in lxml with attribute

How to get path of all elements in lxml with attribute - python

I have the following code:
tree = etree.ElementTree(new_xml)
for e in new_xml.iter():
print tree.getpath(e), e.text
This will give me something like the following:
/Item/Purchases
/Item/Purchases/Purchase[1]
/Item/Purchases/Purchase[1]/URL http://tvgo.xfinity.com/watch/x/6091165185315991112/movies
/Item/Purchases/Purchase[1]/Rating R
/Item/Purchases/Purchase[2]
/Item/Purchases/Purchase[2]/URL http://tvgo.xfinity.com/watch/x/6091165185315991112/movies
/Item/Purchases/Purchase[2]/Rating R
However, I need to get the path not of the list element but of the attribute. Here is what the xml looks like:
<Item>
<Purchases>
<Purchase Country="US">
<URL>http://tvgo.xfinity.com/watch/x/6091165US</URL>
<Rating>R</Rating>
</Purchase>
<Purchase Country="CA">
<URL>http://tvgo.xfinity.com/watch/x/6091165CA</URL>
<Rating>R</Rating>
</Purchase>
</Item>
How would I get the following path instead?
/Item/Purchases
/Item/Purchases/Purchase[#Country="US"]
/Item/Purchases/Purchase[#Country="US"]/URL http://tvgo.xfinity.com/watch/x/6091165185315991112/movies
/Item/Purchases/Purchase[#Country="US"]/Rating R
/Item/Purchases/Purchase[#Country="CA"]
/Item/Purchases/Purchase[#Country="CA"]/URL http://tvgo.xfinity.com/watch/x/6091165185315991112/movies
/Item/Purchases/Purchase[#Country="CA"]/Rating R

Not pretty, but it does the job.
replacements = {}
for e in tree.iter():
path = tree.getpath(e)
if re.search('/Purchase\[\d+\]$', path):
new_predicate = '[#Country="' + e.attrib['Country'] + '"]'
new_path = re.sub('\[\d+\]$', new_predicate, path)
replacements[path] = new_path
for key, replacement in replacements.iteritems():
path = path.replace(key, replacement)
print path, e.text.strip()
prints this for me:
/Item
/Item/Purchases
/Item/Purchases/Purchase[#Country="US"]
/Item/Purchases/Purchase[#Country="US"]/URL http://tvgo.xfinity.com/watch/x/6091165US
/Item/Purchases/Purchase[#Country="US"]/Rating R
/Item/Purchases/Purchase[#Country="CA"]
/Item/Purchases/Purchase[#Country="CA"]/URL http://tvgo.xfinity.com/watch/x/6091165CA
/Item/Purchases/Purchase[#Country="CA"]/Rating R

Related

Transform Nested XML

I am currently looking to parse out a nested XML into a pandas Datatable so I can generate a CSV with each column being an element name and the value of that being the element text but I am having some issues parsing the information out. Below is an example of the nested XML and what I have tried.
The below XML can be quite large with hundreds of different records. This is what I tried:
##Import modules
import xml.etree.ElementTree as ET
import pandas as pd
from lxml import etree
tree = ET.parse("File.xml")
root = tree.getroot()
for subelement in root:
for subsub in subelement:
print(subsub.tag,",", subsub.text, subsub.attrib, subsub.items())
for subelement in root:
for subsub in subelement:
for subsubsub in subsub:
print(subsubsub.tag,",", subsubsub.text, subsubsub.attrib)
<?xml version="1.0" encoding="utf-16"?>
<test1 xmlns="test.xsd">
<test2 ID="123123123" test3="123123">
<test3>Separate</test3>
<test4>AA</test4>
<Comments>BB</Comments>
<test5>
<test6 ID="123123">
<test3>today</test3>
<test7>123 street</test7>
</test6>
</test5>
<test8>
<test10 ID="434234">
<test3>type of work</test3>
<test9>test work</test9>
</test10>
</test8>
<test11>
<test12 ID="234234234">
<test3>Social</test3>
<test14>test</test14>
</test12>
<test12 ID="123123">
<test3>Something Here</test3>
<test13>Some date</test13>
<test14>123123124433</test14>
</test12>
</test11>
<test15>
<test16 ID="6456456456">
<test3>Something Something</test3>
<test14>746745636</test14>
</test16>
</test15>
</test2>
<test2 ID="353453245" test3="list of something">
<test3>Somewhere</test3>
<test4>Someone</test4>
<Comments>Some comment</Comments>
<test5>
<test6 ID="567456756">
<test3>Not today</test3>
<test7>5634643643</test7>
<test17>Some Info</test17>
<test19>Somewhere</test19>
<test18>63243333</test18>
</test6>
</test5>
<test11>
<test12 ID="456436346">
<test3>Pattern</test3>
<test14>436346346</test14>
</test12>
<test12 ID="4364356">
<test3> ID</test3>
<test14>5674567457</test14>
</test12>
<test12 ID="123123123443">
<test3>Other ID</test3>
<test13>54234532452345</test13>
<test14>231423532452345</test14>
</test12>
</test11>
<test15>
<test16 ID="34252345">
<test3>None test</test3>
<test14>456436436346</test14>
</test16>
</test15>
</test2>
</test1>
Update So would the full code look something like this?
###TEST USING EXAMPLE HOTLIST
with open("file.csv", "w", newline='') as fout:
header = ['test3','test4','test7','test9','test13','test14','test17','test18','test19','Comments']
csvout = csv.DictWriter(fout, fieldnames=header)
csvout.writeheader()
row = {}
for _, elem in ET.iterparse('file.xml'):
# strip the namespace from the element tag name; e.g. {Test.xsd}test14 > test14
tag = re.sub("^{.*?}", "", elem.tag)
if tag == 'test2':
if len(row) != 0:
print(row)
csvout.writerow(row)
row = {}
if len(elem) == 0:
text = elem.text
old = row.get(tag)
if old is None:
# first occurrence of the tag
row[tag] = text
elif isinstance(old, str):
# second occurrence of the tag
row[tag] = [old, text]
else:
# already a list
old.append(text)

For nested XML you can use iterparse() function to iterate over all elements in the XML. You would then need to have logic to handle the elements depending on what tag it's looking at to add to a dictionary object to export as a row.
for _, elem in ET.iterparse('file.xml'):
if len(elem) == 0:
print(f'{elem.tag} {elem.attrib} text={elem.text}')
else:
print(f'{elem.tag} {elem.attrib}')
To create a row in a CSV file from the element text then can do something like this. If, for example, the "test2" marks the beginning of a new record then that can be used to write the record to a new row and clear the dictionary for the next record.
If want to output all or some attributes then need to add a few lines of code for that. If attribute names have the same name as element name or multiple elements have same attribute (e.g. ID) then need to address that in your code.
import xml.etree.ElementTree as ET
import re
import csv
with open("out.csv", "w", newline='') as fout:
header = ['test3','test4','test7','test9','test13','test14','test17','test18','test19','Comments']
csvout = csv.DictWriter(fout, fieldnames=header)
csvout.writeheader()
row = {}
for _, elem in ET.iterparse('test.xml'):
# strip the namespace from the element tag name; e.g. {Test.xsd}test14 > test14
tag = re.sub("^{.*?}", "", elem.tag)
if tag == 'test2':
if len(row) != 0:
print(row)
csvout.writerow(row)
row = {}
if len(elem) == 0:
row[tag] = elem.text
Output:
{'test3': 'Something Something', 'test4': 'AA', 'Comments': 'BB', 'test7': '123 street', 'test9': 'test work', 'test14': '746745636', 'test13': 'Some date'}
{'test3': 'None test', 'test4': 'Someone', 'Comments': 'Some comment', 'test7': '5634643643', 'test17': 'Some Info', 'test19': 'Somewhere', 'test18': '63243333', 'test14': '456436436346', 'test13': '54234532452345'}
CSV Output:
test3,test4,test7,test9,test13,test14,test17,test18,test19,Comments
Something Something,AA,123 street,test work,Some date,746745636,,,,BB
None test,Someone,5634643643,,54234532452345,456436436346,Some Info,63243333,Somewhere,Some comment
Update:
If want to handle duplicate tags and create a list of values then try something like this:
if len(elem) == 0:
text = elem.text
old = row.get(tag)
if old is None:
# first occurrence
row[tag] = text
elif isinstance(old, str):
# second occurrence > create list
row[tag] = [old, text]
else:
old.append(text)

Python re.findall organize list

I have a text file with entries like this:
<soap:Envelope xmlns:soap="http://www.w3.org/2003/05/soap-envelope" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<soap:Body>
<Applications_GetResponse xmlns="http://www.country.com">
<Applications>
<CS_Application>
<Name>Spain</Name>
<Key>2345364564</Key>
<Status>NORMAL</Status>
<Modules>
<CS_Module>
<Name>zaragoza</Name>
<Key>8743249725</Key>
<DevelopmentEffort>0</DevelopmentEffort>
<LogicalDBConnections/>
</CS_Module>
<CS_Module>
<Name>malaga</Name>
<Key>8743249725</Key>
<DevelopmentEffort>0</DevelopmentEffort>
<LogicalDBConnections/>
</CS_Module>
</Modules>
<CreatedBy>7</CreatedBy>
</CS_Application>
<CS_Application>
<Name>UK</Name>
<Key>2345364564</Key>
<Status>NORMAL</Status>
<Modules>
<CS_Module>
<Name>london</Name>
<Key>8743249725</Key>
<DevelopmentEffort>0</DevelopmentEffort>
<LogicalDBConnections/>
</CS_Module>
<CS_Module>
<Name>liverpool</Name>
<Key>8743249725</Key>
<DevelopmentEffort>0</DevelopmentEffort>
<LogicalDBConnections/>
</CS_Module>
</Modules>
<CreatedBy>7</CreatedBy>
</CS_Application>
</Applications>
</Applications_GetResponse>
</soap:Body>
</soap:Envelope>
I would like to analyze it and obtain the name of the country in the sequence of the cities.
I tried some things with python re.finall, but I didn't get anything like it
print("HERE APPLICATIONS")
applications = re.findall('<CS_Application><Name>(.*?)</Name>', response_apply.text)
print(applications)
print("HERE MODULES")
modules = re.findall('<CS_Module><Name>(.*?)</Name>', response_apply.text)
print(modules)
return:
host-10$ sudo python3 capture.py
HERE APPLICATIONS
['Spain', 'UK']
HERE MODULES
['zaragoza', 'malaga', 'london', 'liverpool']
The expected result is, I would like the result to be like this:
HERE
The Country: Spain - Cities: zaragoza,malaga
The Country: UK - Cities: london,liverpool

Regex is not good to parse xml. Better use xml parser..
If you want regex solution then hope below code help you.
import re
s = """\n<soap:Envelope xmlns:soap="http://www.w3.org/2003/05/soap-envelope" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">\n <soap:Body>\n <Applications_GetResponse xmlns="http://www.country.com">\n <Applications>\n <CS_Application>\n <Name>Spain</Name>\n <Key>2345364564</Key>\n <Status>NORMAL</Status>\n <Modules>\n <CS_Module>\n <Name>zaragoza</Name>\n <Key>8743249725</Key>\n <DevelopmentEffort>0</DevelopmentEffort>\n <LogicalDBConnections/>\n </CS_Module>\n <CS_Module>\n <Name>malaga</Name>\n <Key>8743249725</Key>\n <DevelopmentEffort>0</DevelopmentEffort>\n <LogicalDBConnections/>\n </CS_Module>\n </Modules>\n <CreatedBy>7</CreatedBy>\n </CS_Application>\n <CS_Application>\n <Name>UK</Name>\n <Key>2345364564</Key>\n <Status>NORMAL</Status>\n <Modules>\n <CS_Module>\n <Name>london</Name>\n <Key>8743249725</Key>\n <DevelopmentEffort>0</DevelopmentEffort>\n <LogicalDBConnections/>\n </CS_Module>\n <CS_Module>\n <Name>liverpool</Name>\n <Key>8743249725</Key>\n <DevelopmentEffort>0</DevelopmentEffort>\n <LogicalDBConnections/>\n </CS_Module>\n </Modules>\n <CreatedBy>7</CreatedBy>\n </CS_Application>\n </Applications>\n </Applications_GetResponse>\n </soap:Body>\n</soap:Envelope>\n"""
pattern1 = re.compile(r'<CS_Application>([\s\S]*?)</CS_Application>')
pattern2 = re.compile(r'<Name>(.*)?</Name>')
for m in re.finditer(pattern1, s):
ss = m.group(1)
res = []
for mm in re.finditer(pattern2, ss):
res.append(mm.group(1))
print("The Country: "+res[0]+" - Cities: "+",".join(res[1:len(res)]))

Converting an xml doc into a specific dot-expanded json structure

I have the following XML document:
<Item ID="288917">
<Main>
<Platform>iTunes</Platform>
<PlatformID>353736518</PlatformID>
</Main>
<Genres>
<Genre FacebookID="6003161475030">Comedy</Genre>
<Genre FacebookID="6003172932634">TV-Show</Genre>
</Genres>
<Products>
<Product Country="CA">
<URL>https://itunes.apple.com/ca/tv-season/id353187108?i=353736518</URL>
<Offers>
<Offer Type="HDBUY">
<Price>3.49</Price>
<Currency>CAD</Currency>
</Offer>
<Offer Type="SDBUY">
<Price>2.49</Price>
<Currency>CAD</Currency>
</Offer>
</Offers>
</Product>
<Product Country="FR">
<URL>https://itunes.apple.com/fr/tv-season/id353187108?i=353736518</URL>
<Rating>Tout public</Rating>
<Offers>
<Offer Type="HDBUY">
<Price>2.49</Price>
<Currency>EUR</Currency>
</Offer>
<Offer Type="SDBUY">
<Price>1.99</Price>
<Currency>EUR</Currency>
</Offer>
</Offers>
</Product>
</Products>
</Item>
Currently, to get it into json format I'm doing the following:
parser = etree.XMLParser(recover=True)
node = etree.fromstring(s, parser=parser)
data = xmltodict.parse(etree.tostring(node))
Of course the xmltodict is doing the heavy lifting. However, it gives me a format that is not ideal for what I'm trying to accomplish. Here is what I'd like the end data to look like:
{
"Item[#ID]": 288917, # if no preceding element, use the root node tag
"Main.Platform": "iTunes",
"Main.PlatformID": "353736518",
"Genres.Genre": ["Comedy", "TV-Show"] # list of elements if repeated
"Genres.Genre[#FacebookID]": ["6003161475030", "6003161475030"],
"Products.Product[#Country]": ["CA", "FR"],
"Products.Product.URL": ["https://itunes.apple.com/ca/tv-season/id353187108?i=353736518", "https://itunes.apple.com/fr/tv-season/id353187108?i=353736518"],
"Products.Product.Offers.Offer[#Type]": ["HDBUY", "SDBUY", "HDBUY", "SDBUY"],
"Products.Product.Offers.Offer.Price": ["3.49", "2.49", "2.49", "1.99"],
"Products.Product.Offers.Offer.Currency": "EUR"
}

This is a bit verbose, but it wasn't too hard to format this as a flat dict. Here is an example:
node = etree.fromstring(file_data.encode('utf-8'), parser=parser)
data = OrderedDict()
nodes = [(node, ''),] # format is (node, prefix)
while nodes:
for sub, prefix in nodes:
# remove the prefix tag unless its for the first attribute
tag_prefix = '.'.join(prefix.split('.')[1:]) if ('.' in prefix) else ''
atr_prefix = sub.tag if (sub == node) else tag_prefix
# tag
if sub.text.strip():
_prefix = tag_prefix + '.' + sub.tag
_value = sub.text.strip()
if data.get(_prefix): # convert it to a list if multiple values
if not isinstance(data[_prefix], list): data[_prefix] = [data[_prefix],]
data[_prefix].append(_value)
else:
data[_prefix] = _value
# atr
for k, v in sub.attrib.items():
_prefix = atr_prefix + '[#%s]' % k
_value = v
if data.get(_prefix): # convert it to a list if multiple values
if not isinstance(data[_prefix], list): data[_prefix] = [data[_prefix],]
data[_prefix].append(_value)
else:
data[_prefix] = _value
nodes.remove((sub, prefix))
for s in sub.getchildren():
_prefix = (prefix + '.' + sub.tag).strip('.')
nodes.append((s, _prefix))
if not nodes: break

You can use recursion here. One way is to store the paths progressively as your recurse the XML document, and return a result dictionary at the end, which can be serialized to JSON.
The below demo uses the standard library xml.etree.ElementTree for parsing XML documents.
Demo:
from xml.etree.ElementTree import ElementTree
from pprint import pprint
# Setup XML tree for parsing
tree = ElementTree()
tree.parse("sample.xml")
root = tree.getroot()
def collect_xml_paths(root, path=[], result={}):
"""Collect XML paths into a dictionary"""
# First collect root items
if not result:
root_id, root_value = tuple(root.attrib.items())[0]
root_key = root.tag + "[#%s]" % root_id
result[root_key] = root_value
# Go through each child from root
for child in root:
# Extract text
text = child.text.strip()
# Update path
new_path = path[:]
new_path.append(child.tag)
# Create dot separated key
key = ".".join(new_path)
# Get child attributes
attributes = child.attrib
# Ensure we have attributes
if attributes:
# Add each attribute to result
for k, v in attributes.items():
attrib_key = key + "[#%s]" % k
result.setdefault(attrib_key, []).append(v)
# Add text if it exists
if text:
result.setdefault(key, []).append(text)
# Recurse through paths once done iteration
collect_xml_paths(child, new_path)
# Separate single values from list values
return {k: v[0] if len(v) == 1 else v for k, v in result.items()}
pprint(collect_xml_paths(root))
Output:
{'Genres.Genre': ['Comedy', 'TV-Show'],
'Genres.Genre[#FacebookID]': ['6003161475030', '6003172932634'],
'Item[#ID]': '288917',
'Main.Platform': 'iTunes',
'Main.PlatformID': '353736518',
'Products.Product.Offers.Offer.Currency': ['CAD', 'CAD', 'EUR', 'EUR'],
'Products.Product.Offers.Offer.Price': ['3.49', '2.49', '2.49', '1.99'],
'Products.Product.Offers.Offer[#Type]': ['HDBUY', 'SDBUY', 'HDBUY', 'SDBUY'],
'Products.Product.Rating': 'Tout public',
'Products.Product.URL': ['https://itunes.apple.com/ca/tv-season/id353187108?i=353736518',
'https://itunes.apple.com/fr/tv-season/id353187108?i=353736518'],
'Products.Product[#Country]': ['CA', 'FR']}
If you want to serialize this dictionary to JSON, you can use json.dumps():
from json import dumps
print(dumps(collect_xml_paths(root)))
# {"Item[#ID]": "288917", "Main.Platform": "iTunes", "Main.PlatformID": "353736518", "Genres.Genre[#FacebookID]": ["6003161475030", "6003172932634"], "Genres.Genre": ["Comedy", "TV-Show"], "Products.Product[#Country]": ["CA", "FR"], "Products.Product.URL": ["https://itunes.apple.com/ca/tv-season/id353187108?i=353736518", "https://itunes.apple.com/fr/tv-season/id353187108?i=353736518"], "Products.Product.Offers.Offer[#Type]": ["HDBUY", "SDBUY", "HDBUY", "SDBUY"], "Products.Product.Offers.Offer.Price": ["3.49", "2.49", "2.49", "1.99"], "Products.Product.Offers.Offer.Currency": ["CAD", "CAD", "EUR", "EUR"], "Products.Product.Rating": "Tout public"}

getting value from xml in a dict

I have a XML file that I want to parse. In the file I have 3 unique tags -
3
2
1
Each of these have 1 unique value for a metricX. I want to be extract these values in form a dict in python.
Something like
Desired Output
{ 3 : {"metricX":100}, 2 : {"metricX":11}, 1 : {"metricX":44}}
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="MeasDataCollection.xsl"?>
<!DOCTYPE mdc SYSTEM "MeasDataCollection.dtd">
<mdc xmlns:HTML="http://www.w3.org/TR/REC-xml">
<md>
<neid>
<neun>936001_STURGEON_BAY_MEYER</neun>
</neid>
<mi>
<mi>
<mts>20170924161500Z</mts>
<gp>900</gp>
<mt>metricX</mt>
<mv>
<moid>3</moid>
<r>100</r>
</mv>
<mv>
<moid>2</moid>
<r>11</r>
</mv>
<mv>
<moid>1</moid>
<r>44</r>
</mv>
</mi>
</mi>
</md>
</mdc>
So far I have tried using Element Tree.
import os
import xml.etree.ElementTree as ET
fullpath = os.getcwd()
os.chdir(r"C:\Users\sss\Documents\Zabbix_work\xml_parsing")
tree = ET.ElementTree(file='smaple.xml')
for elem in tree.iter():
print (elem.tag, elem.text)
Output so far is -
mdc
md
neid
neun 936001_STURGEON_BAY_MEYER
mi
mi
mts 20170924161500Z
gp 900
mt metricX
mv
moid 3
r 100
mv
moid 2
r 11
mv
moid 1
r 44
Not so sure now how to organize it further in form of a dict.

This should do the trick:
import xml.etree.ElementTree as ET
import os
file_path = os.path.expanduser('~/Desktop/input123.xml') # filepath here
tree = ET.ElementTree(file=file_path)
my_dict = {}
for node in tree.getroot().find('md').find('mi').find('mi').findall('mv'):
my_dict[int(node.find('moid').text)] = { 'metricX': int(node.find('r').text) }
print(my_dict)
...output:
{3: {'metricX': 100}, 2: {'metricX': 11}, 1: {'metricX': 44}}

parsing repeating child elements python

I am trying to parse an XML document that contains repeating child elements using Python. When I attempt to parse the data, it creates an empty file. If I comment out the repeating child elements code (see bolded section in python script below), the document generates correctly. Can someone help?
XML:
<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?>
<FRPerformance xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<FRPerformanceShareClassCurrency>
<FundCode>00190</FundCode>
<CurrencyID>USD</CurrencyID>
<FundShareClassCode>A</FundShareClassCode>
<ReportPeriodFrequency>Quarterly</ReportPeriodFrequency>
<ReportPeriodEndDate>06/30/2012</ReportPeriodEndDate>
<Net>
<Annualized>
<Year1>-4.909000000</Year1>
<Year3>10.140000000</Year3>
<Year5>-22.250000000</Year5>
<Year10>-7.570000000</Year10>
<Year15>-4.730000000</Year15>
<Year20>-0.900000000</Year20>
<SI>1.900000000</SI>
</Annualized>
</Net>
<Gross>
<Annualized>
<Month3>1.279000000</Month3>
<YTD>7.294000000</YTD>
<Year1>-0.167000000</Year1>
<Year3>11.940000000</Year3>
<Year5>-21.490000000</Year5>
<Year10>-7.120000000</Year10>
<Year15>-4.420000000</Year15>
<Year20>-0.660000000</Year20>
<SI>2.110000000</SI>
</Annualized>
<Cumulative>
<Month1Back>2.288000000</Month1Back>
<Month2Back>-1.587000000</Month2Back>
<Month3Back>0.610000000</Month3Back>
<CurrentYear>7.294000000</CurrentYear>
<Year1Back>-2.409000000</Year1Back>
<Year2Back>13.804000000</Year2Back>
<Year3Back>20.287000000</Year3Back>
<Year4Back>-78.528000000</Year4Back>
<Year5Back>-0.101000000</Year5Back>
<Year6Back>9.193000000</Year6Back>
<Year7Back>2.659000000</Year7Back>
<Year8Back>9.208000000</Year8Back>
<Year9Back>25.916000000</Year9Back>
<Year10Back>-3.612000000</Year10Back>
</Cumulative>
<HistoricReturns>
<HistoricReturns_Item>
<Date>Fri, 28 Feb 1997 00:00:00 -0600</Date>
<Return>32058.090000000</Return>
</HistoricReturns_Item>
<HistoricReturns_Item>
<Date>Fri, 28 Feb 2003 00:00:00 -0600</Date>
<Return>36415.110000000</Return>
</HistoricReturns_Item>
<HistoricReturns_Item>
<Date>Fri, 29 Feb 2008 00:00:00 -0600</Date>
<Return>49529.290000000</Return>
</HistoricReturns_Item>
<HistoricReturns_Item>
<Date>Fri, 30 Apr 1993 00:00:00 -0600</Date>
<Return>21621.500000000</Return>
</HistoricReturns_Item>
</<HistoricReturns>
Python script
## Create command line arguments for XML file and tageName
xmlFile = sys.argv[1]
tagName = sys.argv[2]
tree = ET.parse(xmlFile)
root = tree.getroot()
## Setup the file for output
saveout = sys.stdout
output_file = open('parsedXML.csv', 'w')
sys.stdout = output_file
## Parse XML
for node in root.findall(tagName):
fundCode = node.find('FundCode').text
curr = node.find('CurrencyID').text
shareClass = node.find('FundShareClassCode').text
for node2 in node.findall('./Net/Annualized'):
year1 = node2.findtext('Year1')
year3 = node2.findtext('Year3')
year5 = node2.findtext('Year5')
year10 = node2.findtext('Year10')
year15 = node2.findtext('Year15')
year20 = node2.findtext('Year20')
SI = node2.findtext('SI')
for node3 in node.findall('./Gross'):
for node4 in node3.findall('./Annualized'):
month3 = node4.findtext('Month3')
ytd = node4.findtext('YTD')
year1g = node4.findtext('Year1')
year3g = node4.findtext('Year3')
year5g = node4.findtext('Year5')
year10g = node4.findtext('Year10')
year15g = node4.findtext('Year15')
year20g = node4.findtext('Year2')
SIg = node4.findtext('SI')
for node5 in node3.findall('./Cumulative'):
month1b = node5.findtext('Month1Back')
month2b = node5.findtext('Month2Back')
month3b = node5.findtext('Month3Back')
curYear = node5.findtext('CurrentYear')
year1b = node5.findtext('Year1Back')
year2b = node5.findtext('Year2Back')
year3b = node5.findtext('Year3Back')
year4b = node5.findtext('Year4Back')
year5b = node5.findtext('Year5Back')
year6b = node5.findtext('Year6Back')
year7b = node5.findtext('Year7Back')
year8b = node5.findtext('Year8Back')
year9b = node5.findtext('Year9Back')
year10b = node5.findtext('Year10Back')
**for node6 in node.findall('./HistoricReturns'):
for node7 in node6.findall('./HistoricReturns_Item'):
hDate = node7.findall('Date')
hReturn = node7.findall('Return')**
print(fundCode, curr, shareClass,year1, year3, year5, year10, year15, year15, year20, SI,month3, ytd, year1g, year3g, year5g, year10g, year15g, year20g, SIg, month1b, month2b, month3b, curYear, year1b, year2b, year3b, year4b, year5b, year6b, year7b, year8b,year9b,year10b, hDate, hReturn)

The sample XML and the python code don't match up in terms of structure. Either
you're missing a closing </Gross> tag from the XML (which should be before the <HistoricReturns> section starts) - in which case the code is correct or
the code should be for node6 in node3.findall('./HistoricReturns'): i.e. node3 instead of node
N.B. The XML sample isn't complete (it isn't well-formed XML) because it's missing closing tags for Gross, FRPerformanceShareClassCurrency and FRPerformance so this makes it impossible to answer the question definitively. Hope this helps though.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to get path of all elements in lxml with attribute - python

Related

Transform Nested XML

Python re.findall organize list

Converting an xml doc into a specific dot-expanded json structure

getting value from xml in a dict

parsing repeating child elements python

Categories

Resources