find all ancestors of xml node python using lxml - python

I am trying to find all ancestors of node.
my xml,
xmldata="""
<OrganizationTreeInfo xmlns:i="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://schemas.datacontract.org/2004/07/YSM.PMS.Web.Service.DataTransfer.Models">
<Name>Parent</Name>
<OrganizationId>4345</OrganizationId>
<Children>
<OrganizationTreeInfo>
<Name>A</Name>
<OrganizationId>123</OrganizationId>
<Children>
<OrganizationTreeInfo>
<Name>B</Name>
<OrganizationId>54</OrganizationId>
<Children/>
</OrganizationTreeInfo>
</Children>
</OrganizationTreeInfo>
<OrganizationTreeInfo>
<Name>C</Name>
<OrganizationId>34</OrganizationId>
<Children>
<OrganizationTreeInfo>
<Name>D</Name>
<OrganizationId>32323</OrganizationId>
<Children>
<OrganizationTreeInfo>
<Name>E</Name>
<OrganizationId>3234</OrganizationId>
<Children/>
</OrganizationTreeInfo>
</Children>
</OrganizationTreeInfo>
</Children>
</OrganizationTreeInfo>
</Children>
"""
for e.g. If I input value of OrganizationId as 3234 , then output should be like,
{'parent':4345,'C':34,'D':32323,'E':3234 }
Here is my try,
root = ET.fromstring(xmldata)
for target in root.xpath('.//OrganizationTreeInfo/OrganizationId[text()="3234"]'):
d = {
dept.find('Name').text: int(dept.find('OrganizationId').text)
for dept in target.xpath('ancestor-or-self::OrganizationTreeInfo')
}
print(d)
But it is not giving any output. I am unable to find out whats wrong with it.

You are not getting correct answer because of namespace
xmlns="http://schemas.datacontract.org/2004/07/YSM.PMS.Web.Service.DataTransfer.Models"
Following code with namespace:
code:
import lxml.etree as ET
root = ET.fromstring(xmldata)
result = {}
count = 1
namespaces1={'xmlns':'http://schemas.datacontract.org/2004/07/YSM.PMS.Web.Service.DataTransfer.Models',}
for target in root.xpath('.//xmlns:OrganizationTreeInfo/xmlns:OrganizationId[text()="3234"]',\
namespaces=namespaces1):
result[count] = {}
for dept in target.xpath('ancestor-or-self::xmlns:OrganizationTreeInfo', namespaces=namespaces1):
result[count][dept.find('xmlns:Name', namespaces=namespaces1).text] = int(dept.find('xmlns:OrganizationId', namespaces=namespaces1).text)
count += 1
import pprint
pprint.pprint(result)
Output:
:~/workspace/vtestproject/study$ python test1.py
{1: {'C': 34, 'D': 32323, 'E': 3234, 'Parent': 4345}}
Replace xmlns= string with other temp string.
code:
import lxml.etree as ET
new_xmldata = xmldata.replace("xmlns=", "xmlnamespace=")
root = ET.fromstring(new_xmldata)#, namespace="{http://schemas.datacontract.org/2004/07/YSM.PMS.Web.Service.DataTransfer.Models}")
result = {}
count = 1
for target in root.xpath('.//OrganizationTreeInfo/OrganizationId[text()="3234"]'):
result[count] = {}
for dept in target.xpath('ancestor-or-self::OrganizationTreeInfo'):
result[count][dept.find('Name').text] = int(dept.find('OrganizationId').text)
count += 1
import pprint
pprint.pprint(result)
Output:
:~/workspace/vtestproject/study$ python test1.py
{1: {'C': 34, 'D': 32323, 'E': 3234, 'Parent': 4345}}

Related

How to extract specfic values from xml file using python xml.etree.ElementTree iterating until an id is found inside a hidden child node?

I need to iterate over the tag ObjectHeader and when the tag ObjectType/Id is equal to 1424 I need to extract all the values inside the following tags ObjectVariant/ObjectValue/Characteristic/Name and ObjectVariant/ObjectValue/PropertyValue/Value and put them in a dictionary. The expected output will be like this:
{"Var1": 10.4,
"Var2": 15.6}
Here is a snippet from the XML that I'm working with which has 30k lines (Hint: Id 1424 only appears once in the whole XML file).
<ObjectContext>
<ObjectHeader>
<ObjectType>
<Id>1278</Id>
<Name>ID_NAME</Name>
</ObjectType>
<ObjectVariant>
<ObjectValue>
<Characteristic>
<Name>Var1</Name>
<Description>Something about the name</Description>
</Characteristic>
<PropertyValue>
<Value>10.6</Value>
<Description>Something about the value</Description>
</PropertyValue>
</ObjectValue>
</ObjectVariant>
</ObjectHeader>
<ObjectHeader>
<ObjectType>
<Id>1424</Id>
<Name>ID_NAME</Name>
</ObjectType>
<ObjectVariant>
<ObjectValue>
<Characteristic>
<Name>Var1</Name>
<Description>Something about the name</Description>
</Characteristic>
<PropertyValue>
<Value>10.4</Value>
<Description>Something about the value</Description>
</PropertyValue>
</ObjectValue>
<ObjectValue>
<Characteristic>
<Name>Var2</Name>
<CharacteristicType>Something about the name</CharacteristicType>
</Characteristic>
<PropertyValue>
<Value>15.6</Value>
<Description>Something about the value</Description>
</PropertyValue>
</ObjectValue>
</ObjectVariant>
</ObjectHeader>
</ObjectContext>
Here is one possibility to write all to pandas and then filter the interessting values:
import pandas as pd
import xml.etree.ElementTree as ET
tree = ET.parse("xml_to_dict.xml")
root = tree.getroot()
columns = ["id", "name", "value"]
row_list = []
for objHead in root.findall('.//ObjectHeader'):
for elem in objHead.iter():
if elem.tag == 'Id':
id = elem.text
if elem.tag == 'Name':
name = elem.text
if elem.tag == 'Value':
value = elem.text
row = id, name, value
row_list.append(row)
df = pd.DataFrame(row_list, columns=columns)
dff = df.query('id == "1424"')
print("Dictionary:", dict(list(zip(dff['name'], dff['value']))))
Output:
Dictionary: {'Var1': '10.4', 'Var2': '15.6'}

Transform Nested XML

I am currently looking to parse out a nested XML into a pandas Datatable so I can generate a CSV with each column being an element name and the value of that being the element text but I am having some issues parsing the information out. Below is an example of the nested XML and what I have tried.
The below XML can be quite large with hundreds of different records. This is what I tried:
##Import modules
import xml.etree.ElementTree as ET
import pandas as pd
from lxml import etree
tree = ET.parse("File.xml")
root = tree.getroot()
for subelement in root:
for subsub in subelement:
print(subsub.tag,",", subsub.text, subsub.attrib, subsub.items())
for subelement in root:
for subsub in subelement:
for subsubsub in subsub:
print(subsubsub.tag,",", subsubsub.text, subsubsub.attrib)
<?xml version="1.0" encoding="utf-16"?>
<test1 xmlns="test.xsd">
<test2 ID="123123123" test3="123123">
<test3>Separate</test3>
<test4>AA</test4>
<Comments>BB</Comments>
<test5>
<test6 ID="123123">
<test3>today</test3>
<test7>123 street</test7>
</test6>
</test5>
<test8>
<test10 ID="434234">
<test3>type of work</test3>
<test9>test work</test9>
</test10>
</test8>
<test11>
<test12 ID="234234234">
<test3>Social</test3>
<test14>test</test14>
</test12>
<test12 ID="123123">
<test3>Something Here</test3>
<test13>Some date</test13>
<test14>123123124433</test14>
</test12>
</test11>
<test15>
<test16 ID="6456456456">
<test3>Something Something</test3>
<test14>746745636</test14>
</test16>
</test15>
</test2>
<test2 ID="353453245" test3="list of something">
<test3>Somewhere</test3>
<test4>Someone</test4>
<Comments>Some comment</Comments>
<test5>
<test6 ID="567456756">
<test3>Not today</test3>
<test7>5634643643</test7>
<test17>Some Info</test17>
<test19>Somewhere</test19>
<test18>63243333</test18>
</test6>
</test5>
<test11>
<test12 ID="456436346">
<test3>Pattern</test3>
<test14>436346346</test14>
</test12>
<test12 ID="4364356">
<test3> ID</test3>
<test14>5674567457</test14>
</test12>
<test12 ID="123123123443">
<test3>Other ID</test3>
<test13>54234532452345</test13>
<test14>231423532452345</test14>
</test12>
</test11>
<test15>
<test16 ID="34252345">
<test3>None test</test3>
<test14>456436436346</test14>
</test16>
</test15>
</test2>
</test1>
Update So would the full code look something like this?
###TEST USING EXAMPLE HOTLIST
with open("file.csv", "w", newline='') as fout:
header = ['test3','test4','test7','test9','test13','test14','test17','test18','test19','Comments']
csvout = csv.DictWriter(fout, fieldnames=header)
csvout.writeheader()
row = {}
for _, elem in ET.iterparse('file.xml'):
# strip the namespace from the element tag name; e.g. {Test.xsd}test14 > test14
tag = re.sub("^{.*?}", "", elem.tag)
if tag == 'test2':
if len(row) != 0:
print(row)
csvout.writerow(row)
row = {}
if len(elem) == 0:
text = elem.text
old = row.get(tag)
if old is None:
# first occurrence of the tag
row[tag] = text
elif isinstance(old, str):
# second occurrence of the tag
row[tag] = [old, text]
else:
# already a list
old.append(text)
For nested XML you can use iterparse() function to iterate over all elements in the XML. You would then need to have logic to handle the elements depending on what tag it's looking at to add to a dictionary object to export as a row.
for _, elem in ET.iterparse('file.xml'):
if len(elem) == 0:
print(f'{elem.tag} {elem.attrib} text={elem.text}')
else:
print(f'{elem.tag} {elem.attrib}')
To create a row in a CSV file from the element text then can do something like this. If, for example, the "test2" marks the beginning of a new record then that can be used to write the record to a new row and clear the dictionary for the next record.
If want to output all or some attributes then need to add a few lines of code for that. If attribute names have the same name as element name or multiple elements have same attribute (e.g. ID) then need to address that in your code.
import xml.etree.ElementTree as ET
import re
import csv
with open("out.csv", "w", newline='') as fout:
header = ['test3','test4','test7','test9','test13','test14','test17','test18','test19','Comments']
csvout = csv.DictWriter(fout, fieldnames=header)
csvout.writeheader()
row = {}
for _, elem in ET.iterparse('test.xml'):
# strip the namespace from the element tag name; e.g. {Test.xsd}test14 > test14
tag = re.sub("^{.*?}", "", elem.tag)
if tag == 'test2':
if len(row) != 0:
print(row)
csvout.writerow(row)
row = {}
if len(elem) == 0:
row[tag] = elem.text
Output:
{'test3': 'Something Something', 'test4': 'AA', 'Comments': 'BB', 'test7': '123 street', 'test9': 'test work', 'test14': '746745636', 'test13': 'Some date'}
{'test3': 'None test', 'test4': 'Someone', 'Comments': 'Some comment', 'test7': '5634643643', 'test17': 'Some Info', 'test19': 'Somewhere', 'test18': '63243333', 'test14': '456436436346', 'test13': '54234532452345'}
CSV Output:
test3,test4,test7,test9,test13,test14,test17,test18,test19,Comments
Something Something,AA,123 street,test work,Some date,746745636,,,,BB
None test,Someone,5634643643,,54234532452345,456436436346,Some Info,63243333,Somewhere,Some comment
Update:
If want to handle duplicate tags and create a list of values then try something like this:
if len(elem) == 0:
text = elem.text
old = row.get(tag)
if old is None:
# first occurrence
row[tag] = text
elif isinstance(old, str):
# second occurrence > create list
row[tag] = [old, text]
else:
old.append(text)

Converting an xml doc into a specific dot-expanded json structure

I have the following XML document:
<Item ID="288917">
<Main>
<Platform>iTunes</Platform>
<PlatformID>353736518</PlatformID>
</Main>
<Genres>
<Genre FacebookID="6003161475030">Comedy</Genre>
<Genre FacebookID="6003172932634">TV-Show</Genre>
</Genres>
<Products>
<Product Country="CA">
<URL>https://itunes.apple.com/ca/tv-season/id353187108?i=353736518</URL>
<Offers>
<Offer Type="HDBUY">
<Price>3.49</Price>
<Currency>CAD</Currency>
</Offer>
<Offer Type="SDBUY">
<Price>2.49</Price>
<Currency>CAD</Currency>
</Offer>
</Offers>
</Product>
<Product Country="FR">
<URL>https://itunes.apple.com/fr/tv-season/id353187108?i=353736518</URL>
<Rating>Tout public</Rating>
<Offers>
<Offer Type="HDBUY">
<Price>2.49</Price>
<Currency>EUR</Currency>
</Offer>
<Offer Type="SDBUY">
<Price>1.99</Price>
<Currency>EUR</Currency>
</Offer>
</Offers>
</Product>
</Products>
</Item>
Currently, to get it into json format I'm doing the following:
parser = etree.XMLParser(recover=True)
node = etree.fromstring(s, parser=parser)
data = xmltodict.parse(etree.tostring(node))
Of course the xmltodict is doing the heavy lifting. However, it gives me a format that is not ideal for what I'm trying to accomplish. Here is what I'd like the end data to look like:
{
"Item[#ID]": 288917, # if no preceding element, use the root node tag
"Main.Platform": "iTunes",
"Main.PlatformID": "353736518",
"Genres.Genre": ["Comedy", "TV-Show"] # list of elements if repeated
"Genres.Genre[#FacebookID]": ["6003161475030", "6003161475030"],
"Products.Product[#Country]": ["CA", "FR"],
"Products.Product.URL": ["https://itunes.apple.com/ca/tv-season/id353187108?i=353736518", "https://itunes.apple.com/fr/tv-season/id353187108?i=353736518"],
"Products.Product.Offers.Offer[#Type]": ["HDBUY", "SDBUY", "HDBUY", "SDBUY"],
"Products.Product.Offers.Offer.Price": ["3.49", "2.49", "2.49", "1.99"],
"Products.Product.Offers.Offer.Currency": "EUR"
}
This is a bit verbose, but it wasn't too hard to format this as a flat dict. Here is an example:
node = etree.fromstring(file_data.encode('utf-8'), parser=parser)
data = OrderedDict()
nodes = [(node, ''),] # format is (node, prefix)
while nodes:
for sub, prefix in nodes:
# remove the prefix tag unless its for the first attribute
tag_prefix = '.'.join(prefix.split('.')[1:]) if ('.' in prefix) else ''
atr_prefix = sub.tag if (sub == node) else tag_prefix
# tag
if sub.text.strip():
_prefix = tag_prefix + '.' + sub.tag
_value = sub.text.strip()
if data.get(_prefix): # convert it to a list if multiple values
if not isinstance(data[_prefix], list): data[_prefix] = [data[_prefix],]
data[_prefix].append(_value)
else:
data[_prefix] = _value
# atr
for k, v in sub.attrib.items():
_prefix = atr_prefix + '[#%s]' % k
_value = v
if data.get(_prefix): # convert it to a list if multiple values
if not isinstance(data[_prefix], list): data[_prefix] = [data[_prefix],]
data[_prefix].append(_value)
else:
data[_prefix] = _value
nodes.remove((sub, prefix))
for s in sub.getchildren():
_prefix = (prefix + '.' + sub.tag).strip('.')
nodes.append((s, _prefix))
if not nodes: break
You can use recursion here. One way is to store the paths progressively as your recurse the XML document, and return a result dictionary at the end, which can be serialized to JSON.
The below demo uses the standard library xml.etree.ElementTree for parsing XML documents.
Demo:
from xml.etree.ElementTree import ElementTree
from pprint import pprint
# Setup XML tree for parsing
tree = ElementTree()
tree.parse("sample.xml")
root = tree.getroot()
def collect_xml_paths(root, path=[], result={}):
"""Collect XML paths into a dictionary"""
# First collect root items
if not result:
root_id, root_value = tuple(root.attrib.items())[0]
root_key = root.tag + "[#%s]" % root_id
result[root_key] = root_value
# Go through each child from root
for child in root:
# Extract text
text = child.text.strip()
# Update path
new_path = path[:]
new_path.append(child.tag)
# Create dot separated key
key = ".".join(new_path)
# Get child attributes
attributes = child.attrib
# Ensure we have attributes
if attributes:
# Add each attribute to result
for k, v in attributes.items():
attrib_key = key + "[#%s]" % k
result.setdefault(attrib_key, []).append(v)
# Add text if it exists
if text:
result.setdefault(key, []).append(text)
# Recurse through paths once done iteration
collect_xml_paths(child, new_path)
# Separate single values from list values
return {k: v[0] if len(v) == 1 else v for k, v in result.items()}
pprint(collect_xml_paths(root))
Output:
{'Genres.Genre': ['Comedy', 'TV-Show'],
'Genres.Genre[#FacebookID]': ['6003161475030', '6003172932634'],
'Item[#ID]': '288917',
'Main.Platform': 'iTunes',
'Main.PlatformID': '353736518',
'Products.Product.Offers.Offer.Currency': ['CAD', 'CAD', 'EUR', 'EUR'],
'Products.Product.Offers.Offer.Price': ['3.49', '2.49', '2.49', '1.99'],
'Products.Product.Offers.Offer[#Type]': ['HDBUY', 'SDBUY', 'HDBUY', 'SDBUY'],
'Products.Product.Rating': 'Tout public',
'Products.Product.URL': ['https://itunes.apple.com/ca/tv-season/id353187108?i=353736518',
'https://itunes.apple.com/fr/tv-season/id353187108?i=353736518'],
'Products.Product[#Country]': ['CA', 'FR']}
If you want to serialize this dictionary to JSON, you can use json.dumps():
from json import dumps
print(dumps(collect_xml_paths(root)))
# {"Item[#ID]": "288917", "Main.Platform": "iTunes", "Main.PlatformID": "353736518", "Genres.Genre[#FacebookID]": ["6003161475030", "6003172932634"], "Genres.Genre": ["Comedy", "TV-Show"], "Products.Product[#Country]": ["CA", "FR"], "Products.Product.URL": ["https://itunes.apple.com/ca/tv-season/id353187108?i=353736518", "https://itunes.apple.com/fr/tv-season/id353187108?i=353736518"], "Products.Product.Offers.Offer[#Type]": ["HDBUY", "SDBUY", "HDBUY", "SDBUY"], "Products.Product.Offers.Offer.Price": ["3.49", "2.49", "2.49", "1.99"], "Products.Product.Offers.Offer.Currency": ["CAD", "CAD", "EUR", "EUR"], "Products.Product.Rating": "Tout public"}

Convert XML into dictionary

I need to convert XML file into the dictionary (later on it will be converted into JSON).
A sample of XML script looks like:
<?xml version="1.0" encoding="UTF-8"?>
<osm version="0.6" generator="Overpass API 0.7.55.3 9da5e7ae">
<note>The data included in this document is from www.openstreetmap.org. The data is made available under ODbL.</note>
<meta osm_base="2018-06-17T15:31:02Z"/>
...
<node id="2188497873" lat="52.5053306" lon="13.4360114">
<tag k="alt_name" v="Spreebalkon"/>
<tag k="name" v="Brommybalkon"/>
<tag k="tourism" v="viewpoint"/>
<tag k="wheelchair" v="yes"/>
</node>
...
</osm>
With the simple code I have already filtered all the values that I needed for my dictionary:
Code
import xml.etree.ElementTree as ET
input_file = r"D:\berlin\trial_xml\berlin_viewpoint_locations.xml"
tree = ET.parse(input_file)
root = tree.getroot()
lst1 = tree.findall("./node")
for item1 in lst1:
print('id:',item1.get('id'))
print('lat:',item1.get('lat'))
print('lon:',item1.get('lon'))
for item1_tags_and_nd in item1.iter('tag'):
print(item1_tags_and_nd.get('k') + ":", item1_tags_and_nd.get('v'))
Result
id: 2188497873
lat: 52.5053306
lon: 13.4360114
alt_name: Spreebalkon
name: Brommybalkon
tourism: viewpoint
wheelchair: yes
Can you help me, please to append properly and efficiently these values into a dictionary?
I want it to look like:
{'id': '2188497873', 'lat': 52.5053306, 'lon': 13.4360114, 'alt_name': 'Spreebalkon', 'name': 'Brommybalkon', 'tourism': 'viewpoint', 'wheelchair': 'yes'}
I have tried with
dictionary = {}
dictionary['id'] = []
dictionary['lat'] = []
dictionary['lon'] = []
lst1 = tree.findall("./node")
for item1 in lst1:
dictionary['id'].append(item1.get('id'))
dictionary['lat'].append(item1.get('lat'))
dictionary['lon'].append(item1.get('lon'))
for item1_tags_and_nd in item1.iter('tag'):
dictionary[item1_tags_and_nd.get('k')] = item1_tags_and_nd.get('v')
but it does not work so far.
I suggest you construct a list of dicts, instead of a dict of lists like:
result_list = []
for item in tree.findall("./node"):
dictionary = {}
dictionary['id'] = item.get('id')
dictionary['lat'] = item.get('lat')
dictionary['lon'] = item.get('lon')
result_list.append(dictionary)
Or as a couple of comprehensions like:
result_list = [{k: item.get(k) for k in ('id', 'lat', 'lon')}
for item in tree.findall("./node")]
And for the nested case:
result_list = [{k: (item.get(k) if k != 'tags' else
{i.get('k'): i.get('v') for i in item.iter('tag')})
for k in ('id', 'lat', 'lon', 'tags')}
for item in tree.findall("./node")]
Results:
{
'id': '2188497873',
'lat': '52.5053306',
'lon': '13.4360114',
'tags': {
'alt_name': 'Spreebalkon',
'name': 'Brommybalkon',
'tourism': 'viewpoint',
'wheelchair': 'yes'
}
}

getting value from xml in a dict

I have a XML file that I want to parse. In the file I have 3 unique tags -
3
2
1
Each of these have 1 unique value for a metricX. I want to be extract these values in form a dict in python.
Something like
Desired Output
{ 3 : {"metricX":100}, 2 : {"metricX":11}, 1 : {"metricX":44}}
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="MeasDataCollection.xsl"?>
<!DOCTYPE mdc SYSTEM "MeasDataCollection.dtd">
<mdc xmlns:HTML="http://www.w3.org/TR/REC-xml">
<md>
<neid>
<neun>936001_STURGEON_BAY_MEYER</neun>
</neid>
<mi>
<mi>
<mts>20170924161500Z</mts>
<gp>900</gp>
<mt>metricX</mt>
<mv>
<moid>3</moid>
<r>100</r>
</mv>
<mv>
<moid>2</moid>
<r>11</r>
</mv>
<mv>
<moid>1</moid>
<r>44</r>
</mv>
</mi>
</mi>
</md>
</mdc>
So far I have tried using Element Tree.
import os
import xml.etree.ElementTree as ET
fullpath = os.getcwd()
os.chdir(r"C:\Users\sss\Documents\Zabbix_work\xml_parsing")
tree = ET.ElementTree(file='smaple.xml')
for elem in tree.iter():
print (elem.tag, elem.text)
Output so far is -
mdc
md
neid
neun 936001_STURGEON_BAY_MEYER
mi
mi
mts 20170924161500Z
gp 900
mt metricX
mv
moid 3
r 100
mv
moid 2
r 11
mv
moid 1
r 44
Not so sure now how to organize it further in form of a dict.
This should do the trick:
import xml.etree.ElementTree as ET
import os
file_path = os.path.expanduser('~/Desktop/input123.xml') # filepath here
tree = ET.ElementTree(file=file_path)
my_dict = {}
for node in tree.getroot().find('md').find('mi').find('mi').findall('mv'):
my_dict[int(node.find('moid').text)] = { 'metricX': int(node.find('r').text) }
print(my_dict)
...output:
{3: {'metricX': 100}, 2: {'metricX': 11}, 1: {'metricX': 44}}

Categories