Finding values in XML document using Python

Finding values in XML document using Python - python

I have following code that tries to get values from XML document:
from xml.dom import minidom
xml = """<SoccerFeed TimeStamp="20130328T152947+0000">
<SoccerDocument uID="f131897" Type="Result" />
<Competition uID="c87">
<MatchData>
<MatchInfo TimeStamp="20070812T144737+0100" Weather="Windy"Period="FullTime" MatchType="Regular" />
<MatchOfficial uID="o11068"/>
<Stat Type="match_time">91</Stat>
<TeamData TeamRef="t810" Side="Home" Score="4" />
<TeamData TeamRef="t2012" Side="Away" Score="1" />
</MatchData>
<Team uID="t810" />
<Team uID="t2012" />
<Venue uID="v2158" />
</SoccerDocument>
</SoccerFeed>"""
xmldoc = minidom.parseString(xml)
soccerfeed = xmldoc.getElementsByTagName("SoccerFeed")[0]
soccerdocument = soccerfeed.getElementsByTagName("SoccerDocument")[0]
#Match Data
MatchData = soccerdocument.getElementsByTagName("MatchData")[0]
MatchInfo = MatchData.getElementsByTagName("MatchInfo")[0]
Goal = MatchData.getElementsByTagNameNS("Side", "Score")
The Goal is being set to [], but I would like to get the score value, which is 4.

It looks like you are searching for the wrong XML node. Check following line:
Goal = MatchData.getElementsByTagNameNS("Side", "Score")
You probably are looking for following:
Goal = MatchData.getElementsByTagName("TeamData")[0].getAttribute("Score")
NOTE: Document.getElementsByTagName, Document.getElementsByTagNameNS, Element.getElementsByTagName, Element.getElementsByTagNameNS return a list of nodes, not just a scalar value.

Related

XML to table form in Excel

There is this option when opening an xml file using Excel. You get prompted with the option as seen in the picture Here
It basically open that xml file in a table work and based on the analysis that I have done. It seems to do a pretty good job.
This is how it looks after I opened an xml file using excel as a tabel form Here
My Question: I want to convert an Xml into a table from like that feature in Excel does it. Is that possible?
The reason I want this result, is that working with tables inside excel is really easy using libraries like pandas. However, I don’t want to go an open every xml file with excel, show the table and then save it again. It is not very time efficient
This is my XML file
<?xml version="1.0" encoding="utf-8"?>
<ProjectData>
<FINAL>
<START id="ID0001" service_code="0x5196">
<Docs Docs_type="START">
<Rational>225196</Rational>
<Qualify>6251960000A0DE</Qualify>
</Docs>
<Description num="1213f2312">The parameter</Description>
<SetFile dg="" dg_id="">
<SetData value="32" />
</SetFile>
</START>
<START id="DG0003" service_code="0x517B">
<Docs Docs_type="START">
<Rational>23423</Rational>
<Qualify>342342</Qualify>
</Docs>
<Description num="3423423f3423">The third</Description>
<SetFile dg="" dg_id="">
<FileX dg="" axis_pts="2" name="" num="" dg_id="" />
<FileY unit="" axis_pts="20" name="TOOLS" text_id="23423" unit_id="" />
<SetData x="E1" value="21259" />
<SetData x="E2" value="0" />
</SetFile>
</START>
<START id="ID0048" service_code="0x5198">
<RawData rawdata_type="OPDATA">
<Request>225198</Request>
<Response>343243324234234</Response>
</RawData>
<Meaning text_id="434234234">The forth</Meaning>
<ValueDataset unit="m" unit_id="FEDS">
<FileX dg="kg" discrete="false" axis_pts="19" name="weight" text_id="SDF3" unit_id="SDGFDS" />
<SetData xin="sdf" xax="233" value="323" />
<SetData xin="123" xax="213" value="232" />
<SetData xin="2321" xax="232" value="23" />
</ValueDataset>
</START>
</FINAL>
</ProjectData>

So let's say I have the following input.xml file:
<main>
<item name="item1" image="a"></item>
<item name="item2" image="b"></item>
<item name="item3" image="c"></item>
<item name="item4" image="d"></item>
</main>
You can use the following code:
import xml.etree.ElementTree as ET
import pandas as pd
tree = ET.parse('input.xml')
tags = [ e.attrib for e in tree.getroot() ]
df = pd.DataFrame(tags)
# df:
# name image
# 0 item1 a
# 1 item2 b
# 2 item3 c
# 3 item4 d
And this should be independent of the number of attributes in a given file.
To write to a simple CSV file from pandas, you can use the to_csv command. See documentation. If it is necessary to be an excel sheet, you can use to_excel, see here.
# Write to csv without the row names
df.to_csv('file_name.csv', index = False)
# Write to xlsx sheet without the row names
df.to_excel('file_name.xlsx', index=False)
UPDATE:
For your XML file, and based on your clarification in the comments, I suggest the following, where all elements in the first level in the tree will be rows, and every attribute or node text will be column:
def has_children(e):
''' Check if element, e, has children'''
return(len(list(e)) > 0)
def has_attrib(e):
''' Check if element, e, has attributes'''
return(len(e.attrib)>0)
def get_uniqe_key(mydict, key):
''' Generate unique key if already exists in mydict'''
if key in mydict:
while key in mydict:
key = key + '*'
return(key)
tree = ET.parse('input2.xml')
root = tree.getroot()
# Get first level:
lvl_one = list(root)
myList = [];
for e in lvl_one:
mydict = {}
# Iterate over each node in level one element
for node in e.iter():
if (not has_children(node)) & (node.text != None):
uniqe_key = get_uniqe_key(mydict, node.tag)
mydict[uniqe_key] = node.text
if has_attrib(node):
for key in node.attrib:
uniqe_key = get_uniqe_key(mydict, key)
mydict[uniqe_key] = node.attrib[key]
myList.append(mydict)
print(pd.DataFrame(myList))
Notice in this code, I check if the column name exists for each key, and if it exists, I create a new column name by suffixing with '*'.

Python XML get immediate child elements only

I have an xml file as below:
<?xml version="1.0" encoding="utf-8"?>
<EDoc CID="1000101" Cname="somename" IName="iname" CSource="e1" Version="1.0">
<RIGLIST>
<RIG RIGID="100001" RIGName="RgName1">
<ListID>
<nodeA nodeAID="1000011" nodeAName="node1A" nodeAExtID="9000011" />
<nodeA nodeAID="1000012" nodeAName="node2A" nodeAExtID="9000012" />
<nodeA nodeAID="1000013" nodeAName="node3A" nodeAExtID="9000013" />
<nodeA nodeAID="1000014" nodeAName="node4A" nodeAExtID="9000014" />
<nodeA nodeAID="1000015" nodeAName="node5A" nodeAExtID="9000015" />
<nodeA nodeAID="1000016" nodeAName="node6A" nodeAExtID="9000016" />
<nodeA nodeAID="1000017" nodeAName="node7A" nodeAExtID="9000017" />
</ListID>
</RIG>
<RIG RIGID="100002" RIGName="RgName2">
<ListID>
<nodeA nodeAID="1000021" nodeAName="node1B" nodeAExtID="9000021" />
<nodeA nodeAID="1000022" nodeAName="node2B" nodeAExtID="9000022" />
<nodeA nodeAID="1000023" nodeAName="node3B" nodeAExtID="9000023" />
</ListID>
</RIG>
</RIGLIST>
</EDoc>
I need to search for the Node value RIGName and if match is found print out all the values of nodeAName
Example:
Searching for RIGName = "RgName2" should print all the values as node1B, node2B, node3B
As of now I am only able to get the first part as below:
import xml.etree.ElementTree as eT
import re
xmlfilePath = "Path of xml file"
tree = eT.parse(xmlfilePath)
root = tree.getroot()
for elem in root.iter("RIGName"):
# print(elem.tag, elem.attrib)
if re.findall(searchtxt, elem.attrib['RIGName'], re.IGNORECASE):
print(elem.attrib)
count += 1
How can I get only the immediate child node values?

Switching from xml.etree to lxml would give you a way to do it in a single go because of a much better XPath query language support:
In [1]: from lxml import etree as ET
In [2]: tree = ET.parse('input.xml')
In [3]: root = tree.getroot()
In [4]: root.xpath('//RIG[#RIGName = "RgName2"]/ListID/nodeA/#nodeAName')
Out[4]: ['node1B', 'node2B', 'node3B']

python ElementTree remove issue

I have xml file as following:
<plugin-config>
<properties>
<property name="AZSRVC_CONNECTION" value="diamond_plugins#AZSRVC" />
<property name="DIAMOND_HOST" value="10.0.230.1" />
<property name="DIAMOND_PORT" value="3333" />
</properties>
<pack-list>
<vsme-pack id="monthly_50MB">
<campaign-list>
<campaign id="2759" type="SOB" />
<campaign id="2723" type="SUBSCRIBE" />
</campaign-list>
</vsme-pack>
<vsme-pack id="monthly_500MB">
<campaign-list>
<campaign id="3879" type="SOB" />
<campaign id="3885" type="SOB" />
<campaign id="2724" type="SUBSCRIBE" />
<campaign id="1111" type="COB" /></campaign-list>
</vsme-pack>
</pack-list>
</plugin-config>
And trying to run this Python script to remove 'campaign' with specific id.
import xml.etree.ElementTree as ET
tree = ET.parse('pack-assign-config.xml')
root = tree.getroot()
pack_list = root.find('pack-list')
camp_list = pack_list.find(".//vsme-pack[#id='{pack_id}']".format(pack_id=pack_id)).find('campaign-list').findall('campaign')
for camp in camp_list:
if camp.get('id') == '2759':
camp_list.remove(camp)
tree.write('out.xml')
I run script but out is the same as input file, so does not remove element.

Issue :
this is wrong way to find the desired node . you are searching for vsme-pack and the trying to find campaign-list and campaign ? which incorrect format.
camp_list = pack_list.find(".//vsme-pack[#id='{pack_id}']".format(pack_id=pack_id)).find('campaign-list').findall('campaign')
Fixed Code Example
here is the working code which removes the node from xml
import xml.etree.ElementTree as ET
root = ET.parse('pack-assign-config.xml')
# Alternatively, parse the XML that lives in 'filename_path'
# tree = ElementTree.parse(filename_path)
# root = tree.getroot()
# Find the parent element of each "weight" element, using XPATH
for parent in root.findall('.//pack-list/'):
# Find each weight element
for element in parent.findall('campaign-list'):
for camp_list in element.findall('campaign'):
if camp_list.get('id') == '2759' or camp_list.get('id') == '3879' :
element.remove(camp_list)
root.write("out.xml")
hope this helps

PySpark counting rows that contain string

I have multiple xml files that look something like this:
<?xml version="1.0" encoding="UTF-8"?>
<parent>
<row AcceptedAnswerId="15" AnswerCount="5" Body="<p>How should
I elicit prior distributions from experts when fitting a Bayesian
model?</p>
" CommentCount="1" CreationDate="2010-07-
19T19:12:12.510" FavoriteCount="17" Id="1" LastActivityDate="2010-09-
15T21:08:26.077" OwnerUserId="8" PostTypeId="1" Score="26"
Tags="<bayesian><prior><elicitation>"
Title="Eliciting priors from experts" ViewCount="1457" />
I would like to be able to use PySpark to count the lines that DO NOT contain the string: <row
My current thought:
def startWithRow(line):
if line.strip().startswith("<row"):
return True
else:
return False
sc.textFile(localpath("folder_containing_xmg.gz_files")) \
.filter(lambda x: not startWithRow(x)) \
.count()
I have tried validating this, but am getting results from even a simple count lines that don't make sense (I downloaded the xml file and did a wc on it which did not match the word count from PySpark.)
Does anything about my approach above stand out as wrong/weird?

I will just use lxml library combined with Spark to count the line with row or filter something out.
from lxml import etree
def find_number_of_rows(path):
try:
tree = etree.fromstring(path)
except:
tree = etree.parse(path)
return len(tree.findall('row'))
rdd = spark.sparkContext.parallelize(paths) # paths is a list to all your paths
rdd.map(lambda x: find_number_of_rows(x)).collect()
For example, if you have list or XML string (just toy example), you can do the following:
text = [
"""
<parent>
<row ViewCount="1457" />
<row ViewCount="1457" />
</parent>
""",
"""
<parent>
<row ViewCount="1457" />
<row ViewCount="1457" />
<row ViewCount="1457" />
</parent>
"""
]
rdd = spark.sparkContext.parallelize(text)
rdd.map(lambda x: find_number_of_rows(x)).collect()
In your case, your function have to take in path to file instead. Then, you can count or filter those rows. I don't have a full file to test on. Let me know if you need extra help!

def badRowParser(x):
try:
line = ET.fromstring(x.strip().encode('utf-8'))
return True
except:
return False
posts = sc.textFile(localpath('folder_containing_xml.gz_files'))
rejected = posts.filter(lambda l: "<row" in l.encode('utf-
8')).map(lambda x: not badRowParser(x))
ans = rejected.collect()
from collections import Counter
Counter(ans)

Null control with XML SAX getValues()

I'm trying to parse the following xml file:
<?xml version="1.0" encoding="UTF-8"?>
<gexf xmlns="http://www.gexf.net/1.2draft" version="1.2">
<graph mode="static" defaultedgetype="directed">
<nodes>
<node id="0" label="Hello" />
<node id="1" label="Word" />
<node id="2" />
</nodes>
<edges>
<edge id="0" source="0" target="1" />
<edge id="1" source="1" target="2" weight="2.0" />
</edges>
</graph>
</gexf>
As can be seen some edges have weights, some do not.
My code is like the following:
elif name == "edge":
u = attrs.getValue("source")
v = attrs.getValue("target")
w = attrs.getValue("weight")
if w is not None:
self.edgeweight = w
Here I expect w to be None on the first line and 2.0 on the second line of the XML file. Instead all I get is an error. What's the proper way to control this?

get() method did the trick.
w = attrs.get("weight")
if w is not None:
self.weighted = True
self.edgeweight = float(w)

Try the following.
if attrs.hasKey("weight"):
w = attrs.getValue("weight")
self.edgeweight = w
I used this as reference. It doesn't specify if you can use "weight" in attrs, but you can try and see if it works.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Finding values in XML document using Python - python

Related

XML to table form in Excel

Python XML get immediate child elements only

python ElementTree remove issue

PySpark counting rows that contain string

Null control with XML SAX getValues()

Categories

Resources