Null control with XML SAX getValues() - python

I'm trying to parse the following xml file:
<?xml version="1.0" encoding="UTF-8"?>
<gexf xmlns="http://www.gexf.net/1.2draft" version="1.2">
<graph mode="static" defaultedgetype="directed">
<nodes>
<node id="0" label="Hello" />
<node id="1" label="Word" />
<node id="2" />
</nodes>
<edges>
<edge id="0" source="0" target="1" />
<edge id="1" source="1" target="2" weight="2.0" />
</edges>
</graph>
</gexf>
As can be seen some edges have weights, some do not.
My code is like the following:
elif name == "edge":
u = attrs.getValue("source")
v = attrs.getValue("target")
w = attrs.getValue("weight")
if w is not None:
self.edgeweight = w
Here I expect w to be None on the first line and 2.0 on the second line of the XML file. Instead all I get is an error. What's the proper way to control this?

get() method did the trick.
w = attrs.get("weight")
if w is not None:
self.weighted = True
self.edgeweight = float(w)

Try the following.
if attrs.hasKey("weight"):
w = attrs.getValue("weight")
self.edgeweight = w
I used this as reference. It doesn't specify if you can use "weight" in attrs, but you can try and see if it works.

Related

How do you properly fetch from this nested XML?

I have the following XML:
<?xml version="1.0" encoding="UTF-8"?>
<data>
<columns>
<Leftover index="5">Leftover</Leftover>
<NODE5 index="6"></NODE5>
<NODE6 index="7"></NODE6>
<NODE8 index="9"></NODE8>
<Nomenk__Nr_ index="2">Nomenk.
Nr.</Nomenk__Nr_>
<Year index="8">2020</Year>
<Name index="1">Name</Name>
<Value_code index="3">Value code</Value_code>
</columns>
<records>
<record index="1">
<Leftover>Leftover</Leftover>
<NODE5>Test1</NODE5>
<NODE6>Test2</NODE6>
<NODE8>Test3</NODE8>
<Nomenk__Nr_></Nomenk__Nr_>
<Name></Name>
<Value_code></Value_code>
</record>
... (it repeats itself with different values and the index value increments)
My code is:
import lxml
import lxml.etree as et
xml = open('C:\outputfile.xml', 'rb')
xml_content = xml.read()
tree = et.fromstring(xml_content)
for bad in tree.xpath("//records[#index=\'*\']/NODE5"):
bad.getparent().remove(bad) # here I grab the parent of the element to call the remove directly on it
result = (et.tostring(tree, pretty_print=True, xml_declaration=True))
f = open( 'outputxml.xml', 'w' )
f.write( str(result) )
f.close()
What I need to do is to remove the NODE5, NODE6, NODE8. I tried using a wildcard and then specifying one of the nodes (see line 6) but that seems to not have worked... I'm also getting a syntax error right after the loop on the first character but the code executes.
My problem is also that the encoding by lxml is set to ASCII afterwards when the file is "exported".
UPDATE
I am getting this error on line 8:
return = ...
^
SyntaxError: invalid syntax
I took some code from https://stackoverflow.com/a/7981894/1987598
What I need to do is to remove the NODE5, NODE6, NODE8.
below
import xml.etree.ElementTree as ET
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<data>
<columns>
<Leftover index="5">Leftover</Leftover>
<NODE5 index="6" />
<NODE6 index="7" />
<NODE8 index="9" />
<Nomenk__Nr_ index="2">Nomenk.
Nr.</Nomenk__Nr_>
<Year index="8">2020</Year>
<Name index="1">Name</Name>
<Value_code index="3">Value code</Value_code>
</columns>
<records>
<record index="1">
<Leftover>Leftover</Leftover>
<NODE5>Test1</NODE5>
<NODE6>Test2</NODE6>
<NODE8>Test3</NODE8>
<Nomenk__Nr_ />
<Name />
<Value_code />
</record>
<record index="21">
<Leftover>Leftover</Leftover>
<NODE5>Test11</NODE5>
<NODE6>Test21</NODE6>
<NODE8>Test39</NODE8>
<Nomenk__Nr_ />
<Name />
<Value_code />
</record>
</records>
</data>'''
root = ET.fromstring(xml)
col = root.find('./columns')
for x in ['5','6','8']:
nodes_to_remove = col.findall('./NODE{}'.format(x))
for node in nodes_to_remove:
col.remove(node)
records = root.find('./records')
records_lst = records.findall('./record'.format(x))
for r in records_lst:
for x in ['5','6','8']:
nodes_to_remove = r.findall('./NODE{}'.format(x))
for node in nodes_to_remove:
r.remove(node)
ET.dump(root)
output
<data>
<columns>
<Leftover index="5">Leftover</Leftover>
<Nomenk__Nr_ index="2">Nomenk.
Nr.</Nomenk__Nr_>
<Year index="8">2020</Year>
<Name index="1">Name</Name>
<Value_code index="3">Value code</Value_code>
</columns>
<records>
<record index="1">
<Leftover>Leftover</Leftover>
<Nomenk__Nr_ />
<Name />
<Value_code />
</record>
<record index="2">
<Leftover>Leftover</Leftover>
<Nomenk__Nr_ />
<Name />
<Value_code />
</record>
</records>
</data>

Python XML get immediate child elements only

I have an xml file as below:
<?xml version="1.0" encoding="utf-8"?>
<EDoc CID="1000101" Cname="somename" IName="iname" CSource="e1" Version="1.0">
<RIGLIST>
<RIG RIGID="100001" RIGName="RgName1">
<ListID>
<nodeA nodeAID="1000011" nodeAName="node1A" nodeAExtID="9000011" />
<nodeA nodeAID="1000012" nodeAName="node2A" nodeAExtID="9000012" />
<nodeA nodeAID="1000013" nodeAName="node3A" nodeAExtID="9000013" />
<nodeA nodeAID="1000014" nodeAName="node4A" nodeAExtID="9000014" />
<nodeA nodeAID="1000015" nodeAName="node5A" nodeAExtID="9000015" />
<nodeA nodeAID="1000016" nodeAName="node6A" nodeAExtID="9000016" />
<nodeA nodeAID="1000017" nodeAName="node7A" nodeAExtID="9000017" />
</ListID>
</RIG>
<RIG RIGID="100002" RIGName="RgName2">
<ListID>
<nodeA nodeAID="1000021" nodeAName="node1B" nodeAExtID="9000021" />
<nodeA nodeAID="1000022" nodeAName="node2B" nodeAExtID="9000022" />
<nodeA nodeAID="1000023" nodeAName="node3B" nodeAExtID="9000023" />
</ListID>
</RIG>
</RIGLIST>
</EDoc>
I need to search for the Node value RIGName and if match is found print out all the values of nodeAName
Example:
Searching for RIGName = "RgName2" should print all the values as node1B, node2B, node3B
As of now I am only able to get the first part as below:
import xml.etree.ElementTree as eT
import re
xmlfilePath = "Path of xml file"
tree = eT.parse(xmlfilePath)
root = tree.getroot()
for elem in root.iter("RIGName"):
# print(elem.tag, elem.attrib)
if re.findall(searchtxt, elem.attrib['RIGName'], re.IGNORECASE):
print(elem.attrib)
count += 1
How can I get only the immediate child node values?
Switching from xml.etree to lxml would give you a way to do it in a single go because of a much better XPath query language support:
In [1]: from lxml import etree as ET
In [2]: tree = ET.parse('input.xml')
In [3]: root = tree.getroot()
In [4]: root.xpath('//RIG[#RIGName = "RgName2"]/ListID/nodeA/#nodeAName')
Out[4]: ['node1B', 'node2B', 'node3B']

Python xpath with xml.etree.ElementTree: multiple conditions

I am trying to count from an XML file all the XML nodes of the form:
....
<node id="0">
<data key="d0">Attribute</data>
....
</node>
....
For example a file like this:
<graphml xmlns="http://graphml.graphdrawing.org/xmlns" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd">
<graph edgedefault="directed">
<node id="0">
<data key="d0">Attribute</data>
<data key="d1">Foo</data>
</node>
What I have tried is:
x = graphml_root.findall(".//"+nsfy("node")+"/["+nsfy("data")+"='Attribute']")
Butt his only says that the text of the XML has to be "Attribute", I want to make sure that "Attribute" is the text of the node with key="d0", so I tried this:
x = graphml_root.findall(".//"+nsfy("node")+"/"+nsfy("data")+"[#key='d0']"+"[""'Attribute']")
But it returns an empty list, so I am missing something.
NOTE:
I had to write a little lambda to avoid copying the xmlnamespace all teh time:
nsfy = lambda x : '{http://graphml.graphdrawing.org/xmlns}'+x #to be able to read namespace tags
Thanks.
Try doing something like:
nodes = []
containers = graphml_root.findall('.//node/data[#key="d0"]')
for container in containers:
if container.text == "Attribute":
nodes.append(container)
count = len(nodes)
from lxml import etree
f= '''
<node id="0">
<data key="d0" t="32">Attribute</data>
<data key="d1">Foo</data>
</node>'''
root = etree.XML(f)
data = root.xpath('.//*[#key="d0" and text()="Attribute"]')
print(data)
lxml provide the xpath method.and it's done.
UPDATE
read the DOC of xml.etree,it don't supported this syntax.the xpath supported by xml.etree
So,only you can do is find .//*[#key="d0"]then test it's text equal to "Attribute".

python : keep a word and drop other words between 2 tags

I'm new to Python.
I want my code to open a file in gxl format (like xml format), and read lines and find < string> tag. After that, keep the first word and delete the other words until it meets < /string> tag. How can i do this? Here is an example:
This:
< string>dummyMainClass void dummyMainMethod(java.lang.String[])< /string>
becomes this:
< string>dummyMainClass< /string>
p.s: in my file, all the first words in between string tags are not the same.
here is an example of one of my files:
https://gist.github.com/anonymous/61c1afd751214a0473fd62ee74a3b1d6
The following example will allow you to read from the GXL file like a XML file:
import xml.etree.ElementTree as ET
tree = ET.parse("original_file.gxl")
for node in tree.findall("/graph/node/attr/string"):
# Retrieves the value of <string>, split on spaces and keep first element
node.text = node.text.split(" ")[0]
# Write the modified contents to a new file
tree.write("modified_file.gxl")
This is based from the information in this answer, and considering the following GXL file structure from the provided sample:
<?xml version="1.0" encoding="iso-8859-1"?>
<gxl>
<graph id="ExtendedCallGraph" edgeids="true" edgemode="directed">
<node id="N_0">
<attr name="name">
<string>dummyMainClass void dummyMainMethod(java.lang.String[])</string>
</attr>
</node>
<edge from="N_0" to="N_1" isdirected="true" id="N_0--N_1">
</edge>
<edge from="N_0" to="N_2" isdirected="true" id="N_0--N_2">
</edge>
<node id="N_442">
<attr name="name">
<string>java.util.AbstractList void init()</string>
</attr>
</node>
<edge from="N_442" to="N_89" isdirected="true" id="N_442--N_89">
</edge>
<edge from="N_442" to="N_443" isdirected="true" id="N_442--N_443">
</edge>
<node id="N_443">
<attr name="name">
<string>java.util.AbstractCollection void init()</string>
</attr>
</node>
<edge from="N_443" to="N_88" isdirected="true" id="N_443--N_88">
</edge>
<edge from="N_443" to="N_89" isdirected="true" id="N_443--N_89">
</edge>
</graph>
</gxl>
Edit: Modified code to write-back to the file as specified in comments
Edit2: Added a sample portion of your example and modified the path

Finding values in XML document using Python

I have following code that tries to get values from XML document:
from xml.dom import minidom
xml = """<SoccerFeed TimeStamp="20130328T152947+0000">
<SoccerDocument uID="f131897" Type="Result" />
<Competition uID="c87">
<MatchData>
<MatchInfo TimeStamp="20070812T144737+0100" Weather="Windy"Period="FullTime" MatchType="Regular" />
<MatchOfficial uID="o11068"/>
<Stat Type="match_time">91</Stat>
<TeamData TeamRef="t810" Side="Home" Score="4" />
<TeamData TeamRef="t2012" Side="Away" Score="1" />
</MatchData>
<Team uID="t810" />
<Team uID="t2012" />
<Venue uID="v2158" />
</SoccerDocument>
</SoccerFeed>"""
xmldoc = minidom.parseString(xml)
soccerfeed = xmldoc.getElementsByTagName("SoccerFeed")[0]
soccerdocument = soccerfeed.getElementsByTagName("SoccerDocument")[0]
#Match Data
MatchData = soccerdocument.getElementsByTagName("MatchData")[0]
MatchInfo = MatchData.getElementsByTagName("MatchInfo")[0]
Goal = MatchData.getElementsByTagNameNS("Side", "Score")
The Goal is being set to [], but I would like to get the score value, which is 4.
It looks like you are searching for the wrong XML node. Check following line:
Goal = MatchData.getElementsByTagNameNS("Side", "Score")
You probably are looking for following:
Goal = MatchData.getElementsByTagName("TeamData")[0].getAttribute("Score")
NOTE: Document.getElementsByTagName, Document.getElementsByTagNameNS, Element.getElementsByTagName, Element.getElementsByTagNameNS return a list of nodes, not just a scalar value.

Categories