Parsing XML with Python and etree

Parsing XML with Python and etree - python

I want to extract all the way elements that contain a tag with the key 'highway' and a specific value from the following example Open Street Map XML file:
<?xml version="1.0" encoding="UTF-8"?>
<osm version="0.6" generator="CGImap 0.0.2">
<bounds minlat="54.0889580" minlon="12.2487570" maxlat="54.0913900" maxlon="12.2524800"/>
<node id="298884272" lat="54.0901447" lon="12.2516513" user="SvenHRO" uid="46882" visible="true" version="1" changeset="676636" timestamp="2008-09-21T21:37:45Z"/>
<way id="26659127" user="Masch" uid="55988" visible="true" version="5" changeset="4142606" timestamp="2010-03-16T11:47:08Z">
<nd ref="292403538"/>
<nd ref="298884289"/>
<nd ref="261728686"/>
<tag k="highway" v="unclassified"/>
<tag k="name" v="Pastower Straße"/>
</way>
<relation id="56688" user="kmvar" uid="56190" visible="true" version="28" changeset="6947637" timestamp="2011-01-12T14:23:49Z">
<member type="node" ref="294942404" role=""/>
...
<member type="node" ref="364933006" role=""/>
<member type="way" ref="4579143" role=""/>
...
<member type="node" ref="249673494" role=""/>
<tag k="name" v="Küstenbus Linie 123"/>
<tag k="network" v="VVW"/>
<tag k="operator" v="Regionalverkehr Küste"/>
<tag k="ref" v="123"/>
<tag k="route" v="bus"/>
<tag k="type" v="route"/>
</relation>
</osm>
To do this; I wrote the following piece of Python code that uses the Etree library. It parses the XML document and uses the findall function (with XPath syntax)
import xml.etree.ElementTree as ET
supported_highways = ('motorway', 'trunk', 'primary', 'secondary', 'tertiary', 'unclassified', 'residential', 'highway_link', 'trunk_link', 'primary_link', 'secondary_link', 'tertiary_link')
class OSMParser:
def __init__(self, inputData):
self.root = ET.fromstring(inputData)
def getRoads(self):
ways = dict()
for road in self.root.findall('./way/'):
highway_tags = road.findall("./tag[#k='highway']")
if not highway_tags:
continue
if all(highway.attrib['v'] not in supported_highways for highway in highway_tags):
continue
However when I run the code, it does not find tag of the way element (the second findall produces an empty list). Any idea what's wrong? Thank you.

Try the XPath //way/ instead of ./way/.

Its working.
>>> root.findall("./way/tag[#k='highway']")
[<Element 'tag' at 0xb74568ac>]
I think in your input content tag way is not child of main start tag i.e. root tag.
or use lxml.etree
>>> import lxml.etree as ET1
>>> root = ET1.fromstring(content)
>>> root.xpath("//way/tag[#k='highway']")
[<Element tag at 0xb745642c>]

Related

Parsing XML with namespace into dictionary

I'm having a hard time following the xml.etree.ElementTree documentation with regard to parsing an XML document with a namespace and nested tags.
To begin, the xml tree I am trying to parse looks like:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<ROOT-MAIN xmlns="http://fakeurl.com/page">
<Alarm> <--- I dont care about these types of objects
<Node>
<location>Texas></location>
<name>John</name>
</Node>
</Alarm>
<Alarm> <--- I care about these types of objects
<CreateTime>01/01/2011</CreateTime>
<Story>
<Node>
<Name>Ethan</name
<Address category="residential>
<address>1421 Morning SE</address>
</address>
</Node>
</Story>
<Build>
<Action category="build_value_1">Build was successful</Action>
</Build>
<OtherData type="string" meaning="favoriteTVShow">Purple</OtherData>
<OtherData type="string" meaning="favoriteColor">Seinfeld</OtherData>
</Alarm>
</ROOT-MAIN>
I am trying to build an array of dictionaries that have a similar structure to the second < Alarm > object. When parsing this XML file, I do the following:
import xml.etree.ElementTree as ET
tree = ET.parse('data/'+filename)
root = tree.getroot()
namespace= '{http://fakeurl.com/page}'
for alarm in tree.findall(namespace+'Alarm'):
for elem in alarm.iter():
try:
creation_time = elem.find(namespace+'CreateTime')
for story in elem.findall(namespace+'Story'):
for node in story.findall(namespace+'Node'):
for Address in node.findall(namespace+'Address'):
address = Address.find(namespace+'address').text
for build in elem.findall(namespace+'Build'):
category= build.find(namespace+'Action').attrib
action = build.find(namespace+'Action').text
for otherdata in elem.findall(namespace+'OtherData'):
#not sure how to get the 'meaning' attribute value as well as the text value for these <OtherData> tags
except:
pass
Right I'm just trying to get values for:
< address >
< Action > (attribute value and text value)
< OtherData > (attribute value and text value)
I'm sort of able to do this with for loops within for-loops but I was hoping for a cleaner, xpath solution which I haven't figured out how to do with a namespace.
Any suggestions would be much appreciated.

Here (collecting a subset of the elements you mentioned -- add more code to collect rest of elements)
import xml.etree.ElementTree as ET
import re
xmlstring = '''<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<root xmlns="http://fakeurl.com/page">
<Alarm>
<Node>
<location>Texas></location>
<name>John</name>
</Node>
</Alarm>
<Alarm>
<CreateTime>01/01/2011</CreateTime>
<Story>
<Node>
<Name>Ethan</Name>
<Address category="residential">
<address>1421 Morning SE</address>
</Address>
</Node>
</Story>
<Build>
<Action category="build_value_1">Build was successful</Action>
</Build>
<OtherData type="string" meaning="favoriteTVShow">Purple</OtherData>
<OtherData type="string" meaning="favoriteColor">Seinfeld</OtherData>
</Alarm>
</root>'''
xmlstring = re.sub(' xmlns="[^"]+"', '', xmlstring, count=1)
root = ET.fromstring(xmlstring)
alarms = root.findall('Alarm')
alarms_list = []
for alarm in alarms:
create_time = alarm.find('CreateTime')
if create_time is not None:
entry = {'create_time': create_time.text}
alarms_list.append(entry)
actions = alarm.findall('Build/Action')
if actions:
entry['builds'] = []
for action in actions:
entry['builds'].append({'category': action.attrib['category'], 'status': action.text})
print(alarms_list)

lxml error : lxml.etree.XPathEvalError: Invalid expression with descendant

As I'm new to python and lxml also, not able to understand this error. Below is my xml text.
<node id="n25::n1">
<data key="d5" xml:space="preserve"><![CDATA[ronin_sanity]]></data>
<data key="d6">
<ShapeNode>
<Geometry height="86.25" width="182.0" x="3164.9136178770227" y="1045.403736953325"/>
<Fill color="#C0C0C0" transparent="false"/>
<BorderStyle color="#000000" raised="false" type="line" width="1.0"/>
<NodeLabel alignment="center" autoSizePolicy="content" fontFamily="Dialog" fontSize="12" fontStyle="plain" hasBackgroundColor="false" hasLineColor="false" height="18.701171875" horizontalTextPosition="center" iconTextGap="4" modelName="internal" modelPosition="c" textColor="#000000" verticalTextPosition="bottom" visible="true" width="83.376953125" x="49.3115234375" xml:space="preserve" y="33.7744140625">Messages App</NodeLabel>
<Shape type="ellipse"/>
</ShapeNode>
</data>
</node>
This is my xpath query. I want to search element with text Fill color ="#C0C0C0".
etree.xpath(/node/descendant::Fill[#color='#C0C0C0'])

You can simply use proper xpath to find the element as shown below,
In [1]: import lxml.etree as ET
In [2]: cat myxml.xml
<node id="n25::n1">
<data key="d5" xml:space="preserve"><![CDATA[ronin_sanity]]></data>
<data key="d6">
<ShapeNode>
<Geometry height="86.25" width="182.0" x="3164.9136178770227" y="1045.403736953325"/>
<Fill color="#C0C0C0" transparent="false"/>
<BorderStyle color="#000000" raised="false" type="line" width="1.0"/>
<NodeLabel alignment="center" autoSizePolicy="content" fontFamily="Dialog" fontSize="12" fontStyle="plain" hasBackgroundColor="false" hasLineColor="false" height="18.701171875" horizontalTextPosition="center" iconTextGap="4" modelName="internal" modelPosition="c" textColor="#000000" verticalTextPosition="bottom" visible="true" width="83.376953125" x="49.3115234375" xml:space="preserve" y="33.7744140625">Messages App</NodeLabel>
<Shape type="ellipse"/>
</ShapeNode>
</data>
</node>
In [3]: tree = ET.parse('myxml.xml')
In [4]: root = tree.getroot()
In [5]: elem = root.xpath('//Fill[#color="#C0C0C0"]')
In [6]: elem
Out[6]: [<Element Fill at 0x7efe04280098>]
if the node is not matching then you will get a empty list as output
In [7]: elem = root.xpath('//Fill[#color="#C0C0C0ABC"]')
In [8]: elem
Out[8]: []

Python xpath with xml.etree.ElementTree: multiple conditions

I am trying to count from an XML file all the XML nodes of the form:
....
<node id="0">
<data key="d0">Attribute</data>
....
</node>
....
For example a file like this:
<graphml xmlns="http://graphml.graphdrawing.org/xmlns" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd">
<graph edgedefault="directed">
<node id="0">
<data key="d0">Attribute</data>
<data key="d1">Foo</data>
</node>
What I have tried is:
x = graphml_root.findall(".//"+nsfy("node")+"/["+nsfy("data")+"='Attribute']")
Butt his only says that the text of the XML has to be "Attribute", I want to make sure that "Attribute" is the text of the node with key="d0", so I tried this:
x = graphml_root.findall(".//"+nsfy("node")+"/"+nsfy("data")+"[#key='d0']"+"[""'Attribute']")
But it returns an empty list, so I am missing something.
NOTE:
I had to write a little lambda to avoid copying the xmlnamespace all teh time:
nsfy = lambda x : '{http://graphml.graphdrawing.org/xmlns}'+x #to be able to read namespace tags
Thanks.

Try doing something like:
nodes = []
containers = graphml_root.findall('.//node/data[#key="d0"]')
for container in containers:
if container.text == "Attribute":
nodes.append(container)
count = len(nodes)

from lxml import etree
f= '''
<node id="0">
<data key="d0" t="32">Attribute</data>
<data key="d1">Foo</data>
</node>'''
root = etree.XML(f)
data = root.xpath('.//*[#key="d0" and text()="Attribute"]')
print(data)
lxml provide the xpath method.and it's done.
UPDATE
read the DOC of xml.etree,it don't supported this syntax.the xpath supported by xml.etree
So,only you can do is find .//*[#key="d0"]then test it's text equal to "Attribute".

python : keep a word and drop other words between 2 tags

I'm new to Python.
I want my code to open a file in gxl format (like xml format), and read lines and find < string> tag. After that, keep the first word and delete the other words until it meets < /string> tag. How can i do this? Here is an example:
This:
< string>dummyMainClass void dummyMainMethod(java.lang.String[])< /string>
becomes this:
< string>dummyMainClass< /string>
p.s: in my file, all the first words in between string tags are not the same.
here is an example of one of my files:
https://gist.github.com/anonymous/61c1afd751214a0473fd62ee74a3b1d6

The following example will allow you to read from the GXL file like a XML file:
import xml.etree.ElementTree as ET
tree = ET.parse("original_file.gxl")
for node in tree.findall("/graph/node/attr/string"):
# Retrieves the value of <string>, split on spaces and keep first element
node.text = node.text.split(" ")[0]
# Write the modified contents to a new file
tree.write("modified_file.gxl")
This is based from the information in this answer, and considering the following GXL file structure from the provided sample:
<?xml version="1.0" encoding="iso-8859-1"?>
<gxl>
<graph id="ExtendedCallGraph" edgeids="true" edgemode="directed">
<node id="N_0">
<attr name="name">
<string>dummyMainClass void dummyMainMethod(java.lang.String[])</string>
</attr>
</node>
<edge from="N_0" to="N_1" isdirected="true" id="N_0--N_1">
</edge>
<edge from="N_0" to="N_2" isdirected="true" id="N_0--N_2">
</edge>
<node id="N_442">
<attr name="name">
<string>java.util.AbstractList void init()</string>
</attr>
</node>
<edge from="N_442" to="N_89" isdirected="true" id="N_442--N_89">
</edge>
<edge from="N_442" to="N_443" isdirected="true" id="N_442--N_443">
</edge>
<node id="N_443">
<attr name="name">
<string>java.util.AbstractCollection void init()</string>
</attr>
</node>
<edge from="N_443" to="N_88" isdirected="true" id="N_443--N_88">
</edge>
<edge from="N_443" to="N_89" isdirected="true" id="N_443--N_89">
</edge>
</graph>
</gxl>
Edit: Modified code to write-back to the file as specified in comments
Edit2: Added a sample portion of your example and modified the path

etree xml parsing and deletion

How to delete or remove all the entries for server1 including tags ? I tried to use etree remove function but its not helping
<hosts>
<host instances="" name="*" roles="alpha">
<tags/>
</host>
<host instances="" name="server1" id="alpha,beta">
<tags>
<tag app-id="1" instance="1" name="alpha"/>
<tag app-id="2" instance="2" name="beta"/>
</tags>
</host>
<host instances="" name="server2" id="beta,gama">
<tags>
<tag app-id="1" instance="1" name="beta"/>
<tag app-id="2" instance="2" name="gama"/>
</tags>
</host>
</hosts>
def main1(file=outfile):
tree = et.parse(file)
root = tree.getroot()
thingy = root.find('hosts')
for thing in thingy:
if "server1" in thing.get('name'):
root.remove(thing)
#thingy.remove(thing)
print thingy

Need parent object to remove its child from the HTML/XML.
Use getparent() method to get parent and then remove() method to remove its chid tag.
Demo:
>>> import lxml.etree as PARSER
>>> root = PARSER.fromstring(data)
>>> root.xpath("//hosts/host[#name='server1']")
[<Element host at 0xb6d2ce6c>]
>>> a = root.xpath("//hosts/host[#name='server1']")
>>> for i in a:
... pp = i.getparent()
... pp.remove(i)
...
>>> PARSER.tostring(root, method="xml")
A. find return None Object for following code.
>>> thingy = root.find('hosts')
>>> thingy
This should be thingy = root.find('host')
B. Use xpath method to get target tag.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing XML with Python and etree - python

Try the XPath //way/ instead of ./way/.

Related

Parsing XML with namespace into dictionary

lxml error : lxml.etree.XPathEvalError: Invalid expression with descendant

Python xpath with xml.etree.ElementTree: multiple conditions

python : keep a word and drop other words between 2 tags

etree xml parsing and deletion

Categories

Resources