I'm working on parsing an XML-Sheet in Python. The XML has a structure like this:
<layer1>
<layer2>
<element>
<info1></info1>
</element>
<element>
<info1></info1>
</element>
<element>
<info1></info1>
</element>
</layer2>
</layer1>
Without layer2, I have no problems to acess the data in info1. But with layer2, I'm really in trouble. Their I can adress info1 with: root.firstChild.childNodes[0].childNodes[0].data
So my thought was, that I can do it similiar like this:root.firstChild.firstChild.childNodes[0].childNodes[0].data
########## Solution
So this is how I solved my problem:
from xml.etree import cElementTree as ET
from xml.etree import cElementTree as ET
tree = ET.parse("test.xml")
root = tree.getroot()
for elem in root.findall('./layer2/'):
for node in elem.findall('element/'):
x = node.find('info1').text
if x != "abc":
elem.remove(node)
Don't use the minidom API if you can help it. Use the ElementTree API instead; the xml.dom.minidom documentation explicitly states that:
Users who are not already proficient with the DOM should consider using the xml.etree.ElementTree module for their XML processing instead.
Here is a short sample that uses the ElementTree API to access your elements:
from xml.etree import ElementTree as ET
tree = ET.parse('inputfile.xml')
for info in tree.findall('.//element/info1'):
print info.text
This uses an XPath expression to list all info1 elements that are contained inside a element element, regardless of their position in the overall XML document.
If all you need is the first info1 element, use .find():
print tree.find('.//info1').text
With the DOM API, .firstChild could easily be a Text node instead of an Element node; you always need to loop over the .childNotes sequence to find the first Element match:
def findFirstElement(node):
for child in node.childNodes:
if child.nodeType == node.ELEMENT_NODE:
return child
but for your case, perhaps using .getElementsByTagName() suffices:
root.getElementsByTagName('info1').data
does this work? (im not amazing at python just a quick thought)
name[0].firstChild.nodeValue
Related
For my case, I have to find few elements in the XML file and update their values using the text attribute. For that, I have to search xml element A, B and C. My project is using xml.etree and python language. Currently I am using:
self.get_root.findall(H/A/T)
self.get_root.findall(H/B/T)
self.get_root.findall(H/C/T)
The sample XML file:
<H><A><T>text-i-have-to-update</H></A></T>
<H><B><T>text-i-have-to-update</H></B></T>
<H><C><T>text-i-have-to-update</H></C></T>
As we can notice, only the middle element in the path is different. Is there a way to optimize the code using something like self.get_root.findall(H|(A,B,C)|T)? Any guidance in the right direction will do! Thanks!
I went through the similar question: XPath to select multiple tags but it didn't work for my case
Update: maybe regular expression inside the findall()?
The html in your question is malformed; assuming it's properly formatted (like below), try this:
import xml.etree.ElementTree as ET
data = """<root>
<H><A><T>text-i-have-to-update</T></A></H>
<H><B><T>text-i-have-to-update</T></B></H>
<H><C><T>text-i-have-to-update</T></C></H>
</root>"""
doc = ET.fromstring(data)
for item in doc.findall('.//H//T'):
item.text = "modified text"
print(ET.tostring(doc).decode())
Output:
<root>
<H><A><T>modified text</T></A></H>
<H><B><T>modified text</T></B></H>
<H><C><T>modified text</T></C></H>
</root>
I am trying to find a way to use Python to parse data from several .xml files that contain part numbers and descriptions for a system my team is working on. Here's what the files look like:
Note: Actual data sanitized for confidentiality reasons.
<DOCUMENT>
<config>
<lruname>NFS</lruname>
<swpn>123-A-456-7890</swpn>
<swname>00 NFS ABC DEFGHI XYZ JKL</swname>
<swver>Appid: abc-defghi-xyz PN: 123-A-456-7890</swver>
</config>
</DOCUMENT>
I'd like to pull the and datatypes from several of these files into .csv format. My initial thought was to try to parse these data types out into a dictionary using the built in xml.etree library, but for some reason it's not finding the elements:
import xml.etree.ElementTree as ET
data = '''
<DOCUMENT>
<config>
<lruname>NFS</lruname>
<swpn>123-A-456-7890</swpn>
<swname>00 NFS ABC DEFGHI XYZ JKL</swname>
<swver>Appid: abc-defghi-xyz PN: 123-A-456-7890</swver>
</config>
</DOCUMENT>
'''
tree = ET.fromstring(data)
PartNo = tree.find('swpn')
Desc = tree.find('swname')
print(PartNo)
The above code returns 'None' for some reason, but I would expect it to return the xml element I'm calling.
I think you're missing the config level in your XML hierarchy, you could do:
part_number = tree.find('config').find('swpn').text
part_desc = tree.find('config').find('swname').text
Alternately you can loop through all the elements if you don't want to have to know the structure and use conditionals to find the elements you care about with tree.iter.
for e in tree.iter():
if e.tag == 'sqpn':
part_number = e.text
if e.tag == 'swname':
part_desc = e.text
ElementTree and etree's find functionality searchers for direct children.
You can still use it by specifying the entire branch:
tree.find('config').find('swpn')
tree.find('config/swpn')
If you always want to look for swpn, but disregard the structure (e.g. you don't know if it's going to be a child of config), you might find it easier to use the xpath functionality in etree (and not in ElementTree):
tree = etree.fromstring(data)
tree.xpath('//swpn')
In this case, the // basically mean that you are looking for elements in tree, no matter where they are
If the xml files are small, and you don't care about performance, you can use minidom which IMHO is more convenient compared to lxml. In this case, your code could be something like this:
from xml.dom.minidom import parseString
xml = parseString(data)
PartNo = xml.getElementsByTagName('swpn')[0]
Desc = xml.getElementsByTagName('swname')[0]
print(PartNo.firstChild.nodeValue)
I want to parse xml like this:
<?xml version="1.0" ?>
<matches>
<round_1>
<match_1>
<home_team>team_5</home_team>
<away_team>team_13</away_team>
<home_goals_time>None</home_goals_time>
<away_goals_time>24;37</away_goals_time>
<home_age_average>27.4</home_age_average>
<away_age_average>28.3</away_age_average>
<score>0:2</score>
<ball_possession>46:54</ball_possession>
<shots>8:19</shots>
<shots_on_target>2:6</shots_on_target>
<shots_off_target>5:10</shots_off_target>
<blocked_shots>1:3</blocked_shots>
<corner_kicks>3:4</corner_kicks>
<fouls>10:12</fouls>
<offsides>0:0</offsides>
</match_1>
</round_1>
</matches>
I use standard library - xml but I can't get values from inner tags. That's my exemplary code:
import xml.etree.ElementTree as et
TEAMS_STREAM = "data/stats1.xml"
tree = et.parse(TEAMS_STREAM)
root = tree.getroot()
for elem in root.iter('home_goals_time'):
print(elem.attrib)
It should work but it's not. I was trying to find issue in xml structure but I coludn't find it. I always got empty dict. Can you tell me what's wrong?
You are calling .attrib on the element, but there are no attributes for those elements. If you want to print the inner text of the element, use .text instead of .attrib
for elem in root.iter('home_goals_time'):
print(elem.text)
The reason you're having issues is that you need to parse through the xml level by level. Using findall, I was able to get the value inside <home_goals_time>.
for i in root.findall('.//home_goals_time'):
print (i.text)
None
I'm having difficulty parsing an XML tree using xml.etree.ElementTree in Python. Basically, I'm making a request to an API that gives an XML response, and trying to extract the values of several elements in the tree.
This is what I've done so far with no success:
root = etree.fromstring(resp_arr[0])
walkscore = root.find('./walkscore')
Here is my XML tree:
<result>
<status>1</status>
<walkscore>95</walkscore>
<description>walker's paradise</description>
<updated>2009-12-25 03:40:16.006257</updated>
<logo_url>https://cdn.walk.sc/images/api-logo.png</logo_url>
<more_info_icon>https://cdn.walk.sc/images/api-more-info.gif</more_info_icon>
<ws_link>http://www.walkscore.com/score/1119-8th-Avenue-Seattle-WA-98101/lat=47.6085/lng=-122.3295/?utm_source=myrealtysite.com&utm_medium=ws_api&utm_campaign=ws_api</ws_link>
<help_link>https://www.redfin.com/how-walk-score-works</help_link>
<snapped_lat>47.6085</snapped_lat>
<snapped_lon>-122.3295</snapped_lon>
</result>
Essentially, I'm trying to pull the walkscores from the XML document but my code isn't returning a value. Does anyone with experience using ElementTree have any advice to help me extract the values I'm after?
Sam
Your XML appears to be malformed. But if I replace instances of & with &, then it's parseable:
>>> from xml.etree import ElementTree as ET
>>> tree = ET.fromstring(xml)
>>> tree.find('./walkscore').text
'95'
I'm using ElementTree and I can get tags and attributes but not that actual content between elements.
from this XML:
<tag_name attrib="1">I WANT THIS INFO HERE</tag_name>
here's my python code:
import urllib2
import xml.etree.ElementTree as ET
XML = urllib2.urlopen("http://URL/file.xml")
Tree = ET.parse(XML)
for node in Tree.getiterator():
print node.tag, node.attrib
This prints most of the XML file, and I understand what 'tag' and 'attrib' are, but how do I get the 'Content'? I tried looking through ElementTree's docs, but I think this might be too basic of a question.
.text method should give you the required text value.
for node in Tree.getiterator():
print node.tag, node.attrib, node.text
Did you try XPath ?
There are a lot of libraries to extract content from tags with a very easy yet powerful syntax.
Here an example:
import XmlXPathSelector
xs = XmlXPathSelector(text="<tags>your xml</tags>")
print xs.select("//tag_name[#attrib='1']/text()").extract()