Grab Content from XML using Python? almost there - python

I'm using ElementTree and I can get tags and attributes but not that actual content between elements.
from this XML:
<tag_name attrib="1">I WANT THIS INFO HERE</tag_name>
here's my python code:
import urllib2
import xml.etree.ElementTree as ET
XML = urllib2.urlopen("http://URL/file.xml")
Tree = ET.parse(XML)
for node in Tree.getiterator():
print node.tag, node.attrib
This prints most of the XML file, and I understand what 'tag' and 'attrib' are, but how do I get the 'Content'? I tried looking through ElementTree's docs, but I think this might be too basic of a question.

.text method should give you the required text value.
for node in Tree.getiterator():
print node.tag, node.attrib, node.text

Did you try XPath ?
There are a lot of libraries to extract content from tags with a very easy yet powerful syntax.
Here an example:
import XmlXPathSelector
xs = XmlXPathSelector(text="<tags>your xml</tags>")
print xs.select("//tag_name[#attrib='1']/text()").extract()

Related

can we search multiple pattern using etree findall() in xml?

For my case, I have to find few elements in the XML file and update their values using the text attribute. For that, I have to search xml element A, B and C. My project is using xml.etree and python language. Currently I am using:
self.get_root.findall(H/A/T)
self.get_root.findall(H/B/T)
self.get_root.findall(H/C/T)
The sample XML file:
<H><A><T>text-i-have-to-update</H></A></T>
<H><B><T>text-i-have-to-update</H></B></T>
<H><C><T>text-i-have-to-update</H></C></T>
As we can notice, only the middle element in the path is different. Is there a way to optimize the code using something like self.get_root.findall(H|(A,B,C)|T)? Any guidance in the right direction will do! Thanks!
I went through the similar question: XPath to select multiple tags but it didn't work for my case
Update: maybe regular expression inside the findall()?
The html in your question is malformed; assuming it's properly formatted (like below), try this:
import xml.etree.ElementTree as ET
data = """<root>
<H><A><T>text-i-have-to-update</T></A></H>
<H><B><T>text-i-have-to-update</T></B></H>
<H><C><T>text-i-have-to-update</T></C></H>
</root>"""
doc = ET.fromstring(data)
for item in doc.findall('.//H//T'):
item.text = "modified text"
print(ET.tostring(doc).decode())
Output:
<root>
<H><A><T>modified text</T></A></H>
<H><B><T>modified text</T></B></H>
<H><C><T>modified text</T></C></H>
</root>

Extract complete XML block using python

Is it possible to extract complete blocks of XML text from an XML file using Python? I am using ElementTree with Python to extract tags and values from XML, in order to compare 2 XML files.
But is it possible to extract the whole text of an XML block?
For example:
<stats>
<player>
<name>Luca Toni</name>
<matches>47</matches>
<goals>16</goals>
<WC>yes</WC>
</player>
<player>
<name>Alberto Gilardino</name>
<matches>57</matches>
<goals>19</goals>
<WC>yes</WC>
</player>
<player>
<name>Mario Balotelli</name>
<matches>36</matches>
<goals>14</goals>
<WC>yes</WC>
</player>
</stats>
Is it possible to extract one particular complete block (), as given below, from the above XML using python (ElementTree)?
<player>
<name>Luca Toni</name>
<matches>47</matches>
<goals>16</goals>
<WC>yes</WC>
</player>
Once you've parsed your document with etree, you can do several things
import xml.etree.ElementTree as ET
doc = ET.parse('test.xml')
root = doc.getroot()
print(root.find("player")) # get first player
print(root.find(".//player")) # get first player if it's not a direct child
print([p for p in root.findall("player")]) # get all players (direct children)
print([p for p in root.getchildren()]) # get direct children
getting an element as a string is just
test = ET.tostring(root.find("player"))
print(text)
EDIT note that to compare elements, this is not necessarily the best method.
See here for another option.
Found that lxml was the best option to extract complete text between two XML tags.
from lxml import etree
node1=etree.parse("azzurri.xml")
e1=node1.xpath(".//player")IndentationError: unexpected indent
for ele1 in e1:
pl=ele1.xpath(".//name")
for pl1 in pl:
if pl1.text=="Luca Toni":
rl1=ele1.text + ''.join(map(etree.tostring, ele1)).strip()
print rl1
<name>Luca Toni</name>
<matches>47</matches>
<goals>16</goals>
<WC>yes</WC>

How to get values from this XML?

I want to parse xml like this:
<?xml version="1.0" ?>
<matches>
<round_1>
<match_1>
<home_team>team_5</home_team>
<away_team>team_13</away_team>
<home_goals_time>None</home_goals_time>
<away_goals_time>24;37</away_goals_time>
<home_age_average>27.4</home_age_average>
<away_age_average>28.3</away_age_average>
<score>0:2</score>
<ball_possession>46:54</ball_possession>
<shots>8:19</shots>
<shots_on_target>2:6</shots_on_target>
<shots_off_target>5:10</shots_off_target>
<blocked_shots>1:3</blocked_shots>
<corner_kicks>3:4</corner_kicks>
<fouls>10:12</fouls>
<offsides>0:0</offsides>
</match_1>
</round_1>
</matches>
I use standard library - xml but I can't get values from inner tags. That's my exemplary code:
import xml.etree.ElementTree as et
TEAMS_STREAM = "data/stats1.xml"
tree = et.parse(TEAMS_STREAM)
root = tree.getroot()
for elem in root.iter('home_goals_time'):
print(elem.attrib)
It should work but it's not. I was trying to find issue in xml structure but I coludn't find it. I always got empty dict. Can you tell me what's wrong?
You are calling .attrib on the element, but there are no attributes for those elements. If you want to print the inner text of the element, use .text instead of .attrib
for elem in root.iter('home_goals_time'):
print(elem.text)
The reason you're having issues is that you need to parse through the xml level by level. Using findall, I was able to get the value inside <home_goals_time>.
for i in root.findall('.//home_goals_time'):
print (i.text)
None

Python minidom: How to access an element

I'm working on parsing an XML-Sheet in Python. The XML has a structure like this:
<layer1>
<layer2>
<element>
<info1></info1>
</element>
<element>
<info1></info1>
</element>
<element>
<info1></info1>
</element>
</layer2>
</layer1>
Without layer2, I have no problems to acess the data in info1. But with layer2, I'm really in trouble. Their I can adress info1 with: root.firstChild.childNodes[0].childNodes[0].data
So my thought was, that I can do it similiar like this:root.firstChild.firstChild.childNodes[0].childNodes[0].data
########## Solution
So this is how I solved my problem:
from xml.etree import cElementTree as ET
from xml.etree import cElementTree as ET
tree = ET.parse("test.xml")
root = tree.getroot()
for elem in root.findall('./layer2/'):
for node in elem.findall('element/'):
x = node.find('info1').text
if x != "abc":
elem.remove(node)
Don't use the minidom API if you can help it. Use the ElementTree API instead; the xml.dom.minidom documentation explicitly states that:
Users who are not already proficient with the DOM should consider using the xml.etree.ElementTree module for their XML processing instead.
Here is a short sample that uses the ElementTree API to access your elements:
from xml.etree import ElementTree as ET
tree = ET.parse('inputfile.xml')
for info in tree.findall('.//element/info1'):
print info.text
This uses an XPath expression to list all info1 elements that are contained inside a element element, regardless of their position in the overall XML document.
If all you need is the first info1 element, use .find():
print tree.find('.//info1').text
With the DOM API, .firstChild could easily be a Text node instead of an Element node; you always need to loop over the .childNotes sequence to find the first Element match:
def findFirstElement(node):
for child in node.childNodes:
if child.nodeType == node.ELEMENT_NODE:
return child
but for your case, perhaps using .getElementsByTagName() suffices:
root.getElementsByTagName('info1').data
does this work? (im not amazing at python just a quick thought)
name[0].firstChild.nodeValue

How to get all strings from all nested tags of a xml tag with python's lxml.etree library?

I have an xml file in which it is possible that the following occurs:
...
<a><b>This is</b> some text about <c>some</c> issue I have, parsing xml</a>
...
Edit: Let's assume, the tags could be nested more than only level, meaning
<a><b><c>...</c>...</b>...</a>
I came up with this using the python lxml.etree library.
context = etree.iterparse(PATH_TO_XML, dtd_validation=True, events=("end",))
for event, element in context:
tag = element.tag
if tag == "a":
print element.text # is empty :/
mystring = element.xpath("string()")
...
But somehow it goes wrong.
What I want is the whole string
"This is some text about some issue I have, parsing xml"
But I only get an empty string. Any suggestions? Thanks!
This question has been asked many times.
You can use lxml.html.text_content() method.
import lxml.html
t = lxml.html.fromstring("...")
t.text_content()
REF: Filter out HTML tags and resolve entities in python
OR use lxml.etree.strip_tags() method.
REF: In lxml, how do I remove a tag but retain all contents?

Categories