I want to parse xml like this:
<?xml version="1.0" ?>
<matches>
<round_1>
<match_1>
<home_team>team_5</home_team>
<away_team>team_13</away_team>
<home_goals_time>None</home_goals_time>
<away_goals_time>24;37</away_goals_time>
<home_age_average>27.4</home_age_average>
<away_age_average>28.3</away_age_average>
<score>0:2</score>
<ball_possession>46:54</ball_possession>
<shots>8:19</shots>
<shots_on_target>2:6</shots_on_target>
<shots_off_target>5:10</shots_off_target>
<blocked_shots>1:3</blocked_shots>
<corner_kicks>3:4</corner_kicks>
<fouls>10:12</fouls>
<offsides>0:0</offsides>
</match_1>
</round_1>
</matches>
I use standard library - xml but I can't get values from inner tags. That's my exemplary code:
import xml.etree.ElementTree as et
TEAMS_STREAM = "data/stats1.xml"
tree = et.parse(TEAMS_STREAM)
root = tree.getroot()
for elem in root.iter('home_goals_time'):
print(elem.attrib)
It should work but it's not. I was trying to find issue in xml structure but I coludn't find it. I always got empty dict. Can you tell me what's wrong?
You are calling .attrib on the element, but there are no attributes for those elements. If you want to print the inner text of the element, use .text instead of .attrib
for elem in root.iter('home_goals_time'):
print(elem.text)
The reason you're having issues is that you need to parse through the xml level by level. Using findall, I was able to get the value inside <home_goals_time>.
for i in root.findall('.//home_goals_time'):
print (i.text)
None
Related
For my case, I have to find few elements in the XML file and update their values using the text attribute. For that, I have to search xml element A, B and C. My project is using xml.etree and python language. Currently I am using:
self.get_root.findall(H/A/T)
self.get_root.findall(H/B/T)
self.get_root.findall(H/C/T)
The sample XML file:
<H><A><T>text-i-have-to-update</H></A></T>
<H><B><T>text-i-have-to-update</H></B></T>
<H><C><T>text-i-have-to-update</H></C></T>
As we can notice, only the middle element in the path is different. Is there a way to optimize the code using something like self.get_root.findall(H|(A,B,C)|T)? Any guidance in the right direction will do! Thanks!
I went through the similar question: XPath to select multiple tags but it didn't work for my case
Update: maybe regular expression inside the findall()?
The html in your question is malformed; assuming it's properly formatted (like below), try this:
import xml.etree.ElementTree as ET
data = """<root>
<H><A><T>text-i-have-to-update</T></A></H>
<H><B><T>text-i-have-to-update</T></B></H>
<H><C><T>text-i-have-to-update</T></C></H>
</root>"""
doc = ET.fromstring(data)
for item in doc.findall('.//H//T'):
item.text = "modified text"
print(ET.tostring(doc).decode())
Output:
<root>
<H><A><T>modified text</T></A></H>
<H><B><T>modified text</T></B></H>
<H><C><T>modified text</T></C></H>
</root>
I want to find the last deeper xml tag interactively. I found some other questions but they all bring me a fixed way to find it. I want to add elements always to the last tag interactively.
root = Element('soap:Envelope', {"xmlns:soap":"http://www.w3.org/2003/05/soap-soap_envelope", "xmlns:aut":"Automidia"})
sub_elementos = [Element("soap:Body"),
Element("information", {"token":"ABC"}),
Element("data"),
Element("value")]
for elemento in sub_elementos:
list(root.iter())[-1].append(elemento) # This is the way I've found
I saw in xml Element Tree documentation that there is a findall() method that supports Xpath to navigate through XML easily. I want to know how can I use it to find the last element with last() function, instead of list(root.iter())[-1] as written in my code above. This command reduces code readability, in my opinion. Some ideias how could I achieve this?
This is my final output:
<soap:Envelope xmlns:soap="http://www.w3.org/2003/05/soap-envelope" xmlns:aut="Automidia">
<soap:Body>
<information token="ABC>
<data>
<value/>
</data>
</information>
</soap:Body>
</soap:Envelope>
something like this:
import xml.etree.ElementTree as ET
tree_elements = {'body':{}, 'info':{'token':'ABC'}, 'data':{}, 'value':{}}
tree = ET.Element('root')
root = tree
for ele,ele_attrs in tree_elements.items():
root = ET.SubElement(root, ele)
root.attrib = ele_attrs
ET.dump(tree)
output
<root><body><info token="ABC"><data><value /></data></info></body></root>
Given the xml
xmlstr = '''
<myxml>
<Description id="10">
<child info="myurl"/>
</Description>
</myxml>'
I'd like to get the id of Description only where child has an attribute of info.
import xml.etree.ElementTree as ET
root = ET.fromstring(xmlstr)
a = root.find(".//Description/[child/#info]")
print(a.attrib)
and changing the find to .//Description/[child[#info]]
both return an error of:
SyntaxError: invalid predicate
I know that etree only supports a subset of xpath, but this doesn't seem particularly weird - should this work? If so, what have I done wrong?!
Changing the find to .//Description/[child] does work, and returns
{'id': '10'}
as expected
You've definitely hit that XPath limited support limitation as, if we look at the source directly (looking at 3.7 source code), we could see that while parsing the Element Path expression, only these things in the filters are considered:
[#attribute] predicate
[#attribute='value']
[tag]
[.='value'] or [tag='value']
[index] or [last()] or [last()-index]
Which means that both of your rather simple expressions are not supported.
If you really want/need to stick with the built-in ElementTree library, one way to solve this would be with finding all Description tags via .findall() and filtering the one having a child element with info attribute.
You can also get those values as keys, which makes it a bit more structured approach to gather data:
import xml.etree.ElementTree as ET
root = ET.fromstring(xmlstr)
wht =root.find(".//Description")
wht.keys() #--> ['id']
wht.get('id') # --> '10'
I am trying to parse XML and am hard time having. I dont understand why the results keep printing [<Element 'Results' at 0x105fc6110>]
I am trying to extract Social from my example with the
import xml.etree.ElementTree as ET
root = ET.parse("test.xml")
results = root.findall("Results")
print results #[<Element 'Results' at 0x105fc6110>]
# WHAT IS THIS??
for result in results:
print result.find("Social") #None
the XML looks like this:
<?xml version="1.0"?>
<List1>
<NextOffset>AAA</NextOffset>
<Results>
<R>
<D>internet.com</D>
<META>
<Social>
<v>http://twitter.com/internet</v>
<v>http://facebook.com/internet</v>
</Social>
<Telephones>
<v>+1-555-555-6767</v>
</Telephones>
</META>
</R>
</Results>
</List1>
findall returns a list of xml.etree.ElementTree.Element objects. In your case, you only have 1 Result node, so you could use find to look for the first/unique match.
Once you got it, you have to use find using the .// syntax which allows to search in anywhere in the tree, not only the one directly under Result.
Once you found it, just findall on v tag and print the text:
import xml.etree.ElementTree as ET
root = ET.parse("test.xml")
result = root.find("Results")
social = result.find(".//Social")
for r in social.findall("v"):
print(r.text)
results in:
http://twitter.com/internet
http://facebook.com/internet
note that I did not perform validity check on the xml file. You should check if the find method returns None and handle the error accordignly.
Note that even though I'm not confident myself with xml format, I learned all that I know on parsing it by following this lxml tutorial.
results = root.findall("Results") is a list of xml.etree.ElementTree.Element objects.
type(results)
# list
type(results[0])
# xml.etree.ElementTree.Element
find and findall only look within first children. The iter method will iterate through matching sub-children at any level.
Option 1
If <Results> could potentially have more than one <Social> element, you could use this:
for result in results:
for soc in result.iter("Social"):
for link in soc.iter("v"):
print link.text
That's worst case scenario. If you know there'll be one <Social> per <Results> then it simplifies to:
for soc in root.iter("Social"):
for link in soc.iter("v"):
print link.text
both return
"http://twitter.com/internet"
"http://facebook.com/internet"
Option 2
Or use nested list comprehensions and do it with one line of code. Because Python...
socialLinks = [[v.text for v in soc] for soc in root.iter("Social")]
# socialLinks == [['http://twitter.com/internet', 'http://facebook.com/internet']]
socialLinks is list of lists. The outer list is of <Social> elements (only one in this example)Each inner list contains the text from the v elements within each particular <Social> element .
I'm using ElementTree and I can get tags and attributes but not that actual content between elements.
from this XML:
<tag_name attrib="1">I WANT THIS INFO HERE</tag_name>
here's my python code:
import urllib2
import xml.etree.ElementTree as ET
XML = urllib2.urlopen("http://URL/file.xml")
Tree = ET.parse(XML)
for node in Tree.getiterator():
print node.tag, node.attrib
This prints most of the XML file, and I understand what 'tag' and 'attrib' are, but how do I get the 'Content'? I tried looking through ElementTree's docs, but I think this might be too basic of a question.
.text method should give you the required text value.
for node in Tree.getiterator():
print node.tag, node.attrib, node.text
Did you try XPath ?
There are a lot of libraries to extract content from tags with a very easy yet powerful syntax.
Here an example:
import XmlXPathSelector
xs = XmlXPathSelector(text="<tags>your xml</tags>")
print xs.select("//tag_name[#attrib='1']/text()").extract()