How to lookup element in AWS response with lxml (namespace issue?) - python

I'm trying to use lxml to read a response from the AWS REST API but not having any luck. I can easily parse the response and print it, but none of the find or xpath functions find anything. For example, take this document fragment:
<DistributionConfig xmlns="http://cloudfront.amazonaws.com/doc/2013-11-11/">
<CallerReference>e6d6909d-f1ed-47f1-83d9-290acf10f324</CallerReference>
<Aliases>
<Quantity>1</Quantity>
<Items>
And this code:
from lxml import etree
root = etree.XML( ... )
node = root.find( 'Quantity' )
node is always None. I've tried a variety of xpaths like //Quanity, .//Quantity, and also the xpath function, but can't find anything.
How do I use this library on this type of document?

Seems you will need to supply the namespace of the element as well:
>>> root.find('.//aws:Quantity', namespaces={'aws': 'http://cloudfront.amazonaws.com/doc/2013-11-11/'})
<Element {http://cloudfront.amazonaws.com/doc/2013-11-11/}Quantity at 0xb6c16aa4>

Related

Parsing XML in python: selecting an attribute given that a child node has a specific attribute

Given the xml
xmlstr = '''
<myxml>
<Description id="10">
<child info="myurl"/>
</Description>
</myxml>'
I'd like to get the id of Description only where child has an attribute of info.
import xml.etree.ElementTree as ET
root = ET.fromstring(xmlstr)
a = root.find(".//Description/[child/#info]")
print(a.attrib)
and changing the find to .//Description/[child[#info]]
both return an error of:
SyntaxError: invalid predicate
I know that etree only supports a subset of xpath, but this doesn't seem particularly weird - should this work? If so, what have I done wrong?!
Changing the find to .//Description/[child] does work, and returns
{'id': '10'}
as expected
You've definitely hit that XPath limited support limitation as, if we look at the source directly (looking at 3.7 source code), we could see that while parsing the Element Path expression, only these things in the filters are considered:
[#attribute] predicate
[#attribute='value']
[tag]
[.='value'] or [tag='value']
[index] or [last()] or [last()-index]
Which means that both of your rather simple expressions are not supported.
If you really want/need to stick with the built-in ElementTree library, one way to solve this would be with finding all Description tags via .findall() and filtering the one having a child element with info attribute.
You can also get those values as keys, which makes it a bit more structured approach to gather data:
import xml.etree.ElementTree as ET
root = ET.fromstring(xmlstr)
wht =root.find(".//Description")
wht.keys() #--> ['id']
wht.get('id') # --> '10'

ElementTree XML API not matching subelement

I am attempting to use the USPS API to return the status of package tracking. I have a method that returns an ElementTree.Element object built from the XML string returned from the USPS API.
This is the returned XML string.
<?xml version="1.0" encoding="UTF-8"?>
<TrackResponse>
<TrackInfo ID="EJ958088694US">
<TrackSummary>The Postal Service could not locate the tracking information for your
request. Please verify your tracking number and try again later.</TrackSummary>
</TrackInfo>
</TrackResponse>
I format that into an Element object
response = xml.etree.ElementTree.fromstring(xml_str)
Now I can see in the xml string that the tag 'TrackSummary' exists and I would expect to be able to access that using ElementTree's find method.
As extra proof I can iterate over the response object and prove that the 'TrackSummary' tag exists.
for item in response.iter():
print(item, item.text)
returns:
<Element 'TrackResponse' at 0x00000000041B4B38> None
<Element 'TrackInfo' at 0x00000000041B4AE8> None
<Element 'TrackSummary' at 0x00000000041B4B88> The Postal Service could not locate the tracking information for your request. Please verify your tracking number and try again later.
So here is the problem.
print(response.find('TrackSummary')
returns
None
Am I missing something here? Seems like I should be able to find that child element without a problem?
import xml.etree.cElementTree as ET # 15 to 20 time faster
response = ET.fromstring(str)
Xpath Syntax
Selects all child elements. For example, */egg selects all grandchildren named egg.
element = response.findall('*/TrackSummary') # you will get a list
print element[0].text #fast print else iterate the list
>>> The Postal Service could not locate the tracking informationfor your request. Please verify your tracking number and try again later.
The .find() method only searches the next layer, not recursively. To search recursively, you need to use an XPath query. In XPath, the double slash // is a recursive search. Try this:
# returns a list of elements with tag TrackSummary
response.xpath('//TrackSummary')
# returns a list of the text contained in each TrackSummary tag
response.xpath('//TrackSummary/node()')

python how to get tag from xml in python

I have an xml
and I check if it is existed like this:
s = os.path.isfile(xmlFile)
I am loading it like this:
from lxml import etree
self.doc=etree.parse(xmlFile)
my questoin
how to get tags from that doc? lets say i have tag called "domain" and tag called "player" existed in "root/team/hello/player"
The lxml documentation says that parse() method returns an ElementTree object in lxml and then you can call getroot() on that to get the root Element. Isn't that the missing piece you were looking for? Will something like this work?
self.doc=etree.parse(xmlFile)
root = self.doc.getroot() # Element object root
I guess once you get the Element, you can call subElement/child etc. methods given in the tutorial.
child_team = etree.subElement(root, "team")
child_hello = etree.subElement(child_team, "hello")
child_player = etree.subElement(child_hello, "player")
Check out this link for details: http://lxml.de/tutorial.html#the-parse-function

Grab Content from XML using Python? almost there

I'm using ElementTree and I can get tags and attributes but not that actual content between elements.
from this XML:
<tag_name attrib="1">I WANT THIS INFO HERE</tag_name>
here's my python code:
import urllib2
import xml.etree.ElementTree as ET
XML = urllib2.urlopen("http://URL/file.xml")
Tree = ET.parse(XML)
for node in Tree.getiterator():
print node.tag, node.attrib
This prints most of the XML file, and I understand what 'tag' and 'attrib' are, but how do I get the 'Content'? I tried looking through ElementTree's docs, but I think this might be too basic of a question.
.text method should give you the required text value.
for node in Tree.getiterator():
print node.tag, node.attrib, node.text
Did you try XPath ?
There are a lot of libraries to extract content from tags with a very easy yet powerful syntax.
Here an example:
import XmlXPathSelector
xs = XmlXPathSelector(text="<tags>your xml</tags>")
print xs.select("//tag_name[#attrib='1']/text()").extract()

XPath with lxml failing

I am trying to query with XPath an html document parsed with lxml. The document is a straight html-only download of the page about Plastic in Wikipedia. Then I parse it with lxml disabling entity substitution to avoid an error with '&reg'
from lxml import etree
root = etree.parse("plastic.html",etree.XMLParser(resolve_entities=False))
Then, I retrieve the namespace url
htmltag = root.iter().next()
nsurl = htmltag.nsmap.values()[0]
Now, I would like to use xpath queries on either 'root' or 'htmltag', but I am unable to do so. I have tried different ways, but the following seems to me the most correct form, which yields errors anyway.
root.xpath('//ns:body',namespace={'ns',nsurl})
And this is what I get
XPathResultError: Unknown return type: dict
I am running the commands in an IPython console, but I don't think that might be the problem. What am I doing wrong?
This is a simple miss spell. You should use namespaces instead of namespace.

Categories