ElementTree XML API not matching subelement - python

I am attempting to use the USPS API to return the status of package tracking. I have a method that returns an ElementTree.Element object built from the XML string returned from the USPS API.
This is the returned XML string.
<?xml version="1.0" encoding="UTF-8"?>
<TrackResponse>
<TrackInfo ID="EJ958088694US">
<TrackSummary>The Postal Service could not locate the tracking information for your
request. Please verify your tracking number and try again later.</TrackSummary>
</TrackInfo>
</TrackResponse>
I format that into an Element object
response = xml.etree.ElementTree.fromstring(xml_str)
Now I can see in the xml string that the tag 'TrackSummary' exists and I would expect to be able to access that using ElementTree's find method.
As extra proof I can iterate over the response object and prove that the 'TrackSummary' tag exists.
for item in response.iter():
print(item, item.text)
returns:
<Element 'TrackResponse' at 0x00000000041B4B38> None
<Element 'TrackInfo' at 0x00000000041B4AE8> None
<Element 'TrackSummary' at 0x00000000041B4B88> The Postal Service could not locate the tracking information for your request. Please verify your tracking number and try again later.
So here is the problem.
print(response.find('TrackSummary')
returns
None
Am I missing something here? Seems like I should be able to find that child element without a problem?

import xml.etree.cElementTree as ET # 15 to 20 time faster
response = ET.fromstring(str)
Xpath Syntax
Selects all child elements. For example, */egg selects all grandchildren named egg.
element = response.findall('*/TrackSummary') # you will get a list
print element[0].text #fast print else iterate the list
>>> The Postal Service could not locate the tracking informationfor your request. Please verify your tracking number and try again later.

The .find() method only searches the next layer, not recursively. To search recursively, you need to use an XPath query. In XPath, the double slash // is a recursive search. Try this:
# returns a list of elements with tag TrackSummary
response.xpath('//TrackSummary')
# returns a list of the text contained in each TrackSummary tag
response.xpath('//TrackSummary/node()')

Related

Applying root.xpath() with regex returns a lxml.etree._ElementUnicodeResult

I'm generating a model to find out where a piece of text is located in an HTML file.
So, I have a database with plenty of data from different newspaper's articles with data like title, publish date, authors and news text. What I'm trying to do is by analyzing this data, generate a model that can find by itself the XPath to the HTML tags with this content.
The problem is when I use a regex within the xpath method as shown here:
from lxml import html
with open('somecode.html', 'r') as f:
root = html.fromstring(f.read())
list_of_xpaths = root.xpath('//*/#*[re:match(.,"2019-04-15")]')
This is an example of searching for the publish date in the code. It returns a lxml.etree._ElementUnicodeResult instead of lxml.etree._Element.
Unfortunately, this type of element doesn't let me get the XPath to where is it locate like an lxml.etree._Element after applying root.getroottree().getpath(list_of_xpaths[0]).
Is there a way to get the XPath for this type of element? How?
Is there a way to lxml with regex return an lxml.etree._ElementUnicodeResult element instead?
The problem is that you get an attribute value represented as an instance of _ElementUnicodeResult class.
If we introspect what _ElementUnicodeResult class provides, we could see that it allows you to get to the element which has this attribute via .getparent() method:
attribute = list_of_xpaths[0]
element = attribute.getparent()
print(root.getroottree().getpath(element))
This would get us a path to the element, but as we need an attribute name as well, we could do:
print(attribute.attrname)
Then, to get the complete xpath pointing at the element attribute, we may use:
path_to_element = root.getroottree().getpath(element)
attribute_name = attribute.attrname
complete_path = path_to_element + "/#" + attribute_name
print(complete_path)
FYI, _ElementUnicodeResult also indicates if this is actually an attribute via .is_attribute property (as this class also represents text nodes and tails as well).

Basic Python Parsing XML with xml.etree - Issue

I am trying to parse XML and am hard time having. I dont understand why the results keep printing [<Element 'Results' at 0x105fc6110>]
I am trying to extract Social from my example with the
import xml.etree.ElementTree as ET
root = ET.parse("test.xml")
results = root.findall("Results")
print results #[<Element 'Results' at 0x105fc6110>]
# WHAT IS THIS??
for result in results:
print result.find("Social") #None
the XML looks like this:
<?xml version="1.0"?>
<List1>
<NextOffset>AAA</NextOffset>
<Results>
<R>
<D>internet.com</D>
<META>
<Social>
<v>http://twitter.com/internet</v>
<v>http://facebook.com/internet</v>
</Social>
<Telephones>
<v>+1-555-555-6767</v>
</Telephones>
</META>
</R>
</Results>
</List1>
findall returns a list of xml.etree.ElementTree.Element objects. In your case, you only have 1 Result node, so you could use find to look for the first/unique match.
Once you got it, you have to use find using the .// syntax which allows to search in anywhere in the tree, not only the one directly under Result.
Once you found it, just findall on v tag and print the text:
import xml.etree.ElementTree as ET
root = ET.parse("test.xml")
result = root.find("Results")
social = result.find(".//Social")
for r in social.findall("v"):
print(r.text)
results in:
http://twitter.com/internet
http://facebook.com/internet
note that I did not perform validity check on the xml file. You should check if the find method returns None and handle the error accordignly.
Note that even though I'm not confident myself with xml format, I learned all that I know on parsing it by following this lxml tutorial.
results = root.findall("Results") is a list of xml.etree.ElementTree.Element objects.
type(results)
# list
type(results[0])
# xml.etree.ElementTree.Element
find and findall only look within first children. The iter method will iterate through matching sub-children at any level.
Option 1
If <Results> could potentially have more than one <Social> element, you could use this:
for result in results:
for soc in result.iter("Social"):
for link in soc.iter("v"):
print link.text
That's worst case scenario. If you know there'll be one <Social> per <Results> then it simplifies to:
for soc in root.iter("Social"):
for link in soc.iter("v"):
print link.text
both return
"http://twitter.com/internet"
"http://facebook.com/internet"
Option 2
Or use nested list comprehensions and do it with one line of code. Because Python...
socialLinks = [[v.text for v in soc] for soc in root.iter("Social")]
# socialLinks == [['http://twitter.com/internet', 'http://facebook.com/internet']]
socialLinks is list of lists. The outer list is of <Social> elements (only one in this example)Each inner list contains the text from the v elements within each particular <Social> element .

How to lookup element in AWS response with lxml (namespace issue?)

I'm trying to use lxml to read a response from the AWS REST API but not having any luck. I can easily parse the response and print it, but none of the find or xpath functions find anything. For example, take this document fragment:
<DistributionConfig xmlns="http://cloudfront.amazonaws.com/doc/2013-11-11/">
<CallerReference>e6d6909d-f1ed-47f1-83d9-290acf10f324</CallerReference>
<Aliases>
<Quantity>1</Quantity>
<Items>
And this code:
from lxml import etree
root = etree.XML( ... )
node = root.find( 'Quantity' )
node is always None. I've tried a variety of xpaths like //Quanity, .//Quantity, and also the xpath function, but can't find anything.
How do I use this library on this type of document?
Seems you will need to supply the namespace of the element as well:
>>> root.find('.//aws:Quantity', namespaces={'aws': 'http://cloudfront.amazonaws.com/doc/2013-11-11/'})
<Element {http://cloudfront.amazonaws.com/doc/2013-11-11/}Quantity at 0xb6c16aa4>

Python and ElementTree: write() isn't working properly

First question. If I screwed up somehow let me know.
Ok, what I need to do is the following. I'm trying to use Python to get some data from an API. The API sends it to me in XML. I'm trying to use ElementTree to parse it.
Now every time I request information from the API, it's different. I want to construct a list of all the data I get. I could use Python's lists, but since I want to save it to a file at the end I figured - why not use ElementTree for that too.
Start with an Element, lets call it ListE. Call the API, parse the XML, get the root Element from the ElementTree. Add the root Element as a subelement into ListE. Call the API again, and do it all over. At the end ListE should be an Element whose subelements are the results of each API call. And the end of everything just wrap ListE into an ElementTree in order to use the ElementTree write() function. Below is the code.
import xml.etree.ElementTree as ET
url = "http://http://api.intrade.com/jsp/XML/MarketData/ContractBookXML.jsp?id=769355"
try:
returnurl=urlopen(url)
except IOError:
exit()
tree = ET.parse(returnurl)
root = tree.getroot()
print "root tag and attrib: ",root.tag, root.attrib
historyE = ET.Element('historical data')
historyE.append(root)
historyE.append(root)
historyET = ET.ElementTree(historyE)
historyET.write('output.xml',"UTF-8")
The program doesn't return any error. The problem is when I ask the browser to open it, it claims a syntax error. Opening the file with notepad here's what I find:
<?xml version='1.0' encoding='UTF-8'?>
<historical data><ContractBookInfo lastUpdateTime="0">
<contractInfo conID="769355" expiryPrice="100.0" expiryTime="1357334563000" state="S" vol="712" />
</ContractBookInfo><ContractBookInfo lastUpdateTime="0">
<contractInfo conID="769355" expiryPrice="100.0" expiryTime="1357334563000" state="S" vol="712" />
</ContractBookInfo></historical data>
I think the reason for the syntax error is that there isn't a space or a return between 'historical data' and 'ContractBookInfo lastUpdateTime="0"'. Suggestions?
The problem is here:
historyE = ET.Element('historical data')
You shouldn't use a space. As summarized on Wikipedia:
The element tags are case-sensitive; the beginning and end tags must
match exactly. Tag names cannot contain any of the characters
!"#$%&'()*+,/;<=>?#[]^`{|}~, nor a space character, and cannot start
with -, ., or a numeric digit.
See this section of the XML spec for the details ("Almost all characters are permitted in names, except those which either are or reasonably could be used as delimiters.")

ElementTree can't seem to run findall() on findall() results

I have XML shaped like the following:
<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom" xmlns:openSearch="http://a9.com/-/spec/opensearch/1.1/" xmlns:docs="http://schemas.google.com/docs/2007" xmlns:batch="http://schemas.google.com/gdata/batch"
<entry gd:etag=""HxYZGQVeHyt7ImBr"">
<title>Some document title I wish to find</title>
I have many entry elements, each which contains a title element. I wish to find which entry contains a title element with particular element text.
I can iterate over each item perfectly with the following code:
entry = './/{http://www.w3.org/2005/Atom}entry'
document_nodes = document_feed_xml.findall(entry)
for document_node in document_nodes:
logging.warn('entry item found!')
logging.warn(pretty_print(document_node))
logging.warn('-'*80)
This works, returning:
WARNING:root:--------------------------------------------------------------------------------
WARNING:root:entry item found!
<ns0:entry ns1:etag=""HxdWRh4MGit7ImBr"" xmlns:ns0="http://www.w3.org/2005/Atom" xmlns:ns1="http://schemas.google.com/g/2005">
<ns0:title>
Some document title
</ns0:title>
</ns0:entry>
So now I'd like to look for a 'title' element in this branch of the tree. if I look for:
title = './/{http://www.w3.org/2005/Atom}title'
title_nodes = document_node.findall(title)
for title_node in title_nodes:
logging.warn('yaaay')
logging.warn(title_node.text)
if not title_nodes:
raise ValueError('Could not find any title elements in this entry')
Edit: I originally had 'document_node[0].findall' from some debugging. Removing this, the code above works. This was the cause of the error - thanks the the gent below for spotting this!
This raises the error for no title nodes.
These results seem odd, as:
- I can clearly see that element, with that namespace, in the document
- I can even run findall() for title directly, using that namespace, and see the results
I've wondered about the possibility of findall() returning objects that are of a different class from it's input, however running 'type' on either object merely returns 'instance' as the type. Quality programming there ElementTree.
Although LXML has better documentation, better xpath support, and better code, for technical reasons, I cannot use LXML, so I am forced to use ElementTree.
The problem is that document_node[0] in your code already references the title element, and looking through its children returns nothing.

Categories