How to identify a specific XML element without ID attributes - python

I'm working with XML documents that have many <text> elements. But none of them have IDs or any other attribute--just the element tag. Is there any way I can use python to tell one of these elements from another (other than by the contents)? For example do elements have some inherent index number based on their position in the document or something like that?

If you have lxml ElementTree, and you want to get details for a particular element:
>>> element
<Element e at 0x7f71068abf38>
You can find index of element inside parent and full path of element:
>>> element.getparent().index(element)
0
>>> element.getroottree().getpath(element)
'/root/e[1]'
That's all you have. For more sophisticated info (such as "global index" of element in the whole document) you should write custom code.

Related

xml.etree.ElementTree not finding all Elements in XML

I have the following XML file that I'm trying to iterate through using xml.etree:
<safetypadapiresponse><url></url><refcode /><status>SUCCESS</status><message><pcrs>
<pcr>
<eCase01m>1234</eCase01m>
<eProcedures03>12 Lead ECG Obtained</eProcedures03>
<eMedications03>Oxygen</eMedications03>
</pcr>
</pcrs></message></safetypadapiresponse>
I'm unable to find any of the child elements after 'message' with the following:
import xml.etree.ElementTree as ET
tree = ET.parse(xmlFile)
root = tree.getroot()
for member in root.findall('pcr'):
print(member)
The following child elements are listed when the following is run:
for member in root:
print(member)
Element 'url'
Element 'refcode'
Element 'status'
Element 'message'
I'm trying to retrieve all the information under the pcr element (i.e. eCase01m, eProcedures03, eMedications03).
You can use findall() in two ways. Unhelpfully this is mentioned in two different parts of the docs:
Element.findall() finds only elements with a tag which are direct
children of the current element.
...
Finds all matching subelements, by tag name or path. Returns a list
containing all matching elements in document order.
What this means is if you look for a tag, you are only searching the direct children of the current element.
You can use XPath instead to look for the parts you are interested in, which will recurse through the docs looking for matches. Either of the following should do:
root.findall('./message/pcrs/pcr') # Find them relative to this node
root.findall('.//pcr') # Find them anywhere below the current node
For the sake of completeness, let me add that you can also try xpath:
for i in tree.xpath('*//pcr/*'):
print(i.tag)
Output:
eCase01m
eProcedures03
eMedications03

access elements and attribs DIRECTLY using lxml etree

Given the following xml structure:
<root>
<a>
<from name="abc">
<b>xxx</b>
<c>yyy</c>
</from>
<to name="def">
<b>blah blah</b>
<c>another blah blah</c>
</to>
</a>
</root>
How can I access directly the value of "from.b" of each "a" without loading first "from" (with find()) of each "a"?
As you can see there are exactly the same elements under "from" and "to". So the method findall() would not work as I have to differentiate where the value of "b" is coming from.
I would like to get the method of direct access because if I have to load each child element (there is a lot) my code would be quite verbose. And in addition in my case performance counts and I have a lot of XML docs to parse! So I have to find the fastest method to go through the document (and store the data into a DB)
Within each "a" element there is exactly 1 "from" element and within each "from" element there is exactly 1 "b" element.
I have no problem to do this with lxml objectify, but I want to use etree because first I have to parse the XML document with etree because I have to validate first the xml schema against an XSD doc and I do not want to reparse the whole document again.
find (and findall) lets you specify a path to elements as well, for example you can do:
root = ET.fromstring(input_xml)
for a in root.findall('a'):
print(a, a.find('from/b').text)
assuming you do always have exactly one from and b element.
otherwise, I might be tempted to use findall and do checks in Python code if this is designed to be more robust

Unable to find element using the following Xpath

I am trying to find the input type with statusid_103408 and with text() Draft
here is the xpath i am using, not sure where I am going wrong
//input[#name='statusid_103408' and contains(text(), 'Draft')]
The reason this xpath does not work is because the text of "Draft" is not actually a property of the input element. It is contained in the li element that is the parent. Therefore, your search is returning no results.
I suggest just using the name only in your xpath search (if it unique). If you definitely need the text in your search, you can search the li item's text first, then find your input, like so:
//li[text()='Draft']/input[#name='statusid_103408']
Use Value it will work , because value is unique, text is not inside the input tag!

ElementTree XML API not matching subelement

I am attempting to use the USPS API to return the status of package tracking. I have a method that returns an ElementTree.Element object built from the XML string returned from the USPS API.
This is the returned XML string.
<?xml version="1.0" encoding="UTF-8"?>
<TrackResponse>
<TrackInfo ID="EJ958088694US">
<TrackSummary>The Postal Service could not locate the tracking information for your
request. Please verify your tracking number and try again later.</TrackSummary>
</TrackInfo>
</TrackResponse>
I format that into an Element object
response = xml.etree.ElementTree.fromstring(xml_str)
Now I can see in the xml string that the tag 'TrackSummary' exists and I would expect to be able to access that using ElementTree's find method.
As extra proof I can iterate over the response object and prove that the 'TrackSummary' tag exists.
for item in response.iter():
print(item, item.text)
returns:
<Element 'TrackResponse' at 0x00000000041B4B38> None
<Element 'TrackInfo' at 0x00000000041B4AE8> None
<Element 'TrackSummary' at 0x00000000041B4B88> The Postal Service could not locate the tracking information for your request. Please verify your tracking number and try again later.
So here is the problem.
print(response.find('TrackSummary')
returns
None
Am I missing something here? Seems like I should be able to find that child element without a problem?
import xml.etree.cElementTree as ET # 15 to 20 time faster
response = ET.fromstring(str)
Xpath Syntax
Selects all child elements. For example, */egg selects all grandchildren named egg.
element = response.findall('*/TrackSummary') # you will get a list
print element[0].text #fast print else iterate the list
>>> The Postal Service could not locate the tracking informationfor your request. Please verify your tracking number and try again later.
The .find() method only searches the next layer, not recursively. To search recursively, you need to use an XPath query. In XPath, the double slash // is a recursive search. Try this:
# returns a list of elements with tag TrackSummary
response.xpath('//TrackSummary')
# returns a list of the text contained in each TrackSummary tag
response.xpath('//TrackSummary/node()')

python docx how to read text along with inline images?

I have a simple docx file like this(just insert a inline png file to text):
I've tried:
>>> x=docx.Document('12.docx')
>>> for p in x.paragraphs:
print(p.text)
headend
>>> list(x.inline_shapes)
[]
And I unzip 12.docx file, found word/media/image1.png is the location. So is there a way to get a output like:
>>> for p in x.paragraphs:
print(p.text_with_image_info)
head<word/media/image1.png>end
You should be able to get a list of inline shapes like this:
>>> [s for s in x.inline_shapes]
[<InlineShape object at 0x...>]
If none show up then you'd probably need to examine the XML to find out why it's not finding anything at the XPath location '//w:p/w:r/w:drawing/wp:inline'. That might yield an interesting finding if you're seeing an empty list there.
Regarding the bit about getting the text with image in document order, you'll need to go down to the lxml layer.
You can get the paragraph lxml element w:p using Paragraph._element. From there you can inspect the XML with the .xml property:
>>> p = paragraph._p
>>> p.xml
'<w:p> etc ...'
You'll need to iterate through the children of the w:p element, I expect you'll find primarily w:r (run) elements. Text is held below those in w:t elements and a w:drawing element is a peer of w:t if I'm not mistaken.
You can construct python-docx objects like InlineShape with the right child element to get access to a more convenient API once you've located the right bit.
So it's a bit of work but doable if you're up to working with lxml-level calls.

Categories