access elements and attribs DIRECTLY using lxml etree

access elements and attribs DIRECTLY using lxml etree - python

Given the following xml structure:
<root>
<a>
<from name="abc">
<b>xxx</b>
<c>yyy</c>
</from>
<to name="def">
<b>blah blah</b>
<c>another blah blah</c>
</to>
</a>
</root>
How can I access directly the value of "from.b" of each "a" without loading first "from" (with find()) of each "a"?
As you can see there are exactly the same elements under "from" and "to". So the method findall() would not work as I have to differentiate where the value of "b" is coming from.
I would like to get the method of direct access because if I have to load each child element (there is a lot) my code would be quite verbose. And in addition in my case performance counts and I have a lot of XML docs to parse! So I have to find the fastest method to go through the document (and store the data into a DB)
Within each "a" element there is exactly 1 "from" element and within each "from" element there is exactly 1 "b" element.
I have no problem to do this with lxml objectify, but I want to use etree because first I have to parse the XML document with etree because I have to validate first the xml schema against an XSD doc and I do not want to reparse the whole document again.

find (and findall) lets you specify a path to elements as well, for example you can do:
root = ET.fromstring(input_xml)
for a in root.findall('a'):
print(a, a.find('from/b').text)
assuming you do always have exactly one from and b element.
otherwise, I might be tempted to use findall and do checks in Python code if this is designed to be more robust

Related

Trouble retrieving text from XML with ElementTree with tags

Right now I have some code which uses Biopython and NCBI's "Entrez" API to get XML strings from Pubmed Central. I'm trying to parse the XML with ElementTree to just have the text from the page. Although I have BeautifulSoup code that does exactly this when I scrape the lxml data from the site itself, I'm switching to the NCBI API since scrapers are apparently a no-no. But now with the XML from the NCBI API, I'm finding ElementTree extremely unintuitive and could really use some help getting it to work. Of course I've looked at other posts, but most of these deal with namespaces and in my case, I just want to use the XML tags to grab information. Even the ElementTree documentation doesn't go into this (from what I can tell). Can anyone help me figure out the syntax to grab information within certain tags rather than within certain namespaces?
Here's an example. Note: I use Python 3.4
Small snippit of the XML:
<sec sec-type="materials|methods" id="s5">
<title>Materials and Methods</title>
<sec id="s5a">
<title>Overgo design</title>
<p>In order to screen the saltwater crocodile genomic BAC library described below, four overgo pairs (forward and reverse) were designed (<xref ref-type="table" rid="pone-0114631-t002">Table 2</xref>) using saltwater crocodile sequences of MHC class I and II from previous studies <xref rid="pone.0114631-Jaratlerdsiri1" ref-type="bibr">[40]</xref>, <xref rid="pone.0114631-Jaratlerdsiri3" ref-type="bibr">[42]</xref>. The overgos were designed using OligoSpawn software, with a GC content of 50–60% and 36 bp in length (8-bp overlapping) <xref rid="pone.0114631-Zheng1" ref-type="bibr">[77]</xref>. The specificity of the overgos was checked against vertebrate sequences using the basic local alignment search tool (BLAST; <ext-link ext-link-type="uri" xlink:href="http://www.ncbi.nlm.nih.gov/">http://www.ncbi.nlm.nih.gov/</ext-link>).</p>
<table-wrap id="pone-0114631-t002" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0114631.t002</object-id>
<label>Table 2</label>
<caption>
<title>Four pairs of forward and reverse overgos used for BAC library screening of MHC-associated BACs.</title>
</caption>
<alternatives>
<graphic id="pone-0114631-t002-2" xlink:href="pone.0114631.t002"/>
<table frame="hsides" rules="groups">
<colgroup span="1">
<col align="left" span="1"/>
<col align="center" span="1"/>
</colgroup>
For my project, I want all of the text in the "p" tag (not just for this snippit of the XML, but for the entire XML string).
Now, I already know that I can make the whole XML string into an ElementTree Object
>>> import xml.etree.ElementTree as ET
>>> tree = ET.ElementTree(ET.fromstring(xml_string))
>>> root = ET.fromstring(xml_string)
Now if I try to get the text using the tag like this:
>>> text = root.find('p')
>>> print("".join(text.itertext()))
or
>>> text = root.get('p').text
I can't extract the text that I want. From what I've read, this is because I'm using the tag "p" as an argument rather than a namespace.
While I feel like it should be quite simple for me to get all the text in "p" tags within an XML file, I'm currently unable to do it. Please let me know what I'm missing and how I can fix this. Thanks!
--- EDIT ---
So now I know that I should be using this code to get everything in the 'p' tags:
>>> text = root.find('.//p')
>>> print("".join(text.itertext()))
Despite the fact that I'm using itertext(), it's only returning content from the first "p" tag and not looking at any other content. Does itertext() only iterate within a tag? Documentation seems to suggest it iterates across all tags as well, so I'm not sure why its only returning one line instead of all of the text under all of the "p" tags.
---- FINAL EDIT --
I figured out that itertext() only works within one tag and find() only returns the first item. In order to get the enitre text that I want I must use findall()
>>> all_text = root.findall('.//p')
>>> for texts in all_text:
print("".join(texts.itertext()))

root.get() is the wrong method, as it will retrieve an attribute of the root tag not a subtag.
root.find() is correct as it will find the first matching subtag (alternatively one can use root.findall() for all matching subtags).
If you want to find not only direct subtags but also indirect subtags (as in your example), the expression within root.find/root.findall has be to a subset of XPath (see https://docs.python.org/2/library/xml.etree.elementtree.html#xpath-support). In your case it is './/p':
text = root.find('.//p')
print("".join(text.itertext()))

Python: Parsing XML autoadd all key/value pairs

I searched a long and have tried a lot! but I can't get my mind open for this totally easy scenario. I need to say that I'm a python newbie but a very good bash coder ;o) I have written some code with python but maybe there is a lot I need to learn yet so do not be too harsh to me ;o) I'm willing to learn and I read python docs and many examples and tried a lot on my own but now I'm at a point where I picking in the dark..
I parse content provided as XML. It is about 20-50 MB big.
My XML Example:
<MAIN>
<NOSUBEL>abcd</NOSUBEL>
<NOSUBEL2>adasdasa</NOSUBEL2>
<MULTISUB>
<WHATEVER>
<ANOTHERSUBEL>
<ANOTHERONE>
(how many levels can not be said / can change)
</ANOTHERONE>
</ANOTHERSUBEL>
</WHATEVER>
</MULTISUB>..
<SUBEL2>
<FOO>abcdefg</FOO>
</SUBEL2>
<NOSUBEL3>abc</NOSUBEL3>
...
and so on
</MAIN>
This is the main part of parsing it (if you need more details pls ask):
from lxml import etree
resp = my.request(some call args)
xml = etree.XML(resp)
for element in xml.findall(".//MAIN"):
# this works fine but is not generic enough:
my_dict = OrderedDict()
for only1sub in element.iter(tag="SUBEL2"):
for i in only1sub:
my_dict[i.tag] = i.text
This just working fine with 1 subelement but that means I need to know which one in the tree has subelements and which not. This could change in the future or be added.
Another problem is MULTISUB. With the above code I'm able to parse until the first tag only.
The goal
What I WANT to achieve is - at best:
A) Having one function / code snippet which is able to parse the whole XML content and if there is a subelement (e.g. with "if len(x)" or whatever) then parse to the next level until you reach a level without a subelement/tree. Then go on to B)
B) For each XML tag found which has NO subelements I want to update the dictionary with the tag name and the tag text.
C) I want to do that for all available elements - the tag and the direct child tag names (e.g. "NOSUBEL2" or "MULTISUB") will not change (often) so it will be ok to use them as a start point for parsing.
What I tried so far was to chain several loops like for and while and for again and so on but nothing was full successful. I also dived my hands into python generators because I thought I can do something with the next() function but also nothing. But again I may have not the knowledge to use them correctly and so I'm happy for every answer..
At the end the thing I need is so easy I believe. I only want to have key value pairs from the tag name and the tag content that couldn't be so hard? Any help greatly appreciated..
Can you help me reaching the goal?
(Already a thanks for reading until here!)

What you are looking for is the recursion - a technique of running some procedure inside that procedure, but for sub-problem of the original problem. In this case: either, for each subelement of some element run this procedure (in case there are subelements) or update your dictionary with element's tag name and text.
I assume at the end you're interested in having dictionary (OrderedDict) containing "flat representation" of whole element tree's leaves' (nodes without subelements) tag names/text values, which in your case, printed out, would look like this:
OrderedDict([('NOSUBEL', 'abcd'), ('NOSUBEL2', 'adasdasa'), ('ANOTHERONE', '(how many levels can not be said / can change)'), ('FOO', 'abcdefg'), ('NOSUBEL3', 'abc')])
Generally, you would define a function that will either call itself with part of your data (in this case: subelements, if there are any) or do something (in this case: update some instance of dictionary).
Since I don't know the details behind my.request call, I've replaced that by parsing from string containing valid XML, based on the one you provided. Just replace constructing the tree object.
resp = """<MAIN>
<NOSUBEL>abcd</NOSUBEL>
<NOSUBEL2>adasdasa</NOSUBEL2>
<MULTISUB>
<WHATEVER>
<ANOTHERSUBEL>
<ANOTHERONE>(how many levels can not be said / can change)</ANOTHERONE>
</ANOTHERSUBEL>
</WHATEVER>
</MULTISUB>
<SUBEL2>
<FOO>abcdefg</FOO>
</SUBEL2>
<NOSUBEL3>abc</NOSUBEL3>
</MAIN>"""
from collections import OrderedDict
from lxml import etree
def update_dict(element, my_dict):
# lxml defines "length" of the element as number of its children.
if len(element): # If "length" is other than 0.
for subelement in element:
# That's where the recursion happens. We're calling the same
# function for a subelement of the element.
update_dict(subelement, my_dict)
else: # Otherwise, subtree is a leaf.
my_dict[element.tag] = element.text
if __name__ == "__main__":
# Change/amend it with your my.request call.
tree = etree.XML(resp) # That's a <MAIN> element, too.
my_dict = OrderedDict()
# That's the first invocation of the procedure. We're passing entire
# tree and instance of dictionary.
update_dict(tree, my_dict)
print(my_dict) # Just to see that dictionarty was filled with values.
As you can see, I didn't use any tag name in the code (except for the XML source, of course).
I've also added missing import from collections.

XML parsing using Elemetree in python

I am trying to read a XML file using python [ver - 2.6.7] using ElementTree
There are some tags of the format :
<tag, [attributes]>
....Data....
</tag>
The data in my case is usually some binary data that I read using text attribute.
However there are some cases where data can reference any other tag in the file.
<tag, [attributes]>
....Data....
<ref target='idname'/>
</tag>
What attribute from element tree can be used to parse them ?

Try XPath expressions.
This will tell you whether the tag is present and, if present, returns the node.

I think I would use something like this:
for iteration in root.iter('tag'):
if iteration.find('ref'):
...
So basicly I would parse thous cases separately.

XML parsing in Python using Python 2 or 3

I'm just trying to write a simple program to allow me to parse some of the following XML.
So far in following examples I am not getting the results I'm looking for.
I encounter many of these XML files and I generally want the info after a handful of tags.
What's the best way using elementtree to be able to do a search for <Id> and grab what ever info is in that tag. I was trying things like
for Reel in root.findall('Reel'):
... id = Reel.findtext('Id')
... print id
Is there a way just to look for every instance of <Id> and grab the urn: etc that comes after it? Some code that traverses everything and looks for <what I want> and so on.
This is a very truncated version of what I usually deal with.
This didn't get what I wanted at all. Is there an easy just to match <what I want> in any XML file and get the contents of that tag, or do i need to know the structure of the XML well enough to know its relation to Root/child etc?
<Reel>
<Id>urn:uuid:632437bc-73f9-49ca-b687-fdb3f98f430c</Id>
<AssetList>
<MainPicture>
<Id>urn:uuid:46afe8a3-50be-4986-b9c8-34f4ba69572f</Id>
<EditRate>24 1</EditRate>
<IntrinsicDuration>340</IntrinsicDuration>
<EntryPoint>0</EntryPoint>
<Duration>340</Duration>
<FrameRate>24 1</FrameRate>
<ScreenAspectRatio>2048 858</ScreenAspectRatio>
</MainPicture>
<MainSound>
<Id>urn:uuid:1fce0915-f8c7-48a7-b023-36e204a66ed1</Id>
<EditRate>24 1</EditRate>
<IntrinsicDuration>340</IntrinsicDuration>
<EntryPoint>0</EntryPoint>
<Duration>340</Duration>
</MainSound>
</AssetList>
</Reel>
#Mata that worked perfectly, but when I tried to use that for different values on another XML file I fell flat on my face. For instance, what about this section of a file.I couldn't post the whole thing unfortunately. What if I want to grab what comes after KeyId?
<?xml version="1.0" encoding="UTF-8" standalone="no" ?><DCinemaSecurityMessage xmlns="http://www.digicine.com/PROTO-ASDCP-KDM-20040311#" xmlns:dsig="http://www.w3.org/2000/09/xmldsig#" xmlns:enc="http://www.w3.org/2001/04/xmlenc#">
<!-- Generated by Wailua Version 0.3.20 -->
<AuthenticatedPublic Id="ID_AuthenticatedPublic">
<MessageId>urn:uuid:7bc63f4c-c617-4d00-9e51-0c8cd6a4f59e</MessageId>
<MessageType>http://www.digicine.com/PROTO-ASDCP-KDM-20040311#</MessageType>
<AnnotationText>SPIDERMAN-3_FTR_S_EN-XX_US-13_51_4K_PH_20070423_DELUXE ~ KDM for Quvis-10010.pem</AnnotationText>
<IssueDate>2007-04-29T04:13:43-00:00</IssueDate>
<Signer>
<dsig:X509IssuerName>dnQualifier=BzC0n/VV/uVrl2PL3uggPJ9va7Q=,CN=.deluxe-admin-c,OU=.mxf-j2c.ca.cinecert.com,O=.ca.cinecert.com</dsig:X509IssuerName>
<dsig:X509SerialNumber>10039</dsig:X509SerialNumber>
</Signer>
<RequiredExtensions>
<Recipient>
<X509IssuerSerial>
<dsig:X509IssuerName>dnQualifier=RUxyQle0qS7qPbcNRFBEgVjw0Og=,CN=SM.QuVIS.com.001,OU=QuVIS Digital Cinema,O=QuVIS.com</dsig:X509IssuerName>
<dsig:X509SerialNumber>363</dsig:X509SerialNumber>
</X509IssuerSerial>
<X509SubjectName>CN=SM MD LE FM.QuVIS_CinemaPlayer-3d_10010,OU=QuVIS,O=QuVIS.com,dnQualifier=3oBfjTfx1me0p1ms7XOX\+eqUUtE=</X509SubjectName>
</Recipient>
<CompositionPlaylistId>urn:uuid:336263da-e4f1-324e-8e0c-ebea00ff79f4</CompositionPlaylistId>
<ContentTitleText>SPIDERMAN-3_FTR_S_EN-XX_US-13_51_4K_PH_20070423_DELUXE</ContentTitleText>
<ContentKeysNotValidBefore>2007-04-30T05:00:00-00:00</ContentKeysNotValidBefore>
<ContentKeysNotValidAfter>2007-04-30T10:00:00-00:00</ContentKeysNotValidAfter>
<KeyIdList>
<KeyId>urn:uuid:9851b0f6-4790-0d4c-a69d-ea8abdedd03d</KeyId>
<KeyId>urn:uuid:8317e8f3-1597-494d-9ed8-08a751ff8615</KeyId>
<KeyId>urn:uuid:5d9b228d-7120-344c-aefc-840cdd32bbfc</KeyId>
<KeyId>urn:uuid:1e32ccb2-ab0b-9d43-b879-1c12840c178b</KeyId>
<KeyId>urn:uuid:44d04416-676a-2e4f-8995-165de8cab78d</KeyId>
<KeyId>urn:uuid:906da0c1-b0cb-4541-b8a9-86476583cdc4</KeyId>
<KeyId>urn:uuid:0fe2d73a-ebe3-9844-b3de-4517c63c4b90</KeyId>
<KeyId>urn:uuid:862fa79a-18c7-9245-a172-486541bef0c0</KeyId>
<KeyId>urn:uuid:aa2f1a88-7a55-894d-bc19-42afca589766</KeyId>
<KeyId>urn:uuid:59d6eeff-cd56-6245-9f13-951554466626</KeyId>
<KeyId>urn:uuid:14a13b1a-76ba-764c-97d0-9900f58af53e</KeyId>
<KeyId>urn:uuid:ccdbe0ae-1c3f-224c-b450-947f43bbd640</KeyId>
<KeyId>urn:uuid:dcd37f10-b042-8e44-bef0-89bda2174842</KeyId>
<KeyId>urn:uuid:9dd7103e-7e5a-a840-a15f-f7d7fe699203</KeyId>
</KeyIdList>
</RequiredExtensions>
<NonCriticalExtensions/>
</AuthenticatedPublic>
<AuthenticatedPrivate Id="ID_AuthenticatedPrivate"><enc:EncryptedKey xmlns:enc="http://www.w3.org/2001/04/xmlenc#">
<enc:EncryptionMethod Algorithm="http://www.w3.org/2001/04/xmlenc#rsa-oaep-mgf1p">
<ds:DigestMethod xmlns:ds="http://www.w3.org/2000/09/xmldsig#" Algorithm="http://www.w3.org/2000/09/xmldsig#sha1"/>
</enc:EncryptionMethod>

The expression Reel.findtext('Id') only matches direct children of Reel. If you want to find all Id tags in your xml document, you can just use:
ids = [id.text for id in Reel.findall(".//Id")]
This would give you a list of all text nodes of all Id tags which are children of Reel.
edit:
Your updated example uses namespaces, in this case KeyId is in the default namespace (http://www.digicine.com/PROTO-ASDCP-KDM-20040311#), so to search for it you need to include it in your search:
from xml.etree import ElementTree
doc = ElementTree.parse('test.xml')
nsmap = {'ns': 'http://www.digicine.com/PROTO-ASDCP-KDM-20040311#'}
ids = [id.text for id in doc.findall(".//ns:KeyId", namespaces=nsmap)]
print(ids)
...
The xpath subset ElementTree supports is rather limited. If you want a more complete support, you should use lxml instead, it's xpath support is way more complete.
For example, using xpath to search for all KeyId tags (ignoring namespaces) and returning their text content directly:
from lxml import etree
doc = etree.parse('test.xml')
ids = doc.xpath(".//*[local-name()='KeyId']/text()")
print(ids)
...

It sounds like XPath might be right up your alley - it will let you query your XML document for exactly what you're looking for, as long as you know the structure.

Here's what I needed to do. This works for finding whatever I need.
for node in tree.getiterator():
... if 'KeyId' in node.tag:
... mylist = node.tag
... print(mylist)
...

Quickest/Best way to traverse XML with lxml in Python

I have an XML file that looks like this:
xml = '''<?xml version="1.0"?>
<root>
<item>text</item>
<item2>more text</item2>
<targetroot>
<targetcontainer>
<target>text i want to get</target>
</targetcontainer>
<targetcontainer>
<target>text i want to get</target>
</targetcontainer>
</targetroot>
...more items
</root>
'''
With lxml I'm trying to acces the text in the element < target >. I've found a solution, but I'm sure there is a better, more efficient way to do this. My solution:
target = etree.XML(xml)
for x in target.getiterator('root'):
item1 = x.findtext('item')
for target in x.iterchildren('targetroot'):
for t in target.iterchildren('targetcontainer'):
targetText = t.findtext('target')
Although this works, as it gives me acces to all the elements in root as well as the target element, I'm having a hard time believing this is the most efficient solution.
So my question is this: is there a more efficient way to access the < target >'s texts while staying in the loop of root, because I also need access to the other elements.

You can use XPath:
for x in target.xpath('/root/targetroot/targetcontainer/target'):
print x.text
We ask all elements that match a path. In this case, the path is /root/targetroot/targetcontainer/target, which means
all the <target> elements that are inside a <targetcontainer> element, inside a <targetroot> element, inside a <root> element. Also, the <root> element should be the document root because it is preceded by /, which means the beginning of the document.
Also, your XML document had two problems. First, the <?xml version="1.0"?> declaration should be the very first thing in the document - and in this example it is preceded by a newline and some space. Also, it is not a tag and should not be closed, so the </xml> at the end of your string should be removed. I already edited your question anyway.
EDIT: this solution can be improved yet. You do not need to pass all the path - you can just ask to all elements <target> inside the document. This is done by preceding the tag name by two slashes. Since you want all the <target> texts, independent of where they are, this can be a better solution. So, the loop above can be written just as:
for x in target.xpath('//target'):
print x.text
I tried it at first but it did not worked. The problem, however, was the syntax problems in the XML, not the XPath, but I tried the other, longer path and forgot to retry this one. Sorry! Anyway, I hope I put some light about XPath nonetheless :)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

access elements and attribs DIRECTLY using lxml etree - python

Related

Trouble retrieving text from XML with ElementTree with tags

Python: Parsing XML autoadd all key/value pairs

XML parsing using Elemetree in python

XML parsing in Python using Python 2 or 3

Quickest/Best way to traverse XML with lxml in Python

Categories

Resources