Reading text from XML nodes using Python's libxml2

Reading text from XML nodes using Python's libxml2 - python

I am a first time XPath user and need to be able to get the text values of these different elements.. for instance time, title, etc.. I am using the libxml2 module in Python and so far have not had much luck getting just the values of the text I need. The code below here only returns the element tags.. i need the values.. any help would be GREATLY appreciated!
I'm using this code:
doc = libxml2.parseDoc(xmlOutput)
result = doc.xpathEval('//*')
With the following document:
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE SCAN_LIST_OUTPUT SYSTEM "https://qualysapi.qualys.com/api/2.0/fo/sca/scan_list_output.dtd">
<SCAN_LIST_OUTPUT>
<RESPONSE>
<DATETIME>2012-01-22T01:21:53Z</DATETIME>
<SCAN_LIST>
<SCAN>
<REF>scan/2343423</REF>
<TYPE>Scheduled</TYPE>
<TITLE><![CDATA[customer 1 5/20/2012]]></TITLE>
<USER_LOGIN>user1</USER_LOGIN>
<LAUNCH_DATETIME>2012-02-21T04:11:05Z</LAUNCH_DATETIME>
<STATUS>
<STATE>Finished</STATE>
</STATUS>
<TARGET><![CDATA[13.3.3.2, 13.8.8.10, 13.10.12.60, 13.10.12.11...]]></TARGET>
</SCAN>
</SCAN_LIST>
</RESPONSE>
</SCAN_LIST_OUTPUT>

You can call getContent() on each returned xmlNode object to retrieve the associated text. Note that this is recursive -- to non-recursively access text content in libxml2, you'll want to retrieve the associated text node under the element, and call .getContent() on that.
That said, this would be easier if you used lxml.etree (a higher-level Python API, still backing into the C libxml2 library) instead of the Python libxml2; in that case, it's simply element.text to access the associated content as a string.

Have a look at Mark Pilgrim's Dive Into Python 3, Chapter 12. XML
The chapter starts with short course to XML (general talk but with the Atom Syndication Feed example), then it continues with the standard xml.etree.ElementTree and continues with third party lxml that implements more with the same interface (full XPATH 1.0, based on libxml2).

Related

Trouble retrieving text from XML with ElementTree with tags

Right now I have some code which uses Biopython and NCBI's "Entrez" API to get XML strings from Pubmed Central. I'm trying to parse the XML with ElementTree to just have the text from the page. Although I have BeautifulSoup code that does exactly this when I scrape the lxml data from the site itself, I'm switching to the NCBI API since scrapers are apparently a no-no. But now with the XML from the NCBI API, I'm finding ElementTree extremely unintuitive and could really use some help getting it to work. Of course I've looked at other posts, but most of these deal with namespaces and in my case, I just want to use the XML tags to grab information. Even the ElementTree documentation doesn't go into this (from what I can tell). Can anyone help me figure out the syntax to grab information within certain tags rather than within certain namespaces?
Here's an example. Note: I use Python 3.4
Small snippit of the XML:
<sec sec-type="materials|methods" id="s5">
<title>Materials and Methods</title>
<sec id="s5a">
<title>Overgo design</title>
<p>In order to screen the saltwater crocodile genomic BAC library described below, four overgo pairs (forward and reverse) were designed (<xref ref-type="table" rid="pone-0114631-t002">Table 2</xref>) using saltwater crocodile sequences of MHC class I and II from previous studies <xref rid="pone.0114631-Jaratlerdsiri1" ref-type="bibr">[40]</xref>, <xref rid="pone.0114631-Jaratlerdsiri3" ref-type="bibr">[42]</xref>. The overgos were designed using OligoSpawn software, with a GC content of 50–60% and 36 bp in length (8-bp overlapping) <xref rid="pone.0114631-Zheng1" ref-type="bibr">[77]</xref>. The specificity of the overgos was checked against vertebrate sequences using the basic local alignment search tool (BLAST; <ext-link ext-link-type="uri" xlink:href="http://www.ncbi.nlm.nih.gov/">http://www.ncbi.nlm.nih.gov/</ext-link>).</p>
<table-wrap id="pone-0114631-t002" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0114631.t002</object-id>
<label>Table 2</label>
<caption>
<title>Four pairs of forward and reverse overgos used for BAC library screening of MHC-associated BACs.</title>
</caption>
<alternatives>
<graphic id="pone-0114631-t002-2" xlink:href="pone.0114631.t002"/>
<table frame="hsides" rules="groups">
<colgroup span="1">
<col align="left" span="1"/>
<col align="center" span="1"/>
</colgroup>
For my project, I want all of the text in the "p" tag (not just for this snippit of the XML, but for the entire XML string).
Now, I already know that I can make the whole XML string into an ElementTree Object
>>> import xml.etree.ElementTree as ET
>>> tree = ET.ElementTree(ET.fromstring(xml_string))
>>> root = ET.fromstring(xml_string)
Now if I try to get the text using the tag like this:
>>> text = root.find('p')
>>> print("".join(text.itertext()))
or
>>> text = root.get('p').text
I can't extract the text that I want. From what I've read, this is because I'm using the tag "p" as an argument rather than a namespace.
While I feel like it should be quite simple for me to get all the text in "p" tags within an XML file, I'm currently unable to do it. Please let me know what I'm missing and how I can fix this. Thanks!
--- EDIT ---
So now I know that I should be using this code to get everything in the 'p' tags:
>>> text = root.find('.//p')
>>> print("".join(text.itertext()))
Despite the fact that I'm using itertext(), it's only returning content from the first "p" tag and not looking at any other content. Does itertext() only iterate within a tag? Documentation seems to suggest it iterates across all tags as well, so I'm not sure why its only returning one line instead of all of the text under all of the "p" tags.
---- FINAL EDIT --
I figured out that itertext() only works within one tag and find() only returns the first item. In order to get the enitre text that I want I must use findall()
>>> all_text = root.findall('.//p')
>>> for texts in all_text:
print("".join(texts.itertext()))

root.get() is the wrong method, as it will retrieve an attribute of the root tag not a subtag.
root.find() is correct as it will find the first matching subtag (alternatively one can use root.findall() for all matching subtags).
If you want to find not only direct subtags but also indirect subtags (as in your example), the expression within root.find/root.findall has be to a subset of XPath (see https://docs.python.org/2/library/xml.etree.elementtree.html#xpath-support). In your case it is './/p':
text = root.find('.//p')
print("".join(text.itertext()))

Parsing XML data within tags using lxml in python

My question is regarding how to get information stored in a tag which allows for no closing tag. Here's the relevant xml:
<?xml version="1.0" encoding="UTF-8"?>
<uws:job>
<uws:results>
<uws:result id="2014-03-03T15:42:31:1337" xlink:href="http://www.cosmosim.org/query/index/stream/table/2014-03-03T15%3A42%3A31%3A1337/format/csv" xlink:type="simple"/>
</uws:results>
</uws:job>
I'm looking to extract the xlink:href url here. As you can see the uws:result tag requires no closing tag. Additionally, having the 'uws:' makes it a bit tricky to handle them when working in python. Here's what I've tried so far:
from lxml import etree
root = etree.fromstring(xmlresponse.content)
url = root.find('{*}results').text
Where xmlresponse.content is the xml data to be parsed. What this returns is
'\n '
which indicates that it's only finding the newline character, since what I'm really after is contained within a tag inside the results tag. Any ideas would be greatly appreciated.

You found the right node; you extracted the data incorrectly. Instead of
url = root.find('{*}results').text
you really want
url = root.find('{*}results').get('attribname', 'value_to_return_if_not_present')
or
url = root.find('{*}results').attrib['attribname']
(which will throw an exception if not present).
Because of the namespace on the attribute itself, you will probably need to use the {ns}attrib syntax to look it up too.
You can dump out the attrib dictionary and just copy the attribute name out too.
text is actually the space between elements, and is not normally used but is supported both for spacing (like etreeindent) and some special cases.

XML parsing in Python using Python 2 or 3

I'm just trying to write a simple program to allow me to parse some of the following XML.
So far in following examples I am not getting the results I'm looking for.
I encounter many of these XML files and I generally want the info after a handful of tags.
What's the best way using elementtree to be able to do a search for <Id> and grab what ever info is in that tag. I was trying things like
for Reel in root.findall('Reel'):
... id = Reel.findtext('Id')
... print id
Is there a way just to look for every instance of <Id> and grab the urn: etc that comes after it? Some code that traverses everything and looks for <what I want> and so on.
This is a very truncated version of what I usually deal with.
This didn't get what I wanted at all. Is there an easy just to match <what I want> in any XML file and get the contents of that tag, or do i need to know the structure of the XML well enough to know its relation to Root/child etc?
<Reel>
<Id>urn:uuid:632437bc-73f9-49ca-b687-fdb3f98f430c</Id>
<AssetList>
<MainPicture>
<Id>urn:uuid:46afe8a3-50be-4986-b9c8-34f4ba69572f</Id>
<EditRate>24 1</EditRate>
<IntrinsicDuration>340</IntrinsicDuration>
<EntryPoint>0</EntryPoint>
<Duration>340</Duration>
<FrameRate>24 1</FrameRate>
<ScreenAspectRatio>2048 858</ScreenAspectRatio>
</MainPicture>
<MainSound>
<Id>urn:uuid:1fce0915-f8c7-48a7-b023-36e204a66ed1</Id>
<EditRate>24 1</EditRate>
<IntrinsicDuration>340</IntrinsicDuration>
<EntryPoint>0</EntryPoint>
<Duration>340</Duration>
</MainSound>
</AssetList>
</Reel>
#Mata that worked perfectly, but when I tried to use that for different values on another XML file I fell flat on my face. For instance, what about this section of a file.I couldn't post the whole thing unfortunately. What if I want to grab what comes after KeyId?
<?xml version="1.0" encoding="UTF-8" standalone="no" ?><DCinemaSecurityMessage xmlns="http://www.digicine.com/PROTO-ASDCP-KDM-20040311#" xmlns:dsig="http://www.w3.org/2000/09/xmldsig#" xmlns:enc="http://www.w3.org/2001/04/xmlenc#">
<!-- Generated by Wailua Version 0.3.20 -->
<AuthenticatedPublic Id="ID_AuthenticatedPublic">
<MessageId>urn:uuid:7bc63f4c-c617-4d00-9e51-0c8cd6a4f59e</MessageId>
<MessageType>http://www.digicine.com/PROTO-ASDCP-KDM-20040311#</MessageType>
<AnnotationText>SPIDERMAN-3_FTR_S_EN-XX_US-13_51_4K_PH_20070423_DELUXE ~ KDM for Quvis-10010.pem</AnnotationText>
<IssueDate>2007-04-29T04:13:43-00:00</IssueDate>
<Signer>
<dsig:X509IssuerName>dnQualifier=BzC0n/VV/uVrl2PL3uggPJ9va7Q=,CN=.deluxe-admin-c,OU=.mxf-j2c.ca.cinecert.com,O=.ca.cinecert.com</dsig:X509IssuerName>
<dsig:X509SerialNumber>10039</dsig:X509SerialNumber>
</Signer>
<RequiredExtensions>
<Recipient>
<X509IssuerSerial>
<dsig:X509IssuerName>dnQualifier=RUxyQle0qS7qPbcNRFBEgVjw0Og=,CN=SM.QuVIS.com.001,OU=QuVIS Digital Cinema,O=QuVIS.com</dsig:X509IssuerName>
<dsig:X509SerialNumber>363</dsig:X509SerialNumber>
</X509IssuerSerial>
<X509SubjectName>CN=SM MD LE FM.QuVIS_CinemaPlayer-3d_10010,OU=QuVIS,O=QuVIS.com,dnQualifier=3oBfjTfx1me0p1ms7XOX\+eqUUtE=</X509SubjectName>
</Recipient>
<CompositionPlaylistId>urn:uuid:336263da-e4f1-324e-8e0c-ebea00ff79f4</CompositionPlaylistId>
<ContentTitleText>SPIDERMAN-3_FTR_S_EN-XX_US-13_51_4K_PH_20070423_DELUXE</ContentTitleText>
<ContentKeysNotValidBefore>2007-04-30T05:00:00-00:00</ContentKeysNotValidBefore>
<ContentKeysNotValidAfter>2007-04-30T10:00:00-00:00</ContentKeysNotValidAfter>
<KeyIdList>
<KeyId>urn:uuid:9851b0f6-4790-0d4c-a69d-ea8abdedd03d</KeyId>
<KeyId>urn:uuid:8317e8f3-1597-494d-9ed8-08a751ff8615</KeyId>
<KeyId>urn:uuid:5d9b228d-7120-344c-aefc-840cdd32bbfc</KeyId>
<KeyId>urn:uuid:1e32ccb2-ab0b-9d43-b879-1c12840c178b</KeyId>
<KeyId>urn:uuid:44d04416-676a-2e4f-8995-165de8cab78d</KeyId>
<KeyId>urn:uuid:906da0c1-b0cb-4541-b8a9-86476583cdc4</KeyId>
<KeyId>urn:uuid:0fe2d73a-ebe3-9844-b3de-4517c63c4b90</KeyId>
<KeyId>urn:uuid:862fa79a-18c7-9245-a172-486541bef0c0</KeyId>
<KeyId>urn:uuid:aa2f1a88-7a55-894d-bc19-42afca589766</KeyId>
<KeyId>urn:uuid:59d6eeff-cd56-6245-9f13-951554466626</KeyId>
<KeyId>urn:uuid:14a13b1a-76ba-764c-97d0-9900f58af53e</KeyId>
<KeyId>urn:uuid:ccdbe0ae-1c3f-224c-b450-947f43bbd640</KeyId>
<KeyId>urn:uuid:dcd37f10-b042-8e44-bef0-89bda2174842</KeyId>
<KeyId>urn:uuid:9dd7103e-7e5a-a840-a15f-f7d7fe699203</KeyId>
</KeyIdList>
</RequiredExtensions>
<NonCriticalExtensions/>
</AuthenticatedPublic>
<AuthenticatedPrivate Id="ID_AuthenticatedPrivate"><enc:EncryptedKey xmlns:enc="http://www.w3.org/2001/04/xmlenc#">
<enc:EncryptionMethod Algorithm="http://www.w3.org/2001/04/xmlenc#rsa-oaep-mgf1p">
<ds:DigestMethod xmlns:ds="http://www.w3.org/2000/09/xmldsig#" Algorithm="http://www.w3.org/2000/09/xmldsig#sha1"/>
</enc:EncryptionMethod>

The expression Reel.findtext('Id') only matches direct children of Reel. If you want to find all Id tags in your xml document, you can just use:
ids = [id.text for id in Reel.findall(".//Id")]
This would give you a list of all text nodes of all Id tags which are children of Reel.
edit:
Your updated example uses namespaces, in this case KeyId is in the default namespace (http://www.digicine.com/PROTO-ASDCP-KDM-20040311#), so to search for it you need to include it in your search:
from xml.etree import ElementTree
doc = ElementTree.parse('test.xml')
nsmap = {'ns': 'http://www.digicine.com/PROTO-ASDCP-KDM-20040311#'}
ids = [id.text for id in doc.findall(".//ns:KeyId", namespaces=nsmap)]
print(ids)
...
The xpath subset ElementTree supports is rather limited. If you want a more complete support, you should use lxml instead, it's xpath support is way more complete.
For example, using xpath to search for all KeyId tags (ignoring namespaces) and returning their text content directly:
from lxml import etree
doc = etree.parse('test.xml')
ids = doc.xpath(".//*[local-name()='KeyId']/text()")
print(ids)
...

It sounds like XPath might be right up your alley - it will let you query your XML document for exactly what you're looking for, as long as you know the structure.

Here's what I needed to do. This works for finding whatever I need.
for node in tree.getiterator():
... if 'KeyId' in node.tag:
... mylist = node.tag
... print(mylist)
...

Targeting specific sub-elements when parsing XML with Python

I'm working on building a simple parser to handle a regular data feed at work. This post, XML to csv(-like) format , has been very helpful. I'm using a for loop like in the solution, to loop through all of the elements/subelements I need to target but I'm still a bit stuck.
For instance, my xml file is structured like so:
<root>
<product>
<identifier>12</identifier>
<identifier>ab</identifier>
<contributor>Alex</contributor>
<contributor>Steve</contributor>
</product>
<root>
I want to target only the second identifier, and only the first contributor. Any suggestions on how might I do that?
Cheers!

The other answer you pointed to has an example of how to turn all instances of a tag into a list. You could just loop through those and discard the ones you're not interested in.
However, there's a way to do this directly with XPath: the mini-language supports item indexes in brackets:
import xml.etree.ElementTree as etree
document = etree.parse(open("your.xml"))
secondIdentifier = document.find(".//product/identifier[2]")
firstContributor = document.find(".//product/contributor[1]")
print secondIdentifier, firstContributor
prints
'ab', 'Alex'
Note that in XPath, the first index is 1, not 0.
ElementTree's find and findall only support a subset of XPath, described here. Full XPath, described in brief on W3Schools and more fully in the W3C's normative document is available from lxml, a third-party package, but one that is widely available. With lxml, the example would look like this:
import lxml.etree as etree
document = etree.parse(open("your.xml"))
secondIdentifier = document.xpath(".//product/identifier[2]")[0]
firstContributor = document.xpath(".//product/contributor[1]")[0]
print secondIdentifier, firstContributor

Comments in XML at beginning of document

my PYTHON xml parser fails if there´s a comment at the beginnging of an xml file like::
<?xml version="1.0" encoding="utf-8"?>
<!-- Script version: "1"-->
<!-- Date: "07052010"-->
<component name="abc">
<pp>
....
</pp>
</component>
is it illegal to place a comment like this?
EDIT:
well it´s not throwing an error but the DOM module will fail and not recognize the child nodes:
import xml.dom.minidom as dom
sub_tree = dom.parse('xyz.xml')
for component in sub_tree.firstChild.childNodes:
print(component)
I cannot acces the child nodes; sub_tree.firstChild.childNodes returns an empty list,but if I remove those 2 comments I can loop through the list and read the childnodes as usual!
EDIT:
Guys, this simple example is working and enough to figure it out. start your python shell and execute this small code above. Once it will output nothing and after deleting the comments it will show up the node!

If you do this:
import xml.dom.minidom as dom
sub_tree = dom.parse('xyz.xml')
print sub_tree.children
You will see what is your problem:
>>> print sub_tree.childNodes
[<DOM Comment node " Script ve...">, <DOM Comment node " Date: "07...">, <DOM Element: component at 0x7fecf88c>]
firstChild will obviously pick up the first child, which is a comment and doesn't have any children of its own.
You could iterate over the children and skip all comment nodes.
Or you could ditch the DOM model and use ElementTree, which is so much nicer to work with. :)

It is legal; from XML 1.0 Reference:
2.5 Comments
[Definition: Comments may appear
anywhere in a document outside other
markup; in addition, they may appear
within the document type declaration
at places allowed by the grammar. They
are not part of the document's
character data; an XML processor MAY,
but need not, make it possible for an
application to retrieve the text of
comments. For compatibility, the
string " -- " (double-hyphen) MUST NOT
occur within comments.] Parameter
entity references MUST NOT be
recognized within comments.

To get better answers, show us (a) a small complete Python script and (b) a small complete XML document that together demonstrate the unexpected behaviour.
Have you considered using ElementTree?

That should be legal as long as the XML declaration is on the first line.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Reading text from XML nodes using Python's libxml2 - python

Related

Trouble retrieving text from XML with ElementTree with tags

Parsing XML data within tags using lxml in python

XML parsing in Python using Python 2 or 3

Targeting specific sub-elements when parsing XML with Python

Comments in XML at beginning of document

Categories

Resources