Python search data within all XML elements

Python search data within all XML elements - python

Newbie - I am trying to use lxml to find "error" in any element (sample XML file below, but it should work regardless to how nested the tags are):
<test>
<test1>
error
</test1>
<test2>
<test3>
error
</test3>
</test2>
</test>
So far it seems that lxml is only capable of searching for tags and not the data within the tags - is this correct?

Are you asking if there's a built-in function to search the text in an element? It's pretty simple to write your own search routine using lxml's etree parser. For example:
test.xml
<test>
<test1>
error
</test1>
<test2>
<test3>
error
</test3>
</test2>
</test>
And from the command line:
>>> import lxml.etree as etree
>>> for event, element in etree.iterparse("test.xml"):
... # Print the tag of a matching element
... if element.text.strip() == "error":
... print element.tag
...
test1
test3
EDIT: If you end up going this route and don't need to muck about with XML namespaces I recommend you check out xml.etree.cElementTree instead of lxml.etree. It's included in the Python standard modules and is on par or slightly faster than lxml.etree.

Related

Issue with XML parsing in python

I'm having difficulty parsing an XML tree using xml.etree.ElementTree in Python. Basically, I'm making a request to an API that gives an XML response, and trying to extract the values of several elements in the tree.
This is what I've done so far with no success:
root = etree.fromstring(resp_arr[0])
walkscore = root.find('./walkscore')
Here is my XML tree:
<result>
<status>1</status>
<walkscore>95</walkscore>
<description>walker's paradise</description>
<updated>2009-12-25 03:40:16.006257</updated>
<logo_url>https://cdn.walk.sc/images/api-logo.png</logo_url>
<more_info_icon>https://cdn.walk.sc/images/api-more-info.gif</more_info_icon>
<ws_link>http://www.walkscore.com/score/1119-8th-Avenue-Seattle-WA-98101/lat=47.6085/lng=-122.3295/?utm_source=myrealtysite.com&utm_medium=ws_api&utm_campaign=ws_api</ws_link>
<help_link>https://www.redfin.com/how-walk-score-works</help_link>
<snapped_lat>47.6085</snapped_lat>
<snapped_lon>-122.3295</snapped_lon>
</result>
Essentially, I'm trying to pull the walkscores from the XML document but my code isn't returning a value. Does anyone with experience using ElementTree have any advice to help me extract the values I'm after?
Sam

Your XML appears to be malformed. But if I replace instances of & with &, then it's parseable:
>>> from xml.etree import ElementTree as ET
>>> tree = ET.fromstring(xml)
>>> tree.find('./walkscore').text
'95'

Get (and sum) values of all attributes named 'test' in XML

Example XML doc:
<main>
<this test="500">
<that test="200"/>
</this>
</main>
Result: 700
All the existing code snippets I've found on this site and others rely on a tag. For example, you could only get "500" if you reference both "this" and "test". I'm looking to search only for "test" throughout the entire doc/string.
Some modules that I've tried (and resulted in failure) are lxml, xml.dom, ElementTree, xmltodict, and BeautifulSoup,

I'd suggest to favor lxml for parsing XML in python. lxml has full xpath 1.0 support, and xpath is the language/technology designed specifically for querying XML.
Once you have the right tool for the right job, you can do something like this f.e :
import lxml.etree as et
xml = """<main>
<this test="500">
<that test="200"/>
</this>
</main>"""
doc = et.fromstring(xml)
result = doc.xpath("sum(//#test)")
print(result)
output :
700.0
brief explanation about xpath being used :
//#test : find all test attribute anywhere in the XML document.
sum() : return sum of the parameter

Python BeautifulSoup giving different results

I am trying to parse an xml file using BeautifulSoup. Consider a sampleinpt xml file as follows:
<DOC>
<DOCNO>1</DOCNO>
....
</DOC>
<DOC>
<DOCNO>2</DOCNO>
....
</DOC>
...
This file consists for 130 <DOC> tags. However, when I tried to parse it using BeautifulSoup's findAll function, it retrieves a random number of tags (usually between 15 - 25) but never 130. The code I used was as follows:
from bs4 import BeautifulSoup
z = open("filename").read()
soup = BeautifulSoup(z, "lxml")
print len(soup.findAll('doc'))
#more code involving manipulation of results
Can anybody tell me what wrong am I doing? Thanks in advance!

You are telling BeautifulSoup to use the HTML parser provided by lxml. If you have an XML document, you should stick to the XML parser option:
soup = BeautifulSoup(z, 'xml')
otherwise the parser will attempt to 'repair' the XML to fit HTML rules. XML parsing in BeautifulSoup is also handled by the lxml library.
Note that XML is case sensitive so you'll need to search for the DOC element now.
For XML documents it may be that the ElementTree API offered by lxml is more productive; it supports XPath queries for example, while BeautifulSoup does not.
However, from your sample it looks like there is no one top level element; it is as if your document consists of a whole series of XML documents instead. This makes your input invalid, and a parser may just stick to only parsing the first element as the top-level document instead.

finding text into namespaced xml elements with lxml.etree

I try to use lxml.etree to parse an XML file and find text into elements of the XML.
XML files can be as such:
<?xml version="1.0" encoding="UTF-8"?>
<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/
http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
<responseDate>2002-06-01T19:20:30Z</responseDate>
<request verb="ListRecords" from="1998-01-15"
set="physics:hep"
metadataPrefix="oai_rfc1807">
http://an.oa.org/OAI-script</request>
<ListRecords>
<record>
<header>
<identifier>oai:arXiv.org:hep-th/9901001</identifier>
<datestamp>1999-12-25</datestamp>
<setSpec>physics:hep</setSpec>
<setSpec>math</setSpec>
</header>
<metadata>
<rfc1807 xmlns=
"http://info.internet.isi.edu:80/in-notes/rfc/files/rfc1807.txt"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation=
"http://info.internet.isi.edu:80/in-notes/rfc/files/rfc1807.txt
http://www.openarchives.org/OAI/1.1/rfc1807.xsd">
<bib-version>v2</bib-version>
<id>hep-th/9901001</id>
<entry>January 1, 1999</entry>
<title>Investigations of Radioactivity</title>
<author>Ernest Rutherford</author>
<date>March 30, 1999</date>
</rfc1807>
</metadata>
<about>
<oai_dc:dc
xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/
http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:publisher>Los Alamos arXiv</dc:publisher>
<dc:rights>Metadata may be used without restrictions as long as
the oai identifier remains attached to it.</dc:rights>
</oai_dc:dc>
</about>
</record>
<record>
<header status="deleted">
<identifier>oai:arXiv.org:hep-th/9901007</identifier>
<datestamp>1999-12-21</datestamp>
</header>
</record>
</ListRecords>
</OAI-PMH>
For the following part we assume doc = etree.parse("/tmp/test.xml") where text.xml contains the xml pasted above.
First I try to find all the <record> elements using doc.findall(".//record") but it returns an empty list.
Secondly, for a given word I'd like to check if it is in the <dc:publisher>.
To achieve this I try first to do the same as earlier : doc.findall(".//publisher") but i've the same issue... I'm pretty sure all of this is linked with namespaces but I don't know how to handle them.
I've read the libxml tutorial, and tried the example for findall method on a basic xml file (without any namespace) and it worked out.

As Chris has already mentioned, you can also use lxml and xpath. As xpath doesn't allow you to write the namespaced names in full like {http://www.openarchives.org/OAI/2.0/}record (so-called "James Clark notation" *), you will have to use prefixes, and provide the xpath engine with a prefix-to-namespace-uri mapping.
Example with lxml (assuming you already have the desired tree object):
nsmap = {'oa':'http://www.openarchives.org/OAI/2.0/',
'dc':'http://purl.org/dc/elements/1.1/'}
tree.xpath('//oa:record[descendant::dc:publisher[contains(., "Alamos")]]',
namespaces=nsmap)
This will select all {http://www.openarchives.org/OAI/2.0/}record elements that have a descendant element {http://purl.org/dc/elements/1.1/}dc containing the word "Alamos".
[*] this comes from an article where James Clark explains XML Namespaces, everyone not familiar with namespaces should read this! (even if it was written a long time ago)

Disclaimer: I am using the standard library xml.etree.ElementTree module, not the lxml library (although this is a subset of lxml as far as I know). I'm sure there is an answer which is much simpler than mine which uses lxml and XPATH, but I don't know it.
Namespace issue
You were right to say that the problem is likely the namespaces. There is no record element in your XML file, but there are two {http://www.openarchives.org/OAI/2.0/}record tags in the file. As the following demonstrates:
>>> import xml.etree.ElementTree as etree
>>> xml_string = ...Your XML to parse...
>>> e = etree.fromstring(xml_string)
# Let's see what the root element is
>>> e
<Element {http://www.openarchives.org/OAI/2.0/}OAI-PMH at 7f39ebf54f80>
# Let's see what children there are of the root element
>>> for child in e:
... print child
...
<Element {http://www.openarchives.org/OAI/2.0/}responseDate at 7f39ebf54fc8>
<Element {http://www.openarchives.org/OAI/2.0/}request at 7f39ebf58050>
<Element {http://www.openarchives.org/OAI/2.0/}ListRecords at 7f39ebf58098>
# Finally, let's get the children of the `ListRecords` element
>>> for child in e[-1]:
... print child
...
<Element {http://www.openarchives.org/OAI/2.0/}record at 7f39ebf580e0>
<Element {http://www.openarchives.org/OAI/2.0/}record at 7f39ebf58908>
So, for example
>>> e.find('ListRecords')
returns None, whereas
>>> e.find('{http://www.openarchives.org/OAI/2.0/}ListRecords'
<Element {http://www.openarchives.org/OAI/2.0/}ListRecords at 7f39ebf58098>
returns the ListRecords element.
Note that I am using the find method since the standard library ElementTree does not have an xpath method.
Possible solution
One way to solve this and to get the namespace prefix and prepend this to the tag you are trying to find. You can use
>>>> e.tag[:e.tag.index('}')+1]
'{http://www.openarchives.org/OAI/2.0/}'
on the root element, e, to find the namespace, although I'm sure there is a better way of doing this.
Now we can define functions to extract the tags we want we an optional namespace prefix:
def findallNS(element, tag, namespace=None):
if namspace is not None:
return element.findall(namepsace+tag)
else:
return element.findall(tag)
def findNS(element, tag, namespace=None):
if namspace is not None:
return element.find(namepsace+tag)
else:
return element.find(tag)
So now we can write:
>>> list_records = findNS(e, 'ListRecords', namespace)
>>> findallNS(list_records, 'record', namespace)
[<Element {http://www.openarchives.org/OAI/2.0/}record at 7f39ebf580e0>,
<Element {http://www.openarchives.org/OAI/2.0/}record at 7f39ebf58908>]
Alternative solution
Another solution maybe to write a function to search for all tags which end with the tag you are interested in, for example:
def find_child_tags(element, tag):
return [child for child in element if child.tag.endswith(tag)]
Here you don't need to deal with the namespace at all.

#Chris answer is very good and it will work with lxml too. Here is another way using lxml (works the same way with xpath instead of find):
In [37]: xml.find('.//n:record', namespaces={'n': 'http://www.openarchives.org/OAI/2.0/'})
Out[37]: <Element {http://www.openarchives.org/OAI/2.0/}record at 0x2a451e0>

Python: Read and write namespaced XML using ElementTree

This XML file is named example.xml:
<?xml version="1.0"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>14.0.0</modelVersion>
<groupId>.com.foobar.flubber</groupId>
<artifactId>uberportalconf</artifactId>
<version>13-SNAPSHOT</version>
<packaging>pom</packaging>
<name>Environment for UberPortalConf</name>
<description>This is the description</description>
<properties>
<birduberportal.version>11</birduberportal.version>
<promotiondevice.version>9</promotiondevice.version>
<foobarportal.version>6</foobarportal.version>
<eventuberdevice.version>2</eventuberdevice.version>
</properties>
<!-- A lot more here, but as it is irrelevant for the problem I have removed it -->
</project>
If I load the example.xml file above using ElementTree and print the root node:
>>> from xml.etree import ElementTree
>>> tree = ElementTree.parse('example.xml')
>>> print tree.getroot()
<Element '{http://maven.apache.org/POM/4.0.0}project' at 0x26ee0f0>
I see that Element also contains the namespace http://maven.apache.org/POM/4.0.0.
How do I:
Get the foobarportal.version text, increase it by one and write the XML file back while keeping the namespace the document had when loaded and also not change the overall XML layout.
Get it to load using any namespace, not just http://maven.apache.org/POM/4.0.0. I still don´t want to strip the namespace, as I want the XML to stay the same except for changing foobarportal.version as in 1 above.
The current way is not aware of XML but fulfills 1 and 2 above:
Grep for <foobarportal.version>(.*)</foobarportal.version>
Take the contents of the match group and i increase it by one
Write it back.
It would be nice to have an XML aware solution, as it would be more robust. The XML namespace handling of ElementTree is making it more complicated.

If your question is simply: "how do I search by a namespaced element name", then the answer is that lxml understands {namespace} syntax, so you can do:
tree.getroot().find('{http://maven.apache.org/POM/4.0.0}project')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python search data within all XML elements - python

Related

Issue with XML parsing in python

Get (and sum) values of all attributes named 'test' in XML

Python BeautifulSoup giving different results

finding text into namespaced xml elements with lxml.etree

Python: Read and write namespaced XML using ElementTree

Categories

Resources