How to get xmlns attributes using lxml objectify? - python

I have several xml documents i am dealing with. They have differing root elements. Here are some of them.
<rss xmlns:npr="http://www.npr.org/rss/" xmlns:nprml="http://api.npr.org/nprml" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:content="http://purl.org/rss/1.0/modules/content/" version="2.0">
<rss version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd">
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2enclosuresfull.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.thisamericanlife.org/~d/styles/itemcontent.css"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:media="http://search.yahoo.com/mrss/" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0" xml:base="http://www.thisamericanlife.org">
I am using lxml in the following way on the first example from above.
>>> from lxml import objectify
>>> root = objectify.parse('file_for_first_example').getroot() # contains valid xml with first above element
>>> print root.tag
'rss'
>>> root.attrib.keys()
['version']
>>> for k in root.attrib.iterkeys():
>>> print k
version
>>> print root.get("xmlns:npr")
None
I just want to be able to sense what these 'attribute' values are so i can, i believe, infer what the format of the various feeds are.
Thanks for the help in advance. Love and peace.

The namespace declarations are namespace nodes. Looks like you want the .nsmap property http://lxml.de/tutorial.html#namespaces
xhtml.nsmap
{None: 'http://www.w3.org/1999/xhtml'}

Related

Round-tripping Python's ElementTree from/tostring drops namespaces

I've got a base XML string that I want to build off of, so the first thing I do is parse the XML string into an etree.
However, it looks like the other namespaces "d" and "m" are being ignored. I can successfully parse the string into an XML Element:
import xml.etree.ElementTree as ET
BASE = """<?xml version="1.0" encoding="utf-8" ?>
<feed
xml:base="https://www.nuget.org/api/v2/"
xmlns="http://www.w3.org/2005/Atom"
xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices"
xmlns:m="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata"
>
</feed>
"""
a = ET.fromstring(BASE)
# <Element '{http://www.w3.org/2005/Atom}feed' at 0x000002264B03F778>
But when we convert back to string, we drop the "d" and "m" namespaces:
ET.tostring(a)
# Formatted manually for StackOverflow
# b'<ns0:feed
# xmlns:ns0="http://www.w3.org/2005/Atom"
# xml:base="https://www.nuget.org/api/v2/">
# </ns0:feed>'
So what's going on here?
It appears that unused namespaces are dropped. If you change your BASE to something like this:
BASE = """<?xml version="1.0" encoding="utf-8" ?>
<feed
xml:base="https://www.nuget.org/api/v2/"
xmlns="http://www.w3.org/2005/Atom"
xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices"
xmlns:m="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata"
>
<m:properties>
<d:Id>NuGetTest</d:Id>
</m:properties>
</feed>
"""
You'll see the missing namespaces:
>>> et.tostring(a)
b'<ns0:feed
xmlns:ns0="http://www.w3.org/2005/Atom"
xmlns:ns1="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata"
xmlns:ns2="http://schemas.microsoft.com/ado/2007/08/dataservices"
xml:base="http://localhost:40221/nuget">
<ns1:properties>
<ns2:Id>NuGetTest</ns2:Id>
</ns1:properties>
</ns0:feed>'
Note that the namespaces change: d becomes ns2, m becomes ns1. I'm not sure how Python does this, but it looks like it's just based on which is used first.

XPath with LXML Element

I am trying to parse an XML document using lxml etree. The XML doc I am parsing looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<metadata xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.openarchives.org/OAI/2.0/">\t
<codeBook version="2.5" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="ddi:codebook:2_5" xsi:schemaLocation="ddi:codebook:2_5 http://www.ddialliance.org/Specification/DDI-Codebook/2.5/XMLSchema/codebook.xsd">
<docDscr>
<citation>
<titlStmt>
<titl>Test Title</titl>
</titlStmt>
<prodStmt>
<prodDate/>
</prodStmt>
</citation>
</docDscr>
<stdyDscr>
<citation>
<titlStmt>
<titl>Test Title 2</titl>
<IDNo agency="UKDA">101</IDNo>
</titlStmt>
<rspStmt>
<AuthEnty>TestAuthEntry</AuthEnty>
</rspStmt>
<prodStmt>
<copyright>Yes</copyright>
</prodStmt>
<distStmt/>
<verStmt>
<version date="">1</version>
</verStmt>
</citation>
<stdyInfo>
<subject>
<keyword>2009</keyword>
<keyword>2010</keyword>
<topcClas>CLASS</topcClas>
<topcClas>ffdsf</topcClas>
</subject>
<abstract>This is an abstract piece of text.</abstract>
<sumDscr>
<timePrd event="single">2020</timePrd>
<nation>UK</nation>
<anlyUnit>Test</anlyUnit>
<universe>test</universe>
<universe>hello</universe>
<dataKind>fdsfdsf</dataKind>
</sumDscr>
</stdyInfo>
<method>
<dataColl>
<timeMeth>test timemeth</timeMeth>
<dataCollector>test data collector</dataCollector>
<sampProc>test sampprocess</sampProc>
<deviat>test deviat</deviat>
<collMode>test collMode</collMode>
<sources/>
</dataColl>
</method>
<dataAccs>
<setAvail>
<accsPlac>Test accsPlac</accsPlac>
</setAvail>
<useStmt>
<restrctn>NONE</restrctn>
</useStmt>
</dataAccs>
<othrStdyMat>
<relPubl>122</relPubl>
<relPubl>12332</relPubl>
</othrStdyMat>
</stdyDscr>
</codeBook>
</metadata>
I wrote the following code to try and process it:
from lxml import etree
import pdb
f = open('/vagrant/out2.xml', 'r')
xml_str = f.read()
xml_doc = etree.fromstring(xml_str)
f.close()
From what I understand from the lxml xpath docs, I should be able to get the text from a specific element as follows:
xml_doc.xpath('/metadata/codeBook/docDscr/citation/titlStmt/titl/text()')
However, when I run this it returns an empty array.
The only xpath I can get to return something is using a wildcard:
xml_doc.xpath('*')
Which returns [<Element {ddi:codebook:2_5}codeBook at 0x7f8da8a413f8>].
I've read through the docs and I'm not understanding what is going wrong with this. Any help is appreciated.
You need to take the default namespace into account so instead of
xml_doc.xpath('/metadata/codeBook/docDscr/citation/titlStmt/titl/text()')
use
xml_doc.xpath.xpath(
'/oai:metadata/ddi:codeBook/ddi:docDscr/ddi:citation/ddi:titlStmt/ddi:titl/text()',
namespaces={
'oai': 'http://www.openarchives.org/OAI/2.0/',
'ddi': 'ddi:codebook:2_5'
}
)

How to resolve external entities with xml.etree like lxml.etree

I have a script that parses XML using lxml.etree:
from lxml import etree
parser = etree.XMLParser(load_dtd=True, resolve_entities=True)
tree = etree.parse('main.xml', parser=parser)
I need load_dtd=True and resolve_entities=True be have &emptyEntry; from globals.xml resolved:
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE map SYSTEM "globals.xml" [
<!ENTITY dirData "${DATADIR}">
]>
<map
xmlns:map="http://my.dummy.org/map"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsschemaLocation="http://my.dummy.org/map main.xsd"
>
&emptyEntry; <!-- from globals.xml -->
<entry><key>KEY</key><value>VALUE</value></entry>
<entry><key>KEY</key><value>VALUE</value></entry>
</map>
with globals.xml
<?xml version="1.0" encoding="UTF-8"?>
<!ENTITY emptyEntry "<entry></entry>">
Now I would like to move from non-standard lxml to standard xml.etree. But this fails with my file because the load_dtd=True and resolve_entities=True is not supported by xml.etree.
Is there an xml.etree-way to have these entities resolved?
My trick is to use the external program xmllint
proc = subprocess.Popen(['xmllint','--noent',fname],stdout=subprocess.PIPE)
output = proc.communicate()[0]
tree = ElementTree.parse(StringIO.StringIO(output))
lxml is a right tool for the job.
But, if you want to use stdlib, then be prepared for difficulties and take a look at XMLParser's UseForeignDTD method. Here's a good (but hacky) example: Python ElementTree support for parsing unknown XML entities?

Python: libxml2 xpath returns empty list

I want to parse XML content with Python's libxml2 using xpath, i followed this example and that tutorial. The XML file is:
<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://purl.org/atom/ns#" version="0.3">
<title>Gmail - Inbox for myemailaddress#gmail.com</title>
<tagline>New messages in your Gmail Inbox</tagline>
<fullcount>1</fullcount>
<link rel="alternate" href="http://mail.google.com/mail" type="text/html"/>
<modified>2011-05-04T18:56:19Z</modified>
</feed>
This XML is stored in a file called "atom", and i try the following:
>>> import libxml2
>>> myfile = open('/pathtomyfile/atom', 'r').read()
>>> xmldata = libxml2.parseDoc('myfile')
>>> data.xpathEval('/fullcount')
[]
>>>
Now as you can see it returns an empty list. No matter what i may provide xpath with, it will return an empty list. However, if i use the * wildcard, i get a list of all nodes:
>>>> data.xpathEval('//*')
[<xmlNode (feed) object at 0xb73862cc>, <xmlNode (title) object at 0xb738650c>, <xmlNode (tagline) object at 0xb73865ec>, <xmlNode (fullcount) object at 0xb738660c>, <xmlNode (link) object at 0xb738662c>, <xmlNode (modified) object at 0xb738664c>]
Now i don't understand, judging from the working examples above, why xpath doesn't find the "fullcount" node or any other: i'm using the same syntax after all...
Any idea or suggestion? Thanks.
Your XPath is failing because you need to specify the purl namespace on the node:
import libxml2
tree = libxml2.parseDoc(data)
xp = tree.xpathNewContext()
xp.xpathRegisterNs("purl", "http://purl.org/atom/ns#")
print xp.xpathEval('//purl:fullcount')
Result:
[<xmlNode (fullcount) object at 0x7fbbeba9ef80>]
(Also: check out lxml, it has a nicer, higher-level interface).
Firstly:
/fullcount is an absolute path, so it's looking for the <fullcount> element in the root of the document, when the element is in fact within the <feed> element.
Secondly:
You need to specify the namespace. This is how you would do it with lxml:
import lxml.etree as etree
tree = etree.parse('/pathtomyfile/atom')
fullcounts = tree.xpath('//ns:fullcount',
namespaces={'ns': "http://purl.org/atom/ns#"})
print etree.tostring(fullcounts[0])
Which would give you:
<fullcount xmlns="http://purl.org/atom/ns#">1</fullcount>

Convert XML to python objects using lxml

I'm trying to use the lxml library to parse an XML file...what I want is to use XML as the datasource, but still maintain the normal Django-way of interactive with the resulting objects...from the docs, I can see that lxml.objectify is what I'm suppossed to use, but I don't know how to proceed after: list = objectify.parse('myfile.xml')
Any help will be very much appreciated. Thanks.
A sample of the file (has about 100+ records) is this:
<store>
<book>
<publisher>Hodder &...</publisher>
<isbn>345123890</isbn>
<author>King</author>
<comments>
<comment rank='1'>Interesting</comment>
<comments>
<pages>200</pages>
</book>
<book>
<publisher>Penguin Books</publisher>
<isbn>9011238XX</isbn>
<author>Armstrong</author>
<comments />
<pages>150</pages>
</book>
</store>
From this, I want to do the following (something just as easy to write as Books.objects.all() and Books.object.get_object_or_404(isbn=selected) is most preferred ):
Display a list of all books with their respective attributes
Enable viewing of further details of a book by selecting it from the list
Firstly, "list" isn't a very good variable because it "shadows" the built-in type "list."
Now, say you have this xml:
<root>
<node1 val="foo">derp</node1>
<node2 val="bar" />
</root>
Now, you could do this:
root = objectify.parse("myfile.xml")
print root.node1.get("val") # prints "foo"
print root.node1.text # prints "derp"
print root.node2.get("val") # prints "bar"
Another tip: when you have lots of nodes with the same name, you can loop over them.
>>> xml = """<root>
<node val="foo">derp</node>
<node val="bar" />
</root>"""
>>> root = objectify.fromstring(xml)
>>> for node in root.node:
print node.get("val")
foo
bar
Edit
You should be able to simply set your django context to the books object, and use that from your templates.
context = dict(books = root.book,
# other stuff
)
And then you'll be able to iterate through the books in the template, and access each book object's attributes.

Categories