Convert XML to python objects using lxml - python

I'm trying to use the lxml library to parse an XML file...what I want is to use XML as the datasource, but still maintain the normal Django-way of interactive with the resulting objects...from the docs, I can see that lxml.objectify is what I'm suppossed to use, but I don't know how to proceed after: list = objectify.parse('myfile.xml')
Any help will be very much appreciated. Thanks.
A sample of the file (has about 100+ records) is this:
<store>
<book>
<publisher>Hodder &...</publisher>
<isbn>345123890</isbn>
<author>King</author>
<comments>
<comment rank='1'>Interesting</comment>
<comments>
<pages>200</pages>
</book>
<book>
<publisher>Penguin Books</publisher>
<isbn>9011238XX</isbn>
<author>Armstrong</author>
<comments />
<pages>150</pages>
</book>
</store>
From this, I want to do the following (something just as easy to write as Books.objects.all() and Books.object.get_object_or_404(isbn=selected) is most preferred ):
Display a list of all books with their respective attributes
Enable viewing of further details of a book by selecting it from the list

Firstly, "list" isn't a very good variable because it "shadows" the built-in type "list."
Now, say you have this xml:
<root>
<node1 val="foo">derp</node1>
<node2 val="bar" />
</root>
Now, you could do this:
root = objectify.parse("myfile.xml")
print root.node1.get("val") # prints "foo"
print root.node1.text # prints "derp"
print root.node2.get("val") # prints "bar"
Another tip: when you have lots of nodes with the same name, you can loop over them.
>>> xml = """<root>
<node val="foo">derp</node>
<node val="bar" />
</root>"""
>>> root = objectify.fromstring(xml)
>>> for node in root.node:
print node.get("val")
foo
bar
Edit
You should be able to simply set your django context to the books object, and use that from your templates.
context = dict(books = root.book,
# other stuff
)
And then you'll be able to iterate through the books in the template, and access each book object's attributes.

Related

Parsing nested attributes

Good day dear developers.
I can't fully parse an xml file.
The structure looks like:
<foo>
<bar1 id="1">
<bar2>
<foobar id="2">name1</foobar>
<foobar id="3">name2</foobar>
</bar2>
</bar1>
</foo>
I used the xml.etree library so I use code like:
source.get('Id')
so i get the first attribute
to get a nested tag i use code like:
source.find('bar/foobar').text
The question is how to get next nested attributes? ( Id =2 and id = 3)
It shows an error when i'm trying to use some stuff with slash
source.get('bar/id')
and other tries give me just the first attribute which i already got, also the second nested attribute has the same name Id.
Thank you for the help in advance.
Below is a working example
import xml.etree.ElementTree as ET
xml = '''<foo>
<bar1 id="1">
<bar2>
<foobar id="2">name1</foobar>
<foobar id="3">name2</foobar>
</bar2>
</bar1>
</foo>'''
root = ET.fromstring(xml)
ids = [f.attrib.get('id') for f in root.findall('.//foobar')]
print(ids)
output
['2','3']
You need to specify a working XPATH expression, like:
foobars = source.findall('bar1/bar2/foobar')
for elem in foobars:
print(elem.get('id'))
Output:
2
3
It works now for one line, but what if we have several bar1? Like this
<foo>
<bar1 id="1">
<bar2>
<foobar id="2">name1</foobar>
<foobar id="3">name2</foobar>
</bar2>
</bar1>
<bar1 id="2">
<bar2>
<foobar id="2">name3</foobar>
<foobar id="3">name4</foobar>
</bar2>
</bar1>
</foo>
The loop (findall=> for)will print all of it(4 ids), but i need just 2 of them for each row

Parsing Weather XML with Python

I'm a beginner but with a lot of effort I'm trying to parse some data about the weather from an .xml file called "weather.xml" which looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<Weather>
<locality name="Rome" alt="21">
<situation temperature="18°C" temperatureF="64,4°F" humidity="77%" pression="1016 mb" wind="5 SSW km/h" windKN="2,9 SSW kn">
<description>clear sky</description>
<lastUpdate>17:45</lastUpdate>
/>
</situation>
<sun sunrise="6:57" sunset="18:36" />
</locality>
I parsed some data from this XML and this is how my Python code looks now:
#!/usr/bin/python
from xml.dom import minidom
xmldoc = minidom.parse('weather.xml')
entry_situation = xmldoc.getElementsByTagName('situation')
entry_locality = xmldoc.getElementsByTagName('locality')
print entry_locality[0].attributes['name'].value
print "Temperature: "+entry_situation[0].attributes['temperature'].value
print "Humidity: "+entry_situation[0].attributes['humidity'].value
print "Pression: "+entry_situation[0].attributes['pression'].value
It's working fine but if I try to parse data from "description" or "lastUpdate" node with the same method, I get an error, so this way must be wrong for those nodes which actually I can see they are differents.
I'm also trying to write output into a log file with no success, the most I get is an empty file.
Thank you for your time reading this.
It is because "description" and "lastUpdate" are not attributes but child nodes of the "situation" node.
Try:
d = entry_situation[0].getElementsByTagName("description")[0]
print "Description: %s" % d.firstChild.nodeValue
You should use the same method to access the "situation" node from its parent "locality".
By the way you should take a look at the lxml module, especially the objectify API as yegorich said. It is easier to use.

Python - parse xml with lxml trouble

I've found a lot of questions on this issue but nothing I saw fits mine. I'm new to lxml so need some help.
my users.xml file:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<user>
<login>elena</login>
<password>elena</password>
<group>1</group>
</user>
<user>
<login>anele</login>
<password>anele</password>
<group>2</group>
</user>
</root>
the trouble function:
def analize_data(login):
doc = etree.parse("/myapp/users.xml")
for elem in doc.iter(tag='login'):
if elem.text == login:
parent = elem.getparent()
group = etree.SubElement(parent, 'group')
return group.text
What I need:
to find a user tag with login passed to function and get the text of group subelement of this user. But this function returns None when testing. What am I doing wrong and how to fix it?
I'm new to all these things, so need help. Thanks in advance!
Try using:
group = parent.iterchildren(tag="group").next()
etree.SubElement does something completely different:
This function creates an element instance, and appends it to an existing element.
Which is clearly not what you want.

Element Tree doesn't load a Google Earth-exported KML

I have a problem related to a Google Earth exported KML, as it doesn't seem to work well with Element Tree. I don't have a clue where the problem might lie, so I will explain how I do everything.
Here is the relevant code:
kmlFile = open( filePath, 'r' ).read( -1 ) # read the whole file as text
kmlFile = kmlFile.replace( 'gx:', 'gx' ) # we need this as otherwise the Element Tree parser
# will give an error
kmlData = ET.fromstring( kmlFile )
document = kmlData.find( 'Document' )
With this code, ET (Element Tree object) creates an Element object accessible via variable kmlData. It points to the root element ('kml' tag). However, when I run a search for the sub-element 'Document', it returns None. Although the 'Document' tag is present in the KML file!
Are there any other discrepancies between KMLs and XMLs apart from the 'gx: smth' tags? I have searched through the KML files I am dealing with and found nothing suspicious. Here is a simplified structure of an KML file the program is supposed to deal with:
<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://earth.google.com/kml/2.2">
<Document>
<name>UK.kmz</name>
<Style id="sh_blu-blank">
<IconStyle>
<scale>1.3</scale>
<Icon>
<href>http://maps.google.com/mapfiles/kml/paddle/blu-blank.png</href>
</Icon>
<hotSpot x="32" y="1" xunits="pixels" yunits="pixels"/>
</IconStyle>
<ListStyle>
<ItemIcon>
<href>http://maps.google.com/mapfiles/kml/paddle/blu-blank-lv.png</href>
</ItemIcon>
</ListStyle>
</Style>
[other style tags...]
<Folder>
<name>UK</name>
<Placemark>
<name>1262 Crossness Pumping Station</name>
<LookAt>
<longitude>0.1329926667038817</longitude>
<latitude>51.50303535104574</latitude>
<altitude>0</altitude>
<range>4246.539753518848</range>
<tilt>0</tilt>
<heading>-4.295161152207489</heading>
<altitudeMode>relativeToGround</altitudeMode>
<gx:altitudeMode>relativeToSeaFloor</gx:altitudeMode>
</LookAt>
<styleUrl>#msn_blu-blank15000</styleUrl>
<Point>
<coordinates>0.1389579668507301,51.50888923518947,0</coordinates>
</Point>
</Placemark>
[other placemark tags...]
</Folder>
</Document>
</kml>
Do you have an idea why I can't access any sub-elements of 'kml'? By the way, Python version is 2.7.
The KML document is in the http://earth.google.com/kml/2.2 namespace, as indicated by
<kml xmlns="http://earth.google.com/kml/2.2">
This means that the name of the Document element is in fact {http://earth.google.com/kml/2.2}Document.
Instead of this:
document = kmlData.find('Document')
you need this:
document = kmlData.find('{http://earth.google.com/kml/2.2}Document')
However, there is a problem with the XML file. There is an element called gx:altitudeMode. The gx bit is a namespace prefix. Such a prefix needs to be declared, but the declaration is missing.
You have worked around the problem by simply replacing gx: with gx. But the proper way to do this would be to add the namespace declaration. Based on https://developers.google.com/kml/documentation/altitudemode, I take it that gx is associated with the http://www.google.com/kml/ext/2.2 namespace. So for the document to be well-formed, the root element start tag should read
<kml xmlns="http://earth.google.com/kml/2.2" xmlns:gx="http://www.google.com/kml/ext/2.2">
Now the document can be parsed:
In [1]: from xml.etree import ElementTree as ET
In [2]: kmlData = ET.parse("kml2.xml")
In [3]: document = kmlData.find('{http://earth.google.com/kml/2.2}Document')
In [4]: document
Out[4]: <Element '{http://earth.google.com/kml/2.2}Document' at 0x1895810>
In [5]:

How to get xmlns attributes using lxml objectify?

I have several xml documents i am dealing with. They have differing root elements. Here are some of them.
<rss xmlns:npr="http://www.npr.org/rss/" xmlns:nprml="http://api.npr.org/nprml" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:content="http://purl.org/rss/1.0/modules/content/" version="2.0">
<rss version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd">
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2enclosuresfull.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.thisamericanlife.org/~d/styles/itemcontent.css"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:media="http://search.yahoo.com/mrss/" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0" xml:base="http://www.thisamericanlife.org">
I am using lxml in the following way on the first example from above.
>>> from lxml import objectify
>>> root = objectify.parse('file_for_first_example').getroot() # contains valid xml with first above element
>>> print root.tag
'rss'
>>> root.attrib.keys()
['version']
>>> for k in root.attrib.iterkeys():
>>> print k
version
>>> print root.get("xmlns:npr")
None
I just want to be able to sense what these 'attribute' values are so i can, i believe, infer what the format of the various feeds are.
Thanks for the help in advance. Love and peace.
The namespace declarations are namespace nodes. Looks like you want the .nsmap property http://lxml.de/tutorial.html#namespaces
xhtml.nsmap
{None: 'http://www.w3.org/1999/xhtml'}

Categories