python: is there an XML parser implemented as a generator? - python

I'd like to parse a big XML file "on the fly". I'd like to use a python generator to perform this. I've tried "iterparse" of "xml.etree.cElementTree" (which is really nice) but still not a generator.
Other suggestions?

xml.etree.cElementTree comes close to a generator with correct usage; by default you receive each element after its 'end' event, at which point you can process it. You should use element.clear() on the element if you don't need it after processing; thereby you save the memory.
Here is a complete example what I mean, where I parse Rhythmbox's (Music Player) Library. I use (c)ElementTree's iterparse and for each processed element I call element.clear() so that I save quite a lot of memory. (Btw, the code below is a successor to some sax code to do the same thing; the cElementTree solution was a relief since 1) The code is concise and expresses what I need and nothing more 2) It is 3x as fast, 3) it uses less memory.)
import os
import xml.etree.cElementTree as ElementTree
NEEDED_KEYS= set(("title", "artist", "album", "track-number", "location", ))
def _lookup_string(string, strmap):
"""Look up #string in the string map,
and return the copy in the map.
If not found, update the map with the string.
"""
string = string or ""
try:
return strmap[string]
except KeyError:
strmap[string] = string
return string
def get_rhythmbox_songs(dbfile, typ="song", keys=NEEDED_KEYS):
"""Return a list of info dictionaries for all songs
in a Rhythmbox library database file, with dictionary
keys as given in #keys.
"""
rhythmbox_dbfile = os.path.expanduser(dbfile)
lSongs = []
strmap = {}
# Parse with iterparse; we get the elements when
# they are finished, and can remove them directly after use.
for event, entry in ElementTree.iterparse(rhythmbox_dbfile):
if not (entry.tag == ("entry") and entry.get("type") == typ):
continue
info = {}
for child in entry.getchildren():
if child.tag in keys:
tag = _lookup_string(child.tag, strmap)
text = _lookup_string(child.text, strmap)
info[tag] = text
lSongs.append(info)
entry.clear()
return lSongs
Now, I don't understand your expectations, do you have the following expectation?
# take one
for event, entry in ElementTree.iterparse(rhythmbox_dbfile):
# parse some entries, then exit loop
# take two
for event, entry in ElementTree.iterparse(rhythmbox_dbfile):
# parse the rest of entries
Each time you call iterparse you get a new iterator object, reading the file anew! If you want a persistent object with iterator semantics, you have to refer to the same object in both loops (untried code):
#setup
parseiter = iter(ElementTree.iterparse(rhythmbox_dbfile))
# take one
for event, entry in parseiter:
# parse some entries, then exit loop
# take two
for event, entry in parseiter:
# parse the rest of entries
I think it can be confusing since different objects have different semantics. A file object will always have an internal state and advance in the file, however you iterate on it. An ElementTree iterparse object apparently not. The crux is to think that when you use a for loop, the for always calls iter() on the thing you iterate over. Here is an experiment comparing ElementTree.iterparse with a file object:
>>> import xml.etree.cElementTree as ElementTree
>>> pth = "/home/ulrik/.local/share/rhythmbox/rhythmdb.xml"
>>> iterparse = ElementTree.iterparse(pth)
>>> iterparse
<iterparse object at 0x483a0890>
>>> iter(iterparse)
<generator object at 0x483a2f08>
>>> iter(iterparse)
<generator object at 0x483a6468>
>>> f = open(pth, "r")
>>> f
<open file '/home/ulrik/.local/share/rhythmbox/rhythmdb.xml', mode 'r' at 0x4809af98>
>>> iter(f)
<open file '/home/ulrik/.local/share/rhythmbox/rhythmdb.xml', mode 'r' at 0x4809af98>
>>> iter(f)
<open file '/home/ulrik/.local/share/rhythmbox/rhythmdb.xml', mode 'r' at 0x4809af98>
What you see is that each call to iter() on an iterparse object returns a new generator. The file object however, has an internal Operating System state that must be conserved and it its own iterator.

"On the fly" parsing and document trees are not really compatible. SAX-style parsers are usually used for that (for example, Python's standard xml.sax). You basically have to define a class with handlers for various events like startElement, endElement, etc. and the parser will call the methods as it parses the XML file.

PullDom does what you want. It reads XML from a stream, like SAX, but then builds a DOM for a selected piece of it.
"PullDOM is a really simple API for working with DOM objects in a streaming (efficient!) manner rather than as a monolithic tree."

This is possible with elementtree and incremental parsing:
http://effbot.org/zone/element-iterparse.htm#incremental-parsing
import xml.etree.cElementTree as etree
for event, elem in etree.iterparse(source):
...
Easier to use than sax.

xmltodict has a callback way of reading row by row, but it is not very pythonic. I wanted something similar for reading stackoverflow posts one by one from their xml dump using a generator.
This is the structure of the xml file:
<?xml version="1.0" encoding="utf-8"?>
<posts>
<row Id="1" ... />
<row Id="2" ... />
</posts>
And here is the code I used. It combines pulldom for streaming and xmltodict for parsing the rows.
def xml_to_dict_gen(file_path, tag='row'):
from xml.dom import pulldom
import xmltodict
doc = pulldom.parse(file_path)
for event, node in doc:
if event == pulldom.START_ELEMENT and node.tagName == tag:
doc.expandNode(node)
yield dict(xmltodict.parse(node.toxml()).get(tag))
for post in xml_to_dict_gen('Posts.xml'):
print(post)

Related

Python: how to update the xml and save to a new xml file, with iterparse method reading and updating?

I'm able to print it out to the console, and it is the way I want it, but I can't seem to grasp on how to save it. The XML from the sample doesn't change. I'm using fairly big XML files and the iterparse function, as I believe is crucial.
My code:
def xmlTagMethod(xmlfile, changetag):
tree = ET.ElementTree(file=xmlfile)
root = tree.getroot()
for event, elem in ET.iterparse(xmlfile):
if event == 'end':
if elem.tag == changetag:
elem.set('maxwidth', '20')
print elem.attrib
tree.write("outPutTagData.xml")
You are not making any changes to tree, so when you write it it will be the same as before.
You are simply modifying the event and elem elements of the iterator returned by ET.iterparse(xmlfile), which is a completely separate object from tree.
For a simpler approach you could try:
import xml.etree.ElementTree as ET
def xmlTagMethod(xmlfile, changetag):
tree = ET.ElementTree(file=xmlfile)
for e in tree.findall(changetag):
e.set('maxwidth','20')
tree.write("outPutTagData.xml")

Parsing Xml file using python Error: list index out of range for threshold

I have a very large XML file. I need to display the value stored in each and every tag using python. I am trying to use dom library. Can anyone please help?
XML File Link: https://code.google.com/p/warai/downloads/detail?name=haarcascade_frontalface_alt.xml
from xml.dom import minidom
doc= minidom.parse('haarcascade_frontalface_alt.xml')
size=doc.getElementsByTagName('size')[0]
print size.firstChild.data
stages=doc.getElementsByTagName('stages')[0]
stagen=stages.getElementsByTagName('_')
for stage in stagen:
stage_threshold=stage.getElementsByTagName('stage_threshold')[0]
parent=stage.getElementsByTagName('parent')[0]
anext=stage.getElementsByTagName('next')[0]
print stage_threshold.firstChild.data
print parent.firstChild.data
print anext.firstChild.data
trees=stage.getElementsByTagName('trees')[0]
a=trees.getElementsByTagName('_')
for k in a:
b=k.getElementsByTagName('_')[0]
threshold=b.getElementsByTagName('threshold')[0]
left_val=b.getElementsByTagName('left_val')[0]
right_val=b.getElementsByTagName('right_val')[0]
feature=b.getElementsByTagName('feature')[0]
tilted=feature.getElementsByTagName('tilted')[0]
rects=feature.getElementsByTagName('rects')[0]
m=rects.getElementsByTagName('_')[0]
n=rects.getElementsByTagName('_')[1]
print m.firstChild.data
print n.firstChild.data
print tilted.firstChild.data
print threshold.firstChild.data
print left_val.firstChild.data
print right_val.firstChild.data
Use ElementTree interface, for example xml.etree.ElementTree, but other implementations also exist.
Iterate through all elements recursively using iterators to walk the XML tree:
from xml.etree import ElementTree as ET
root = ET.parse("xml.xml").getroot()
def print_value(node):
if node.text and not node.text.isspace():
print(node.text)
for child in node: print_value(child)
print_value(root)
But if the file is really big, and you cannot load the whole tree into memory, use .iterparse(). It returns element by element - event "start" and "end" means that parser reached beginning or end of current node. If no events are supplied to iterparse function, only "end" is used.
import xml.etree.ElementTree as ET
it = ET.iterparse("xml.xml")
for event, node in it:
if node.text and not node.text.isspace():
print(node.text)
node.clear()
Notes
You may also print .tail of each element.
If you have really, really massive number of elements, you could consider remembering parent for each node, and using parent.remove(node) instead of node.clear().

change element value with lmxl python

I need to be able to parse a large xml file but only look for <name> elements and replace the value. Hence, I am doing event driven parsing as follows I have the following code:
import os, re, sys
from lxml import etree
# parse the xml file
context = etree.iterparse(xmlFile, events=('end',), tag='name')
for event, elem in context:
# this is an internal method that I call to perform regex
newElementText = searchReplace(elem.text).replace(" ", "")
# assign the elem.text to the replaced value
elem.text = newElementText
# write to the xml
etree.tostring(elem, encoding='utf-8')
My problem is with writing the updated element value to the file. When I call etree.tostring() it does not update the file. Can someone please kindly point to the error of my ways. Thanks!
etree.tostring(elem) returns a string representation of the tree, so it does nothing in your code. Use elem.write(xmlFile, encoding='utf-8').

Loading different data types from XML into a dictionary in python

I'm using cElementTree to extract xml tags and values in a loop and then storing them into a dictionary.
XML file contains:
<root>
<tag1>['item1', 'item2']</tag1>
<tag2>a normal string</tag2>
</root>
Python code (roughly):
import xml.etree.cElementTree as xml
xmldata = {}
xmlfile = xml.parse(XMLFile.xml)
for xmltag in xmlfile.iter():
xmldata[xmltag.tag] = xmltag.text
The problem I have encountered is that the xml file contains different data types, which include string and list. Unfortunately Element.text saves all the xml values as string (including the lists).
So when I load from the XML file I have:
{'tag1':"['item1', 'item2']", 'tag2':'a normal string'}
When I'd prefer to have:
{'tag1':['item1', 'item2'], 'tag2':'a normal string'}
Is there an easy way to do this?
e.g a command that saves to the dictionary in the original format
Or do I need to set up if statements to determine the value type and save it seperately using an alternative to Element.text?
You can use literal_eval to try to parse complex python literals. Since your strigns are unquoted, they will raise a SyntaxError in lteral eval, but that is simle to work around:
import xml.etree.cElementTree as xml
from ast import literal_eval
xmldata = {}
xmlfile = xml.parse(XMLFile.xml)
for xmltag in xmlfile.iter():
try:
xmldata[xmltag.tag] = literal_eval(xmltag.text)
except SyntaxError:
xmldata[xmltag.tag] = xmltag.text
Unlike Python's builtin "eval", ast.literal_eval does not allow the execution of expressions, and thus is safe, even if the XML data come from an untrusted source.
Here is a proposed solution: check for the existence of [, then parse the list. It's not failsafe (it won't work if the separator is not exactly , with a space) but I think that it'll be easy for you to improve it.
import xml.etree.cElementTree as xml
xmldata = {}
xmlfile = xml.parse("data.xml")
for xmltag in xmlfile.iter():
# it's a list
if "[" in xmltag.text:
d = xmltag.text.lstrip("[").rstrip("]")
l = [item.lstrip("'").rstrip("'") for item in d.split(", ")]
xmldata[xmltag.tag] = l
else:
xmldata[xmltag.tag] = xmltag.text
print xmldata
Prints: {'root': '\n', 'tag1': ['item1', 'item2'], 'tag2': 'a normal string'}
I think that you are not using xml in all its mighty power!
Why don't you organize your .xml like:
<root>
<tag1>
<item>item1</item>
<item>item2</item>
</tag1>
<tag2>a normal string<tag2>
</root>
This way your python code will be handling every <tag1> as a container of <item>, and I think that's better.
Note: You may also want to take a look here. (I agree with the "Favorite Way" of the author)

How to write an XML file without header in Python?

when using Python's stock XML tools such as xml.dom.minidom for XML writing, a file would always start off like
<?xml version="1.0"?>
[...]
While this is perfectly legal XML code, and it's even recommended to use the header, I'd like to get rid of it as one of the programs I'm working with has problems here.
I can't seem to find the appropriate option in xml.dom.minidom, so I wondered if there are other packages which do allow to neglect the header.
Cheers,
Nico
Unfortunately minidom does not give you the option to omit the XML Declaration.
But you can always serialise the document content yourself by calling toxml() on the document's root element instead of the document. Then you won't get an XML Declaration:
xml= document.documentElement.toxml('utf-8')
...but then you also wouldn't get anything else outside the root element, such as the DOCTYPE, or any comments or processing instructions. If you need them, serialise each child of the document object one by one:
xml= '\n'.join(node.toxml('utf-8') for node in document.childNodes)
I wondered if there are other packages which do allow to neglect the header.
DOM Level 3 LS defines an xml-declaration config parameter you can use to suppress it. The only Python implementation I know of is pxdom, which is thorough on standards support, but not at all fast.
If you want to use minidom and maintain 'prettiness', how about this as a quick/hacky fix:
xml_without_declaration.py:
import xml.dom.minidom as xml
doc = xml.Document()
declaration = doc.toxml()
a = doc.createElement("A")
doc.appendChild(a)
b = doc.createElement("B")
a.appendChild(b)
xml = doc.toprettyxml()[len(declaration):]
print xml
The header is print in Document. If you print the node directly, it won't print the header.
root = doc.childNodes[0]
root.toprettyxml(encoding="utf-8")
Just replace the first line with blank:
import xml.dom.minidom as MD
<XML String>.replace(MD.Document().toxml()+'\n', '')
If you're set on using minidom, just scan back in the file and remove the first line after writing all the XML you need.
You might be able to use a custom file-like object which removes the first tag, e.g:
class RemoveFirstLine:
def __init__(self, f):
self.f = f
self.xmlTagFound = False
def __getattr__(self, attr):
return getattr(self, self.f)
def write(self, s):
if not self.xmlTagFound:
x = 0 # just to be safe
for x, c in enumerate(s):
if c == '>':
self.xmlTagFound = True
break
self.f.write(s[x+1:])
else:
self.f.write(s)
...
f = RemoveFirstLine(open('path', 'wb'))
Node.writexml(f, encoding='UTF-8')
or something similar. This has the advantage the file doesn't have to be totally rewritten if the XML files are fairly large.
Purists may not like to hear this, but I have found using an XML parser to generate XML to be overkill. Just generate it directly as strings. This also lets you generate files larger than you can keep in memory, which you can't do with DOM. Reading XML is another story.
Use string replace
from xml.dom import minidom
mydoc = minidom.parse('filename.xml')
with open(newfile, "w" ) as fs:
fs.write(mydoc.toxml().replace('?xml version="1.0" ?>', ''))
fs.close()
That's it ;)

Categories