Why is lxml.etree.iterparse() eating up all my memory? - python

This eventually consumes all my available memory and then the process is killed. I've tried changing the tag from schedule to 'smaller' tags but that didn't make a difference.
What am I doing wrong / how can I process this large file with iterparse()?
import lxml.etree
for schedule in lxml.etree.iterparse('really-big-file.xml', tag='schedule'):
print "why does this consume all my memory?"
I can easily cut it up and process it in smaller chunks but that's uglier than I'd like.

As iterparse iterates over the entire file a tree is built and no elements are freed. The advantage of doing this is that the elements remember who their parent is, and you can form XPaths that refer to ancestor elements. The disadvantage is that it can consume a lot of memory.
In order to free some memory as you parse, use Liza Daly's fast_iter:
def fast_iter(context, func, *args, **kwargs):
"""
http://lxml.de/parsing.html#modifying-the-tree
Based on Liza Daly's fast_iter
http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
See also http://effbot.org/zone/element-iterparse.htm
"""
for event, elem in context:
func(elem, *args, **kwargs)
# It's safe to call clear() here because no descendants will be
# accessed
elem.clear()
# Also eliminate now-empty references from the root node to elem
for ancestor in elem.xpath('ancestor-or-self::*'):
while ancestor.getprevious() is not None:
del ancestor.getparent()[0]
del context
which you could then use like this:
def process_element(elem):
print "why does this consume all my memory?"
context = lxml.etree.iterparse('really-big-file.xml', tag='schedule', events=('end',))
fast_iter(context, process_element)
I highly recommend the article on which the above fast_iter is based; it should be especially interesting to you if you are dealing with large XML files.
The fast_iter presented above is a slightly modified version of the one shown in the article. This one is more aggressive about deleting previous ancestors, thus saves more memory. Here you'll find a script which demonstrates the
difference.

Directly copied from http://effbot.org/zone/element-iterparse.htm
Note that iterparse still builds a tree, just like parse, but you can safely rearrange or remove parts of the tree while parsing. For example, to parse large files, you can get rid of elements as soon as you’ve processed them:
for event, elem in iterparse(source):
if elem.tag == "record":
... process record elements ...
elem.clear()
The above pattern has one drawback; it does not clear the root element, so you will end up with a single element with lots of empty child elements. If your files are huge, rather than just large, this might be a problem. To work around this, you need to get your hands on the root element. The easiest way to do this is to enable start events, and save a reference to the first element in a variable:
# get an iterable
context = iterparse(source, events=("start", "end"))
# turn it into an iterator
context = iter(context)
# get the root element
event, root = context.next()
for event, elem in context:
if event == "end" and elem.tag == "record":
... process record elements ...
root.clear()

This worked really well for me:
def destroy_tree(tree):
root = tree.getroot()
node_tracker = {root: [0, None]}
for node in root.iterdescendants():
parent = node.getparent()
node_tracker[node] = [node_tracker[parent][0] + 1, parent]
node_tracker = sorted([(depth, parent, child) for child, (depth, parent)
in node_tracker.items()], key=lambda x: x[0], reverse=True)
for _, parent, child in node_tracker:
if parent is None:
break
parent.remove(child)
del tree

Related

XML parsing using fast_iter clearing data before done processing

Im using Liza Daly's fast_iter which has the structure of:
def fast_iter(context, args=[], kwargs={}):
"""
Deletes elements as the tree is travsersed to prevent the full tree from building and save memory
Author: Liza Daly, IBM
"""
for event, elem in context:
if elem.tag == 'target':
func(elem, *args, **kwargs)
elem.clear()
while elem.getprevious() is not None:
del elem.getparent()[0]
del context
return save
However, Ive noticed when i create my context as
context = etree.iterparse(path, events=('end',))
The data within the elem gets deleted before my function can even process it. For clarity, I am using fully synchronous code.
If I set my context as
context = etree.iterparse(path, events=('end',), tag='target')
It works correctly, however I know its not doing the full memory conservation that fast_iter is intended to provide.
Is there any reason to even use this when compared to xml.dom.pulldom, a SAX parser which creates no tree? It seems like fast_iter attempts to replicate this staying within lxml.
Does anyone have ideas on what im doing wrong? TIA
I think I can see where your approach might delete data you want to access before the code to access it is called, let's assume you have e.g.
<target>
<foo>test</foo>
<bar>test</bar>
</target>
elements in your XML, then each time an end element tag is found your code
for event, elem in context:
if elem.tag == 'target':
func(elem, *args, **kwargs)
elem.clear()
while elem.getprevious() is not None:
del elem.getparent()[0]
is run, meaning it encounters the foo end element tag, then the bar end element tag where the while loop deletes the foo sibling element and then the target end element tag is encountered and I assume your function looks for both the foo and the bar element data but the foo element has been deleted.
So somehow your code has to take the structure (you probably know) into account and don't do that while loop for children/descendants of your target element.

Does replaceChild break childNodes iteration in Python minidom DOM implementation?

Does replaceChild() break for loop around childNodes in Python minidom?
Consider the following code with v being a minidom node:
for w in v.childNodes:
if ...:
frag = parseString(...)
v.replaceChild(w, frag.documentElement)
Will it work as expected enumerating all child nodes in turn? Or will replaceChild break the for loop?
https://www.w3.org/TR/DOM-Level-1/level-one-core.html says:
The content of the returned NodeList is "live" in the sense that, for
instance, changes to the children of the node object that it was
created from are immediately reflected in the nodes returned by the
NodeList accessors; it is not a static snapshot of the content of the
node.
This follows that the loop won't be broken.

Remove root from k-d-Tree in Python

For someone who is new to python, I don't understand how to remove an instance of a class from inside a recursive function.
Consider this code of a k-d Tree:
def remove(self, bin, targetAxis=0, parent=None):
if not self:
return None
elif self.data.x == bin.x and self.data.y == bin.y:
if self.rightNode:
self.data = self.rightNode.findMin((targetAxis+1)% KdSearch.DIMENSION)
self.rightNode = self.rightNode.remove(self.data, (targetAxis+1)% KdSearch.DIMENSION,self)
elif self.leftNode:
self.data = self.leftNode.findMin((targetAxis+1)% KdSearch.DIMENSION)
self.rightNode = self.leftNode.remove(self.data, (targetAxis+1)% KdSearch.DIMENSION,self)
else:
if not parent is None:
#get direction if child....
if not parent.leftNode is None:
if parent.leftNode.data.x == bin.x and parent.leftNode.data.y == bin.y:
parent.leftNode=None
if not parent.rightNode is None:
if parent.rightNode.data.x == bin.x and parent.rightNode.data.y == bin.y:
parent.rightNode=None
else:
print("Trying to delete self")
del self.data
del self.leftNode
del self.rightNode
del self.splittingAxis
else:
axis = self.splittingAxis % KdSearch.DIMENSION
if axis==0:
if bin.x <= self.data.x :
if self.leftNode:
self.leftNode.remove(bin,(targetAxis+1)% KdSearch.DIMENSION,self)
else:
if self.rightNode:
self.rightNode.remove(bin,(targetAxis+1)% KdSearch.DIMENSION,self)
else:
if bin.y <= self.data.y:
if self.leftNode:
self.leftNode.remove(bin,(targetAxis+1)% KdSearch.DIMENSION,self)
else:
if self.rightNode:
self.rightNode.remove(bin,(targetAxis+1)% KdSearch.DIMENSION,self)
The important part is this:
del self.data
del self.leftNode
del self.rightNode
del self.splittingAxis
How can i delete the current instance?
The del self or self=None or my approach is NOT working
What you're trying to do doesn't make sense in words, let alone in Python. What you want to do is remove the node from the tree. However, you don't have a tree object, you only have nodes. So how can you remove the node from the tree when there is no tree to remove it from?
Being generous, you could argue that you're implementing the tree without an explicit tree class by saying that a collection of nodes is a tree. But then you have the problem, what does an empty tree look like? Also, the client of the tree needs a reference to the tree (so it can add and remove nodes), but since you don't have a tree object, it can only have a reference to a node. Therefore, the client is the only one with the capability of emptying the tree, which it must do by deleting its reference to the node. It is not possible for an object in Python to remove arbitrary references to itself from other objects without knowledge of those objects, so your root node cannot generally delete itself from the "tree", which would mean deleting the reference to the node the client holds. To implement this would require a defined interface between the root node and the client, so when the client says "delete this node" the root node can reply and say "that's actually me, so delete me and you've got an empty tree". But this would be a pain.
Also, an implicit conceptual tree that is a collection of nodes goes against the Zen of Python:
Explicit is better than implicit.
So what I suggest is that you implement an explicit simple tree class that can be empty and that your client can hold a reference to. If you make it look a bit like a node, it can just be the parent of the root node and as far as the root node is concerned it (the root node) is a normal sub-node. Something like (caveat: not tested, and assuming the remove() function above is really a method on a node class):
class Tree:
def __init__(self):
self.leftNode = None
# need a rightNode to look like a node, but otherwise unused.
self.rightNode = None
# This will probably be useful.
#property
def isEmpty(self):
return self.leftNode is None
def addNode(self, node):
if self.leftNode is not None:
self.leftNode = node
return
self.leftNode.add(node, parent=self)
def removeNode(self, node):
# the node will remove itself from us, the parent, if needed
self.leftNode.remove(node, parent=self)
Then the client does things like:
tree = Tree()
tree.isEmpty
tree.addNode(node)
tree.removeNode(node)
Before looking at Python, consider the following C/C++ code:
struct A {
virtual void suicide() {
delete this;
}
};
int main() {
A* a = new A();
a->suicide();
return 0;
}
First, an object of type A is explicitly created. This boils down to allocating and initializing a small piece of memory (the only thing stored in the object is a pointer to the suicide function) and setting the variable a to point to that piece of memory.
Next, the suicide function is called, which internally asks the runtime to release the memory for the object by calling delete this. This is a totally valid operation, although it is not something you would normally do in real-life code. Namely, that after a->suicide() is called, the pointer a becomes invalid, because the memory it continues to point to is no longer there. For example, if you tried calling a->suicide() again afterwards, you would get a segmentation fault (because in order to call a->suicide you need to look up the pointer to the method suicide in the memory pointed to by a, and this memory is no longer valid).
But meaningful or not, you really can destroy a C/C++ object (i.e., release its memory) from any place, including the object's own method (assuming it was created on the heap, of course).
Now let us go back to Python. In Python, the situation is different. Although you create objects explicitly in Python, just like you do in C/C++, you have no way of forcefully releasing their memory. All the memory is managed by the garbage collector, which keeps track of which objects are currently referenced, which are not, and cleans the unreachable ones at the moments it decides appropriate.
Although the Python statement del self may seem syntactically similar to delete this in C/C++, it is really something completely different. It is not an order to the memory manager to clean the memory. Instead, it simply removes the key self from the "local variables" dictionary. The corresponding value (i.e. the memory self was referencing) still remains suspended somewhere on the heap.
Of course, if no one else points to that memory, chances are the garbage collector will release it soon (although not even this is guaranteed because it really depends on the GC algorithm used), but as you did a del self, someone is still pointing at the memory, because that someone just invoked the method.
Consider a "literal translation" of the C/C++ code above into Python:
class A(object):
def suicide(self):
del self
a = A()
a.suicide()
It is also totally valid Python code, however del self here does nothing (except for prohibiting you to refer to self later along in the same method, because you deleted the variable from the scope).
As long as there exists a variable a pointing to the created object from somewhere, its memory will not be released. Just as the memory would not be released here, for example:
a = A()
b = a
del a
For better understanding I suggest you also compare the meaning of del d[key] in Python with delete d[key] in C/C++.

How to check if an xml node has children in python with minidom?

How to check if an xml node has children in python with minidom?
I'm writing an recursive function to remove all attributes in an xml file and I need to check if an node has child nodes before calling the same function again.
What I've tried:
I tried to use node.childNodes.length, but didn't have much luck. Any other suggestions?
Thanks
My Code:
def removeAllAttributes(dom):
for node in dom.childNodes:
if node.attributes:
for key in node.attributes.keys():
node.removeAttribute(key)
if node.childNodes.length > 1:
node = removeAllAttributes(dom)
return dom
Error code:
RuntimeError: maximum recursion depth exceeded
You are in an infinite loop. Here is your problem line:
node = removeAllAttributes(dom)
I think you mean
node = removeAllAttributes(node)
You could try hasChildNodes() - although if inspecting the childNodes attribute directly isn't working you may have other problems.
On a guess, your processing is being thrown off because your element has no element children, but does have text children or something. You can check for that this way:
def removeAllAttributes(element):
for attribute_name in element.attributes.keys():
element.removeAttribute(attribute_name)
for child_node in element.childNodes:
if child_node.nodeType == xml.dom.minidom.ELEMENT_NODE:
removeAllAttributes(child_node)

When using iterdescendants() on an etree, is it ok to modify the tree?

(Python 3.2)
I'm using etree to parse some XML. To do this, I'm recursively iterating through the document with iterdescendants(). So, something like:
for elem in doc.iterdescendants():
if elem.tag == "tag":
pass # Further processing
Sometimes, I process a parent tag that contains children that I want to prevent from being processed in a later recursion. Is it ok to destroy the children?
In my initial testing, I've tried:
for child in elem.getchildren(): child.clear()
For some reason, this results in the element immediately after elem from being processed. It's like the element gets removed as well.
I then tried this, which works (in that it removes the parent and its children, but doesn't result in any subsequent siblings of the parent from being skipped/affected as well):
elem.clear()
Can anyone shed some light on this? Thanks,
I have the following code in place of yours and it seems to work, deleting all the child elements. I use iterfind to find all descendants with the tag and delete them.
for element in doc.iterfind('.//%s'%tag):
element.getparent().remove(element)

Categories