Does replaceChild() break for loop around childNodes in Python minidom?
Consider the following code with v being a minidom node:
for w in v.childNodes:
if ...:
frag = parseString(...)
v.replaceChild(w, frag.documentElement)
Will it work as expected enumerating all child nodes in turn? Or will replaceChild break the for loop?
https://www.w3.org/TR/DOM-Level-1/level-one-core.html says:
The content of the returned NodeList is "live" in the sense that, for
instance, changes to the children of the node object that it was
created from are immediately reflected in the nodes returned by the
NodeList accessors; it is not a static snapshot of the content of the
node.
This follows that the loop won't be broken.
Related
Im using Liza Daly's fast_iter which has the structure of:
def fast_iter(context, args=[], kwargs={}):
"""
Deletes elements as the tree is travsersed to prevent the full tree from building and save memory
Author: Liza Daly, IBM
"""
for event, elem in context:
if elem.tag == 'target':
func(elem, *args, **kwargs)
elem.clear()
while elem.getprevious() is not None:
del elem.getparent()[0]
del context
return save
However, Ive noticed when i create my context as
context = etree.iterparse(path, events=('end',))
The data within the elem gets deleted before my function can even process it. For clarity, I am using fully synchronous code.
If I set my context as
context = etree.iterparse(path, events=('end',), tag='target')
It works correctly, however I know its not doing the full memory conservation that fast_iter is intended to provide.
Is there any reason to even use this when compared to xml.dom.pulldom, a SAX parser which creates no tree? It seems like fast_iter attempts to replicate this staying within lxml.
Does anyone have ideas on what im doing wrong? TIA
I think I can see where your approach might delete data you want to access before the code to access it is called, let's assume you have e.g.
<target>
<foo>test</foo>
<bar>test</bar>
</target>
elements in your XML, then each time an end element tag is found your code
for event, elem in context:
if elem.tag == 'target':
func(elem, *args, **kwargs)
elem.clear()
while elem.getprevious() is not None:
del elem.getparent()[0]
is run, meaning it encounters the foo end element tag, then the bar end element tag where the while loop deletes the foo sibling element and then the target end element tag is encountered and I assume your function looks for both the foo and the bar element data but the foo element has been deleted.
So somehow your code has to take the structure (you probably know) into account and don't do that while loop for children/descendants of your target element.
I have a nested dict which looks like this:
There are multiple nestings within the key children. I would like to capture the key branch whenever the key children is present. Because there are multiple children, I would like to do this for each child. Ofcourse, each child can also have further children. This nesting can go upto 7 levels.
To achieve this, I could either write a boneheaded 7-for loop method or use recursion. So I gave recursion a shot and came up with the following code:
def GatherConcepts(header):
if 'children' in header.keys():
if len(header['children']) > 0:
if 'branch' in header.keys():
concepts.append(header['handle'])
if 'children' in header.keys():
for j in range(0, len(header['children'])):
GatherConcepts(header['children'][j])
else:
for i in range(0,len(header['children'])):
GatherConcepts(header['children'][i])
The problem with this code is that it gives me only 2 levels (because I'm calling the function itself 2 times, thereby not using recursion properly), not 7.
How can I improve this to get all the levels?
Any pointers would be highly appreciated.
You have some unnecessary redundancies. If I understand you correctly, you need to add the handles to the list separately from the recursion, because you want to test branch in the parent.
def GatherConcepts(header):
if 'children' in header and 'branch' in header:
for child in header['children']:
concepts.append(child['handle'])
GatherConcepts(child)
You don't need to test the length of header['children'] -- if it's zero then the loop will just not do anything.
In order to get recursion correctly, you can use this simple template for it:
def recursive(variable):
if something:
# base step
return somethingelse
else:
# recursive step
return recursive(somethingelse)
In your case, you can try something like this:
def gather_concepts(header):
# recursive step
if 'branch' in header and 'handle' in header:
concepts.append(header['handle'])
if 'children' in header:
for child in header['children']:
return gather_concepts(child)
# base step
else:
return
You should tweak this code under your needs though, because I haven't tested it myself.
How to check if an xml node has children in python with minidom?
I'm writing an recursive function to remove all attributes in an xml file and I need to check if an node has child nodes before calling the same function again.
What I've tried:
I tried to use node.childNodes.length, but didn't have much luck. Any other suggestions?
Thanks
My Code:
def removeAllAttributes(dom):
for node in dom.childNodes:
if node.attributes:
for key in node.attributes.keys():
node.removeAttribute(key)
if node.childNodes.length > 1:
node = removeAllAttributes(dom)
return dom
Error code:
RuntimeError: maximum recursion depth exceeded
You are in an infinite loop. Here is your problem line:
node = removeAllAttributes(dom)
I think you mean
node = removeAllAttributes(node)
You could try hasChildNodes() - although if inspecting the childNodes attribute directly isn't working you may have other problems.
On a guess, your processing is being thrown off because your element has no element children, but does have text children or something. You can check for that this way:
def removeAllAttributes(element):
for attribute_name in element.attributes.keys():
element.removeAttribute(attribute_name)
for child_node in element.childNodes:
if child_node.nodeType == xml.dom.minidom.ELEMENT_NODE:
removeAllAttributes(child_node)
This eventually consumes all my available memory and then the process is killed. I've tried changing the tag from schedule to 'smaller' tags but that didn't make a difference.
What am I doing wrong / how can I process this large file with iterparse()?
import lxml.etree
for schedule in lxml.etree.iterparse('really-big-file.xml', tag='schedule'):
print "why does this consume all my memory?"
I can easily cut it up and process it in smaller chunks but that's uglier than I'd like.
As iterparse iterates over the entire file a tree is built and no elements are freed. The advantage of doing this is that the elements remember who their parent is, and you can form XPaths that refer to ancestor elements. The disadvantage is that it can consume a lot of memory.
In order to free some memory as you parse, use Liza Daly's fast_iter:
def fast_iter(context, func, *args, **kwargs):
"""
http://lxml.de/parsing.html#modifying-the-tree
Based on Liza Daly's fast_iter
http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
See also http://effbot.org/zone/element-iterparse.htm
"""
for event, elem in context:
func(elem, *args, **kwargs)
# It's safe to call clear() here because no descendants will be
# accessed
elem.clear()
# Also eliminate now-empty references from the root node to elem
for ancestor in elem.xpath('ancestor-or-self::*'):
while ancestor.getprevious() is not None:
del ancestor.getparent()[0]
del context
which you could then use like this:
def process_element(elem):
print "why does this consume all my memory?"
context = lxml.etree.iterparse('really-big-file.xml', tag='schedule', events=('end',))
fast_iter(context, process_element)
I highly recommend the article on which the above fast_iter is based; it should be especially interesting to you if you are dealing with large XML files.
The fast_iter presented above is a slightly modified version of the one shown in the article. This one is more aggressive about deleting previous ancestors, thus saves more memory. Here you'll find a script which demonstrates the
difference.
Directly copied from http://effbot.org/zone/element-iterparse.htm
Note that iterparse still builds a tree, just like parse, but you can safely rearrange or remove parts of the tree while parsing. For example, to parse large files, you can get rid of elements as soon as you’ve processed them:
for event, elem in iterparse(source):
if elem.tag == "record":
... process record elements ...
elem.clear()
The above pattern has one drawback; it does not clear the root element, so you will end up with a single element with lots of empty child elements. If your files are huge, rather than just large, this might be a problem. To work around this, you need to get your hands on the root element. The easiest way to do this is to enable start events, and save a reference to the first element in a variable:
# get an iterable
context = iterparse(source, events=("start", "end"))
# turn it into an iterator
context = iter(context)
# get the root element
event, root = context.next()
for event, elem in context:
if event == "end" and elem.tag == "record":
... process record elements ...
root.clear()
This worked really well for me:
def destroy_tree(tree):
root = tree.getroot()
node_tracker = {root: [0, None]}
for node in root.iterdescendants():
parent = node.getparent()
node_tracker[node] = [node_tracker[parent][0] + 1, parent]
node_tracker = sorted([(depth, parent, child) for child, (depth, parent)
in node_tracker.items()], key=lambda x: x[0], reverse=True)
for _, parent, child in node_tracker:
if parent is None:
break
parent.remove(child)
del tree
(Python 3.2)
I'm using etree to parse some XML. To do this, I'm recursively iterating through the document with iterdescendants(). So, something like:
for elem in doc.iterdescendants():
if elem.tag == "tag":
pass # Further processing
Sometimes, I process a parent tag that contains children that I want to prevent from being processed in a later recursion. Is it ok to destroy the children?
In my initial testing, I've tried:
for child in elem.getchildren(): child.clear()
For some reason, this results in the element immediately after elem from being processed. It's like the element gets removed as well.
I then tried this, which works (in that it removes the parent and its children, but doesn't result in any subsequent siblings of the parent from being skipped/affected as well):
elem.clear()
Can anyone shed some light on this? Thanks,
I have the following code in place of yours and it seems to work, deleting all the child elements. I use iterfind to find all descendants with the tag and delete them.
for element in doc.iterfind('.//%s'%tag):
element.getparent().remove(element)