Update element values using xml.dom.minidom - python

I have an XML structure which looks similar to:
<Store>
<foo>
<book>
<isbn>123456</isbn>
</book>
<title>XYZ</title>
<checkout>no</checkout>
</foo>
<bar>
<book>
<isbn>7890</isbn>
</book>
<title>XYZ2</title>
<checkout>yes</checkout>
</bar>
</Store>
Using xml.dom.minidom only (restrictions) i would like to
1)traverse through the XML file
2)Search/Get for particular element, depending on its parent
Example: checkout element for author1, isbn for author2
3)Change/Set that element's value
4)Write the new XML structure to a file
Can anyone help here?
Thank you!
UPDATE:
This is what i have done till now
import xml.dom.minidom
checkout = "yes"
def getLoneChild(node, tagname):
assert ((node is not None) and (tagname is not None))
elem = node.getElementsByTagName(tagname)
if ((elem is None) or (len(elem) != 1)):
return None
return elem
def getLoneLeaf(node, tagname):
assert ((node is not None) and (tagname is not None))
elem = node.getElementsByTagName(tagname)
if ((elem is None) or (len(elem) != 1)):
return None
leaf = elem[0].firstChild
if (leaf is None):
return None
return leaf.data
def setcheckout(node, tagname):
assert ((node is not None) and (tagname is not None))
child = getLoneChild(node, 'foo')
Check = getLoneLeaf(child[0],'checkout')
Check = tagname
return Check
doc = xml.dom.minidom.parse('test.xml')
root = doc.getElementsByTagName('Store')[0]
output = setcheckout(root, checkout)
tmp_config = '/tmp/tmp_config.xml'
fw = open(tmp_config, 'w')
fw.write(doc.toxml())
fw.close()

I'm not entirely sure what you mean by "checkout". This script will find the element and alter the value of that element. Perhaps you can adapt it to your specific needs.
import xml.dom.minidom as DOM
# find the author as a child of the "Store"
def getAuthor(parent, author):
# by looking at the children
for child in [child for child in parent.childNodes
if child.nodeType != DOM.Element.TEXT_NODE]:
if child.tagName == author:
return child
return None
def alterElement(parent, attribute, newValue):
found = False;
# look through the child elements, skipping Text_Nodes
#(in your example these hold the "values"
for child in [child for child in parent.childNodes
if child.nodeType != DOM.Element.TEXT_NODE]:
# if the child element tagName matches target element name
if child.tagName == attribute:
# alter the data, i.e. the Text_Node value,
# which is the firstChild of the "isbn" element
child.firstChild.data = newValue
return True
else:
# otherwise look at all the children of this node.
found = alterElement(child, attribute, newValue)
if found:
break
# return found status
return found
doc = DOM.parse("test.xml")
# This assumes that there is only one "Store" in the file
root = doc.getElementsByTagName("Store")[0]
# find the author
# this assumes that there are no duplicate author names in the file
author = getAuthor(root, "foo")
if not author:
print "Author not found!"
else:
# alter an element
if not alterElement(author, "isbn", "987654321"):
print "isbn not found"
else:
# output the xml
tmp_config = '/tmp/tmp_config.xml'
f = open(tmp_config, 'w')
doc.writexml( f )
f.close()
The general idea is that you match the name of the author against the tagNames of the children of the "Store" element, then recurse through the children of the author, looking for a match against a target element tagName. There are a lot of assumptions made in this solution, but it may get you started. It's painful to try and deal with hierarchical structures like XML without using recursion.
In retrospect there was an error in the "alterElement" function. I've fixed this (note the "found" variable")

Related

Python XML Element Tree Parsing a big document, returns a subset

I have a big document of xml elements of german text, root iter only returning a subset of the document
root.iter('tu') only finds 82
import logging
import xml.etree.cElementTree as ET
class Extractor(object):
def _get_iter(self, filename: str):
with open(filename) as objects:
context = ET.iterparse(objects, events=("start", "end"))
index, (event, root) = next(enumerate(context))
return root.iter('tu')
def get_objects(self, filename: str, limit=-1):
found = sum(1 for _ in self._get_iter(filename))
logging.getLogger(__name__).info('found: {}'.format(found))
// found is 82, actual number is millions
alignments = extractor.get_alignments('data/file.tmx', 100000)
update: Sample tmx file: https://pastebin.com/kUFMMjck
update: Resolved it using event and tagname = tu, I suppose this is a buggy behaviour with root.iter()
root.iter('tagname') behaves inexpectedly, it doesn't work as expected iterator, and apparently preparses the document.
The solution is
class Extractor(object):
def get_objects(self, filename: str):
# get an iterable
context = ET.iterparse(filename, events=("start", "end"))
# turn it into an iterator
context = iter(context)
for event, elem in context:
if event == "end" and elem.tag == "tu":
# do something with elem
elem.clear() # clears memory after doing something with the data

How to simplify my code - extract all node values of same xml tag name

For example, here is the XML data:
<SOAP-ENV:Body>
<reportList>
<reportName>report 1</reportName>
</reportList>
<reportList>
<reportName>report 2</reportName>
</reportList>
<reportList>
<reportName>report 3</reportName>
</reportList>
</SOAP-ENV:Body>
Here is my code to extract the node values of all reportName, and it works.
import xml.dom.minidom
...
node = xml.dom.minimom.parseString(xml_file.text).documentElement
reportLists = node.getElementsByTagName('reportList')
reports = []
for reportList in reportLists:
reportObj = reportList.getElementsByTagName('reportName')[0]
reports.append(reportObj)
for report in reports:
nodes = report.childNodes
for node in nodes:
if node.nodeType == node.TEXT_NODE:
print (node.data)
result:
report 1
report 2
report 3
Although it works, I want to simplify the code. How to achieve the same result using shorter code?
You can simplify both for loops using list comprehensions:
import xml.dom.minidom
node = xml.dom.minidom.parseString(xml_file.text).documentElement
reportLists = node.getElementsByTagName('reportList')
reports = [report.getElementsByTagName('reportName')[0] for report in reportLists]
node_data = [node.data for report in reports for node in report.childNodes if node.nodeType == node.TEXT_NODE]
node_data is now a list containing the information you were printing.

Search and remove element with elementTree in Python

I have an XML document in which I want to search for some elements and if they match some criteria
I would like to delete them
However, I cannot seem to be able to access the parent of the element so that I can delete it
file = open('test.xml', "r")
elem = ElementTree.parse(file)
namespace = "{http://somens}"
props = elem.findall('.//{0}prop'.format(namespace))
for prop in props:
type = prop.attrib.get('type', None)
if type == 'json':
value = json.loads(prop.attrib['value'])
if value['name'] == 'Page1.Button1':
#here I need to access the parent of prop
# in order to delete the prop
Is there a way I can do this?
Thanks
You can remove child elements with the according remove method. To remove an element you have to call its parents remove method. Unfortunately Element does not provide a reference to its parents, so it is up to you to keep track of parent/child relations (which speaks against your use of elem.findall())
A proposed solution could look like this:
root = elem.getroot()
for child in root:
if child.name != "prop":
continue
if True:# TODO: do your check here!
root.remove(child)
PS: don't use prop.attrib.get(), use prop.get(), as explained here.
You could use xpath to select an Element's parent.
file = open('test.xml', "r")
elem = ElementTree.parse(file)
namespace = "{http://somens}"
props = elem.findall('.//{0}prop'.format(namespace))
for prop in props:
type = prop.get('type', None)
if type == 'json':
value = json.loads(prop.attrib['value'])
if value['name'] == 'Page1.Button1':
# Get parent and remove this prop
parent = prop.find("..")
parent.remove(prop)
http://docs.python.org/2/library/xml.etree.elementtree.html#supported-xpath-syntax
Except if you try that it doesn't work: http://elmpowered.skawaii.net/?p=74
So instead you have to:
file = open('test.xml', "r")
elem = ElementTree.parse(file)
namespace = "{http://somens}"
search = './/{0}prop'.format(namespace)
# Use xpath to get all parents of props
prop_parents = elem.findall(search + '/..')
for parent in prop_parents:
# Still have to find and iterate through child props
for prop in parent.findall(search):
type = prop.get('type', None)
if type == 'json':
value = json.loads(prop.attrib['value'])
if value['name'] == 'Page1.Button1':
parent.remove(prop)
It is two searches and a nested loop. The inner search is only on Elements known to contain props as first children, but that may not mean much depending on your schema.
I know this is an old thread but this kept popping up while I was trying to figure out a similar task. I did not like the accepted answer for two reasons:
1) It doesn't handle multiple nested levels of tags.
2) It will break if multiple xml tags are deleted in the same level one-after-another. Since each element is an index of Element._children you shouldn't delete while forward iterating.
I think a better more versatile solution is this:
import xml.etree.ElementTree as et
file = 'test.xml'
tree = et.parse(file)
root = tree.getroot()
def iterator(parents, nested=False):
for child in reversed(parents):
if nested:
if len(child) >= 1:
iterator(child)
if True: # Add your entire condition here
parents.remove(child)
iterator(root, nested=True)
For the OP, this should work - but I don't have the data you're working with to test if it's perfect.
import xml.etree.ElementTree as et
file = 'test.xml'
tree = et.parse(file)
namespace = "{http://somens}"
props = tree.findall('.//{0}prop'.format(namespace))
def iterator(parents, nested=False):
for child in reversed(parents):
if nested:
if len(child) >= 1:
iterator(child)
if prop.attrib.get('type') == 'json':
value = json.loads(prop.attrib['value'])
if value['name'] == 'Page1.Button1':
parents.remove(child)
iterator(props, nested=True)
A solution using lxml module
from lxml import etree
root = ET.fromstring(xml_str)
for e in root.findall('.//{http://some.name.space}node'):
parent = e.getparent()
for child in parent.find('./{http://some.name.space}node'):
try:
parent.remove(child)
except ValueError:
pass
Using the fact that every child must have a parent, I'm going to simplify #kitsu.eb's example. f using the findall command to get the children and parents, their indices will be equivalent.
file = open('test.xml', "r")
elem = ElementTree.parse(file)
namespace = "{http://somens}"
search = './/{0}prop'.format(namespace)
# Use xpath to get all parents of props
prop_parents = elem.findall(search + '/..')
props = elem.findall('.//{0}prop'.format(namespace))
for prop in props:
type = prop.attrib.get('type', None)
if type == 'json':
value = json.loads(prop.attrib['value'])
if value['name'] == 'Page1.Button1':
#use the index of the current child to find
#its parent and remove the child
prop_parents[props.index[prop]].remove(prop)
I also used XPath for this issue, but in a different way:
root = elem.getroot()
elementName = "YourElement"
#this will find all the parents of the elements with elementName
for elementParent in root.findall(".//{}/..".format(elementName)):
#this will find all the elements under the parent, and remove them
for element in elementParent.findall("{}".format(elementName)):
elementParent.remove(element)
I like to use an XPath expression for this kind of filtering. Unless I know otherwise, such an expression must be applied at the root level, which means I can't just get a parent and apply the same expression on that parent. However, it seems to me that there is a nice and flexible solution that should work with any supported XPath, as long as none of the sought nodes is the root. It goes something like this:
root = elem.getroot()
# Find all nodes matching the filter string (flt)
nodes = root.findall(flt)
while len(nodes):
# As long as there are nodes, there should be parents
# Get the first of all parents to the found nodes
parent = root.findall(flt+'/..')[0]
# Use this parent to remove the first node
parent.remove(nodes[0])
# Find all remaining nodes
nodes = root.findall(flt)
I would like only to add a comment on the accepted answer, but my lack of reputation doesn't allow me to. I wanted to add that it is important to add .findall("*")to the iterator to avoid issues, as stated in the documentation:
Note that concurrent modification while iterating can lead to problems, just like when iterating and modifying Python lists or dicts. Therefore, the example first collects all matching elements with root.findall(), and only then iterates over the list of matches.
Therefore, in the accepted answer the iteration should be for child in root.findal("*"):instead of for child in root:. Not doing so made my code skip some elements from the list.

Node.TEXT_NODE has the value, but I need the Attribute

I have an xml file like so:
<host name='ip-10-196-55-2.ec2.internal'>
<hostvalue name='arch_string'>lx24-x86</hostvalue>
<hostvalue name='num_proc'>1</hostvalue>
<hostvalue name='load_avg'>0.01</hostvalue>
</host>
I can get get out the Node.data from a Node.TEXT_NODE, but I also need the Attribute name, like I want to know load_avg = 0.01, without writing load_avg, num_proc, etc, one by one. I want them all.
My code looks like this, but I can't figure out what part of the Node has the attribute name for me.
for stat in h.getElementsByTagName("hostvalue"):
for node3 in stat.childNodes:
attr = "foo"
val = "poo"
if node3.nodeType == Node.ATTRINUTE_NODE:
attr = node3.tagName
if node3.nodeType == Node.TEXT_NODE:
#attr = node3.tagName
val = node3.data
From the above code, I'm able to get val, but not attr (compile error:
here's a short example of what you could achieve:
from xml.dom import minidom
xmldoc = minidom.parse("so.xml")
values = {}
for stat in xmldoc.getElementsByTagName("hostvalue"):
attr = stat.attributes["name"].value
value = "\n".join([x.data for x in stat.childNodes])
values[attr] = value
print repr(values)
This outputs, given your XML file:
$ ./parse.py
{u'num_proc': u'1', u'arch_string': u'lx24-x86', u'load_avg': u'0.01'}
Be warned that this is not failsafe, i.e. if you have nested elements inside <hostvalue>.

Processing RSS/RDF via xml.dom.minidom

I'm trying to process a delicious rss feed via python. Here's a sample:
...
<item rdf:about="http://weblist.me/">
<title>WebList - The Place To Find The Best List On The Web</title>
<dc:date>2009-12-24T17:46:14Z</dc:date>
<link>http://weblist.me/</link>
...
</item>
<item rdf:about="http://thumboo.com/">
<title>Thumboo! Free Website Thumbnails and PHP Script to Generate Web Screenshots</title>
<dc:date>2006-10-24T18:11:32Z</dc:date>
<link>http://thumboo.com/</link>
...
The relevant code is:
def getText(nodelist):
rc = ""
for node in nodelist:
if node.nodeType == node.TEXT_NODE:
rc = rc + node.data
return rc
dom = xml.dom.minidom.parse(file)
items = dom.getElementsByTagName("item")
for i in items:
title = i.getElementsByTagName("title")
print getText(title)
I would think this would print out each title, but instead I get basically get blank output. I'm sure I'm doing something stupid wrong, but no idea what?
You are passing the title nodes to getText, whose nodeTypes are not node.TEXT_NODE. You have to loop over all the children of the node instead in your getText method:
def getTextSingle(node):
parts = [child.data for child in node.childNodes if child.nodeType == node.TEXT_NODE]
return u"".join(parts)
def getText(nodelist):
return u"".join(getTextSingle(node) for node in nodelist)
Even better, call node.normalize() before calling getTextSingle which ensures that consecutive children of type node.TEXT_NODE are merged into a single node.TEXT_NODE.

Categories