Processing RSS/RDF via xml.dom.minidom - python

I'm trying to process a delicious rss feed via python. Here's a sample:
...
<item rdf:about="http://weblist.me/">
<title>WebList - The Place To Find The Best List On The Web</title>
<dc:date>2009-12-24T17:46:14Z</dc:date>
<link>http://weblist.me/</link>
...
</item>
<item rdf:about="http://thumboo.com/">
<title>Thumboo! Free Website Thumbnails and PHP Script to Generate Web Screenshots</title>
<dc:date>2006-10-24T18:11:32Z</dc:date>
<link>http://thumboo.com/</link>
...
The relevant code is:
def getText(nodelist):
rc = ""
for node in nodelist:
if node.nodeType == node.TEXT_NODE:
rc = rc + node.data
return rc
dom = xml.dom.minidom.parse(file)
items = dom.getElementsByTagName("item")
for i in items:
title = i.getElementsByTagName("title")
print getText(title)
I would think this would print out each title, but instead I get basically get blank output. I'm sure I'm doing something stupid wrong, but no idea what?

You are passing the title nodes to getText, whose nodeTypes are not node.TEXT_NODE. You have to loop over all the children of the node instead in your getText method:
def getTextSingle(node):
parts = [child.data for child in node.childNodes if child.nodeType == node.TEXT_NODE]
return u"".join(parts)
def getText(nodelist):
return u"".join(getTextSingle(node) for node in nodelist)
Even better, call node.normalize() before calling getTextSingle which ensures that consecutive children of type node.TEXT_NODE are merged into a single node.TEXT_NODE.

Related

How to simplify my code - extract all node values of same xml tag name

For example, here is the XML data:
<SOAP-ENV:Body>
<reportList>
<reportName>report 1</reportName>
</reportList>
<reportList>
<reportName>report 2</reportName>
</reportList>
<reportList>
<reportName>report 3</reportName>
</reportList>
</SOAP-ENV:Body>
Here is my code to extract the node values of all reportName, and it works.
import xml.dom.minidom
...
node = xml.dom.minimom.parseString(xml_file.text).documentElement
reportLists = node.getElementsByTagName('reportList')
reports = []
for reportList in reportLists:
reportObj = reportList.getElementsByTagName('reportName')[0]
reports.append(reportObj)
for report in reports:
nodes = report.childNodes
for node in nodes:
if node.nodeType == node.TEXT_NODE:
print (node.data)
result:
report 1
report 2
report 3
Although it works, I want to simplify the code. How to achieve the same result using shorter code?
You can simplify both for loops using list comprehensions:
import xml.dom.minidom
node = xml.dom.minidom.parseString(xml_file.text).documentElement
reportLists = node.getElementsByTagName('reportList')
reports = [report.getElementsByTagName('reportName')[0] for report in reportLists]
node_data = [node.data for report in reports for node in report.childNodes if node.nodeType == node.TEXT_NODE]
node_data is now a list containing the information you were printing.

python ElementTree.Element missing text?

So, I'm parsing this xml file of moderate size (about 27K lines). Not far into it, I'm seeing unexpected behavior from ElementTree.Element where I get Element.text for one entry but not the next, yet it's there in the source XML as you can see:
<!-- language: lang-xml -->
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:enumeration value="24">
<xs:annotation>
<xs:documentation>UPC12 (item-specific) on cover 2</xs:documentation>
<xs:documentation>AKA item/price; ‘cover 2’ is defined as the inside front cover of a book</xs:documentation>
</xs:annotation>
</xs:enumeration>
<xs:enumeration value="25">
<xs:annotation>
<xs:documentation>UPC12+5 (item-specific) on cover 2</xs:documentation>
<xs:documentation>AKA item/price; ‘cover 2’ is defined as the inside front cover of a book</xs:documentation>
</xs:annotation>
</xs:enumeration>
When I encounter an enumeration tag I call this function:
import xml.etree.cElementTree as ElementTree
...
def _parse_list_item(xmlns: str, list_id: int, itemElement: ElementTree.Element) -> ListItem:
if isinstance(itemElement, ElementTree.Element):
if itemElement.attrib['value'] is not None:
item_id = itemElement.attrib['value'] # string
if list_id == 6 and (item_id == '25' or item_id=='24'):
print(list_id, item_id) # <== debug break point here
desc = None
notes = ""
for child in itemElement:
if child.tag == (xmlns + 'annotation'):
for grandchild in child:
if grandchild.tag == (xmlns + 'documentation'):
if desc is None:
desc = grandchild.text
else:
if len(notes)>0:
notes += " " # add a space
notes += grandchild.text or ""
if item_id is not None and desc is not None:
return Codex.ListItem({'itemId': item_id, 'listId': list_id, 'description': desc, 'notes': notes})
If I place a breakpoint at the print statement, when I get to the enumeration node for "24" I can look at the text for the grandchild nodes and they are as shown in the XML, i.e. "UPC12..." or "AKA item...", but when I get to the enumeration node for "25", and look at the grandchild text, it's None.
When I remove the xs: namespace by pre-filtering the XML file, the grandchild text comes through fine.
Is it possible I'm over some size limit or is there some syntax problem?
Sorry for less-than-pythonic code but I wanted to be able to examine all the intermediate values in pycharm. It's python 3.6.
Thanks for any insights you may have!
In the for loop, this condition is never met: if child.tag == (xmlns + 'annotation'):.
Why?
Try to output the child's tag. If we suppose your namespace (xmlns) is 'Steve' then:
print(child.tag) will output: {Steve}annotation, not Steveannotation.
So given this fact, if child.tag == (xmlns + 'annotation'): is always False.
You should change it to: if child.tag == ('{'+xmlns+'}annotation'):
With the same logic, you will find out you will also have to change this condition:
if grandchild.tag == (xmlns + 'documentation'):
to:
if grandchild.tag == ('{'+xmlns+'}documentation'):
So, ultimately, I solved my problem by running a pre-process on the XML file to remove the xs: namespace from all of the open/close XML tags and then I was able to successfully process the file using the function as defined above. Not sure why namespaces are causing problems, but perhaps there is a bug in cElementTree for namespace prefixes in large XML files. To #mzjn - I expect that it would be difficult to construct a minimal example as it does process hundreds of items correctly before it fails, so I would at least have to provide a fairly large XML file. Nevertheless, thanks for being a sounding board.

How can I parse a XML file to a dictionary in Python?

I 'am trying to parse a XML file using the Python library minidom (even tried xml.etree.ElementTree API).
My XML (resource.xml)
<?xml version='1.0'?>
<quota_result xmlns="https://some_url">
</quota_rule>
<quota_rule name='max_mem_per_user/5'>
<users>user1</users>
<limit resource='mem' limit='1550' value='921'/>
</quota_rule>
<quota_rule name='max_mem_per_user/6'>
<users>user2 /users>
<limit resource='mem' limit='2150' value='3'/>
</quota_rule>
</quota_result>
I would like to parse this file and store inside a dictionnary the information in the following form and be able to access it:
dict={user1=[resource,limit,value],user2=[resource,limit,value]}
So far I have only been able to do things like:
docXML = minidom.parse("resource.xml")
for node in docXML.getElementsByTagName('limit'):
print node.getAttribute('value')
You can use getElementsByTagName and getAttribute to trace the result:
dict_users = dict()
docXML = parse('mydata.xml')
users= docXML.getElementsByTagName("quota_rule")
for node in users:
user = 'None'
tag_user = node.getElementsByTagName("users") #check the length of the tag_user to see if tag <users> is exist or not
if len(tag_user) ==0:
print "tag <users> is not exist"
else:
user = tag_user[0]
resource = node.getElementsByTagName("limit")[0].getAttribute("resource")
limit = node.getElementsByTagName("limit")[0].getAttribute("limit")
value = node.getElementsByTagName("limit")[0].getAttribute("value")
dict_users[user.firstChild.data]=[resource, limit, value]
if user == 'None':
dict_users['None']=[resource, limit, value]
else:
dict_users[user.firstChild.data]=[resource, limit, value]
print(dict_users) # remove the <users>user1</users> in xml
Output:
tag <users> is not exist
{'None': [u'mem', u'1550', u'921'], u'user2': [u'mem', u'2150', u'3']}

How to search for XML elements in python?

Сode that is shown below works perfectly, but the problem is that i need to manually set name-spaces like d:. Is it possible somehow to search for elements ignoring this name-spaces like dom.getElementsByTagName('Scopes')?
def parseSoapBody(soap_data):
dom = parseString(soap_data)
return {
'scopes': dom.getElementsByTagName('d:Scopes')[0].firstChild.nodeValue,
'address': dom.getElementsByTagName('d:XAddrs')[0].firstChild.nodeValue,
}
Since your code uses parseString and getElementsByTagName, I'm assuming you are using minidom. In that case, try:
dom.getElementsByTagNameNS('*', 'Scopes')
It doesn't say so in the docs, but if you look in the source code for xml/dom/minidom.py, you'll see getElementsByTagNameNS calls _get_elements_by_tagName_ns_helper which is defined like this:
def _get_elements_by_tagName_ns_helper(parent, nsURI, localName, rc):
for node in parent.childNodes:
if node.nodeType == Node.ELEMENT_NODE:
if ((localName == "*" or node.localName == localName) and
(nsURI == "*" or node.namespaceURI == nsURI)):
rc.append(node)
_get_elements_by_tagName_ns_helper(node, nsURI, localName, rc)
return rc
Notice that when nsURI equals *, only the localName needs to match.
For example,
import xml.dom.minidom as minidom
content = '''<root xmlns:f="foo"><f:test/><f:test/></root>'''
dom = minidom.parseString(content)
for n in dom.getElementsByTagNameNS('*', 'test'):
print(n.toxml())
# <f:test/>
# <f:test/>

Update element values using xml.dom.minidom

I have an XML structure which looks similar to:
<Store>
<foo>
<book>
<isbn>123456</isbn>
</book>
<title>XYZ</title>
<checkout>no</checkout>
</foo>
<bar>
<book>
<isbn>7890</isbn>
</book>
<title>XYZ2</title>
<checkout>yes</checkout>
</bar>
</Store>
Using xml.dom.minidom only (restrictions) i would like to
1)traverse through the XML file
2)Search/Get for particular element, depending on its parent
Example: checkout element for author1, isbn for author2
3)Change/Set that element's value
4)Write the new XML structure to a file
Can anyone help here?
Thank you!
UPDATE:
This is what i have done till now
import xml.dom.minidom
checkout = "yes"
def getLoneChild(node, tagname):
assert ((node is not None) and (tagname is not None))
elem = node.getElementsByTagName(tagname)
if ((elem is None) or (len(elem) != 1)):
return None
return elem
def getLoneLeaf(node, tagname):
assert ((node is not None) and (tagname is not None))
elem = node.getElementsByTagName(tagname)
if ((elem is None) or (len(elem) != 1)):
return None
leaf = elem[0].firstChild
if (leaf is None):
return None
return leaf.data
def setcheckout(node, tagname):
assert ((node is not None) and (tagname is not None))
child = getLoneChild(node, 'foo')
Check = getLoneLeaf(child[0],'checkout')
Check = tagname
return Check
doc = xml.dom.minidom.parse('test.xml')
root = doc.getElementsByTagName('Store')[0]
output = setcheckout(root, checkout)
tmp_config = '/tmp/tmp_config.xml'
fw = open(tmp_config, 'w')
fw.write(doc.toxml())
fw.close()
I'm not entirely sure what you mean by "checkout". This script will find the element and alter the value of that element. Perhaps you can adapt it to your specific needs.
import xml.dom.minidom as DOM
# find the author as a child of the "Store"
def getAuthor(parent, author):
# by looking at the children
for child in [child for child in parent.childNodes
if child.nodeType != DOM.Element.TEXT_NODE]:
if child.tagName == author:
return child
return None
def alterElement(parent, attribute, newValue):
found = False;
# look through the child elements, skipping Text_Nodes
#(in your example these hold the "values"
for child in [child for child in parent.childNodes
if child.nodeType != DOM.Element.TEXT_NODE]:
# if the child element tagName matches target element name
if child.tagName == attribute:
# alter the data, i.e. the Text_Node value,
# which is the firstChild of the "isbn" element
child.firstChild.data = newValue
return True
else:
# otherwise look at all the children of this node.
found = alterElement(child, attribute, newValue)
if found:
break
# return found status
return found
doc = DOM.parse("test.xml")
# This assumes that there is only one "Store" in the file
root = doc.getElementsByTagName("Store")[0]
# find the author
# this assumes that there are no duplicate author names in the file
author = getAuthor(root, "foo")
if not author:
print "Author not found!"
else:
# alter an element
if not alterElement(author, "isbn", "987654321"):
print "isbn not found"
else:
# output the xml
tmp_config = '/tmp/tmp_config.xml'
f = open(tmp_config, 'w')
doc.writexml( f )
f.close()
The general idea is that you match the name of the author against the tagNames of the children of the "Store" element, then recurse through the children of the author, looking for a match against a target element tagName. There are a lot of assumptions made in this solution, but it may get you started. It's painful to try and deal with hierarchical structures like XML without using recursion.
In retrospect there was an error in the "alterElement" function. I've fixed this (note the "found" variable")

Categories