How to search for XML elements in python?

How to search for XML elements in python? - python

Сode that is shown below works perfectly, but the problem is that i need to manually set name-spaces like d:. Is it possible somehow to search for elements ignoring this name-spaces like dom.getElementsByTagName('Scopes')?
def parseSoapBody(soap_data):
dom = parseString(soap_data)
return {
'scopes': dom.getElementsByTagName('d:Scopes')[0].firstChild.nodeValue,
'address': dom.getElementsByTagName('d:XAddrs')[0].firstChild.nodeValue,
}

Since your code uses parseString and getElementsByTagName, I'm assuming you are using minidom. In that case, try:
dom.getElementsByTagNameNS('*', 'Scopes')
It doesn't say so in the docs, but if you look in the source code for xml/dom/minidom.py, you'll see getElementsByTagNameNS calls _get_elements_by_tagName_ns_helper which is defined like this:
def _get_elements_by_tagName_ns_helper(parent, nsURI, localName, rc):
for node in parent.childNodes:
if node.nodeType == Node.ELEMENT_NODE:
if ((localName == "*" or node.localName == localName) and
(nsURI == "*" or node.namespaceURI == nsURI)):
rc.append(node)
_get_elements_by_tagName_ns_helper(node, nsURI, localName, rc)
return rc
Notice that when nsURI equals *, only the localName needs to match.
For example,
import xml.dom.minidom as minidom
content = '''<root xmlns:f="foo"><f:test/><f:test/></root>'''
dom = minidom.parseString(content)
for n in dom.getElementsByTagNameNS('*', 'test'):
print(n.toxml())
# <f:test/>
# <f:test/>

Related

XML parsing in python issue using elementTree

I need to parse a soap response and convert to a text file. I am trying to parse the values as detailed below. I am using ElementTree in python
I have the below xml response which I need to parse
<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:tmf854="tmf854.v1" xmlns:alu="alu.v1">
<soapenv:Header>
<tmf854:header>
<tmf854:activityName>query</tmf854:activityName>
<tmf854:msgName>queryResponse</tmf854:msgName>
<tmf854:msgType>RESPONSE</tmf854:msgType>
<tmf854:senderURI>https:/destinationhost:8443/tmf854/services</tmf854:senderURI>
<tmf854:destinationURI>https://localhost:8443</tmf854:destinationURI>
<tmf854:activityStatus>SUCCESS</tmf854:activityStatus>
<tmf854:correlationId>1</tmf854:correlationId>
<tmf854:communicationPattern>MultipleBatchResponse</tmf854:communicationPattern>
<tmf854:communicationStyle>RPC</tmf854:communicationStyle>
<tmf854:requestedBatchSize>1500</tmf854:requestedBatchSize>
<tmf854:batchSequenceNumber>1</tmf854:batchSequenceNumber>
<tmf854:batchSequenceEndOfReply>true</tmf854:batchSequenceEndOfReply>
<tmf854:iteratorReferenceURI>http://9195985371165397084</tmf854:iteratorReferenceURI>
<tmf854:timestamp>20220915222121.472+0530</tmf854:timestamp>
</tmf854:header>
</soapenv:Header>
<soapenv:Body>
<queryResponse xmlns="alu.v1">
<queryObjectData>
<queryObject>
<name>
<tmf854:mdNm>AMS</tmf854:mdNm>
<tmf854:meNm>CHEERLAVANCHA_281743</tmf854:meNm>
<tmf854:ptpNm>/type=NE/CHEERLAVANCHA_281743</tmf854:ptpNm>
</name>
<vendorExtensions>
<package>
<NameAndStringValue>
<tmf854:name>hubSubtendedStatus</tmf854:name>
<tmf854:value>NONE</tmf854:value>
</NameAndStringValue>
<NameAndStringValue>
<tmf854:name>productAndRelease</tmf854:name>
<tmf854:value>DF.6.1</tmf854:value>
</NameAndStringValue>
<NameAndStringValue>
<tmf854:name>adminUserName</tmf854:name>
<tmf854:value>isadmin</tmf854:value>
</NameAndStringValue>
<NameAndStringValue>
</package>
</vendorExtensions>
</queryObject>
</queryObjectData>
</queryResponse>
</soapenv:Body>
</soapenv:Envelope>
I need to use the below code snippet.
parser = ElementTree.parse("response.txt")
root = parser.getroot()
inventoryObjectData = root.find(".//{alu.v1}queryObjectData")
for inventoryObject in inventoryObjectData:
for device in inventoryObject:
if (device.tag.split("}")[1]) == "me":
vendorExtensionsNames = []
vendorExtensionsValues = []
if device.find(".//{tmf854.v1}mdNm") is not None:
mdnm = device.find(".//{tmf854.v1}mdNm").text
if device.find(".//{tmf854.v1}meNm") is not None:
menm = device.find(".//{tmf854.v1}meNm").text
if device.find(".//{tmf854.v1}userLabel") is not None:
userlabel = device.find(".//{tmf854.v1}userLabel").text
if device.find(".//{tmf854.v1}resourceState") is not None:
resourcestate = device.find(".//{tmf854.v1}resourceState").text
if device.find(".//{tmf854.v1}location") is not None:
location = device.find(".//{tmf854.v1}location").text
if device.find(".//{tmf854.v1}manufacturer") is not None:
manufacturer = device.find(".//{tmf854.v1}manufacturer").text
if device.find(".//{tmf854.v1}productName") is not None:
productname = device.find(".//{tmf854.v1}productName").text
if device.find(".//{tmf854.v1}version") is not None:
version = device.find(".//{tmf854.v1}version").text
vendorExtensions = device.find("vendorExtensions")
vendorExtensionsNamesElements = vendorExtensions.findall(".//{tmf854.v1}name")
for i in vendorExtensionsNamesElements:
vendorExtensionsNames.append(i.text.strip())
vendorExtensionsValuesElements = vendorExtensions.findall(".//{tmf854.v1}value")
for i in vendorExtensionsValuesElements:
vendorExtensionsValues.append(str(i.text or "").strip())
alu = ""
for i in vendorExtensions:
if i.attrib:
if alu == "":
alu = i.attrib.get("{alu.v1}name")
else:
alu = alu + "|" + i.attrib.get("{alu.v1}name")
The issue is that The below code is not able to find the 'vendorExtensions"'. Please help here.
vendorExtensions = device.find("vendorExtensions")
Have tried the below as well
vendorExtensions = device.find(".//queryObject/vendorExtensions")

Your document declares a default namespace of alu.v1:
<queryResponse xmlns="alu.v1">
...
</queryResponse>
Any attribute without an explicit namespace is in the alu.v1 namespace. You need to qualify your attribute name appropriately:
vendorExtensions = device.find("{alu.v1}vendorExtensions")
While the above is a real problem with your code that needs to be corrected (the Wikipedia entry on XML namespaces may be useful reading if you're unfamiliar with how namespaces work), there are also some logic problems with your code.
Let's drop the big list of conditionals from the code and see if it's actually doing what we think it's doing. If we run this:
from xml.etree import ElementTree
parser = ElementTree.parse("data.xml")
root = parser.getroot()
queryObjectData = root.find(".//{alu.v1}queryObjectData")
for queryObject in queryObjectData:
for device in queryObject:
print(device.tag)
Then using your sample data (once it has been corrected to be syntactically valid), we see as output:
{alu.v1}name
{alu.v1}vendorExtensions
Your search for the {alu.v1}vendorExtensions element will never succeed before the thing on which you're trying to search (the device variable) is the thing you're trying to find.
Additionally, the conditional in your loop...
if (device.tag.split("}")[1]) == "me":
...will never match (there is no element in the entire document for which tag.split("}")[1] == "me" is True).
I'm not entirely clear what you're trying to do, but here's are some thoughts:
Given your example data, you probably don't want that for device in inventoryObject: loop
We can drastically simplify your code by replacing that long block of conditionals with a list of attributes in which we are interested and then a for loop to extract them.
Rather than assigning a bunch of individual variables, we can build up a dictionary with the data from the queryObject
That might look like:
from xml.etree import ElementTree
import json
attributeNames = [
"mdNm",
"meNm",
"userLabel",
"resourceState",
"location",
"manufacturer",
"productName",
"version",
]
parser = ElementTree.parse("data.xml")
root = parser.getroot()
queryObjectData = root.find(".//{alu.v1}queryObjectData")
for queryObject in queryObjectData:
device = {}
for name in attributeNames:
if (value := queryObject.find(f".//{{tmf854.v1}}{name}")) is not None:
device[name] = value.text
vendorExtensions = queryObject.find("{alu.v1}vendorExtensions")
extensionMap = {}
for extension in vendorExtensions.findall(".//{alu.v1}NameAndStringValue"):
extname = extension.find("{tmf854.v1}name").text
extvalue = extension.find("{tmf854.v1}value").text
extensionMap[extname] = extvalue
device["vendorExtensions"] = extensionMap
print(json.dumps(device, indent=2))
Given your example data, this outputs:
{
"mdNm": "AMS",
"meNm": "CHEERLAVANCHA_281743",
"vendorExtensions": {
"hubSubtendedStatus": "NONE",
"productAndRelease": "DF.6.1",
"adminUserName": "isadmin"
}
}
An alternate approach, in which we just transform each queryObject into a dictionary, might look like this:
from xml.etree import ElementTree
import json
def localName(ele):
return ele.tag.split("}")[1]
def etree_to_dict(t):
if list(t):
d = {}
for child in t:
if localName(child) == "NameAndStringValue":
d.update(dict([[x.text.strip() for x in child]]))
else:
d.update({localName(child): etree_to_dict(child) for child in t})
return d
else:
return t.text.strip()
parser = ElementTree.parse("data.xml")
root = parser.getroot()
queryObjectData = root.find(".//{alu.v1}queryObjectData") or []
for queryObject in queryObjectData:
d = etree_to_dict(queryObject)
print(json.dumps(d, indent=2))
This will output:
{
"name": {
"mdNm": "AMS",
"meNm": "CHEERLAVANCHA_281743",
"ptpNm": "/type=NE/CHEERLAVANCHA_281743"
},
"vendorExtensions": {
"package": {
"hubSubtendedStatus": "NONE",
"productAndRelease": "DF.6.1",
"adminUserName": "isadmin"
}
}
}
That may or may not be appropriate depending on the structure of your real data and exactly what you're trying to accomplish.

AttributeError when assigning value to function for XML data extraction

I'm coding a script to extract information from several XML files with the same structure but with missing sections when there is no information related to a tag. The easiest way to achieve this was using try/except so instead of getting a "AtributeError: 'NoneType' object has no atrribute 'find'" I assign an empty string('') to the object in the exeption. Something like this:
try:
string1=root.find('value1').find('value2').find('value3').text
except:
string1=''
The issue is that I want to shrink my code by using a function:
def extract(string):
tempstr=''
try:
tempstr=string.replace("\n", "")
except:
if tempstr is None:
tempstr=""
return string
And then I try to called it like this:
string1=extract(root.find('value1').find('value2').find('value3').text)
and value2 or value3 does not exist for the xml that is being processed, I get and AttributeError even if I don't use the variable in the function making the function useless.
Is there a way to make a function work, maybe there is a way to make it run without checking if the value entered is invalid?
Solution:
I'm using a mix of both answers:
def extract(root, xpath):
tempstr=''
try:
tempstr=root.findall(xpath)[0].text.replace("\n", "")
except:
tempstr=''#To avoid getting a Nonetype object
return tempstr

You can try something like that:
def extract(root, children_keys: list):
target_object = root
result_text = ''
try:
for child_key in children_keys:
target_object = target_object.find(child_key)
result_text = target_object.text
except:
pass
return result_text
You will go deeper at XML structure with for loop (children_keys - is predefined by you list of nested keys of XML - xml-path to your object).
And if error will throw inside that code - you will get '' as result.
Example XML (source):
<?xml version="1.0" encoding="UTF-8"?>
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>
<y>Don't forget me this weekend!</y>
</body>
</note>
Example:
import xml.etree.ElementTree as ET
tree = ET.parse('note.xml')
root = tree.getroot()
children_keys = ['body', 'y']
result_string = extract(root, children_keys)
print(result_string)
Output:
"Don't forget me this weekend!"

Use XPATH expression
import xml.etree.ElementTree as ET
xml1 = '''<r><v1><v2><v3>a string</v3></v2></v1></r>'''
root = ET.fromstring(xml1)
v3 = root.findall('./v1/v2/v3')
if v3:
print(v3[0].text)
else:
print('v3 not found')
xml2 = '''<r><v1><v3>a string</v3></v1></r>'''
root = ET.fromstring(xml2)
v3 = root.findall('./v1/v2/v3')
if v3:
print(v3[0].text)
else:
print('v3 not found')
output
a string
v3 not found

Python replace XML content with Etree

I'd like to parse and compare 2 XML files with the Python Etree parser as follows:
I have 2 XML files with loads of data. One is in English (the source file), the other one the corresponding French translation (the target file).
E.g.:
source file:
<AB>
<CD/>
<EF>
<GH>
<id>123</id>
<IJ>xyz</IJ>
<KL>DOG</KL>
<MN>dogs/dog</MN>
some more tags and info on same level
<metadata>
<entry>
<cl>Translation</cl>
<cl>English:dog/dogs</cl>
</entry>
<entry>
<string>blabla</string>
<string>blabla</string>
</entry>
some more strings and entries
</metadata>
</GH>
</EF>
<stuff/>
<morestuff/>
<otherstuff/>
<stuffstuff/>
<blubb/>
<bla/>
<blubbbla>8</blubbla>
</AB>
The target file looks exactly the same, but has no text at some places:
<MN>chiens/chien</MN>
some more tags and info on same level
<metadata>
<entry>
<cl>Translation</cl>
<cl></cl>
</entry>
The French target file has an empty cross-language reference where I'd like to put in the information from the English source file whenever the 2 macros have the same ID.
I already wrote some code in which I replaced the string tag name with a unique tag name in order to identify the cross-language reference. Now I want to compare the 2 files and if 2 macros have the same ID, exchange the empty reference in the French file with the info from the English file. I was trying out the minidom parser before but got stuck and would like to try Etree now. I have hardly any knowledge about programming and find this very hard.
Here is the code I have so far:
macros = ElementTree.parse(english)
for tag in macros.getchildren('macro'):
id_ = tag.find('id')
data = tag.find('cl')
id_dict[id_.text] = data.text
macros = ElementTree.parse(french)
for tag in macros.getchildren('macro'):
id_ = tag.find('id')
target = tag.find('cl')
if target.text.strip() == '':
target.text = id_dict[id_.text]
print (ElementTree.tostring(macros))
I am more than clueless and reading other posts on this confuses me even more. I'd appreciate it very much if someone could enlighten me :-)

There is probably more details to be clarified. Here is the sample with some debug prints that shows the idea. It assumes that both files have exactly the same structure, and that you want to go only one level below the root:
import xml.etree.ElementTree as etree
english_tree = etree.parse('en.xml')
french_tree = etree.parse('fr.xml')
# Get the root elements, as they support iteration
# through their children (direct descendants)
english_root = english_tree.getroot()
french_root = french_tree.getroot()
# Iterate through the direct descendants of the root
# elements in both trees in parallel.
for en, fr in zip(english_root, french_root):
assert en.tag == fr.tag # check for the same structure
if en.tag == 'id':
assert en.text == fr.text # check for the same id
elif en.tag == 'string':
if fr.text is None:
fr.text = en.text
print en.text # displaying what was replaced
etree.dump(french_tree)
For more complex structures of the file, the loop through the direct children of the node can be replaced by iteration through all the elements of the tree. If the structures of the files are exactly the same, the following code will work:
import xml.etree.ElementTree as etree
english_tree = etree.parse('en.xml')
french_tree = etree.parse('fr.xml')
for en, fr in zip(english_tree.iter(), french_tree.iter()):
assert en.tag == fr.tag # check if the structure is the same
if en.tag == 'id':
assert en.text == fr.text # identification must be the same
elif en.tag == 'string':
if fr.text is None:
fr.text = en.text
print en.text # display the inserted text
# Write the result to the output file.
with open('fr2.xml', 'w') as fout:
fout.write(etree.tostring(french_tree.getroot()))
However, it works only in cases when both files have exactly the same structure. Let's follow the algorithm that would be used when the task is to be done manually. Firstly, we need to find the French translation that is empty. Then it should be replaced by the English translation from the GH element with the same identification. A subset of XPath expressions is used in the case when searching for the elements:
import xml.etree.ElementTree as etree
def find_translation(tree, id_):
# Search fot the GH element with the given identification, and return
# its translation if found. Otherwise None is returned implicitly.
for gh in tree.iter('GH'):
id_elem = gh.find('./id')
if id_ == id_elem.text:
# The related GH element found.
# Find metadata entry, extract the translation.
# Warning! This is simplification for the fixed position
# of the Translation entry.
me = gh.find('./metadata/entry')
assert len(me) == 2 # metadata/entry has two elements
cl1 = me[0]
assert cl1.text == 'Translation'
cl2 = me[1]
return cl2.text
# Body of the program. --------------------------------------------------
english_tree = etree.parse('en.xml')
french_tree = etree.parse('fr.xml')
for gh in french_tree.iter('GH'): # iterate through the GH elements only
# Get the identification of the GH section
id_elem = gh.find('./id')
id_ = id_elem.text
# Find and check the metadata entry, extract the French translation.
# Warning! This is simplification for the fixed position of the Translation
# entry.
me = gh.find('./metadata/entry')
assert len(me) == 2 # metadata/entry has two elements
cl1 = me[0]
assert cl1.text == 'Translation'
cl2 = me[1]
fr_translation = cl2.text
# If the French translation is empty, put there the English translation
# from the related element.
if cl2.text is None:
cl2.text = find_translation(english_tree, id_)
with open('fr2.xml', 'w') as fout:
fout.write(etree.tostring(french_tree.getroot()).decode('utf-8'))

Search and remove element with elementTree in Python

I have an XML document in which I want to search for some elements and if they match some criteria
I would like to delete them
However, I cannot seem to be able to access the parent of the element so that I can delete it
file = open('test.xml', "r")
elem = ElementTree.parse(file)
namespace = "{http://somens}"
props = elem.findall('.//{0}prop'.format(namespace))
for prop in props:
type = prop.attrib.get('type', None)
if type == 'json':
value = json.loads(prop.attrib['value'])
if value['name'] == 'Page1.Button1':
#here I need to access the parent of prop
# in order to delete the prop
Is there a way I can do this?
Thanks

You can remove child elements with the according remove method. To remove an element you have to call its parents remove method. Unfortunately Element does not provide a reference to its parents, so it is up to you to keep track of parent/child relations (which speaks against your use of elem.findall())
A proposed solution could look like this:
root = elem.getroot()
for child in root:
if child.name != "prop":
continue
if True:# TODO: do your check here!
root.remove(child)
PS: don't use prop.attrib.get(), use prop.get(), as explained here.

You could use xpath to select an Element's parent.
file = open('test.xml', "r")
elem = ElementTree.parse(file)
namespace = "{http://somens}"
props = elem.findall('.//{0}prop'.format(namespace))
for prop in props:
type = prop.get('type', None)
if type == 'json':
value = json.loads(prop.attrib['value'])
if value['name'] == 'Page1.Button1':
# Get parent and remove this prop
parent = prop.find("..")
parent.remove(prop)
http://docs.python.org/2/library/xml.etree.elementtree.html#supported-xpath-syntax
Except if you try that it doesn't work: http://elmpowered.skawaii.net/?p=74
So instead you have to:
file = open('test.xml', "r")
elem = ElementTree.parse(file)
namespace = "{http://somens}"
search = './/{0}prop'.format(namespace)
# Use xpath to get all parents of props
prop_parents = elem.findall(search + '/..')
for parent in prop_parents:
# Still have to find and iterate through child props
for prop in parent.findall(search):
type = prop.get('type', None)
if type == 'json':
value = json.loads(prop.attrib['value'])
if value['name'] == 'Page1.Button1':
parent.remove(prop)
It is two searches and a nested loop. The inner search is only on Elements known to contain props as first children, but that may not mean much depending on your schema.

I know this is an old thread but this kept popping up while I was trying to figure out a similar task. I did not like the accepted answer for two reasons:
1) It doesn't handle multiple nested levels of tags.
2) It will break if multiple xml tags are deleted in the same level one-after-another. Since each element is an index of Element._children you shouldn't delete while forward iterating.
I think a better more versatile solution is this:
import xml.etree.ElementTree as et
file = 'test.xml'
tree = et.parse(file)
root = tree.getroot()
def iterator(parents, nested=False):
for child in reversed(parents):
if nested:
if len(child) >= 1:
iterator(child)
if True: # Add your entire condition here
parents.remove(child)
iterator(root, nested=True)
For the OP, this should work - but I don't have the data you're working with to test if it's perfect.
import xml.etree.ElementTree as et
file = 'test.xml'
tree = et.parse(file)
namespace = "{http://somens}"
props = tree.findall('.//{0}prop'.format(namespace))
def iterator(parents, nested=False):
for child in reversed(parents):
if nested:
if len(child) >= 1:
iterator(child)
if prop.attrib.get('type') == 'json':
value = json.loads(prop.attrib['value'])
if value['name'] == 'Page1.Button1':
parents.remove(child)
iterator(props, nested=True)

A solution using lxml module
from lxml import etree
root = ET.fromstring(xml_str)
for e in root.findall('.//{http://some.name.space}node'):
parent = e.getparent()
for child in parent.find('./{http://some.name.space}node'):
try:
parent.remove(child)
except ValueError:
pass

Using the fact that every child must have a parent, I'm going to simplify #kitsu.eb's example. f using the findall command to get the children and parents, their indices will be equivalent.
file = open('test.xml', "r")
elem = ElementTree.parse(file)
namespace = "{http://somens}"
search = './/{0}prop'.format(namespace)
# Use xpath to get all parents of props
prop_parents = elem.findall(search + '/..')
props = elem.findall('.//{0}prop'.format(namespace))
for prop in props:
type = prop.attrib.get('type', None)
if type == 'json':
value = json.loads(prop.attrib['value'])
if value['name'] == 'Page1.Button1':
#use the index of the current child to find
#its parent and remove the child
prop_parents[props.index[prop]].remove(prop)

I also used XPath for this issue, but in a different way:
root = elem.getroot()
elementName = "YourElement"
#this will find all the parents of the elements with elementName
for elementParent in root.findall(".//{}/..".format(elementName)):
#this will find all the elements under the parent, and remove them
for element in elementParent.findall("{}".format(elementName)):
elementParent.remove(element)

I like to use an XPath expression for this kind of filtering. Unless I know otherwise, such an expression must be applied at the root level, which means I can't just get a parent and apply the same expression on that parent. However, it seems to me that there is a nice and flexible solution that should work with any supported XPath, as long as none of the sought nodes is the root. It goes something like this:
root = elem.getroot()
# Find all nodes matching the filter string (flt)
nodes = root.findall(flt)
while len(nodes):
# As long as there are nodes, there should be parents
# Get the first of all parents to the found nodes
parent = root.findall(flt+'/..')[0]
# Use this parent to remove the first node
parent.remove(nodes[0])
# Find all remaining nodes
nodes = root.findall(flt)

I would like only to add a comment on the accepted answer, but my lack of reputation doesn't allow me to. I wanted to add that it is important to add .findall("*")to the iterator to avoid issues, as stated in the documentation:
Note that concurrent modification while iterating can lead to problems, just like when iterating and modifying Python lists or dicts. Therefore, the example first collects all matching elements with root.findall(), and only then iterates over the list of matches.
Therefore, in the accepted answer the iteration should be for child in root.findal("*"):instead of for child in root:. Not doing so made my code skip some elements from the list.

Processing RSS/RDF via xml.dom.minidom

I'm trying to process a delicious rss feed via python. Here's a sample:
...
<item rdf:about="http://weblist.me/">
<title>WebList - The Place To Find The Best List On The Web</title>
<dc:date>2009-12-24T17:46:14Z</dc:date>
<link>http://weblist.me/</link>
...
</item>
<item rdf:about="http://thumboo.com/">
<title>Thumboo! Free Website Thumbnails and PHP Script to Generate Web Screenshots</title>
<dc:date>2006-10-24T18:11:32Z</dc:date>
<link>http://thumboo.com/</link>
...
The relevant code is:
def getText(nodelist):
rc = ""
for node in nodelist:
if node.nodeType == node.TEXT_NODE:
rc = rc + node.data
return rc
dom = xml.dom.minidom.parse(file)
items = dom.getElementsByTagName("item")
for i in items:
title = i.getElementsByTagName("title")
print getText(title)
I would think this would print out each title, but instead I get basically get blank output. I'm sure I'm doing something stupid wrong, but no idea what?

You are passing the title nodes to getText, whose nodeTypes are not node.TEXT_NODE. You have to loop over all the children of the node instead in your getText method:
def getTextSingle(node):
parts = [child.data for child in node.childNodes if child.nodeType == node.TEXT_NODE]
return u"".join(parts)
def getText(nodelist):
return u"".join(getTextSingle(node) for node in nodelist)
Even better, call node.normalize() before calling getTextSingle which ensures that consecutive children of type node.TEXT_NODE are merged into a single node.TEXT_NODE.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to search for XML elements in python? - python

Related

XML parsing in python issue using elementTree

AttributeError when assigning value to function for XML data extraction

Python replace XML content with Etree

Search and remove element with elementTree in Python

Processing RSS/RDF via xml.dom.minidom

Categories

Resources