change element value with lmxl python - python

I need to be able to parse a large xml file but only look for <name> elements and replace the value. Hence, I am doing event driven parsing as follows I have the following code:
import os, re, sys
from lxml import etree
# parse the xml file
context = etree.iterparse(xmlFile, events=('end',), tag='name')
for event, elem in context:
# this is an internal method that I call to perform regex
newElementText = searchReplace(elem.text).replace(" ", "")
# assign the elem.text to the replaced value
elem.text = newElementText
# write to the xml
etree.tostring(elem, encoding='utf-8')
My problem is with writing the updated element value to the file. When I call etree.tostring() it does not update the file. Can someone please kindly point to the error of my ways. Thanks!

etree.tostring(elem) returns a string representation of the tree, so it does nothing in your code. Use elem.write(xmlFile, encoding='utf-8').

Related

Python lxml error "namespace not defined."

I am being driven crazy by some oddly formed xml and would be grateful for some pointers:
The documents are defined like this:
<sphinx:document id="18059090929806848187">
<url>http://www.some-website.com</url>
<page_number>104</page_number>
<size>7865</size>
</sphinx:document>
Now, I need to read lots (500m+ of these files which are all gz compresed) and grab the text values form a few of the contained tags.
sample code:
from lxml import objectify, etree
import gzip
with open ('file_list','rb') as file_list:
for file in file_list:
in_xml = gzip.open(file.strip('\n'))
xml2 = etree.iterparse(in_xml)
for action, elem in xml2:
if elem.tag == "page_number":
print elem.text + str(file)
the first value elem.text is returned but only for the first file in the list and quickly followed by the error:
lxml.etree.XMLSyntaxError: Namespace prefix sphinx on document is not defined, line 1, column 20
Please excuse my ignorance but xml really hurts my head and I have been struggling with this for a while. Is there a way that I can either define the namespace prefix or handle this in some other more intelligent manner?
Thanks
Your input file is not well formed XML. I assume that it is a snippet from a larger XML document.
Your choices are:
Reconstruct the larger document. How you do this is specific to your application. You may have to consult with the people that created the file you are parsing.
Parse the file in spite of its errors. To do that, use the recover keyword from lxml.etree.iterparse:
xml2 =etree.iterparse(in_xml, recover=True)

Python ElementTree doesn't seem to recognize text nodes

I am trying to parse a simple XML document located at http://www.webservicex.net/airport.asmx/getAirportInformationByAirportCode?airportCode=jfk using the ElementTree module. The code (so far):
import urllib2
from xml.etree import ElementTree
from xml.etree.ElementTree import Element
from xml.etree.ElementTree import SubElement
url = "http://www.webservicex.net/airport.asmx/getAirportInformationByAirportCode?airportCode=jfk"
s = urllib2.urlopen(url)
print s
document = ElementTree.parse(s)
root = document.getroot()
print root
dataset = SubElement(root, 'NewDataSet')
print dataset
table = SubElement(dataset, 'Table')
print table
airportName = SubElement(table, 'CityOrAirportName')
print airportName.text
The final line yields "none" not the name of the airport in the XML. Can anyone assist? This should be realtively simply, but I am missing something.
Look at the documentation for that module. It says, among other things:
The SubElement() function also provides a convenient way to create new sub-elements for a given element
In particular note the word create. You are creating a new element, not reading the elements that are already there.
If you want to locate certain elements within the parsed XML, read the rest of the documentation on that page to understand how to use the library to do that.

Parsing Xml file using python Error: list index out of range for threshold

I have a very large XML file. I need to display the value stored in each and every tag using python. I am trying to use dom library. Can anyone please help?
XML File Link: https://code.google.com/p/warai/downloads/detail?name=haarcascade_frontalface_alt.xml
from xml.dom import minidom
doc= minidom.parse('haarcascade_frontalface_alt.xml')
size=doc.getElementsByTagName('size')[0]
print size.firstChild.data
stages=doc.getElementsByTagName('stages')[0]
stagen=stages.getElementsByTagName('_')
for stage in stagen:
stage_threshold=stage.getElementsByTagName('stage_threshold')[0]
parent=stage.getElementsByTagName('parent')[0]
anext=stage.getElementsByTagName('next')[0]
print stage_threshold.firstChild.data
print parent.firstChild.data
print anext.firstChild.data
trees=stage.getElementsByTagName('trees')[0]
a=trees.getElementsByTagName('_')
for k in a:
b=k.getElementsByTagName('_')[0]
threshold=b.getElementsByTagName('threshold')[0]
left_val=b.getElementsByTagName('left_val')[0]
right_val=b.getElementsByTagName('right_val')[0]
feature=b.getElementsByTagName('feature')[0]
tilted=feature.getElementsByTagName('tilted')[0]
rects=feature.getElementsByTagName('rects')[0]
m=rects.getElementsByTagName('_')[0]
n=rects.getElementsByTagName('_')[1]
print m.firstChild.data
print n.firstChild.data
print tilted.firstChild.data
print threshold.firstChild.data
print left_val.firstChild.data
print right_val.firstChild.data
Use ElementTree interface, for example xml.etree.ElementTree, but other implementations also exist.
Iterate through all elements recursively using iterators to walk the XML tree:
from xml.etree import ElementTree as ET
root = ET.parse("xml.xml").getroot()
def print_value(node):
if node.text and not node.text.isspace():
print(node.text)
for child in node: print_value(child)
print_value(root)
But if the file is really big, and you cannot load the whole tree into memory, use .iterparse(). It returns element by element - event "start" and "end" means that parser reached beginning or end of current node. If no events are supplied to iterparse function, only "end" is used.
import xml.etree.ElementTree as ET
it = ET.iterparse("xml.xml")
for event, node in it:
if node.text and not node.text.isspace():
print(node.text)
node.clear()
Notes
You may also print .tail of each element.
If you have really, really massive number of elements, you could consider remembering parent for each node, and using parent.remove(node) instead of node.clear().

How to write an XML file without header in Python?

when using Python's stock XML tools such as xml.dom.minidom for XML writing, a file would always start off like
<?xml version="1.0"?>
[...]
While this is perfectly legal XML code, and it's even recommended to use the header, I'd like to get rid of it as one of the programs I'm working with has problems here.
I can't seem to find the appropriate option in xml.dom.minidom, so I wondered if there are other packages which do allow to neglect the header.
Cheers,
Nico
Unfortunately minidom does not give you the option to omit the XML Declaration.
But you can always serialise the document content yourself by calling toxml() on the document's root element instead of the document. Then you won't get an XML Declaration:
xml= document.documentElement.toxml('utf-8')
...but then you also wouldn't get anything else outside the root element, such as the DOCTYPE, or any comments or processing instructions. If you need them, serialise each child of the document object one by one:
xml= '\n'.join(node.toxml('utf-8') for node in document.childNodes)
I wondered if there are other packages which do allow to neglect the header.
DOM Level 3 LS defines an xml-declaration config parameter you can use to suppress it. The only Python implementation I know of is pxdom, which is thorough on standards support, but not at all fast.
If you want to use minidom and maintain 'prettiness', how about this as a quick/hacky fix:
xml_without_declaration.py:
import xml.dom.minidom as xml
doc = xml.Document()
declaration = doc.toxml()
a = doc.createElement("A")
doc.appendChild(a)
b = doc.createElement("B")
a.appendChild(b)
xml = doc.toprettyxml()[len(declaration):]
print xml
The header is print in Document. If you print the node directly, it won't print the header.
root = doc.childNodes[0]
root.toprettyxml(encoding="utf-8")
Just replace the first line with blank:
import xml.dom.minidom as MD
<XML String>.replace(MD.Document().toxml()+'\n', '')
If you're set on using minidom, just scan back in the file and remove the first line after writing all the XML you need.
You might be able to use a custom file-like object which removes the first tag, e.g:
class RemoveFirstLine:
def __init__(self, f):
self.f = f
self.xmlTagFound = False
def __getattr__(self, attr):
return getattr(self, self.f)
def write(self, s):
if not self.xmlTagFound:
x = 0 # just to be safe
for x, c in enumerate(s):
if c == '>':
self.xmlTagFound = True
break
self.f.write(s[x+1:])
else:
self.f.write(s)
...
f = RemoveFirstLine(open('path', 'wb'))
Node.writexml(f, encoding='UTF-8')
or something similar. This has the advantage the file doesn't have to be totally rewritten if the XML files are fairly large.
Purists may not like to hear this, but I have found using an XML parser to generate XML to be overkill. Just generate it directly as strings. This also lets you generate files larger than you can keep in memory, which you can't do with DOM. Reading XML is another story.
Use string replace
from xml.dom import minidom
mydoc = minidom.parse('filename.xml')
with open(newfile, "w" ) as fs:
fs.write(mydoc.toxml().replace('?xml version="1.0" ?>', ''))
fs.close()
That's it ;)

python: is there an XML parser implemented as a generator?

I'd like to parse a big XML file "on the fly". I'd like to use a python generator to perform this. I've tried "iterparse" of "xml.etree.cElementTree" (which is really nice) but still not a generator.
Other suggestions?
xml.etree.cElementTree comes close to a generator with correct usage; by default you receive each element after its 'end' event, at which point you can process it. You should use element.clear() on the element if you don't need it after processing; thereby you save the memory.
Here is a complete example what I mean, where I parse Rhythmbox's (Music Player) Library. I use (c)ElementTree's iterparse and for each processed element I call element.clear() so that I save quite a lot of memory. (Btw, the code below is a successor to some sax code to do the same thing; the cElementTree solution was a relief since 1) The code is concise and expresses what I need and nothing more 2) It is 3x as fast, 3) it uses less memory.)
import os
import xml.etree.cElementTree as ElementTree
NEEDED_KEYS= set(("title", "artist", "album", "track-number", "location", ))
def _lookup_string(string, strmap):
"""Look up #string in the string map,
and return the copy in the map.
If not found, update the map with the string.
"""
string = string or ""
try:
return strmap[string]
except KeyError:
strmap[string] = string
return string
def get_rhythmbox_songs(dbfile, typ="song", keys=NEEDED_KEYS):
"""Return a list of info dictionaries for all songs
in a Rhythmbox library database file, with dictionary
keys as given in #keys.
"""
rhythmbox_dbfile = os.path.expanduser(dbfile)
lSongs = []
strmap = {}
# Parse with iterparse; we get the elements when
# they are finished, and can remove them directly after use.
for event, entry in ElementTree.iterparse(rhythmbox_dbfile):
if not (entry.tag == ("entry") and entry.get("type") == typ):
continue
info = {}
for child in entry.getchildren():
if child.tag in keys:
tag = _lookup_string(child.tag, strmap)
text = _lookup_string(child.text, strmap)
info[tag] = text
lSongs.append(info)
entry.clear()
return lSongs
Now, I don't understand your expectations, do you have the following expectation?
# take one
for event, entry in ElementTree.iterparse(rhythmbox_dbfile):
# parse some entries, then exit loop
# take two
for event, entry in ElementTree.iterparse(rhythmbox_dbfile):
# parse the rest of entries
Each time you call iterparse you get a new iterator object, reading the file anew! If you want a persistent object with iterator semantics, you have to refer to the same object in both loops (untried code):
#setup
parseiter = iter(ElementTree.iterparse(rhythmbox_dbfile))
# take one
for event, entry in parseiter:
# parse some entries, then exit loop
# take two
for event, entry in parseiter:
# parse the rest of entries
I think it can be confusing since different objects have different semantics. A file object will always have an internal state and advance in the file, however you iterate on it. An ElementTree iterparse object apparently not. The crux is to think that when you use a for loop, the for always calls iter() on the thing you iterate over. Here is an experiment comparing ElementTree.iterparse with a file object:
>>> import xml.etree.cElementTree as ElementTree
>>> pth = "/home/ulrik/.local/share/rhythmbox/rhythmdb.xml"
>>> iterparse = ElementTree.iterparse(pth)
>>> iterparse
<iterparse object at 0x483a0890>
>>> iter(iterparse)
<generator object at 0x483a2f08>
>>> iter(iterparse)
<generator object at 0x483a6468>
>>> f = open(pth, "r")
>>> f
<open file '/home/ulrik/.local/share/rhythmbox/rhythmdb.xml', mode 'r' at 0x4809af98>
>>> iter(f)
<open file '/home/ulrik/.local/share/rhythmbox/rhythmdb.xml', mode 'r' at 0x4809af98>
>>> iter(f)
<open file '/home/ulrik/.local/share/rhythmbox/rhythmdb.xml', mode 'r' at 0x4809af98>
What you see is that each call to iter() on an iterparse object returns a new generator. The file object however, has an internal Operating System state that must be conserved and it its own iterator.
"On the fly" parsing and document trees are not really compatible. SAX-style parsers are usually used for that (for example, Python's standard xml.sax). You basically have to define a class with handlers for various events like startElement, endElement, etc. and the parser will call the methods as it parses the XML file.
PullDom does what you want. It reads XML from a stream, like SAX, but then builds a DOM for a selected piece of it.
"PullDOM is a really simple API for working with DOM objects in a streaming (efficient!) manner rather than as a monolithic tree."
This is possible with elementtree and incremental parsing:
http://effbot.org/zone/element-iterparse.htm#incremental-parsing
import xml.etree.cElementTree as etree
for event, elem in etree.iterparse(source):
...
Easier to use than sax.
xmltodict has a callback way of reading row by row, but it is not very pythonic. I wanted something similar for reading stackoverflow posts one by one from their xml dump using a generator.
This is the structure of the xml file:
<?xml version="1.0" encoding="utf-8"?>
<posts>
<row Id="1" ... />
<row Id="2" ... />
</posts>
And here is the code I used. It combines pulldom for streaming and xmltodict for parsing the rows.
def xml_to_dict_gen(file_path, tag='row'):
from xml.dom import pulldom
import xmltodict
doc = pulldom.parse(file_path)
for event, node in doc:
if event == pulldom.START_ELEMENT and node.tagName == tag:
doc.expandNode(node)
yield dict(xmltodict.parse(node.toxml()).get(tag))
for post in xml_to_dict_gen('Posts.xml'):
print(post)

Categories