Subclassing ElementTree parser to retain comments

Subclassing ElementTree parser to retain comments - python

Trying to use the ElementTree to parse xml files; since by default the parser does not retain comments, used the following code from http://bugs.python.org/issue8277:
import xml.etree.ElementTree as etree
class CommentedTreeBuilder(etree.TreeBuilder):
"""A TreeBuilder subclass that retains comments."""
def comment(self, data):
self.start(etree.Comment, {})
self.data(data)
self.end(etree.Comment)
parser = etree.XMLParser(target = CommentedTreeBuilder())
The above is in documents.py. Tested with:
class TestDocument(unittest.TestCase):
def setUp(self):
filename = os.path.join(sys.path[0], "data", "facilities.xml")
self.doc = etree.parse(filename, parser = documents.parser)
def testClass(self):
print("Class is {0}.".format(self.doc.__class__.__name__))
#commented out tests.
if __name__ == '__main__':
unittest.main()
This barfs with:
Traceback (most recent call last):
File "/home/goncalo/documents/games/ja2/modding/mods/xml-overhaul/src/scripts/../tests/test_documents.py", line 24, in setUp
self.doc = etree.parse(filename, parser = documents.parser)
File "/usr/lib/python3.3/xml/etree/ElementTree.py", line 1242, in parse
tree.parse(source, parser)
File "/usr/lib/python3.3/xml/etree/ElementTree.py", line 1726, in parse
parser.feed(data)
IndexError: pop from empty stack
What am I doing wrong? By the way, the xml in the file is valid (as checked by an independent program) and in utf-8 encoding.
note(s):
using Python 3.3. In Kubuntu 13.04, just in case it is relevant. I make sure to use "python3" (and not just "python") to run the test scripts.
edit: here is the sample xml file used; it is very small (let's see if I can get the formatting right):
<?xml version="1.0" encoding="utf-8"?>
<!-- changes to facilities.xml by G. Rodrigues: ar overhaul.-->
<SECTORFACILITIES>
<!-- Drassen -->
<!-- Small airport -->
<FACILITY>
<SectorGrid>B13</SectorGrid>
<FacilityType>4</FacilityType>
<ubHidden>0</ubHidden>
</FACILITY>
</SECTORFACILITIES>

The example XML you added works for me in 2.7, but breaks on 3.3 with the stack trace you described.
The problem seems to be the very first comment - after the XML declaration, before the first element. It isn't part of the tree in 2.7 (doesn't raise an Exception though), and causes the exception in 3.3.
See Python issue #17901: In Python 3.4, which contains the mentioned fix, pop from empty stack doesn't occur, but ParseError: multiple elements on top level is raised instead.
Which makes sense: If you want to retain the comments in the tree, they need to be trated as nodes. And XML only allows one node at the top level of the document, so you can't have a comment before the first "real" element (if you force the parser to retain commments).
So unfortunately I think that's your only option: Remove those comments outside the root document node from your XML files - either in the original files, or by stripping them before parsing.

Related

how to write new xml file with same header

I have an xml file with this as the header
<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type='text/xsl' href='\\segotn12805\ppr\PPRData3\StyleSheet\PPRData3.xslt'?>
when I modify the file I use .write (for example)
mytree.write('output.xml')
but the output file does not contain the header info.
The first two lines of the output file look like this
<ns0:pprdata xmlns:ns0="http://ManHub.PPRData">
<ns0:Group name="Models">
any ideas on how I can add the header info to the output file?

The first line is the XML declaration. It is optional, and a parser will assume UTF-8 if not specified.
The second line is a processing instruction.
It would be helpful if you provided more code to show what you are doing, but I suspect that you are using ElementTree. The documentation has this note indicating that by default these are skipped:
Note Not all elements of the XML input will end up as elements of the parsed tree. Currently, this module skips over any XML comments, processing instructions, and document type declarations in the input. Nevertheless, trees built using this module’s API rather than parsing from XML text can have comments and processing instructions in them; they will be included when generating XML output. A document type declaration may be accessed by passing a custom TreeBuilder instance to the XMLParser constructor.
As suggested in this answer, you might want to try using lxml

lxml.etree.XMLSyntaxError: No declaration for attribute

I am trying to use dblp data set to convert the xml file to csv file. Right now, I am using iterparse() to parse the xml file.
Here is my code:
def iterpar():
f = open(dblp.xml', 'rb')
context = etree.iterparse(f, dtd_validation=True, events=("start", "end"))
context = iter(context)
event, root = next(context)
for event, ele in context:
print event
print ele
However, when I tried to print out something to see what it is, an error was reported:
Traceback (most recent call last):
File "C:\dblp\Data\XML2csv", line 34, in <module>
iterpar()
File "C:\dblp\Data\XML2csv", line 29, in iterpar
for event, ele in context:
File "iterparse.pxi", line 208, in lxml.etree.iterparse.__next__ (src\lxml\lxml.etree.c:131498)
lxml.etree.XMLSyntaxError: No declaration for attribute mdate of element article, line 4, column 19
I guess this might result from a fail dtd validation because all attributes are declared in the dtd file. I also tried to google if there are any explanations for my problem but didn't find a good one.
Can anybody explain it and tell me how to fix it? Thank you very much.
-----------update---------
I think I do need the dtd_validation. Otherwise it will report:
lxml.etree.XMLSyntaxError: Entity 'ouml' not defined, line 47, column 25
Entities like 'ouml', 'uuml' occurs in xml file, and is defined in the dtd file. Although setting the dtd_validation to be false prevents the No declaration error, but this one will occur.

Without seeing your XML or DTD, it's hard to say. It sounds like your XML is violating the DTD because it defines an 'mdate' attribute not listed in the DTD for a specific element. You definitely need the DTD because it defines at least one special character in your XML, so removing the DTD is out of the question.
Is it possible for you to add the 'mdate' attribute into the DTD so that the parser will accept your XML?
<!ATTLIST element-name attribute-name attribute-type #IMPLIED>

Your xml file needs to have something similar to this:
<!DOCTYPE dblp SYSTEM "dblp.dtd">
If you start fixing Entity 'ouml', you can break the previous dependency.

Parsing a large .bz2 file (40 GB) with lxml iterparse in python. Error that does not appear with uncompressed file

I am trying to parse OpenStreetMap's planet.osm, compressed in bz2 format. Because it is already 41G, I don't want to decompress the file completely.
So I figured out how to parse portions of the planet.osm file using bz2 and lxml, using the following code
from lxml import etree as et
from bz2 import BZ2File
path = "where/my/fileis.osm.bz2"
with BZ2File(path) as xml_file:
parser = et.iterparse(xml_file, events=('end',))
for events, elem in parser:
if elem.tag == "tag":
continue
if elem.tag == "node":
(do something)
## Do some cleaning
# Get rid of that element
elem.clear()
# Also eliminate now-empty references from the root node to node
while elem.getprevious() is not None:
del elem.getparent()[0]
which works perfectly with the Geofabrick extracts. However, when I try to parse the planet-latest.osm.bz2 with the same script I get the error:
xml.etree.XMLSyntaxError: Specification mandate value for attribute num_change, line 3684, column 60
Here are the things I tried:
Check the planet-latest.osm.bz2 md5sum
Check the planet-latest.osm where the script with bz2 stops. There is no apparent error, and the attribute is called "num_changes", not "num_change" as indicated in the error
Also I did something stupid, but the error puzzled me: I opened the planet-latest.osm.bz2 in mode 'rb' [c = BZ2File('file.osm.bz2', 'rb')] and then passed c.read() to iterparse(), which returned me an error saying (very long string) cannot be opened. Strange thing, (very long string) ends right where the "Specification mandate value" error refers to...
Then I tried to decompress first the planet.osm.gz2 usin a simple
bzcat planet.osm.gz2 > planet.osm
And ran the parser directly on planet.osm. And... it worked! I am very puzzled by this, and could not find any pointer to why this may happen and how to solve this. My guess would be there is something going on between the decompression and the parsing, but I am not sure. Please help me understand!

It turns out that the problem is with the compressed planet.osm file.
As indicated on the OSM Wiki, the planet file is compressed as a multistream file, and the bz2 python module cannot read multistream files. However, the bz2 documentation indicates an alternative module that can read such files, bz2file. I used it and it works perfectly!
So the code should read:
from lxml import etree as et
from bz2file import BZ2File
path = "where/my/fileis.osm.bz2"
with BZ2File(path) as xml_file:
parser = et.iterparse(xml_file, events=('end',))
for events, elem in parser:
if elem.tag == "tag":
continue
if elem.tag == "node":
(do something)
## Do some cleaning
# Get rid of that element
elem.clear()
# Also eliminate now-empty references from the root node to node
while elem.getprevious() is not None:
del elem.getparent()[0]
Also, doing some research on using the PBF format (as advised in the comments), I stumbled upon imposm.parser, a python module that implements a generic parser for OSM data (in pbf or xml format). You may want to have a look at this!

As an alternative you can use the output of bzcat command (which can handle multistream files too):
p = subprocess.Popen(["bzcat", "data.bz2"], stdout=subprocess.PIPE)
parser = et.iterparse(p.stdout, ...)
# at the end just check that p.returncode == 0 so there were no errors

element tree treats similar files differently

Here are two different files that my python (2.6) script encounters. One will parse, the other will not. I'm just curious as to why this happens.
This xml file will not parse and the script will fail:
<Landfire_Feedback_Point_xlsform id="fbfm40v10" instanceID="uuid:9e062da6-b97b-4d40-b354-6eadf18a98ab" submissionDate="2013-04-30T23:03:32.881Z" isComplete="true" markedAsCompleteDate="2013-04-30T23:03:32.881Z" xmlns="http://opendatakit.org/submissions">
<date_test>2013-04-17</date_test>
<plot_number>10</plot_number>
<select_multiple_names>BillyBob</select_multiple_names>
<geopoint_plot>43.2452830500 -118.2149402900 210.3000030518 3.0000000000</geopoint_plot><fbfm40_new>GS2</fbfm40_new>
<select_grazing>NONE</select_grazing>
<image_close>1366230030355.jpg</image_close>
<plot_note>No road present.</plot_note>
<n0:meta xmlns:n0="http://openrosa.org/xforms">
<n0:instanceID>uuid:9e062da6-b97b-4d40-b354-6eadf18a98ab</n0:instanceID>
</n0:meta>
</Landfire_Feedback_Point_xlsform>
This xml file will parse correctly and the script succeeds:
<Landfire_Feedback_Point_xlsform id="fbfm40v10">
<date_test>2013-05-14</date_test>
<plot_number>010</plot_number>
<select_multiple_names>BillyBob</select_multiple_names>
<geopoint_plot>43.26630563 -118.39881809 351.70001220703125 5.0</geopoint_plot>
<fbfm40_new>GR1</fbfm40_new>
<select_grazing>HIGH</select_grazing>
<image_close>fbfm40v10_PLOT_010_ID_6.jpg</image_close>
<plot_note>Heavy grazing</plot_note>
<meta><instanceID>uuid:90e7d603-86c0-46fc-808f-ea0baabdc082</instanceID></meta>
</Landfire_Feedback_Point_xlsform>
Here is a little python script that demonstrates that one will work, while the other will not. I'm just looking for an explanation as to why one is seen by ElementTree as an xml file while the other isn't. Specifically, the one that doesn't seem to parse fails with a "'NONE' type doesn't have a 'text' attribute" or something similar. But, it's because it doesn't seem to consider the file as xml or it can't see any elements beyond the opening line. Any explanation or direction with regard to this error would be appreciated. Thanks in advance.
Python script:
import os
from xml.etree import ElementTree
def replace_xml_attribute_in_file(original_file,element_name,attribute_value):
#THIS FUNCTION ONLY WORKS ON XML FILES WITH UNIQUE ELEMENT NAMES
# -DUPLICATE ELEMENT NAMES WILL ONLY GET THE FIRST ELEMENT WITH A GIVEN NAME
#split original filename and add tempfile name
tempfilename="temp.xml"
rootsplit = original_file.rsplit('\\') #split the root directory on the backslash
rootjoin = '\\'.join(rootsplit[:-1]) #rejoin the root diretory parts with a backslash -minus the last
temp_file = os.path.join(rootjoin,tempfilename)
et = ElementTree.parse(original_file)
author=et.find(element_name)
author.text = attribute_value
et.write(temp_file)
if os.path.exists(temp_file) and os.path.exists(original_file): #if both the original and the temp files exist
os.remove(original_file) #erase the original
os.rename(temp_file,original_file) #rename the new file
else:
print "Something went wrong."
replace_xml_attribute_in_file("testfile1.xml","image_close","whoopdeedoo.jpg");

Here is a little python script that demonstrates that one will work, while the other will not. I'm just looking for an explanation as to why one is seen by ElementTree as an xml file while the other isn't.
Your code doesn't demonstrate that at all. It demonstrates that they're both seen by ElementTree as valid XML files chock full of nodes. They both parse just fine, they both read past the first line, etc.
The only problem is that the first one doesn't have a node named 'image_close', so your code doesn't work.
You can see that pretty easily:
for node in et.getroot().getchildren():
print node.tag
You get 9 children of the root, with either version.
And the output to that should show you the problem. The node you want is actually named {http://opendatakit.org/submissions}image_close in the first example, rather than image_close as in the second.
And, as you can probably guess, this is because of the namespace=http://opendatakit.org/submissions in the root node. ElementTree uses the "James Clark notation" for mapping unknown-namespaced names to universal names.
Anyway, because none of the nodes are named image_close, the et.find(element_name) returns None, so your code stores author=None, then tries to assign to author.text, and gets an error.
As for how to fix this problem… well, you could learn how namespaces work by default in ElementTree, or you could upgrade to Python 2.7 or install a newer ElementTree for 2.6 that lets you customize things more easily. But if you want to do custom namespace handling and also stick with your old version… I'd start with this article (and its two predecessors) and this one.

Parsing an XML file in python for emailing purposes

I am writing code in python that can not only read a xml but also send the results of that parsing as an email. Now I am having trouble just trying to read the file I have in xml. I made a simple python script that I thought would at least read the file which I can then try to email within python but I am getting a Syntax Error in line 4.
root.tag 'log'
Anyways here is the code I written so far:
import xml.etree.cElementTree as etree
tree = etree.parse('C:/opidea.xml')
response = tree.getroot()
log = response.find('log').text
logentry = response.find('logentry').text
author = response.find('author').text
date = response.find('date').text
msg = [i.text for i in response.find('msg')]
Now the xml file has this type of formating
<log>
<logentry
revision="12345">
<author>glv</author>
<date>2012-08-09T13:16:24.488462Z</date>
<paths>
<path
action="M"
kind="file">/trunk/build.xml</path>
</paths>
<msg>BUG_NUMBER:N/A
FEATURE_AFFECTED:N/A
OVERVIEW:Example</msg>
</logentry>
</log>
I want to be able to send an email of this xml file. For now though I am just trying to get the python code to read the xml file.

response.find('log') won't find anything, because:
find(self, path, namespaces=None)
Finds the first matching subelement, by tag name or path.
In your case log is not a subelement, but rather the root element itself. You can get its text directly, though: response.text. But in your example the log element doesn't have any text in it, anyway.
EDIT: Sorry, that quote from the docs actually applies to lxml.etree documentation, rather than xml.etree.
I'm not sure about the reason, but all other calls to find also return None (you can find it out by printing response.find('date') and so on). With lxml ou can use xpath instead:
author = response.xpath('//author')[0].text
msg = [i.text for i in response.xpath('//msg')]
In any case, your use of find is not correct for msg, because find always returns a single element, not a list of them.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.