I am trying to use dblp data set to convert the xml file to csv file. Right now, I am using iterparse() to parse the xml file.
Here is my code:
def iterpar():
f = open(dblp.xml', 'rb')
context = etree.iterparse(f, dtd_validation=True, events=("start", "end"))
context = iter(context)
event, root = next(context)
for event, ele in context:
print event
print ele
However, when I tried to print out something to see what it is, an error was reported:
Traceback (most recent call last):
File "C:\dblp\Data\XML2csv", line 34, in <module>
iterpar()
File "C:\dblp\Data\XML2csv", line 29, in iterpar
for event, ele in context:
File "iterparse.pxi", line 208, in lxml.etree.iterparse.__next__ (src\lxml\lxml.etree.c:131498)
lxml.etree.XMLSyntaxError: No declaration for attribute mdate of element article, line 4, column 19
I guess this might result from a fail dtd validation because all attributes are declared in the dtd file. I also tried to google if there are any explanations for my problem but didn't find a good one.
Can anybody explain it and tell me how to fix it? Thank you very much.
-----------update---------
I think I do need the dtd_validation. Otherwise it will report:
lxml.etree.XMLSyntaxError: Entity 'ouml' not defined, line 47, column 25
Entities like 'ouml', 'uuml' occurs in xml file, and is defined in the dtd file. Although setting the dtd_validation to be false prevents the No declaration error, but this one will occur.
Without seeing your XML or DTD, it's hard to say. It sounds like your XML is violating the DTD because it defines an 'mdate' attribute not listed in the DTD for a specific element. You definitely need the DTD because it defines at least one special character in your XML, so removing the DTD is out of the question.
Is it possible for you to add the 'mdate' attribute into the DTD so that the parser will accept your XML?
<!ATTLIST element-name attribute-name attribute-type #IMPLIED>
Your xml file needs to have something similar to this:
<!DOCTYPE dblp SYSTEM "dblp.dtd">
If you start fixing Entity 'ouml', you can break the previous dependency.
Related
I am trying to use the following Script but there have been some errors that I've needed to fix. I was able to get it running but for most instances of the data it tries to process the following error arises:
C:\Users\Alexa\OneDrive\Skrivbord\Database Samples\LIDC-IDRI-0001\1.3.6.1.4.1.14519.5.2.1.6279.6001.175012972118199124641098335511\1.3.6.1.4.1.14519.5.2.1.6279.6001.141365756818074696859567662357\068.xml
Traceback (most recent call last):
File "C:\Users\Alexa\OneDrive\Documents\Universitet\Nuvarande\KEX\LIDC-IDRI-processing-master\lidc_data_to_nifti.py", line 370, in <module>
parse_xml_file(xml_file)
File "C:\Users\Alexa\OneDrive\Documents\Universitet\Nuvarande\KEX\LIDC-IDRI-processing-master\lidc_data_to_nifti.py", line 311, in parse_xml_file
root=xmlHelper.create_xml_tree(file)
File "C:\Users\Alexa\OneDrive\Documents\Universitet\Nuvarande\KEX\LIDC-IDRI-processing-master\lidcXmlHelper.py", line 23, in create_xml_tree
for at in el.attrib.keys(): # strip namespaces of attributes too
RuntimeError: dictionary keys changed during iteration
This corresponds to the following code:
def create_xml_tree(filepath):
"""
Method to ignore the namespaces if ElementTree is used.
Necessary becauseElementTree, by default, extend
Tag names by the name space, but the namespaces used in the
LIDC-IDRI dataset are not consistent.
Solution based on https://stackoverflow.com/questions/13412496/python-elementtree-module-how-to-ignore-the-namespace-of-xml-files-to-locate-ma
instead of ET.fromstring(xml)
"""
it = ET.iterparse(filepath)
for _, el in it:
if '}' in el.tag:
el.tag = el.tag.split('}', 1)[1] # strip all namespaces
for at in el.attrib.keys(): # strip namespaces of attributes too
if '}' in at:
newat = at.split('}', 1)[1]
el.attrib[newat] = el.attrib[at]
del el.attrib[at]
return it.root
I am not at all familiar with xml file reading in python and this problem has gotten me stuck for the last two days now. I tried reading up on threads both here and on other forums but it did not give me significant insight. From what I understand the problem arises because we're trying to manipulate the same object we are reading from, which is not allowed? I tried to fix this by making a copy of the file and then having that change depending on what is read in the original file but I did not get it working properly.
I have an xml file with this as the header
<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type='text/xsl' href='\\segotn12805\ppr\PPRData3\StyleSheet\PPRData3.xslt'?>
when I modify the file I use .write (for example)
mytree.write('output.xml')
but the output file does not contain the header info.
The first two lines of the output file look like this
<ns0:pprdata xmlns:ns0="http://ManHub.PPRData">
<ns0:Group name="Models">
any ideas on how I can add the header info to the output file?
The first line is the XML declaration. It is optional, and a parser will assume UTF-8 if not specified.
The second line is a processing instruction.
It would be helpful if you provided more code to show what you are doing, but I suspect that you are using ElementTree. The documentation has this note indicating that by default these are skipped:
Note Not all elements of the XML input will end up as elements of the parsed tree. Currently, this module skips over any XML comments, processing instructions, and document type declarations in the input. Nevertheless, trees built using this module’s API rather than parsing from XML text can have comments and processing instructions in them; they will be included when generating XML output. A document type declaration may be accessed by passing a custom TreeBuilder instance to the XMLParser constructor.
As suggested in this answer, you might want to try using lxml
I am making a GUI using appJar(python Library, uses Tkinter). I have an XML file.
I am parsing the XML file using ElementTree XML parsing library.
I want to see my XML file in a tree view.
So I am parsing the file using Element Tree, Getting the tags in Need to show in the Treeview and Forming a new XML object. and passing the new object in the appJar Function: .addTree().
But I am Getting the error as:
..lib\site-packages\appJar\appjar.py", line 8764, in addTree
xmlDoc = parseString(data).
...lib\xml\dom\expatbuilder.py", line 223, in parseString
parser.Parse(string, True)
TypeError: a bytes-like object is required, not 'ElementTree'
xml=et.Element(root)
print(xml)
for ele in valList:
reg=et.SubElement(xml, ele.find('Name').text)
bitroot= ele.findall('Bit')
for bit in bitroot:
et.SubElement(reg, bit.find('Name').text)
xmltree= et.ElementTree(xml)
app.startFrame('bottomleft',1,0,2)
app.setBg('orange')
app.setSticky('news')
app.setStretch('none')
app.addTree('REGISTER', xmltree)
I am getting the error, as far as I can understand is because .addTree() API is unable to read the format of xmltree variable.
According to appJar documentation, you need to pass an XML string to .addTree(), not an ElementTree. According to ElementTree documentation, you can use xml.etree.ElementTree.tostring() to build an XML string from your Element:
xml_string = et.tostring(xml)
app.addTree('REGISTER', xml_string)
Trying to use the ElementTree to parse xml files; since by default the parser does not retain comments, used the following code from http://bugs.python.org/issue8277:
import xml.etree.ElementTree as etree
class CommentedTreeBuilder(etree.TreeBuilder):
"""A TreeBuilder subclass that retains comments."""
def comment(self, data):
self.start(etree.Comment, {})
self.data(data)
self.end(etree.Comment)
parser = etree.XMLParser(target = CommentedTreeBuilder())
The above is in documents.py. Tested with:
class TestDocument(unittest.TestCase):
def setUp(self):
filename = os.path.join(sys.path[0], "data", "facilities.xml")
self.doc = etree.parse(filename, parser = documents.parser)
def testClass(self):
print("Class is {0}.".format(self.doc.__class__.__name__))
#commented out tests.
if __name__ == '__main__':
unittest.main()
This barfs with:
Traceback (most recent call last):
File "/home/goncalo/documents/games/ja2/modding/mods/xml-overhaul/src/scripts/../tests/test_documents.py", line 24, in setUp
self.doc = etree.parse(filename, parser = documents.parser)
File "/usr/lib/python3.3/xml/etree/ElementTree.py", line 1242, in parse
tree.parse(source, parser)
File "/usr/lib/python3.3/xml/etree/ElementTree.py", line 1726, in parse
parser.feed(data)
IndexError: pop from empty stack
What am I doing wrong? By the way, the xml in the file is valid (as checked by an independent program) and in utf-8 encoding.
note(s):
using Python 3.3. In Kubuntu 13.04, just in case it is relevant. I make sure to use "python3" (and not just "python") to run the test scripts.
edit: here is the sample xml file used; it is very small (let's see if I can get the formatting right):
<?xml version="1.0" encoding="utf-8"?>
<!-- changes to facilities.xml by G. Rodrigues: ar overhaul.-->
<SECTORFACILITIES>
<!-- Drassen -->
<!-- Small airport -->
<FACILITY>
<SectorGrid>B13</SectorGrid>
<FacilityType>4</FacilityType>
<ubHidden>0</ubHidden>
</FACILITY>
</SECTORFACILITIES>
The example XML you added works for me in 2.7, but breaks on 3.3 with the stack trace you described.
The problem seems to be the very first comment - after the XML declaration, before the first element. It isn't part of the tree in 2.7 (doesn't raise an Exception though), and causes the exception in 3.3.
See Python issue #17901: In Python 3.4, which contains the mentioned fix, pop from empty stack doesn't occur, but ParseError: multiple elements on top level is raised instead.
Which makes sense: If you want to retain the comments in the tree, they need to be trated as nodes. And XML only allows one node at the top level of the document, so you can't have a comment before the first "real" element (if you force the parser to retain commments).
So unfortunately I think that's your only option: Remove those comments outside the root document node from your XML files - either in the original files, or by stripping them before parsing.
I am using libxml2 python for parsing my xml and validate it against an xsd file.
I was unable to find out the line number of the error in xml file when i get an schema validation error.
Is there a way?
For better understanding i have pasted my code below:
ctxt = libxml2.schemaNewParserCtxt("xsd1.xsd")
schema = ctxt.schemaParse()
del ctxt
validationCtxt = schema.schemaNewValidCtxt()
doc = libxml2.parseFile("myxml.xml")
iErrA1 = validationCtxt.schemaValidateDoc(doc)
#get the line number if there is error
Update:
I tried libxml2.lineNumbersDefault(1), this probably enables the library to collect line numbers.
Then i register a callback through validationCtxt.setValidityErrorHandler(ErrorHandler, ErrorHandler).
I get the msg part but no line number is present.
More info: I had seen a line number member in xmlError, but its unclear to me as to how I will get this xmlError type object back when there is an error.
Also there is a global function lastError which returns an xmlError type object, but it always returns 'none' although there is an error