I have a big XML file with several article nodes. I have included only one with the problem. I try to parse it in Python to filter some data and I get the error
File "<string>", line unknown
ParseError: undefined entity Ö: line 90, column 17
Sample of the XML file
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp SYSTEM "dblp.dtd">
<dblp>
<article mdate="2019-10-25" key="tr/gte/TR-0146-06-91-165" publtype="informal">
<author>Alejandro P. Buchmann</author>
<author>M. Tamer Özsu</author>
<author>Dimitrios Georgakopoulos</author>
<title>Towards a Transaction Management System for DOM.</title>
<journal>GTE Laboratories Incorporated</journal>
<volume>TR-0146-06-91-165</volume>
<month>June</month>
<year>1991</year>
<url>db/journals/gtelab/index.html#TR-0146-06-91-165</url>
</article>
</dblp>
From my search in Google, I found that this kind of error appears if you have issues in the node names. However, the line with the error is the second author, in the text.
This is my Python code
with open('xaa.xml', 'r') as xml_file:
xml_tree = etree.parse(xml_file)
The declaration of the Ouml entity is presumably in the DTD (dblp.dtd), but ElementTree does not support external DTDs. ElementTree only recognizes entities declared directly in the XML file (in the "internal subset"). This is a working example:
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp [
<!ENTITY Ouml 'Ö'>
]>
<dblp>
<article mdate="2019-10-25" key="tr/gte/TR-0146-06-91-165" publtype="informal">
<author>Alejandro P. Buchmann</author>
<author>M. Tamer Özsu</author>
<author>Dimitrios Georgakopoulos</author>
<title>Towards a Transaction Management System for DOM.</title>
<journal>GTE Laboratories Incorporated</journal>
<volume>TR-0146-06-91-165</volume>
<month>June</month>
<year>1991</year>
<url>db/journals/gtelab/index.html#TR-0146-06-91-165</url>
</article>
</dblp>
To parse the XML file in the question without errors, you need a more powerful XML library that supports external DTDs. lxml is a good choice for that.
Related
I need a python code which returns 5.6.0 which is the version of the xml.
I have already tried to retrieve the doctype but I failed.
<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article PUBLIC
"-//ES//DTD journal article DTD version 5.6.0//EN//XML"
"art560.dtd" [<!ENTITY gr1 SYSTEM "gr1" NDATA IMAGE>
There you go:
xmlstring = ''' <?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article PUBLIC
"-//ES//DTD journal article DTD version 5.6.0//EN//XML"
"art560.dtd" [<!ENTITY gr1 SYSTEM "gr1" NDATA IMAGE>'''
version = xmlstring.split("version")[2].split("//")[0]
print(version)
Stackoverflow syntax highlighting renders it weird but it runs-
the following code gives me the python error 'failed to parse' addon.xml:
(I've used an online checker and it says "error on line 33 at column 15: Opening and ending tag mismatch: description line 0 and extension" - which is the very end of the /extension end tag at the end of the document).
Any advice would be appreciated. This worked yesterday and I have no idea why it's not working at all
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<addon id="plugin.audio.criminalpodcast" name="Criminal Podcast" version="1.1.0" provider-name="leopheard">
<requires>
<import addon="xbmc.python" version="2.1.0"/>
<import addon="script.module.xbmcswift2" version="2.4.0"/>
<import addon="script.module.beautifulsoup4" version="4.3.1"/>
<import addon="script.module.requests" version="1.1.0"/>
<import addon="script.module.routing" version="0.2.0"/> </requires>
<provides>audio</provides> </extension>
<extension point="xbmc.addon.metadata">
<platform>all</platform>
<language></language>
<summary lang="en"></summary>
<description lang="en">description </description>
<license>The MIT License (MIT)</license>
<forum>https://forum.kodi.tv/showthread.php?tid=344790</forum>
<email>leopheard#gmail.com</email>
<source>https://github.com/leopheard/criminalpodcast</source>
<website>http://www.thisiscriminal.com</website>
<audio_guide></audio_guide>
<assets>
<icon>icon.png</icon>
<fanart>fanart.jpg</fanart>
<screenshot>resources/media/Criminal_SocialShare_2.png</screenshot>
<screenshot>resources/media/Criminal_SocialShare_3.png</screenshot>
<screenshot>resources/media/Radiotopia-logo.png</screenshot>
</assets>
Your "XML" file is not well-formed, so it cannot be parsed. Find out how it was created, correct the process so the problem does not occur again, and then regenerate the file.
Files that are vaguely XML-like but not well-formed are pretty well useless. Repair is sometimes possible if the errors are very systematic, but that doesn't appear to the the case here.
Most of the time a "failed to parse" error msg is due to the XML File itself.
Check you're XML File for the correct formatting.
I once forgot the root tag and had the same error message.
I am using python 2.7 with lxml:
parser = etree.XMLParser(load_dtd=True, no_network=False, resolve_entities=True)
xml = etree.parse(data, parser=parser)
The resulting xml will parse data with external entities. It will grab files if the external entity is:
file:///home/text.txt
However, if we want the external entity to be processed to list the files in a directory:
file:///home/
the external entity variable is blank.
This is a debian machine.
Example that gets a file:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE urlset [
<!ENTITY list SYSTEM "file:///home/test.txt" >]>
<test>
<game>
<files>&list;</files>
</game>
</test>
Example that DOES NOT get a list of file names:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE urlset [
<!ENTITY list SYSTEM "file:///home/" >]>
<test>
<game>
<files>&list;</files>
</game>
</test>
The following code:
import xml.etree.ElementTree as ET
xml = '''\
<?xml version="1.0" encoding="UTF-8"?>
<testCaseConfig>
<?LazyComment Blah de blah/?>
<testCase runLimit="420" name="d1/n1"/>
<testCase runLimit="420" name="d1/n2"/>
</testCaseConfig>'''
root = ET.fromstring(xml)
xml2 = xml.replace('LazyComment ', 'LazyComment:')
print(xml2)
try:
root2 = ET.fromstring(xml2)
except ET.ParseError:
print("\nERROR in xml2!!!\n")
xml3 = xml2.replace('testCaseConfig', 'testCaseConfig xmlns:Blah="http://www.w3.org/TR/html4/"', 1)
print(xml3)
try:
root3 = ET.fromstring(xml3)
except ET.ParseError:
print("\nERROR in xml3!!!\n")
raise
Gives this output:
<?xml version="1.0" encoding="UTF-8"?>
<testCaseConfig>
<?LazyComment:Blah de blah/?>
<testCase runLimit="420" name="d1/n1"/>
<testCase runLimit="420" name="d1/n2"/>
</testCaseConfig>
ERROR in xml2!!!
<?xml version="1.0" encoding="UTF-8"?>
<testCaseConfig xmlns:Blah="http://www.w3.org/TR/html4/">
<?LazyComment:Blah de blah/?>
<testCase runLimit="420" name="d1/n1"/>
<testCase runLimit="420" name="d1/n2"/>
</testCaseConfig>
ERROR in xml3!!!
Traceback (most recent call last):
File "C:\Users\Paddy3118\Google Drive\Code\elementtree_error.py", line 30, in <module>
root3 = ET.fromstring(xml3)
File "C:\Anaconda3\envs\Py3.5\lib\xml\etree\ElementTree.py", line 1333, in XML
parser.feed(text)
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 3, column 17
I searched and found this Q that pointed to other resources that I read.
It seems that the '?' makes it a processing instruction whose tag name can include colons. Without the '?' then a colon in a name indicates namespace and one of the answers stated that defining the namespace should make things work.
Combining '?' and ':' though causes issues with ElementTree.
I am given xml files of this type that are used by other tools that do parse it OK and want to process the files myself using Python. Any ideas?
Thanks.
According to the W3C Extensible Markup Language 1.0 Specifications under Common Syntactic Constructs:
The Namespaces in XML Recommendation [XML Names] assigns a meaning to
names containing colon characters. Therefore, authors should not use
the colon in XML names except for namespace purposes, but XML
processors must accept the colon as a name character.
And further in the W3C XPath 1.0 note on Processing Instruction nodes:
A processing instruction has an expanded-name: the local part is the
processing instruction's target; the namespace URI is null.
Altogether, <?LazyComment:Blah de blah/?> is an invalid processing instruction as colons is used to reference namespace URIs and for processing instructions that part is null or empty. Therefore, Python's XML processor complains that using such an instruction does not render a well-formed XML.
Also, reconsider such tools that are generating such invalid processing instructions as they are not handling valid XML documents. Possibly, such tools are treating XML files as text documents (similar to the way you were able to replace the string representation of XML but would not have been able to append an instruction using etree).
<?xml version="1.0" encoding="UTF-8"?>
<testCaseConfig xmlns:Blah="http://www.w3.org/TR/html4/">
<?LazyComment:Blah de blah/?>
<testCase runLimit="420" name="d1/n1"/>
<testCase runLimit="420" name="d1/n2"/>
</testCaseConfig xmlns:Blah="http://www.w3.org/TR/html4/">
Is invalid XML. You can't have attributes in the closing tag. The last line should be just </testCaseConfig>
Also comments are written like this
<!-- this is a comment -->
I am trying to use Python's xml.etree.ElementTree.parse() function to parse an XML file I created by exporting all of the content from a WordPress blog. However, when I try like so:
import xml.etree.ElementTree as xml
tree = xml.parse('/path/to/file.xml')
I get the following error:
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1183, in parse
tree.parse(source, parser)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 656, in parse
parser.feed(data)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1643, in feed
self._raiseerror(v)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1507, in _raiseerror
raise err
ParseError: unbound prefix: line 189, column 1
Here's what's on line 189 of my XML file:
<atom:link rel="search" type="application/opensearchdescription+xml" href="http://blogname.wordpress.com/osd.xml" title="blog name" />
I've seen many questions about this error coming up with Android development, but I can't tell if and how that applies to my situation. Can anyone help with this?
Apologies to everyone for whom this was stupidly obvious, but it turns out I simply didn't have a namespace definition for "atom" in the document. I'm guessing that "unbound prefix" means that the prefix "atom" wasn't "bound" to a namespace definition?
Anyway, adding said definition has solved the problem. Although it makes me wonder why WordPress exports XML files without proper definitions for all of the namespaces they use...
If you remove all the Name Space, works absolutely fine.
Change
<s:home>USA</s:home>
to
<home>USA</home>
Just in case it helps someone some day, I was also working with a WordPress XML export (WordPress eXtended RSS) file in Python and was getting the same error. In my case, WordPress had included most of the correct namespace definitions. However, the XML had iTunes podcast information as well, and the iTunes namespace declaration was not present.
I fixed it by adding xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" into the RSS declaration block. So this:
<!-- generator="WordPress/4.9.8" created="2018-08-06 03:12" -->
<rss version="2.0"
xmlns:excerpt="http://wordpress.org/export/1.2/excerpt/"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:wp="http://wordpress.org/export/1.2/"
>
became this:
<!-- generator="WordPress/4.9.8" created="2018-08-06 03:12" -->
<rss version="2.0"
xmlns:excerpt="http://wordpress.org/export/1.2/excerpt/"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:wp="http://wordpress.org/export/1.2/"
xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd"
>