lxml: XMLSyntaxError: Unsupported version '2.0' - python

lxml fails with an exception, when using XML version 2.0.
Test:
class TestLXML(unittest.TestCase):
def test_lxml(self):
from lxml import etree
etree.fromstring('<?xml version="2.0" encoding="UTF-8" standalone="no"?><test>test</test>')
Result:
Error
Traceback (most recent call last):
File "/home/viator/coding/esb/mdmesb/packages/smev/core/request/test.py", line 33, in test_lxml
etree.fromstring('<?xml version="2.0" encoding="UTF-8" standalone="no"?><test>test</test>')
File "lxml.etree.pyx", line 3032, in lxml.etree.fromstring (src/lxml/lxml.etree.c:68121)
File "parser.pxi", line 1786, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:102470)
File "parser.pxi", line 1674, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:101299)
File "parser.pxi", line 1074, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:96481)
File "parser.pxi", line 582, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:91290)
File "parser.pxi", line 683, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:92476)
File "parser.pxi", line 622, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:91772)
XMLSyntaxError: Unsupported version '2.0', line 1, column 19
Does lxml not support version 2.0? What can I do?

Well, it looks like there is no formal specification for a XML 2.0 - just a W3C working group informative specification, which explicitly says "The XML Security Working Group has agreed not to progress this Canonical XML 2.0 specification further as a Recommendation" . (https://www.w3.org/TR/xml-c14n2/). Further write ups on "XML 2.0" on Wikpedia and Stackoverflow corroborate this.
So, due to the non existence of a formal specification, there is no way a production-quality, formal, strict checking library as lxml can read it.
If your documents are XML 1.1 compatible, jsut replace the initial "2.0" on the document by "1.1" - treating the XML as a string, prior to parsing it. If they are not, you will have to pick up another library which works with the informative W3C spec (or craft your own).
Some googling finds out there is no such a thing like "XML 2.0" supported in Python by any libraries. Another option is to document which features you need from XML 2.0, if any, and create a XML pre-processor to handle those.

Related

Entity 'ouml' error while using lxml to parse dblp data

I am trying to parse dblp data(xml format). So far my code is :
#-*-coding:utf-8-*-
from lxml import etree # lxml import library
parser = etree.XMLParser (load_dtd =True)
Tree = etree.parse( "dblp.xml" ,parser)
Root = tree.getroot()
I tried running the code and I get the following error:
Tree = etree.parse( "dblp.xml" ,parser) # Parse the xml with tree structure
File "src/lxml/etree.pyx", line 3426, in lxml.etree.parse
File "src/lxml/parser.pxi", line 1839, in lxml.etree._parseDocument
File "src/lxml/parser.pxi", line 1865, in lxml.etree._parseDocumentFromURL
File "src/lxml/parser.pxi", line 1769, in lxml.etree._parseDocFromFile
File "src/lxml/parser.pxi", line 1162, in lxml.etree._BaseParser._parseDocFromFile
File "src/lxml/parser.pxi", line 600, in lxml.etree._ParserContext._handleParseResultDoc
File "src/lxml/parser.pxi", line 710, in lxml.etree._handleParseResult
File "src/lxml/parser.pxi", line 639, in lxml.etree._raiseParseError
File "dblp.xml", line 70
lxml.etree.XMLSyntaxError: Entity 'ouml' not defined, line 70,
column 27
how can i resolve this error?
Note: I have xml and dtd files in same location.
I recently encountered the same issue whilst parsing DBLP's XML database. In my case, I was missing the appropriate .dtd file for my dblp.xml (which provides the necessary information for parsing certain custom entities, including ouml). The top of your file should look something like this:
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp SYSTEM "dblp-2017-08-29.dtd">
The .dtd file specified on the second line should be located in the same directory as the dblp.xml file that you're attempting to parse. You can download the appropriate .dtd file your XML file from here: http://dblp.org/xml/release/
$ ls
dblp-2017-08-29.dtd dblp-2018-11-01.xml
Also, given the size of dblp.xml, you may also want to use lxml.etree.iterparse to stream the contents of the file instead. Below is some of the code that I used to obtain entries for certain types of publication within the database.
fn = 'dblp.xml'
for event, elem in lxml.etree.iterparse(fn, load_dtd=True):
if elem.tag not in ['article', 'inproceedings', 'proceedings']:
continue
title = elem.find('title') # type: Optional[str]
year = elem.find('year') # type: Optional[int]
authors = elem.find('author') # type: Optional[str]
venue = elem.find('venue') # type: Optional[str]
...
elem.clear()

IOError passing requests Response.content to lxml.etree.parse() [duplicate]

This question already has an answer here:
lxml error "IOError: Error reading file" when parsing facebook mobile in a python scraper script
(1 answer)
Closed 7 years ago.
I have the following xml on a webpage -
<entry>
<id>1750</id>
<title>variablename</title>
<source>
com.tidalsoft.webclient.tes.dsp.db.datatypes.Variable
</source>
<tes:variable>
<tes:ownername>ownergroup</tes:ownername>
<tes:productiondate>2015-08-17T00:00:00-0400</tes:productiondate>
<tes:readonly>N</tes:readonly>
<tes:publish>N</tes:publish>
<tes:description>
Decription Here
</tes:description>
<tes:startcalendar>0</tes:startcalendar>
<tes:ownerid>666</tes:ownerid>
<tes:type>1</tes:type>
<tes:lastusermodifiedtime>2015-06-15T15:42:27-0400</tes:lastusermodifiedtime>
<tes:innervalue>\\share\location</tes:innervalue>
<tes:calc>N</tes:calc>
<tes:name>variablename</tes:name>
<tes:startdate>1899-12-30T00:00:00-0500</tes:startdate>
<tes:pub>Y</tes:pub>
<tes:lastvalue>\\share\location</tes:lastvalue>
<tes:id>1750</tes:id>
<tes:startdateasstring>18991230000000</tes:startdateasstring>
<tes:lastchangetime>2015-06-15T15:42:27-0400</tes:lastchangetime>
<tes:clientcachelastchangetime>2015-08-17T09:56:49-0400</tes:clientcachelastchangetime>
</tes:variable>
</entry>
I'm trying to parse this data. I have a get through requests -
r = requests.get(url, auth=('username', 'password'))
but when I try to parse the content I get errors.
>>> xmlObject = etree.parse(r.content)
Traceback (most recent call last):
File "apiTest.py", line 46, in <module>
xmlObject = etree.parse(r.content)
File "lxml.etree.pyx", line 3310, in lxml.etree.parse (src\lxml\lxml.etree.c:7
2517)
File "parser.pxi", line 1791, in lxml.etree._parseDocument (src\lxml\lxml.etre
e.c:105979)
File "parser.pxi", line 1817, in lxml.etree._parseDocumentFromURL (src\lxml\lx
ml.etree.c:106278)
File "parser.pxi", line 1721, in lxml.etree._parseDocFromFile (src\lxml\lxml.e
tree.c:105277)
File "parser.pxi", line 1122, in lxml.etree._BaseParser._parseDocFromFile (src
\lxml\lxml.etree.c:100227)
File "parser.pxi", line 580, in lxml.etree._ParserContext._handleParseResultDo
c (src\lxml\lxml.etree.c:94350)
File "parser.pxi", line 690, in lxml.etree._handleParseResult (src\lxml\lxml.e
tree.c:95786)
File "parser.pxi", line 618, in lxml.etree._raiseParseError (src\lxml\lxml.etr
ee.c:94818)
IOError: Error reading file ''
On the last line what is between the quotes is the xml stated at the beginning as a string -
<?xml version="1.0" encoding="UTF-8" standalone="ye
s"?><entry xmlns="http://purl.org/atom/ns#"><id>1750</id><title>....
The data is being provided as content-type: text/xml
etree.parse expects a filename, a file-like object, or a URL as its first argument (see help(etree.parse)). It does not expect an XML string. To parse an XML string use
xmlObject = etree.fromstring(r.content)
Note that etree.fromstring returns a lxml.etree._Element. In contrast, etree.parse returns a lxml.etree._ElementTree. Given the _Element, you can obtain the _ElementTree with the getroottree method:
xmlTree = xmlObject.getroottree()

python html parsing fails in document with javascript

I'm trying to use Python to parse HTML (although strictly speaking, the server claims it's xhtml) and every parser I have tried (ElementTree, minidom, and lxml) all fail. When I go to look at where the problem is, it's inside a script tag:
<script type="text/javascript">
... // some javascript code
if (condition1 && condition2) { // croaks on this line
I see what the problem is, the ampersand should be quoted. The problem is, this is inside a javascript script tag, so it cannot be quoted, because that would break the code.
What's going on here? How is inline javascript able to break my parse, and what can I do about it?
Update: per request, here is the code used with lxml.
>>> from lxml import etree
>>> tree=etree.parse("http://192.168.1.185/site.html")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "lxml.etree.pyx", line 3299, in lxml.etree.parse (src/lxml/lxml.etree.c:72655)
File "parser.pxi", line 1791, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:106263)
File "parser.pxi", line 1817, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:106564)
File "parser.pxi", line 1721, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:105561)
File "parser.pxi", line 1122, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:100456)
File "parser.pxi", line 580, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:94543)
File "parser.pxi", line 690, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:96003)
File "parser.pxi", line 620, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:95050)
lxml.etree.XMLSyntaxError: xmlParseEntityRef: no name, line 77, column 22
The lxml manual starts Chapter 9 by stating "lxml provides a very simple and powerful API for parsing XML and HTML" so I would expect to not see that exception.
There are a lot of really crappy ways for HTML parsing to break. Bad HTML is ubiquitous, and both script sections and various templating languages throw monkey wrenches into the works.
But, you also seem to be using XML-oriented parsers for the job, which are stricter and thus much, much more likely to break if not presented with exactly-right, totally valid input. Which most HTML--including most XHTML--manifestly is not.
So, use a parser designed to overlook some of the HTML gotchas:
import lxml.html
d = lxml.html.parse(URL)
That should take you off to the races.

Using Python and lxml to validate XML against an external DTD

I'm trying to validate an XML file against an external DTD referenced in the doctype tag. Specifically:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE en-export SYSTEM "http://xml.evernote.com/pub/evernote-export3.dtd">
...the rest of the document...
I'm using Python 3.3 and the lxml module. From reading http://lxml.de/validation.html#validation-at-parse-time, I've thrown this together:
enexFile = open(sys.argv[2], mode="rb") # sys.argv[2] is the path to an XML file in local storage.
enexParser = etree.XMLParser(dtd_validation=True)
enexTree = etree.parse(enexFile, enexParser)
From what I understand of validation.html, the lxml library should now take care of retrieving the DTD and performing validation. But instead, I get this:
$ ./mapwrangler.py validate notes.enex
Traceback (most recent call last):
File "./mapwrangler.py", line 27, in <module>
enexTree = etree.parse(enexFile, enexParser)
File "lxml.etree.pyx", line 3239, in lxml.etree.parse (src/lxml/lxml.etree.c:69955)
File "parser.pxi", line 1769, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:102257)
File "parser.pxi", line 1789, in lxml.etree._parseFilelikeDocument (src/lxml/lxml.etree.c:102516)
File "parser.pxi", line 1684, in lxml.etree._parseDocFromFilelike (src/lxml/lxml.etree.c:101442)
File "parser.pxi", line 1134, in lxml.etree._BaseParser._parseDocFromFilelike (src/lxml/lxml.etree.c:97069)
File "parser.pxi", line 582, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:91275)
File "parser.pxi", line 683, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:92461)
File "parser.pxi", line 622, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:91757)
lxml.etree.XMLSyntaxError: Validation failed: no DTD found !, line 3, column 43
This surprises me, because if I turn off validation, then the document parses in just fine and I can do print(enexTree.docinfo.doctype) to get
$ ./mapwrangler.py validate notes.enex
<!DOCTYPE en-export SYSTEM "http://xml.evernote.com/pub/evernote-export3.dtd">
So it looks to me like there shouldn't be any problem finding the DTD.
Thanks for your help.
You need to add no_network=False when constructing the parser object. This option is set to True by default.
From the documentation of parser options at http://lxml.de/parsing.html#parsers:
no_network - prevent network access when looking up external documents (on by default)
For a reason I still don't know, my problem was related to where the XML catalog was located on my local file system.
In my case, I use an XML editor that has a tight integration with a component content management system (CCMS, in this case SDL Trisoft 2011 R2). When the editor connects to the CCMS, DTDs, catalog files and a bunch of other files are synced. These files end up on the local file system in:
C:\Users\[username]\AppData\Local\Trisoft\InfoShare Client\[id]\Config\DocTypes\catalog.xml
I could not get that to work. Simply COPYING the whole catalog to another location fixed things, and this works:
f = r"path/to/my/file.xml"
# set XML catatog file path
os.environ['XML_CATALOG_FILES'] = r'C:\DATA\Mydoctypes\catalog.xml'
# configure parser
parser = etree.XMLParser(dtd_validation=True, no_network=True)
# validate
try:
valid = etree.parse(f, parser=parser)
print("This file is valid against the DTD.")
except etree.XMLSyntaxError, error:
print("This file is INVALID against the DTD!")
print(error)
Obviously this is not ideal, but it works.
Could it be something to do with file permissions, or perhaps that good old "file path too long" problem in Windows? I have not tried whether a symbolic link would work.
I am using Windows 7, Python 2.7.11 and the version of lxml is (3.6.0).

Parsing XML exception

I'm new to python, and seriously need help! I have a number of errors I can't figure out. I'm using python 2.7 on a mac. Here is the list of errors:
Traceback (most recent call last):
File "minihiveosc.py", line 378, in <module>
swhive = SWMiniHiveOSC( options.host, options.hport, options.ip, options.port, options.minibees, options.serial, options.baudrate, options.config, [1,options.minibees], options.verbose, options.apimode )
File "minihiveosc.py", line 280, in __init__
self.hive.load_from_file( config )
File "/Users/Puffin/Documents/python/pydon/pydon/pydonhive.py", line 396, in load_from_file
hiveconf = cfgfile.read_file( filename )
File "/Users/Puffin/Documents/python/pydon/pydon/minibeexml.py", line 116, in read_file
tree = ET.parse( filename )
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1183, in parse
tree.parse(source, parser)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 656, in parse
parser.feed(data)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1643, in feed
self._raiseerror(v)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1507, in _raiseerror
raise err
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 164, column 8
Any chance someone can help me?
Thanks!
What you posted in your question is called a "Traceback", and it shows only one error:
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 164, column 8
All the lines before it show how python got there; in the file minihiveosc.py, on line 378 some code was executed (shown in the traceback), which then led to line 280 of the same file, where something else was called, etc.
Every time Python calls a function the current state is pushed onto the stack to make room for the next context, and when an exception occurs python can show you this stack to help you diagnose your problem
In this case, you are trying to feed an XML document to the XML parser that has an error in it; by the time the parser gets to line 164, column 8, it found something it didn't expect. You'll need to inspect that document to see what the problem is, it'll be around that area.
It just because that your XML file is not wellformed at line 8. When the parser tries to read that line it raises that error. Have a look at your document to see what it is.
This is one error with stack trace.
Creation of SWMiniHiveOSC object caused error when executing load_from_file(config) method. File name or file content is in 'options.config'. Your XML config file is not well-formed, there is invalid token at line 164, column 8 in this file. The problem is with XML file, not python code.

Categories