How to parse broken HTML with LXML

How to parse broken HTML with LXML - python

I'm trying to parse a broken HTML with LXML parser on python 2.5 and 2.7
Unlike in LXML documentation (http://lxml.de/parsing.html#parsing-html) parsing a broken HTML does not work:
from lxml import etree
import StringIO
broken_html = "<html><head><title>test<body><h1>page title</h3>"
parser = etree.HTMLParser()
tree = etree.parse(StringIO.StringIO(broken_html))
Result:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "lxml.etree.pyx", line 2954, in lxml.etree.parse (src/lxml/lxml.etree.c:56220)
File "parser.pxi", line 1550, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:82482)
File "parser.pxi", line 1578, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:82764)
File "parser.pxi", line 1457, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:81562)
File "parser.pxi", line 965, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:78232)
File "parser.pxi", line 569, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:74488)
File "parser.pxi", line 650, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:75379)
File "parser.pxi", line 590, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:74712)
lxml.etree.XMLSyntaxError: Opening and ending tag mismatch: h1 line 1 and h3, line 1, column 50

Don't just construct that parser, use it (as per the example you link to):
>>> tree = etree.parse(StringIO.StringIO(broken_html), parser=parser)
>>> tree
<lxml.etree._ElementTree object at 0x2fd8e60>
Or use lxml.html as a shortcut:
>>> from lxml import html
>>> broken_html = "<html><head><title>test<body><h1>page title</h3>"
>>> html.fromstring(broken_html)
<Element html at 0x2dde650>

lxml allows you load a broken xml by creating a parser instance with recover=True
etree.HTMLParser(recover=True)
You could use the same technique when creating the parser.

You might try to use lxml.html instead
>>> import lxml.html
>>> broken_html = "<html><head><title>test<body><h1>page title</h3>"
>>> root = lxml.html.fromstring(broken_html)
>>> lxml.html.tostring(root)
'<html><head><title>test</title></head><body><h1>page title</h1></body></html>'

Related

Entity 'ouml' error while using lxml to parse dblp data

I am trying to parse dblp data(xml format). So far my code is :
#-*-coding:utf-8-*-
from lxml import etree # lxml import library
parser = etree.XMLParser (load_dtd =True)
Tree = etree.parse( "dblp.xml" ,parser)
Root = tree.getroot()
I tried running the code and I get the following error:
Tree = etree.parse( "dblp.xml" ,parser) # Parse the xml with tree structure
File "src/lxml/etree.pyx", line 3426, in lxml.etree.parse
File "src/lxml/parser.pxi", line 1839, in lxml.etree._parseDocument
File "src/lxml/parser.pxi", line 1865, in lxml.etree._parseDocumentFromURL
File "src/lxml/parser.pxi", line 1769, in lxml.etree._parseDocFromFile
File "src/lxml/parser.pxi", line 1162, in lxml.etree._BaseParser._parseDocFromFile
File "src/lxml/parser.pxi", line 600, in lxml.etree._ParserContext._handleParseResultDoc
File "src/lxml/parser.pxi", line 710, in lxml.etree._handleParseResult
File "src/lxml/parser.pxi", line 639, in lxml.etree._raiseParseError
File "dblp.xml", line 70
lxml.etree.XMLSyntaxError: Entity 'ouml' not defined, line 70,
column 27
how can i resolve this error?
Note: I have xml and dtd files in same location.

I recently encountered the same issue whilst parsing DBLP's XML database. In my case, I was missing the appropriate .dtd file for my dblp.xml (which provides the necessary information for parsing certain custom entities, including ouml). The top of your file should look something like this:
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp SYSTEM "dblp-2017-08-29.dtd">
The .dtd file specified on the second line should be located in the same directory as the dblp.xml file that you're attempting to parse. You can download the appropriate .dtd file your XML file from here: http://dblp.org/xml/release/
$ ls
dblp-2017-08-29.dtd dblp-2018-11-01.xml
Also, given the size of dblp.xml, you may also want to use lxml.etree.iterparse to stream the contents of the file instead. Below is some of the code that I used to obtain entries for certain types of publication within the database.
fn = 'dblp.xml'
for event, elem in lxml.etree.iterparse(fn, load_dtd=True):
if elem.tag not in ['article', 'inproceedings', 'proceedings']:
continue
title = elem.find('title') # type: Optional[str]
year = elem.find('year') # type: Optional[int]
authors = elem.find('author') # type: Optional[str]
venue = elem.find('venue') # type: Optional[str]
...
elem.clear()

IOError passing requests Response.content to lxml.etree.parse() [duplicate]

This question already has an answer here:
lxml error "IOError: Error reading file" when parsing facebook mobile in a python scraper script
(1 answer)
Closed 7 years ago.
I have the following xml on a webpage -
<entry>
<id>1750</id>
<title>variablename</title>
<source>
com.tidalsoft.webclient.tes.dsp.db.datatypes.Variable
</source>
<tes:variable>
<tes:ownername>ownergroup</tes:ownername>
<tes:productiondate>2015-08-17T00:00:00-0400</tes:productiondate>
<tes:readonly>N</tes:readonly>
<tes:publish>N</tes:publish>
<tes:description>
Decription Here
</tes:description>
<tes:startcalendar>0</tes:startcalendar>
<tes:ownerid>666</tes:ownerid>
<tes:type>1</tes:type>
<tes:lastusermodifiedtime>2015-06-15T15:42:27-0400</tes:lastusermodifiedtime>
<tes:innervalue>\\share\location</tes:innervalue>
<tes:calc>N</tes:calc>
<tes:name>variablename</tes:name>
<tes:startdate>1899-12-30T00:00:00-0500</tes:startdate>
<tes:pub>Y</tes:pub>
<tes:lastvalue>\\share\location</tes:lastvalue>
<tes:id>1750</tes:id>
<tes:startdateasstring>18991230000000</tes:startdateasstring>
<tes:lastchangetime>2015-06-15T15:42:27-0400</tes:lastchangetime>
<tes:clientcachelastchangetime>2015-08-17T09:56:49-0400</tes:clientcachelastchangetime>
</tes:variable>
</entry>
I'm trying to parse this data. I have a get through requests -
r = requests.get(url, auth=('username', 'password'))
but when I try to parse the content I get errors.
>>> xmlObject = etree.parse(r.content)
Traceback (most recent call last):
File "apiTest.py", line 46, in <module>
xmlObject = etree.parse(r.content)
File "lxml.etree.pyx", line 3310, in lxml.etree.parse (src\lxml\lxml.etree.c:7
2517)
File "parser.pxi", line 1791, in lxml.etree._parseDocument (src\lxml\lxml.etre
e.c:105979)
File "parser.pxi", line 1817, in lxml.etree._parseDocumentFromURL (src\lxml\lx
ml.etree.c:106278)
File "parser.pxi", line 1721, in lxml.etree._parseDocFromFile (src\lxml\lxml.e
tree.c:105277)
File "parser.pxi", line 1122, in lxml.etree._BaseParser._parseDocFromFile (src
\lxml\lxml.etree.c:100227)
File "parser.pxi", line 580, in lxml.etree._ParserContext._handleParseResultDo
c (src\lxml\lxml.etree.c:94350)
File "parser.pxi", line 690, in lxml.etree._handleParseResult (src\lxml\lxml.e
tree.c:95786)
File "parser.pxi", line 618, in lxml.etree._raiseParseError (src\lxml\lxml.etr
ee.c:94818)
IOError: Error reading file ''
On the last line what is between the quotes is the xml stated at the beginning as a string -
<?xml version="1.0" encoding="UTF-8" standalone="ye
s"?><entry xmlns="http://purl.org/atom/ns#"><id>1750</id><title>....
The data is being provided as content-type: text/xml

etree.parse expects a filename, a file-like object, or a URL as its first argument (see help(etree.parse)). It does not expect an XML string. To parse an XML string use
xmlObject = etree.fromstring(r.content)
Note that etree.fromstring returns a lxml.etree._Element. In contrast, etree.parse returns a lxml.etree._ElementTree. Given the _Element, you can obtain the _ElementTree with the getroottree method:
xmlTree = xmlObject.getroottree()

Parser.pxi problems when pretty printing xml file

I'm getting some errors when I'm trying to pretty print a xml file. I've looked everywhere and tried installning latest version of lxml but still getting this error.
My script is pretty simple it looks like this.
import os
import lxml.etree as etree
from lxml.etree import parse
fname = 'C:\Test_folder\SlutR_20150218.xml'
x = etree.parse(fname)
print etree.tostring(x, pretty_print = True)
And the errors I'm getting is following.
Traceback (most recent call last):
File "C:\Users\a.curcic\Desktop\Övriga_python_script\
Pretty_print_example.py",
line 5, in <module> x = etree.parse(fname)
File "lxml.etree.pyx",
line 3301, in lxml.etree.parse
(src\lxml\lxml.etree.c:72453)
File "parser.pxi", line 1791, in lxml.etree._parseDocument
(src\lxml\lxml.etree.c:105915)
File "parser.pxi", line 1817,
in lxml.etree._parseDocumentFromURL (src\lxml\lxml.etree.c:106214)
XMLSyntaxError:
Extra content at the end of the document, line 2, column 909

python html parsing fails in document with javascript

I'm trying to use Python to parse HTML (although strictly speaking, the server claims it's xhtml) and every parser I have tried (ElementTree, minidom, and lxml) all fail. When I go to look at where the problem is, it's inside a script tag:
<script type="text/javascript">
... // some javascript code
if (condition1 && condition2) { // croaks on this line
I see what the problem is, the ampersand should be quoted. The problem is, this is inside a javascript script tag, so it cannot be quoted, because that would break the code.
What's going on here? How is inline javascript able to break my parse, and what can I do about it?
Update: per request, here is the code used with lxml.
>>> from lxml import etree
>>> tree=etree.parse("http://192.168.1.185/site.html")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "lxml.etree.pyx", line 3299, in lxml.etree.parse (src/lxml/lxml.etree.c:72655)
File "parser.pxi", line 1791, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:106263)
File "parser.pxi", line 1817, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:106564)
File "parser.pxi", line 1721, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:105561)
File "parser.pxi", line 1122, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:100456)
File "parser.pxi", line 580, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:94543)
File "parser.pxi", line 690, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:96003)
File "parser.pxi", line 620, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:95050)
lxml.etree.XMLSyntaxError: xmlParseEntityRef: no name, line 77, column 22
The lxml manual starts Chapter 9 by stating "lxml provides a very simple and powerful API for parsing XML and HTML" so I would expect to not see that exception.

There are a lot of really crappy ways for HTML parsing to break. Bad HTML is ubiquitous, and both script sections and various templating languages throw monkey wrenches into the works.
But, you also seem to be using XML-oriented parsers for the job, which are stricter and thus much, much more likely to break if not presented with exactly-right, totally valid input. Which most HTML--including most XHTML--manifestly is not.
So, use a parser designed to overlook some of the HTML gotchas:
import lxml.html
d = lxml.html.parse(URL)
That should take you off to the races.

how to extract reviews from iframeurl returned by amazon api in python?

I am trying to get the text content of reviews of a given product in amazon using its api. But I am not able to work it out.
Here is what I have:
result = api.item_lookup('B00062B6QY', ResponseGroup='Reviews',
TruncateReviewsAt=256, IncludeReviewsSummary=False)
iframeurl=result.xpath('//*[local-name()="IFrameURL"]/text()')[0].strip()
print iframeurl
reviews=requests.get(iframeurl)
reviews.raise_for_status()
#data = json.loads(reviews.text)
root = ET.fromstring(reviews.text)
print root
The output is:
http://www.amazon.com/reviews/iframe?akid=helloworld&alinkCode=xm2&asin=B00062B6QY&atag=welcomehome-20&exp=2014-01-28T19%3A06%3A20Z&summary=0&truncate=256&v=2&sig=HIDDEN%3D
Traceback (most recent call last):
File "amazon_api_new.py", line 36, in <module>
root = ET.fromstring(reviews.text)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1300, in XML
parser.feed(text)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1642, in feed
self._raiseerror(v)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1506, in _raiseerror
raise err
xml.etree.ElementTree.ParseError: mismatched tag: line 867, column 2
PS: I have changed the iframeurl printed out just to clear the api key details
EDIT: image from firebug

instead of using ElementTree, try to load reviews.text to lxml like:
>>> from lxml import etree
>>> parser = etree.HTMLParser()
>>> tree = etree.parse(StringIO(reviews.text), parser)
>>> result = etree.tostring(tree.getroot(),
... pretty_print=True, method="html")
>>> print(result)
...
of course, you can then use lxml xpath for further parsing

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to parse broken HTML with LXML - python

lxml allows you load a broken xml by creating a parser instance with recover=True etree.HTMLParser(recover=True) You could use the same technique when creating the parser.

You might try to use lxml.html instead >>> import lxml.html >>> broken_html = "<html><head><title>test<body><h1>page title</h3>" >>> root = lxml.html.fromstring(broken_html) >>> lxml.html.tostring(root) '<html><head><title>test</title></head><body><h1>page title</h1></body></html>'

Related

Entity 'ouml' error while using lxml to parse dblp data

IOError passing requests Response.content to lxml.etree.parse() [duplicate]

Parser.pxi problems when pretty printing xml file

python html parsing fails in document with javascript

how to extract reviews from iframeurl returned by amazon api in python?

Categories

Resources