What’s the most forgiving HTML parser in Python?

What’s the most forgiving HTML parser in Python? - python

I have some random HTML and I used BeautifulSoup to parse it, but in most of the cases (>70%) it chokes. I tried using Beautiful soup 3.0.8 and 3.2.0 (there were some problems with 3.1.0 upwards), but the results are almost same.
I can recall several HTML parser options available in Python from the top of my head:
BeautifulSoup
lxml
pyquery
I intend to test all of these, but I wanted to know which one in your tests come as most forgiving and can even try to parse bad HTML.

They all are. I have yet to come across any html page found in the wild that lxml.html couldn't parse. If lxml barfs on the pages you're trying to parse you can always preprocess them using some regexps to keep lxml happy.
lxml itself is fairly strict, but lxml.html is a different parser and can deal with very broken html. For extremely brokeh html, lxml also ships with lxml.html.soupparser which interfaces with the BeautifulSoup library.
Some approaches to parsing broken html using lxml.html are described here: http://lxml.de/elementsoup.html

With pages that don't work with anything else (those that contain nested <form> elements come to mind) I've had success with MinimalSoup and ICantBelieveItsBeautifulSoup. Each can handle certain types of error that the other one can't so often you'll need to try both.

I ended up using BeautifulSoup 4.0 with html5lib for parsing and is much more forgiving, with some modifications to my code it's now working considerabily well, thanks all for suggestions.

If beautifulsoup doesn't fix your html problem, the next best solution would be regular expression. lxml, elementtree, minidom are very strict in parsing and actually they are doing right.
Other tips:
I feed the html to lynx browser through command prompt, and take out the text version of the page/content and parse using regex.
Converting to html to text or html to markdown strips all the html tags and you will remain with text. That is easy to parse.

Related

How do I parse HTML-like with errors?

I have data that looks like it is part of an HTML document. However there are some bugs in it like
<td class= foo"bar">
on which all the parsers I tried (lxml, xml.etree) fail with an error.
Since I don't actually care about this specific part of the document I am looking for a more robust parser.
Something where I can allow errors in specific subtrees to be ignored and maybe just not insert the nodes or something that will only lazily parse the parts of the tree I am traversing for example.

You are using XML parsers. XML is a strict language, while the HTML standard requires parsers to be tolerant of errors.
Use a compliant HTML parser like lxml.html, or html5lib, or the wrapper library BeautifulSoup (which uses either of the previous with a cleaner API). html5lib is slower but closely mimics how a modern browser would treat errors.

Use lxml:
Create a HTML parser with the recover set to True:
parser = etree.HTMLParser(recover=True)
tree = etree.parse(StringIO(broken_html), parser)
See the tutorial Parsing XML and HTML with lxml.

BeautifulSoup: how to ignore spurious end tags

I've read many good things about BeautifulSoup, that's why I'm trying to use it currently to scrape a set of websites with badly formed HTML.
Unfortunately, there's one feature of BeautifulSoup that pretty much is a showstopper currently:
It seems that when BeautifulSoup encounters a closing tag (in my case </p>) that was never opened, it decides to end the document instead.
Also, the find method seems to not search the contents behind the (self-induced) </html> tag in this case. This means that when the block I'm interested in happens to be behind a spurious closing tag, I can't access the contents.
Is there a way I can configure BeautifulSoup to ignore unmatched closing tags rather than closing the document when they are encountered?

BeautifulSoup doesn't do any parsing, it uses the output of a dedicated parser (lxml or html.parser or html5lib).
Pick a different parser if the one you are using right now doesn't handle broken HTML quite the way you want it to. lxml is the faster parser and can handle broken HTML quite well, html5lib comes closest to how your browser would parse broken HTML but is a lot slower.
Also see Installing a parser in the BeautifulSoup documentation, as well as the Differences between parsers section.

BeautifulSoup4 missing tags

I'm using BeautifulSoup 4 under Anaconda's distribution as bs4. Correct me if I'm wrong - I'm understanding BeautifulSoup is lib for transforming ill-formed HTML into well-formed one. But, when I'm assigning HTML to it's constructor, I lose more then half of it's characters. Shouldn't it be only fixing HTML and not cleaning it? In docs it's not well described.
This is the code:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
where html is HTML of Google's homepage.
Edit:
Could it be from the way I'm retrieving string of HTML via str(soup)?

First of all, make sure you see these "missing tags" in the html coming into BeautifulSoup to parse. It could be that the problem is not in how BeautifulSoup parses the HTML, but in how you are retrieving the HTML data to parse.
I suspect, you are downloading the google homepage via urllib2 or requests and compare what you see inside str(soup) with what you see in a real browser. If this is case, then you cannot compare the two, since neither urllib2, nor requests is a browser and cannot execute javascript or manipulate DOM after the page load, or make asynchronous requests. What you get with urllib2 or requests is basically an initial HTML page "without a dynamic part".
If the problem is still in how BeautifulSoup parses the HTML...
As it clearly stated in docs, the behavior depends on which parser BeautifulSoup would choose to use under-the-hood:
There are also differences between HTML parsers. If you give Beautiful
Soup a perfectly-formed HTML document, these differences won’t matter.
One parser will be faster than another, but they’ll all give you a
data structure that looks exactly like the original HTML document. But
if the document is not perfectly-formed, different parsers will give
different results.
See Installing a parser and Specifying the parser to use.
Since you don't specify a parser explicitly, the following rule is applied:
If you don’t specify anything, you’ll get the best HTML parser that’s
installed. Beautiful Soup ranks lxml’s parser as being the best, then
html5lib’s, then Python’s built-in parser.
See also Differences between parsers.
In other words, try to approach the problem using different parsers and see how the result would differ:
soup = BeautifulSoup(html, 'lxml')
soup = BeautifulSoup(html, 'html5lib')
soup = BeautifulSoup(html, 'html.parser')

Using Regular Expression with Twill

I'm currently using urllib2 and BeautifulSoup to open and parse html data. However I've ran into a problem with a site that uses javascript to load the images after the page has been rendered (I'm trying to find the image source for a certain image on the page).
I'm thinking Twill could be a solution, and am trying to open the page and use a regular expression with 'find' to return the html string I'm looking for. I'm having some trouble getting this to work though, and can't seem to find any documentation or examples on how to use regular expressions with Twill.
Any help or advice on how to do this or solve this problem in general would be much appreciated.

I'd rather user CSS selectors or "real" regexps on page source. Twill is AFAIK not being worked on. Have you tried BS or PyQuery with CSS selectors?

Twill does not work with javascript (see http://twill.idyll.org/browsing.html)
use webdriver if you want to handle javascript

Is there a good html parser like HtmlAgilityPack (.NET) for Python?

I'm looking for a good html parser like HtmlAgilityPack (open-source .NET project: http://www.codeplex.com/htmlagilitypack), but for using with Python.
Anyone knows?

Use Beautiful Soup like everyone does.

Others have recommended BeautifulSoup, but it's much better to use lxml. Despite its name, it is also for parsing and scraping HTML. It's much, much faster than BeautifulSoup, and it even handles "broken" HTML better than BeautifulSoup (their claim to fame). It has a compatibility API for BeautifulSoup too if you don't want to learn the lxml API.
Ian Blicking agrees.
There's no reason to use BeautifulSoup anymore, unless you're on Google App Engine or something where anything not purely Python isn't allowed.

Beautiful Soup should be something you search for. It is a html/xml parser that can deal with invalid pages and allows e.g. to iterate over specific tags.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.