How do I parse HTML-like with errors? - python

I have data that looks like it is part of an HTML document. However there are some bugs in it like
<td class= foo"bar">
on which all the parsers I tried (lxml, xml.etree) fail with an error.
Since I don't actually care about this specific part of the document I am looking for a more robust parser.
Something where I can allow errors in specific subtrees to be ignored and maybe just not insert the nodes or something that will only lazily parse the parts of the tree I am traversing for example.

You are using XML parsers. XML is a strict language, while the HTML standard requires parsers to be tolerant of errors.
Use a compliant HTML parser like lxml.html, or html5lib, or the wrapper library BeautifulSoup (which uses either of the previous with a cleaner API). html5lib is slower but closely mimics how a modern browser would treat errors.

Use lxml:
Create a HTML parser with the recover set to True:
parser = etree.HTMLParser(recover=True)
tree = etree.parse(StringIO(broken_html), parser)
See the tutorial Parsing XML and HTML with lxml.

Related

BeautifulSoup: how to ignore spurious end tags

I've read many good things about BeautifulSoup, that's why I'm trying to use it currently to scrape a set of websites with badly formed HTML.
Unfortunately, there's one feature of BeautifulSoup that pretty much is a showstopper currently:
It seems that when BeautifulSoup encounters a closing tag (in my case </p>) that was never opened, it decides to end the document instead.
Also, the find method seems to not search the contents behind the (self-induced) </html> tag in this case. This means that when the block I'm interested in happens to be behind a spurious closing tag, I can't access the contents.
Is there a way I can configure BeautifulSoup to ignore unmatched closing tags rather than closing the document when they are encountered?
BeautifulSoup doesn't do any parsing, it uses the output of a dedicated parser (lxml or html.parser or html5lib).
Pick a different parser if the one you are using right now doesn't handle broken HTML quite the way you want it to. lxml is the faster parser and can handle broken HTML quite well, html5lib comes closest to how your browser would parse broken HTML but is a lot slower.
Also see Installing a parser in the BeautifulSoup documentation, as well as the Differences between parsers section.

Parse html source code into xml tree

I know there's many ways to do this using 3rd party libraries such as resources, pyparsing, selenium, etc. but I'm looking for a quick and dirty way to do it without any 3rd party modules.
Basically what I want to do is take the HTML code from the page source of a webpage and parse it into xml format (probably using xml.etree.ElementTree). I've tried this:
import urllib.request
import xml.etree.ElementTree as ET
data = urllib.request.urlopen(website)
tree = ET.fromstring(data.read)
However when I do this I either get mismatched tags or unknown symbol for UTF-8 encoding, which the page source is definitely in. I was under the assumption that a functioning html page wouldn't have mismatched tags so I'm thinking there's something I'm missing.
And the whole reason I don't want to use a 3rd party library is because I need to grab a small set of information and don't think it's enough to justify using another module.

Extracting parts of HTML from website using python

I'm currently working on a project that involves a program to inspect a web page's HTML using Python. My program has to monitor a web page, and when a change is made to the HTML, it will complete a set of actions. My question is how do you extract just part of a web page, and how do you monitor a web page's HTML and report almost instantly when a change is made. Thanks.
In the past I wrote my own parsers. Nowadays HTML is HTML 5, more statements,more Javascript, a lot of crappiness done by developers and their editors, like
document.write('<SCR' + 'IPT
And some web frameworks / developers bad coding change the Last-Modified in the HTTP header on every request, even if for a human person the text you read on the page isn't changed.
I suggest you BeautifulSoup for the parsing stuff; by your own you have to careful choose what to watch to decide if the Web page is modified.
Its intro :
BeautifulSoup is a Python package that parses broken HTML, just like
lxml supports it based on the parser of libxml2. BeautifulSoup uses a
different parsing approach. It is not a real HTML parser but uses
regular expressions to dive through tag soup. It is therefore more
forgiving in some cases and less good in others. It is not uncommon
that lxml/libxml2 parses and fixes broken HTML better, but
BeautifulSoup has superiour support for encoding detection. It very
much depends on the input which parser works better.
Scrapy might be a good place to start. http://doc.scrapy.org/en/latest/intro/overview.html
Getting sections of websites is easy, it is just xml, you can use scrapy or beautifulsoup.

What’s the most forgiving HTML parser in Python?

I have some random HTML and I used BeautifulSoup to parse it, but in most of the cases (>70%) it chokes. I tried using Beautiful soup 3.0.8 and 3.2.0 (there were some problems with 3.1.0 upwards), but the results are almost same.
I can recall several HTML parser options available in Python from the top of my head:
BeautifulSoup
lxml
pyquery
I intend to test all of these, but I wanted to know which one in your tests come as most forgiving and can even try to parse bad HTML.
They all are. I have yet to come across any html page found in the wild that lxml.html couldn't parse. If lxml barfs on the pages you're trying to parse you can always preprocess them using some regexps to keep lxml happy.
lxml itself is fairly strict, but lxml.html is a different parser and can deal with very broken html. For extremely brokeh html, lxml also ships with lxml.html.soupparser which interfaces with the BeautifulSoup library.
Some approaches to parsing broken html using lxml.html are described here: http://lxml.de/elementsoup.html
With pages that don't work with anything else (those that contain nested <form> elements come to mind) I've had success with MinimalSoup and ICantBelieveItsBeautifulSoup. Each can handle certain types of error that the other one can't so often you'll need to try both.
I ended up using BeautifulSoup 4.0 with html5lib for parsing and is much more forgiving, with some modifications to my code it's now working considerabily well, thanks all for suggestions.
If beautifulsoup doesn't fix your html problem, the next best solution would be regular expression. lxml, elementtree, minidom are very strict in parsing and actually they are doing right.
Other tips:
I feed the html to lynx browser through command prompt, and take out the text version of the page/content and parse using regex.
Converting to html to text or html to markdown strips all the html tags and you will remain with text. That is easy to parse.

Is there a good html parser like HtmlAgilityPack (.NET) for Python?

I'm looking for a good html parser like HtmlAgilityPack (open-source .NET project: http://www.codeplex.com/htmlagilitypack), but for using with Python.
Anyone knows?
Use Beautiful Soup like everyone does.
Others have recommended BeautifulSoup, but it's much better to use lxml. Despite its name, it is also for parsing and scraping HTML. It's much, much faster than BeautifulSoup, and it even handles "broken" HTML better than BeautifulSoup (their claim to fame). It has a compatibility API for BeautifulSoup too if you don't want to learn the lxml API.
Ian Blicking agrees.
There's no reason to use BeautifulSoup anymore, unless you're on Google App Engine or something where anything not purely Python isn't allowed.
Beautiful Soup should be something you search for. It is a html/xml parser that can deal with invalid pages and allows e.g. to iterate over specific tags.

Categories