Parse HTML via XPath [closed]

Parse HTML via XPath [closed] - python

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 1 year ago.
Improve this question
In .Net, I found this great library, HtmlAgilityPack that allows you to easily parse non-well-formed HTML using XPath. I've used this for a couple years in my .Net sites, but I've had to settle for more painful libraries for my Python, Ruby and other projects. Is anyone aware of similar libraries for other languages?

I'm surprised there isn't a single mention of lxml. It's blazingly fast and will work in any environment that allows CPython libraries.
Here's how you can parse HTML via XPATH using lxml.
>>> from lxml import etree
>>> doc = '<foo><bar></bar></foo>'
>>> tree = etree.HTML(doc)
>>> r = tree.xpath('/foo/bar')
>>> len(r)
1
>>> r[0].tag
'bar'
>>> r = tree.xpath('bar')
>>> r[0].tag
'bar'

In python, ElementTidy parses tag soup and produces an element tree, which allows querying using XPath:
>>> from elementtidy.TidyHTMLTreeBuilder import TidyHTMLTreeBuilder as TB
>>> tb = TB()
>>> tb.feed("<p>Hello world")
>>> e= tb.close()
>>> e.find(".//{http://www.w3.org/1999/xhtml}p")
<Element {http://www.w3.org/1999/xhtml}p at 264eb8>

The most stable results I've had have been using lxml.html's soupparser. You'll need to install python-lxml and python-beautifulsoup, then you can do the following:
from lxml.html.soupparser import fromstring
tree = fromstring('<mal form="ed"><html/>here!')
matches = tree.xpath("./mal[#form=ed]")

BeautifulSoup is a good Python library for dealing with messy HTML in clean ways.

It seems the question could be more precisely stated as "How to convert HTML to XML so that XPath expressions can be evaluated against it".
Here are two good tools:
TagSoup, an open-source program, is a Java and SAX - based tool, developed by John Cowan. This is
a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML. TagSoup also includes a command-line processor that reads HTML files and can generate either clean HTML or well-formed XML that is a close approximation to XHTML.
Taggle is a commercial C++ port of TagSoup.
SgmlReader is a tool developed by Microsoft's Chris Lovett.
SgmlReader is an XmlReader API over any SGML document (including built in support for HTML). A command line utility is also provided which outputs the well formed XML result.
Download the zip file including the standalone executable and the full source code: SgmlReader.zip

For Ruby, I highly recommend Hpricot that Jb Evain pointed out. If you're looking for a faster libxml-based competitor, Nokogiri (see http://tenderlovemaking.com/2008/10/30/nokogiri-is-released/) is pretty good too (it supports both XPath and CSS searches like Hpricot but is faster). There's a basic wiki and some benchmarks.

There is a free C implementation for XML called libxml2 which has some api bits for XPath which I have used with great success which you can specify HTML as the document being loaded. This had worked for me for some less than perfect HTML documents..
For the most part, XPath is most useful when the inbound HTML is properly coded and can be read 'like an xml document'. You may want to consider using a utility that is specific to this purpose for cleaning up HTML documents. Here is one example: http://tidy.sourceforge.net/
As far as these XPath tools go- you will likely find that most implementations are actually based on pre-existing C or C++ libraries such as libxml2.

Related

Identifying large bodies of text via BeautifulSoup or other python based extractors

Given some random news article, I want to write a web crawler to find the largest body of text present, and extract it. The intention is to extract the physical news article on the page.
The original plan was to use a BeautifulSoup findAll(True) and to sort each tag by its .getText() value. EDIT: don't use this for html work, use the lxml library, it's python based and much faster than BeautifulSoup. command (which means extract all html tags)
But this won't work for most pages, like the one I listed as an example, because the large body of text is split into many smaller tags, like paragraph dividers for example.
Does anyone have any experience with this? Any help with something like this would be amazing.
At the moment I'm using BeautifulSoup along with python, but willing to explore other possibilities.
EDIT: Came back to this question after a few months later (wow i sounded like an idiot ^), and solved this with a combination of libraries & own code.
Here are some deadly helpful python libraries for the task in sorted order of how much it helped me:
#1 goose library Fast, powerful, consistent
#2 readability library Content is passable, slower on average than goose but faster than boilerpipe
#3 python-boilerpipe Slower & hard to install, no fault to the boilerpipe library (originally in java), but to the fact that this library is build on top of another library in java, which attributes to IO time & errors, etc.
I'll release benchmarks perhaps if there is interest.
Indirectly related libraries, you should probably install them and read their docs:
NLTK text processing library This
is too good not to install. They provide text analysis tools along
with html tools (like cleanup, etc).
lxml html/xml parser Mentioned
above. This beats BeautifulSoup in every aspect but usability. It's a
bit harder to learn but the results are worth it. HTML parsing takes
much less time, it's very noticeable.
python
webscraper library I think the value of this code isn't the
lib itself, but using the lib as a reference manual to build your own
crawlers/extractors. It's very nicely coded / documented!
A lot of the value and power in using python, a rather slow language, comes from it's open source libraries. They are especially awesome when combined and used together, and everyone should take advantage of them to solve whatever problems they may have!
Goose library gets lots of solid maintenance, they just added Arabic support, it's great!

You might look at the python-readability package which does exactly this for you.

You're really not going about it the right way, I would say, as all the comments above would attest to.
That said, this does what you're looking for.
from bs4 import BeautifulSoup as BS
import requests
html = requests.get('http://www.cnn.com/2013/01/04/justice/ohio-rape-online-video/index.html?hpt=hp_c2').text
soup = BS(html)
print '\n\n'.join([k.text for k in soup.find(class_='cnn_strycntntlft').find_all('p')])
It pulls out only the text, first by finding the main container of all the <p> tags, then by selecting only the <p> tags themselves to get the text; ignoring the <script> and other irrelevant ones.
As was mentioned in the comments, this will only work for CNN--and possibly, only this page. You might need a different strategy for every new webpage.

Does Python have something like LINQ to XML

I need to parse XML file and put results on HTML form, but I am new to Python. Does Python 2.7 have something like LINQ to XML from C# or any good library for XML can suggest me ?

Check out Pynq – Python Language Integrated Query https://github.com/heynemann/pynq/wiki
I'm sure though if Pynq will be sufficient for you, although it does implement exprenssion trees in Python just like LINQ does for C#.
For an easy way to access XML in Python, you could check out BeautifulSoup (http://www.crummy.com/software/BeautifulSoup/documentation.html). Note: for XML processing: use BeautifulStoneSoup for XML processing.
A simple example: "find first three a tags"
soup.findAll('p', limit=3)
For a more comprehensive selection of XML libraries for Python, please see "PythonXml" in PythonInfo Wiki.

Look at lxml, more specifically the combination of:
the ElementTree interface: an easier xml model/api than DOM, much like the XDocument and XElement classes (attributes and text are even simpler because they are not separate "nodes", which may look strange at first if you're working with "mixed content" models a lot and you're used to DOM interfaces)
The E-factory of lxml.builder (like "functional construction" in LINQ to XML, but even better ;-))
python's built-in list comprehensions and generator expressions, which can get you very close to the LINQ query syntax (without looking like SQL however ;-))
... will get you a very similar experience. (having played with LINQ to XML in .NET as well: I much prefer working python and lxml)
Another remark: lxml has some good support for html (even ill-formed) as well, including facilities for filling out html forms (I'm not sure if that is what you meant with "put results on html form" however)

The most featured XML library in python is lxml http://lxml.de but it doesn't support LINQ interface, as I know.
You can use ElementTree interface http://lxml.de/tutorial.html#the-elementtree-class
or XPAth selectors http://lxml.de/xpathxslt.html#xpath
or CSS selectors http://lxml.de/cssselect.html
for data extraction from XML.

BeautifulSoup and lxml.html - what to prefer? [duplicate]

This question already has answers here:
Parsing HTML in python - lxml or BeautifulSoup? Which of these is better for what kinds of purposes?
(7 answers)
Closed 9 years ago.
I am working on a project that will involve parsing HTML.
After searching around, I found two probable options: BeautifulSoup and lxml.html
Is there any reason to prefer one over the other? I have used lxml for XML some time back and I feel I will be more comfortable with it, however BeautifulSoup seems to be much common.
I know I should use the one that works for me, but I was looking for personal experiences with both.

The simple answer, imo, is that if you trust your source to be well-formed, go with the lxml solution. Otherwise, BeautifulSoup all the way.
Edit:
This answer is three years old now; it's worth noting, as Jonathan Vanasco does in the comments, that BeautifulSoup4 now supports using lxml as the internal parser, so you can use the advanced features and interface of BeautifulSoup without most of the performance hit, if you wish (although I still reach straight for lxml myself -- perhaps it's just force of habit :)).

In summary, lxml is positioned as a lightning-fast production-quality html and xml parser that, by the way, also includes a soupparser module to fall back on BeautifulSoup's functionality. BeautifulSoup is a one-person project, designed to save you time to quickly extract data out of poorly-formed html or xml.
lxml documentation says that both parsers have advantages and disadvantages. For this reason, lxml provides a soupparser so you can switch back and forth. Quoting,
BeautifulSoup uses a different parsing approach. It is not a real HTML
parser but uses regular expressions to dive through tag soup. It is
therefore more forgiving in some cases and less good in others. It is
not uncommon that lxml/libxml2 parses and fixes broken HTML better,
but BeautifulSoup has superiour support for encoding detection. It
very much depends on the input which parser works better.
In the end they are saying,
The downside of using this parser is that it is much slower than
the HTML parser of lxml. So if performance matters, you might want
to consider using soupparser only as a fallback for certain cases.
If I understand them correctly, it means that the soup parser is more robust --- it can deal with a "soup" of malformed tags by using regular expressions --- whereas lxml is more straightforward and just parses things and builds a tree as you would expect. I assume it also applies to BeautifulSoup itself, not just to the soupparser for lxml.
They also show how to benefit from BeautifulSoup's encoding detection, while still parsing quickly with lxml:
>>> from BeautifulSoup import UnicodeDammit
>>> def decode_html(html_string):
... converted = UnicodeDammit(html_string, isHTML=True)
... if not converted.unicode:
... raise UnicodeDecodeError(
... "Failed to detect encoding, tried [%s]",
... ', '.join(converted.triedEncodings))
... # print converted.originalEncoding
... return converted.unicode
>>> root = lxml.html.fromstring(decode_html(tag_soup))
(Same source: http://lxml.de/elementsoup.html).
In words of BeautifulSoup's creator,
That's it! Have fun! I wrote Beautiful Soup to save everybody time.
Once you get used to it, you should be able to wrangle data out of
poorly-designed websites in just a few minutes. Send me email if you
have any comments, run into problems, or want me to know about your
project that uses Beautiful Soup.
--Leonard
Quoted from the Beautiful Soup documentation.
I hope this is now clear. The soup is a brilliant one-person project designed to save you time to extract data out of poorly-designed websites. The goal is to save you time right now, to get the job done, not necessarily to save you time in the long term, and definitely not to optimize the performance of your software.
Also, from the lxml website,
lxml has been downloaded from the Python Package Index more than two
million times and is also available directly in many package
distributions, e.g. for Linux or MacOS-X.
And, from Why lxml?,
The C libraries libxml2 and libxslt have huge benefits:...
Standards-compliant... Full-featured... fast. fast! FAST! ... lxml
is a new Python binding for libxml2 and libxslt...

Use both? lxml for DOM manipulation, BeautifulSoup for parsing:
http://lxml.de/elementsoup.html

lxml's great. But parsing your input as html is useful only if the dom structure actually helps you find what you're looking for.
Can you use ordinary string functions or regexes? For a lot of html parsing tasks, treating your input as a string rather than an html document is, counterintuitively, way easier.

Which XML library for what purposes?

A search for "python" and "xml" returns a variety of libraries for combining the two.
This list probably faulty:
xml.dom
xml.etree
xml.sax
xml.parsers.expat
PyXML
beautifulsoup?
HTMLParser
htmllib
sgmllib
Be nice if someone can offer a quick summary of when to use which, and why.

The DOM/SAX divide is a basic one. It applies not just to python since DOM and SAX are cross-language.
DOM: read the whole document into memory and manipulate it.
Good for:
complex relationships across tags in the markup
small intricate XML documents
Cautions:
Easy to use excessive memory
SAX: parse the document while you read it. Good for:
Long documents or open ended streams
places where memory is a constraint
Cautions:
You'll need to code a stateful parser, which can be tricky
beautifulsoup:
Great for HTML or not-quite-well-formed markup. Easy to use and fast. Good for screen scraping, etc. It can work with markup where the XML based ones would just through an error saying the markup is incorrect.
Most of the rest I haven't used, but I don't think there's hard and fast rules about when to use which. Just your standard considerations: who is going to maintain the code, which APIs do you find most easy to use, how well do they work, etc.
In general, for basic needs, it's nice to use the standard library modules since they are "standard" and thus available and well known. However, if you need to dig deep into something, almost always there are newer nonstandard modules with superior functionality outside of the standard library.

I find xml.etree essentially sufficient for everything, except for BeautifulSoup if I ever need to parse broken XML (not a common problem, differently from broken HTML, which BeautifulSoup also helps with and is everywhere): it has reasonable support for reading entire XML docs in memory, navigating them, creating them, incrementally-parsing large docs. lxml supports the same interface, and is generally faster -- useful to push performance when you can afford to install third party Python extensions (e.g. on App Engine you can't -- but xml.etree is still there, so you can run exactly the same code). lxml also has more features, and offers BeautifulSoup too.
The other libs you mention mimic APIs designed for very different languages, and in general I see no reason to contort Python into those gyrations. If you have very specific needs such as support for xslt, various kinds of validations, etc, it may be worth looking around for other libraries yet, but I haven't had such needs in a long time so I'm not current wrt the offerings for them.

For many problems you can get by with the xml. It has the major advantage of being part of the standard library. This means that it is pre-installed on almost every system and that the interface will be static. It is not the best, or the fastest, but it is there.
For everything else there is lxml. Specically, lxml is best for parsing broken HTML, xHTML, or suspect feeds. It uses libxml2 and libxslt to handle XPath, XSLT, and EXSLT. The tutorial is clear and the interface is simplistically straight-forward. The rest of the libraries mentioned exist because lxml was not available in its current form.
This is my opinion.

I don't do much with XML, but when I've needed to, lxml has been a joy to work with and is apparently quite fast. The element tree API is very nice in an object oriented setting.

What is a light python library that can eliminate HTML tags? (and only text) [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 8 years ago.
Improve this question
I know that NLTK has it. But any else?

python standard module html.parser should allow you to parse simple html content and eliminate tags. you only have to derive HTMLParser, then overload all handle_*() methods so that they output or discard content, depending on the surrounding element tags.

BeautifulSoup: http://www.crummy.com/software/BeautifulSoup/
From the home page:
Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen-scraping. Three features make it powerful:
Beautiful Soup won't choke if you give it bad markup. It yields a parse tree that makes approximately as much sense as your original document. This is usually good enough to collect the data you need and run away.
Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. You don't have to create a custom parser for each application.
Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. You don't have to think about encodings, unless the document doesn't specify an encoding and Beautiful Soup can't autodetect one. Then you just have to specify the original encoding.

You may want to a look at Strip-o-Gram HTML Conversion Library: http://pypi.python.org/pypi/stripogram/1.5
example usage from readme.txt:
from stripogram import html2text, html2safehtml
mylumpofdodgyhtml # a lump of dodgy html ;-)
# Only allow <b>, <a>, <i>, <br>, and <p> tags
mylumpofcoolcleancollectedhtml = html2safehtml(mylumpofdodgyhtml,valid_tags=("b", "a", "i", "br", "p"))
# Don't process <img> tags, just strip them out. Use an indent of 4 spaces
# and a page that's 80 characters wide.
mylumpoftext = html2text(mylumpofcoolcleancollectedhtml,ignore_tags=("img",),indent_width=4,page_width=80)

If your licensing permits it, you could use html2text (the asciinator) (GPL).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.