python method to extract content (excluding navigation) from an HTML page - python

Of course an HTML page can be parsed using any number of python parsers, but I'm surprised that there don't seem to be any public parsing scripts to extract meaningful content (excluding sidebars, navigation, etc.) from a given HTML doc.
I'm guessing it's something like collecting DIV and P elements and then checking them for a minimum amount of text content, but I'm sure a solid implementation would include plenty of things that I haven't thought of.

Try the Beautiful Soup library for Python. It has very simple methods to extract information from an html file.
Trying to generically extract data from webpages would require people to write their pages in a similar way... but there's an almost infinite number of ways to convey a page that looks identical let alone all the conbinations you can have to convey the same information.
Was there a particular type of information you were trying to extract or some other end goal?
You could try extracting any content in 'div' and 'p' markers and compare the relative sizes of all the information in the page. The problem then is that people probably group information into collections of 'div's and 'p's (or at least they do if they're writing well formed html!).
Maybe if you formed a tree of how the information is related (nodes would be the 'p' or 'div or whatever and each node would contain the associated text) you could do some sort of analysis to identify the smallest 'p' or 'div' that encompases what appears to be the majority of the information.. ?
[EDIT] Maybe if you can get it into the tree structure I suggested, you could then use a similar points system to spam assassin. Define some rules that attempt to classify the information. Some examples:
+1 points for every 100 words
+1 points for every child element that has > 100 words
-1 points if the section name contains the word 'nav'
-2 points if the section name contains the word 'advert'
If you have a lots of low scoring rules which add up when you find more relevent looking sections, I think that could evolve into a fairly powerful and robust technique.
[EDIT2] Looking at the readability, it seems to be doing pretty much exactly what I just suggested! Maybe it could be improved to try and understand tables better?

Have a look at templatemaker: http://www.holovaty.com/writing/templatemaker/
It's written by one of the founders of Django. Basically you feed it a few example html files and it will generate a "template" that you can then use to extract just the bits that are different (which is usually the meaningful content).
Here's an example from the google code page:
# Import the Template class.
>>> from templatemaker import Template
# Create a Template instance.
>>> t = Template()
# Learn a Sample String.
>>> t.learn('<b>this and that</b>')
# Output the template so far, using the "!" character to mark holes.
# We've only learned a single string, so the template has no holes.
>>> t.as_text('!')
'<b>this and that</b>'
# Learn another string. The True return value means the template gained
# at least one hole.
>>> t.learn('<b>alex and sue</b>')
True
# Sure enough, the template now has some holes.
>>> t.as_text('!')
'<b>! and !</b>'

You might use the boilerpipe Web application to fetch and extract content on the fly.
(This is not specific to Python, as you only need to issue a HTTP GET request to a page on Google AppEngine).
Cheers,
Christian

What is meaningful and what is not, it depends on the semantic of the page. If the semantics is crappy, your code won't "guess" what is meaningful. I use readability, which you linked in the comment, and I see that on many pages I try to read it does not provide any result, not talking about a decent one.
If someone puts the content in a table, you're doomed. Try readability on a phpbb forum you'll see what I mean.
If you want to do it, go with a regexp on <p></p>, or parse the DOM.

Goose is just the library for this task. To quote their README:
Goose will try to extract the following information:
Main text of an article
Main image of article
Any Youtube/Vimeo movies embedded in article
Meta Description
Meta tags

Related

Accessing Hovertext with html

I am trying to access hover text found on graph points at this site (bottom):
http://matchhistory.na.leagueoflegends.com/en/#match-details/TRLH1/1002200043?gameHash=b98e62c1bcc887e4&tab=overview
I have the full site html but I am unable to find the values displayed in the hover text. All that can be seen when inspecting a point are x and y values that are transformed versions of these values. The mapping can be determined with manual input taken from the hovertext but this defeats the purpose of looking at the html. Additionally, the mapping changes with each match history so it is not feasible to do this for a large number of games.
Is there any way around this?
thank you
Explanation
Nearly everything on this webpage is loaded via JSON through JavaScript. We don't even have to request the original page. You will, however, have to repiece together the page via id's of items, mysteries and etc., which won't be too hard because you can request masteries similar to how we fetch items.
So, I went through the network tab in inspect and I noticed that it loaded the following JSON formatted URL:
https://acs.leagueoflegends.com/v1/stats/game/TRLH1/1002200043?gameHash=b98e62c1bcc887e4
If you notice, there is a gameHash and the id (similar to that of the link you just sent me). This page contains everything you need to rebuild it, given that you fetch all reliant JSON files.
Dealing with JSON
You can use json.loads in Python to load it, but a great tool I would recomend is:
https://jsonformatter.curiousconcept.com/
You copy and paste JSON in there and it will help you understand the data structure.
Fetching items
The webpage loads all this information via a JSON file:
https://ddragon.leagueoflegends.com/cdn/7.10.1/data/en_US/item.json
It contains all of the information and tool tips about each item in the game. You can access your desired item via: theirJson['data']['1001']. Each image on the page's file name is the id (or 1001) in this example.
For instance, for 'Boots of Speed':
import requests, json
itemJson = json.loads(requests.get('https://ddragon.leagueoflegends.com/cdn/7.10.1/data/en_US/item.json').text)
print(itemJson['data']['1001'])
An alternative: Selenium
Selenium could be used for this. You should look it up. It's been ported for several programming languages, one being Python. It may work as you want it to here, but I sincerely think that the JSON method (describe above), although a little more convoluted, will perform faster (since speed, based on your post, seems to be an important factor).

Determining if a found string is contained within a hyperlink using xpath or regex (or other)

On a page such as this, the individual jobs are outlined behind hyperlinks (but my program does not yet know this; all it knows is that it has arrived at a jobs listings page).
Given a search term e.g. 'senior project manager', I scrape the source of the page to determine if the page contains this string;
search_term = 'senior project manager'
url = 'http://british-business-bank.co.uk/what-the-british-business-bank-does/job-vacancies/'
source = urllib2.urlopen(url).read().lower()
found_a_match = search_term in source
In this case, with search_term True, I then want to determine if the full job description is behind a hyperlink. Manual inspection of the source shows:
<p>Senior Project Manager – Northern Powerhouse Investment Fund</p>
I guess I could parse the source again, this time looking for a match for the search term preceded by an <a>, but I have a (perhaps unfounded) feeling that this may be a little brittle. What is a more robust approach?
NOTE: I know I can look into BeautifulSoup, lxml, scrapy et al to achieve this, but given that speed is of the essence and that there will be little if any more parsing to de done once I've made this hyperlink-or-not determination, I'm looking to keep things simple.
I recently had to build a solution that would ignore any matches within <a></a> tags. My approach was as follows:
In a pre-processing pass, search and record the positions of all <a.*> and </a> strings -- I used an array where each entry contains a start and stop position, for <a> and </a>, respectively.
Then when your searching for matches, determining whether a match is within hyperlink tags is a simple matter of running through the list of tags' start/stop positions and see if the match's offset is within any of those.

Python lxml XPath with deep nesting with specific search

The xpath for text I wish to extract is reliably located deep in the tree at
...table/tbody/tr[4]/td[2]
Specifically, td[2] is structured like so
<td class="val">xyz</td>
I am trying to extract the text "xyz", but a broad search returns multiple results. For example the following path returns 10 elements.
xpath('//td[#class="val"]')
... while a specific search doesn't return any elements. I am unsure why the following returns nothing.
xpath('//tbody/tr/td[#class="val"]')
One solution involves..
table = root.xpath('//table[#class="123"]')
#going down the tree
xyz = table[0][3][1]
print vol.text
However, I am pretty sure this extremely brittle. I would appreciate it if someone could tell me how to construct an xpath search that would be both un-brittle and relatively cheap on resources
You haven't mentioned it explicitly, but if your target table and td tag classes are reliable then you could do something like:
//table[#class="123"]/descendant::td[#class="val"]
And you half dodge the issue of tbody being there or not.
However, there's no substitute for actually seeing the material you are trying to parse for recommending XPATH queries...
...table/tbody/tr[4]/td[2]
I guess you found this XPath via a tool like Firebug. One thing to note about tools like Firebug (or other inspect tools within browsers) is that they use the DOM tree generated by the browser itself and most (if not all) HTML parsers in browsers would try hard to make the passed HTML valid. This often requires adding various tags the standard dictates.
<tbody> is one of these tags. <tr> tags are only allowed as a child of <thead>, <tbody> or <tfoot> tags. Unfortunately, in my experience, you will rarely see one of these tags inside a <table> in the actual source, but a browser would add these necessary tags while parsing to make HTML valid since standard requires to do so.
To cut this story short, there is probably no <tbody> tag in your actual source. That is why your XPath returns nothing.
As for generating XPath queries, this highly depends on the particular page/xml. In general, positional queries such as td[4] should be the last resort since they tend to break easily when something is added before them. You should inspect the markup carefully and try to come up queries that use attributes like id or class since they add specificity more reliably than the positional ones. But in the end, it all boils down to the specifics of the page in question.
This seems to be working
from lxml import etree
doc = etree.HTML('<html><body><table><tbody><tr><td>bad</td><td class="val">xyz</td></tr></tbody></table></body></html>')
print doc.xpath('//tbody/tr/td[#class="val"]')[0].text
output:
xyz
So what is your problem?

Wikipedia philosophy game diagram in python and R

So I am relatively new to python, and in order to learn, I have started writing a program that goes online to wikipedia, finds the first link in the overview section of a random article, follows that link and keeps going until it either enters a loop or finds the philosophy page (as detailed here) and then repeats this process for a new random article a specified number of times. I then want to collect the results in some form of useful data structure, so that I can pass the data to R using the Rpy library so that I can draw some sort of network diagram (R is pretty good at drawing things like that) with each node in the diagram representing the pages visited, and the arrows that paths taken from the starting article to the philosophy page.
So I have no problem getting python to return the fairly structured html from wiki but there are some problems that I can't quite figure out. Up till now I have selected the first link using a cssselector from the lxml library. It selects for the first link ( in an a tag) that is a direct descendant of a p tag, that is a direct descendant of a div tag with class="mw-content-ltr" like this:
user_agent = 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT)'
values = {'name' : 'David Kavanagh',
'location' : 'Belfast',
'language' : 'Python' }
headers = { 'User-Agent' : user_agent }
encodes = urllib.urlencode(values)
req = urllib2.Request(url, encodes, headers)
page = urllib2.urlopen(req)
root = parse(page).getroot()
return root.cssselect("div.mw-content-ltr>p>a")[0].get('href')
This code resides in a function which I use to find the first link in the page. It works for the most part but the problem is if the first link is inside some other tag as opposed to being a direct descendant of a p tag like let's say a b tag or something then I miss it. As you can see from the wiki article above, links in italics or inside parentheses aren't eligible for the game, which means that I never get a link in italics (good) but frequently do get links that are inside parentheses (bad) and sometimes miss the first link on a page like the first link on the Chair article, which is stool, but it is in bold, so I don't get it. I have tried removing the direct descendant stipulation but then I frequently get links that are "above" the overview section, that are usually in the side box, in a p tag, in a table, in the same div as the overview section.
So the first part of my question is:
How could I use cssselectors or some other function or library to select the first link in the overview section that is not inside parentheses or in italics. I thought about using regular expressions to look through the raw html but that seems like a very clunky solution and I thought that there might be something a bit nicer out there that I haven't thought of.
So currently I am storing the results in a list of lists. So I have a list called paths, in which there are lists that contain strings that contain the title of the wiki article.
The second part of the question is:
How can I traverse this list of lists to represent the multiple convergent paths? Is storing the results like this a good idea? Since the end diagram should look something like an upside down tree, I thought about making some kind of tree class, but that seems like a lot of work for something that is conceptually, fairly simple.
Any ideas or suggestions would be greatly appreciated.
Cheers,
Davy
I'll just answer the second question:
For a start, just keep a dict mapping one Wikipedia article title to the next. This will make it easy and fast to check I've you've hit an article you've already found before. Essentially this is just storing a directed graph's vertices, indexed by their origins.
If you get to the point where a Python dict is not efficient enough (it does have a significant memory overhead, once you have millions of items memory can be an issue) you can find a more efficient graph data structure to suit your needs.
EDIT
Ok, I'll answer the first question as well...
For the first part, I highly recommend using the MediaWiki API instead of getting the HTML version and parsing it. The API enables querying for certain types of links, for instance just inter-wiki links or just inter-language links. Additionally, there are Python client libraries for this API, which should make using it from Python code simple.
Don't parse a website's HTML if it provides a comprehensive and well-documented API!
For the first part, it's not possible to find brackets with css selectors, because as far as the html is concerned brackets are just text.
If I were you, I'd use selectors to find all the relevant paragraph elements that are valid for the game. Then, I'd look in the text of the paragraph element, and remove any that is not valid - for instance, anything between brackets, and anything between italic tags. I'd then search this processed text for the link elements I need. This is slightly nicer than manually processing the whole html document.
I'm not sure I follow exactly what you are doing for the second part, but as for representing the results of this search as a tree: This is a bad idea as you are looking for cycles, which trees can't represent.
For a data structure, I would have lists of 'nodes', where a node represents a page and has a URL and a count of occurances. I'd then use a brute force algorithm to compare lists of nodes - if the two lists have a node that are the same, you could merge them, increasing the 'occurances' count for each mirrored node.
I would not use the standard python 'list' as this can't loop back on itself. Maybe create your own linked list implementation to contain the nodes.

Web scraping - how to identify main content on a webpage

Given a news article webpage (from any major news source such as times or bloomberg), I want to identify the main article content on that page and throw out the other misc elements such as ads, menus, sidebars, user comments.
What's a generic way of doing this that will work on most major news sites?
What are some good tools or libraries for data mining? (preferably python based)
There are a number of ways to do it, but, none will always work. Here are the two easiest:
if it's a known finite set of websites: in your scraper convert each url from the normal url to the print url for a given site (cannot really be generalized across sites)
Use the arc90 readability algorithm (reference implementation is in javascript) http://code.google.com/p/arc90labs-readability/ . The short version of this algorithm is it looks for divs with p tags within them. It will not work for some websites but is generally pretty good.
A while ago I wrote a simple Python script for just this task. It uses a heuristic to group text blocks together based on their depth in the DOM. The group with the most text is then assumed to be the main content. It's not perfect, but works generally well for news sites, where the article is generally the biggest grouping of text, even if broken up into multiple div/p tags.
You'd use the script like: python webarticle2text.py <url>
There's no way to do this that's guaranteed to work, but one strategy you might use is to try to find the element with the most visible text inside of it.
Diffbot offers a free(10.000 urls) API to do that, don't know if that approach is what you are looking for, but it might help someone http://www.diffbot.com/
Check the following script. It is really amazing:
from newspaper import Article
URL = "https://www.ksat.com/money/philippines-stops-sending-workers-to-qatar"
article = Article(URL)
article.download()
print(article.html)
article.parse()
print(article.authors)
print(article.publish_date)
#print(article.text)
print(article.top_image)
print(article.movies)
article.nlp()
print(article.keywords)
print(article.summary)
More documentation can be found at http://newspaper.readthedocs.io/en/latest/ and https://github.com/codelucas/newspaper you should install it using:
pip3 install newspaper3k
For a solution in Java have a look at https://github.com/kohlschutter/boilerpipe :
The boilerpipe library provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.
The library already provides specific strategies for common tasks (for example: news article extraction) and may also be easily extended for individual problem settings.
But there is also a python wrapper around this available here:
https://github.com/misja/python-boilerpipe
It might be more useful to extract the RSS feeds (<link type="application/rss+xml" href="..."/>) on that page and parse the data in the feed to get the main content.
Another possibility of separating "real" content from noise is by measuring HTML density of the parts of a HTML page.
You will need a bit of experimentation with the thresholds to extract the "real" content, and I guess you could improve the algorithm by applying heuristics to specify the exact bounds of the HTML segment after having identified the interesting content.
Update: Just found out the URL above does not work right now; here is an alternative link to a cached version of archive.org.
There is a recent (early 2020) comparison of various methods of extracting article body, without and ads, menus, sidebars, user comments, etc. - see https://github.com/scrapinghub/article-extraction-benchmark. A report, data and evaluation scripts are available. It compares many options mentioned in the answers here, as well as some options which were not mentioned:
python-readability
boilerpipe
newspaper3k
dragnet
html-text
Diffbot
Scrapinghub AutoExtract
In short, "smart" open source libraries are adequate if you need to remove e.g. sidebar and menu, but they don't handle removal of unnecessary content inside articles, and are quite noisy overall; sometimes they remove an article itself and return nothing. Commercial services use Computer Vision and Machine Learning, which allows them to provide a much more precise output.
For some use cases simpler libraries like html-text are preferrable, both to commercial services and to "smart" open source libraries - they are fast, and ensure information is not missing (i.e. recall is high).
I would not recommend copy-pasting code snippets, as there are many edge cases even for a seemingly simple task of extracting text from HTML, and there are libraries available (like html-text or html2text) which should be handling these edge cases.
To use a commercial tool, in general one needs to get an API key, and then use a client library. For example, for AutoExtract by Scrapinghub (disclaimer: I work there) you would need to install pip install scrapinghub-autoextract. There is a Python API available - see https://github.com/scrapinghub/scrapinghub-autoextract README for details, but an easy way to get extractions is to create a .txt file with URLs to extract, and then run
python -m autoextract urls.txt --page-type article --api-key <API_KEY> --output res.jl
I wouldn't try to scrape it from the web page - too many things could mess it up - but instead see which web sites publish RSS feeds. For example, the Guardian's RSS feed has most of the text from their leading articles:
http://feeds.guardian.co.uk/theguardian/rss
I don't know if The Times (The London Times, not NY) has one because it's behind a paywall. Good luck with that...

Categories