Finding structures from web articles with Python - python

I'm looking for some Python tool, that can help me determine content structures from an article website such as http://www.bbc.co.uk/. I used boilerplate removal library - Boilerpipe to clean the web page from unwanted stuff (banners, links, pictures, etc).
Now when I have only relevant content, I want to automatically determine what string is title, author, date, date of article updating, what is the article itself. Problem is, I am not only going to use it for transparent article pages, that has most of those information in HTML tags such as <title>Title</title>. I'd like to be able to determine it from tags like <div>28.11.2011<p>John Cusack on Syria conflict</div>.
Is there any tool that can help me with that?

Isn't scrapy meant for that kind of stuff? http://scrapy.org/

You can get content from articles easy with the follow "tools":
scrapy (recommended, but have greater learning curve)
newspaper (gives you immediately title, author, text, images, videos, etc. )
goose-extractor (is like newspaper)

Related

Python Web Scraping with Beautiful Soup - Having Trouble

I'm using BeautifulSoup to try to pull either the top links or simply the top headlines from different topics on the CNN homepage. I seem to be missing something here and would appreciate some assistance. I have managed to come up with a few web scrapers before, but it's always through a lot of resistance and is quite the uphill battle.
What it looks like to me is that the links I need are ultimately stored somewhere like this:
<article class="cd cd--card cd--article cd--idx-1 cd--extra-small cd--has-siblings cd--media__image" data-vr-contentbox="/2015/10/02/travel/samantha-brown-travel-channel-feat/index.html" data-eq-pts="xsmall: 0, small: 300, medium: 460, large: 780, full16x9: 1100" data-eq-state="small">
I can grab that link after data-vr-contentbox and append it to the end of www.cnn.com and it brings me to the page I need. My problem is in actually grabbing that link. I've tried various forms to grab them. My current iteration is as follows:
r = requests.get("http://www.cnn.com/")
data = r.text
soup = BeautifulSoup(data)
for link in soup.findAll("article"):
test = link.get("data-vr-contentbox")
print(test)
My issue here is that it only seems to grab a small number of things that I actually need. I'm only seeing two articles from politics, none from travel, etc. I would appreciate some assistance in resolving this issue. I'm looking to grab all of the links under each topic. Right now I'm just looking at politics or travel as a base to get started.
Particularly, I want to be able to specify the topic (tech, travel, politics, etc.) and grab those headlines. Whether I could simply grab the links and use those to get the headline from the respective page, or simply grab the headlines from here... I seem unable to do either. It would be nice to be able to view everything in a single topic at once, but finding out how to narrow this down isn't proving very simple.
An example article is the "IOS 9's Wi-Fi Assist feature costly" which can be found within tags.
I want to be able to find ALL articles under, say, the Tech heading on the homepage and isolate those tags to grab the headline. The tags for this headline look like this:
<div class="strip-rec-link-title ob-tcolor">IOS 9's Wi-Fi Assist feature costly</div>
Yet I don't know how to do BOTH of these things. I can't even seem to grab the headline, despite it being within tags when I try this:
for link in soup.findAll("div"):
print("")
print(link)
I feel like I have a fundamental misunderstanding somewhere, although I've managed to do some scrapers before.
My guess is that the cnn.com website has a bunch of javascript which renders a lot of the content after beautifulsoup reads it. I opened cnn.com and looked at the source in safari and there were 197 instances of data-vr-contentbox. However when I ran it through beautifulsoup and dumped it out there were only 13 instances of data-vr-contentbox.
There are a bunch of posts out there about handling it. You can start with the method used in this question: Scraping Javascript driven web pages with PyQt4 - how to access pages that need authentication?

How to extract budget, gross, metascore from imdb using scrapy and beautifulsoup?

I am staring with the url below:
http://www.imdb.com/chart/top
The structure of the HTML file seems to be so confusing:
"
Metascore: "
I am trying to use a format like this:
movie['metascore'] = self.get_text(soup.find('h4', attrs={'&nbsp':'Metascore'}))
I'll take a stab at this since it sounds like you're new to scraping. What it sounds like you're actually trying to do is to get the budget, gross, and metascore from each of the individual 250 movie pages on IMDB. You're on the right track by mentioning Scrapy because you do have to crawl to those pages from the initial URL you provided. Scrapy has some excellent documentation, so if you want to use it, I highly recommend you start there first.
However, if all you need is to scrape those 250 pages, you're better off just using Beautiful Soup to do the whole job. Simply do a soup.findAll("td", {"class":"titleColumn"}), extract the links, then do a loop where you have Beautiful Soup open each of the those pages one at a time. If you're not sure how to do that, again, BS has excellent documentation.
From there, it's just a matter of scraping the relevant data you want during each iteration. For instance, the metascore of each film is inside the a <div> of the class star-box-details. Do a .find for that and then you'll have to do some regular expressions to extract the exact piece you want (regular-expressions.info has a great tutorial on regex and if you really get into regex, you'll probably end up sinking hours into RexEgg).
I'm not going to code the whole thing since you'll learn a lot through the trial and error that comes with attempting to solve things, but hopefully that puts you on the right track. However, do note that IMDB forbids scraping, but for small projects I'm sure no one will care. But if you want to get serious, the "Does IMDB provide an API?" post has some excellent resources for how to do it via various third-party APIs (and some even directly from IMDB). In your case, the best might be to simply download the data as text files directly from IMDB. Click on any of the FTP links. The files you'll probably want are business.list.gz and ratings.list.gz. As for the metascore on each movie page, that rating actually comes from Metacritic, so you'll want to go there to pull that data.
Good luck!

Comments scraping without using Api

I am using scrapy to scrape reviews about books from a site. Till now i have made a crawler and scraped comments of a single book by giving its url as starting url by myself and i even had to give tags of comments about that book by myself after finding it from page's source code. Ant it worked. But the problem is that till now the work i've done manually i want it to be done automatically. i.e. I want some way that crawler should be able to find book's page in the website and scrape its comments. I am extracting comments from goodreads and it doesn't provide a uniform method for url's or even tags are also different for different books. Plus i don't want to use Api. I want to do all work by myself. Any help would be appreciated.
It seems, that CrawlSpider can fit your needs.
You can start as follows:
Specify list of starting url(s) for the crawler start_urls = ['https://www.goodreads.com'].
To identify urls with books you can create the following Rule:
rules = (
Rule(SgmlLinkExtractor(allow=(r'book/show/.+', )), callback='parse_comments'),
)
HtmlAgilityPack helped me in parsing and reading Xpath for the reviews. It worked :)

How do you grab a headline from a blog/article like techmeme?

I'm a creating a type of news aggregator and I would like to create a program(Python) that correctly detects the headline and displays it. How would I go about doing this? Is this a machine learning problem?
I would appreciate any articles or books that would point me in the right direction.
My past attempts have included BeautifulSoup and Requests module. Any other open source models I should check out?
Thank you,
Fernando
The direct way to scrape a web page requires human learning - look at the page, decide what you think are headlines, find out how they are tagged, and then look for those tags using a parser like BeautifulSoup. For example, the level 1 headlines on Techmeme currently are labeled:
<DIV CLASS="ii">
and the level 2 headlines are:
<STRONG CLASS="L1">
After your program fetches the page and matches the tags you're interested in, see if they identify what you're looking for. If some headlines are missed, add additional tags to your search list. If you get false positives (hits on links that aren't headlines), weeding them out will require extra page-dependent logic. There is no magic to reverse engineering, just grunt work and testing and periodic revalidation to be sure the webmaster hasn't switched things up on you.
After playing around a bit I find that this works best:
Use BeautifuSoup and Requests module
r = requests.get('http://example.com')
soup = BeautifulSoup(r.text)
if soup.findAll('title'):
title = soup.find('title')
print title.renderContents()
What results is title text that should be cleaned up a bit using regular expressions.
Maybe it could be much easer with parsing their RSS\Atom feeds. Google easily delivers these links http://wiki.python.org/moin/RssLibraries and http://pypi.python.org/pypi/Atomisator/1.3
But those are pure XML, so you could use built-in urllib and XML(DOM or SAX) libraries

Web scraping - how to identify main content on a webpage

Given a news article webpage (from any major news source such as times or bloomberg), I want to identify the main article content on that page and throw out the other misc elements such as ads, menus, sidebars, user comments.
What's a generic way of doing this that will work on most major news sites?
What are some good tools or libraries for data mining? (preferably python based)
There are a number of ways to do it, but, none will always work. Here are the two easiest:
if it's a known finite set of websites: in your scraper convert each url from the normal url to the print url for a given site (cannot really be generalized across sites)
Use the arc90 readability algorithm (reference implementation is in javascript) http://code.google.com/p/arc90labs-readability/ . The short version of this algorithm is it looks for divs with p tags within them. It will not work for some websites but is generally pretty good.
A while ago I wrote a simple Python script for just this task. It uses a heuristic to group text blocks together based on their depth in the DOM. The group with the most text is then assumed to be the main content. It's not perfect, but works generally well for news sites, where the article is generally the biggest grouping of text, even if broken up into multiple div/p tags.
You'd use the script like: python webarticle2text.py <url>
There's no way to do this that's guaranteed to work, but one strategy you might use is to try to find the element with the most visible text inside of it.
Diffbot offers a free(10.000 urls) API to do that, don't know if that approach is what you are looking for, but it might help someone http://www.diffbot.com/
Check the following script. It is really amazing:
from newspaper import Article
URL = "https://www.ksat.com/money/philippines-stops-sending-workers-to-qatar"
article = Article(URL)
article.download()
print(article.html)
article.parse()
print(article.authors)
print(article.publish_date)
#print(article.text)
print(article.top_image)
print(article.movies)
article.nlp()
print(article.keywords)
print(article.summary)
More documentation can be found at http://newspaper.readthedocs.io/en/latest/ and https://github.com/codelucas/newspaper you should install it using:
pip3 install newspaper3k
For a solution in Java have a look at https://github.com/kohlschutter/boilerpipe :
The boilerpipe library provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.
The library already provides specific strategies for common tasks (for example: news article extraction) and may also be easily extended for individual problem settings.
But there is also a python wrapper around this available here:
https://github.com/misja/python-boilerpipe
It might be more useful to extract the RSS feeds (<link type="application/rss+xml" href="..."/>) on that page and parse the data in the feed to get the main content.
Another possibility of separating "real" content from noise is by measuring HTML density of the parts of a HTML page.
You will need a bit of experimentation with the thresholds to extract the "real" content, and I guess you could improve the algorithm by applying heuristics to specify the exact bounds of the HTML segment after having identified the interesting content.
Update: Just found out the URL above does not work right now; here is an alternative link to a cached version of archive.org.
There is a recent (early 2020) comparison of various methods of extracting article body, without and ads, menus, sidebars, user comments, etc. - see https://github.com/scrapinghub/article-extraction-benchmark. A report, data and evaluation scripts are available. It compares many options mentioned in the answers here, as well as some options which were not mentioned:
python-readability
boilerpipe
newspaper3k
dragnet
html-text
Diffbot
Scrapinghub AutoExtract
In short, "smart" open source libraries are adequate if you need to remove e.g. sidebar and menu, but they don't handle removal of unnecessary content inside articles, and are quite noisy overall; sometimes they remove an article itself and return nothing. Commercial services use Computer Vision and Machine Learning, which allows them to provide a much more precise output.
For some use cases simpler libraries like html-text are preferrable, both to commercial services and to "smart" open source libraries - they are fast, and ensure information is not missing (i.e. recall is high).
I would not recommend copy-pasting code snippets, as there are many edge cases even for a seemingly simple task of extracting text from HTML, and there are libraries available (like html-text or html2text) which should be handling these edge cases.
To use a commercial tool, in general one needs to get an API key, and then use a client library. For example, for AutoExtract by Scrapinghub (disclaimer: I work there) you would need to install pip install scrapinghub-autoextract. There is a Python API available - see https://github.com/scrapinghub/scrapinghub-autoextract README for details, but an easy way to get extractions is to create a .txt file with URLs to extract, and then run
python -m autoextract urls.txt --page-type article --api-key <API_KEY> --output res.jl
I wouldn't try to scrape it from the web page - too many things could mess it up - but instead see which web sites publish RSS feeds. For example, the Guardian's RSS feed has most of the text from their leading articles:
http://feeds.guardian.co.uk/theguardian/rss
I don't know if The Times (The London Times, not NY) has one because it's behind a paywall. Good luck with that...

Categories