extracing data from websites using python - python

I'm pretty new to web development and I have an idea for something that I would like to explore and I'd like some advice on what tools I should use. I know python and have been learning django recently so I would ideally like to incorporate them.
What I want to do is related to some basic html parsing and use of regular expressions I think. Basically, I want to be able to aggregate certain bits of useful information from several websites into one site. Suppose, for example, there are a dozen high schools whose graduation dates, times, and locations I'm interested in knowing. How the information on each high school site is presented is roughly similar and so I want to extract the data for the word after "location" or "venue", "time", "date", etc and then have that automatically posted on my site and I would also like it updated if any of the info happens to change on any of the high school sites.
What would you use to accomplish this task? Also, if you know of any useful tutorials, resources, etc that you could point me to, that would be much appreciated!

For the extraction part I think your best bet would be Beautiful soup mostly beacause it's easy to use and would try to parse anything even broken xml/html.

Check out BeautifulSoup
Update:
If you want to fill forms you can use mechanize

Related

How to set a date range for scraping google search using Python?

I would like to know if it is possible to scrape google search specifying a date range. I read about googlesearch and I am trying to use its module (search). However it seems that something it is not working.
Using 'cdr:1,cd_min:01/01/2020,cd_max:01/01/2020' to search all results about a query (for example Kevin Spacey), it is not returning the expected urls. I guess something it is not working with the function (as defined in the library). Has someone ever tried to use it?
I am looking for results in Italian (only pages in Italian and with domain google.it). Another way to scrape these results would be also welcomed.
Many thanks
May this information help you:
Then, use the HTTP Spy to get the detail of the request. It's useful when Google changes their format of search, and the Module has not applied update to their code.
Good luck!

Scrapy and possibilities available

I’m looking into web scraping /crawling possibilities and have been reading up on the Scrapy program. I was wondering if anyone knows if it’s possible to input instructions into the script so that once it’s visited the url it can then choose pre-selected dates from a calendar on the website. ?
End result is for this to be used for price comparisons on sites such as Trivago. I’m hoping I can get the program to select certain criteria such as dates once on the website like a human would.
Thanks,
Alex
In theory for a website like Trivago you can use the URL to set the dates you want to query but you will need to research user agents and proxies because otherwise your IP will get blacklisted really fast.

Can I scrape all URL results using Python from a google search without getting blocked?

I realize that versions of this question have been asked and I spent several hours the other day trying a number of strategies.
What I would like to is use python to scrape all of the URLs from a google search that I can use in a separate script to do text analysis of a large corpus (news sites mainly). This seems relatively straightforward, but none of the attempts I've tried have worked properly.
This is as close as I got:
from google import search
for url in search('site:cbc.ca "kinder morgan" and "trans mountain" and protest*', stop=100):
print(url)
This returned about 300 URLs before I got kicked. An actual search using these parameters provides about 1000 results and I'd like all of them.
First: is this possible? Second: does anyone have any suggestions to do this? I basically just want a txt file of all the URLs that I can use in another script.
It seems that this package uses screen scraping to retrieve search results from google, so it doesn't play well with Google's Terms of Service which could be the reason why you've been blocked.
The relevant clause in Google's Terms of Service:
Don’t misuse our Services. For example, don’t interfere with our Services or try to access them using a method other than the interface and the instructions that we provide. You may use our Services only as permitted by law, including applicable export and re-export control laws and regulations. We may suspend or stop providing our Services to you if you do not comply with our terms or policies or if we are investigating suspected misconduct.
I haven't been able to find a definite number, but it seems like their limit for the number of search queries a day is rather strict too - at 100 search queries / day on their JSON Custom Search API documentation here.
Nonetheless, there's no harm trying out other alternatives to see if they work better:
BeautifulSoup
Scrapy
ParseHub - this one is not in code, but is a useful piece of software with good documentation. Link to their tutorial on how to scrape a list of URLs.

How to scrape tag information from questions on Stack Exchange

My problem is that I want to create a data base of all of the questions, answers, and most importantly, the tags, from a certain (somewhat small) Stack Exchange. The relationships among tags (e.g. tags more often used together have a strong relation) could reveal a lot about the structure of the community and popularity or interest in certain sub fields.
So, what is the easiest way to go through a list of questions (that are positively ranked) and extract the tag information using Python?
The easiest way to get the shared-tag count for all questions is to use the Stack Exchange API.
import requests
r = requests.get(
'http://api.stackexchange.com/2.2/tags/python/related?pagesize=3&site=stackoverflow')
for item in r.json()['items']:
print("{name} shares {count} tags with Python".format(**item))
If this doesn't satisfy your need, there are many other API queries available.
Visit the site to find the URL that shows the information you want, then look at the page source to see how it has been formatted.
In order to scrape the pages use the urllib2 library.
Parse the text using the BeautifulSoup library.
Place the data into a database.
The difficult thing is going to be structuring your database and developing queries that reveal what you want.

Web scraping - how to identify main content on a webpage

Given a news article webpage (from any major news source such as times or bloomberg), I want to identify the main article content on that page and throw out the other misc elements such as ads, menus, sidebars, user comments.
What's a generic way of doing this that will work on most major news sites?
What are some good tools or libraries for data mining? (preferably python based)
There are a number of ways to do it, but, none will always work. Here are the two easiest:
if it's a known finite set of websites: in your scraper convert each url from the normal url to the print url for a given site (cannot really be generalized across sites)
Use the arc90 readability algorithm (reference implementation is in javascript) http://code.google.com/p/arc90labs-readability/ . The short version of this algorithm is it looks for divs with p tags within them. It will not work for some websites but is generally pretty good.
A while ago I wrote a simple Python script for just this task. It uses a heuristic to group text blocks together based on their depth in the DOM. The group with the most text is then assumed to be the main content. It's not perfect, but works generally well for news sites, where the article is generally the biggest grouping of text, even if broken up into multiple div/p tags.
You'd use the script like: python webarticle2text.py <url>
There's no way to do this that's guaranteed to work, but one strategy you might use is to try to find the element with the most visible text inside of it.
Diffbot offers a free(10.000 urls) API to do that, don't know if that approach is what you are looking for, but it might help someone http://www.diffbot.com/
Check the following script. It is really amazing:
from newspaper import Article
URL = "https://www.ksat.com/money/philippines-stops-sending-workers-to-qatar"
article = Article(URL)
article.download()
print(article.html)
article.parse()
print(article.authors)
print(article.publish_date)
#print(article.text)
print(article.top_image)
print(article.movies)
article.nlp()
print(article.keywords)
print(article.summary)
More documentation can be found at http://newspaper.readthedocs.io/en/latest/ and https://github.com/codelucas/newspaper you should install it using:
pip3 install newspaper3k
For a solution in Java have a look at https://github.com/kohlschutter/boilerpipe :
The boilerpipe library provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.
The library already provides specific strategies for common tasks (for example: news article extraction) and may also be easily extended for individual problem settings.
But there is also a python wrapper around this available here:
https://github.com/misja/python-boilerpipe
It might be more useful to extract the RSS feeds (<link type="application/rss+xml" href="..."/>) on that page and parse the data in the feed to get the main content.
Another possibility of separating "real" content from noise is by measuring HTML density of the parts of a HTML page.
You will need a bit of experimentation with the thresholds to extract the "real" content, and I guess you could improve the algorithm by applying heuristics to specify the exact bounds of the HTML segment after having identified the interesting content.
Update: Just found out the URL above does not work right now; here is an alternative link to a cached version of archive.org.
There is a recent (early 2020) comparison of various methods of extracting article body, without and ads, menus, sidebars, user comments, etc. - see https://github.com/scrapinghub/article-extraction-benchmark. A report, data and evaluation scripts are available. It compares many options mentioned in the answers here, as well as some options which were not mentioned:
python-readability
boilerpipe
newspaper3k
dragnet
html-text
Diffbot
Scrapinghub AutoExtract
In short, "smart" open source libraries are adequate if you need to remove e.g. sidebar and menu, but they don't handle removal of unnecessary content inside articles, and are quite noisy overall; sometimes they remove an article itself and return nothing. Commercial services use Computer Vision and Machine Learning, which allows them to provide a much more precise output.
For some use cases simpler libraries like html-text are preferrable, both to commercial services and to "smart" open source libraries - they are fast, and ensure information is not missing (i.e. recall is high).
I would not recommend copy-pasting code snippets, as there are many edge cases even for a seemingly simple task of extracting text from HTML, and there are libraries available (like html-text or html2text) which should be handling these edge cases.
To use a commercial tool, in general one needs to get an API key, and then use a client library. For example, for AutoExtract by Scrapinghub (disclaimer: I work there) you would need to install pip install scrapinghub-autoextract. There is a Python API available - see https://github.com/scrapinghub/scrapinghub-autoextract README for details, but an easy way to get extractions is to create a .txt file with URLs to extract, and then run
python -m autoextract urls.txt --page-type article --api-key <API_KEY> --output res.jl
I wouldn't try to scrape it from the web page - too many things could mess it up - but instead see which web sites publish RSS feeds. For example, the Guardian's RSS feed has most of the text from their leading articles:
http://feeds.guardian.co.uk/theguardian/rss
I don't know if The Times (The London Times, not NY) has one because it's behind a paywall. Good luck with that...

Categories