How to crawl various websites to find specific department using Python?

How to crawl various websites to find specific department using Python? - python

I have a list of university urls like www.harvard.edu, www.berkeley.eduetc.
I need to find the cse department urls in the respective websites What i initially set out to do was to crawl through links in url given and by specifying depth , say 3 it will follow links and try to find cse or computer or lists of words in the links scraped on that page matching links along with their anchor text are returned as results into a csv file.
if no links containing cse or as such words it should return not found or something like that
idea is to push the csv file later onto a database. How can I do this?

This is quite complex task and I recommend using database using structure like this:
TABLE pages (
`absolute_url` VARCHAR(255) NOT NULL,
`visited` TINYINT(1) DEFAULT 0,
-- Additional fields
UNIQUE KEY (`absolute_url`)
)
Little explanation:
absolute_url contains full URL to page (starting with http[s]://) and has unique index placed on it. This way you make sure you won't end up in recursion or process multiple link twice
visited informs you on whether site was already visited (and processed). This field is important for preventing double visitation again and allow you to recover gracefully if your program crashes (ie. network downtime)
You may implement those things on your own via CSV, or associate arrays, but database is most familiar solution for me.
And the algorithm would go as:
database.insert( 'http://www.harvard.edu')
database.insert( 'http://www.berkeley.edu')
# In case of failure you'll start at this point:
while database.get_count( WHERE visited = 0) < 0:
for url in database.get_records( WHERE visited = 0):
content = http_client.load(url)
time.sleep(5) # You don't want to flood server
# Problematic URLs will be parsed later
if (not content) or (http_client.is_error):
continue;
for i in content.get_all_urls():
i = make_absolute(i, url)
# Also don't crawl remote sites, images, ...
if not is_valid_url(i):
continue
database.insert(i)
This is pseudocode, I won't implement it all for you.

For solve your problem you can use scrapy framework.
Extracted from scrapy web:
Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

Related

Scrapy Xpath returning null but working fine in Chrome

I am quite new to Scrapy but am designing a web scrape to pull certain information from GoFundMe, specifically in this case the amount of people who have donated to a project. I have written an xpath statement which works fine in Chrome but returns null in Scrapy.
A random example project is https://www.gofundme.com/f/passage/donations, which at present has 22 donations. The below when entered in Chrome inspect gives me "Donations(22)" which is what I need -
//h2[#class="heading-5 mb0"]/text()
However in my Scrapy spider the following yields null -
class DonationsSpider(scrapy.Spider):
name = 'get_donations'
start_urls = [
'https://www.gofundme.com/f/passage/donations'
]
def parse(self, response):
amount_of_donations = response.xpath('//h2[#class="heading-5 mb0"]/text()').extract_first()
yield{
'Donations': amount_of_donations
}
Does anyone know why Scrapy is unable to see this value?
I am doing this in an attempt to find out how many times the rest of the spider needs to loop, as when I hard code this value it works with no problems and yields all of the donations.

Well because there are many requests going on the fulfil the request "https://www.gofundme.com/f/passage/donations". Where
your chrome is smart enough to under stand javascript, using that
smartness it reads the JavaScript code and fetches all the responses
from different different endpoints to fulfil your request
there's one request to the endpoint "https://gateway.gofundme.com/web-gateway/v1/feed/passage/counts" which loads the data you're looking for. which your python script can't do and also it's not recommend.
Instead you can call directly to that api and you'll get the data, good news is that endpoint responds JSON data which is very structured, easy to parse.
and I'm sure you're also looking for the data which is coming from this endpoint "https://gateway.gofundme.com/web-gateway/v1/feed/passage/donations?limit=20&offset=0&sort=recent"
for more information you may refer to one of my blog by clicking here

How to extract files from ScrapingHub?

I have deployed some Scrapy spiders to scrape data which I can download in .csv from ScrapingHub.
Some of these spiders have FilePipeline which I used to download files (pdf) to a specific folder. Is there any way I can retrieve these files from ScrapingHub via the platform or API?

Though I have to go over scraping hubs documentation, I'm quite certain despite of having a file explorer there's no actual file being generated or it's being ignored while during the crawl and stanchion... I assume so given the fact that if you try to deploy one of your projects with anything other than the files that correspond to a scrappy project() unless you do some hacking around with your settings and setup file for then scrapinghub to accept your extra parameters orphans)... For example if you try to have a ton of start URLs in a file and then use a real and function to parse all that into your spider... Works like a charm but scrapinghub wasn't built with that in mind...
I assume you know that you can download your files in a CSV or desired format straight from the web interface... Personally I use scraping Hub client API in Python... All three libraries of which I believe to our deprecated at this point but you kind of have to mix and match to get fully functional feet for example...
I have this side gig for a pretty well-known pornt website, what I do for them is content aggregation I spend a lot of time watching a lot o debauchery but for people like myself it's just fun... Hope that you're reading this and not think too much of a pervert LOL got to make that money right? Anyways... By using scraping hugs API client for python I'm able to connect to my account with the API key and maneuver my way around and do as I please; personally I think that there are some limitations , not so much of a limitation is just that one thing that really bothers me is that the function to get the name of a project was deprecated with the first version of there client Library... I'd like the see, when I'm parsing my items the name of the project of which where the spider is to run different jobs Ergo the crawlz... So when I first started to mess around with the client it just look messy,
What's even more awesome it's my life so sweet is that when you create a project run your spider and all your items are collected can directly download these files from the web interface as I mentioned, but what I can do is Target my output to give me desired effect for example.
I'm crawling a site and I'm getting a media item like videos, there are three things you always need. The name of the media or the title of the video , the URL source to where the video can be reached or URL where it is embedded of which you can then request for every instance that you need... And of course the metadata of what is tags and categories that are associated with video media.
The largest crawl that's outputted the most items now I believe was 150,000, it was abroad crawl and it was something like the 15 or 17% of dupla Fire cases. Each video I then call using the API client by its given dictionary or key value (not a dictionary btw)... Of course in my case I will always use all three of the key values but I can Target categories or tags of which RN or under the key value o its corresponding place and output only the items and their totality (meaning still output all three items) foot print out only the ones that meet or match a particular string or expression I want allowing me the able who really Parts through my content quite effectively. In this particular scrapy project, Im just simply printing out or creating a .m3u playlist from all this 'pronz'!

Web crawler - how to build the visited url set?

I have implemented a distributed web crawler on rabbitMQ. Everything is almost done except the visited url set. I want to have some kind of shared variable between different crawlers.
Furthermore, as I have been reading, the size of this url set will be huge and should be stored in disk.
What is the best way to store, access and share this visited-urls list in a distributed environment?

As majidkabir says, Nutch is quite a good solution...but that doesn't answer the question since it's about how to track state when building your own crawler.
I'll offer the approach I took when I created a crawler in Node (https://www.npmjs.com/package/node-nutch). As you can see from the name, the approach I've taken is in turn modelled on the approach taken in Nutch.
All I did was to use the URL as the key (after normalising), and then store a simple JSON file in S3 that contained the state of the crawl. When it was time to run the next crawl I'd whizz through each of these JSON files looking for candidates to be crawled and then after retrieving the page, set the JSON to indicate when to crawl next.
The number of pages I was crawling was never very large so this worked fine, but if it did get larger I'd put the JSON into something like ElasticSearch and then search for URLs to be crawled based on a date field.

Ideally, any storage that is scalable and supports indexing can be used for such use cases.
Some of the systems that I know are being used for such purposes are Solr, ElasticSearch, Redis or any SQL databases that can scale.
I have used Redis for the same purpose and I have been storing approximately 2 millions of URLs. I am quite sure, by increasing the nodes I should be able to scale easily.

You can use Apache Nutch for crawling, this library has ability to crawl url in a specific period and use some algorithms for this purpose.
For example: When a page with specific url don't changed in second crawling nutch increase the period of next crawling and if it changed, decrease this period.
You can create your own nutch plugin for parsing the data that nutch crawled or using predefined nutch plugins.

Searching for books with the Amazon Product Advertising API - Python

tl;dr: I am using the Amazon Product Advertising API with Python. How can I do a keyword search for a book and get XML results that contain TITLE, ISBN, and PRICE for each entry?
Verbose version:
I am working in Python on a web site that allows the user to search for textbooks from different sites such as eBay and Amazon. Basically, I need to obtain simple information such as titles, ISBNS, and prices for each item from a set of search results from one of those sites. Then, I can store and format that information as needed in my application (e.g, displaying HTML).
In eBay's case, getting the info I needed wasn't too hard. I used urllib2 to make a request based on a sample I found. All I needed was a special security key to add to the URL:
def ebaySearch(keywords): #keywords is a list of strings, e.g. ['moby', 'dick']
#findItemsAdvanced allows category filter -- 267 is books
#Of course, I replaced my security appname in the example below
url = "http://svcs.ebay.com/services/search/FindingService/v1?OPERATION-NAME=findItemsAdvanced&SERVICE-NAME=FindingService&SERVICE-VERSION=1.0.0&SECURITY-APPNAME=[MY-APPNAME]&RESPONSE-DATA-FORMAT=XML&REST-PAYLOAD&categoryId=267&keywords="
#Complete the url...
numKeywords = len(keywords)
for k in range(0, numKeywords-1):
url += keywords[k]
url += "%20"
#There should not be %20 after last keyword
url += keywords[numKeywords-1]
request = urllib2.Request(url)
response = urllib2.urlopen(request) #file like thing (due to library conversion)
xml_response = response.read()
...
...Then I parsed this with minidom.
In Amazon's case, it doesn't seem to be so easy. I thought I would start out by just looking for an easy wrapper. But their developer site doesn't seem to provide a python wrapper for what I am interested in (the Product Advertising API). One that I have tried, python-amazon-product-api 0.2.5 from https://pypi.python.org/pypi/python-amazon-product-api/, has been giving me some installation issues that may not be worth the time to look into (but maybe I'm just exasperated..). I also looked around and found pyaws and pyecs, but these seem to use deprecated authentication mechanisms.
I then figured I would just try to construct the URLs from scratch as I did for eBay. But Amazon requires a time stamp in the URLs, which I suppose I could programatically construct (perhaps something like these folks, who go the whole 9 yards with the signature: https://forums.aws.amazon.com/thread.jspa?threadID=10048).
Even if that worked (which I doubt will happen, given the amount of frustration the logistics have given so far), the bottom line is that I want name, price, and ISBN for the books that I search for. I was able to generate a sample URL with the tutorial on the API website, and then see the XML result, which indeed contained titles and ISBNs. But no prices! Gah! After some desperate Google searching, a slight modification to the URL (adding &ResponseGroup=Offers and &MerchantID=All) did the trick, but then there were no titles. (I guess yet another question I would have, then, is where can I find an index of the possible ResponseGroup parameters?)
Overall, as you can see, I really just don't have a solid methodology for this. Is the construct-a-url approach a decent way to go, or will it be more trouble than it is worth? Perhaps the tl;dr at the top is a better representation of the overall question.

Another way could be amazon-simple-product-api:
from amazon.api import AmazonAPI
amazon = AmazonAPI(ACCESS_KEY, SECRET, ASSOC)
results = amazon.search(Keywords = "book name", SearchIndex = "Books")
for item in results:
print item.title, item.isbn, item.price_and_currency
To install, just clone from github and run
sudo python setup.py install
Hope this helps!

If you have installation issues with python-amazon-product-api, send details to mailing list and you will be helped.

How can I iterate through the pages of a website using Python?

I'm new to software development, and I'm not sure how to go about this. I want to visit every page of a website and grab a specific bit of data from each one. My problem is, I don't know how to iterate through all of the existing pages without knowing the individual urls ahead of time. For example, I want to visit every page whose url starts with
"http://stackoverflow.com/questions/"
Is there a way to compile a list and then iterate through that, or is it possible to do this without creating a giant list of urls?

Try Scrapy.
It handles all of the crawling for you and lets you focus on processing the data, not extracting it. Instead of copy-pasting the code already in the tutorial, I'll leave it to you to read it.

To grab a specific bit of data from a web site you could use some web scraping tool e.g., scrapy.
If required data is generated by javascript then you might need browser-like tool such as Selenium WebDriver and implement crawling of the links by hand.

For example, you can make a simple for loop, like this:
def webIterate():
base_link = "http://stackoverflow.com/questions/"
for i in xrange(24):
print "http://stackoverflow.com/questions/%d" % (i)
The output will be:
http://stackoverflow.com/questions/0
http://stackoverflow.com/questions/2
http://stackoverflow.com/questions/3
...
http://stackoverflow.com/questions/23
It's just an example. You can pass numbers of questions and make with them whatever you want

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.