Scrapy crawl site including non-HTML items - python

I am using Scrapy to crawl an entire website, including images, CSS, JavaScript and external links. I've noticed that Scrapy's default CrawlSpider only processes HTML responses and ignores external links. So I tried overriding the method _requests_to_follow and removing the check at the beginning but that didn't work. I also tried using a method process_request to allow all requests but that failed too. Here's my code:
class MySpider(CrawlSpider):
name = 'myspider'
allowed_domains = ['example.com']
start_urls = ['http://example.com']
rules = (Rule(LinkExtractor(), callback='parse_item', follow=False,
process_request='process_request'),)
def parse_item(self, response):
node = Node()
node['location'] = response.url
node['content_type'] = response.headers['Content-Type']
yield node
link = Link()
link['source'] = response.request.headers['Referer']
link['destination'] = response.url
yield link
def process_request(self, request):
# Allow everything
return request
def _requests_to_follow(self, response):
# There used to be a check for Scrapy's HtmlResponse response here
seen = set()
for n, rule in enumerate(self._rules):
links = [l for l in rule.link_extractor.extract_links(response) if l not in seen]
if links and rule.process_links:
links = rule.process_links(links)
for link in links:
seen.add(link)
r = Request(url=link.url, callback=self._response_downloaded)
r.meta.update(rule=n, link_text=link.text)
yield rule.process_request(r)
The idea is to build a graph of the domain, that's why my parse_item yields a Node object with the resource's location and type, and a Link object to keep track of the relations between nodes. External pages should have their node and link information retrieved but they shouldn't be crawled, of course.
Thanks in advance for your help.

Related

How to get links inside links from webpage in python?

How can i go to link and get its sub links and again get its sub sub links?like for example,
I want to go to
"https://stackoverflow.com"
then extract its links e.g
['https://stackoverflow.com/questions/ask', 'https://stackoverflow.com/?tab=bounties']
and again go to that sub link and extract those sub links links.
I would recommend using Scrapy for this. With Scrapy, you create a spider object which then is run by the Scrapy module.
First, to get all the links on a page, you can create a Selector object and find all of the hyperlink objects using the XPath:
hxs = scrapy.Selector(response)
urls = hxs.xpath('*//a/#href').extract()
Since the hxs.xpath returns an iterable list of paths, you can just iterate over them directly without storing them in a variable. Also each URL found should be passed back into this function using the callback argument, allowing it to recursively find all the links within each URL found:
hxs = scrapy.Selector(response)
for url in hxs.xpath('*//a/#href').extract():
yield scrapy.http.Request(url=url, callback=self.parse)
Each path found might not contain the original URL, so that check has to be made:
if not ( url.startswith('http://') or url.startswith('https://') ):
url = "https://stackoverflow.com/" + url
Finally, the each URL can be passed to a different function to be parsed, in this case it's just printed:
self.handle(url)
All of this put together in a full Spider object looks like this:
import scrapy
class StackSpider(scrapy.Spider):
name = "stackoverflow.com"
# limit the scope to stackoverflow
allowed_domains = ["stackoverflow.com"]
start_urls = [
"https://stackoverflow.com/",
]
def parse(self, response):
hxs = scrapy.Selector(response)
# extract all links from page
for url in hxs.xpath('*//a/#href').extract():
# make it a valid url
if not ( url.startswith('http://') or url.startswith('https://') ):
url = "https://stackoverflow.com/" + url
# process the url
self.handle(url)
# recusively parse each url
yield scrapy.http.Request(url=url, callback=self.parse)
def handle(self, url):
print(url)
And the spider would be run like this:
$ scrapy runspider spider.py > urls.txt
Also, keep in mind that running this code will get you rate limited from stack overflow. You might want to find a different target for testing, ideally a site that you're hosting yourself.

Python Scrapy: Return list of URLs scraped

I am using scrapy to scrape all the links off single domain. I am following all links on the domain but saving all links off the domain. The following scraper works correctly, but I can't access member variables from within the scraper since I am running it with a CrawlerProcess.
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['https://example.com']
on_domain_urls = set()
off_domain_urls = set()
def parse(self, response):
links = response.xpath('//a/#href')
for link in links:
url = link.get()
if 'example.com' in url and url not in self.on_domain_urls:
print('On domain links found: {}'.format(
len(self.on_domain_urls)))
self.on_domain_urls.add(url)
yield scrapy.Request(url, callback=self.parse)
elif url not in self.off_domain_urls:
print('Offf domain links found: {}'.format(
len(self.on_domain_urls)))
self.off_domain_urls.add(url)
process = CrawlerProcess()
process.crawl(GoodOnYouSpider)
process.start()
# Need access to off_domain_links
How can I access off_domain_links? I could probably move it to a global scope but this seems hack. I can also append to a file, but I'd like to avoid file I/O if possible. Is there a better way to return aggregated data like this?
Did you check the Itempipeline? I think you'll have to use that in this scenario and decide what needs to be done with the variable.
See:
https://docs.scrapy.org/en/latest/topics/item-pipeline.html

Only 25 entries are stored in JSON files while scraping data using Scrapy; how to increase?

I am scraping data using Scrapy in a item.json file. Data is getting stored but the problem is only 25 entries are stored, while in the website there are more entries. I am using the following command:
class DmozSpider(Spider):
name = "dmoz"
allowed_domains = ["justdial.com"]
start_urls = ["http://www.justdial.com/Delhi-NCR/Taxi-Services/ct-57371"]
def parse(self, response):
hxs = Selector(response)
sites = hxs.xpath('//section[#class="rslwrp"]/section')
items = []
for site in sites:
item = DmozItem()
item['title'] = site.xpath('section[2]/section[1]/aside[1]/p[1]/span/a/text()').extract()
items.append(item)
return items
The command I'm using to run the script is:
scrapy crawl myspider -o items.json -t json
Is there any setting which I am not aware of? or the page is not getting loaded fully till scraping. how do i resolve this?
Abhi, here is some code, but please note that it isn't complete and working, it is just to show you the idea. Usually you have to find a next page URL and try to recreate the appropriate request in your spider. In your case AJAX is used. I used FireBug to check which requests are sent by the site.
URL = "http://www.justdial.com/function/ajxsearch.php?national_search=0&...page=%s" # this isn't the complete next page URL
next_page = 2 # how to handle next_page counter is up to you
def parse(self, response):
hxs = Selector(response)
sites = hxs.xpath('//section[#class="rslwrp"]/section')
for site in sites:
item = DmozItem()
item['title'] = site.xpath('section[2]/section[1]/aside[1]/p[1]/span/a/text()').extract()
yield item
# build you pagination URL and send a request
url = self.URL % self.next_page
yield Request(url) # Request is Scrapy request object here
# increment next_page counter if required, make additional
# checks and actions etc
Hope this will help.

Scrapy crawl all sitemap links

I want to crawl all he links present in the sitemap.xml of a fixed site. I've came across Scrapy's SitemapSpider. So far i've extracted all the urls in the sitemap. Now i want to crawl through each link of the sitemap. Any help would be highly useful. The code so far is:
class MySpider(SitemapSpider):
name = "xyz"
allowed_domains = ["xyz.nl"]
sitemap_urls = ["http://www.xyz.nl/sitemap.xml"]
def parse(self, response):
print response.url
Essentially you could create new request objects to crawl the urls created by the SitemapSpider and parse the responses with a new callback:
class MySpider(SitemapSpider):
name = "xyz"
allowed_domains = ["xyz.nl"]
sitemap_urls = ["http://www.xyz.nl/sitemap.xml"]
def parse(self, response):
print response.url
return Request(response.url, callback=self.parse_sitemap_url)
def parse_sitemap_url(self, response):
# do stuff with your sitemap links
You need to add sitemap_rules to process the data in the crawled urls, and you can create as many as you want.
For instance say you have a page named http://www.xyz.nl//x/ you want to create a rule:
class MySpider(SitemapSpider):
name = 'xyz'
sitemap_urls = 'http://www.xyz.nl/sitemap.xml'
# list with tuples - this example contains one page
sitemap_rules = [('/x/', parse_x)]
def parse_x(self, response):
sel = Selector(response)
paragraph = sel.xpath('//p').extract()
return paragraph

Scrapy: skip hrefs w/ missing scheme

So I'm trying to crawl the popular.ebay.com page and I get an error:Missing scheme in request url: #mainContent for the # anchor links.
The following is my code:
def parse_links(self, response):
hxs = HtmlXPathSelector(response)
links = hxs.select('//a')
#domain = 'http://popular.ebay.com/'
for link in links:
anchor_text = ''.join(link.select('./text()').extract())
title = ''.join(link.select('./#title').extract())
url = ''.join(link.select('./#href').extract())
meta = {'title':title,}
meta = {'anchor_text':anchor_text,}
yield Request(url, callback = self.parse_page, meta=meta,)
I can't add the base url to #mainContent, because it adds a double URL to the urls with with the full url scheme. I end up getting urls like this http://popular.ebay.comhttp://www.ebay.com/sch/i.html?_nkw=grande+mansion
def parse_links(self, response):
hxs = HtmlXPathSelector(response)
links = hxs.select('//a')
#domain = 'http://popular.ebay.com/'
for link in links:
anchor_text = ''.join(link.select('./text()').extract())
title = ''.join(link.select('./#title').extract())
url = ''.join(link.select('./#href').extract())
meta = {'title':title,}
meta = {'anchor_text':anchor_text,}
yield Request(response.url, callback = self.parse_page, meta=meta,)
The links I want to get look like this: Antique Chairs | but I get the error cause of links like this: <a id="gh-hdn-stm" class="gh-acc-a" href="#mainContent">Skip to main content</a>
How would I go about adding the base url to only the hash anchor links, or ignore links without the base url in them? For a simple solution I've tried the set rule deny=(#mainContent) and restrict_xpaths, but the crawler still spits the same error.
error:Missing scheme in request url: #mainContent is caused by requesting a url without a scheme (the "http://" part of the url).
#mainContent is an internal link, referring to a HTML element with the id "mainContent". You're probably not wanting to follow these links, as it's only linking to a different part of the current page you're on.
I'd suggest looking at this part of the documentation http://doc.scrapy.org/en/latest/topics/link-extractors.html#scrapy.contrib.linkextractors.sgml.SgmlLinkExtractor. You can tell Scrapy to follow links which conform to a certain format and restrict what part of the page it will fetch links from. Take note of the "restrict_xpaths" and "allow" parameters.
Hope this helps :)
In your for loop:
meta = {'anchor_text':anchor_text,}
url = link.select('./#href').extract()[0]
if not '#' in url: // or if url[0] != '#'
yield Request(response.url, callback = self.parse_page, meta=meta,)
This will avoid yielding #foobar as an URL. You could add the base url to the #foobar in an else statement, but since this will redirect to a page scrapy has already scraped I don't think there's a point in it.
I found links other than #mainContent that were missing the scheme, so using #Robin's logic I made sure that the url contained the base url before parse_page.
for link in links:
anchor_text = ''.join(link.select('./text()').extract())
title = ''.join(link.select('./#title').extract())
url = ''.join(link.select('./#href').extract())
meta = {'title':title,}
meta = {'anchor_text':anchor_text,}
if domain in url:
yield Request(url, callback = self.parse_page, meta=meta,)

Categories