I'm new to using Scrapy and I wanted to understand how the rules are being used within the CrawlSpider.
If I have a rule where I'm crawling through the yellowpages for cupcake listings in Tucson, AZ, how does yielding a URL request activate the rule - specifically how does it activiate the restrict_xpath attribute?
Thanks.
The rules attribute for a CrawlSpider specify how to extract the links from a page and which callbacks should be called for those links. They are handled by the default parse() method implemented in that class -- look here to read the source.
So, whenever you want to trigger the rules for an URL, you just need to yield a scrapy.Request(url, self.parse), and the Scrapy engine will send a request to that URL and apply the rules to the response.
The extraction of the links (that may or may not use restrict_xpaths) is done by the LinkExtractor object registered for that rule. It basically searches for all the <a>s and <area>s elements in the whole page or only in the elements obtained after applying the restrict_xpaths expressions if the attribute is set.
Example:
For example, say you have a CrawlSpider like so:
from scrapy.contrib.spiders.crawl import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class MySpider(CrawlSpider):
start_urls = ['http://someurlhere.com']
rules = (
Rule(
LinkExtractor(restrict_xpaths=[
"//ul[#class='menu-categories']",
"//ul[#class='menu-subcategories']"]),
callback='parse'
),
Rule(
LinkExtractor(allow='/product.php?id=\d+'),
callback='parse_product_page'
),
)
def parse_product_page(self, response):
# yield product item here
The engine starts sending requests to the urls in start_urls and executing the default callback (the parse() method in CrawlSpider) for their response.
For each response, the parse() method will execute the link extractors on it to get the links from the page. Namely, it calls the LinkExtractor.extract_links(response) for each response object to get the urls, and then yields scrapy.Request(url, <rule_callback>) objects.
The example code is an skeleton for a spider that crawls an e-commerce site following the links of product categories and subcategories, to get links for each of the product pages.
For the rules registered specifically in this spider, it would crawl the links inside the lists of "categories" and "subcategories" with the parse() method as callback (which will trigger the crawl rules to be called for these pages), and the links matching the regular expression product.php?id=\d+ with the callback parse_product_page() -- which would finally scrape the product data.
As you can see, pretty powerful stuff. =)
Read more:
CrawlSpider - Scrapy docs
Link extractors - Scrapy docs
Related
I have the crawler implemented as below.
It is working and it would go through sites regulated under the link extractor.
Basically what I am trying to do is to extract information from different places in the page:
- href and text() under the class 'news' ( if exists)
- image url under the class 'think block' ( if exists)
I have three problems for my scrapy:
1) duplicating linkextractor
It seems that it will duplicate processed page. ( I check against the export file and found that the same ~.img appeared many times while it is hardly possible)
And the fact is , for every page in the website, there are hyperlinks at the bottom that facilitate users to direct to the topic they are interested in, while my objective is to extract information from the topic's page ( here listed several passages's title under the same topic ) and the images found within a passage's page( you can arrive to the passage's page by clicking on the passage's title found at topic page).
I suspect link extractor would loop the same page over again in this case.
( maybe solve with depth_limit?)
2) Improving parse_item
I think it is quite not efficient for parse_item. How could I improve it? I need to extract information from different places in the web ( for sure it only extracts if it exists).Beside, it looks like that the parse_item could only progress HkejImage but not HkejItem (again I checked with the output file). How should I tackle this?
3) I need the spiders to be able to read Chinese.
I am crawling a site in HK and it would be essential to be capable to read Chinese.
The site:
http://www1.hkej.com/dailynews/headline/article/1105148/IMF%E5%82%B3%E4%BF%83%E4%B8%AD%E5%9C%8B%E9%80%80%E5%87%BA%E6%95%91%E5%B8%82
As long as it belongs to 'dailynews', that's the thing I want.
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.http import Request, FormRequest
from scrapy.contrib.linkextractors import LinkExtractor
import items
class EconjournalSpider(CrawlSpider):
name = "econJournal"
allowed_domains = ["hkej.com"]
login_page = 'http://www.hkej.com/template/registration/jsp/login.jsp'
start_urls = 'http://www.hkej.com/dailynews'
rules=(Rule(LinkExtractor(allow=('dailynews', ),unique=True), callback='parse_item', follow =True),
)
def start_requests(self):
yield Request(
url=self.login_page,
callback=self.login,
dont_filter=True
)
# name column
def login(self, response):
return FormRequest.from_response(response,
formdata={'name': 'users', 'password': 'my password'},
callback=self.check_login_response)
def check_login_response(self, response):
"""Check the response returned by a login request to see if we are
successfully logged in.
"""
if "username" in response.body:
self.log("\n\n\nSuccessfully logged in. Let's start crawling!\n\n\n")
return Request(url=self.start_urls)
else:
self.log("\n\n\nYou are not logged in.\n\n\n")
# Something went wrong, we couldn't log in, so nothing happens
def parse_item(self, response):
hxs = Selector(response)
news=hxs.xpath("//div[#class='news']")
images=hxs.xpath('//p')
for image in images:
allimages=items.HKejImage()
allimages['image'] = image.xpath('a/img[not(#data-original)]/#src').extract()
yield allimages
for new in news:
allnews = items.HKejItem()
allnews['news_title']=new.xpath('h2/#text()').extract()
allnews['news_url'] = new.xpath('h2/#href').extract()
yield allnews
Thank you very much and I would appreciate any help!
First, to set settings, make it on the settings.py file or you can specify the custom_settings parameter on the spider, like:
custom_settings = {
'DEPTH_LIMIT': 3,
}
Then, you have to make sure the spider is reaching the parse_item method (which I think it doesn't, haven't tested yet). And also you can't specify the callback and follow parameters on a rule, because they don't work together.
First remove the follow on your rule, or add another rule, to check which links to follow, and which links to return as items.
Second on your parse_item method, you are getting incorrect xpath, to get all the images, maybe you could use something like:
images=hxs.xpath('//img')
and then to get the image url:
allimages['image'] = image.xpath('./#src').extract()
for the news, it looks like this could work:
allnews['news_title']=new.xpath('.//a/text()').extract()
allnews['news_url'] = new.xpath('.//a/#href').extract()
Now, as and understand your problem, this isn't a Linkextractor duplicating error, but only poor rules specifications, also make sure you have valid xpath, because your question didn't indicate you needed xpath correction.
I am writing scrapy code to crawl first page and one additional depth of given webpage
Somehow my crawler doesn't enter additional depth. Just crawls given starting urls and ends its operation.
I added filter_links callback function but even thts not getting called so clearly rules are getting ignored. what can be possible reason and what can i change to make it follow rules
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from crawlWeb.items import CrawlwebItem
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
class DmozSpider(CrawlSpider):
name = "premraj"
start_urls = [
"http://www.broadcom.com",
"http://www.qualcomm.com"
]
rules = [Rule(SgmlLinkExtractor(), callback='parse',process_links="process_links",follow=True)]
def parse(self, response):
#print dir(response)
#print dir(response)
item=CrawlwebItem()
item["html"]=response.body
item["url"]=response.url
yield item
def process_links(self,links):
print links
print "hey!!!!!!!!!!!!!!!!!!!!!"
There is a Warning box in the CrawlSpider documentation. It says:
When writing crawl spider rules, avoid using parse as callback, since
the CrawlSpider uses the parse method itself to implement its logic.
So if you override the parse method, the crawl spider will no longer
work.
Your code does probably not work as expected because you do use parse as callback.
I want to extract data from http://community.sellfree.co.kr/. Scrapy is working, however it appears to only scrape the start_urls, and doesn't crawl any links.
I would like the spider to crawl the entire site.
The following is my code:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from metacritic.items import MetacriticItem
class MetacriticSpider(BaseSpider):
name = "metacritic" # Name of the spider, to be used when crawling
allowed_domains = ["sellfree.co.kr"] # Where the spider is allowed to go
start_urls = [
"http://community.sellfree.co.kr/"
]
rules = (Rule (SgmlLinkExtractor(allow=('.*',))
,callback="parse", follow= True),
)
def parse(self, response):
hxs = HtmlXPathSelector(response) # The XPath selector
sites = hxs.select('/html/body')
items = []
for site in sites:
item = MetacriticItem()
item['title'] = site.select('//a[#title]').extract()
items.append(item)
return items
There are two kinds of links on the page. One is onclick="location='../bbs/board.php?bo_table=maket_5_3' and another is <span class="list2">solution</span>
How can I get the crawler to follow both kinds of links?
Before I get started, I'd highly recommend using an updated version of Scrapy. It appears you're still using an old one, as many of the methods/classes you're using have been moved around or deprecated.
To the problem at hand: the scrapy.spiders.BaseSpider class will not do anything with the rules you specify. Instead, use the scrapy.contrib.spiders.CrawlSpider class, which has functionality to handle rules built into.
Next, you'll need to switch your parse() method to a new name, since the the CrawlSpider uses parse() internally to work. (We'll assume parse_page() for the rest of this answer)
To pick up all basic links, and have them crawled, your link extractor will need to be changed. By default, you shouldn't use regular expression syntax for domains you want to follow. The following will pick it up, and your DUPEFILTER will filter out links not on the site:
rules = (
Rule(SgmlLinkExtractor(allow=('')), callback="parse_page", follow=True),
)
As for the onclick=... links, these are JavaScript links, and the page you are trying to process relies on them heavily. Scrapy cannot crawl things like onclick=location.href="javascript:showLayer_tap('2')" or onclick="win_open('./bbs/profile.php?mb_id=wlsdydahs', because it can't execute showLayer_tap() or win_open() in Javascript.
(the following is untested, but should work and provide the basic idea of what you need to do)
You can write your own functions for parsing these, though. For instance, the following can handle onclick=location.href="./photo/":
def process_onclick(value):
m = re.search("location.href=\"(.*?)\"", value)
if m:
return m.group(1)
Then add the following rule (this only handles tables, expand it as needed):
Rule(SgmlLinkExtractor(allow=(''), tags=('table',),
attrs=('onclick',), process_value=process_onclick),
callback="parse_page", follow=True),
Using scrapy's CrawlSpider, is there a canonical way to get the url of the page that a rule follows from. So for example, if I had a link from page A to page B when I parse page B in the callback method, is there a way to know the url of page A? I am interested more in a built-in feature rather than then extending the CrawlSPider class.
In your callback you can use the "Referer" header in the response's request headers
def mycallback(self, response):
print "Referer:", response.request.headers.get("Referer")
...
It should work with all spiders.
I am trying to use scrapy for crawling a website, but there's no sitemap or page indices for the website. How can I crawl all pages of the website with scrapy?
I just need to download all the pages of the site without extracting any item. Do I only need to set following all links in the Rule of Spider? But I don't know whether or not scrapy will avoid replicate urls in this way.
I just found the answer myself. With the CrawlSpider class, we just need to set variable allow=() in the SgmlLinkExtractor function. As the documentation says:
allow (a regular expression (or list of)) – a single regular expression (or list of regular expressions) that the (absolute) urls must match in order to be extracted. If not given (or empty), it will match all links.
In your Spider, define allowed_domains as a list of domains you want to crawl.
class QuotesSpider(scrapy.Spider):
name = 'quotes'
allowed_domains = ['quotes.toscrape.com']
Then you can use response.follow() to follow the links. See the docs for Spiders and the tutorial.
Alternatively, you can filter the domains with a LinkExtractor (like David Thompson mentioned).
from scrapy.linkextractors import LinkExtractor
class QuotesSpider(scrapy.Spider):
name = 'quotes'
start_urls = ['http://quotes.toscrape.com/page/1/']
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('small.author::text').get(),
'tags': quote.css('div.tags a.tag::text').getall(),
}
for a in LinkExtractor(allow_domains=['quotes.toscrape.com']).extract_links(response):
yield response.follow(a, callback=self.parse)