Scrapy CrawlSpider rules with multiple callbacks - python

I'm tring to create an ExampleSpider which implements scrapy CrawlSpider. My ExampleSpider should be able to process pages containing only artist info,
pages containing only album info, and some other pages which contains both album and artist info.
I was able to handle First two scenarios. but the problem occurs in third scenario. I'm using parse_artist(response) method to process artist data, parse_album(response) method to process album data.
My question is, If a page contains both artist and album data, how should I define my rules?
Shoud I do like below? (Two rules for same url pattern)
Should I multiple callbacks? (Does scrapy support multiple callbacks?)
Is there other way to do it. (A proper way)
class ExampleSpider(CrawlSpider):
name = 'example'
start_urls = ['http://www.example.com']
rules = [
Rule(SgmlLinkExtractor(allow=[r'same regex_rule']), callback='parse_artist', follow=True),
Rule(SgmlLinkExtractor(allow=[r'same regex_rule']), callback='parse_album', follow=True),
# more rules .....
]
def parse_artist(self, response):
artist_item = ArtistItem()
try:
# do the scrape and assign to ArtistItem
except Exception:
# ignore for now
pass
return artist_item
pass
def parse_album(self, response):
album_item = AlbumItem()
try:
# do the scrape and assign to AlbumItem
except Exception:
# ignore for now
pass
return album_item
pass
pass

The CrawlSpider calls _requests_to_follow() method to extract urls and generate requests to follow:
def _requests_to_follow(self, response):
if not isinstance(response, HtmlResponse):
return
seen = set()
for n, rule in enumerate(self._rules):
links = [l for l in rule.link_extractor.extract_links(response) if l not in seen]
if links and rule.process_links:
links = rule.process_links(links)
seen = seen.union(links)
for link in links:
r = Request(url=link.url, callback=self._response_downloaded)
r.meta.update(rule=n, link_text=link.text)
yield rule.process_request(r)
As you can see:
The variable seen memorizes urls has been processed.
Every url will be parsed by at most one callback.
You can define a parse_item() to call parse_artist() and parse_album():
rules = [
Rule(SgmlLinkExtractor(allow=[r'same regex_rule']), callback='parse_item', follow=True),
# more rules .....
]
def parse_item(self, response):
yield self.parse_artist(response)
yield self.parse_album(response)

Related

Scrapy spider prefers one domain (which slows down process)

I'm working on Crawler which gets a list of domains and for every domain, counts number of urls on this page.
I use CrawlSpider for this purpose but there is a problem.
When I start crawling, it seems to send multiple requests to multiple domains but after some time (one minute), it ends crawling one page (domain).
SETTINGS
CONCURRENT_REQUESTS = 100
CONCURRENT_REQUESTS_PER_DOMAIN = 3
REACTOR_THREADPOOL_MAXSIZE = 20
Here you can see how many urls has been scraped for particular domain:
AFTER 7 minutes - as you can see it aims only on first domain and forgot about others
If scrapy aims just on one domain at once, it logically slows down process. I would like to send requests to multiple domains in short time.
class MainSpider(CrawlSpider):
name = 'main_spider'
allowed_domains = []
rules = (
Rule(LinkExtractor(), callback='parse_item', follow=True, ),
)
def start_requests(self):
for d in Domain.objects.all():
self.allowed_domains.append(d.name)
yield scrapy.Request(d.main_url, callback=self.parse_item, meta={'domain': d})
def parse_start_url(self, response):
self.parse_item(response)
def parse_item(self, response):
d = response.meta['domain']
d.number_of_urls = d.number_of_urls + 1
d.save()
extractor = LinkExtractor(allow_domains=d.name)
links = extractor.extract_links(response)
for link in links:
yield scrapy.Request(link.url, callback=self.parse_item,meta={'domain': d})
It seems to focus only on the first domain until it doesn't scrape it all.

How to add scraped items into a set and execute when condition is met?

This piece of code is expected to add extracted reviewId into a set( in order to omit duplicates. Then there is a check, when set lenth is 100 - callback is executed and long url string with all ids is passed to main extract function.
How do i do this(Save all ids, extracted from different callbacks into same Set and use it further) either with built in tools or with the code i have? the problem now is that lenth check loop is never enetered.
UPdate. I believe there are two options - pass Set as meta to each callback and somehow use Item for this one. But donno how.
import scrapy
from scrapy.shell import inspect_response
class QuotesSpider(scrapy.Spider):
name = "tripad"
list= set()
def start_requests(self):
url = "https://www.tripadvisor.com/Hotel_Review-g60763-d122005-Reviews-or{}-The_New_Yorker_A_Wyndham_Hotel-New_York_City_New_York.html#REVIEWS"
for i in range(0,500,5):
yield scrapy.Request(url=url.format(i), callback=self.parse)
def parse(self, response):
for result in response.xpath('//div[contains(#id,"review_")]/#id').extract():
if "review" in result[:8]:
QuotesSpider.list.add(result[7:] +"%2C")
if len(QuotesSpider.list) == 100:
url = "https://www.tripadvisor.com/OverlayWidgetAjax?Mode=EXPANDED_HOTEL_REVIEWS&metaReferer=Hotel_Review&reviews="
for i in QuotesSpider.list:
url+=i
yield scrapy.Request(url=url, callback=self.parse_page)
There are several ways of doing this, however I'd advise splitting your spider into two parts:
Spider that collects review ids
class CollectorSpider(Spider):
name='collect_reviews'
def parse(self, response):
review_ids = ...
for review_id in review_ids:
yield {'review_id': review_id}
Spider that uses collected review ids to collect review content
class ConsumerSpider(Spider):
name='consume_reviews'
def start_requests(self):
with open(self.file, 'r') as f:
data = json.loads(f.read())
last = 0
for i in range(0, len(data), 100):
ids = data[last:i]
ids = [i['review_id'] for i in ids]
# make url from ids
url = ''
yield Request(url)
def parse(self, response):
# crawl 100 reviews here

Scrapy CSS selector

I am learning how to use scrapy but I am having some issue. I wrote this code, following an online tutorial, to understand a bit more about it.
import scrapy
class BrickSetSpider(scrapy.Spider):
name = 'brick_spider'
start_urls = ['http://brickset.com/sets/year-2016']
def parse(self, response):
SET_SELECTOR = '.set'
for brickset in response.css(SET_SELECTOR):
NAME_SELECTOR = 'h1 a ::text'
PIECES_SELECTOR = './/dl[dt/text() = "Pieces"]/dd/a/text()'
MINIFIGS_SELECTOR = './/dl[dt/text() = "Minifigs"]/dd[2]/a/text()'
PRICE_SELECTOR = './/dl[dt/text() = "RRP"]/dd[3]/text()'
IMAGE_SELECTOR = 'img ::attr(src)'
yield {
'name': brickset.css(NAME_SELECTOR).extract_first(),
'pieces': brickset.xpath(PIECES_SELECTOR).extract_first(),
'minifigs': brickset.xpath(MINIFIGS_SELECTOR).extract_first(),
'retail price': brickset.xpath(PRICE_SELECTOR).extract_first(),
'image': brickset.css(IMAGE_SELECTOR).extract_first(),
}
NEXT_PAGE_SELECTOR = '.next a ::attr(href)'
next_page = response.css(NEXT_PAGE_SELECTOR).extract_first()
if next_page:
yield scrapy.Request(
response.urljoin(next_page),
callback=self.parse
)
Since the sites divide the product listed in years and this code crawls just data from 2016 I decided to extend it and analyze also the data of previous years. The idea of the code is this:
PREVIOUS_YEAR_SELECTOR = '...'
previous_year= response.css(PREVIOUS_YEAR_SELECTOR).extract_first()
if previous_year:
yield scrapy.Request(
response.urljoin(previous_year),
callback=self.parse
)
I tried different things but I really have no idea of what to write instead of '...'
I also tried with xpath but nothing seems to work.
You have at least two options here.
The first is to use generic CrawlSpider
and define which links you want to extract and follow.
Something like this:
class BrickSetSpider(scrapy.CrawlSpider):
name = 'brick_spider'
start_urls = ['http://brickset.com/sets']
rules = (
Rule(LinkExtractor(
allow=r'\/year\-[\d]{4}'), callback='parse_bricks', follow=True),
)
#Your method renamed to parse_bricks goes here
Note: you need to rename parse method to some other name like 'parse_bricks' since the CrawlSpider uses the parse method itself.
The second options are to set start_urls to a page http://brickset.com/browse/sets containing all links to year sets and add a method to parse those links:
class BrickSetSpider(scrapy.Spider):
name = 'brick_spider'
start_urls = ['http://brickset.com/browse/sets']
def parse(self, response):
links = response.xpath(
'//a[contains(#href, "/sets/year")]/#href').extract()
for link in links:
yield scrapy.Request(response.urljoin(link), callback=self.parse_bricks)
# Your method renamed to parse_bricks goes here
Maybe you want to exploit the structure of the href attribute? It seems to follow the structure /sets/year-YYYY. By this you can use a regex based selector or - if you are lazy like my - just a contains():
XPath: //a[contains(#href,"/sets/year-")]/#href
I'm not sure if this is also possible with CSS. So the ... can be filled with:
PREVIOUS_YEAR_SELECTOR_XPATH = '//a[contains(#href,"/sets/year-")]/#href'
previous_year = response.xpath(PREVIOUS_YEAR_SELECTOR).extract_first()
But I think you will go for ALL years, so maybe you want to loop over the links:
PREVIOUS_YEAR_SELECTOR_XPATH = '//a[contains(#href,"/sets/year-")]/#href'
for previous_year in response.xpath(PREVIOUS_YEAR_SELECTOR):
yield scrapy.Request(response.urljoin(previous_year), callback=self.parse)
I think you are on a good way. Google for an CSS/XPATH cheat sheet that matches your needs and checkout the FirePath extension or similar. It speeds up the selector setup a lot :)

How to feed a spider with links crawled within the spider?

I'm writing a spider (CrawlSpider) for an online store. According to client requisites, I need to write two rules: one for determining which pages have items and other for extracting the items.
I have both rules already working independently:
if my start_urls = ["www.example.com/books.php",
"www.example.com/movies.php"] and I comment the Rule and the code
of parse_category, my parse_item will extract every item.
On the other hand, if start_urls = "http://www.example.com" and I
comment the Ruleand the code of parse_item, parse_category will
return every link in which there a items for extracting, i.e.
parse_category will return www.example.com/books.php and
www.example.com/movies.php.
My problem is that I don't know how to merge both modules, so that start_urls = "http://www.example.com" and then parse_category extracts www.example.com/books.php and www.example.com/movies.php and feed those links to parse_item, where I actually extract the info of each item.
I need to find a way to do it this way instead of just using start_urls = ["www.example.com/books.php", "www.example.com/movies.php"] because if in the future a new category is added (e.g. www.example.com/music.php), the spider wouldn't be able to automatically detect that new category and should be manually edited. Not a big deal, but the client doesn't want this.
class StoreSpider (CrawlSpider):
name = "storyder"
allowed_domains = ["example.com"]
start_urls = ["http://www.example.com/"]
#start_urls = ["http://www.example.com/books.php", "http://www.example.com/movies.php"]
rules = (
Rule(LinkExtractor(), follow=True, callback='parse_category'),
Rule(LinkExtractor(), follow=False, callback="parse_item"),
)
def parse_category(self, response):
category = StoreCategory()
# some code for determining whether the current page is a category, or just another stuff
if is a category:
category['name'] = name
category['url'] = response.url
return category
def parse_item(self, response):
item = StoreItem()
# some code for extracting the item's data
return item
the CrawlSpider rules don't work like you want, you'll need to implement the logic by yourself. when you specify follow=True you can't use callback, because the idea is to keep getting links (no items) while following the rules, check the documentation
you could try with something like:
class StoreSpider (CrawlSpider):
name = "storyder"
allowed_domains = ["example.com"]
start_urls = ["http://www.example.com/"]
# no rules
def parse(self, response): # this is parse_category
category_le = LinkExtractor("something for categories")
for a in category_le.extract_links(response):
yield Request(a.url, callback=self.parse_category)
item_le = LinkExtractor("something for items")
for a in item_le.extract_links(response):
yield Request(a.url, callback=self.parse_item)
def parse_category(self, response):
category = StoreCategory()
# some code for determining whether the current page is a category, or just another stuff
if is a category:
category['name'] = name
category['url'] = response.url
yield category
for req in self.parse(response):
yield req
def parse_item(self, response):
item = StoreItem()
# some code for extracting the item's data
return item
Instead of using a parse_category, I used restrict_css in LinkExtractorto get the links I want, and it seems to be feeding the second Rule with the extracted links, so my question is answered. It ended up this way:
class StoreSpider (CrawlSpider):
name = "storyder"
allowed_domains = ["example.com"]
start_urls = ["http://www.example.com/"]
rules = (
Rule(LinkExtractor(restrict_css=("#movies", "#books"))),
Rule(LinkExtractor(), callback="parse_item"),
)
def parse_item(self, response):
item = StoreItem()
# some code for extracting the item's data
return item
Still it can't detect new added categories (and there is not a clear pattern for using in restrict_css without fetching other garbage), but at least it's complying with the requisites of the client: 2 rules, one for extracting category's links and other for extracting item's data.

Scrapy spider get information that is inside of links

I have done and spider that can take the information of this page and it can follow "Next page" links. Now, the spider just takes the information that i'm showing in the following structure.
The structure of the page is something like this
Title 1
URL 1 ---------> If you click you go to one page with more information
Location 1
Title 2
URL 2 ---------> If you click you go to one page with more information
Location 2
Next page
Then, that i want is that the spider goes on each URL link and get full information. I suppose that i must generate another rule that specify that i want do something like this.
The behaviour of the spider it should be:
Go to URL1 (get info)
Go to URL2 (get info)
...
Next page
But i don't know how i can implement it. Can someone guide me?
Code of my Spider:
class BcnSpider(CrawlSpider):
name = 'bcn'
allowed_domains = ['guia.bcn.cat']
start_urls = ['http://guia.bcn.cat/index.php?pg=search&q=*:*']
rules = (
Rule(
SgmlLinkExtractor(
allow=(re.escape("index.php")),
restrict_xpaths=("//div[#class='paginador']")),
callback="parse_item",
follow=True),
)
def parse_item(self, response):
self.log("parse_item")
sel = Selector(response)
sites = sel.xpath("//div[#id='llista-resultats']/div")
items = []
cont = 0
for site in sites:
item = BcnItem()
item['id'] = cont
item['title'] = u''.join(site.xpath('h3/a/text()').extract())
item['url'] = u''.join(site.xpath('h3/a/#href').extract())
item['when'] = u''.join(site.xpath('div[#class="dades"]/dl/dd[1]/text()').extract())
item['where'] = u''.join(site.xpath('div[#class="dades"]/dl/dd[2]/span/a/text()').extract())
item['street'] = u''.join(site.xpath('div[#class="dades"]/dl/dd[3]/span/text()').extract())
item['phone'] = u''.join(site.xpath('div[#class="dades"]/dl/dd[4]/text()').extract())
items.append(item)
cont = cont + 1
return items
EDIT After searching in internet I found a code with which i can do that.
First of all, I have to get all the links, then I have to call another parse method.
def parse(self, response):
#Get all URL's
yield Request( url= _url, callback=self.parse_details )
def parse_details(self, response):
#Detailed information of each page
If you want use Rules because the page have a paginator, you should change def parse to def parse_start_url and then call this method through Rule. With this changes you make sure that the parser begins at the parse_start_url and the code it would be something like this:
rules = (
Rule(
SgmlLinkExtractor(
allow=(re.escape("index.php")),
restrict_xpaths=("//div[#class='paginador']")),
callback="parse_start_url",
follow=True),
)
def parse_start_url(self, response):
#Get all URL's
yield Request( url= _url, callback=self.parse_details )
def parse_details(self, response):
#Detailed information of each page
Thant's all folks
There is an easier way of achieving this. Click next on your link, and read the new url carefully:
http://guia.bcn.cat/index.php?pg=search&from=10&q=*:*&nr=10
By looking at the get data in the url (everything after the questionmark), and a bit of testing, we find that these mean
from=10 - Starting index
q=*:* - Search query
nr=10 - Number of items to display
This is how I would've done it:
Set nr=100 or higher. (1000 may do as well, just be sure that there is no timeout)
Loop from from=0 to 34300. This is above the number of entries currently. You may want to extract this value first.
Example code:
entries = 34246
step = 100
stop = entries - entries % step + step
for x in xrange(0, stop, step):
url = 'http://guia.bcn.cat/index.php?pg=search&from={}&q=*:*&nr={}'.format(x, step)
# Loop over all entries, and open links if needed

Categories