Given I have a single HTML file containing multiple sections with different structures that require widely different scraping. What would be the best practices for the spider layout?
Should I use one spider or more spiders? Should I request the same URL multiple times, each time with a different function for the callback? Or just sequentially parses the different parts? I ask in respect of being able to play nice with other parts of the framework — like items and pipleines — and also performance, limits and caching wise.
So, any best practice advise out there? Rules or conventions used in the community?
Multiple Request
If I request a URL multiple times is it cached / throttled? Or do every request to the engine result in a request to the "external web server"?
class MultiSpider(scrapy.Spider):
"""Parse the parts in parallel."""
name = 'multispider'
def start_requests(self):
url = 'https://en.wikipedia.org/wiki/Main_Page'
yield scrapy.Request(url=url, callback=self.parser_01)
yield scrapy.Request(url=url, callback=self.parser_02)
def parser_01(self, response):
selector = response.xpath('//some/path')
if selector is not None:
# do stuff with *selector* and
yield {}
def parser_0(self, response):
selector = response.xpath('//some/other/path')
if selector is not None:
# do very different stuff with *selector* and
yield {}
Multiple Parser Functions
If I want to avoid a huge parse function and instead use multiple functions for different task / sections, are there especially good / bad ways to structure this (e.g. "how to yield from where")?
class SeqSpider(scrapy.Spider):
"""Parse the page sequentially."""
name = 'seqspider'
start_urls = ['https://en.wikipedia.org/wiki/Main_Page', ]
def parse(self, response):
selector = response.xpath('//some/path')
if selector:
yield from self.parser_01(response, selector):
selector = response.xpath('//some/other/path')
if selector:
yield from self.parser_02(response, selector):
def parser_01(self, response, selector):
# do stuff with *selector* and
yield {}
def parser_0(self, response, selector):
# do very different stuff with *selector* and
yield {}
To answer your question on how to structure a spider / best practices, here's what I'm usually doing:
Avoid building multiple spiders that work on the same pages - as the targeted website's bandwidth is usually the bottleneck and creating unnecessary traffic on the website isn't considered courteous scraping.
Same goes for creating multiple requests to the same URL: it might create unnecessary traffic to the target website, which isn't nice scraper behavior. Also scrapy by default filters duplicate requests so you might suddenly be wondering why not all of your requests are executed.
(Note: there are ways around this, also using proxies etc ... but that makes things unnecessary complicated)
So if you want to avoid big parse methods that do a lot of different things, feel free to split it up. Very similar to what you suggest in your own example. I would even go further and encapsulate complete processing steps into separate parse methods, e.g.
class SomeSpider(scrapy.Spider):
def parse(self, response):
yield self.parse_widgets_type_a(response)
yield self.parse_widgets_type_b(response)
# ....
yield self.follow_interesting_links(response)
def parse_widgets_type_a(self, response):
# ....
def parse_widgets_type_b(self, response):
# ....
def follow_interesting_links(self, response):
# ....
yield Request(interesting_url)
Based on this template you might even want to refactor the different parse methods into different Mixin Classes.
If it is a single page I would recommend to use one spider. Request the page once and parse all the data you need (You can use one or more functions for that).
I would also recommend to use Items, for example
import scrapy
class AmazonItem(scrapy.Item):
product_name = scrapy.Field()
product_asin = scrapy.Field()
product_avg_stars = scrapy.Field()
product_num_reviews = scrapy.Field()
pass
If you want to save your crawled data into a database you have should use the pipeline.
Related
What's the best approach to write contracts for Scrapy spiders that have more than one method to parse the response?
I saw this answer but it didn't sound very clear to me.
My current example: I have a method called parse_product that extracts the information on a page but I have more data that I need to extract for the same product in another page, so I yield a new request at the end of this method to make a new request and let the new callback extracts theses fields and returns the item.
The problem is that if I write a contract for the second method, it will fail because it doesn't have the meta attribute (containing the item with most of the fields). If I write a contract for the first method, I can't check if it returns the fields, because it returns a new request, instead of the item.
def parse_product(self, response):
il = ItemLoader(item=ProductItem(), response=response)
# populate the item in here
# yield the new request sending the ItemLoader to another callback
yield scrapy.Request(new_url, callback=self.parse_images, meta={'item': il})
def parse_images(self, response):
"""
#url http://foo.bar
#returns items 1 1
#scrapes field1 field2 field3
"""
il = response.request.meta['item']
# extract the new fields and add them to the item in here
yield il.load_item()
In the example, I put the contract in the second method, but it gave me a KeyError exception on response.request.meta['item'], also, the fields field1 and field2 are populated in the first method.
Hope it's clear enough.
Frankly, I don't use Scrapy contracts and I don't really recommend anyone to use them either. They have many issues and someday may be removed from Scrapy.
In practice, I haven't had much luck using unit tests for spiders.
For testing spiders during development, I'd enable the cache and then re-run the spider as many times as needed to get the scraping right.
For regression bugs, I had better luck using item pipelines (or spider middlewares) that do validation on-the-fly (there is only so much you can catch in early testing anyway). It's also a good idea to have some strategies for recovering.
And for maintaining a healthy codebase, I'd be constantly moving library-like code out from the spider itself to make it more testable.
Sorry if this isn't the answer you're looking for.
I want to concatenate mulitple spiders in Scrapy, so that the output of one feeds the other one. I am aware of the way Scrapy is used to concatenate parse functions and use the Meta parameter of the request to comunicate the item.
class MySpider(scrapy.Spider):
start_urls = [url1]
def parse(self, response):
# parse code and item generated
yield scrapy.Request(url2, call_back=self.parse2, meta={'item': item})
def parse2(self, response):
item = response.meta['item']
# parse2 code
But I have a very long list of parsing functions to concatenate and this increasing complexity would be more modular and easier to debug with multiple spiders.
Thanks!
approach other than the cronjob schedulibg would be to override close_spider signal
as you have written multiple spiders for each use-case
next you should override close_spider signal and call the next spider from there , using script.
you need do this signal override for all spiders and call up next dependent spider as soon as current spiders completes it's operation
Appreciate someone can help me understand how rules stack for depth crawling. Does Stacking multiple rules result in Rules being processed one at a time. The aim is to grab links from MainPage, return the items and the responses, and pass it to the next rule which will pass of the links to another function and so on.
rules = {
Rule(LinkExtractor(restrict_xpaths=(--some xpath--)), callback='function_a', follow=True)
Rule(linkExtractor(restrict_xpaths=(--some xpath--)),callback='function_b', process_links='function_c', follow=True),
)
def function_a(self, response): --grab sports, games, link3 from main page--
item = ItemA()
i = response.xpath('---some xpath---')
for xpth in i:
item['name'] = xpth('---some xpath--')
yield item, scrapy.Request(url) // yield each item and url link from function_a back to the second rule
def function_b(self, response) -- receives responses from second rule--
//grab links same as function_a
def function_c(self, response) -- does process_links in the rule send the links it received to function_c?
Can this be done recursively to deep crawl a single site? I'm not sure if I got the rules concept correct. Do I have to add X num of rules to process X depth pages or is there a better way to process recursive depth crawls.
Thanks
From the docs the following passage implies that every rule is applied to every page. (My italics)
rules
Which is a list of one (or more) Rule objects. Each Rule defines a certain
behaviour for crawling the site. Rules objects are described
below. If multiple rules match the same link, the first one will be
used, according to the order they’re defined in this attribute.
In your case target each rule to the appropriate page and then order the rules in depth order.
I'm using Scrapy and I read on the doc about the setting "CONCURRENT_REQUESTS". The docs talk about "The maximum number of concurrent (ie. simultaneous) requests that will be performed by the Scrapy downloader."
I created a spider in order to take questions and answers from Q&A websites, so I want to know if is possible run multiple concurrent requests.
Now I have set this value to 1 because I don't want to lose some Item or override someone.
The main doubt is that I have a Global ID idQuestion (for making an idQuestion.idAnswer) for any item so I don't know if making multiple requests all can be a mess and lose some Item or set wrong Ids.
This is a snippet of code:
class Scraper(scrapy.Spider):
uid = 1
def parse_page(self, response):
# Scraping a single question
item = ScrapeItem()
hxs = HtmlXPathSelector(response)
#item['date_time'] = response.meta['data']
item['type'] = "Question"
item['uid'] = str(self.uid)
item['url'] = response.url
#Do some scraping.
ans_uid = ans_uid + 1
item['uid'] = str(str(self.uid) + (":" + str(ans_uid)))
yield item
#Call recusivly the method on other page.
print("NEXT -> "+str(composed_string))
yield scrapy.Request(composed_string, callback=self.parse_page)
This is the skeleton of my code.
I use uid to memorize the id for the single question and ans_uid for the answer.
Ex:
Question
1.1) Ans 1 for Question 1
1.2) Ans 2 for Question 1
1.3) Ans 3 for Question 1
Can I simply increase the CONCURRENT_REQUESTS value without compromising anything?
The answer to your question is: no. If you increase the concurrent requests you can end up having different values for uid -- even if the question is the same later. That's because there is no guarantee that your requests are handled in order.
However you can pass information along your Request objects with the meta attribute. I would pass along the ID with the yield Request(... as a meta tag and then look in the parse_page if this attribute is available or not. If it is not then it is a new question, if yes, use this id because this is not a new question.
You can read more about meta here: http://doc.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Request.meta
Scrapy is not a multithreaded environment, but rather uses an event loop driven asynchronous architecture (Twisted, which is a bit like node.js for python).
in that sense, it is completely thread safe.
You actually have a reference to the request object as response -> response.request, which has response.request.url, as well as the referer header sent, and response.request.meta so you have mapping from answers back to questions built in. (like a referrer header of sorts) if you are reading from a list of questions or answers from a single page, you are guaranteed that those questions and answers will be read in order.
you can do something like the following:
class mySpider(Spider):
def parse_answer(self, response):
question_url = response.request.headers.get('Referer', None)
yield Answer(question_url = ..., answerinfo = ... )
class Answer(item):
answer = ....
question_url = ...
Hope that helps.
I'm crawling a web site (only two levels deep), and I want to scrape information from sites on both levels. The problem I'm running into, is I want to fill out the fields of one item with information from both levels. How do I do this?
I was thinking having a list of items as an instance variable that will be accessible by all threads (since it's the same instance of the spider), and parse_1 will fill out some fields, and parse_2 will have to check for the correct key before filling out the corresponding value. This method seems burdensome, and I'm still not sure how to make it work.
What I'm thinking is there must be a better way, maybe somehow passing an item to the callback. I don't know how to do that with the Request() method though. Ideas?
From scrapy documentation:
In some cases you may be interested in passing arguments to those callback functions so you can receive the arguments later, in the second callback. You can use the Request.meta attribute for that.
Here’s an example of how to pass an item using this mechanism, to populate different fields from different pages:
def parse_page1(self, response):
item = MyItem()
item['main_url'] = response.url
request = Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
request.meta['item'] = item
return request
def parse_page2(self, response):
item = response.meta['item']
item['other_url'] = response.url
return item
So, basically you can scrape first page and store all information in item and then send whole item with request for that second level url and have all the information in one item.