I need to raise CloseSpider from a Scrapy Pipeline. Either that or return some parameter from the Pipeline back to the Spider to do the raise.
For example, if the date already exists raise CloseSpider:
raise CloseSpider('Already been scraped:' + response.url)
Is there a way to do this?
As from scrapy docs, CloseSpider Exception can only be raised from a callback function (by default parse function) in a Spider only. Raising it in pipeline will crash spider. To achieve the similar results from a pipeline, you can initiate a shutdown signal, that will close scrapy gracefully.
from scrapy.project import crawler
crawler._signal_shutdown(9,0)
Do remember ,scrapy might process already fired or even scheduled requests even after initiating shutdown signal.
To do it from Spider, set some variable in Spider from Pipeline like this.
def process_item(self, item, spider):
if some_condition_is_met:
spider.close_manually = True
After this in the callback function of your spider , you can raise close spider exception.
def parse(self, response):
if self.close_manually:
raise CloseSpider('Already been scraped.')
I prefer the following solution.
class MongoDBPipeline(object):
def process_item(self, item, spider):
spider.crawler.engine.close_spider(self, reason='duplicate')
Source: Force spider to stop in scrapy
Related
from scrapy.exceptions import DropItem
class DuplicatesPipeline(object):
def __init__(self):
self.ids_seen = set()
def process_item(self, item, spider):
if item['id'] in self.ids_seen:
raise DropItem("Duplicate item found: %s" % item)
else:
self.ids_seen.add(item['id'])
return item
Above code is from Scrapy official website: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
which is used for filtering duplicates.
And as Scrapy documentation suggested, http://doc.scrapy.org/en/latest/topics/jobs.html
To pause and resume a spider, I need to use the Jobs system.
So I'm curious if the Scrapy Jobs system can make duplicates filter persistent in its directory. The way that implements the duplicates filter is so simple that I'm in doubt.
You just need to implement your pipeline so that it reads the JOBDIR setting and, when that setting is defined, your pipeline:
Reads the initial value of self.ids_seen from some file inside the JOBDIR directory.
At run time, it updates that file as new IDs are added to the set.
I was trying to use the following function to wait for a crawler to finish and return all results. However, this function always returns immediately when called while the crawler is still running. What am I missing here? Aren't join() supposed to wait?
def spider_results():
runner = CrawlerRunner(get_project_settings())
results = []
def crawler_results(signal, sender, item, response, spider):
results.append(item)
dispatcher.connect(crawler_results, signal=signals.item_passed)
runner.crawl(QuotesSpider)
runner.join()
return results
Accordig to scrapy docs (common practices section)
CrawlerProcess class is recommended to use in this cases.
I'm trying to capture "finish_reason" in scrapy after each crawl and insert this info into a database. The crawl instance is created in a pipeline before first item is collected.
It seems like I have to use the "engine_stopped" signal but couldn't find an example on how or where should I put my code to do this?
One of possible options is to override scrapy.statscollectors.MemoryStatsCollector (docs,code) and it's close_spider method:
middleware.py:
import pprint
from scrapy.statscollectors import MemoryStatsCollector, logger
class MemoryStatsCollectorSender(MemoryStatsCollector):
#Override close_spider method
def close_spider(self, spider, reason):
#finish_reason in reason variable
#add your data sending code here
if self._dump:
logger.info("Dumping Scrapy stats:\n" + pprint.pformat(self._stats),
extra={'spider': spider})
self._persist_stats(self._stats, spider)
Add newly created stats collector class to settings.py:
STATS_CLASS = 'project.middlewares.MemoryStatsCollectorSender'
#STATS_CLASS = 'scrapy.statscollectors.MemoryStatsCollector'
I have found plenty of information on calling a function when a Scrapy spider quits (viz: Call a function in Settings from spider Scrapy) but I'm looking for how to call a function -- just once -- when the spider opens. Cannot find this in the Scrapy documentation.
I've got a project of multiple spiders that scrape event information and post them to different Google Calendars. The event information is updated often, so before the spider runs, I need to clear out the existing Google Calendar information in order to refresh it entirely. I've got a working function that accomplishes this when passed a calendar ID. Each spider posts to a different Google Calendar, so I need to be able to pass the calendar ID from within the spider to the function that clears the calendar.
I've defined a base spider in init.py that looks like this:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
## import other stuff I need for the clear_calendar() function
class BaseSpider(CrawlSpider):
def clear_calendar(self, CalId):
## working code to clear the calendar
Now I can call that function from within parse_item like:
from myproject import BaseSpider
class ExampleSpider(BaseSpider):
def parse_item(self, response):
calendarID = 'MycalendarID'
self.clear_calendar(MycalendarID)
## other stuff to do
And of course that calls the function every single time an item is scraped, which is ridiculous. But if I move the function call outside of def parse_item, I get the error "self is not defined", or, if I remove "self", "clear_calendar is not defined."
How can I call a function that requires an argument just once from within a Scrapy spider? Or, is there a better way to go about this?
There is totally a better way, with the spider_opened signal.
I think on newer versions of scrapy, there is a spider_opened method ready for you to use inside the spider:
class MySpider(Spider):
...
calendar_id = 'something'
#classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super(MySpider, cls).from_crawler(crawler, *args, **kwargs)
crawler.signals.connect(spider.spider_opened, signal=signals.spider_opened)
return spider
def spider_opened(self):
calendar_id = self.calendar_id
# use my calendar_id
I have use the mongodb to store the data of the crawl.
Now I want to query the last date of the data, that I can continue crawl the data and not need restart it from the begin of the url list.(url, which can determined by the date, like: /2014-03-22.html)
I want only one connection object to take the database operation, which is in pipeline.
So, I want to know how can I get the pipeline object(not new one) in the spider.
Or, any better solution for incremental update...
Thanks in advance.
Sorry, for my poor english...
Just sample now:
# This is my Pipline
class MongoDBPipeline(object):
def __init__(self, mongodb_db=None, mongodb_collection=None):
self.connection = pymongo.Connection(settings['MONGODB_SERVER'], settings['MONGODB_PORT'])
....
def process_item(self, item, spider):
....
def get_date(self):
....
And the spider:
class Spider(Spider):
name = "test"
....
def parse(self, response):
# Want to get the Pipeline object
mongo = MongoDBPipeline() # if take this way, must a new Pipeline object
mongo.get_date() # In scrapy, it must have a Pipeline object for the spider
# I want to get the Pipeline object, which created when scrapy started.
Ok, just don't want to new a new object....I admit I am an OCD..
A Scrapy Pipeline has an open_spider method that gets executed after the spider is initialized. You can pass a reference to the database connection, the get_date() method, or the Pipeline itself, to your spider. An example of the latter with your code is:
# This is my Pipline
class MongoDBPipeline(object):
def __init__(self, mongodb_db=None, mongodb_collection=None):
self.connection = pymongo.Connection(settings['MONGODB_SERVER'], settings['MONGODB_PORT'])
....
def process_item(self, item, spider):
....
def get_date(self):
....
def open_spider(self, spider):
spider.myPipeline = self
Then, in the spider:
class Spider(Spider):
name = "test"
def __init__(self):
self.myPipeline = None
def parse(self, response):
self.myPipeline.get_date()
I don't think the __init__() method is necessary here, but I put it here to show that open_spider replaces it after initialization.
According to the scrapy Architecture Overview:
The Item Pipeline is responsible for processing the items once they
have been extracted (or scraped) by the spiders.
Basically that means that, first, scrapy spiders are working, then extracted items are going to the pipelines - no way to go backwards.
One possible solution would be, in the pipeline itself, check if the Item you've scraped is already in the database.
Another workaround would be to keep the list of urls you've crawled in the database, and, in the spider, check if you've already got the data from a url.
Since I'm not sure what do you mean by "start from the beginning" - I cannot suggest anything specific.
Hope at least this information helped.