I am using Scrapy with Python to scrape several websites.
I got many Spiders with a structure like this:
import library as lib
class Spider(Spider):
...
def parse(self, response):
yield FormRequest(..., callback=lib.parse_after_filtering_results1)
yield FormRequest(..., callback=lib.parse_after_filtering_results2)
def parse_after_filtering_results1(self,response):
return results
def parse_after_filtering_results2(self,response):
... (doesn't return anything)
I would like to know if there's any way I can put the last 2 functions, which are called in the callback, in another module that is common to all my Spiders (so that if I modify it then all of them change). I know they are class functions but is there anyway I could put them in another file?
I have tried declaring the functions in my library.py file but my problem is how can I pass the 2 parameters needed (self, response) to them.
Create a base class to contain those common functions. Then your real spiders can inherit from that. For example, if all your spiders extend Spider then you can do the following:
spiders/basespider.py:
from scrapy import Spider
class BaseSpider(Spider):
# Do not give it a name so that it does not show up in the spiders list.
# This contains only common functions.
def parse_after_filtering_results1(self, response):
# ...
def parse_after_filtering_results2(self, response):
# ...
spiders/realspider.py:
from .basespider import BaseSpider
class RealSpider(BaseSpider):
# ...
def parse(self, response):
yield FormRequest(..., callback=self.parse_after_filtering_results1)
yield FormRequest(..., callback=self.parse_after_filtering_results2)
If you have different types of spiders you can create different base classes. Or your base class can be a plain object (not Spider) and then you can use it as a mixin.
Related
I'm trying to capture "finish_reason" in scrapy after each crawl and insert this info into a database. The crawl instance is created in a pipeline before first item is collected.
It seems like I have to use the "engine_stopped" signal but couldn't find an example on how or where should I put my code to do this?
One of possible options is to override scrapy.statscollectors.MemoryStatsCollector (docs,code) and it's close_spider method:
middleware.py:
import pprint
from scrapy.statscollectors import MemoryStatsCollector, logger
class MemoryStatsCollectorSender(MemoryStatsCollector):
#Override close_spider method
def close_spider(self, spider, reason):
#finish_reason in reason variable
#add your data sending code here
if self._dump:
logger.info("Dumping Scrapy stats:\n" + pprint.pformat(self._stats),
extra={'spider': spider})
self._persist_stats(self._stats, spider)
Add newly created stats collector class to settings.py:
STATS_CLASS = 'project.middlewares.MemoryStatsCollectorSender'
#STATS_CLASS = 'scrapy.statscollectors.MemoryStatsCollector'
I have found plenty of information on calling a function when a Scrapy spider quits (viz: Call a function in Settings from spider Scrapy) but I'm looking for how to call a function -- just once -- when the spider opens. Cannot find this in the Scrapy documentation.
I've got a project of multiple spiders that scrape event information and post them to different Google Calendars. The event information is updated often, so before the spider runs, I need to clear out the existing Google Calendar information in order to refresh it entirely. I've got a working function that accomplishes this when passed a calendar ID. Each spider posts to a different Google Calendar, so I need to be able to pass the calendar ID from within the spider to the function that clears the calendar.
I've defined a base spider in init.py that looks like this:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
## import other stuff I need for the clear_calendar() function
class BaseSpider(CrawlSpider):
def clear_calendar(self, CalId):
## working code to clear the calendar
Now I can call that function from within parse_item like:
from myproject import BaseSpider
class ExampleSpider(BaseSpider):
def parse_item(self, response):
calendarID = 'MycalendarID'
self.clear_calendar(MycalendarID)
## other stuff to do
And of course that calls the function every single time an item is scraped, which is ridiculous. But if I move the function call outside of def parse_item, I get the error "self is not defined", or, if I remove "self", "clear_calendar is not defined."
How can I call a function that requires an argument just once from within a Scrapy spider? Or, is there a better way to go about this?
There is totally a better way, with the spider_opened signal.
I think on newer versions of scrapy, there is a spider_opened method ready for you to use inside the spider:
class MySpider(Spider):
...
calendar_id = 'something'
#classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super(MySpider, cls).from_crawler(crawler, *args, **kwargs)
crawler.signals.connect(spider.spider_opened, signal=signals.spider_opened)
return spider
def spider_opened(self):
calendar_id = self.calendar_id
# use my calendar_id
I am running a CrawlSpider and I want to implement some logic to stop following some of the links in mid-run, by passing a function to process_request.
This function uses the spider's class variables in order to keep track of the current state, and depending on it (and on the referrer URL), links get dropped or continue to be processed:
class BroadCrawlSpider(CrawlSpider):
name = 'bitsy'
start_urls = ['http://scrapy.org']
foo = 5
rules = (
Rule(LinkExtractor(), callback='parse_item', process_request='filter_requests', follow=True),
)
def parse_item(self, response):
<some code>
def filter_requests(self, request):
if self.foo == 6 and request.headers.get('Referer', None) == someval:
raise IgnoreRequest("Ignored request: bla %s" % request)
return request
I think that if I were to run several spiders on the same machine, they would all use the same class variables which is not my intention.
Is there a way to add instance variables to CrawlSpiders? Is only a single instance of the spider created when I run Scrapy?
I could probably work around it with a dictionary with values per process ID, but that will be ugly...
I think spider arguments would be the solution in your case.
When invoking scrapy like scrapy crawl some_spider, you could add arguments like scrapy crawl some_spider -a foo=bar, and the spider would receive the values via its constructor, e.g.:
class SomeSpider(scrapy.Spider):
def __init__(self, foo=None, *args, **kwargs):
super(SomeSpider, self).__init__(*args, **kwargs)
# Do something with foo
What's more, as scrapy.Spider actually sets all additional arguments as instance attributes, you don't even need to explicitly override the __init__ method but just access the .foo attribute. :)
I want to parse sitemap and find out all urls from sitemap and then appending some word to all urls and then I want to check response code of all modified urls.
for this task I decided to use scrapy because it have luxury to crawl sitemaps. its given in scarpy's documentation
with the help of this documentation I created my spider. but I want to change urls before sending for fetching. so for this I tried to take help from this link. this link suggested my to use rules and implement process_requests(). but I am not able to make use of these. I tired little bit that I have commented. could anyone help me write exact code for commented lines or any other ways to do this task in scrapy?
from scrapy.contrib.spiders import SitemapSpider
class MySpider(SitemapSpider):
sitemap_urls = ['http://www.example.com/sitemap.xml']
#sitemap_rules = [some_rules, process_request='process_request')]
#def process_request(self, request, spider):
# modified_url=orginal_url_from_sitemap + 'myword'
# return request.replace(url = modified_url)
def parse(self, response):
print response.status, response.url
You can attach the request_scheduled signal to a function and do what you want in the function. For example
class MySpider(SitemapSpider):
#classmethod
def from_crawler(cls, crawler):
spider = cls()
crawler.signals.connect(spider.request_scheduled, signals.request_scheduled)
def request_scheduled(self, request, spider):
modified_url = orginal_url_from_sitemap + 'myword'
request.url = modified_url
SitemapSpider has sitemap_filter method.
You can override it to implement required functionality.
class MySpider(SitemapSpider):
...
def sitemap_filter(self, entries):
for entry in entries:
entry["loc"] = entry["loc"] + myword
yield entry
Each of that entry objects are dicts with structure like this:
<class 'dict'>:
{'loc': 'https://example.com/',
'lastmod': '2019-01-04T08:09:23+00:00',
'changefreq': 'weekly',
'priority': '0.8'}
Important note!. SitemapSpider.sitemap_filter method appeared on scrapy 1.6.0 released on Jan 2019 1.6.0 release notes - new extensibility features section
I've just facet this. Apparently you can't really use process_requests because sitemap rules in SitemapSpider are different from Rule objects in CrawlSpider - only the latter can have this argument.
After examining the code it looks like this can be avoided by manually overriding part of SitemapSpider implementation:
class MySpider(SitemapSpider):
sitemap_urls = ['...']
sitemap_rules = [('/', 'parse')]
def start_requests(self):
# override to call custom_parse_sitemap instead of _parse_sitemap
for url in self.sitemap_urls:
yield Request(url, self.custom_parse_sitemap)
def custom_parse_sitemap(self, response):
# modify requests marked to be called with parse callback
for request in super()._parse_sitemap(response):
if request.callback == self.parse:
yield self.modify_request(request)
else:
yield request
def modify_request(self, request):
return request.replace(
# ...
)
def parse(self, response):
# ...
I have use the mongodb to store the data of the crawl.
Now I want to query the last date of the data, that I can continue crawl the data and not need restart it from the begin of the url list.(url, which can determined by the date, like: /2014-03-22.html)
I want only one connection object to take the database operation, which is in pipeline.
So, I want to know how can I get the pipeline object(not new one) in the spider.
Or, any better solution for incremental update...
Thanks in advance.
Sorry, for my poor english...
Just sample now:
# This is my Pipline
class MongoDBPipeline(object):
def __init__(self, mongodb_db=None, mongodb_collection=None):
self.connection = pymongo.Connection(settings['MONGODB_SERVER'], settings['MONGODB_PORT'])
....
def process_item(self, item, spider):
....
def get_date(self):
....
And the spider:
class Spider(Spider):
name = "test"
....
def parse(self, response):
# Want to get the Pipeline object
mongo = MongoDBPipeline() # if take this way, must a new Pipeline object
mongo.get_date() # In scrapy, it must have a Pipeline object for the spider
# I want to get the Pipeline object, which created when scrapy started.
Ok, just don't want to new a new object....I admit I am an OCD..
A Scrapy Pipeline has an open_spider method that gets executed after the spider is initialized. You can pass a reference to the database connection, the get_date() method, or the Pipeline itself, to your spider. An example of the latter with your code is:
# This is my Pipline
class MongoDBPipeline(object):
def __init__(self, mongodb_db=None, mongodb_collection=None):
self.connection = pymongo.Connection(settings['MONGODB_SERVER'], settings['MONGODB_PORT'])
....
def process_item(self, item, spider):
....
def get_date(self):
....
def open_spider(self, spider):
spider.myPipeline = self
Then, in the spider:
class Spider(Spider):
name = "test"
def __init__(self):
self.myPipeline = None
def parse(self, response):
self.myPipeline.get_date()
I don't think the __init__() method is necessary here, but I put it here to show that open_spider replaces it after initialization.
According to the scrapy Architecture Overview:
The Item Pipeline is responsible for processing the items once they
have been extracted (or scraped) by the spiders.
Basically that means that, first, scrapy spiders are working, then extracted items are going to the pipelines - no way to go backwards.
One possible solution would be, in the pipeline itself, check if the Item you've scraped is already in the database.
Another workaround would be to keep the list of urls you've crawled in the database, and, in the spider, check if you've already got the data from a url.
Since I'm not sure what do you mean by "start from the beginning" - I cannot suggest anything specific.
Hope at least this information helped.