How to add instance variable to Scrapy CrawlSpider? - python

I am running a CrawlSpider and I want to implement some logic to stop following some of the links in mid-run, by passing a function to process_request.
This function uses the spider's class variables in order to keep track of the current state, and depending on it (and on the referrer URL), links get dropped or continue to be processed:
class BroadCrawlSpider(CrawlSpider):
name = 'bitsy'
start_urls = ['http://scrapy.org']
foo = 5
rules = (
Rule(LinkExtractor(), callback='parse_item', process_request='filter_requests', follow=True),
)
def parse_item(self, response):
<some code>
def filter_requests(self, request):
if self.foo == 6 and request.headers.get('Referer', None) == someval:
raise IgnoreRequest("Ignored request: bla %s" % request)
return request
I think that if I were to run several spiders on the same machine, they would all use the same class variables which is not my intention.
Is there a way to add instance variables to CrawlSpiders? Is only a single instance of the spider created when I run Scrapy?
I could probably work around it with a dictionary with values per process ID, but that will be ugly...

I think spider arguments would be the solution in your case.
When invoking scrapy like scrapy crawl some_spider, you could add arguments like scrapy crawl some_spider -a foo=bar, and the spider would receive the values via its constructor, e.g.:
class SomeSpider(scrapy.Spider):
def __init__(self, foo=None, *args, **kwargs):
super(SomeSpider, self).__init__(*args, **kwargs)
# Do something with foo
What's more, as scrapy.Spider actually sets all additional arguments as instance attributes, you don't even need to explicitly override the __init__ method but just access the .foo attribute. :)

Related

Scrapy callback function in another file

I am using Scrapy with Python to scrape several websites.
I got many Spiders with a structure like this:
import library as lib
class Spider(Spider):
...
def parse(self, response):
yield FormRequest(..., callback=lib.parse_after_filtering_results1)
yield FormRequest(..., callback=lib.parse_after_filtering_results2)
def parse_after_filtering_results1(self,response):
return results
def parse_after_filtering_results2(self,response):
... (doesn't return anything)
I would like to know if there's any way I can put the last 2 functions, which are called in the callback, in another module that is common to all my Spiders (so that if I modify it then all of them change). I know they are class functions but is there anyway I could put them in another file?
I have tried declaring the functions in my library.py file but my problem is how can I pass the 2 parameters needed (self, response) to them.
Create a base class to contain those common functions. Then your real spiders can inherit from that. For example, if all your spiders extend Spider then you can do the following:
spiders/basespider.py:
from scrapy import Spider
class BaseSpider(Spider):
# Do not give it a name so that it does not show up in the spiders list.
# This contains only common functions.
def parse_after_filtering_results1(self, response):
# ...
def parse_after_filtering_results2(self, response):
# ...
spiders/realspider.py:
from .basespider import BaseSpider
class RealSpider(BaseSpider):
# ...
def parse(self, response):
yield FormRequest(..., callback=self.parse_after_filtering_results1)
yield FormRequest(..., callback=self.parse_after_filtering_results2)
If you have different types of spiders you can create different base classes. Or your base class can be a plain object (not Spider) and then you can use it as a mixin.

Scrapy: Get Start_Urls from Database by Pipeline

Unfortunately I don't have enough population to make a comment, so I have to make this new question, referring to https://stackoverflow.com/questions/23105590/how-to-get-the-pipeline-object-in-scrapy-spider
I have many urls in a DB. So I want to get the start_url from my db. So far not a big problem.
Well I don't want the mysql things inside the spider and in the pipeline I get a problem.
If I try to hand over the pipeline object to my spider like in the referred question, I only get an Attribute Error with the message
'None Type' object has no attribute getUrl
I think the actual problem is that the function spider_opened doesn't get called (also inserted a print statement which never showed its output in the console).
Has somebody an idea how to get the pipeline object inside the spider?
MySpider.py
def __init__(self):
self.pipe = None
def start_requests(self):
url = self.pipe.getUrl()
scrapy.Request(url,callback=self.parse)
Pipeline.py
#classmethod
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
def spider_opened(self, spider):
spider.pipe = self
def getUrl(self):
...
Scrapy pipelines already have expected methods of open_spider and close_spider
Taken from docs: https://doc.scrapy.org/en/latest/topics/item-pipeline.html#open_spider
open_spider(self, spider)
This method is called when the spider is opened.
Parameters: spider (Spider object) – the spider which was opened
close_spider(self, spider)
This method is called when the spider is closed.
Parameters: spider (Spider object) – the spider which was closed
However your original issue doesn't make much sense, why do you want to assign pipeline reference to your spider? That seems like a very bad idea.
What you should do is open up db and read urls in your spider itself.
from scrapy import Spider
class MySpider(Spider):
name = 'myspider'
start_urls = []
#classmethod
def from_crawler(self, crawler, *args, **kwargs):
spider = super().from_crawler(crawler, *args, **kwargs)
spider.start_urls = self.get_urls_from_db()
return spider
def get_urls_from_db(self):
db = # get db cursor here
urls = # use cursor to pop your urls
return urls
I'm using accepted solution but not works as expected.
TypeError: get_urls_from_db() missing 1 required positional argument: 'self'
Here's the worked one from my side
from scrapy import Spider
class MySpider(Spider):
name = 'myspider'
start_urls = []
def __init__(self, db_dsn):
self.db_dsn = db_dsn
self.start_urls = self.get_urls_from_db(db_dsn)
#classmethod
def from_crawler(cls, crawler):
spider = cls(
db_dsn=os.getenv('DB_DSN', 'mongodb://localhost:27017'),
)
spider._set_crawler(crawler)
return spider
def get_urls_from_db(self, db_dsn):
db = # get db cursor here
urls = # use cursor to pop your urls
return urls

Scrapy subclassing LinkExtractor raises TypeError: MyLinkExtractor() got an unexpected keyword argument 'allow'

I am scraping a news website with Scrapy and saving scraped items to a database with sqlalchemy.
The crawling job runs periodically and I would like to ignore URLs which did not change since the last crawling.
I am trying to subclass LinkExtractor and return an empty list in case the response.url has been crawled more recently than updated.
But when I run 'scrapy crawl spider_name' I am getting:
TypeError: MyLinkExtractor() got an unexpected keyword argument
'allow'
The code:
def MyLinkExtractor(LinkExtractor):
'''This class should redefine the method extract_links to
filter out all links from pages which were not modified since
the last crawling'''
def __init__(self, *args, **kwargs):
"""
Initializes database connection and sessionmaker.
"""
engine = db_connect()
self.Session = sessionmaker(bind=engine)
super(MyLinkExtractor, self).__init__(*args, **kwargs)
def extract_links(self, response):
all_links = super(MyLinkExtractor, self).extract_links(response)
# Return empty list if current url was recently crawled
session = self.Session()
url_in_db = session.query(Page).filter(Page.url==response.url).all()
if url_in_db and url_in_db[0].last_crawled.replace(tzinfo=pytz.UTC) > item['header_last_modified']:
return []
return all_links
...
class MySpider(CrawlSpider):
def __init__(self, *args, **kwargs):
"""
Initializes database connection and sessionmaker.
"""
engine = db_connect()
self.Session = sessionmaker(bind=engine)
super(MySpider, self).__init__(*args, **kwargs)
...
# Define list of regex of links that should be followed
links_regex_to_follow = [
r'some_url_pattern',
]
rules = (Rule(MyLinkExtractor(allow=links_regex_to_follow),
callback='handle_news',
follow=True),
)
def handle_news(self, response):
item = MyItem()
item['url'] = response.url
session = self.Session()
# ... Process the item and extract meaningful info
# Register when the item was crawled
item['last_crawled'] = datetime.datetime.utcnow().replace(tzinfo=pytz.UTC)
# Register when the page was last-modified
date_string = response.headers.get('Last-Modified', None).decode('utf-8')
item['header_last_modified'] = get_datetime_from_http_str(date_string)
yield item
The most weird thing is that, if I replace MyLinkExtractor for LinkExtractor in the Rule definition, it runs.
But if I leave MyLinkExtractor in the Rule definition and redefine MyLinkExtractor to:
def MyLinkExtractor(LinkExtractor):
'''This class should redefine the method extract_links to
filter out all links from pages which were not modified since
the last crawling'''
pass
I get the same error.
Your MyLinkExtractor is not a class, but function since you've declared it with def instead of class. It's hard to spot, since Python allows declaring functions inside of other functions and none of the names are really reserved.
Anyway, I believe that stack-trace would be little different in case if it would be not properly instantiated class - you'd see name of the last function that errored (MyLinkExtractor's __init__).

Scrapy XmlFeedSpider based on String Argument - How to Suppress Automatic Request

Aim: Trigger the execution of an XMLFeedSpider by passing the response as an argument (i.e. no need for start_urls).
Example Command:
scrapy crawl spider_name -a response_as_string="<xml><sometag>abc123</sometag></xml>"
Example Spider:
class ExampleXmlSpider(XMLFeedSpider):
name = "spider_name"
itertag = 'sometag'
def parse_node(self, response, node):
response2 = XmlResponse(url="Some URL", body=self.response_as_string)
ProcessResponse().get_data(response2)
def __init__(self, response_as_string=''):
self.response_as_string = response_as_string
Problem: Terminal complains that there is no start_urls. I can only get the above to work if I include a dummy.xml within start_urls.
E.g.
start_urls = ['file:///home/user/dummy.xml']
Question: Is there anyway to have an XMLFeedSpider that is purely driven by a response provided by an argument (as per the original command)? In which case I would need to suppress the need for the XMLFeedSpider to seek out a start_url to execute a request.
Thanks Paul, you were spot on. Updated example code below. I stopped referring to the class as an XMLFeedSpider. Python script updated to be a class of type "object" with the ability to pass the url and body as arguments.
from scrapy.http import XmlResponse
class ExampleXmlSpider(object):
def __init__(self, response_url='', response_body=''):
self. response_url = response_url
self.response_body = response_body
def run(self):
response = XmlResponse(url=self.response_url, body=self.response_body)
print response.url

How to get the pipeline object in Scrapy spider

I have use the mongodb to store the data of the crawl.
Now I want to query the last date of the data, that I can continue crawl the data and not need restart it from the begin of the url list.(url, which can determined by the date, like: /2014-03-22.html)
I want only one connection object to take the database operation, which is in pipeline.
So, I want to know how can I get the pipeline object(not new one) in the spider.
Or, any better solution for incremental update...
Thanks in advance.
Sorry, for my poor english...
Just sample now:
# This is my Pipline
class MongoDBPipeline(object):
def __init__(self, mongodb_db=None, mongodb_collection=None):
self.connection = pymongo.Connection(settings['MONGODB_SERVER'], settings['MONGODB_PORT'])
....
def process_item(self, item, spider):
....
def get_date(self):
....
And the spider:
class Spider(Spider):
name = "test"
....
def parse(self, response):
# Want to get the Pipeline object
mongo = MongoDBPipeline() # if take this way, must a new Pipeline object
mongo.get_date() # In scrapy, it must have a Pipeline object for the spider
# I want to get the Pipeline object, which created when scrapy started.
Ok, just don't want to new a new object....I admit I am an OCD..
A Scrapy Pipeline has an open_spider method that gets executed after the spider is initialized. You can pass a reference to the database connection, the get_date() method, or the Pipeline itself, to your spider. An example of the latter with your code is:
# This is my Pipline
class MongoDBPipeline(object):
def __init__(self, mongodb_db=None, mongodb_collection=None):
self.connection = pymongo.Connection(settings['MONGODB_SERVER'], settings['MONGODB_PORT'])
....
def process_item(self, item, spider):
....
def get_date(self):
....
def open_spider(self, spider):
spider.myPipeline = self
Then, in the spider:
class Spider(Spider):
name = "test"
def __init__(self):
self.myPipeline = None
def parse(self, response):
self.myPipeline.get_date()
I don't think the __init__() method is necessary here, but I put it here to show that open_spider replaces it after initialization.
According to the scrapy Architecture Overview:
The Item Pipeline is responsible for processing the items once they
have been extracted (or scraped) by the spiders.
Basically that means that, first, scrapy spiders are working, then extracted items are going to the pipelines - no way to go backwards.
One possible solution would be, in the pipeline itself, check if the Item you've scraped is already in the database.
Another workaround would be to keep the list of urls you've crawled in the database, and, in the spider, check if you've already got the data from a url.
Since I'm not sure what do you mean by "start from the beginning" - I cannot suggest anything specific.
Hope at least this information helped.

Categories