I have a pipeline that runs at spider_opened to retrieve a value lastseenfrom mysql, in turn this value is used in later pipelines to test if I need to drop item and/or close spider. This value must stay constant and can not be updated during the spider session. The question is: since I have mupltiple spiders in project, I'd like to include an item field in this method, is this even possible? I understand that at the time the pipeline runs, spider just opened and no items returned yet, so maybe I need to code differently?
class DuplicatesPipeline(object):
def __init__(self):
dispatcher.connect(self.spider_opened, signals.spider_opened)
def spider_opened(self, spider):
query = """
mysql query that has a variable %s
"""
dbConn = MySQLdb.connect(#settings)
dictCursor = dbConn.cursor(MySQLdb.cursors.DictCursor)
dictCursor.execute(query, (item['catid'])) # need supply variable here
self.lastseen = dictCursor.fetchone()
dictCursor.close
dbConn.close()
Related
from scrapy.exceptions import DropItem
class DuplicatesPipeline(object):
def __init__(self):
self.ids_seen = set()
def process_item(self, item, spider):
if item['id'] in self.ids_seen:
raise DropItem("Duplicate item found: %s" % item)
else:
self.ids_seen.add(item['id'])
return item
Above code is from Scrapy official website: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
which is used for filtering duplicates.
And as Scrapy documentation suggested, http://doc.scrapy.org/en/latest/topics/jobs.html
To pause and resume a spider, I need to use the Jobs system.
So I'm curious if the Scrapy Jobs system can make duplicates filter persistent in its directory. The way that implements the duplicates filter is so simple that I'm in doubt.
You just need to implement your pipeline so that it reads the JOBDIR setting and, when that setting is defined, your pipeline:
Reads the initial value of self.ids_seen from some file inside the JOBDIR directory.
At run time, it updates that file as new IDs are added to the set.
I'm trying to capture "finish_reason" in scrapy after each crawl and insert this info into a database. The crawl instance is created in a pipeline before first item is collected.
It seems like I have to use the "engine_stopped" signal but couldn't find an example on how or where should I put my code to do this?
One of possible options is to override scrapy.statscollectors.MemoryStatsCollector (docs,code) and it's close_spider method:
middleware.py:
import pprint
from scrapy.statscollectors import MemoryStatsCollector, logger
class MemoryStatsCollectorSender(MemoryStatsCollector):
#Override close_spider method
def close_spider(self, spider, reason):
#finish_reason in reason variable
#add your data sending code here
if self._dump:
logger.info("Dumping Scrapy stats:\n" + pprint.pformat(self._stats),
extra={'spider': spider})
self._persist_stats(self._stats, spider)
Add newly created stats collector class to settings.py:
STATS_CLASS = 'project.middlewares.MemoryStatsCollectorSender'
#STATS_CLASS = 'scrapy.statscollectors.MemoryStatsCollector'
I am running a CrawlSpider and I want to implement some logic to stop following some of the links in mid-run, by passing a function to process_request.
This function uses the spider's class variables in order to keep track of the current state, and depending on it (and on the referrer URL), links get dropped or continue to be processed:
class BroadCrawlSpider(CrawlSpider):
name = 'bitsy'
start_urls = ['http://scrapy.org']
foo = 5
rules = (
Rule(LinkExtractor(), callback='parse_item', process_request='filter_requests', follow=True),
)
def parse_item(self, response):
<some code>
def filter_requests(self, request):
if self.foo == 6 and request.headers.get('Referer', None) == someval:
raise IgnoreRequest("Ignored request: bla %s" % request)
return request
I think that if I were to run several spiders on the same machine, they would all use the same class variables which is not my intention.
Is there a way to add instance variables to CrawlSpiders? Is only a single instance of the spider created when I run Scrapy?
I could probably work around it with a dictionary with values per process ID, but that will be ugly...
I think spider arguments would be the solution in your case.
When invoking scrapy like scrapy crawl some_spider, you could add arguments like scrapy crawl some_spider -a foo=bar, and the spider would receive the values via its constructor, e.g.:
class SomeSpider(scrapy.Spider):
def __init__(self, foo=None, *args, **kwargs):
super(SomeSpider, self).__init__(*args, **kwargs)
# Do something with foo
What's more, as scrapy.Spider actually sets all additional arguments as instance attributes, you don't even need to explicitly override the __init__ method but just access the .foo attribute. :)
I have use the mongodb to store the data of the crawl.
Now I want to query the last date of the data, that I can continue crawl the data and not need restart it from the begin of the url list.(url, which can determined by the date, like: /2014-03-22.html)
I want only one connection object to take the database operation, which is in pipeline.
So, I want to know how can I get the pipeline object(not new one) in the spider.
Or, any better solution for incremental update...
Thanks in advance.
Sorry, for my poor english...
Just sample now:
# This is my Pipline
class MongoDBPipeline(object):
def __init__(self, mongodb_db=None, mongodb_collection=None):
self.connection = pymongo.Connection(settings['MONGODB_SERVER'], settings['MONGODB_PORT'])
....
def process_item(self, item, spider):
....
def get_date(self):
....
And the spider:
class Spider(Spider):
name = "test"
....
def parse(self, response):
# Want to get the Pipeline object
mongo = MongoDBPipeline() # if take this way, must a new Pipeline object
mongo.get_date() # In scrapy, it must have a Pipeline object for the spider
# I want to get the Pipeline object, which created when scrapy started.
Ok, just don't want to new a new object....I admit I am an OCD..
A Scrapy Pipeline has an open_spider method that gets executed after the spider is initialized. You can pass a reference to the database connection, the get_date() method, or the Pipeline itself, to your spider. An example of the latter with your code is:
# This is my Pipline
class MongoDBPipeline(object):
def __init__(self, mongodb_db=None, mongodb_collection=None):
self.connection = pymongo.Connection(settings['MONGODB_SERVER'], settings['MONGODB_PORT'])
....
def process_item(self, item, spider):
....
def get_date(self):
....
def open_spider(self, spider):
spider.myPipeline = self
Then, in the spider:
class Spider(Spider):
name = "test"
def __init__(self):
self.myPipeline = None
def parse(self, response):
self.myPipeline.get_date()
I don't think the __init__() method is necessary here, but I put it here to show that open_spider replaces it after initialization.
According to the scrapy Architecture Overview:
The Item Pipeline is responsible for processing the items once they
have been extracted (or scraped) by the spiders.
Basically that means that, first, scrapy spiders are working, then extracted items are going to the pipelines - no way to go backwards.
One possible solution would be, in the pipeline itself, check if the Item you've scraped is already in the database.
Another workaround would be to keep the list of urls you've crawled in the database, and, in the spider, check if you've already got the data from a url.
Since I'm not sure what do you mean by "start from the beginning" - I cannot suggest anything specific.
Hope at least this information helped.
I wrote a spider using scrapy, one that makes a whole bunch of HtmlXPathSelector Requests to separate sites. It creates a row of data in a .csv file after each request is (asynchronously) satisfied. It's impossible to see which request is satisfied last, because the request is repeated if no data was extracted yet (occasionally it misses the data a few times). Even though I start with a neat list, the output is jumbled because the rows are written immediately after data is extracted.
Now I'd like to sort that list based on one column, but after every request is done. Can the 'spider_closed' signal be used to trigger a real function? As below, I tried connecting the signal with dispatcher, but this function seems to only print out things, rather than work with variables or even call other functions.
def start_requests(self)
... dispatcher.connect(self.spider_closed, signal=signals.engine_stopped) ....
def spider_closed(spider):
print 'this gets printed alright' # <-only if the next line is omitted...
out = self.AnotherFunction(in) # <-This doesn't seem to run
I hacked together a pipeline to solve this problem for you.
file: Project.middleware_module.SortedCSVPipeline
import csv
from scrapy import signals
class SortedCSVPipeline(object):
def __init__(self):
self.items = []
self.file_name = r'YOUR_FILE_PATH_HERE'
self.key = 'YOUR_KEY_HERE'
#classmethod
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
return pipeline
def spider_closed(self, spider):
for item in sorted(self.items, key=lambda k: k[self.key]):
self.write_to_csv(item)
def process_item(self, item, spider):
self.items.append(item)
return item
def write_to_csv(self, item):
writer = csv.writer(open(self.file_name, 'a'), lineterminator='\n')
writer.writerow([item[key] for key in item.keys()])
file: settings.py
ITEM_PIPELINES = {"Project.middleware_module.SortedCSVPipeline.SortedCSVPipeline" : 1000}
When running this you won't need to use an item exporter anymore because this pipeline will do the csv writing for you. Also, the 1000 in the pipeline entry in your setting needs to be a higher value than all other pipelines that you want to run before this one. I tested this in my project and it resulted in a csv file sorted by the column I specified! HTH
Cheers