How to capture "finish_reason" after each crawl

How to capture "finish_reason" after each crawl - python

I'm trying to capture "finish_reason" in scrapy after each crawl and insert this info into a database. The crawl instance is created in a pipeline before first item is collected.
It seems like I have to use the "engine_stopped" signal but couldn't find an example on how or where should I put my code to do this?

One of possible options is to override scrapy.statscollectors.MemoryStatsCollector (docs,code) and it's close_spider method:
middleware.py:
import pprint
from scrapy.statscollectors import MemoryStatsCollector, logger
class MemoryStatsCollectorSender(MemoryStatsCollector):
#Override close_spider method
def close_spider(self, spider, reason):
#finish_reason in reason variable
#add your data sending code here
if self._dump:
logger.info("Dumping Scrapy stats:\n" + pprint.pformat(self._stats),
extra={'spider': spider})
self._persist_stats(self._stats, spider)
Add newly created stats collector class to settings.py:
STATS_CLASS = 'project.middlewares.MemoryStatsCollectorSender'
#STATS_CLASS = 'scrapy.statscollectors.MemoryStatsCollector'

Related

Scrapy Can duplicates filter be persistent with Jobs?

from scrapy.exceptions import DropItem
class DuplicatesPipeline(object):
def __init__(self):
self.ids_seen = set()
def process_item(self, item, spider):
if item['id'] in self.ids_seen:
raise DropItem("Duplicate item found: %s" % item)
else:
self.ids_seen.add(item['id'])
return item
Above code is from Scrapy official website: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
which is used for filtering duplicates.
And as Scrapy documentation suggested, http://doc.scrapy.org/en/latest/topics/jobs.html
To pause and resume a spider, I need to use the Jobs system.
So I'm curious if the Scrapy Jobs system can make duplicates filter persistent in its directory. The way that implements the duplicates filter is so simple that I'm in doubt.

You just need to implement your pipeline so that it reads the JOBDIR setting and, when that setting is defined, your pipeline:
Reads the initial value of self.ids_seen from some file inside the JOBDIR directory.
At run time, it updates that file as new IDs are added to the set.

scrapy: call a function when a spider opens

I have found plenty of information on calling a function when a Scrapy spider quits (viz: Call a function in Settings from spider Scrapy) but I'm looking for how to call a function -- just once -- when the spider opens. Cannot find this in the Scrapy documentation.
I've got a project of multiple spiders that scrape event information and post them to different Google Calendars. The event information is updated often, so before the spider runs, I need to clear out the existing Google Calendar information in order to refresh it entirely. I've got a working function that accomplishes this when passed a calendar ID. Each spider posts to a different Google Calendar, so I need to be able to pass the calendar ID from within the spider to the function that clears the calendar.
I've defined a base spider in init.py that looks like this:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
## import other stuff I need for the clear_calendar() function
class BaseSpider(CrawlSpider):
def clear_calendar(self, CalId):
## working code to clear the calendar
Now I can call that function from within parse_item like:
from myproject import BaseSpider
class ExampleSpider(BaseSpider):
def parse_item(self, response):
calendarID = 'MycalendarID'
self.clear_calendar(MycalendarID)
## other stuff to do
And of course that calls the function every single time an item is scraped, which is ridiculous. But if I move the function call outside of def parse_item, I get the error "self is not defined", or, if I remove "self", "clear_calendar is not defined."
How can I call a function that requires an argument just once from within a Scrapy spider? Or, is there a better way to go about this?

There is totally a better way, with the spider_opened signal.
I think on newer versions of scrapy, there is a spider_opened method ready for you to use inside the spider:
class MySpider(Spider):
...
calendar_id = 'something'
#classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super(MySpider, cls).from_crawler(crawler, *args, **kwargs)
crawler.signals.connect(spider.spider_opened, signal=signals.spider_opened)
return spider
def spider_opened(self):
calendar_id = self.calendar_id
# use my calendar_id

Scrapy spider not following links when using Celery

I'm a writing a crawler in Python that crawls all pages in a given domain, as part of a domain-specific search engine . I'am using Django, Scrapy, and Celery for achieving this. The scenario is as follows:
I receive a domain name from the user and call the crawl task inside the view, passing the domain as an argument:
crawl.delay(domain)
The task itself just calls a function that starts the crawling process:
from .crawler.crawl import run_spider
from celery import shared_task
#shared_task
def crawl(domain):
return run_spider(domain)
run_spider starts the crawling process, as in this SO answer, replacing MySpider with WebSpider.
WebSpider inherits from CrawlSpider and I'm using it now just to test functionality. The only rule defined takes an SgmlLinkExtractor instance and a callback function parse_page which simply extracts the response url and the page title, populates a new DjangoItem (HTMLPageItem) with them and saves it into the database (not so efficient, I know).
from urlparse import urlparse
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from ..items import HTMLPageItem
from scrapy.selector import Selector
from scrapy.contrib.spiders import Rule, CrawlSpider
class WebSpider(CrawlSpider):
name = "web"
def __init__(self, **kw):
super(WebSpider, self).__init__(**kw)
url = kw.get('domain') or kw.get('url')
if not (url.startswith('http://') or url.startswith('https://')):
url = "http://%s/" % url
self.url = url
self.allowed_domains = [urlparse(url).hostname.lstrip('www.')]
self.start_urls = [url]
self.rules = [
Rule(SgmlLinkExtractor(
allow_domains=self.allowed_domains,
unique=True), callback='parse_page', follow=True)
]
def parse_start_url(self, response):
return self.parse_page(response)
def parse_page(self, response):
sel = Selector(response)
item = HTMLPageItem()
item['url'] = response.request.url
item['title'] = sel.xpath('//title/text()').extract()[0]
item.save()
return item
The problem is the crawler only crawls the start_urls and does not follow links (or call the callback function) when following this scenario and using Celery. However calling run_spider through python manage.py shell works just fine!
Another problem is that Item Pipelines and logging are not working with Celery. This is making debugging much harder. I think these problems might be related.

So after inspecting Scrapy's code and enabling Celery logging, by inserting these two lines in web_spider.py:
from celery.utils.log import get_task_logger
logger = get_task_logger(__name__)
I was able to locate the problem:
In the initialization function of WebSpider:
super(WebSpider, self).__init__(**kw)
The __init__ function of the parent CrawlSpider calls the _compile_rules function which in short copies the rules from self.rules to self._rules while making some changes. self._rules is what the spider uses when it checks for rules . Calling the initialization function of CrawlSpider before defining the rules led to an empty self._rules, hence no links were followed.
Moving the super(WebSpider, self).__init__(**kw) line to the last line of WebSpider's __init__ fixed the problem.
Update: There is a little mistake in code from the previously mentioned SO answer. It causes the reactor to hang after second call. The fix is simple, in WebCrawlerScript's __init__ method, simply move this line:
self.crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
out of the if statement, as suggested in the comments there.
Update 2: I finally got pipelines to work! It was not a Celery problem. I realized that the settings module wasn't being read. It was simply an import problem. To fix it:
Set the environment variable SCRAPY_SETTINGS_MODULE in your django project's settings module myproject/settings.py:
import os
os.environ['SCRAPY_SETTINGS_MODULE'] = 'myapp.crawler.crawler.settings'
In your Scrapy settings module crawler/settings.py, add your Scrapy project path to sys.path so that relative imports in the settings file would work:
import sys
sys.path.append('/absolute/path/to/scrapy/project')
Change the paths to suit your case.

How to get the pipeline object in Scrapy spider

I have use the mongodb to store the data of the crawl.
Now I want to query the last date of the data, that I can continue crawl the data and not need restart it from the begin of the url list.(url, which can determined by the date, like: /2014-03-22.html)
I want only one connection object to take the database operation, which is in pipeline.
So, I want to know how can I get the pipeline object(not new one) in the spider.
Or, any better solution for incremental update...
Thanks in advance.
Sorry, for my poor english...
Just sample now:
# This is my Pipline
class MongoDBPipeline(object):
def __init__(self, mongodb_db=None, mongodb_collection=None):
self.connection = pymongo.Connection(settings['MONGODB_SERVER'], settings['MONGODB_PORT'])
....
def process_item(self, item, spider):
....
def get_date(self):
....
And the spider:
class Spider(Spider):
name = "test"
....
def parse(self, response):
# Want to get the Pipeline object
mongo = MongoDBPipeline() # if take this way, must a new Pipeline object
mongo.get_date() # In scrapy, it must have a Pipeline object for the spider
# I want to get the Pipeline object, which created when scrapy started.
Ok, just don't want to new a new object....I admit I am an OCD..

A Scrapy Pipeline has an open_spider method that gets executed after the spider is initialized. You can pass a reference to the database connection, the get_date() method, or the Pipeline itself, to your spider. An example of the latter with your code is:
# This is my Pipline
class MongoDBPipeline(object):
def __init__(self, mongodb_db=None, mongodb_collection=None):
self.connection = pymongo.Connection(settings['MONGODB_SERVER'], settings['MONGODB_PORT'])
....
def process_item(self, item, spider):
....
def get_date(self):
....
def open_spider(self, spider):
spider.myPipeline = self
Then, in the spider:
class Spider(Spider):
name = "test"
def __init__(self):
self.myPipeline = None
def parse(self, response):
self.myPipeline.get_date()
I don't think the __init__() method is necessary here, but I put it here to show that open_spider replaces it after initialization.

According to the scrapy Architecture Overview:
The Item Pipeline is responsible for processing the items once they
have been extracted (or scraped) by the spiders.
Basically that means that, first, scrapy spiders are working, then extracted items are going to the pipelines - no way to go backwards.
One possible solution would be, in the pipeline itself, check if the Item you've scraped is already in the database.
Another workaround would be to keep the list of urls you've crawled in the database, and, in the spider, check if you've already got the data from a url.
Since I'm not sure what do you mean by "start from the beginning" - I cannot suggest anything specific.
Hope at least this information helped.

Scrapy Very Basic Example

Hi I have Python Scrapy installed on my mac and I was trying to follow the very first example on their web.
They were trying to run the command:
scrapy crawl mininova.org -o scraped_data.json -t json
I don't quite understand what does this mean? looks like scrapy turns out to be a separate program. And I don't think they have a command called crawl. In the example, they have a paragraph of code, which is the definition of the class MininovaSpider and the TorrentItem. I don't know where these two classes should go to, go to the same file and what is the name of this python file?

TL;DR: see Self-contained minimum example script to run scrapy.
First of all, having a normal Scrapy project with a separate .cfg, settings.py, pipelines.py, items.py, spiders package etc is a recommended way to keep and handle your web-scraping logic. It provides a modularity, separation of concerns that keeps things organized, clear and testable.
If you are following the official Scrapy tutorial to create a project, you are running web-scraping via a special scrapy command-line tool:
scrapy crawl myspider
But, Scrapy also provides an API to run crawling from a script.
There are several key concepts that should be mentioned:
Settings class - basically a key-value "container" which is initialized with default built-in values
Crawler class - the main class that acts like a glue for all the different components involved in web-scraping with Scrapy
Twisted reactor - since Scrapy is built-in on top of twisted asynchronous networking library - to start a crawler, we need to put it inside the Twisted Reactor, which is in simple words, an event loop:
The reactor is the core of the event loop within Twisted – the loop which drives applications using Twisted. The event loop is a programming construct that waits for and
dispatches events or messages in a program. It works by calling some
internal or external “event provider”, which generally blocks until an
event has arrived, and then calls the relevant event handler
(“dispatches the event”). The reactor provides basic interfaces to a
number of services, including network communications, threading, and
event dispatching.
Here is a basic and simplified process of running Scrapy from script:
create a Settings instance (or use get_project_settings() to use existing settings):
settings = Settings() # or settings = get_project_settings()
instantiate Crawler with settings instance passed in:
crawler = Crawler(settings)
instantiate a spider (this is what it is all about eventually, right?):
spider = MySpider()
configure signals. This is an important step if you want to have a post-processing logic, collect stats or, at least, to ever finish crawling since the twisted reactor needs to be stopped manually. Scrapy docs suggest to stop the reactor in the spider_closed signal handler:
Note that you will also have to shutdown the Twisted reactor yourself
after the spider is finished. This can be achieved by connecting a
handler to the signals.spider_closed signal.
def callback(spider, reason):
stats = spider.crawler.stats.get_stats()
# stats here is a dictionary of crawling stats that you usually see on the console
# here we need to stop the reactor
reactor.stop()
crawler.signals.connect(callback, signal=signals.spider_closed)
configure and start crawler instance with a spider passed in:
crawler.configure()
crawler.crawl(spider)
crawler.start()
optionally start logging:
log.start()
start the reactor - this would block the script execution:
reactor.run()
Here is an example self-contained script that is using DmozSpider spider and involves item loaders with input and output processors and item pipelines:
import json
from scrapy.crawler import Crawler
from scrapy.contrib.loader import ItemLoader
from scrapy.contrib.loader.processor import Join, MapCompose, TakeFirst
from scrapy import log, signals, Spider, Item, Field
from scrapy.settings import Settings
from twisted.internet import reactor
# define an item class
class DmozItem(Item):
title = Field()
link = Field()
desc = Field()
# define an item loader with input and output processors
class DmozItemLoader(ItemLoader):
default_input_processor = MapCompose(unicode.strip)
default_output_processor = TakeFirst()
desc_out = Join()
# define a pipeline
class JsonWriterPipeline(object):
def __init__(self):
self.file = open('items.jl', 'wb')
def process_item(self, item, spider):
line = json.dumps(dict(item)) + "\n"
self.file.write(line)
return item
# define a spider
class DmozSpider(Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
for sel in response.xpath('//ul/li'):
loader = DmozItemLoader(DmozItem(), selector=sel, response=response)
loader.add_xpath('title', 'a/text()')
loader.add_xpath('link', 'a/#href')
loader.add_xpath('desc', 'text()')
yield loader.load_item()
# callback fired when the spider is closed
def callback(spider, reason):
stats = spider.crawler.stats.get_stats() # collect/log stats?
# stop the reactor
reactor.stop()
# instantiate settings and provide a custom configuration
settings = Settings()
settings.set('ITEM_PIPELINES', {
'__main__.JsonWriterPipeline': 100
})
# instantiate a crawler passing in settings
crawler = Crawler(settings)
# instantiate a spider
spider = DmozSpider()
# configure signals
crawler.signals.connect(callback, signal=signals.spider_closed)
# configure and start the crawler
crawler.configure()
crawler.crawl(spider)
crawler.start()
# start logging
log.start()
# start the reactor (blocks execution)
reactor.run()
Run it in a usual way:
python runner.py
and observe items exported to items.jl with the help of the pipeline:
{"desc": "", "link": "/", "title": "Top"}
{"link": "/Computers/", "title": "Computers"}
{"link": "/Computers/Programming/", "title": "Programming"}
{"link": "/Computers/Programming/Languages/", "title": "Languages"}
{"link": "/Computers/Programming/Languages/Python/", "title": "Python"}
...
Gist is available here (feel free to improve):
Self-contained minimum example script to run scrapy
Notes:
If you define settings by instantiating a Settings() object - you'll get all the defaults Scrapy settings. But, if you want to, for example, configure an existing pipeline, or configure a DEPTH_LIMIT or tweak any other setting, you need to either set it in the script via settings.set() (as demonstrated in the example):
pipelines = {
'mypackage.pipelines.FilterPipeline': 100,
'mypackage.pipelines.MySQLPipeline': 200
}
settings.set('ITEM_PIPELINES', pipelines, priority='cmdline')
or, use an existing settings.py with all the custom settings preconfigured:
from scrapy.utils.project import get_project_settings
settings = get_project_settings()
Other useful links on the subject:
How to run Scrapy from within a Python script
Confused about running Scrapy from within a Python script
scrapy run spider from script

You may have better luck looking through the tutorial first, as opposed to the "Scrapy at a glance" webpage.
The tutorial implies that Scrapy is, in fact, a separate program.
Running the command scrapy startproject tutorial will create a folder called tutorial several files already set up for you.
For example, in my case, the modules/packages items, pipelines, settings and spiders have been added to the root package tutorial .
tutorial/
scrapy.cfg
tutorial/
__init__.py
items.py
pipelines.py
settings.py
spiders/
__init__.py
...
The TorrentItem class would be placed inside items.py, and the MininovaSpider class would go inside the spiders folder.
Once the project is set up, the command-line parameters for Scrapy appear to be fairly straightforward. They take the form:
scrapy crawl <website-name> -o <output-file> -t <output-type>
Alternatively, if you want to run scrapy without the overhead of creating a project directory, you can use the runspider command:
scrapy runspider my_spider.py

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to capture "finish_reason" after each crawl - python

Related

Scrapy Can duplicates filter be persistent with Jobs?

scrapy: call a function when a spider opens

Scrapy spider not following links when using Celery

How to get the pipeline object in Scrapy spider

Scrapy Very Basic Example

Categories

Resources