Python Scrapy - common way to configure scrapers

Python Scrapy - common way to configure scrapers - python

Python framework Scrapy works pretty well but I can not figure out how to configure spiders at runtime. Seems that all configuration should me made "statically" which is not handy. Awful design or I missed something?
For example I have a spider that requires difficult initialization routine.
I use my own scripts to obtain HTTP headers for crawling (cookies, user-agent etc.) - as it were logged-in user.
This takes one-two minute. After that these headers should be applied to all requests.
Right now I do this in __init__ method of the spider. But I can not set up User-Agent from the constructor of the spider. custom_settings must be set up as class variable. Therefore I have to use middleware to set up user agent for each request. This is an ugly solution.
Do we have some common pattern to init spiders - some kind of spider factory ? Smth like this:
class SpiderConfigurator:
def __init__():
...
def configureSpider(spider, environment):
...
spider.setMyCustomSettings(arg1, arg2)
...
environment.setMyCustomSettings(argName1, argValue1)
environment.setMyCustomSettings('User-Agent', 'my value')

Scrapy allows to lounch crawling from the script: Run Scrapy from script Thanks to #paultrmbrth for this hint
But we can not init spider - we just pass spider class to Crawler instance and then crawler instantiates the object. What we can do - pass parameters for our spider`s constructor. Smth like this:
os.chdir(scrapyDir)
projectSettings = get_project_settings()
crawlerProcess = CrawlerProcess(projectSettings)
crawlerProcess.crawl(SpiderCls,
argumentName1=argumentValue1,
argumentName2=argumentValue2)
Arguments argumentName1 and argumentName2 will be passed to the spider`s constructor.

Related

How to capture "finish_reason" after each crawl

I'm trying to capture "finish_reason" in scrapy after each crawl and insert this info into a database. The crawl instance is created in a pipeline before first item is collected.
It seems like I have to use the "engine_stopped" signal but couldn't find an example on how or where should I put my code to do this?

One of possible options is to override scrapy.statscollectors.MemoryStatsCollector (docs,code) and it's close_spider method:
middleware.py:
import pprint
from scrapy.statscollectors import MemoryStatsCollector, logger
class MemoryStatsCollectorSender(MemoryStatsCollector):
#Override close_spider method
def close_spider(self, spider, reason):
#finish_reason in reason variable
#add your data sending code here
if self._dump:
logger.info("Dumping Scrapy stats:\n" + pprint.pformat(self._stats),
extra={'spider': spider})
self._persist_stats(self._stats, spider)
Add newly created stats collector class to settings.py:
STATS_CLASS = 'project.middlewares.MemoryStatsCollectorSender'
#STATS_CLASS = 'scrapy.statscollectors.MemoryStatsCollector'

Python scrapy with argparsed

I have nearly 20 arguments to take from the command line for a Scrapy. My project has nearly 10 web pages to parse to get data. So, have made multiple spiders and Running multiple spiders in the same process using script example from the Scrapy doc. like
if __name__ == '__main__':
configure_logging(install_root_handler=False)
logging.basicConfig(
filename='scraper.log',
format='%(created) - f%(levelname)s: %(message)s',
level=logging.INFO
)
settings = args.get_args()
process = CrawlerProcess(settings)
process.crawl('Connectors')
process.crawl("helper")
....
process.start()
I am putting arguments in Scrapy Setting after taking args from cli using argparse library. And passed the same setting object to CrawlerProcess
settings = args.get_args()
process = CrawlerProcess(settings)
I am unable to access these arguments in my spiders. If i try to get this using
from scrapy.utils.project import get_project_settings
self.settings = get_project_settings()
I am getting a new object with all null values. I have tried using init method to get args but that is not working. Please redirect me to some similar project on GitHub or suggest some posts to get through this problem i have already spent so much time and have much work to do.

By that settings = args.get_args() you are passing argparse's Namespace instance to CrawlerProcess while should pass some dict-like object instead.

scrapy: call a function when a spider opens

I have found plenty of information on calling a function when a Scrapy spider quits (viz: Call a function in Settings from spider Scrapy) but I'm looking for how to call a function -- just once -- when the spider opens. Cannot find this in the Scrapy documentation.
I've got a project of multiple spiders that scrape event information and post them to different Google Calendars. The event information is updated often, so before the spider runs, I need to clear out the existing Google Calendar information in order to refresh it entirely. I've got a working function that accomplishes this when passed a calendar ID. Each spider posts to a different Google Calendar, so I need to be able to pass the calendar ID from within the spider to the function that clears the calendar.
I've defined a base spider in init.py that looks like this:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
## import other stuff I need for the clear_calendar() function
class BaseSpider(CrawlSpider):
def clear_calendar(self, CalId):
## working code to clear the calendar
Now I can call that function from within parse_item like:
from myproject import BaseSpider
class ExampleSpider(BaseSpider):
def parse_item(self, response):
calendarID = 'MycalendarID'
self.clear_calendar(MycalendarID)
## other stuff to do
And of course that calls the function every single time an item is scraped, which is ridiculous. But if I move the function call outside of def parse_item, I get the error "self is not defined", or, if I remove "self", "clear_calendar is not defined."
How can I call a function that requires an argument just once from within a Scrapy spider? Or, is there a better way to go about this?

There is totally a better way, with the spider_opened signal.
I think on newer versions of scrapy, there is a spider_opened method ready for you to use inside the spider:
class MySpider(Spider):
...
calendar_id = 'something'
#classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super(MySpider, cls).from_crawler(crawler, *args, **kwargs)
crawler.signals.connect(spider.spider_opened, signal=signals.spider_opened)
return spider
def spider_opened(self):
calendar_id = self.calendar_id
# use my calendar_id

Scrapy spider not following links when using Celery

I'm a writing a crawler in Python that crawls all pages in a given domain, as part of a domain-specific search engine . I'am using Django, Scrapy, and Celery for achieving this. The scenario is as follows:
I receive a domain name from the user and call the crawl task inside the view, passing the domain as an argument:
crawl.delay(domain)
The task itself just calls a function that starts the crawling process:
from .crawler.crawl import run_spider
from celery import shared_task
#shared_task
def crawl(domain):
return run_spider(domain)
run_spider starts the crawling process, as in this SO answer, replacing MySpider with WebSpider.
WebSpider inherits from CrawlSpider and I'm using it now just to test functionality. The only rule defined takes an SgmlLinkExtractor instance and a callback function parse_page which simply extracts the response url and the page title, populates a new DjangoItem (HTMLPageItem) with them and saves it into the database (not so efficient, I know).
from urlparse import urlparse
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from ..items import HTMLPageItem
from scrapy.selector import Selector
from scrapy.contrib.spiders import Rule, CrawlSpider
class WebSpider(CrawlSpider):
name = "web"
def __init__(self, **kw):
super(WebSpider, self).__init__(**kw)
url = kw.get('domain') or kw.get('url')
if not (url.startswith('http://') or url.startswith('https://')):
url = "http://%s/" % url
self.url = url
self.allowed_domains = [urlparse(url).hostname.lstrip('www.')]
self.start_urls = [url]
self.rules = [
Rule(SgmlLinkExtractor(
allow_domains=self.allowed_domains,
unique=True), callback='parse_page', follow=True)
]
def parse_start_url(self, response):
return self.parse_page(response)
def parse_page(self, response):
sel = Selector(response)
item = HTMLPageItem()
item['url'] = response.request.url
item['title'] = sel.xpath('//title/text()').extract()[0]
item.save()
return item
The problem is the crawler only crawls the start_urls and does not follow links (or call the callback function) when following this scenario and using Celery. However calling run_spider through python manage.py shell works just fine!
Another problem is that Item Pipelines and logging are not working with Celery. This is making debugging much harder. I think these problems might be related.

So after inspecting Scrapy's code and enabling Celery logging, by inserting these two lines in web_spider.py:
from celery.utils.log import get_task_logger
logger = get_task_logger(__name__)
I was able to locate the problem:
In the initialization function of WebSpider:
super(WebSpider, self).__init__(**kw)
The __init__ function of the parent CrawlSpider calls the _compile_rules function which in short copies the rules from self.rules to self._rules while making some changes. self._rules is what the spider uses when it checks for rules . Calling the initialization function of CrawlSpider before defining the rules led to an empty self._rules, hence no links were followed.
Moving the super(WebSpider, self).__init__(**kw) line to the last line of WebSpider's __init__ fixed the problem.
Update: There is a little mistake in code from the previously mentioned SO answer. It causes the reactor to hang after second call. The fix is simple, in WebCrawlerScript's __init__ method, simply move this line:
self.crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
out of the if statement, as suggested in the comments there.
Update 2: I finally got pipelines to work! It was not a Celery problem. I realized that the settings module wasn't being read. It was simply an import problem. To fix it:
Set the environment variable SCRAPY_SETTINGS_MODULE in your django project's settings module myproject/settings.py:
import os
os.environ['SCRAPY_SETTINGS_MODULE'] = 'myapp.crawler.crawler.settings'
In your Scrapy settings module crawler/settings.py, add your Scrapy project path to sys.path so that relative imports in the settings file would work:
import sys
sys.path.append('/absolute/path/to/scrapy/project')
Change the paths to suit your case.

Scrapy Very Basic Example

Hi I have Python Scrapy installed on my mac and I was trying to follow the very first example on their web.
They were trying to run the command:
scrapy crawl mininova.org -o scraped_data.json -t json
I don't quite understand what does this mean? looks like scrapy turns out to be a separate program. And I don't think they have a command called crawl. In the example, they have a paragraph of code, which is the definition of the class MininovaSpider and the TorrentItem. I don't know where these two classes should go to, go to the same file and what is the name of this python file?

TL;DR: see Self-contained minimum example script to run scrapy.
First of all, having a normal Scrapy project with a separate .cfg, settings.py, pipelines.py, items.py, spiders package etc is a recommended way to keep and handle your web-scraping logic. It provides a modularity, separation of concerns that keeps things organized, clear and testable.
If you are following the official Scrapy tutorial to create a project, you are running web-scraping via a special scrapy command-line tool:
scrapy crawl myspider
But, Scrapy also provides an API to run crawling from a script.
There are several key concepts that should be mentioned:
Settings class - basically a key-value "container" which is initialized with default built-in values
Crawler class - the main class that acts like a glue for all the different components involved in web-scraping with Scrapy
Twisted reactor - since Scrapy is built-in on top of twisted asynchronous networking library - to start a crawler, we need to put it inside the Twisted Reactor, which is in simple words, an event loop:
The reactor is the core of the event loop within Twisted – the loop which drives applications using Twisted. The event loop is a programming construct that waits for and
dispatches events or messages in a program. It works by calling some
internal or external “event provider”, which generally blocks until an
event has arrived, and then calls the relevant event handler
(“dispatches the event”). The reactor provides basic interfaces to a
number of services, including network communications, threading, and
event dispatching.
Here is a basic and simplified process of running Scrapy from script:
create a Settings instance (or use get_project_settings() to use existing settings):
settings = Settings() # or settings = get_project_settings()
instantiate Crawler with settings instance passed in:
crawler = Crawler(settings)
instantiate a spider (this is what it is all about eventually, right?):
spider = MySpider()
configure signals. This is an important step if you want to have a post-processing logic, collect stats or, at least, to ever finish crawling since the twisted reactor needs to be stopped manually. Scrapy docs suggest to stop the reactor in the spider_closed signal handler:
Note that you will also have to shutdown the Twisted reactor yourself
after the spider is finished. This can be achieved by connecting a
handler to the signals.spider_closed signal.
def callback(spider, reason):
stats = spider.crawler.stats.get_stats()
# stats here is a dictionary of crawling stats that you usually see on the console
# here we need to stop the reactor
reactor.stop()
crawler.signals.connect(callback, signal=signals.spider_closed)
configure and start crawler instance with a spider passed in:
crawler.configure()
crawler.crawl(spider)
crawler.start()
optionally start logging:
log.start()
start the reactor - this would block the script execution:
reactor.run()
Here is an example self-contained script that is using DmozSpider spider and involves item loaders with input and output processors and item pipelines:
import json
from scrapy.crawler import Crawler
from scrapy.contrib.loader import ItemLoader
from scrapy.contrib.loader.processor import Join, MapCompose, TakeFirst
from scrapy import log, signals, Spider, Item, Field
from scrapy.settings import Settings
from twisted.internet import reactor
# define an item class
class DmozItem(Item):
title = Field()
link = Field()
desc = Field()
# define an item loader with input and output processors
class DmozItemLoader(ItemLoader):
default_input_processor = MapCompose(unicode.strip)
default_output_processor = TakeFirst()
desc_out = Join()
# define a pipeline
class JsonWriterPipeline(object):
def __init__(self):
self.file = open('items.jl', 'wb')
def process_item(self, item, spider):
line = json.dumps(dict(item)) + "\n"
self.file.write(line)
return item
# define a spider
class DmozSpider(Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
for sel in response.xpath('//ul/li'):
loader = DmozItemLoader(DmozItem(), selector=sel, response=response)
loader.add_xpath('title', 'a/text()')
loader.add_xpath('link', 'a/#href')
loader.add_xpath('desc', 'text()')
yield loader.load_item()
# callback fired when the spider is closed
def callback(spider, reason):
stats = spider.crawler.stats.get_stats() # collect/log stats?
# stop the reactor
reactor.stop()
# instantiate settings and provide a custom configuration
settings = Settings()
settings.set('ITEM_PIPELINES', {
'__main__.JsonWriterPipeline': 100
})
# instantiate a crawler passing in settings
crawler = Crawler(settings)
# instantiate a spider
spider = DmozSpider()
# configure signals
crawler.signals.connect(callback, signal=signals.spider_closed)
# configure and start the crawler
crawler.configure()
crawler.crawl(spider)
crawler.start()
# start logging
log.start()
# start the reactor (blocks execution)
reactor.run()
Run it in a usual way:
python runner.py
and observe items exported to items.jl with the help of the pipeline:
{"desc": "", "link": "/", "title": "Top"}
{"link": "/Computers/", "title": "Computers"}
{"link": "/Computers/Programming/", "title": "Programming"}
{"link": "/Computers/Programming/Languages/", "title": "Languages"}
{"link": "/Computers/Programming/Languages/Python/", "title": "Python"}
...
Gist is available here (feel free to improve):
Self-contained minimum example script to run scrapy
Notes:
If you define settings by instantiating a Settings() object - you'll get all the defaults Scrapy settings. But, if you want to, for example, configure an existing pipeline, or configure a DEPTH_LIMIT or tweak any other setting, you need to either set it in the script via settings.set() (as demonstrated in the example):
pipelines = {
'mypackage.pipelines.FilterPipeline': 100,
'mypackage.pipelines.MySQLPipeline': 200
}
settings.set('ITEM_PIPELINES', pipelines, priority='cmdline')
or, use an existing settings.py with all the custom settings preconfigured:
from scrapy.utils.project import get_project_settings
settings = get_project_settings()
Other useful links on the subject:
How to run Scrapy from within a Python script
Confused about running Scrapy from within a Python script
scrapy run spider from script

You may have better luck looking through the tutorial first, as opposed to the "Scrapy at a glance" webpage.
The tutorial implies that Scrapy is, in fact, a separate program.
Running the command scrapy startproject tutorial will create a folder called tutorial several files already set up for you.
For example, in my case, the modules/packages items, pipelines, settings and spiders have been added to the root package tutorial .
tutorial/
scrapy.cfg
tutorial/
__init__.py
items.py
pipelines.py
settings.py
spiders/
__init__.py
...
The TorrentItem class would be placed inside items.py, and the MininovaSpider class would go inside the spiders folder.
Once the project is set up, the command-line parameters for Scrapy appear to be fairly straightforward. They take the form:
scrapy crawl <website-name> -o <output-file> -t <output-type>
Alternatively, if you want to run scrapy without the overhead of creating a project directory, you can use the runspider command:
scrapy runspider my_spider.py

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Scrapy - common way to configure scrapers - python

Related

How to capture "finish_reason" after each crawl

Python scrapy with argparsed

scrapy: call a function when a spider opens

Scrapy spider not following links when using Celery

Scrapy Very Basic Example

Categories

Resources