Does anyone know if there is a way to set different levels for scrapy's modules ? I want to log scraped item and the requests sent in a log file, but the logs coming from the scrapy.middleware, scrapy.crawler and scrapy.utils.log modules are always the same and do not add value to the log file.
My biggest constraint is that I have to do everything outside of the spiders (in the pipelines, the settings.py file, etc.). I have more than 200 spiders and cannot possibly add code to each of them.
Scrapy's doc says it is possible to modify the level for a specific logger in the advanced customization section, but it does not seem to work when this is set in the settings.py file. My guess is that the logs from scrapy.middleware and scrapy.crawler are logged before the spider evaluates the settings.py file.
I have read scrapy's doc extensively but I cannot seem to find the answer. I don't want to have to recreate my own loggers since some of Scrapy's logs are useful, like the ones logging requests sent and errors.
I can provide code extracts if necessary. Thank you.
You can create a scrapy extension that manipulates the various log levels setting them to higher values for the ones you do not want to appear. The first 3 logs that come from scrapy.utils.log are run before scrapy get's to loading it's extensions, so those 3 I am not sure what to do beyond turning logging off entirely and implementing the logs yourself.
Here is an example of the extension:
extension.py
import logging
from scrapy.exceptions import NotConfigured
from scrapy import signals
logger = logging.getLogger(__name__)
class CustomLogExtension:
def __init__(self):
self.level = logging.WARNING
self.modules = ['scrapy.utils.log', 'scrapy.middleware',
'scrapy.extensions.logstats', 'scrapy.statscollectors',
'scrapy.core.engine', 'scrapy.core.scraper',
'scrapy.crawler', 'scrapy.extensions',
__name__]
for module in self.modules:
logger = logging.getLogger(module)
logger.setLevel(self.level)
#classmethod
def from_crawler(cls, crawler):
if not crawler.settings.getbool('CUSTOM_LOG_EXTENSION'):
raise NotConfigured
ext = cls()
crawler.signals.connect(
ext.spider_opened, signal=signals.spider_opened
)
return ext
def spider_opened(self, spider):
logger.debug("This log should not appear.")
Then in your settings.py
settings.py
CUSTOM_LOG_EXTENSION = True
EXTENSIONS = {
'scrapy.extensions.telnet.TelnetConsole': None,
'my_project_name.extension.CustomLogExtension': 1,
}
The example above removes pretty much all of the logs produced by scrapy. If you only want to keep the request logs, then just remove scrapy.core.engine from the self.modules list in the Extension constructor.
Related
I've been trying to figure out how best to set this up. Cutting it down as much as I can. I have 4 python files: core.py (main), logger_controler.py, config_controller.py, and a 4th as a module or singleton well just call it tool.py.
The way I have it setup is logging has an init function that setup pythons built in logging with the necessary levels, formatter, directory location, etc. I call this init function in main.
import logging
import logger_controller
def main():
logger_controller.init_log()
logger = logging.getLogger(__name__)
if __name__ == "__main__":
main()
config_controller is using configparser and is mainly a singleton as a controller for my config.
import configparser
import logging
logger = logging.getLogger(__name__)
class ConfigController(object):
def __init__(self, *file_names):
self.config_parser = configparser.ConfigParser()
found_files = self.config_parser.read(file_names)
if not found_files:
raise ValueError("No config file found.")
self._validate()
def _validate(self):
...
def read_config(self, section, field):
try:
data = self.config_parser.get(section, field)
except (configparser.NoSectionError, configparser.NoOptionError) as e:
logger.error(e)
data = None
return data
config = ConfigController("config.ini")
And then my problem is trying to create the 4th file and making sure both my logger and config parser are running before it. I'm also wanting this 4th one to be a singleton so it's following a similar format as the config_controller.
So tool.py uses config_controller to pull anything it needs from the config file. It also has some error checking for if config_controller's read_config returns None as that isn't validated in _validate. I did this as I wanted my logging to have a general layer for error checking and a more specific layer. So _validate just checks if required fields and sections are in the config file. Then wherever the field is read will handle extra error checking.
So my main problem is this:
How do I have it where my logger and configparser are both running and available before anything else. I'm very much willing to rework all of this, but I'd like to keep the functionality of it all.
One attempt I tried that works, but seems very messy is making my logger_controler a singleton that just returns python's logging object.
import logging
import os
class MyLogger(object):
def __new__(cls, *args, **kwargs):
init_log()
return logging
def init_log():
...
mylogger = MyLogger()
Then in core.py
from logger_controller import mylogger
logger = mylogger.getLogger(__name__)
I feel like there should be a better way to do the above, but I'm honestly not sure how.
A few ideas:
Would I be able to extend the logging class instead of just using that init_log function?
Maybe there's a way I can make all 3 individual modules such that they each initialize in a correct order? My attempts here didn't quite work as I also have some internal data that I wouldn't want exposed to classes using the module, just the functionality.
I'd like to have it where all 3, logging, configparsing, and the tool, available anywhere I import them.
How I have it setup now "works" but if I were to import the tool.py anywhere in core.py and an error occurs that I need to catch, then my logger won't be able to log it as this tool is loading before the init of my logger.
Scrapy built-in loggers:
scrapy.utils.log
scrapy.crawler
scrapy.middleware
scrapy.core.engine
scrapy.extensions.logstats
scrapy.extensions.telnet
scrapy.core.scraper
scrapy.statscollectors
are very verbose.
I was trying to set a different log level, DEBUG, than user spider log level, INFO. This way I can reduce the 'noise'.
This helper function works, some times:
def set_loggers_level(level=logging.DEBUG):
loggers = [
'scrapy.utils.log',
'scrapy.crawler',
'scrapy.middleware',
'scrapy.core.engine',
'scrapy.extensions.logstats',
'scrapy.extensions.telnet',
'scrapy.core.scraper',
'scrapy.statscollectors'
]
for logger_name in loggers:
logger = logging.getLogger(logger_name)
logger.setLevel(level)
for handler in logger.handlers:
handler.setLevel(level)
I call it from UserSpider init:
class UserSpider(scrapy.Spider):
def __init__(self, *args, **kwargs):
# customize loggers: Some loggers can't be reset a this point
helpers.set_loggers_level()
super(UserSpider, self).__init__(*args, **kwargs)
This approach works some time, others not.
What will be the correct solution?
You can just set LOG_LEVEL appropriately in your settings.py, read more here: https://doc.scrapy.org/en/latest/topics/settings.html#std:setting-LOG_LEVEL
LOG_LEVEL
Default: 'DEBUG'
Minimum level to log. Available levels are: CRITICAL, ERROR, WARNING, INFO, DEBUG. For more info see Logging.
If project wide settings are not focused enough, you can set them per-spider by using custom_settings:
class MySpider(scrapy.Spider):
name = 'myspider'
custom_settings = {
'LOG_LEVEL': 'INFO',
}
Source:
https://doc.scrapy.org/en/latest/topics/settings.html#settings-per-spider
Settings different log levels per Log handlers is not very realiable.
At the end of the day the better approach will be launch scrapy cli tool from another script and filter logs output with a parser has needed.
I stumbled upon the same issue. I tried various method but it looks like since Scrapy uses logging module, you have to set it at global level which result in Scrapy to print all the debug information.
I've found more reliable solution to use bool flag with print statement fro DEBUG and use logger for INFO, ERROR and WARNING.
I'm a writing a crawler in Python that crawls all pages in a given domain, as part of a domain-specific search engine . I'am using Django, Scrapy, and Celery for achieving this. The scenario is as follows:
I receive a domain name from the user and call the crawl task inside the view, passing the domain as an argument:
crawl.delay(domain)
The task itself just calls a function that starts the crawling process:
from .crawler.crawl import run_spider
from celery import shared_task
#shared_task
def crawl(domain):
return run_spider(domain)
run_spider starts the crawling process, as in this SO answer, replacing MySpider with WebSpider.
WebSpider inherits from CrawlSpider and I'm using it now just to test functionality. The only rule defined takes an SgmlLinkExtractor instance and a callback function parse_page which simply extracts the response url and the page title, populates a new DjangoItem (HTMLPageItem) with them and saves it into the database (not so efficient, I know).
from urlparse import urlparse
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from ..items import HTMLPageItem
from scrapy.selector import Selector
from scrapy.contrib.spiders import Rule, CrawlSpider
class WebSpider(CrawlSpider):
name = "web"
def __init__(self, **kw):
super(WebSpider, self).__init__(**kw)
url = kw.get('domain') or kw.get('url')
if not (url.startswith('http://') or url.startswith('https://')):
url = "http://%s/" % url
self.url = url
self.allowed_domains = [urlparse(url).hostname.lstrip('www.')]
self.start_urls = [url]
self.rules = [
Rule(SgmlLinkExtractor(
allow_domains=self.allowed_domains,
unique=True), callback='parse_page', follow=True)
]
def parse_start_url(self, response):
return self.parse_page(response)
def parse_page(self, response):
sel = Selector(response)
item = HTMLPageItem()
item['url'] = response.request.url
item['title'] = sel.xpath('//title/text()').extract()[0]
item.save()
return item
The problem is the crawler only crawls the start_urls and does not follow links (or call the callback function) when following this scenario and using Celery. However calling run_spider through python manage.py shell works just fine!
Another problem is that Item Pipelines and logging are not working with Celery. This is making debugging much harder. I think these problems might be related.
So after inspecting Scrapy's code and enabling Celery logging, by inserting these two lines in web_spider.py:
from celery.utils.log import get_task_logger
logger = get_task_logger(__name__)
I was able to locate the problem:
In the initialization function of WebSpider:
super(WebSpider, self).__init__(**kw)
The __init__ function of the parent CrawlSpider calls the _compile_rules function which in short copies the rules from self.rules to self._rules while making some changes. self._rules is what the spider uses when it checks for rules . Calling the initialization function of CrawlSpider before defining the rules led to an empty self._rules, hence no links were followed.
Moving the super(WebSpider, self).__init__(**kw) line to the last line of WebSpider's __init__ fixed the problem.
Update: There is a little mistake in code from the previously mentioned SO answer. It causes the reactor to hang after second call. The fix is simple, in WebCrawlerScript's __init__ method, simply move this line:
self.crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
out of the if statement, as suggested in the comments there.
Update 2: I finally got pipelines to work! It was not a Celery problem. I realized that the settings module wasn't being read. It was simply an import problem. To fix it:
Set the environment variable SCRAPY_SETTINGS_MODULE in your django project's settings module myproject/settings.py:
import os
os.environ['SCRAPY_SETTINGS_MODULE'] = 'myapp.crawler.crawler.settings'
In your Scrapy settings module crawler/settings.py, add your Scrapy project path to sys.path so that relative imports in the settings file would work:
import sys
sys.path.append('/absolute/path/to/scrapy/project')
Change the paths to suit your case.
Hi I have Python Scrapy installed on my mac and I was trying to follow the very first example on their web.
They were trying to run the command:
scrapy crawl mininova.org -o scraped_data.json -t json
I don't quite understand what does this mean? looks like scrapy turns out to be a separate program. And I don't think they have a command called crawl. In the example, they have a paragraph of code, which is the definition of the class MininovaSpider and the TorrentItem. I don't know where these two classes should go to, go to the same file and what is the name of this python file?
TL;DR: see Self-contained minimum example script to run scrapy.
First of all, having a normal Scrapy project with a separate .cfg, settings.py, pipelines.py, items.py, spiders package etc is a recommended way to keep and handle your web-scraping logic. It provides a modularity, separation of concerns that keeps things organized, clear and testable.
If you are following the official Scrapy tutorial to create a project, you are running web-scraping via a special scrapy command-line tool:
scrapy crawl myspider
But, Scrapy also provides an API to run crawling from a script.
There are several key concepts that should be mentioned:
Settings class - basically a key-value "container" which is initialized with default built-in values
Crawler class - the main class that acts like a glue for all the different components involved in web-scraping with Scrapy
Twisted reactor - since Scrapy is built-in on top of twisted asynchronous networking library - to start a crawler, we need to put it inside the Twisted Reactor, which is in simple words, an event loop:
The reactor is the core of the event loop within Twisted – the loop which drives applications using Twisted. The event loop is a programming construct that waits for and
dispatches events or messages in a program. It works by calling some
internal or external “event provider”, which generally blocks until an
event has arrived, and then calls the relevant event handler
(“dispatches the event”). The reactor provides basic interfaces to a
number of services, including network communications, threading, and
event dispatching.
Here is a basic and simplified process of running Scrapy from script:
create a Settings instance (or use get_project_settings() to use existing settings):
settings = Settings() # or settings = get_project_settings()
instantiate Crawler with settings instance passed in:
crawler = Crawler(settings)
instantiate a spider (this is what it is all about eventually, right?):
spider = MySpider()
configure signals. This is an important step if you want to have a post-processing logic, collect stats or, at least, to ever finish crawling since the twisted reactor needs to be stopped manually. Scrapy docs suggest to stop the reactor in the spider_closed signal handler:
Note that you will also have to shutdown the Twisted reactor yourself
after the spider is finished. This can be achieved by connecting a
handler to the signals.spider_closed signal.
def callback(spider, reason):
stats = spider.crawler.stats.get_stats()
# stats here is a dictionary of crawling stats that you usually see on the console
# here we need to stop the reactor
reactor.stop()
crawler.signals.connect(callback, signal=signals.spider_closed)
configure and start crawler instance with a spider passed in:
crawler.configure()
crawler.crawl(spider)
crawler.start()
optionally start logging:
log.start()
start the reactor - this would block the script execution:
reactor.run()
Here is an example self-contained script that is using DmozSpider spider and involves item loaders with input and output processors and item pipelines:
import json
from scrapy.crawler import Crawler
from scrapy.contrib.loader import ItemLoader
from scrapy.contrib.loader.processor import Join, MapCompose, TakeFirst
from scrapy import log, signals, Spider, Item, Field
from scrapy.settings import Settings
from twisted.internet import reactor
# define an item class
class DmozItem(Item):
title = Field()
link = Field()
desc = Field()
# define an item loader with input and output processors
class DmozItemLoader(ItemLoader):
default_input_processor = MapCompose(unicode.strip)
default_output_processor = TakeFirst()
desc_out = Join()
# define a pipeline
class JsonWriterPipeline(object):
def __init__(self):
self.file = open('items.jl', 'wb')
def process_item(self, item, spider):
line = json.dumps(dict(item)) + "\n"
self.file.write(line)
return item
# define a spider
class DmozSpider(Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
for sel in response.xpath('//ul/li'):
loader = DmozItemLoader(DmozItem(), selector=sel, response=response)
loader.add_xpath('title', 'a/text()')
loader.add_xpath('link', 'a/#href')
loader.add_xpath('desc', 'text()')
yield loader.load_item()
# callback fired when the spider is closed
def callback(spider, reason):
stats = spider.crawler.stats.get_stats() # collect/log stats?
# stop the reactor
reactor.stop()
# instantiate settings and provide a custom configuration
settings = Settings()
settings.set('ITEM_PIPELINES', {
'__main__.JsonWriterPipeline': 100
})
# instantiate a crawler passing in settings
crawler = Crawler(settings)
# instantiate a spider
spider = DmozSpider()
# configure signals
crawler.signals.connect(callback, signal=signals.spider_closed)
# configure and start the crawler
crawler.configure()
crawler.crawl(spider)
crawler.start()
# start logging
log.start()
# start the reactor (blocks execution)
reactor.run()
Run it in a usual way:
python runner.py
and observe items exported to items.jl with the help of the pipeline:
{"desc": "", "link": "/", "title": "Top"}
{"link": "/Computers/", "title": "Computers"}
{"link": "/Computers/Programming/", "title": "Programming"}
{"link": "/Computers/Programming/Languages/", "title": "Languages"}
{"link": "/Computers/Programming/Languages/Python/", "title": "Python"}
...
Gist is available here (feel free to improve):
Self-contained minimum example script to run scrapy
Notes:
If you define settings by instantiating a Settings() object - you'll get all the defaults Scrapy settings. But, if you want to, for example, configure an existing pipeline, or configure a DEPTH_LIMIT or tweak any other setting, you need to either set it in the script via settings.set() (as demonstrated in the example):
pipelines = {
'mypackage.pipelines.FilterPipeline': 100,
'mypackage.pipelines.MySQLPipeline': 200
}
settings.set('ITEM_PIPELINES', pipelines, priority='cmdline')
or, use an existing settings.py with all the custom settings preconfigured:
from scrapy.utils.project import get_project_settings
settings = get_project_settings()
Other useful links on the subject:
How to run Scrapy from within a Python script
Confused about running Scrapy from within a Python script
scrapy run spider from script
You may have better luck looking through the tutorial first, as opposed to the "Scrapy at a glance" webpage.
The tutorial implies that Scrapy is, in fact, a separate program.
Running the command scrapy startproject tutorial will create a folder called tutorial several files already set up for you.
For example, in my case, the modules/packages items, pipelines, settings and spiders have been added to the root package tutorial .
tutorial/
scrapy.cfg
tutorial/
__init__.py
items.py
pipelines.py
settings.py
spiders/
__init__.py
...
The TorrentItem class would be placed inside items.py, and the MininovaSpider class would go inside the spiders folder.
Once the project is set up, the command-line parameters for Scrapy appear to be fairly straightforward. They take the form:
scrapy crawl <website-name> -o <output-file> -t <output-type>
Alternatively, if you want to run scrapy without the overhead of creating a project directory, you can use the runspider command:
scrapy runspider my_spider.py
i have multiple spiders in one project , problem is right now i am defining LOG_FILE in SETTINGS like
LOG_FILE = "scrapy_%s.log" % datetime.now()
what i want is scrapy_SPIDERNAME_DATETIME
but i am unable to provide spidername in log_file name ..
i found
scrapy.log.start(logfile=None, loglevel=None, logstdout=None)
and called it in each spider init method but its not working ..
any help would be appreciated
The spider's __init__() is not early enough to call log.start() by itself since the log observer is already started at this point; therefore, you need to reinitialize the logging state to trick Scrapy into (re)starting it.
In your spider class file:
from datetime import datetime
from scrapy import log
from scrapy.spider import BaseSpider
class ExampleSpider(BaseSpider):
name = "example"
allowed_domains = ["example.com"]
start_urls = ["http://www.example.com/"]
def __init__(self, name=None, **kwargs):
LOG_FILE = "scrapy_%s_%s.log" % (self.name, datetime.now())
# remove the current log
# log.log.removeObserver(log.log.theLogPublisher.observers[0])
# re-create the default Twisted observer which Scrapy checks
log.log.defaultObserver = log.log.DefaultObserver()
# start the default observer so it can be stopped
log.log.defaultObserver.start()
# trick Scrapy into thinking logging has not started
log.started = False
# start the new log file observer
log.start(LOG_FILE)
# continue with the normal spider init
super(ExampleSpider, self).__init__(name, **kwargs)
def parse(self, response):
...
And the output file might look like:
scrapy_example_2012-08-25 12:34:48.823896.log
There should be a BOT_NAME in your settings.py. This is the project/spider name. So in your case, this would be
LOG_FILE = "scrapy_%s_%s.log" % (BOT_NAME, datetime.now())
This is pretty much the same that Scrapy does internally
But why not use log.msg. The docs clearly state that this is for spider specific stuff. It might be easier to use this and just extract/grep/... the different spider log messages from a big log file.
A more compicated approach would be to get the location of the spider SPIDER_MODULES list and load all spiders inside these package.
You can use Scrapy's Storage URI parameters in your settings.py file for FEED URI.
%(name)s
%(time)s
For example: /tmp/crawled/%(name)s/%(time)s.log