I am running a log of scrapy by doing this:
from scrapy import log
class MySpider(BaseSpider):
name = "myspider"
def __init__(self, name=None, **kwargs):
LOG_FILE = "logs/spider.log"
log.log.defaultObserver = log.log.DefaultObserver()
log.log.defaultObserver.start()
log.started = False
log.start(LOG_FILE, loglevel=log.INFO)
super(MySpider, self).__init__(name, **kwargs)
def parse(self,response):
....
raise Exception("Something went wrong!")
log.msg('Something went wrong!', log.ERROR)
# Somehow write to a separate error log here.
Then I run the spider like this:
scrapy crawl myspider
This would store all the log.INFO data as well as log.ERROR into spider.log.
If an error occurs, I would also like to store those details in a separate log file called spider_errors.log. It would make it easier to search for errors that occurred rather than trying to scan through the entire spider.log file (which could be huge).
Is there a way to do this?
EDIT:
Trying with PythonLoggingObserver:
def __init__(self, name=None, **kwargs):
LOG_FILE = 'logs/spider.log'
ERR_File = 'logs/spider_error.log'
observer = log.log.PythonLoggingObserver()
observer.start()
log.started = False
log.start(LOG_FILE, loglevel=log.INFO)
log.start(ERR_FILE, loglevel=log.ERROR)
But I get ERROR: No handlers could be found for logger "twisted"
Just let logging do the job. Try to use PythonLoggingObserver instead of DefaultObserver:
configure two loggers (one for INFO and one for ERROR messages) directly in python, or via fileconfig, or via dictconfig (see docs)
start it in spider's __init__:
def __init__(self, name=None, **kwargs):
# TODO: configure logging: e.g. logging.config.fileConfig("logging.conf")
observer = log.PythonLoggingObserver()
observer.start()
Let me know if you need help with configuring loggers.
EDIT:
Another option is to start two file log observers in __init__.py:
from scrapy.log import ScrapyFileLogObserver
from scrapy import log
class MySpider(BaseSpider):
name = "myspider"
def __init__(self, name=None, **kwargs):
ScrapyFileLogObserver(open("spider.log", 'w'), level=logging.INFO).start()
ScrapyFileLogObserver(open("spider_error.log", 'w'), level=logging.ERROR).start()
super(MySpider, self).__init__(name, **kwargs)
...
Related
Is it possible to override Scrapy settings after the init function of a spider?
For example if I want to get settings from db and I pass my query parameters as arguments from the cmdline.
def __init__(self, spider_id, **kwargs):
self.spider_id = spider_id
self.set_params(spider_id)
super(Base_Crawler, self).__init__(**kwargs)
def set_params(self):
#TODO
#makes a query in db
#get set variables from query result
#override settings
Technically you can "override" settings after initialization of spider however it would affect nothing because most of them applied earlier.
What you can actually do is to pass parameters to Spider as command-line options using -a and override project settings using -s, for ex.)
Spider:
class TheSpider(scrapy.Spider):
name = 'thespider'
def __init__(self, *args, **kwargs):
self.spider_id = kwargs.pop('spider_id', None)
super(TheSpider).__init__(*args, **kwargs)
CLI:
scrapy crawl thespider -a spider_id=XXX -s SETTTING_TO_OVERRIDE=YYY
If you need something more advanced consider to write custom runner wrapping your spider. Below is example from the docs:
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
process = CrawlerProcess(get_project_settings())
# 'followall' is the name of one of the spiders of the project.
process.crawl('followall', domain='scrapinghub.com')
process.start() # the script will block here until the crawling is finished
Just replace get_project_settings with your own routine that returns Settings instance.
Anyway, avoid of overloading of spider's code with non-scraping logic to keep it clean and reusable.
I'm using scrapy,in my scrapy project,I created several spider classes,as the official document said,I used this way to specify log file name:
def logging_to_file(file_name):
"""
#rtype: logging
#type file_name:str
#param file_name:
#return:
"""
import logging
from scrapy.utils.log import configure_logging
configure_logging(install_root_handler=False)
logging.basicConfig(
filename=filename+'.txt',
filemode='a',
format='%(levelname)s: %(message)s',
level=logging.DEBUG,
)
Class Spider_One(scrapy.Spider):
name='xxx1'
logging_to_file(name)
......
Class Spider_Two(scrapy.Spider):
name='xxx2'
logging_to_file(name)
......
Now,if I start Spider_One,everything is correct!But,if I start Spider Two,the log file of Spider Two will also be named with the name of Spider One!
I have searched many answers from google and stackoverflow,but unfortunately,none worked!
I am using python 2.7 & scrapy 1.1!
Hope anyone can help me!
It's because you initiate logging_to_file every time when you load up your package. You are using a class variable here where you should use instance variable.
When python loads in your package or module ir loads every class and so on.
class MyClass:
# everything you do here is loaded everytime when package is loaded
name = 'something'
def __init__(self):
# everything you do here is loaded ONLY when the object is created
# using this class
To resolve your issue just move logging_to_file function call to your spiders __init__() method.
class MyClass(Spider):
name = 'xx1'
def __init__(self):
super(MyClass, self).__init__()
logging_to_file(self.name)
One option is to define the LOG_FILE setting in Spider.custom_settings:
from scrapy import Spider
class MySpider(Spider):
name = "myspider"
start_urls = ["https://toscrape.com"]
custom_settings = {
"LOG_FILE": f"{name}.log",
}
def parse(self, response):
pass
However, because logging starts before the spider custom settings are read, the first few lines of the log will still go into the standard error output.
I just got thrown into the deep end with my new contract. The current system uses the python logging module to do timed log-file rotation. The problem is that the log-file of the process running as a daemon gets rotated correctly, while the other log-file of the process instances that get created and destroyed when done does not rotate. Ever. I have now got to find a solution to this problem. After 2 days of research on the internet and python documentation I'm only halfway out of the dark. Since I'm new to the logging module I can't see the answer to the problem since I'm probably looking with my eyes closed!
The process is started with:
python /admin/bin/fmlog.py -l 10 -f /tmp/fmlog/fmapp_log.log -d
where:
-l 10 => DEBUG logging-level
-f ... => Filename to log to for app-instance
-d => run as daemon
The following shows a heavily edited version of my code:
#!/usr/bin python
from comp.app import app, yamlapp
...
from comp.utils.log4_new import *
# Exceptions handling class
class fmlogException(compException): pass
class fmlog(app):
# Fmlog application class
def __init__(self, key, config, **kwargs):
# Initialise the required variables
app.__init__(self, key, config, **kwargs)
self._data = {'sid': self._id}
...
def process(self, tid=None):
if tid is not None:
self.logd("Using thread '%d'." % (tid), data=self._data)
# Run the fmlog process
self.logi("Processing this '%s'" % (filename), data=self._data)
...
def __doDone__(self, success='Failure', msg='', exception=None):
...
self.logd("Process done!")
if __name__ == '__main__':
def main():
with yamlapp(filename=config, cls=fmlog, configcls=fmlogcfg, sections=sections, loglevel=loglevel, \
logfile=logfile, excludekey='_dontrun', sortkey='_priority', usethreads=threads, maxthreads=max, \
daemon=daemon, sleep=sleep) as a:
a.run()
main()
The yamlapp process (sub-class of app) is instantiated and runs as a daemon until manually stopped. This process will only create 1 or more instance(s) of the fmlog class and call the process() function when needed (certain conditions met). Up to x instances can be created per thread if the yamlapp process is run in thread-mode.
The app process code:
#!/usr/bin/env python
...
from comp.utils.log4_new import *
class app(comp.base.comp, logconfig, log):
def __init__(self, cls, **kwargs):
self.__setdefault__('_configcls', configitem)
self.__setdefault__('_daemon', True)
self.__setdefault__('_maxthreads', 5)
self.__setdefault__('_usethreads', False)
...
comp.base.comp.__init__(self, **kwargs)
logconfig.__init__(self, prog(), **getlogkwargs(**kwargs))
log.__init__(self, logid=prog())
def __enter__(self):
self.logi(msg="Starting application '%s:%s' '%d'..." % (self._cls.__name__, \
self.__class__.__name__, os.getpid()))
return self
def ...
def run(self):
...
if self._usethreads:
...
while True:
self.logd(msg="Start of run iteration...")
if not self._usethreads:
while not self._q.empty():
item = self._q.get()
try:
item.process()
self.logd(msg="End of run iteration...")
time.sleep(self._sleep)
The logging config and setup is done via the log4_new.py classes:
#!/usr/bin/env python
import logging
import logging.handlers
import re
class logconfig(comp):
def __init__(self, logid, **kwargs):
comp.__init__(self, **kwargs)
self.__setdefault__('_logcount', 20)
self.__setdefault__('_logdtformat', None)
self.__setdefault__('_loglevel', DEBUG)
self.__setdefault__('_logfile', None)
self.__setdefault__('_logformat', '[%(asctime)-15s][%(levelname)5s] %(message)s')
self.__setdefault__('_loginterval', 'S')
self.__setdefault__('_logintervalnum', 30)
self.__setdefault__('_logsuffix', '%Y%m%d%H%M%S')
self._logid = logid
self.__loginit__()
def __loginit__(self):
format = logging.Formatter(self._logformat, self._logdtformat)
if self._logfile:
hnd = logging.handlers.TimedRotatingFileHandler(self._logfile, when=self._loginterval, interval=self._logintervalnum, backupCount=self._logcount)
hnd.suffix = self._logsuffix
hnd.extMatch = re.compile(strftoregex(self._logsuffix))
else:
hnd = logging.StreamHandler()
hnd.setFormatter(format)
l = logging.getLogger(self._logid)
for h in l.handlers:
l.removeHandler(h)
l.setLevel(self._loglevel)
l.addHandler(hnd)
class log():
def __init__(self, logid):
self._logid = logid
def __log__(self, msg, level=DEBUG, data=None):
l = logging.getLogger(self._logid)
l.log(level, msg, extra=data)
def logd(self, msg, **kwargs):
self.__log__(level=DEBUG, msg=msg, **kwargs)
def ...
def logf(self, msg, **kwargs):
self.__log__(level=FATAL, msg=msg, **kwargs)
def getlogkwargs(**kwargs):
logdict = {}
for key, value in kwargs.iteritems():
if key.startswith('log'): logdict[key] = value
return logdict
Logging is done as expected: logs from yamlapp (sub-class of app) is written to fmapp_log.log, and logs from fmlog is written to fmlog.log.
The problem is that fmapp_log.log is rotated as expected, but fmlog.log is never rotated. How do I solve this? I know the process must run continuously for the rotation to happen, that is why only one logger is used. I suspect another handle must be created for the fmlog process which must never be destroyed when the process exits.
Requirements:
The app (framework or main) log and the fmlog (process) log must be to different files.
Both log-files must be time-rotated.
Hopefully someone will understand the above and be able to give me a couple of pointers.
I use a LoggerAdapter to let my python logging output Linux TIDs instead of the long unique IDs. But this way I don't modify an existing logger but I create a new object:
new_logger = logging.LoggerAdapter(
logger=logging.getLogger('mylogger'),
extra=my_tid_extractor())
Now I want this LoggerAdapter be used by certain modules. As long as I know a global variable being used as logger I can do something like this:
somemodule.logger = new_logger
But this is not nice - it works only in a couple of cases and you need to know the logger variables used by the modules.
Do you know a way to make a LoggerAdapter available globally e.g. by calling s.th. like
logging.setLogger('mylogger', new_logger)
Or alternatively: is there some other way to let Python logging output Linux thread IDs like printed by ps?
Alternatively, you can to implement custom logger, and make it default in logging module.
Here is example:
import logging
import ctypes
SYS_gettid = 186
libc = ctypes.cdll.LoadLibrary('libc.so.6')
FORMAT = '%(asctime)-15s [thread=%(tid)s] %(message)s'
logging.basicConfig(level=logging.DEBUG, format=FORMAT)
def my_tid_extractor():
tid = libc.syscall(SYS_gettid)
return {'tid': tid}
class CustomLogger(logging.Logger):
def _log(self, level, msg, args, exc_info=None, extra=None):
if extra is None:
extra = my_tid_extractor()
super(CustomLogger, self)._log(level, msg, args, exc_info, extra)
logging.setLoggerClass(CustomLogger)
logger = logging.getLogger('test')
logger.debug('test')
Output sample:
2015-01-20 19:24:09,782 [thread=5017] test
I think you need override LoggerAdapter.process() method
Because the default LoggerAdapter.process method will do nothing, Here is example:
import logging
import random
L=logging.getLogger('name')
class myLogger(logging.LoggerAdapter):
def process(self,msg,kwargs):
return '(%d),%s' % (self.extra['name1'](1,1000),msg) ,kwargs
#put the randint function object
LA=myLogger(L,{'name1':random.randint})
#now,do some logging
LA.debug('some_loging_messsage')
out>>DEBUG:name:(167),some_loging_messsage
I had a similar problem. My solution might be a bit more generic than the accepted one.
I’ve also used a custom logger class, but I did a generic extension that allows me to register adapters after it’s instantiated.
class AdaptedLogger(logging.Logger):
"""A logger that allows you to register adapters on a instance."""
def __init__(self, name):
"""Create a new logger instance."""
super().__init__(name)
self.adapters = []
def _log(self, level, msg, *args, **kwargs):
"""Let adapters modify the message and keyword arguments."""
for adapter in self.adapters:
msg, kwargs = adapter.process(msg, kwargs)
return super()._log(level, msg, *args, **kwargs)
To make you logger use the class you have to instantiate it before it is used elsewhere. For example using:
original_class = logging.getLoggerClass()
logging.setLoggerClass(AdaptedLogger)
logcrm_logger = logging.getLogger("test")
logging.setLoggerClass(original_class)
Then you can register adapters on the instance at any time later on.
logger = logging.getLogger("test")
adapter = logging.LoggerAdapter(logger, extra=my_tid_extractor())
logger.adapters.append(adapter)
Actually the “adapters” can be any object now as long as they have a process-method with a signature compatible with logging.LoggingAdapter.process().
i have multiple spiders in one project , problem is right now i am defining LOG_FILE in SETTINGS like
LOG_FILE = "scrapy_%s.log" % datetime.now()
what i want is scrapy_SPIDERNAME_DATETIME
but i am unable to provide spidername in log_file name ..
i found
scrapy.log.start(logfile=None, loglevel=None, logstdout=None)
and called it in each spider init method but its not working ..
any help would be appreciated
The spider's __init__() is not early enough to call log.start() by itself since the log observer is already started at this point; therefore, you need to reinitialize the logging state to trick Scrapy into (re)starting it.
In your spider class file:
from datetime import datetime
from scrapy import log
from scrapy.spider import BaseSpider
class ExampleSpider(BaseSpider):
name = "example"
allowed_domains = ["example.com"]
start_urls = ["http://www.example.com/"]
def __init__(self, name=None, **kwargs):
LOG_FILE = "scrapy_%s_%s.log" % (self.name, datetime.now())
# remove the current log
# log.log.removeObserver(log.log.theLogPublisher.observers[0])
# re-create the default Twisted observer which Scrapy checks
log.log.defaultObserver = log.log.DefaultObserver()
# start the default observer so it can be stopped
log.log.defaultObserver.start()
# trick Scrapy into thinking logging has not started
log.started = False
# start the new log file observer
log.start(LOG_FILE)
# continue with the normal spider init
super(ExampleSpider, self).__init__(name, **kwargs)
def parse(self, response):
...
And the output file might look like:
scrapy_example_2012-08-25 12:34:48.823896.log
There should be a BOT_NAME in your settings.py. This is the project/spider name. So in your case, this would be
LOG_FILE = "scrapy_%s_%s.log" % (BOT_NAME, datetime.now())
This is pretty much the same that Scrapy does internally
But why not use log.msg. The docs clearly state that this is for spider specific stuff. It might be easier to use this and just extract/grep/... the different spider log messages from a big log file.
A more compicated approach would be to get the location of the spider SPIDER_MODULES list and load all spiders inside these package.
You can use Scrapy's Storage URI parameters in your settings.py file for FEED URI.
%(name)s
%(time)s
For example: /tmp/crawled/%(name)s/%(time)s.log