Is Python's logging.getLogger scalable with many named loggers? - python

I'm using logging.getLogging() in a way which is not forbidden by the documentation but not directly referred to.
My applications process data files and network connections, sometimes in threads. In order to identify the log lines for each connection and/or data file, I do the following:
data_file_name = "data_file_123.xml"
logger = logging.getLogger(data_file_name)
logger.info("This is logged.")
2013-07-22 05:58:55,721 - data_file_123.xml - INFO - This is logged.
This works very well:
Avoids passing a logger instance around the software.
Ensures every
line is marked with the appropriate source identifier without having
to manually perform it in each logging call.
My concern is this from the documentation at http://docs.python.org/2/library/logging.html#logging.getLogger:
All calls to this function with a given name return the same logger
instance. This means that logger instances never need to be passed
between different parts of an application.
How are the the logger instances destroyed? Are they destroyed? After processing a million files will there be a million named logger instances waiting in memory to be used? Am I setting myself up for a memory leak as memory fills with these old logging instances?

How are the the logger instances destroyed? Are they destroyed? After
processing a million files will there be a million named logger
instances waiting in memory to be used? Am I setting myself up for a
memory leak as memory fills with these old logging instances?
They aren't destroyed until the interpreter exits. All instances are cached since this is the behaviour that you want when logging. After processing a million files there will be one million logger instances alive.
As you stated yourself you are using the logging module for something that is not part of the aim of the module, hence it is a suboptimal solution.
There isn't a public API to get rid of cached loggers, although you can clear the cache by doing:
>>> root = logging.getLogger()
>>> root.manager.loggerDict.clear()
The loggerDict or manager attributes aren't described in the public documentation, although they aren't explicitly marked as _private.
Instead of having a different logger for each file processed I'd use a different logger for each thread, and insert the name of the file in the required logging messages. You can write a simple function to do the logging that avoids having to insert explicitly the filename in every call to the logger.

Related

Why get a new logger object in each new module?

The python logging module has a common pattern (ex1, ex2) where in each module you get a new logger object for each python module.
I'm not a fan of blindly following patterns and so I would like to understand a little bit more.
Why get a new logger object in each new module?
Why not have everyone just use the same root logger and configure the formatter with %(module)s?
Is there examples where this pattern is NECESSARY/NEEDED (i.e. because of some sort of performance reason[1])?
[1]
In a multi-threaded python program, is there some sort of hidden synchronization issues that is fixed by using multiple logging objects?
Each logger can be configured separately. Generally, a module logger is not configured at all in the module itself. You create a distinct logger and use it to log messages of varying levels of detail. Whoever uses the logger decides what level of messages to see, where to send those messages, and even how to display them. They may want everything (DEBUG and up) from one module logged to a file, while another module they may only care if a serious error occurs (in which case they want it e-mailed directly to them). If every module used the same (root) logger, you wouldn't have that kind of flexibility.
The logger name defines where (logically) in your application events occur. Hence, the recommended pattern
logger = logging.getLogger(__name__)
uses logger names which track the Python package hierarchy. This in turn allows whoever is configuring logging to turn verbosity up or down for specific loggers. If everything just used the root logger, one couldn't get fine grained control of verbosity, which is important when systems reach a certain size / complexity.
The logger names don't need to exactly track the package names - you could have multiple loggers in certain packages, for example. The main deciding factor is how much flexibility is needed (if you're writing an application) and perhaps also how much flexibility your users need (if you're writing a library).

Design Pattern for logging in Multi threaded system

How can we make use of design pattern for log generation in Multi threaded environment. There is one log file and there are multiple threads need to write in this log file. So there has to be a mechanism that each thread can access the same file handler once it is created.
Should i use Singleton or Factory design pattern as there is only point of installation of object or there is a better way to do this.
The Python logging module is actually thread-safe by default:
The logging module is intended to be thread-safe without any special
work needing to be done by its clients. It achieves this though using
threading locks; there is one lock to serialize access to the module’s
shared data, and each handler also creates a lock to serialize access
to its underlying I/O.

Each thread create its own logger instance, logging their own event

I have tried logging in Python. It looks like once a logging instance is created by a thread it won't be deleted. However, my program should produce more than 100 threads per minute, and each will create their own logger, which may result in a kind of memory leak (logging.Logger instances will not be collected by the garbage collector).
Can anyone help me on this, is there a way to use logger for multi-threaded applications?
In the python logging module, loggers are managed by a logging.Manager instance. usually there is only one logging manager, available as logging.Logger.manager. Loggers are identified by their name. Each time you use logging.getLogger('name') this call is acutally forwarded to logging.Logger.manager.getLogger which holds a dict of loggers and returns the same logger for each 'name' every time.
so if you don't use a different name when getting the logger from a thread, you're actually using the same logger instance each time and don't have to worry about a memory leak.

Temporary changing python logging handlers

I'm working on an app that uses the standard logging module to do logging. We have a setup where we log to a bunch of files based on levels etc. We also use celery to run some jobs out of the main app (maintenance stuff usually that's time consuming).
The celery task does nothing other than call functions (lets say spam) which do the actual work. These functions use the logging module to output status messages. Now, I want to write a decorator that hijacks all the logging calls made by spam and puts them into a StringIO so that I can put them somewhere.
One of the solutions I had was to insert a handler for the root logger while the function is executing that grabs everything. However, this is messing with global shared objects which might be problematic later.
I came across this answer but it's not exactly what I'm looking for.
The thing about the StringIO is, there could be multiple processes running (Celery tasks), hence multiple StringIOs, right?
You can do something like this:
In the processes run under Celery, add to the root logger a handler which sends events to a socket (SocketHandler for TCP or DatagramHandler for UDP).
Create a socket receiver to receive and handle the events, as documented here. This acts like a consolidated StringIO across multiple Celery-run processes.
If you are using multiprocessing, you can also use the approach described here. Though that post talks about Python 3.2, the functionality is also available for Python 2.x using logutils.
Update: If you want to avoid a separate receiver process, you can log to a database directly, using a handler similar to that in this answer. If you want to buffer all the logging till the end of the process, you can use a MemoryHandler in conjunction with a database handler to achieve this.
For the StringIO handler, you could add an extra handler for the root logger that would grab everything, but at the same time add a dummy filter (Logger.addFilter) that filters everything out (so nothing is actually logged to StringIO).
You could then write a decorator for spam that removes the filter (Logger.removeFilter) before the function executes, and adds the dummy filter back after.

Python multi-threaded application with memory leak from the thread-specific logger instances

I have a server subclass spawning threaded response handlers, the handlers in turn start application threads. Everything is going smoothly except when I use ObjGraph I see the correct number of application threads running ( I am load testing and have it throttled to keep 35 applications instances running).
Invoking objgraph.typestats() provides a break down of how many instances of each object are currently live in the interpreter (according to the GC). Looking at that output for memory leaks I find 700 logger instances - which would be the total number of response handlers spawned by the server.
I have called logger.removehandler(memoryhandler) and logger.removehandler(filehandler) when the application thread exits the run() method to ensure that there are no lingering references to the logger instances, also the logger instances is completely isolated within the application thread (there are no external references to it). As a final stab at eliminating these logger instances the last statement in run() is del self.logger
To get the logger in init() I provide it a suitably large random number to name it so it will be distinct for file access - I use the same large number as part of the log file name to avoid application log collisions.
The long and the short is I have 700 logger instances tracked by the GC but only 35 active threads - how do I go about killing off these loggers? A more cumbersome engineer solution is to create a pool of loggers and just acquire one for the life of the application thread but that is creating more code to maintain when the GC should simply handle this automatically.
Don't create potentially unbounded numbers of loggers, that's not good practice - there are other ways of getting context-sensitive information into your logs, as documented here.
You also don't need to have a logger as an instance attribute: loggers are singletons so you can just get a particular one by name from anywhere. The recommended practice is to name loggers at module level using
logger = logging.getLogger(__name__)
which suffices for most scenarios.
From your question I can't tell whether you appreciate that handlers and loggers aren't the same thing - for example you talk about removeHandler calls (which might serve to free the handler instances because their reference counts go to zero, but you won't free any logger instances by doing so).
Generally, loggers are named after parts of your application which generate events of interest.
If you want each thread to e.g. write to a different file, you can create a new filename each time, and then close the handler when you're done and the thread is about to terminate (that closing is important to free handler resources). Or, you can log everything to one file with thread ids or other discriminators included in the log output, and use post-processing on the log file.
I met the same memory leak when using logging.Logger(), and you may try to manually close the handler fd when the logger is useless, like:
for handler in logger.handlers:
handler.close()

Categories