I have tried logging in Python. It looks like once a logging instance is created by a thread it won't be deleted. However, my program should produce more than 100 threads per minute, and each will create their own logger, which may result in a kind of memory leak (logging.Logger instances will not be collected by the garbage collector).
Can anyone help me on this, is there a way to use logger for multi-threaded applications?
In the python logging module, loggers are managed by a logging.Manager instance. usually there is only one logging manager, available as logging.Logger.manager. Loggers are identified by their name. Each time you use logging.getLogger('name') this call is acutally forwarded to logging.Logger.manager.getLogger which holds a dict of loggers and returns the same logger for each 'name' every time.
so if you don't use a different name when getting the logger from a thread, you're actually using the same logger instance each time and don't have to worry about a memory leak.
Related
I need some help with implementing logging while multiprocessing and running the application frozen under Windows. There are dozens of topics on this subject and I have spent a lot of time reviewing and testing those. I have also extensively reviewed the documentation, but I cannot figure out how to implement this in my code.
I have created a minimum example which runs fine on Linux, but crashes on Windows (even when not frozen). The example I created is just one of many iterations I have put my code through.
You can find the minimum example on github. Any assistance to get this example working would be greatly appreciated.
Thank you.
Marc.
The basic
On Linux, a child process is created by fork method by default. That means, the child process inherits almost everything from the parent process.
On Windows, the child process is created by spawn method.
That means, a child process is started almost from crash, re-imports and re-executes any code that is outside of the guard cloud if __name__ == '__main__'.
Why it worked or failed
On Linux, because the logger object is inherited, your program will start logging.
But it is far from perfect since you log directly to the file.
Sooner or later, log lines will be overlapped or IO error on file happens due to race condition between processes.
On Windows, since you didn't pass the logger object to the child process, and it re-imports your pymp_global module, logger is a None object. So when you try logging with a None object, it crashes for sure.
The solution
Logging with multiprocessing is not an easy task.
For it to work on Windows, you must either pass a logger object to child processes and/or log with QueueHandler. Another similar solution for inter-process communication is to use SocketHandler.
The idea is that only one thread or process does the logging. Other processes just send the log records. This prevents the race condition and ensures the log is written out after the critical process got time to do its job.
So how to implement it?
I have encountered this logging problem before and already written the code.
You can just use it with logger-tt package.
#pymp.py
from logging import getLogger
from logger_tt import setup_logging
setup_logging(use_multiprocessing=True)
logger = getLogger(__name__)
# other code below
For other modules
#pymp_common.py
from logging import getLogger
logger = getLogger(__name__)
# other code below
This saves you from writing all the logging config code everywhere manually.
You may consider changing the log_config file to suit your need.
I am inspecting the logging.Logger.manager.loggerDict by doing:
import logging
logging.Logger.manager.loggerDict
and the dict is as follows:
{
'nose.case': <celery.utils.log.ProcessAwareLoggerobjectat0x112c8dcd0>,
'apps.friends': <logging.PlaceHolderobjectat0x1147720d0>,
'oauthlib.oauth2.rfc6749.grant_types.client_credentials': <celery.utils.log.ProcessAwareLoggerobjectat0x115c48710>,
'apps.adapter.views': <celery.utils.log.ProcessAwareLoggerobjectat0x116a847d0>,
'apps.accounts.views': <celery.utils.log.ProcessAwareLoggerobjectat0x116976990>,
}
There are more but I truncated it
My questions are :
How come celery is involved in logging of various other non-celery apps? Is it because logging is done in an async way and somehow logging framework detects presence of celery and uses it?
For two of my own files that are logging using logger = logging.getLogger(__name__) , I see one is PlaceHolderObject and other two it is celery.utils.log.ProcessAwareLogger object - although these latter two are called in views and not in celery processes. How did it become this way then
Thanks
Celery itself replaces the (global) logger class, using the logging.setLoggerClass method, with a ProcessAwareLogger class that does a couple of things: avoid trying to log while in a signal handler, and add a process name to logs. This happens as soon as Celery's logging system is set up. You're seeing this class even on your own loggers because of the global nature of setLoggerClass.
As for why, exactly, Celery is designed like that, I think you'd have to ask a developer of Celery, but effectively it allows Celery to ensure that signal handler safety and process name are taken care of even if you use your own loggers in your app.
The python logging docs note:
If you are implementing asynchronous signal handlers using the signal module, you may not be able to use logging from within such handlers. This is because lock implementations in the threading module are not always re-entrant, and so cannot be invoked from such signal handlers.
Celery uses signal so this may be a reason for wanting to globally enforce its logger class.
I'm using logging.getLogging() in a way which is not forbidden by the documentation but not directly referred to.
My applications process data files and network connections, sometimes in threads. In order to identify the log lines for each connection and/or data file, I do the following:
data_file_name = "data_file_123.xml"
logger = logging.getLogger(data_file_name)
logger.info("This is logged.")
2013-07-22 05:58:55,721 - data_file_123.xml - INFO - This is logged.
This works very well:
Avoids passing a logger instance around the software.
Ensures every
line is marked with the appropriate source identifier without having
to manually perform it in each logging call.
My concern is this from the documentation at http://docs.python.org/2/library/logging.html#logging.getLogger:
All calls to this function with a given name return the same logger
instance. This means that logger instances never need to be passed
between different parts of an application.
How are the the logger instances destroyed? Are they destroyed? After processing a million files will there be a million named logger instances waiting in memory to be used? Am I setting myself up for a memory leak as memory fills with these old logging instances?
How are the the logger instances destroyed? Are they destroyed? After
processing a million files will there be a million named logger
instances waiting in memory to be used? Am I setting myself up for a
memory leak as memory fills with these old logging instances?
They aren't destroyed until the interpreter exits. All instances are cached since this is the behaviour that you want when logging. After processing a million files there will be one million logger instances alive.
As you stated yourself you are using the logging module for something that is not part of the aim of the module, hence it is a suboptimal solution.
There isn't a public API to get rid of cached loggers, although you can clear the cache by doing:
>>> root = logging.getLogger()
>>> root.manager.loggerDict.clear()
The loggerDict or manager attributes aren't described in the public documentation, although they aren't explicitly marked as _private.
Instead of having a different logger for each file processed I'd use a different logger for each thread, and insert the name of the file in the required logging messages. You can write a simple function to do the logging that avoids having to insert explicitly the filename in every call to the logger.
I'm working on an app that uses the standard logging module to do logging. We have a setup where we log to a bunch of files based on levels etc. We also use celery to run some jobs out of the main app (maintenance stuff usually that's time consuming).
The celery task does nothing other than call functions (lets say spam) which do the actual work. These functions use the logging module to output status messages. Now, I want to write a decorator that hijacks all the logging calls made by spam and puts them into a StringIO so that I can put them somewhere.
One of the solutions I had was to insert a handler for the root logger while the function is executing that grabs everything. However, this is messing with global shared objects which might be problematic later.
I came across this answer but it's not exactly what I'm looking for.
The thing about the StringIO is, there could be multiple processes running (Celery tasks), hence multiple StringIOs, right?
You can do something like this:
In the processes run under Celery, add to the root logger a handler which sends events to a socket (SocketHandler for TCP or DatagramHandler for UDP).
Create a socket receiver to receive and handle the events, as documented here. This acts like a consolidated StringIO across multiple Celery-run processes.
If you are using multiprocessing, you can also use the approach described here. Though that post talks about Python 3.2, the functionality is also available for Python 2.x using logutils.
Update: If you want to avoid a separate receiver process, you can log to a database directly, using a handler similar to that in this answer. If you want to buffer all the logging till the end of the process, you can use a MemoryHandler in conjunction with a database handler to achieve this.
For the StringIO handler, you could add an extra handler for the root logger that would grab everything, but at the same time add a dummy filter (Logger.addFilter) that filters everything out (so nothing is actually logged to StringIO).
You could then write a decorator for spam that removes the filter (Logger.removeFilter) before the function executes, and adds the dummy filter back after.
I have a server subclass spawning threaded response handlers, the handlers in turn start application threads. Everything is going smoothly except when I use ObjGraph I see the correct number of application threads running ( I am load testing and have it throttled to keep 35 applications instances running).
Invoking objgraph.typestats() provides a break down of how many instances of each object are currently live in the interpreter (according to the GC). Looking at that output for memory leaks I find 700 logger instances - which would be the total number of response handlers spawned by the server.
I have called logger.removehandler(memoryhandler) and logger.removehandler(filehandler) when the application thread exits the run() method to ensure that there are no lingering references to the logger instances, also the logger instances is completely isolated within the application thread (there are no external references to it). As a final stab at eliminating these logger instances the last statement in run() is del self.logger
To get the logger in init() I provide it a suitably large random number to name it so it will be distinct for file access - I use the same large number as part of the log file name to avoid application log collisions.
The long and the short is I have 700 logger instances tracked by the GC but only 35 active threads - how do I go about killing off these loggers? A more cumbersome engineer solution is to create a pool of loggers and just acquire one for the life of the application thread but that is creating more code to maintain when the GC should simply handle this automatically.
Don't create potentially unbounded numbers of loggers, that's not good practice - there are other ways of getting context-sensitive information into your logs, as documented here.
You also don't need to have a logger as an instance attribute: loggers are singletons so you can just get a particular one by name from anywhere. The recommended practice is to name loggers at module level using
logger = logging.getLogger(__name__)
which suffices for most scenarios.
From your question I can't tell whether you appreciate that handlers and loggers aren't the same thing - for example you talk about removeHandler calls (which might serve to free the handler instances because their reference counts go to zero, but you won't free any logger instances by doing so).
Generally, loggers are named after parts of your application which generate events of interest.
If you want each thread to e.g. write to a different file, you can create a new filename each time, and then close the handler when you're done and the thread is about to terminate (that closing is important to free handler resources). Or, you can log everything to one file with thread ids or other discriminators included in the log output, and use post-processing on the log file.
I met the same memory leak when using logging.Logger(), and you may try to manually close the handler fd when the logger is useless, like:
for handler in logger.handlers:
handler.close()