Design Pattern for logging in Multi threaded system - python

How can we make use of design pattern for log generation in Multi threaded environment. There is one log file and there are multiple threads need to write in this log file. So there has to be a mechanism that each thread can access the same file handler once it is created.
Should i use Singleton or Factory design pattern as there is only point of installation of object or there is a better way to do this.

The Python logging module is actually thread-safe by default:
The logging module is intended to be thread-safe without any special
work needing to be done by its clients. It achieves this though using
threading locks; there is one lock to serialize access to the module’s
shared data, and each handler also creates a lock to serialize access
to its underlying I/O.

Related

Is Python's logging.getLogger scalable with many named loggers?

I'm using logging.getLogging() in a way which is not forbidden by the documentation but not directly referred to.
My applications process data files and network connections, sometimes in threads. In order to identify the log lines for each connection and/or data file, I do the following:
data_file_name = "data_file_123.xml"
logger = logging.getLogger(data_file_name)
logger.info("This is logged.")
2013-07-22 05:58:55,721 - data_file_123.xml - INFO - This is logged.
This works very well:
Avoids passing a logger instance around the software.
Ensures every
line is marked with the appropriate source identifier without having
to manually perform it in each logging call.
My concern is this from the documentation at http://docs.python.org/2/library/logging.html#logging.getLogger:
All calls to this function with a given name return the same logger
instance. This means that logger instances never need to be passed
between different parts of an application.
How are the the logger instances destroyed? Are they destroyed? After processing a million files will there be a million named logger instances waiting in memory to be used? Am I setting myself up for a memory leak as memory fills with these old logging instances?
How are the the logger instances destroyed? Are they destroyed? After
processing a million files will there be a million named logger
instances waiting in memory to be used? Am I setting myself up for a
memory leak as memory fills with these old logging instances?
They aren't destroyed until the interpreter exits. All instances are cached since this is the behaviour that you want when logging. After processing a million files there will be one million logger instances alive.
As you stated yourself you are using the logging module for something that is not part of the aim of the module, hence it is a suboptimal solution.
There isn't a public API to get rid of cached loggers, although you can clear the cache by doing:
>>> root = logging.getLogger()
>>> root.manager.loggerDict.clear()
The loggerDict or manager attributes aren't described in the public documentation, although they aren't explicitly marked as _private.
Instead of having a different logger for each file processed I'd use a different logger for each thread, and insert the name of the file in the required logging messages. You can write a simple function to do the logging that avoids having to insert explicitly the filename in every call to the logger.

Managing global objects with side effects when reloading a module in Python

I am looking for a way to correctly manage module level global variables that use some operating system resource (like a file or a thread).
The problem is that when the module is reloaded, my resource must be properly disposed (e.g. the file closed or the thread terminated) before creating the new one.
So I need a better pattern to manage those singleton objects.
I've been reading the docs on module reload and this is quite interesting:
When a module is reloaded, its dictionary (containing the module’s
global variables) is retained. Redefinitions of names will override
the old definitions, so this is generally not a problem. If the new
version of a module does not define a name that was defined by the old
version, the old definition remains. This feature can be used to the
module’s advantage if it maintains a global table or cache of objects
— with a try statement it can test for the table’s presence and skip
its initialization if desired:
try:
cache
except NameError:
cache = {}
So I could just check if the objects already exist, and dispose them before creating the new ones.
You need to monkeypatch or fork django to hook into django dev server reloading feature and do the proper thing to manage file closing etc...
But since you develop a django application, if you mean to use a proper server to serve your app in the future you should consider managing your global variables and think about semaphores and all that jazz.
But before going this route and implement all this difficult code prone to error and hair loss. You should consider other solution like nosql databases (redis, mongodb, neo4j, hadoop...) and background process managers like celery and gearman. If all of this don't feet your use-case(s) and you can't avoid to create and manage files yourself and global variables then consider the client/server pattern where clients are webserver threads unless you want to mess with NFS.

Memory model for apache/modwsgi application in python?

In a regular application (like on Windows), when objects/variables are created on a global level it is available to the entire program during the entire duration the program is running.
In a web application written in PHP for instance, all variables/objects are destroyed at the end of the script so everything has to be written to the database.
a) So what about python running under apache/modwsgi? How does that work in regards to the memory?
b) How do you create objects that persist between web page requests and how do you ensure there isn't threading issues in apache/modwsgi?
Go read the following from the official mod_wsgi documentation:
http://code.google.com/p/modwsgi/wiki/ProcessesAndThreading
It explains the various modes things can be run in and gives some general guidelines about data scope and sharing.
All Python globals are created when the module is imported. When module is re-imported the same globals are used.
Python web servers do not do threading, but pre-forked processes. Thus there is no threading issues with Apache.
The lifecycle of Python processes under Apache depends. Apache has settings how many child processes are spawned, keep in reserve and killed. This means that you can use globals in Python processes for caching (in-process cache), but the process may terminate after any request so you cannot put any persistent data in the globals. But the process does not necessarily need to terminate and in this regard Python is much more efficient than PHP (the source code is not parsed for every request - but you need to have the server in reload mode to read source code changes during the development).
Since globals are per-process and there can be N processes, the processes share "web server global" state using mechanisms like memcached.
Usually Python globals only contain
Setting variables set during the process initialization
Cached data (session/user neutral)

Temporary changing python logging handlers

I'm working on an app that uses the standard logging module to do logging. We have a setup where we log to a bunch of files based on levels etc. We also use celery to run some jobs out of the main app (maintenance stuff usually that's time consuming).
The celery task does nothing other than call functions (lets say spam) which do the actual work. These functions use the logging module to output status messages. Now, I want to write a decorator that hijacks all the logging calls made by spam and puts them into a StringIO so that I can put them somewhere.
One of the solutions I had was to insert a handler for the root logger while the function is executing that grabs everything. However, this is messing with global shared objects which might be problematic later.
I came across this answer but it's not exactly what I'm looking for.
The thing about the StringIO is, there could be multiple processes running (Celery tasks), hence multiple StringIOs, right?
You can do something like this:
In the processes run under Celery, add to the root logger a handler which sends events to a socket (SocketHandler for TCP or DatagramHandler for UDP).
Create a socket receiver to receive and handle the events, as documented here. This acts like a consolidated StringIO across multiple Celery-run processes.
If you are using multiprocessing, you can also use the approach described here. Though that post talks about Python 3.2, the functionality is also available for Python 2.x using logutils.
Update: If you want to avoid a separate receiver process, you can log to a database directly, using a handler similar to that in this answer. If you want to buffer all the logging till the end of the process, you can use a MemoryHandler in conjunction with a database handler to achieve this.
For the StringIO handler, you could add an extra handler for the root logger that would grab everything, but at the same time add a dummy filter (Logger.addFilter) that filters everything out (so nothing is actually logged to StringIO).
You could then write a decorator for spam that removes the filter (Logger.removeFilter) before the function executes, and adds the dummy filter back after.

Python multi-threaded application with memory leak from the thread-specific logger instances

I have a server subclass spawning threaded response handlers, the handlers in turn start application threads. Everything is going smoothly except when I use ObjGraph I see the correct number of application threads running ( I am load testing and have it throttled to keep 35 applications instances running).
Invoking objgraph.typestats() provides a break down of how many instances of each object are currently live in the interpreter (according to the GC). Looking at that output for memory leaks I find 700 logger instances - which would be the total number of response handlers spawned by the server.
I have called logger.removehandler(memoryhandler) and logger.removehandler(filehandler) when the application thread exits the run() method to ensure that there are no lingering references to the logger instances, also the logger instances is completely isolated within the application thread (there are no external references to it). As a final stab at eliminating these logger instances the last statement in run() is del self.logger
To get the logger in init() I provide it a suitably large random number to name it so it will be distinct for file access - I use the same large number as part of the log file name to avoid application log collisions.
The long and the short is I have 700 logger instances tracked by the GC but only 35 active threads - how do I go about killing off these loggers? A more cumbersome engineer solution is to create a pool of loggers and just acquire one for the life of the application thread but that is creating more code to maintain when the GC should simply handle this automatically.
Don't create potentially unbounded numbers of loggers, that's not good practice - there are other ways of getting context-sensitive information into your logs, as documented here.
You also don't need to have a logger as an instance attribute: loggers are singletons so you can just get a particular one by name from anywhere. The recommended practice is to name loggers at module level using
logger = logging.getLogger(__name__)
which suffices for most scenarios.
From your question I can't tell whether you appreciate that handlers and loggers aren't the same thing - for example you talk about removeHandler calls (which might serve to free the handler instances because their reference counts go to zero, but you won't free any logger instances by doing so).
Generally, loggers are named after parts of your application which generate events of interest.
If you want each thread to e.g. write to a different file, you can create a new filename each time, and then close the handler when you're done and the thread is about to terminate (that closing is important to free handler resources). Or, you can log everything to one file with thread ids or other discriminators included in the log output, and use post-processing on the log file.
I met the same memory leak when using logging.Logger(), and you may try to manually close the handler fd when the logger is useless, like:
for handler in logger.handlers:
handler.close()

Categories