Logging from Multiple Modules to the Same Text File

Logging from Multiple Modules to the Same Text File - python

I've inherited a heap of Python code, that runs a bunch of different processes, but doesn't log anything. I want to set up a good logging process for some of the more important tasks. (I'll set it up for everything eventually.)
The way the code base is set up, there are a bunch of modules that are reused by multiple scripts. What I'd like to do is set the logging up so that messages are logged to stdout, as well as to a text file associated with the script that called it.
From what I've gathered this should be possible, e.g. logging.basicConfig() appears to do almost what I want.
How do I configure my logging so that all the modules log to the same text file, and to stdout at the same time?
Edit: The difference between this, and What is the most pythonic way of logging for multiple modules and multiple handlers with specified encoding? is that I also want to be able to call the modules from different scripts. Possibly at the same time.

Related

Python Multiprocessing returning results with Logging and running frozen on Windows

I need some help with implementing logging while multiprocessing and running the application frozen under Windows. There are dozens of topics on this subject and I have spent a lot of time reviewing and testing those. I have also extensively reviewed the documentation, but I cannot figure out how to implement this in my code.
I have created a minimum example which runs fine on Linux, but crashes on Windows (even when not frozen). The example I created is just one of many iterations I have put my code through.
You can find the minimum example on github. Any assistance to get this example working would be greatly appreciated.
Thank you.
Marc.

The basic
On Linux, a child process is created by fork method by default. That means, the child process inherits almost everything from the parent process.
On Windows, the child process is created by spawn method.
That means, a child process is started almost from crash, re-imports and re-executes any code that is outside of the guard cloud if __name__ == '__main__'.
Why it worked or failed
On Linux, because the logger object is inherited, your program will start logging.
But it is far from perfect since you log directly to the file.
Sooner or later, log lines will be overlapped or IO error on file happens due to race condition between processes.
On Windows, since you didn't pass the logger object to the child process, and it re-imports your pymp_global module, logger is a None object. So when you try logging with a None object, it crashes for sure.
The solution
Logging with multiprocessing is not an easy task.
For it to work on Windows, you must either pass a logger object to child processes and/or log with QueueHandler. Another similar solution for inter-process communication is to use SocketHandler.
The idea is that only one thread or process does the logging. Other processes just send the log records. This prevents the race condition and ensures the log is written out after the critical process got time to do its job.
So how to implement it?
I have encountered this logging problem before and already written the code.
You can just use it with logger-tt package.
#pymp.py
from logging import getLogger
from logger_tt import setup_logging
setup_logging(use_multiprocessing=True)
logger = getLogger(__name__)
# other code below
For other modules
#pymp_common.py
from logging import getLogger
logger = getLogger(__name__)
# other code below
This saves you from writing all the logging config code everywhere manually.
You may consider changing the log_config file to suit your need.

Doesn't log_level parameter in python logging module affect performance?

I am using an API, to get some service from my project. The API call is taking too long, so I thought one of reasons could be lots and lots of logs that I have put across the project and the IO reads/writes are taking time.
I am using logging. My guess was as a LOG_LEVEL discard logs of lower priority, with higher priorities the API call should be completed in less time. But the time is almost same(difference being in the range of 1/10th of seconds).
The only reference regarding LOG_LEVEL and performance I got from here is
The beauty of this is that if you set the log level to WARN, info and debug messages have next to no performance impact.
Some points I should note here
I have not configured my logs to stream to any log service, like Kibana.
I have checked with this kind of situations, I am not doing any prepossessing in log message.
I have done basic Logger initialization,i.e,
import logging
logger = logging.getLogger(__name__)
and not used any file to write logs into as follows. LOG_LEVEL is given as one of the environment variable.
logging.basicConfig(filename="file_name.log")
Considering every other thing is optimal(if also everything is not optimal, then too higher priority logs should take less time), am I wrong in my guess of more time because of log read/writes? If no, then why use of high priority LOG_LEVEL flags are not decreasing the time?
In which default location, logging module store the logs?

What's the difference between log level performances?
Setting the log level can effect performance but may not be very noticeable until at scale.
When you set the level, you're creating a way to stop the logging process from continuing, and very little happens before this is checked with any individual log. For example, here is what CRITICAL logs look like in the code:
if self.isEnabledFor(CRITICAL):
self._log(CRITICAL, msg, args, **kwargs)
The logger itself has much more to do as part of _log than just this check so there would be time gains by setting a log level. But, it is fairly optimized so at the point you have initiated a logger at all, unless the difference in the amount of calls is quite large you probably won't be able to notice it much.
If you removed any reference to the logging instead of just setting level, you would get more performance gains because that check is not happening at all (which obviously takes some amount of time).
Where are logs stored by default?
By default, without setting a File, StreamHandler [source] is enabled and without specifying a specific stream, it will stream to sys.stderr. When you set a file, it creates a FileHandler which inherits from the same functions as StreamHandlers [source].
How do I optimize?
For the question you didn't ask, which is How do I speed up logging? I would suggest looking at this which gives some advice. Part of that advice is what I pointed out above but telling you to explicitly check your log level, and you can even cache that result and check the cache instead which should reduce time even further.
Check out this answer for even more on optimizing logging.
And finally, if you want to determine the speed issues with your code, whether it is from logging or not, you need to use a profiler. There are built in profiling functions in Python, check here.

One log level isn't more performant than another, however, if a level is enabled for logging, loggers are nested (In your example, this would happen if __name__ had dots in it like mypackage.core.logs), and the version of Python you are running can. This is because three things happen when you make a logging call:
The logger determines if the logging level is enabled.
This will happen for every call. In versions of Python before 3.7, this call was not cached and nested loggers took longer to determine if they were enabled or not. How much longer? In some benchmarks it was twice as much time. That said, this is heavily dependent on log nesting and even when logging millions of messages, this may only save a few seconds of system time.
The logger processes the record.
This is where the optimizations outlined in the documentation come into play. They allow the record creation to skip some steps.
The logger send the record to the handler.
This maybe the default, StreamHandler, the FileHandler, the SysLogHandler, or any number of build-in or custom handlers. In your example, you are using the FileHandler to write to file_name.log in the current directory. This may be fine for smaller applications, larger applications would benefit from using an external logger like syslog or the systemd journal. The main reason for this is because these operate in a separate process and are optimized for processing a large number of logs.

Why get a new logger object in each new module?

The python logging module has a common pattern (ex1, ex2) where in each module you get a new logger object for each python module.
I'm not a fan of blindly following patterns and so I would like to understand a little bit more.
Why get a new logger object in each new module?
Why not have everyone just use the same root logger and configure the formatter with %(module)s?
Is there examples where this pattern is NECESSARY/NEEDED (i.e. because of some sort of performance reason[1])?
[1]
In a multi-threaded python program, is there some sort of hidden synchronization issues that is fixed by using multiple logging objects?

Each logger can be configured separately. Generally, a module logger is not configured at all in the module itself. You create a distinct logger and use it to log messages of varying levels of detail. Whoever uses the logger decides what level of messages to see, where to send those messages, and even how to display them. They may want everything (DEBUG and up) from one module logged to a file, while another module they may only care if a serious error occurs (in which case they want it e-mailed directly to them). If every module used the same (root) logger, you wouldn't have that kind of flexibility.

The logger name defines where (logically) in your application events occur. Hence, the recommended pattern
logger = logging.getLogger(__name__)
uses logger names which track the Python package hierarchy. This in turn allows whoever is configuring logging to turn verbosity up or down for specific loggers. If everything just used the root logger, one couldn't get fine grained control of verbosity, which is important when systems reach a certain size / complexity.
The logger names don't need to exactly track the package names - you could have multiple loggers in certain packages, for example. The main deciding factor is how much flexibility is needed (if you're writing an application) and perhaps also how much flexibility your users need (if you're writing a library).

Bad idea to have two class opening the same file?

I have two classes in my Python program and one of them is a thread. Is it a bad idea to have both classes open the same log file and write to it?
Is there any good approach to write to the same log file for two classes which are running at the same time?

This is a classical concurrency issue. You need to ensure that you exactly control what is happening. Regarding log files, the easiest solution might be to have a queue collecting log messages from various places (from different threads or even processes) and then have one entity that pops messages from that queue and writes them to the log file. This way, at least single messages stay self-contained.
The operating system does not prevent message mix up if you write to the file from different unsynchronized entities. Hence, if you do not explicitly control what should happen in which order, you might end up with corrupted messages in that file, even if things seem to work most of the time.

Use the python logging module. It handles the gory details for you.

As long as you control which class is reading and writing from a file and ensure that only one of them can write to it at a time you should be fine and every time you switch you reread the file.
Look into using lock to ensure that both classes are not accessing the file at the same time.

Temporary changing python logging handlers

I'm working on an app that uses the standard logging module to do logging. We have a setup where we log to a bunch of files based on levels etc. We also use celery to run some jobs out of the main app (maintenance stuff usually that's time consuming).
The celery task does nothing other than call functions (lets say spam) which do the actual work. These functions use the logging module to output status messages. Now, I want to write a decorator that hijacks all the logging calls made by spam and puts them into a StringIO so that I can put them somewhere.
One of the solutions I had was to insert a handler for the root logger while the function is executing that grabs everything. However, this is messing with global shared objects which might be problematic later.
I came across this answer but it's not exactly what I'm looking for.

The thing about the StringIO is, there could be multiple processes running (Celery tasks), hence multiple StringIOs, right?
You can do something like this:
In the processes run under Celery, add to the root logger a handler which sends events to a socket (SocketHandler for TCP or DatagramHandler for UDP).
Create a socket receiver to receive and handle the events, as documented here. This acts like a consolidated StringIO across multiple Celery-run processes.
If you are using multiprocessing, you can also use the approach described here. Though that post talks about Python 3.2, the functionality is also available for Python 2.x using logutils.
Update: If you want to avoid a separate receiver process, you can log to a database directly, using a handler similar to that in this answer. If you want to buffer all the logging till the end of the process, you can use a MemoryHandler in conjunction with a database handler to achieve this.

For the StringIO handler, you could add an extra handler for the root logger that would grab everything, but at the same time add a dummy filter (Logger.addFilter) that filters everything out (so nothing is actually logged to StringIO).
You could then write a decorator for spam that removes the filter (Logger.removeFilter) before the function executes, and adds the dummy filter back after.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.