I am using an API, to get some service from my project. The API call is taking too long, so I thought one of reasons could be lots and lots of logs that I have put across the project and the IO reads/writes are taking time.
I am using logging. My guess was as a LOG_LEVEL discard logs of lower priority, with higher priorities the API call should be completed in less time. But the time is almost same(difference being in the range of 1/10th of seconds).
The only reference regarding LOG_LEVEL and performance I got from here is
The beauty of this is that if you set the log level to WARN, info and debug messages have next to no performance impact.
Some points I should note here
I have not configured my logs to stream to any log service, like Kibana.
I have checked with this kind of situations, I am not doing any prepossessing in log message.
I have done basic Logger initialization,i.e,
import logging
logger = logging.getLogger(__name__)
and not used any file to write logs into as follows. LOG_LEVEL is given as one of the environment variable.
logging.basicConfig(filename="file_name.log")
Considering every other thing is optimal(if also everything is not optimal, then too higher priority logs should take less time), am I wrong in my guess of more time because of log read/writes? If no, then why use of high priority LOG_LEVEL flags are not decreasing the time?
In which default location, logging module store the logs?
What's the difference between log level performances?
Setting the log level can effect performance but may not be very noticeable until at scale.
When you set the level, you're creating a way to stop the logging process from continuing, and very little happens before this is checked with any individual log. For example, here is what CRITICAL logs look like in the code:
if self.isEnabledFor(CRITICAL):
self._log(CRITICAL, msg, args, **kwargs)
The logger itself has much more to do as part of _log than just this check so there would be time gains by setting a log level. But, it is fairly optimized so at the point you have initiated a logger at all, unless the difference in the amount of calls is quite large you probably won't be able to notice it much.
If you removed any reference to the logging instead of just setting level, you would get more performance gains because that check is not happening at all (which obviously takes some amount of time).
Where are logs stored by default?
By default, without setting a File, StreamHandler [source] is enabled and without specifying a specific stream, it will stream to sys.stderr. When you set a file, it creates a FileHandler which inherits from the same functions as StreamHandlers [source].
How do I optimize?
For the question you didn't ask, which is How do I speed up logging? I would suggest looking at this which gives some advice. Part of that advice is what I pointed out above but telling you to explicitly check your log level, and you can even cache that result and check the cache instead which should reduce time even further.
Check out this answer for even more on optimizing logging.
And finally, if you want to determine the speed issues with your code, whether it is from logging or not, you need to use a profiler. There are built in profiling functions in Python, check here.
One log level isn't more performant than another, however, if a level is enabled for logging, loggers are nested (In your example, this would happen if __name__ had dots in it like mypackage.core.logs), and the version of Python you are running can. This is because three things happen when you make a logging call:
The logger determines if the logging level is enabled.
This will happen for every call. In versions of Python before 3.7, this call was not cached and nested loggers took longer to determine if they were enabled or not. How much longer? In some benchmarks it was twice as much time. That said, this is heavily dependent on log nesting and even when logging millions of messages, this may only save a few seconds of system time.
The logger processes the record.
This is where the optimizations outlined in the documentation come into play. They allow the record creation to skip some steps.
The logger send the record to the handler.
This maybe the default, StreamHandler, the FileHandler, the SysLogHandler, or any number of build-in or custom handlers. In your example, you are using the FileHandler to write to file_name.log in the current directory. This may be fine for smaller applications, larger applications would benefit from using an external logger like syslog or the systemd journal. The main reason for this is because these operate in a separate process and are optimized for processing a large number of logs.
Related
I'm using a task queue with python (RQ). Since workers run concurrently, without any configuration messages from all workers are mixed up.
I want to organize logging such that at any time I can get the exact full log for a given task run by a given worker. Workers run on different machines, so preferably logs would be sent over the network to some central collector, but to get started, I'd also be happy with local logging to file, as long as the messages of each task end up in a separate log file.
My question has two parts:
how to implement this in python code. I suppose that, for the "log to file" case, I could do something like this at the beginning of each task function:
logging.basicConfig(filename="some_unique_id_for_this_task_and_worker.log", level=logging.DEBUG, format="whatever")
logging.debug("My message...")
# etc.
but when it comes to logging over the network, I'm struggling to understand how the logger should be configured so that all log messages from the same task are recognizable at the collector. This is purposely vague because I haven't chosen a given technology or protocol to do this collection yet, and I'm looking for suggestions.
Assuming that the previous requirement can be accomplished, when logging over the network, what's a centralized solution that can give me the full log for a given task? I mean really showing me the full text log, not using a search interface returning events or lines (as eg, IIRC, in splunk or elasticsearch).
Thanks
Since you're running multiple processes (the RQ workers) you could probably use one of the recipes in the logging cookbook. If you want to use a SocketHandler and a socket server to receive and send messages to a file, you should also look at this recipe in the cookbook. It has a part related to running a socket listener in production.
When using the logging library, when should I log using DEBUG, and when should I use INFO instead? All I know is that they are used to show what a program is doing during normal operation.
You can set up to only show logs of certain level.
DEBUG and INFO are two levels, info being a more neutral one, used for non-essential stuff and debug being the one that you might use for displaying stuff that might help you debug something.
It is up to you what you use each level for, and what levels you might want to see in your logs. If you disable a level, it will simply not be shown in logs.
Logging has 5 levels, and you can set the levels you need via setLevel() function. See here: https://docs.python.org/3/library/logging.html
There are no predetermined roles other than DEBUG being a higher verbosity level than INFO.
Their names imply that INFO is supposed to report on a program's progress while DEBUG is to report info for diagnosing problems.
The key thing to watch for when choosing which level to use for a specific message is to make each level give a full picture of what is going on, with the corresponding level of detail. See How to debug a Python program running as a service? for details.
E.g. in one of my programs that utilized a user-provided script to do tasks, I used:
INFO -- progress on tasks
VERBOSE (custom level with ID 15) -- info for diagnosing problems in the user script
DEBUG -- info for diagnosing problems in the program itself
If you view your log messages as part of your application's user interface, INFO messages are for consumption by the administrators or users, whereas debug messages are for consumption by its programmers. Messages should be designed and emitted with this in mind.
The python logging module has a common pattern (ex1, ex2) where in each module you get a new logger object for each python module.
I'm not a fan of blindly following patterns and so I would like to understand a little bit more.
Why get a new logger object in each new module?
Why not have everyone just use the same root logger and configure the formatter with %(module)s?
Is there examples where this pattern is NECESSARY/NEEDED (i.e. because of some sort of performance reason[1])?
[1]
In a multi-threaded python program, is there some sort of hidden synchronization issues that is fixed by using multiple logging objects?
Each logger can be configured separately. Generally, a module logger is not configured at all in the module itself. You create a distinct logger and use it to log messages of varying levels of detail. Whoever uses the logger decides what level of messages to see, where to send those messages, and even how to display them. They may want everything (DEBUG and up) from one module logged to a file, while another module they may only care if a serious error occurs (in which case they want it e-mailed directly to them). If every module used the same (root) logger, you wouldn't have that kind of flexibility.
The logger name defines where (logically) in your application events occur. Hence, the recommended pattern
logger = logging.getLogger(__name__)
uses logger names which track the Python package hierarchy. This in turn allows whoever is configuring logging to turn verbosity up or down for specific loggers. If everything just used the root logger, one couldn't get fine grained control of verbosity, which is important when systems reach a certain size / complexity.
The logger names don't need to exactly track the package names - you could have multiple loggers in certain packages, for example. The main deciding factor is how much flexibility is needed (if you're writing an application) and perhaps also how much flexibility your users need (if you're writing a library).
Currently I have a couple of python written modules that constantly log something to files. My problem is that when something happens (consider it as a disaster for all the modules, lack of connections with some services or something else) all of them start to log same or very similar (within a module) messages flooding my logs gatherer (which follows to other issues).
Is there any python logger which includes something like reducing the frequency of providing the same or similar logs (it would be nice if got smaller and smaller till reducing the number of such logs)?
Or is it possible to write such a logger by myself? I've been searching a lot but unfortunately did not find any suitable implementation/solution/idea.
I have a server subclass spawning threaded response handlers, the handlers in turn start application threads. Everything is going smoothly except when I use ObjGraph I see the correct number of application threads running ( I am load testing and have it throttled to keep 35 applications instances running).
Invoking objgraph.typestats() provides a break down of how many instances of each object are currently live in the interpreter (according to the GC). Looking at that output for memory leaks I find 700 logger instances - which would be the total number of response handlers spawned by the server.
I have called logger.removehandler(memoryhandler) and logger.removehandler(filehandler) when the application thread exits the run() method to ensure that there are no lingering references to the logger instances, also the logger instances is completely isolated within the application thread (there are no external references to it). As a final stab at eliminating these logger instances the last statement in run() is del self.logger
To get the logger in init() I provide it a suitably large random number to name it so it will be distinct for file access - I use the same large number as part of the log file name to avoid application log collisions.
The long and the short is I have 700 logger instances tracked by the GC but only 35 active threads - how do I go about killing off these loggers? A more cumbersome engineer solution is to create a pool of loggers and just acquire one for the life of the application thread but that is creating more code to maintain when the GC should simply handle this automatically.
Don't create potentially unbounded numbers of loggers, that's not good practice - there are other ways of getting context-sensitive information into your logs, as documented here.
You also don't need to have a logger as an instance attribute: loggers are singletons so you can just get a particular one by name from anywhere. The recommended practice is to name loggers at module level using
logger = logging.getLogger(__name__)
which suffices for most scenarios.
From your question I can't tell whether you appreciate that handlers and loggers aren't the same thing - for example you talk about removeHandler calls (which might serve to free the handler instances because their reference counts go to zero, but you won't free any logger instances by doing so).
Generally, loggers are named after parts of your application which generate events of interest.
If you want each thread to e.g. write to a different file, you can create a new filename each time, and then close the handler when you're done and the thread is about to terminate (that closing is important to free handler resources). Or, you can log everything to one file with thread ids or other discriminators included in the log output, and use post-processing on the log file.
I met the same memory leak when using logging.Logger(), and you may try to manually close the handler fd when the logger is useless, like:
for handler in logger.handlers:
handler.close()