How to do logging.DEBUG in django code? - python

In django we have settings.py that defines the DEBUG for the whole project.
Now,
My have debug level is independently configured in settings.py.
How should I use logging.DEBUG ?
Way1:
if settings.DEBUG:
logging.debug("Debug message")
Way2:
# Without checking settings.DEBUG
logging.debug("Debug message")
What is a good practice ?
I think we should use Way2 since logging level already decides - if the message will be logged or not.
But, some say that Way1 is a standard practice.

I think it's not a good thing to rely too much on a global setting such as DEBUG, which changes the whole behavior of your app.
What if you want to audit code and log stuff in production ? You're not going to turn DEBUG to true to do this, are you ? You'd rather tone down your log filter.
On a more stylistic point of view, it makes little sense and is not very pythonistic to have 2 settings (DEBUG and log level) affect a single behavior.
Long answer short: my opinion is that method 2 is superior, technically and stylisticly speaking.

The second method is fine, and in fact i use it all the time.
The only reason I am putting an answer here is because we in-fact did something like (1) in a work project a few years back, it turned out that although we were not logging anything at debug level in production to a file the cost of creating the debug message was in itself quite expensive and impacting performance.
i.e.
(1) In production the debug level message is not created at all, just a boolean check instead.
(2) In production the debug messages are created and propagated but just not logged into a file (well if that is in fact how you have setup your logging).
The project was a pretty big calculation farm where every ounce of performance mattered, this hasn't been the case for me ever since and might not be the case for you, but hey... i just thought i would mention it.

Related

Doesn't log_level parameter in python logging module affect performance?

I am using an API, to get some service from my project. The API call is taking too long, so I thought one of reasons could be lots and lots of logs that I have put across the project and the IO reads/writes are taking time.
I am using logging. My guess was as a LOG_LEVEL discard logs of lower priority, with higher priorities the API call should be completed in less time. But the time is almost same(difference being in the range of 1/10th of seconds).
The only reference regarding LOG_LEVEL and performance I got from here is
The beauty of this is that if you set the log level to WARN, info and debug messages have next to no performance impact.
Some points I should note here
I have not configured my logs to stream to any log service, like Kibana.
I have checked with this kind of situations, I am not doing any prepossessing in log message.
I have done basic Logger initialization,i.e,
import logging
logger = logging.getLogger(__name__)
and not used any file to write logs into as follows. LOG_LEVEL is given as one of the environment variable.
logging.basicConfig(filename="file_name.log")
Considering every other thing is optimal(if also everything is not optimal, then too higher priority logs should take less time), am I wrong in my guess of more time because of log read/writes? If no, then why use of high priority LOG_LEVEL flags are not decreasing the time?
In which default location, logging module store the logs?
What's the difference between log level performances?
Setting the log level can effect performance but may not be very noticeable until at scale.
When you set the level, you're creating a way to stop the logging process from continuing, and very little happens before this is checked with any individual log. For example, here is what CRITICAL logs look like in the code:
if self.isEnabledFor(CRITICAL):
self._log(CRITICAL, msg, args, **kwargs)
The logger itself has much more to do as part of _log than just this check so there would be time gains by setting a log level. But, it is fairly optimized so at the point you have initiated a logger at all, unless the difference in the amount of calls is quite large you probably won't be able to notice it much.
If you removed any reference to the logging instead of just setting level, you would get more performance gains because that check is not happening at all (which obviously takes some amount of time).
Where are logs stored by default?
By default, without setting a File, StreamHandler [source] is enabled and without specifying a specific stream, it will stream to sys.stderr. When you set a file, it creates a FileHandler which inherits from the same functions as StreamHandlers [source].
How do I optimize?
For the question you didn't ask, which is How do I speed up logging? I would suggest looking at this which gives some advice. Part of that advice is what I pointed out above but telling you to explicitly check your log level, and you can even cache that result and check the cache instead which should reduce time even further.
Check out this answer for even more on optimizing logging.
And finally, if you want to determine the speed issues with your code, whether it is from logging or not, you need to use a profiler. There are built in profiling functions in Python, check here.
One log level isn't more performant than another, however, if a level is enabled for logging, loggers are nested (In your example, this would happen if __name__ had dots in it like mypackage.core.logs), and the version of Python you are running can. This is because three things happen when you make a logging call:
The logger determines if the logging level is enabled.
This will happen for every call. In versions of Python before 3.7, this call was not cached and nested loggers took longer to determine if they were enabled or not. How much longer? In some benchmarks it was twice as much time. That said, this is heavily dependent on log nesting and even when logging millions of messages, this may only save a few seconds of system time.
The logger processes the record.
This is where the optimizations outlined in the documentation come into play. They allow the record creation to skip some steps.
The logger send the record to the handler.
This maybe the default, StreamHandler, the FileHandler, the SysLogHandler, or any number of build-in or custom handlers. In your example, you are using the FileHandler to write to file_name.log in the current directory. This may be fine for smaller applications, larger applications would benefit from using an external logger like syslog or the systemd journal. The main reason for this is because these operate in a separate process and are optimized for processing a large number of logs.

What is the difference between DEBUG and INFO in Python logging?

When using the logging library, when should I log using DEBUG, and when should I use INFO instead? All I know is that they are used to show what a program is doing during normal operation.
You can set up to only show logs of certain level.
DEBUG and INFO are two levels, info being a more neutral one, used for non-essential stuff and debug being the one that you might use for displaying stuff that might help you debug something.
It is up to you what you use each level for, and what levels you might want to see in your logs. If you disable a level, it will simply not be shown in logs.
Logging has 5 levels, and you can set the levels you need via setLevel() function. See here: https://docs.python.org/3/library/logging.html
There are no predetermined roles other than DEBUG being a higher verbosity level than INFO.
Their names imply that INFO is supposed to report on a program's progress while DEBUG is to report info for diagnosing problems.
The key thing to watch for when choosing which level to use for a specific message is to make each level give a full picture of what is going on, with the corresponding level of detail. See How to debug a Python program running as a service? for details.
E.g. in one of my programs that utilized a user-provided script to do tasks, I used:
INFO -- progress on tasks
VERBOSE (custom level with ID 15) -- info for diagnosing problems in the user script
DEBUG -- info for diagnosing problems in the program itself
If you view your log messages as part of your application's user interface, INFO messages are for consumption by the administrators or users, whereas debug messages are for consumption by its programmers. Messages should be designed and emitted with this in mind.

Thread locals in Python - negatives, with regards to scalability?

I'm wondering if there are some serious implications I might be creating for myself by using thread locals. I noticed that in the case of Flask, they use thread locals, and mention that it can cause issues with servers that aren't built with threads in mind. Is this an outdated concern? I'm using thread locals with Django for a few things, deploying with NGINX in front of UWSGI, or Gunicorn, on Ubuntu 10.04 with Postgres (not that the OS or DB probably matter, but just for clarity). Do I need to be worried?
Threadlocals aren't the most robust or secure way to do things - check out this note, for instance. [ Though also see Glenn's comment, below ]
I suppose if you have coded cleanly, with the idea that you're putting stuff into a big global pot of info, accepting unguaranteed data consistency in those threaded locals and taking care to avoid race conditions, etc, etc, you might well be ok.
But, even with that in mind, there's still the 'magic'ness of threaded local vars, so documenting clearly what the heck is going on and any time a threadedlocal var is used might help you/future developers of the codebase down the line.

sandbox to execute possibly unfriendly python code [duplicate]

This question already has answers here:
How can I sandbox Python in pure Python?
(7 answers)
Python, safe, sandbox [duplicate]
(9 answers)
Closed 9 years ago.
Let's say there is a server on the internet that one can send a piece of code to for evaluation. At some point server takes all code that has been submitted, and starts running and evaluating it. However, at some point it will definitely bump into "os.system('rm -rf *')" sent by some evil programmer. Apart from "rm -rf" you could expect people try using the server to send spam or dos someone, or fool around with "while True: pass" kind of things.
Is there a way to coop with such unfriendly/untrusted code? In particular I'm interested in a solution for python. However if you have info for any other language, please share.
If you are not specific to CPython implementation, you should consider looking at PyPy[wiki] for these purposes — this Python dialect allows transparent code sandboxing.
Otherwise, you can provide fake __builtin__ and __builtins__ in the corresponding globals/locals arguments to exec or eval.
Moreover, you can provide dictionary-like object instead of real dictionary and trace what untrusted code does with it's namespace.
Moreover, you can actually trace that code (issuing sys.settrace() inside restricted environment before any other code executed) so you can break execution if something will go bad.
If none of solutions is acceptable, use OS-level sandboxing like chroot, unionfs and standard multiprocess python module to spawn code worker in separate secured process.
You can check pysandbox which does just that, though the VM route is probably safer if you can afford it.
It's impossible to provide an absolute solution for this because the definition of 'bad' is pretty hard to nail down.
Is opening and writing to a file bad or good? What if that file is /dev/ram?
You can profile signatures of behavior, or you can try to block anything that might be bad, but you'll never win. Javascript is a pretty good example of this, people run arbitrary javascript code all the time on their computers -- it's supposed to be sandboxed but there's all sorts of security problems and edge conditions that crop up.
I'm not saying don't try, you'll learn a lot from the process.
Many companies have spent millions (Intel just spent billions on McAffee) trying to understand how to detect 'bad code' -- and every day machines running McAffe anti-virus get infected with viruses. Python code isn't any less dangerous than C. You can run system calls, bind to C libraries, etc.
I would seriously consider virtualizing the environment to run this stuff, so that exploits in whatever mechanism you implement can be firewalled one more time by the configuration of the virtual machine.
Number of users and what kind of code you expect to test/run would have considerable influence on choices btw. If they aren't expected to link to files or databases, or run computationally intensive tasks, and you have very low pressure, you could be almost fine by just preventing file access entirely and imposing a time limit on the process before it gets killed and the submission flagged as too expensive or malicious.
If the code you're supposed to test might be any arbitrary Django extension or page, then you're in for a lot of work probably.
You can try some generic sanbox such as Sydbox or Gentoo's sandbox. They are not Python-specific.
Both can be configured to restrict read/write to some directories. Sydbox can even sandbox sockets.
I think a fix like this is going to be really hard and it reminds me of a lecture I attended about the benefits of programming in a virtual environment.
If you're doing it virtually its cool if they bugger it. It wont solve a while True: pass but rm -rf / won't matter.
Unless I'm mistaken (and I very well might be), this is much of the reason behind the way Google changed Python for the App Engine. You run Python code on their server, but they've removed the ability to write to files. All data is saved in the "nosql" database.
It's not a direct answer to your question, but an example of how this problem has been dealt with in some circumstances.

Reducing Django Memory Usage. Low hanging fruit?

My memory usage increases over time and restarting Django is not kind to users.
I am unsure how to go about profiling the memory usage but some tips on how to start measuring would be useful.
I have a feeling that there are some simple steps that could produce big gains. Ensuring 'debug' is set to 'False' is an obvious biggie.
Can anyone suggest others? How much improvement would caching on low-traffic sites?
In this case I'm running under Apache 2.x with mod_python. I've heard mod_wsgi is a bit leaner but it would be tricky to switch at this stage unless I know the gains would be significant.
Edit: Thanks for the tips so far. Any suggestions how to discover what's using up the memory? Are there any guides to Python memory profiling?
Also as mentioned there's a few things that will make it tricky to switch to mod_wsgi so I'd like to have some idea of the gains I could expect before ploughing forwards in that direction.
Edit: Carl posted a slightly more detailed reply here that is worth reading: Django Deployment: Cutting Apache's Overhead
Edit: Graham Dumpleton's article is the best I've found on the MPM and mod_wsgi related stuff. I am rather disappointed that no-one could provide any info on debugging the memory usage in the app itself though.
Final Edit: Well I have been discussing this with Webfaction to see if they could assist with recompiling Apache and this is their word on the matter:
"I really don't think that you will get much of a benefit by switching to an MPM Worker + mod_wsgi setup. I estimate that you might be able to save around 20MB, but probably not much more than that."
So! This brings me back to my original question (which I am still none the wiser about). How does one go about identifying where the problems lies? It's a well known maxim that you don't optimize without testing to see where you need to optimize but there is very little in the way of tutorials on measuring Python memory usage and none at all specific to Django.
Thanks for everyone's assistance but I think this question is still open!
Another final edit ;-)
I asked this on the django-users list and got some very helpful replies
Honestly the last update ever!
This was just released. Could be the best solution yet: Profiling Django object size and memory usage with Pympler
Make sure you are not keeping global references to data. That prevents the python garbage collector from releasing the memory.
Don't use mod_python. It loads an interpreter inside apache. If you need to use apache, use mod_wsgi instead. It is not tricky to switch. It is very easy. mod_wsgi is way easier to configure for django than brain-dead mod_python.
If you can remove apache from your requirements, that would be even better to your memory. spawning seems to be the new fast scalable way to run python web applications.
EDIT: I don't see how switching to mod_wsgi could be "tricky". It should be a very easy task. Please elaborate on the problem you are having with the switch.
If you are running under mod_wsgi, and presumably spawning since it is WSGI compliant, you can use Dozer to look at your memory usage.
Under mod_wsgi just add this at the bottom of your WSGI script:
from dozer import Dozer
application = Dozer(application)
Then point your browser at http://domain/_dozer/index to see a list of all your memory allocations.
I'll also just add my voice of support for mod_wsgi. It makes a world of difference in terms of performance and memory usage over mod_python. Graham Dumpleton's support for mod_wsgi is outstanding, both in terms of active development and in helping people on the mailing list to optimize their installations. David Cramer at curse.com has posted some charts (which I can't seem to find now unfortunately) showing the drastic reduction in cpu and memory usage after they switched to mod_wsgi on that high traffic site. Several of the django devs have switched. Seriously, it's a no-brainer :)
These are the Python memory profiler solutions I'm aware of (not Django related):
Heapy
pysizer (discontinued)
Python Memory Validator (commercial)
Pympler
Disclaimer: I have a stake in the latter.
The individual project's documentation should give you an idea of how to use these tools to analyze memory behavior of Python applications.
The following is a nice "war story" that also gives some helpful pointers:
Reducing the footprint of python applications
Additionally, check if you do not use any of known leakers. MySQLdb is known to leak enormous amounts of memory with Django due to bug in unicode handling. Other than that, Django Debug Toolbar might help you to track the hogs.
In addition to not keeping around global references to large data objects, try to avoid loading large datasets into memory at all wherever possible.
Switch to mod_wsgi in daemon mode, and use Apache's worker mpm instead of prefork. This latter step can allow you to serve many more concurrent users with much less memory overhead.
Webfaction actually has some tips for keeping django memory usage down.
The major points:
Make sure debug is set to false (you already know that).
Use "ServerLimit" in your apache config
Check that no big objects are being loaded in memory
Consider serving static content in a separate process or server.
Use "MaxRequestsPerChild" in your apache config
Find out and understand how much memory you're using
Another plus for mod_wsgi: set a maximum-requests parameter in your WSGIDaemonProcess directive and mod_wsgi will restart the daemon process every so often. There should be no visible effect for the user, other than a slow page load the first time a fresh process is hit, as it'll be loading Django and your application code into memory.
But even if you do have memory leaks, that should keep the process size from getting too large, without having to interrupt service to your users.
Here is the script I use for mod_wsgi (called wsgi.py, and put in the root off my django project):
import os
import sys
import django.core.handlers.wsgi
from os import path
sys.stdout = open('/dev/null', 'a+')
sys.stderr = open('/dev/null', 'a+')
sys.path.append(path.join(path.dirname(__file__), '..'))
os.environ['DJANGO_SETTINGS_MODULE'] = 'myproject.settings'
application = django.core.handlers.wsgi.WSGIHandler()
Adjust myproject.settings and the path as needed. I redirect all output to /dev/null since mod_wsgi by default prevents printing. Use logging instead.
For apache:
<VirtualHost *>
ServerName myhost.com
ErrorLog /var/log/apache2/error-myhost.log
CustomLog /var/log/apache2/access-myhost.log common
DocumentRoot "/var/www"
WSGIScriptAlias / /path/to/my/wsgi.py
</VirtualHost>
Hopefully this should at least help you set up mod_wsgi so you can see if it makes a difference.
Caches: make sure they're being flushed. Its easy for something to land in a cache, but never be GC'd because of the cache reference.
Swig'd code: Make sure any memory management is being done correctly, its really easy to miss these in python, especially with third party libraries
Monitoring: If you can, get data about memory usage and hits. Usually you'll see a correlation between a certain type of request and memory usage.
We stumbled over a bug in Django with big sitemaps (10.000 items). Seems Django is trying to load them all in memory when generating the sitemap: http://code.djangoproject.com/ticket/11572 - effectively kills the apache process when Google pays a visit to the site.

Categories