Testing for mysterious load errors in python/django

Testing for mysterious load errors in python/django - python

This is related to this Configure Apache to recover from mod_python errors, although I've since stopped assuming that this has anything to do with mod_python. Essentially, I have a problem that I wasn't able to reproduce consistently and I wanted some feedback on whether the proposed solution seems likely and some potential ways to try and reproduce this problem.
The setup: a django-powered site would begin throwing errors after a few days of use. They were always ImportErrors or ImproperlyConfigured errors, which amount to the same thing, since the message always specified trouble loading some module referenced in the settings.py file. It was not generally the same class. I am using preforked apache with 8 forked children, and whenever this problem would come up, one process would be broken and seven would be fine. Once broken, every request (with Debug On in the apache conf) would display the same trace every time it served a request, even if the failed load is not relevant to the particular request. An httpd restart always made the problem go away in the short run.
Noted problems: installation and updates are performed via svn with some post-update scripts. A few .pyc files accidentally were checked into the repository. Additionally, the project itself was owned by one user (not apache, although apache had permissions on the project) and there was a persistent plugin that ended up getting backgrounded as root. I call these noted problems because they would be wrong whether or not I noticed this error, and hence I have fixed them. The project is owned by apache and the plugin is backgrounded as apache. All .pyc files are out of the repository, and they are all force-recompiled after each checkout while the server and plugin have been stopped.
What I want to know is
Do these configuration disasters seem like a likely explanation for sporadic ImportErrors?
If there is still a problem somewhere else in my code, how would I best reproduce it?
As for 2, my approach thus far has been to write some stress tests that repeatedly request the same page so as to execute common code paths.
Incidentally, this has been running without incident for about 2 days since the fix, but the problem was observed with 1 to 10 day intervals between.

"Do these configuration disasters seem like a likely explanation for sporadic ImportErrors"
Yes. An old .pyc file is a disaster of the first magnitude.
We develop on Windows, but run production on Red Hat Linux. An accidentally moved .pyc file is an absolute mystery to debug because (1) it usually runs and (2) it has a Windows filename for the original source, making the traceback error absolutely senseless. I spent hours staring at logs -- on linux -- wondering why the file was "C:\This\N\That".
"If there is still a problem somewhere else in my code, how would I best reproduce it?"
Before reproducing errors, you should try to prevent them.
First, create unit tests to exercise everything.
Start with Django's tests.py testing. Then expand to unittest for all non-Django components. Then write yourself a "run_tests" script that runs every test you own. Run this periodically. Daily isn't often enough.
Second, be sure you're using logging. Heavily.
Third, wrap anything that uses external resources in generic exception-logging blocks like this.
try:
some_external_resource_processing()
except Exception, e:
logger.exception( e )
raise
This will help you pinpoint problems with external resources. Files and databases are often the source of bad behavior due to permission or access problems.
At this point, you have prevented a large number of errors. If you want to run cyclic load testing, that's not a bad idea either. Use unittest for this.
class SomeLoadtest( unittest.TestCase ):
def test_something( self ):
self.connection = urllib2.urlopen( "localhost:8000/some/path" )
results = self.connection.read()
This isn't the best way to do things, but it shows one approach. You might want to start using Selenium to test the web site "from the outside" as a complement to your unittests.

Related

File descriptor doesn't update for logging in Django

We use Python(2.7)/Django(1.8.1) and Gunicorn(19.4.5) for our web application and supervisor(3.0) to monitor it. I have recently encountered 2 issues in logging:
Django was logging into previous day logs(We have log rotation enabled)
Django was not logging anything at all.
The first scenario is understandable where the log rotation changed the file but Django was not updated.
The second scenario fixed when I restarted the supervisor process. Which led me to believe again the file descriptor was not updated in the django process.
I came by this SO thread which states:
Each child is an independent process, and file handles in the parent
may be closed in the child after a fork (assuming POSIX). In any case,
logging to the same file from multiple processes is not supported.
So I have few questions:
My gunicorn has 4 child processes and if one of them fails while
writing to a log file will the other child process won't be able to
use it? and how to debug these kind of scenarios?
Personally I found debugging errors in python logging module to be
difficult. Can some one point how to debug errors such as this or is
there any way I can monkey patch logging to not fail silently?*(Kindly read update section)*
I have seen Django LogRotation causes the Issue type 1 as explained above and not some script scheduled via cron. So what is preferable?
Note: The logging config is not a problem. I have already spent fair amount of time trying to figure that out. Also if the config is the issue Django will not write log files after a process restart.
Update:
For my second question I see that logging modules provides an option to raiseExceptions on failure although this is discourages in production environment. Documentation here. So now my question becomes how do I set this in Django?

I felt like closing this question. Bit awkward and seems stupid after 2 months. But I guess being stupid is part of the learning and want this to be as a reference for people who stumble across this.
Scenario 1: Django on using TimedRotatingFileHandler seems not to update the file descriptor some times and hence writes to old log files unless we restart the supervisor. We are yet to find the reason for this behaviour and update the reason if found. For now we are using WatchedFileHandler and then using logrotate utility to rotate the logs.
Scenario 2: This is the stupid question. When I was logging with some string formatting I forgot to give enough variables which is why the logger was erring. But this didn't get propagated. But locally when I was testing I found that logging module was actually throwing that error but silently and any logs after it in the module were not getting printed. Lessons learns from this scenario were:
If there is a problem in logging find out if the string formatting does not err
Using log.debug('example: {msg}'.format(msg=msg)) of python instead of log.debug('example: %s', msg).

Parsing .py files performances with mod_wsgi / django / rest_framework

I'm using apache with mod_wsgi on a debian jessie, python3.4, django and django REST framework to power a REST web service.
I'm currently running performance tests. My server is a KS-2 (http://www.kimsufi.com/fr/serveurs.xml) with 4Gb of RAM and an Atom N2800 processor (1.8GHz, 2c/4t). My server already runs plenty of little services, but my load average does not exceed 0.5 and I usually have 2Gb of free RAM. I'm giving those context informations because maybe the performances I describe below is normal in context of this hardware support.
I'm quite new to python powered web services and don't really know what to except in term of performances. I used firefox's network monitor to test the duration of a request.
I've set up a test environnement with django rest framework's first example (http://www.django-rest-framework.org/). When I go to url http://myapi/users/?format=json I have to wait ~1600 ms. If I check the response multiple times in a short period of time it goes down go 60ms. However, as soon as I wait more than ~5 secs, the average time is 1600ms.
My application has about 6k lines of python and includes some django librairies in INSTALLED_APPS (django-cors-headers, django-filter, django-guardian, django-rest-swagger). When I perform the same kind of tests (on a comparable view returning a list of my users) on it I get 6500/90ms.
My data do not require a lot of ressources to retrieve (django-debug-toolbar shows me that my SQL queries take <10ms to perform). So I'm not sure what is going on under the hood but I guess all .py files need to be periodically parsed or .pyc to be read. If it's the case, is it possible to get rid of this behaviour ? I mean, in a production environnement where I know I won't edit often my files. Or if it's not the case, to lower the weight of the first call.
Note : I've read django's documentation about cache (https://docs.djangoproject.com/en/1.9/topics/cache/), but in my application my data (which do not seem to require a lot of ressources) is susceptible to change often. I guess caching does not help for the source code of an application, am I wrong ?
Thanks

I guess all .py files need to be periodically parsed or .pyc to be read
.py files are only parsed (and compiled to bytecode .pyc files) when there's no matching .pyc file or the .py file is newer than the .pyc. Also, the .pyc files are only loaded once per process.
Given your symptom, chances are your problem is mostly with your server's settings. First make sure you're in daemon mode (https://code.google.com/p/modwsgi/wiki/QuickConfigurationGuide#Delegation_To_Daemon_Process), then tweak your settings accroding to your server hardware and application's needs ( https://code.google.com/p/modwsgi/wiki/ConfigurationDirectives#WSGIDaemonProcess)

This looks like your Apache does remove the Python processes from the memory after a while. mod_wsgi loads the Python interpreter and files into Apache which is slow. However you should be able to tune it so it keeps them into memory.

Starting an IPython cluster from the notebook with a delay

Our SGE cluster setup requires there to be a delay between controller and engines starting. If this delay is not there, some of the servers use "old" ipcontroller-client.json files and attempt to connect to previous (and not running) controllers. This is an NFS "feature", so to remedy, I set c.IPClusterStart.delay = 30 in the ipcluster_config.py file and things work well. The controller gets submitted to SGE, has enough time to start and write its json files, and then the engines can start correctly to the newly running controller. However, I'd like to also be able to start the cluster from the notebook. Unfortunately, it appears that this timeout is not used, the controller and engines start up at the same time (as seen with watch qstat), some of the engines connect (because the pick up the new settings from the json file) and some do not (because of NFS).
I ran an strace on the notebook and saw that it's using sge_controller and sge_engines scripts (created by the notebook when you press start) to start these processes.
I'm wondering if there's any way to implement a delay here, as well. It's starting the controller and engines the right way (SGE) so I know it's reading the ipcluster_config.py.
I've Googled around and searched this site, with no luck. Hoping maybe someone can shed some light on the deeper workings of this behavior.
Thanks,
Chris

Well, this is probably too late for the OP, but hopefully it helps someone.
If it is a timeout issue, just set c.EngineFactory.timeout and c.IPEngineApp.wait_for_url_file to some larger times.
If it is due to failure after the first run, it is probably due to lingering security files, which should be deleted ( ipcontroller-engine.json and ipcontroller-client.json ) from the relevant iPython profile using IPython.utils.path.get_security_file to get the full paths. To automate this and make it somewhat less painful, this deletion step can be tacked on to the beginning of the same profile's ipcluster_config.py.
These changes alone were enough for me to get the cluster running with the notebook easily.
If neither of these solve the problem, there are some other thoughts ( http://mail.scipy.org/pipermail/ipython-user/2011-November/008741.html ).

RW-locking a Windows file in Python, so that at most one test instance runs per night

I have written a custom test harness in Python (existing stuff was not a good fit due to lots of custom logic). Windows task scheduler kicks it off once per hour every day. As my tests now take more than 2 hours to run and are growing, I am running into problems. Right now I just check the system time and do nothing unless hour % 3 == 0, but I do not like that. I have a text file that contains:
# This is a comment
LatestTestedBuild = 25100
# Blank lines are skipped too
LatestTestRunStartedDate = 2011_03_26_00:01:21
# This indicates that it has not finished yet.
LatestTestRunFinishDate =
Sometimes, when I kick off a test manually, it can happen at any time, including 12:59:59.99
I want to remove race conditions as much as possible. I would rather put some extra effort once and not worry about practical probability of something happening. So, I think locking a this text file atomically is the best approach.
I am using Python 2.7, Windows Server 2008R2 Pro and Windows 7 Pro. I prefer not to install extra libraries (Python has not been "sold" to my co-workers yet, but I could copy over a file locally that implements it all, granted that the license permits it).
So, please suggest a good, bullet-proof way to solve this.

When you start running a test make a file called __LOCK__ or something. Delete it when you finish, using a try...finally block to ensure that it always gets cleared up. Don't run the test if the file exists. If the computer crashes or similar, delete the file by hand. I doubt you need more cleverness than that.
Are you sure you need 2 hours of tests?! I think 2 minutes is a more reasonable amount of time to spend, though I guess if you are running some complicated numerics you might need more.
example code:
import os
if os.path.exists("__LOCK__"):
raise RuntimeError("Already running.") # or whatever
try:
open("__LOCK__", "w").write("Put some info here if you want.")
finally:
if os.path.exists("__LOCK__"):
os.unlink("__LOCK__")

Pylons REPL reevaluate code in running web server

I'm programming in python on a pre-existing pylons project (the okfn's ckan), but I'm a lisper by trade and used to that way of doing things.
Please correct me if I make false statements:
In pylons it seems that I should say
$ paster serve --reload
to get a web server that will notice changes.
At that point I can change a function, save the file and then go to my browser to test the change.
If I want to examine variables in a function in the process of making a webpage, then I put raise "hello", and then when I load the page, I get a browser based debugger, in which I can examine the program.
This is all very nice and works swimmingly, and I get the impression that that's how people tend to write pylons code.
Unfortunately the reload takes several seconds, and it keeps breaking my train of thought.
What I'd like to do is to run the web server from emacs, (although a python REPL on the command line would be almost as good), so that I can change a function in the editor and then send the new code to the running process without having to restart it. (with a command line repl I guess I'd have to copy and paste the new thing, but that would also be workable, just slightly less convenient)
Python seems very dynamic, and much like lisp in many ways, so I can't see in principle any reason why that wouldn't work.
So I guess the question is:
Is anyone familiar with the lisp way of doing things, and with Pylons, and can they tell me how to program the lisp way in pylons? Or is it impossible or a bad idea for some reason?
Edit:
I can run the webserver from my python interpreter inside emacs with:
from paste.script.serve import ServeCommand
ServeCommand("serve").run(["development.ini"])
And I can get the code to stop and show me what it's doing by inserting:
import pdb
pdb.set_trace()
so now all I need is a way to get the webserver to run on a different thread, so that control returns to the REPL and I can redefine functions and variables in the running process.
def start_server():
from paste.script.serve import ServeCommand
ServeCommand("serve").run(["development.ini"])
server_thread=threading.Thread(target=start_server)
server_thread.start()
This seems to work, except that if I redefine a function at the REPL the change doesn't get reflected in the webserver. Does anyone know why?

It seems that this way of working is impossible in python for the reason given by TokenMacGuy's comment, i.e. because redefining a class doesn't change the code in an instance of that class.
That seems a terrible shame, since in many other respects python seems very flexible, but it does explain why there's no python-swank!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.