Parsing .py files performances with mod_wsgi / django / rest_framework

Parsing .py files performances with mod_wsgi / django / rest_framework - python

I'm using apache with mod_wsgi on a debian jessie, python3.4, django and django REST framework to power a REST web service.
I'm currently running performance tests. My server is a KS-2 (http://www.kimsufi.com/fr/serveurs.xml) with 4Gb of RAM and an Atom N2800 processor (1.8GHz, 2c/4t). My server already runs plenty of little services, but my load average does not exceed 0.5 and I usually have 2Gb of free RAM. I'm giving those context informations because maybe the performances I describe below is normal in context of this hardware support.
I'm quite new to python powered web services and don't really know what to except in term of performances. I used firefox's network monitor to test the duration of a request.
I've set up a test environnement with django rest framework's first example (http://www.django-rest-framework.org/). When I go to url http://myapi/users/?format=json I have to wait ~1600 ms. If I check the response multiple times in a short period of time it goes down go 60ms. However, as soon as I wait more than ~5 secs, the average time is 1600ms.
My application has about 6k lines of python and includes some django librairies in INSTALLED_APPS (django-cors-headers, django-filter, django-guardian, django-rest-swagger). When I perform the same kind of tests (on a comparable view returning a list of my users) on it I get 6500/90ms.
My data do not require a lot of ressources to retrieve (django-debug-toolbar shows me that my SQL queries take <10ms to perform). So I'm not sure what is going on under the hood but I guess all .py files need to be periodically parsed or .pyc to be read. If it's the case, is it possible to get rid of this behaviour ? I mean, in a production environnement where I know I won't edit often my files. Or if it's not the case, to lower the weight of the first call.
Note : I've read django's documentation about cache (https://docs.djangoproject.com/en/1.9/topics/cache/), but in my application my data (which do not seem to require a lot of ressources) is susceptible to change often. I guess caching does not help for the source code of an application, am I wrong ?
Thanks

I guess all .py files need to be periodically parsed or .pyc to be read
.py files are only parsed (and compiled to bytecode .pyc files) when there's no matching .pyc file or the .py file is newer than the .pyc. Also, the .pyc files are only loaded once per process.
Given your symptom, chances are your problem is mostly with your server's settings. First make sure you're in daemon mode (https://code.google.com/p/modwsgi/wiki/QuickConfigurationGuide#Delegation_To_Daemon_Process), then tweak your settings accroding to your server hardware and application's needs ( https://code.google.com/p/modwsgi/wiki/ConfigurationDirectives#WSGIDaemonProcess)

This looks like your Apache does remove the Python processes from the memory after a while. mod_wsgi loads the Python interpreter and files into Apache which is slow. However you should be able to tune it so it keeps them into memory.

Related

How to detect files in a directory if the files have finished copying/adding? [duplicate]

Files are being pushed to my server via FTP. I process them with PHP code in a Drupal module. O/S is Ubuntu and the FTP server is vsftp.
At regular intervals I will check for new files, process them with SimpleXML and move them to a "Done" folder. How do I avoid processing a partially uploaded file?
vsftp has lock_upload_files defaulted to yes. I thought of attempting to move the files first, expecting the move to fail on a currently uploading file. That doesn't seem to happen, at least on the command line. If I start uploading a large file and move, it just keeps growing in the new location. I guess the directory entry is not locked.
Should I try fopen with mode 'a' or 'r+' just to see if it succeeds before attempting to load into SimpleXML or is there a better way to do this? I guess I could just detect SimpleXML load failing but... that seems messy.
I don't have control of the sender. They won't do an upload and rename.
Thanks

Using the lock_upload_files configuration option of vsftpd leads to locking files with the fcntl() function. This places advisory lock(s) on uploaded file(s) which are in progress. Other programs don't need to consider advisory locks, and mv for example does not. Advisory locks are in general just an advice for programs that care about such locks.
You need another command line tool like lockrun which respects advisory locks.
Note: lockrun must be compiled with the WAIT_AND_LOCK(fd) macro to use the lockf() and not the flock() function in order to work with locks that are set by fcntl() under Linux. So when lockrun is compiled with using lockf() then it will cooperate with the locks set by vsftpd.
With such features (lockrun, mv, lock_upload_files) you can build a shell script or similar that moves files one by one, checking if the file is locked beforehand and holding an advisory lock on it as long as the file is moved. If the file is locked by vsftpd then lockrun can skip the call to mv so that running uploads are skipped.

If locking doesn't work, I don't know of a solution as clean/simple as you'd like. You could make an educated guess by not processing files whose last modified time (which you can get with filemtime()) is within the past x minutes.
If you want a higher degree of confidence than that, you could check and store each file's size (using filesize()) in a simple database, and every x minutes check new size against its old size. If the size hasn't changed in x minutes, you can assume nothing more is being sent.

The lsof linux command lists opened files on your system. I suggest executing it with shell_exec() from PHP and parsing the output to see what files are still being used by your FTP server.

Picking up on the previous answer, you could copy the file over and then compare the sizes of the copied file and the original file at a fixed interval.
If the sizes match, the upload is done, delete the copy, work with the file.
If the sizes do not match, copy the file again.
repeat.

Here's another idea: create a super (but hopefully not root) FTP user that can access some or all of the upload directories. Instead of your PHP code reading uploaded files right off the disk, make it connect to the local FTP server and download files. This way vsftpd handles the locking for you (assuming you leave lock_upload_files enabled). You'll only be able to download a file once vsftp releases the exclusive/write lock (once writing is complete).
You mentioned trying flock in your comment (and how it fails). It does indeed seem painful to try to match whatever locking vsftpd is doing, but dio_fcntl might be worth a shot.

I guess you've solved your problem years ago but still.
If you use some pattern to find the files you need you can ask the party uploading the file to use different name and rename the file once the upload has completed.

You should check the Hidden Stores in proftp, more info here:
http://www.proftpd.org/docs/directives/linked/config_ref_HiddenStores.html

How can I force Python code to read input files again without rebooting my computer

I am scanning through a large number of files looking for some markers. I am starting to be really confident that once I have run through the code one time Python is not rereading the actual files from disk. I find this behavior strange because I was told that one reason I needed to structure my file access in the manner I have is so that the handle and file content is flushed. But that can't be.
There are 9,568 file paths in the list I am reading from. If I shut down Python and reboot my computer it takes roughly 6 minutes to read the files and determine if there is anything returned from the regular expression.
However, if I run the code a second time it takes about 36 seconds. Just for grins, the average document has 53,000 words.
Therefore I am concluding that Python still has access to the file it read in the first iteration.
I want to also observe that the first time I do this I can hear the disk spin (E:\ - Python is on C:). E is just a spinning disk with 126 MB cache - I don't think the cache is big enough to hold the contents of these files. When I do it later I do not hear the disk spin.
Here is the code
import re
test_7A_re = re.compile(r'\n\s*ITEM\s*7\(*a\)*[.]*\s*-*\s*QUANT.*\n',re.IGNORECASE)
no7a = []
for path in path_list:
path = path.strip()
with open(path,'r') as fh:
string = fh.read()
items = [item for item in re.finditer(test_7A_re,string)]
if len(items) == 0:
no7a.append(path)
continue
I care about this for a number of reasons, one is that I was thinking about using multi-processing. But if the bottleneck is reading in the files I don't see that I will gain much. I also think this is a problem because I would be worried about the file being modified and not having the most recent version of the file available.
I am tagging this 2.7 because I have no idea if this behavior is persistent across versions.
To confirm this behavior I modified my code to run as a .py file, and added some timing code. I then rebooted my computer - the first time it ran it took 5.6 minutes and the second time (without rebooting) the time was 36 seconds. Output is the same in both cases.
The really interesting thing is that even if shut down IDLE (but do not reboot my computer) it still takes 36 seconds to run the code.
All of this suggests to me that the files are not read from disk after the first time - this is amazing behavior to me but it seems dangerous.
To be clear, the results are the same - I believe given the timing tests I have run and the fact that I do not hear the disk spinning that somehow the files are still accessible to Python.

This is caused by caching in Windows. It is not related to Python.
In order to stop Windows from caching your reads:
Disable paging file in Windows and fill the RAM up to 90%
Use some tool to disable file caching in Windows like this one.
Run your code on a Linux VM on your Windows machine that has limited RAM. In Linux you can control the caching much better
Make the files much bigger, so that they won't fit in cache

I fail to see why this is a problem. I'm not 100% certain of how Windows handles file cache invalidation, but unless the "Last modified time" changes, you and I and Windows would assume that the file still holds the same content. If the file holds the same content, I don't see why reading from cache can be a problem.
I'm pretty sure that if you change the last modified date, say, by opening the file for write access then closing it right away, Windows will hold sufficient doubts over the file content and invalidate the cache.

Why running django's session cleanup command kill's my machine resources?

I have a one year production site configured with django.contrib.sessions.backends.cached_db backend with a MySQL database backend. The reason why I chose cached_db is a mix of security with read performance.
The problem is, the cleanup command, responsible to delete all expired sessions, was never executed, resulting in a 2.3GB session table data length, 6 million rows and 500Mb index length.
When I try to run the ./manage.py cleanup (in Django 1.3) command, or ./manage.py clearsessions (Django`s 1.5 correspondent), the process never ends (or my patience doesn't complete 3 hours).
The code that Django use's to do this is:
Session.objects.filter(expire_date__lt=timezone.now()).delete()
In a first impression, I think that's normal because the table has 6M rows, but, after I inspect System's monitor, I discover that all memory and cpu was used by the python process, not mysqld, fullfilling my machine's resources. I think that's something terrible wrong with this command code. It seems that python iterates over all founded expired session rows before deleting each of them, one by one. In this case, a code refactoring to just raw a DELETE FROM command can resolve my problem and helps Django community, right? But, if this is the case, a Queryset delete command is acting weird and none optimized in my opinion. Am I right?

Complete log management (python)

Similar questions have been asked, but I have not come across an easy-to-do-it way
We have some application logs of various kinds which fill up the space and we face other unwanted issues. How do I write a monitoring script(zipping files of particular size, moving them, watching them, etc..) for this maintenance? I am looking for a simple solution(as in what to use?), if possible in python or maybe just a shell script.
Thanks.

The "standard" way of doing this (atleast on most Gnu/Linux distros) is to use logrotate. I see a /etc/logrotate.conf on my Debian machine which has details on which files to rotate and at what frequency. It's triggered by a daily cron entry. This is what I'd recommend.
If you want your application itself to do this (which is a pain really since it's not it's job), you could consider writing a custom log handler. A RotatingFileHandler (or TimedRotatingFileHandler) might work but you can write a custom one.
Most systems are by default set up to automatically rotate log files which are emitted by syslog. You might want to consider using the SysLogHandler and logging to syslog (from all your apps regardless of language) so that the system infrastructure automatically takes care of things for you.

Use logrotate to do the work for you.
Remember that there are few cases where it may not work properly, for example if the logging application keeps the log file always open and is not able to resume it if the file is removed and recreated.
Over the years I encountered few applications like that, but even for them you could configure logrotate to restart them when it rotates the logs.

Testing for mysterious load errors in python/django

This is related to this Configure Apache to recover from mod_python errors, although I've since stopped assuming that this has anything to do with mod_python. Essentially, I have a problem that I wasn't able to reproduce consistently and I wanted some feedback on whether the proposed solution seems likely and some potential ways to try and reproduce this problem.
The setup: a django-powered site would begin throwing errors after a few days of use. They were always ImportErrors or ImproperlyConfigured errors, which amount to the same thing, since the message always specified trouble loading some module referenced in the settings.py file. It was not generally the same class. I am using preforked apache with 8 forked children, and whenever this problem would come up, one process would be broken and seven would be fine. Once broken, every request (with Debug On in the apache conf) would display the same trace every time it served a request, even if the failed load is not relevant to the particular request. An httpd restart always made the problem go away in the short run.
Noted problems: installation and updates are performed via svn with some post-update scripts. A few .pyc files accidentally were checked into the repository. Additionally, the project itself was owned by one user (not apache, although apache had permissions on the project) and there was a persistent plugin that ended up getting backgrounded as root. I call these noted problems because they would be wrong whether or not I noticed this error, and hence I have fixed them. The project is owned by apache and the plugin is backgrounded as apache. All .pyc files are out of the repository, and they are all force-recompiled after each checkout while the server and plugin have been stopped.
What I want to know is
Do these configuration disasters seem like a likely explanation for sporadic ImportErrors?
If there is still a problem somewhere else in my code, how would I best reproduce it?
As for 2, my approach thus far has been to write some stress tests that repeatedly request the same page so as to execute common code paths.
Incidentally, this has been running without incident for about 2 days since the fix, but the problem was observed with 1 to 10 day intervals between.

"Do these configuration disasters seem like a likely explanation for sporadic ImportErrors"
Yes. An old .pyc file is a disaster of the first magnitude.
We develop on Windows, but run production on Red Hat Linux. An accidentally moved .pyc file is an absolute mystery to debug because (1) it usually runs and (2) it has a Windows filename for the original source, making the traceback error absolutely senseless. I spent hours staring at logs -- on linux -- wondering why the file was "C:\This\N\That".
"If there is still a problem somewhere else in my code, how would I best reproduce it?"
Before reproducing errors, you should try to prevent them.
First, create unit tests to exercise everything.
Start with Django's tests.py testing. Then expand to unittest for all non-Django components. Then write yourself a "run_tests" script that runs every test you own. Run this periodically. Daily isn't often enough.
Second, be sure you're using logging. Heavily.
Third, wrap anything that uses external resources in generic exception-logging blocks like this.
try:
some_external_resource_processing()
except Exception, e:
logger.exception( e )
raise
This will help you pinpoint problems with external resources. Files and databases are often the source of bad behavior due to permission or access problems.
At this point, you have prevented a large number of errors. If you want to run cyclic load testing, that's not a bad idea either. Use unittest for this.
class SomeLoadtest( unittest.TestCase ):
def test_something( self ):
self.connection = urllib2.urlopen( "localhost:8000/some/path" )
results = self.connection.read()
This isn't the best way to do things, but it shows one approach. You might want to start using Selenium to test the web site "from the outside" as a complement to your unittests.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.