Keeping concurrency in web.py applications on mod_wsgi

Keeping concurrency in web.py applications on mod_wsgi - python

Sorry if this makes no sense. Please comment if clarification is needed.
I'm writing a small file upload app in web.py which I am deploying using mod_wsgi + apache. I have been having a problem with my session management and would like clarification on how the threading works in web.py.
Essentially I embed a code in a hidden field of the html page I render when someone accesses my page. The file upload is then done via a standard POST request containing both the file and the code. Then I retrieve the progress of the file by updating it in the file upload POST method and grabbing it with a GET request to a different class. The 'session' (apologies for it being fairly naive) is stored in a session object like this:
class session:
def __init__(self):
self.progress = 0
self.title = ""
self.finished = False
def advance(self):
self.progress = self.progress + 1
The sessions are all kept in a global dictionary within my app script and then accessed with my code (from earlier) as the key.
For some reason my progress seems to stay at 0 and never increments. I've been debugging for a couple hours now and I've found that the two session objects referenced from the upload class and the progress class are not the same. The two codes, however, are (as far as I can tell) equal. This is driving me mad as it worked without any problems on the web.py test server on my local machine.
EDIT: After some research it seems that the dictionary may get copied for every request. I've tried putting the dictionary in another and importing but this doesn't work. Is there some other way short of using a database to 'seperate' the sessions dictionary?

Apache/mod_wsgi can run in multiprocess configurations and possible your requests aren't even being serviced by the same process and never will if for that multiprocess configuration each process is single thread because while the upload is occuring no other requests can be handled by that same process. Read:
http://code.google.com/p/modwsgi/wiki/ProcessesAndThreading
Possibly you should use mod_wsgi daemon mode with single multiple thread daemon process.

From PEP 333, defining WSGI:
Servers that can run multiple requests in parallel, should also provide the option of running an application in a single-threaded fashion, so that applications or frameworks that are not thread-safe may still be used with that server
Check the documentation of your WSGI server.

Related

Web Application Hangs on Download

I'm maintaining an open-source document asset management application called NotreDAM, which is written in Django running on Apache an instance of TwistedWeb.
Whenever any user downloads a file, the application hangs for all users for the entire duration of the download. I've tracked down the download command to this point in the code, but I'm not enough versed with Python/Django to know why this may be happening.
response = HttpResponse(open(fullpath, 'rb').read(), mimetype=mimetype)
response["Last-Modified"] = http_date(statobj.st_mtime)
response["Content-Length"] = statobj.st_size
if encoding:
response["Content-Encoding"] = encoding
return response
Do you know how I could fix the application hanging while a file downloads?

The web server reads the whole file in the memory instead of streaming it. It is not well written code, but not a bug per se.
This blocks the Apache client (pre-forked) for the duration of whole file read. If IO is slow and the file is large it may take some time.
Usually you have several pre-forked Apache clients configured to satisfy this kind of requests, but on a badly configured web server you may exhibit this kind of problems and this is not a Django issue. Your web server is probably running only one pre-forked process, potentially in a debug mode.

notreDAM serves the asset files using the django.views.static.serve() command, which according to the Django docs "Using this method is inefficient and insecure. Do not use this in a production setting. Use this only for development." So there we go. I have to use another command.

Web application: Hold large object between requests

I'm working on a web application related to genome searching. This application makes use of this suffix tree library through Cython bindings. Objects of this type are large (hundreds of MB up to ~10GB) and take as long to load from disk as it takes to process them in response to a page request. I'm looking for a way to load several of these objects once on server boot and then use them for all page requests.
I have tried using a remote manager / client setup using the multiprocessing module, modeled after this demo, but it fails when the client connects with an error message that says the object is not picklable.

I would suggest writing a small Flask (or even raw WSGI… But it's probably simpler to use Flask, as it will be easier to get up and running quickly) application which loads the genome database then exposes a simple API. Something like this:
app = Flask(__name__)
database = load_database()
#app.route('/get_genomes')
def get_genomes():
return database.all_genomes()
app.run(debug=True)
Or, you know, something a bit more sensible.
Also, if you need to be handling more than one request at a time (I believe that app.run will only handle one at a time), start by threading… And if that's too slow, you can os.fork() after the database is loaded and run multiple request handlers from there (that way they will all share the same database in memory).

Does Google App Engine run one instance of an app per one request? or for all requests?

Using google app engine:
# more code ahead not shown
application = webapp.WSGIApplication([('/', Home)],
debug=True)
def main():
run_wsgi_app(application)
if __name__ == "__main__":
main()
If two different users request the webpage on two different machine, two individual instances of the server will be invoked?
Or just one instance of the server is running all the time which handle all the requests?
How about if one user open the webpage twice in the same browser?
Edit:
According to the answers below, one instance may handle requests from different users turn by turn. Then consider the following fraction of code, taken from the example Google gave:
class User(db.Model):
email = db.EmailProperty()
nickname = db.StringProperty()
1, email and nickname here are defined as class variables?
2, All the requests handled by the same instance of server share the same variables and thus by mistake interfere with each other? (Say, one's email appears in another's page)
ps. I know that I should read the manual and doc more and I am doing it, however answers from experienced programmer will really help me understand faster and more through, thanks

An instance can handle many requests over its lifetime. In the python runtime's threading model, each instance can only handle a single request at any given time. If 2 requests arrive at the same time they might be handled one after the other by a single instance, or a second instance might be spawned to handle the request.
EDIT:
In general, variables used by each request will be scoped to a RequestHandler instance's .get() or .post() method, and thus can't "leak" into other requests. You should be careful about using global variables in your scripts, as these will be cached in the instance and would be shared between requests. Don't use globals without knowing exactly why you want to (which is good advice for any application, for that matter), and you'll be fine.

App Engine dynamically builds up and tears down instances based on request volume.
From the docs:
App Engine applications are powered by
any number of instances at any given
time, depending on the volume of
requests received by your application.
As requests for your application
increase, so do the number of
instances powering it.
Each instance has its own queue for
incoming requests. App Engine monitors
the number of requests waiting in each
instance's queue. If App Engine
detects that queues for an application
are getting too long due to increased
load, it automatically creates a new
instance of the application to handle
that load.
App Engine scales instances in reverse
when request volumes decrease. In this
way, App Engine ensures that all of
your application's current instances
are being used to optimal efficiency.
This automatic scaling makes running
App Engine so cost effective.
When an application is not being used
all, App Engine turns off its
associated instances, but readily
reloads them as soon as they are
needed.

How to defer a Django DB operation from within Twisted?

I have a normal Django site running. In addition, there is another twisted process, which listens for Jabber presence notifications and updates the Django DB using Django's ORM.
So far it works as I just call the corresponding Django models (after having set up the settings environment correctly). This, however, blocks the Twisted app, which is not what I want.
As I'm new to twisted I don't know, what the best way would be to access the Django DB (via its ORM) in a non-blocking way using deferreds.
deferredGenerator ?
twisted.enterprise.adbapi ? (circumvent the ORM?)
???
If the presence message is parsed I want to save in the Django DB that the user with jid_str is online/offline (using the Django model UserProfile). I do it with that function:
def django_useravailable(jid_str, user_available):
try:
userhost = jid.JID(jid_str).userhost()
user = UserProfile.objects.get(im_jabber_name=userhost)
user.im_jabber_online = user_available
user.save()
return jid_str, user_available
except Exception, e:
print e
raise jid_str, user_available,e
Currently, I invoke it with:
d = threads.deferToThread(django_useravailable, from_attr, user_available)
d.addCallback(self.success)
d.addErrback(self.failure)

"I have a normal Django site running."
Presumably under Apache using mod_wsgi or similar.
If you're using mod_wsgi embedded in Apache, note that Apache is multi-threaded and your Python threads are mashed into Apache's threading. Analysis of what's blocking could get icky.
If you're using mod_wsgi in daemon mode (which you should be) then your Django is a separate process.
Why not continue this design pattern and make your "jabber listener" a separate process.
If you'd like this process to be run any any of a number of servers, then have it be started from init.rc or cron.
Because it's a separate process it will not compete for attention. Your Django process runs quickly and your Jabber listener runs independently.

I have been successful using the method you described as your current method. You'll find by reading the docs that the twisted DB api uses threads under the hood because most SQL libraries have a blocking API.
I have a twisted server that saves data from power monitors in the field, and it does it by starting up a subthread every now and again and calling my Django save code. You can read more about my live data collection pipeline (that's a blog link).
Are you saying that you are starting up a sub thread and that is still blocking?

I have a running Twisted app where I use Django ORM. I'm not deferring it. I know it's wrong, but hadd no problems yet.

How to Disable Django / mod_WSGI Page Caching

I have Django running in Apache via mod_wsgi. I believe Django is caching my pages server-side, which is causing some of the functionality to not work correctly.
I have a countdown timer that works by getting the current server time, determining the remaining countdown time, and outputting that number to the HTML template. A javascript countdown timer then takes over and runs the countdown for the user.
The problem arises when the user refreshes the page, or navigates to a different page with the countdown timer. The timer appears to jump around to different times sporadically, usually going back to the same time over and over again on each refresh.
Using HTTPFox, the page is not being loaded from my browser cache, so it looks like either Django or Apache is caching the page. Is there any way to disable this functionality? I'm not going to have enough traffic to worry about caching the script output. Or am I completely wrong about why this is happening?
[Edit] From the posts below, it looks like caching is disabled in Django, which means it must be happening elsewhere, perhaps in Apache?
[Edit] I have a more thorough description of what is happening: For the first 7 (or so) requests made to the server, the pages are rendered by the script and returned, although each of those 7 pages seems to be cached as it shows up later. On the 8th request, the server serves up the first page. On the 9th request, it serves up the second page, and so on in a cycle. This lasts until I restart apache, when the process starts over again.
[Edit] I have configured mod_wsgi to run only one process at a time, which causes the timer to reset to the same value in every case. Interestingly though, there's another component on my page that displays a random image on each request, using order('?'), and that does refresh with different images each time, which would indicate the caching is happening in Django and not in Apache.
[Edit] In light of the previous edit, I went back and reviewed the relevant views.py file, finding that the countdown start variable was being set globally in the module, outside of the view functions. Moving that setting inside the view functions resolved the problem. So it turned out not to be a caching issue after all. Thanks everyone for your help on this.

From my experience with mod_wsgi in Apache, it is highly unlikely that they are causing caching. A couple of things to try:
It is possible that you have some proxy server between your computer and the web server that is appropriately or inappropriately caching pages. Sometimes ISPs run proxy servers to reduce bandwidth outside their network. Can you please provide the HTTP headers for a page that is getting cached (Firebug can give these to you). Headers that I would specifically be interested in include Cache-Control, Expires, Last-Modified, and ETag.
Can you post your MIDDLEWARE_CLASSES from your settings.py file. It possible that you have a Middleware that performs caching for you.
Can you grep your code for the following items "load cache", "django.core.cache", and "cache_page". A *grep -R "search" ** will work.
Does the settings.py (or anything it imports like "from localsettings import *") include CACHE_BACKEND?
What happens when you restart apache? (e.g. sudo services apache restart). If a restart clears the issue, then it might be apache doing caching (it is possible that this could also clear out a locmen Django cache backend)

Did you specifically setup Django caching? From the docs it seems you would clearly know if Django was caching as it requires work beforehand to get it working. Specifically, you need to define where the cached files are saved.
http://docs.djangoproject.com/en/dev/topics/cache/

Are you using a multiprocess configuration for Apache/mod_wsgi? If you are, that will account for why different responses can have a different value for the timer as likely that when timer is initialised will be different for each process handling requests. Thus why it can jump around.
Have a read of:
http://code.google.com/p/modwsgi/wiki/ProcessesAndThreading
Work out in what mode or configuration you are running Apache/mod_wsgi and perhaps post what that configuration is. Without knowing, there are too many unknowns.

I just came across this:
Support for Automatic Reloading To help deployment tools you can
activate support for automatic reloading. Whenever something changes
the .wsgi file, mod_wsgi will reload all the daemon processes for us.
For that, just add the following directive to your Directory section:
WSGIScriptReloading On

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.