Web Application Hangs on Download

Web Application Hangs on Download - python

I'm maintaining an open-source document asset management application called NotreDAM, which is written in Django running on Apache an instance of TwistedWeb.
Whenever any user downloads a file, the application hangs for all users for the entire duration of the download. I've tracked down the download command to this point in the code, but I'm not enough versed with Python/Django to know why this may be happening.
response = HttpResponse(open(fullpath, 'rb').read(), mimetype=mimetype)
response["Last-Modified"] = http_date(statobj.st_mtime)
response["Content-Length"] = statobj.st_size
if encoding:
response["Content-Encoding"] = encoding
return response
Do you know how I could fix the application hanging while a file downloads?

The web server reads the whole file in the memory instead of streaming it. It is not well written code, but not a bug per se.
This blocks the Apache client (pre-forked) for the duration of whole file read. If IO is slow and the file is large it may take some time.
Usually you have several pre-forked Apache clients configured to satisfy this kind of requests, but on a badly configured web server you may exhibit this kind of problems and this is not a Django issue. Your web server is probably running only one pre-forked process, potentially in a debug mode.

notreDAM serves the asset files using the django.views.static.serve() command, which according to the Django docs "Using this method is inefficient and insecure. Do not use this in a production setting. Use this only for development." So there we go. I have to use another command.

Related

How can a CGI server based on CGIHTTPRequestHandler require that a script start its response with headers that include a `content-type`?

Later note: the issues in the original posting below have been largely resolved.
Here's the background: For an introductory comp sci course, students develop html and server-side Python 2.7 scripts using a server provided by the instructors. That server is based on CGIHTTPRequestHandler, like the one at pointlessprogramming. When the students' html and scripts seem correct, they port those files to a remote, slow Apache server. Why support two servers? Well, the initial development using a local server has the benefit of reducing network issues and dependency on the remote, weak machine that is running Apache. Eventually porting to the Apache-running machine has the benefit of publishing their results for others to see.
For the local development to be most useful, the local server should closely resemble the Apache server. Currently there is an important difference: Apache requires that a script start its response with headers that include a content-type; if the script fails to provide such a header, Apache sends the client a 500 error ("Internal Server Error"), which too generic to help the students, who cannot use the server logs. CGIHTTPRequestHandler imposes no similar requirement. So it is common for a student to write header-free scripts that work with the local server, but get the baffling 500 error after copying files to the Apache server. It would be helpful to have a version of the local server that checks for a content-type header and gives a good error if there is none.
I seek advice about creating such a server. I am new to Python and to writing servers. Here are the issues that occur to me, but any helpful advice would be appreciated.
Is a content-type header required by the CGI standard? If so, other people might benefit from an answer to the main question here. Also, if so, I despair of finding a way to disable Apache's requirement. Maybe the relevant part of the CGI RFC is section 6.3.1 (CGI Response, Content-Type): "If an entity body is returned, the script MUST supply a Content-Type field in the response."
To make a local server that checks for the content-type header, perhaps I should sub-class CGIHTTPServer.CGIHTTPRequestHandler, to override run_cgi() with a version that issues an error for a missing header. I am looking at CGIHTTPServer.py __version__ = "0.4", which was installed with Python 2.7.3. But run_cgi() does a lot of processing, so it is a little unappealing to copy all its code, just to add a couple calls to a header-checking routine. Is there a better way?
If the answer to (2) is something like "No, overriding run_cgi() is recommended," I anticipate writing a version that invokes the desired script, then checks the script's output for headers before that output is sent to the client. There are apparently two places in the existing run_cgi() where the script is invoked:
3a. When run_cgi() is executed on a non-Unix system, the script is executed using Python's subprocess module. As a result, the standard output from the script will be available as an in-memory string, which I can presumably check for headers before the call to self.wfile.write. Does this sound right?
3b. But when run_cgi() is executed on a *nix system, the script is executed by a forked process. I think the child's stdout will write directly to self.wfile (I'm a little hazy on this), so I see no opportunity for the code in run_cgi() to check the output. Ugh. Any suggestions?
If analyzing the script's output is recommended, is email.parser the standard way to recognize whether there is a content-type header? Is another standard module recommended instead?
Is there a more appropriate forum for asking the main question ("How can a CGI server based on CGIHTTPRequestHandler require...")? It seems odd to ask if there is a better forum for asking programming questions than Stack Overflow, but I guess anything is possible.
Thanks for any help.

Grabbing log files from production server

I developed a statistics system for online web service user behavior research in python, which mostly relies on reading and analyzing logs from production server. Currently I shared log folders internally under SMB protocol for the routine analytics program to read, but for the data accessing method I have 2 questions,
Are there any other way accessing logs other than via SMB? or other strategy?
I guess a lot read may block HD of the production and affect normal log writing, any solution to solve this?
I hoped I could come up with some real number but currently don't have. Any guy can give me some guide on doing this more gracefully?

If you are open to using a third party log aggregation tool, you have a couple of options:
http://graylog2.org/
http://www.logstash.net/
http://www.octopussy.pm/
https://github.com/facebook/scribe
In addition, if you are logging to syslog - many of the commonly used syslog daemons ( eg syslog-ng ) can be configured to forward logs from various applications to one or more of these aggregators. It is trivial to log to syslog from a python application - there is a syslog module in the standard library

Well, if you have a HTTP server in between (IHS, OHS, I guess Apache too...) then you can expose your physical repositories via a URL: each of your files will benefit from a URL too, and via this kind of code you can download them quite easily:
import os
import urllib2
# Open our local file for writing
f = urllib2.urlopen(url)
with open(os.path.basename(url), 'wb') as local_file:
local_file.write(f.read())

Does local GAE read and write to a local datastore file on the hard drive while it's running?

I have just noticed that when I have a running instance of my GAE application, there nothing happens with the datastore file when I add or remove entries using Python code or in admin console. I can even remove the file and still have all data safe and sound in admin area and accessible from code. But when I restart my application, all data obviously goes away and I have a blank datastore. So, the question - does GAE reads all data from the file only when it starts and then deals with it in the memory, saving the data after I stop the application? Does it make any requests to the datastore file when the application is running? If it doesn't save anything to the file while it's running, then, possibly, data may be lost if the application unexpectedly stops? Please make it clear for me if you know how it works in this aspect.

How the datastore reads and writes its underlying files varies - the standard datastore is read on startup, and written progressively, journal-style, as the app modifies data. The SQLite backend uses a SQLite database.
You shouldn't have to care, though - neither backend is designed for robustness in the face of failure, as they're development backends. You shouldn't be modifying or deleting the underlying files, either.

By default the dev_appserver will store it's data in a temporary location (which is why it disappears and you can't see anything changing)
If you don't want your data to disappear on restart set --datastore_path when running your dev server like:
dev_appserver.py --datastore_path /path/to/app/myapp.db /path/to/app
As nick said, the dev server is not built to be bulletproof, it's designed to help you quickly develop your app. The production setup is very different and will not do anything unexpected when you are dealing with exceptional circumstances.

How Google App Engine limit Python?

Does anybody know, how GAE limit Python interpreter? For example, how they block IO operations, or URL operations.
Shared hosting also do it in some way?

The sandbox "internally works" by them having a special version of the Python interpreter. You aren't running the standard Python executable, but one especially modified to run on Google App engine.
Update:
And no it's not a virtual machine in the ordinary sense. Each application does not have a complete virtual PC. There may be some virtualization going on, but Google isn't saying exactly how much or what.
A process has normally in an operating system already limited access to the rest of the OS and the hardware. Google have limited this even more and you get an environment where you are only allowed to read the very specific parts of the file system, and not write to it at all, you are not allowed to open sockets and not allowed to make system calls etc.
I don't know at which level OS/Filesystem/Interpreter each limitation is implemented, though.

From Google's site:
An application can only access other
computers on the Internet through the
provided URL fetch and email
services. Other computers can only
connect to the application by making
HTTP (or HTTPS) requests on the
standard ports.
An application cannot write to the
file system. An app can read files,
but only files uploaded with the
application code. The app must use
the App Engine datastore, memcache or
other services for all data that
persists between requests.
Application code only runs in
response to a web request, a queued
task, or a scheduled task, and must
return response data within 30
seconds in any case. A request
handler cannot spawn a sub-process or
execute code after the response has
been sent.
Beyond that, you're stuck with Python 2.5, you can't use any C-based extensions, more up-to-date versions of web frameworks won't work in some cases (Python 2.5 again).
You can read the whole article What is Google App Engine?.

I found this site
that has some pretty decent information. What exactly are you trying to do?
Here
FRESH!
Look here: http://code.google.com/appengine/docs/python/runtime.html
Your IO Operations are limited as follows (beyond disabled modules):
App Engine records how much of each resource an application uses in a calendar day, and considers the resource depleted when this amount reaches the app's quota for the resource. A calendar day is a period of 24 hours beginning at midnight, Pacific Time. App Engine resets all resource measurements at the beginning of each day, except for Stored Data which always represents the amount of datastore storage in use.
When an app consumes all of an allocated resource, the resource becomes unavailable until the quota is replenished. This may mean that your app will not work until the quota is replenished.
An application can determine how much CPU time the current request has taken so far by calling the Quota API. This is useful for profiling CPU-intensive code, and finding places where CPU efficiency can be improved for greater cost savings. You can measure the CPU used for the entire request, or call the API before and after a section of code then subtract to determine the CPU used between those two points.
Resource| Free Default Quota| Billing Enabled Default Quota
Blobstore |Stored Data| 1 GB| 1 GB free; no maximum
Resource |Billing Enabled| Default Quota
Daily Limit| Maximum Rate
Blobstore API Calls |140,000,000 calls| 72,000 calls/minute
Hmm my table isn't that good, but hopefully still readable.
EDIT: OK, I understand. But sir, you did not have to use the "f" word. :) And you know, it's kinda like the whole 'teach a man to fish' scenario. Google is who I always ask and that's why I'm answering questions here for fun.
EDIT AGAIN: OK that made more sense before the comment was tooked. So I went and answered the question a little more. I hope it helps.

IMO it's not a standard python, but a version specifically patched for app engine. In other words you can think more or less like an "higher level" VM that however is not emulating x86 instructions but python opcodes (if you don't know what they are try writing a small function named "foo" and the doing "import dis; dis.dis(foo)" you will see the python opcodes that the compiler produced).
By patching python you can impose to it whatever limitations you like. Of course you've however to forbid the use of user supplied C/C++ extension modules as a C/C++ module will have access to everything the process can access.
Using such a virtual environment you're able to run safely python code without the need to use a separate x86 VM for every instance.

fastcgi, cherrypy, and python

So I'm trying to do more web development in python, and I've picked cherrypy, hosted by lighttpd w/ fastcgi. But my question is a very basic one: why do I need to restart lighttpd (or apache) every time I change my application code, or the code for an underlying library?
I realize this question extends from a basic mis(i.e. poor)understanding of the fastcgi model, so I'm open to any schooling here, but I'm used to just changing a PHP file and it showing up, versus having to bounce the web server.
Any elucidation/useful mockery appreciated.

This is because of performance. For development, autoreloading is helpful. But for production, you don't want to autoreload. This is actually a decently-sized bottleneck in say PHP. Every time you access a PHP webpage, the server has to parse and load each page from scratch. With Python, the script is already loaded and running after the first access.
As has been pointed out, CherryPy has a autoreload setting. I'd recommend using the CherryPy built-in server for development and using lighttpd for production. That will likely save you some time. The tutorial shows you how to do this.

From a system-software-writer's pointer of view: This all depends on how the meta-data about the server process is organized within your daemon (lighttpd or fcgi). Some programs are designed for one time only initialization -- MOSTLY this allows a much simpler and better performing internal programming model.
Often it is very hard to program a server process reload config data in a easy way. You might have to introduce locks and external event objects (signals in UNIX). When you can synchronize the data structures by design -- i.e., only initializing once .... why complicate things by making the data model modifiable multiple times ?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.