How do you restrict large file uploads in wsgi?

How do you restrict large file uploads in wsgi? - python

I'm trying to get an understanding of the best way of handling file uploads safely in a wsgi app. It seems a lot of solutions involve using FieldStorage from the cgi module to parse form data. From what I understand about FieldStorage it performs a bit of 'magic' behind the scenes by streaming data into a tempfile.
What I'm not 100% clear on is how to restrict a request containing a file greater than a specified amount (say 10MB). If someone uploads a file which is several GB in size you obviously want to block the request before it chews through your server's disk space right?
What is the best way to restrict file uploads in a wsgi application?

It would depend on your front-end server. If it has any configuration to block big request even before it goes into your app, use it.
If you want to block this with your code I see two approaches:
Look ate the Content-Length HTTP Header. If it's bigger than you can handle, deny the request right away.
Don't trust the headers and start reading the request body, until you reach your limit. Note that this is not a very clever way, but could work. =)
Trusting the HTTP header could lead you to some problems. Supose some one send a request with a Content-Length: 1024 but sends a 1GB request body. If your front-end server trusts the header, it will start do read this request and would find out later that the request body is actually much bigger that it should be. This situation could still fill your server disk, even being a request that "passes" the "too big check".
Although this could happen, I think trusting the Header would be a good start point.

You could use the features of the HTTP server you probably have in front of your WSGI application. For example lighttpd has many options for traffic shaping.

Related

Is there a way to limit the number of concurrent requests from one IP with Gunicorn?

Basically I'm running a Flask web server that crunches a bunch of data and sends it back to the user. We aren't expecting many users ~60, but I've noticed what could be an issue with concurrency. Right now, if I open a tab and send a request to have some data crunched, it takes about 30s, for our application that's ok.
If I open another tab and send the same request at the same time, unicorn will do it concurrently, this is great if we have two seperate users making two seperate requests. But what happens if I have one user open 4 or 8 tabs and send the same request? It backs up the server for everyone else, is there a way I can tell Gunicorn to only accept 1 request at a time from the same IP?

A better solution to the answer by #jon would be limiting the access by your web server instead of the application server. A good way would always be to have separation between the responsibilities to be carried out by the different layers of your application. Ideally, the application server, flask should not have any configuration for the limiting or anything to do with from where the requests are coming. The responsibility of the web server, in this case nginx is to route the request based on certain parameters to the right client. The limiting should be done at this layer.
Now, coming to the limiting, you could do it by using the limit_req_zone directive in the http block config of nginx
http {
limit_req_zone $binary_remote_addr zone=one:10m rate=1r/s;
...
server {
...
location / {
limit_req zone=one burst=5;
proxy_pass ...
}
where, binary_remote_addris the IP of the client and not more than 1 request per second at an average is allowed, with bursts not exceeding 5 requests.
Pro-tip: Since the subsequent requests from the same IP would be held in a queue, there is a good chance of nginx timing out. Hence, it would be advisable to have a better proxy_read_timeout and if the reports take longer then also adjusting the timeout of gunicorn
Documentation of limit_req_zone
A blog post by nginx on rate limiting can be found here

This is probably NOT best handled at the flask level. But if you had to do it there, then it turns out someone else already designed a flask plugin to do just this:
https://flask-limiter.readthedocs.io/en/stable/
If a request takes at least 30s then make your limit by address for one request every 30s. This will solve the issue of impatient users obsessively clicking instead of waiting for a very long process to finish.
This isn't exactly what you requested, since it means that longer/shorter requests may overlap and allow multiple requests at the same time, which doesn't fully exclude the behavior you describe of multiple tabs, etc. That said, if you are able to tell your users to wait 30 seconds for anything, it sounds like you are in the drivers seat for setting UX expectations. Probably a good wait/progress message will help too if you can build an asynchronous server interaction.

Using django for non http requests

Can I use django to handle non http-requests and responses? I have a django web application serving up webpages, and I would like to use it to also communicate with other devices (hand-held gps sending in status reports and receiving ack) over tcp, but django reports that the requests are "
code 400, message Bad HTTP/0.9 request type".
[28/Sep/2015 15:14:26] code 400, message Bad HTTP/0.9 request type ('[V1.0.0,244565434376396,1,abcd,2015-09-28')
[28/Sep/2015 15:14:26] "[V1.0.0,244565434376396,1,abcd,2015-09-28 14:14:12,1-2,865456543459367,2,T1]" 400 -
The message from the device is sent as text over tcp with no http parameters at all.
I haven't found any information on how to do this with django, but it would make my life easier if it was possible.
Thanks!

Not that I know of.
Django is a web framework, so it's designed around a certain paradigm if not a certain protocol.
The design is heavily informed - if not by HTTP - by the notions of URL, request, a stateless protocol, et cetera.
If the template system and the routing system were taken away you would be left with a glorified ORM and some useless bits of code.
However, unless you are dealing with existing devices with their own protocol, you can use Django to build a RESTful service to successfully exchange information with something other than bipeds in front of a web browser.
This article on Dr. Dobb's is very informative.
Django REST, although by no means necessary, can help you.
If you are really stuck with legacy devices and protocols, you could write an adapter/proxy that would receive your devices' requests and translate them to RESTful calls, if you protocol looks enough like HTTP semantically rather than syntactically (as in, if you just have to translate QUUX aaa:bbb:ccc: to GET xx/yy/zz).
If it does not share the slightest bit of HTTP's semantics, I'd say Django can't help you much.

I second the suggestion that you can better handle non-http with other methods, but I do have a suggestion as to how to structure a Django app that could do it. HTTP processing takes place in middleware and you could just make your app be on the top of that stack and either pre-empt other middlewares by returning the response instead of passing it down the stack or preparing a mock request to pass down to other handlers, grabbing the response on the way back to post-process it for your receiver.
This feels hacky and might require a bunch of un-orthodox tricks but that's how I would approach the problem as stated.

Slow access to Django's request.body

Sometimes this line of Django app (hosted using Apache/mod_wsgi) takes a lot of time to execute (eg. 99% of eg. 6 seconds of request handling, as measured by New Relic), when submitted by some mobile clients:
raw_body = request.body
(where request is an incoming request)
The questions I have:
What could have slowed down access to request.body so much?
What would be the correct configuration for Apache to wait before invoking Django until client sends whole payload? Maybe the problem is in Apache configuration.
Django's body attribute in HttpRequest is a property, so that really resolves on what is really being done there and how to make it happen outside of the Django app, if possible. I want Apache to wait for full request before sending it to Django app.

Regarding (1), Apache passes control to the mod_wsgi handler as soon as the request's headers are available, and mod_wsgi then passes control on to Python. The internal implementation of request.body then calls the read() method which eventually calls the implementation within mod_wsgi, which requests the request's body from Apache and, if it hasn't been completely received by Apache yet, blocks until it is available.
Regarding (2), this is not possible with mod_wsgi alone. At least, the hook processing incoming requests doesn't provide a mechanism to block until the full request is available. Another poster suggested to use nginx as a proxy in a response to this duplicate question.

There are two ways you can fix this in Apache.
You can use mod_buffer, available in >=2.3, and change BufferSize to the maximum expected payload size. This should make Apache hold the request in memory until it's either finished sending, or the buffer is reached.
For older Apache versions < 2.3, you can use mod_proxy combined with ProxyIOBufferSize, ProxyReceiveBufferSize and a loopback vhost. This involves putting your real vhost on a loopback interface, and exposing a proxy vhost which connects back to the real vhost. The downside to this is that it uses twice as many sockets, and can make resource calculation difficult.
However, the most ideal choice would be to enable request/response buffering at your L4/L7 load balancer. For example, haproxy lets you add rules based on req_len and same goes for nginx. Most good commercial load balancers also have an option to buffer requests before sending.
All three approaches rely on buffering the full request/response payload, and there are performance considerations depending on your use case and available resources. You could cache the entire payload in memory but this may dramatically decrease your maximum concurrent connections. You could choose to write the payload to local storage (preferably SSD), but you are then limited by IO capacity.
You also need to consider file uploads, because these are not a good fit for memory based payload buffering. In most cases, you would handle upload requests in your webserver, for example HttpUploadModule, then query nginx for the upload progress, rather than handling it directly in WSGI. If you are buffering at your load balancer, then you may wish to exclude file uploads from the buffering rules.
You need to understand why this is happening, and that this problem exists both when sending a response and receiving a request. It's also a good idea to have these protections in place, not just for scalability, but for security reasons.

I'm afraid the problem could be in the amount of data you are transferring and possibly a slow connection. Also note that upload bandwidth is typically much less than download bandwidth.
As already pointed out, when you use request.body Django will wait for the whole body to be fully transferred from the client and available in-memory (or on disk, according to configurations and size) on the server.
I would suggest you to try what happens with the same request if the client is connected to a WiFi access point which is wired to the server itself, and see if it improves grately. If this is not possible, perhaps just run a tool like speedtest.net on the client, get the request size and do the math to see how much time it would require theoretically (I'd expect the mesured time to be more or less 20% more). Be careful that network speed is often mesured in bits per second, while file size is mesured in Bytes.
In some cases, if a lot of processing is needed on the data, it may be convinient to read() the request and do computations on-the-go, or perhaps directly pass the request object to any function that can read from a so-called "file-like object" instead of a string.
In your specific case, however, I'm afraid this would only affect that 1% of time that is not spent in receiving the body from the network.
Edit:
Sorry, ony now I've noticed the extra description in the bounty. I'm afraid I can't help you but, may I ask, what is the point? I'd guess this would only save a tiny bit of server resources for keeping a python thread idle for a while, without any noticable performance gain on the request...

Looking at the Django source, it looks like what actually happens when you call request.body is the the request body is loaded into memory by being read from a stream.
https://github.com/django/django/blob/stable/1.4.x/django/http/init.py#L390-L392
It's likely that if the request is large the time being taken is actually just loading it into memory. Django has methods on the request to handle acting on the body as a stream, which depending on what exactly the content being consumed is could allow you to process the request more efficiently.
https://docs.djangoproject.com/en/dev/ref/request-response/#django.http.HttpRequest.read
You could for example read one line at a time.

seek in http connection when downloading with python

I have actually two questions in one. Firstly, does http protocol allows seeking. If the wording is incorrect, what I mean is this: for example, there is file accessible through http request in some server. File's size is 2 gb. Can I retrieve only last 1 gb of this file using http. If this can be done, how to do it in Python. I am asking this, because I am considering writing a Python script to download same file with paralel connections, and combining the outcome.

The http protocol defines a way for a client to request part of the resource see http://www.w3.org/Protocols/rfc2616/
Since all HTTP entities are represented in HTTP messages as sequences
of bytes, the concept of a byte range is meaningful for any HTTP
entity. (However, not all clients and servers need to support byte-
range operations.)
Therefore in theory, you could specify a range header to specify which part of the file you want, however the server might just ignore the request. Therefore you need to configure the server to supports byte range.
Sorry cant provide you with a code sample, I have never worked in python but this information should be sufficient to get you started. If you need further help, please ask.

HTTP lets you request a "range" of bytes of a resource, this is specified in the HTTP/1.1. RFC. Not every server and not every resource might support range retrieval and may ignore the headers. The answer to this question has some example code you could look at.

Sending gzipped form data

I've heard how browsers can receive gzipped pages from the server. Can they also gzip form data that they send to the server? And if it's possible, how would I decompress this data on the server?
I'm using AppEngine's webapp module, but a general explanation / pointers to tutorials would be sufficient. I've done some googling to no avail.

Short answer: No.
See: Why can't browser send gzip request?

I think that browsers probably can send gzipped form data to the server. I don't know if it is common to do so or not.
The sender would need to make sure to have a Content-Encoding: header with a value that included gzip. The body of the message would then need to be encoded with a gzip encoding, and one can compress / decompress gzipped data in python with the gzip.GzipFile class. I don't know if the gzip module is available on appengine -- if it requires a C-module implementation, then it probably wouldn't be (not sure if it does).
As far as the decoding goes, it's possible that the web machinery that runs before your app-engine program gets any input will decode gzipped content. I've done almost no work with appengine, so I'm not familiar with that sort of detail. It's possible though, that you just don't have to worry about it on the server end...it just gets taken care of automatically. You'd have to check.
It might be useful to look at RFC2616, especially the sections for Accept-Encoding and Content-Encoding.

Short answer:
No, most browsers will not compress form data for client requests.
Long answer:
Yes, all browsers allow the client to send compressed form data. But since the browsers wouldn't compress the data for us, we've got to compress it ourselves.
Gzip uses the DEFLATE algorithm, which is publicly available and free to use. What can be done is to compress the form data on the client-side using JavaScript (download a JS Gzip library if you don't want to write one yourself), then send the compressed data to the server through either GET, POST, or PUT using XMLHttpRequest.
If you are in control of your web server, you could simply grab the data and uncompress it. If you are not in control, you will have to follow whatever policies set in place. For example, some web servers may require you to set a specific Content-Type, while others may not support it at all.
Lastly note that if your resource is a file that is already compressed, there may be no advantages in gzipping it. However if your resource is huge uncompressed data (e.g. a forum post of 80000 characters), the advantages are enormous.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.