Sending gzipped form data

Sending gzipped form data - python

I've heard how browsers can receive gzipped pages from the server. Can they also gzip form data that they send to the server? And if it's possible, how would I decompress this data on the server?
I'm using AppEngine's webapp module, but a general explanation / pointers to tutorials would be sufficient. I've done some googling to no avail.

Short answer: No.
See: Why can't browser send gzip request?

I think that browsers probably can send gzipped form data to the server. I don't know if it is common to do so or not.
The sender would need to make sure to have a Content-Encoding: header with a value that included gzip. The body of the message would then need to be encoded with a gzip encoding, and one can compress / decompress gzipped data in python with the gzip.GzipFile class. I don't know if the gzip module is available on appengine -- if it requires a C-module implementation, then it probably wouldn't be (not sure if it does).
As far as the decoding goes, it's possible that the web machinery that runs before your app-engine program gets any input will decode gzipped content. I've done almost no work with appengine, so I'm not familiar with that sort of detail. It's possible though, that you just don't have to worry about it on the server end...it just gets taken care of automatically. You'd have to check.
It might be useful to look at RFC2616, especially the sections for Accept-Encoding and Content-Encoding.

Short answer:
No, most browsers will not compress form data for client requests.
Long answer:
Yes, all browsers allow the client to send compressed form data. But since the browsers wouldn't compress the data for us, we've got to compress it ourselves.
Gzip uses the DEFLATE algorithm, which is publicly available and free to use. What can be done is to compress the form data on the client-side using JavaScript (download a JS Gzip library if you don't want to write one yourself), then send the compressed data to the server through either GET, POST, or PUT using XMLHttpRequest.
If you are in control of your web server, you could simply grab the data and uncompress it. If you are not in control, you will have to follow whatever policies set in place. For example, some web servers may require you to set a specific Content-Type, while others may not support it at all.
Lastly note that if your resource is a file that is already compressed, there may be no advantages in gzipping it. However if your resource is huge uncompressed data (e.g. a forum post of 80000 characters), the advantages are enormous.

Related

Slow access to Django's request.body

Sometimes this line of Django app (hosted using Apache/mod_wsgi) takes a lot of time to execute (eg. 99% of eg. 6 seconds of request handling, as measured by New Relic), when submitted by some mobile clients:
raw_body = request.body
(where request is an incoming request)
The questions I have:
What could have slowed down access to request.body so much?
What would be the correct configuration for Apache to wait before invoking Django until client sends whole payload? Maybe the problem is in Apache configuration.
Django's body attribute in HttpRequest is a property, so that really resolves on what is really being done there and how to make it happen outside of the Django app, if possible. I want Apache to wait for full request before sending it to Django app.

Regarding (1), Apache passes control to the mod_wsgi handler as soon as the request's headers are available, and mod_wsgi then passes control on to Python. The internal implementation of request.body then calls the read() method which eventually calls the implementation within mod_wsgi, which requests the request's body from Apache and, if it hasn't been completely received by Apache yet, blocks until it is available.
Regarding (2), this is not possible with mod_wsgi alone. At least, the hook processing incoming requests doesn't provide a mechanism to block until the full request is available. Another poster suggested to use nginx as a proxy in a response to this duplicate question.

There are two ways you can fix this in Apache.
You can use mod_buffer, available in >=2.3, and change BufferSize to the maximum expected payload size. This should make Apache hold the request in memory until it's either finished sending, or the buffer is reached.
For older Apache versions < 2.3, you can use mod_proxy combined with ProxyIOBufferSize, ProxyReceiveBufferSize and a loopback vhost. This involves putting your real vhost on a loopback interface, and exposing a proxy vhost which connects back to the real vhost. The downside to this is that it uses twice as many sockets, and can make resource calculation difficult.
However, the most ideal choice would be to enable request/response buffering at your L4/L7 load balancer. For example, haproxy lets you add rules based on req_len and same goes for nginx. Most good commercial load balancers also have an option to buffer requests before sending.
All three approaches rely on buffering the full request/response payload, and there are performance considerations depending on your use case and available resources. You could cache the entire payload in memory but this may dramatically decrease your maximum concurrent connections. You could choose to write the payload to local storage (preferably SSD), but you are then limited by IO capacity.
You also need to consider file uploads, because these are not a good fit for memory based payload buffering. In most cases, you would handle upload requests in your webserver, for example HttpUploadModule, then query nginx for the upload progress, rather than handling it directly in WSGI. If you are buffering at your load balancer, then you may wish to exclude file uploads from the buffering rules.
You need to understand why this is happening, and that this problem exists both when sending a response and receiving a request. It's also a good idea to have these protections in place, not just for scalability, but for security reasons.

I'm afraid the problem could be in the amount of data you are transferring and possibly a slow connection. Also note that upload bandwidth is typically much less than download bandwidth.
As already pointed out, when you use request.body Django will wait for the whole body to be fully transferred from the client and available in-memory (or on disk, according to configurations and size) on the server.
I would suggest you to try what happens with the same request if the client is connected to a WiFi access point which is wired to the server itself, and see if it improves grately. If this is not possible, perhaps just run a tool like speedtest.net on the client, get the request size and do the math to see how much time it would require theoretically (I'd expect the mesured time to be more or less 20% more). Be careful that network speed is often mesured in bits per second, while file size is mesured in Bytes.
In some cases, if a lot of processing is needed on the data, it may be convinient to read() the request and do computations on-the-go, or perhaps directly pass the request object to any function that can read from a so-called "file-like object" instead of a string.
In your specific case, however, I'm afraid this would only affect that 1% of time that is not spent in receiving the body from the network.
Edit:
Sorry, ony now I've noticed the extra description in the bounty. I'm afraid I can't help you but, may I ask, what is the point? I'd guess this would only save a tiny bit of server resources for keeping a python thread idle for a while, without any noticable performance gain on the request...

Looking at the Django source, it looks like what actually happens when you call request.body is the the request body is loaded into memory by being read from a stream.
https://github.com/django/django/blob/stable/1.4.x/django/http/init.py#L390-L392
It's likely that if the request is large the time being taken is actually just loading it into memory. Django has methods on the request to handle acting on the body as a stream, which depending on what exactly the content being consumed is could allow you to process the request more efficiently.
https://docs.djangoproject.com/en/dev/ref/request-response/#django.http.HttpRequest.read
You could for example read one line at a time.

How do you restrict large file uploads in wsgi?

I'm trying to get an understanding of the best way of handling file uploads safely in a wsgi app. It seems a lot of solutions involve using FieldStorage from the cgi module to parse form data. From what I understand about FieldStorage it performs a bit of 'magic' behind the scenes by streaming data into a tempfile.
What I'm not 100% clear on is how to restrict a request containing a file greater than a specified amount (say 10MB). If someone uploads a file which is several GB in size you obviously want to block the request before it chews through your server's disk space right?
What is the best way to restrict file uploads in a wsgi application?

It would depend on your front-end server. If it has any configuration to block big request even before it goes into your app, use it.
If you want to block this with your code I see two approaches:
Look ate the Content-Length HTTP Header. If it's bigger than you can handle, deny the request right away.
Don't trust the headers and start reading the request body, until you reach your limit. Note that this is not a very clever way, but could work. =)
Trusting the HTTP header could lead you to some problems. Supose some one send a request with a Content-Length: 1024 but sends a 1GB request body. If your front-end server trusts the header, it will start do read this request and would find out later that the request body is actually much bigger that it should be. This situation could still fill your server disk, even being a request that "passes" the "too big check".
Although this could happen, I think trusting the Header would be a good start point.

You could use the features of the HTTP server you probably have in front of your WSGI application. For example lighttpd has many options for traffic shaping.

seek in http connection when downloading with python

I have actually two questions in one. Firstly, does http protocol allows seeking. If the wording is incorrect, what I mean is this: for example, there is file accessible through http request in some server. File's size is 2 gb. Can I retrieve only last 1 gb of this file using http. If this can be done, how to do it in Python. I am asking this, because I am considering writing a Python script to download same file with paralel connections, and combining the outcome.

The http protocol defines a way for a client to request part of the resource see http://www.w3.org/Protocols/rfc2616/
Since all HTTP entities are represented in HTTP messages as sequences
of bytes, the concept of a byte range is meaningful for any HTTP
entity. (However, not all clients and servers need to support byte-
range operations.)
Therefore in theory, you could specify a range header to specify which part of the file you want, however the server might just ignore the request. Therefore you need to configure the server to supports byte range.
Sorry cant provide you with a code sample, I have never worked in python but this information should be sufficient to get you started. If you need further help, please ask.

HTTP lets you request a "range" of bytes of a resource, this is specified in the HTTP/1.1. RFC. Not every server and not every resource might support range retrieval and may ignore the headers. The answer to this question has some example code you could look at.

Writing a Python Music Streamer

I would like to implement a server in Python that streams music in MP3 format over HTTP. I would like it to broadcast the music such that a client can connect to the stream and start listening to whatever is currently playing, much like a radio station.
Previously, I've implemented my own HTTP server in Python using SocketServer.TCPServer (yes I know BaseHTTPServer exists, just wanted to write a mini HTTP stack myself), so how would a music streamer be different architecturally? What libraries would I need to look at on the network side and on the MP3 side?

The mp3 format was designed for streaming, which makes some things simpler than you might have expected. The data is essentially a stream of audio frames with built-in boundary markers, rather than a file header followed by raw data. This means that once a client is expecting to receive audio data, you can just start sending it bytes from any point in an existing mp3 source, whether it be live or a file, and the client will sync up to the next frame it finds and start playing audio. Yay!
Of course, you'll have to give clients a way to set up the connection. The de-facto standard is the SHOUTcast (ICY) protocol. This is very much like HTTP, but with status and header fields just different enough that it isn't directly compatible with Python's built-in http server libraries. You might be able to get those libraries to do some of the work for you, but their documented interfaces won't be enough to get it done; you'll have to read their code to understand how to make them speak SHOUTcast.
Here are a few links to get you started:
https://web.archive.org/web/20220912105447/http://forums.winamp.com/showthread.php?threadid=70403
https://web.archive.org/web/20170714033851/https://www.radiotoolbox.com/community/forums/viewtopic.php?t=74
https://web.archive.org/web/20190214132820/http://www.smackfu.com/stuff/programming/shoutcast.html
http://en.wikipedia.org/wiki/Shoutcast
I suggest starting with a single mp3 file as your data source, getting the client-server connection setup and playback working, and then moving on to issues like live sources, multiple encoding bit rates, inband meta-data, and playlists.
Playlists are generally either .pls or .m3u files, and essentially just static text files pointing at the URL for your live stream. They're not difficult and not even strictly necessary, since many (most?) mp3 streaming clients will accept a live stream URL with no playlist at all.
As for architecture, the field is pretty much wide open. You have as many options as there are for HTTP servers. Threaded? Worker processes? Event driven? It's up to you. To me, the more interesting question is how to share the data from a single input stream (the broadcaster) with the network handlers serving multiple output streams (the players). In order to avoid IPC and synchronization complications, I would probably start with a single-threaded event-driven design. In python 2, a library like gevent will give you very good I/O performance while allowing you to structure your code in a very understandable way. In python 3, I would prefer asyncio coroutines.

Since you already have good python experience (given you've already written an HTTP server) I can only provide a few pointers on how to extend the ground-work you've already done:
Prepare your server for dealing with Request Headers like: Accept-Encoding, Range, TE (Transfer Encoding), etc. An MP3-over-HTTP player (i.e. VLC) is nothing but an mp3 player that knows how to "speak" HTTP and "seek" to different positions in the file.
Use wireshark or tcpdump to sniff actual HTTP requests done by VLC when playing an mp3 over HTTP, so you know how what request headers you'll be receiving and implement them.
Good luck with your project!

You'll want to look into serving m3u or pls files. That should give you a file format that players understand well enough to hit your http server looking for mp3 files.
A minimal m3u file would just be a simple text file with one song url per line. Assuming you've got the following URLs available on your server:
/playlists/<playlist_name/playlist_id>
/songs/<song_name/song_id>
You'd serve a playlist from the url:
/playlists/myfirstplaylist
And the contents of the resource would be just:
/songs/1
/songs/mysong.mp3
A player (like Winamp) will be able to open the URL to the m3u file on your HTTP server and will then start streaming the first song on the playlist. All you'll have to do to support this is serve the mp3 file just like you'd serve any other static content.
Depending on how many clients you want to support you may want to look into asynchronous IO using a library like Twisted to support tons of simultaneous streams.

Study these before getting too far:
http://wiki.python.org/moin/PythonInMusic
Specifically
http://edna.sourceforge.net/

You'll want to have a .m3u or .pls file that points at a static URI (e.g. http://example.com/now_playing.mp3) then give them mp3 data starting wherever you are in the song when they ask for that file. Probably there are a bunch of minor issues I'm glossing over here...However, at least as forest points out, you can just start streaming the mp3 data from any byte.

How to use twisted for downloading a remote file?

I'm relatively new to twisted and I'm planning on using it to create a file downloader. It would accept a file url and a number of parts to download the file.
What I have in mind is to split the file into how many parts the user specified and download each parts through deferred and when it is done, all parts gets assembled.
But do I need a protocol for each file to be downloaded and have each protocol dispatch a defer to download each file's chunks?
Is there a twisted component to read the remote file that has a seek? I really don't have any idea where to start.

If your mention of a URL implies that the protocol in use is HTTP (and I hope HTTP 1.1;-), then you could use twisted's relatively new HTTP 1.1 client (discussed at length here, and from the fact that the issue was marked as fixed 9 months ago I assume the client is finally in -- I have not checked that), using HTTP 1.1's range requests to get "slices" of the file.
If you're stuck with HTTP 1.0, or a not fully compliant server, you may be out of luck; if you really mean the "U" part of "URL", i.e., you need a Universal solution across all kinds of protocols, the problem of course becomes much, much harder.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.