seek in http connection when downloading with python

seek in http connection when downloading with python - python

I have actually two questions in one. Firstly, does http protocol allows seeking. If the wording is incorrect, what I mean is this: for example, there is file accessible through http request in some server. File's size is 2 gb. Can I retrieve only last 1 gb of this file using http. If this can be done, how to do it in Python. I am asking this, because I am considering writing a Python script to download same file with paralel connections, and combining the outcome.

The http protocol defines a way for a client to request part of the resource see http://www.w3.org/Protocols/rfc2616/
Since all HTTP entities are represented in HTTP messages as sequences
of bytes, the concept of a byte range is meaningful for any HTTP
entity. (However, not all clients and servers need to support byte-
range operations.)
Therefore in theory, you could specify a range header to specify which part of the file you want, however the server might just ignore the request. Therefore you need to configure the server to supports byte range.
Sorry cant provide you with a code sample, I have never worked in python but this information should be sufficient to get you started. If you need further help, please ask.

HTTP lets you request a "range" of bytes of a resource, this is specified in the HTTP/1.1. RFC. Not every server and not every resource might support range retrieval and may ignore the headers. The answer to this question has some example code you could look at.

Related

how to unit test a REST client for an unreliable server?

I'm making a Python-based REST client for a 3rd party service that's still under development. The issue is to test/verify that the client will work under ALL kinds of scenarios. Including incorrect responses.
The client uses the Requests library to make the remote REST calls (mostly GET and POST). And for unit testing, I'm thinking of employing the HTTPretty module to simulate/mock the server responses.
The problem is how to deal with the sheer number of possible test cases. Consider the following made-up API;
REQUEST (GET) = http://example.com/new_api?param1=34&param2=hello
RESPONSE = {"value1":34,"value2":"a string"}
I find myself needing to write unit test cases for the following scenarios -
client sending correct number of parameters
client sending incorrect parameter values
client missing a parameter
server's correct responses for above scenarios
server not sending back all the required values
server mixing up value parameters (returning a string instead of a number)
server sending back HTML instead of JSON
... etc
The intent behind all this extensive testing is to help identify where an error could originate from. i.e. is it my client that's having an issue, or the 3rd party server?
Does anyone know of a good way to organize a Python test suite to accommodate these scenarios? Writing unit test functions feels like it will become a never-ending task... :(

That's the goal of unit testing is to test all the cases where you think, you might have to handle errors. On the other hand you do not need to test things which are already handled naturally by the system.
Note that HTTP is an application level protocol where the client always initiates the request and the server just responds. So what I mean by this, is that because you are developing the client you are not responsible of the server response. Your goal is just to sent the appropriate requests.
On the other hand, there are HTTP responses which might trigger behaviors on the client side. These you want to test. For example, the server answers a 301, and you want to test if your client does the right thing by initiating the next request, grabbing the Location: HTTP header value.
In the case of a REST API (aka hypertext driven), your client will parse the content of the HTTP responses and specifically the set of links and/or rel associated value. Based on these values, the client may make decisions or expose possible choices to the users. These you have to test.
If the server doesn't give the information inside the HTTP response for continuing your exploration on the client side, then it's not a REST API, but a perfectly valid HTTP API. Simple as that. It becomes even easier to test. Nothing much to do.

Slow access to Django's request.body

Sometimes this line of Django app (hosted using Apache/mod_wsgi) takes a lot of time to execute (eg. 99% of eg. 6 seconds of request handling, as measured by New Relic), when submitted by some mobile clients:
raw_body = request.body
(where request is an incoming request)
The questions I have:
What could have slowed down access to request.body so much?
What would be the correct configuration for Apache to wait before invoking Django until client sends whole payload? Maybe the problem is in Apache configuration.
Django's body attribute in HttpRequest is a property, so that really resolves on what is really being done there and how to make it happen outside of the Django app, if possible. I want Apache to wait for full request before sending it to Django app.

Regarding (1), Apache passes control to the mod_wsgi handler as soon as the request's headers are available, and mod_wsgi then passes control on to Python. The internal implementation of request.body then calls the read() method which eventually calls the implementation within mod_wsgi, which requests the request's body from Apache and, if it hasn't been completely received by Apache yet, blocks until it is available.
Regarding (2), this is not possible with mod_wsgi alone. At least, the hook processing incoming requests doesn't provide a mechanism to block until the full request is available. Another poster suggested to use nginx as a proxy in a response to this duplicate question.

There are two ways you can fix this in Apache.
You can use mod_buffer, available in >=2.3, and change BufferSize to the maximum expected payload size. This should make Apache hold the request in memory until it's either finished sending, or the buffer is reached.
For older Apache versions < 2.3, you can use mod_proxy combined with ProxyIOBufferSize, ProxyReceiveBufferSize and a loopback vhost. This involves putting your real vhost on a loopback interface, and exposing a proxy vhost which connects back to the real vhost. The downside to this is that it uses twice as many sockets, and can make resource calculation difficult.
However, the most ideal choice would be to enable request/response buffering at your L4/L7 load balancer. For example, haproxy lets you add rules based on req_len and same goes for nginx. Most good commercial load balancers also have an option to buffer requests before sending.
All three approaches rely on buffering the full request/response payload, and there are performance considerations depending on your use case and available resources. You could cache the entire payload in memory but this may dramatically decrease your maximum concurrent connections. You could choose to write the payload to local storage (preferably SSD), but you are then limited by IO capacity.
You also need to consider file uploads, because these are not a good fit for memory based payload buffering. In most cases, you would handle upload requests in your webserver, for example HttpUploadModule, then query nginx for the upload progress, rather than handling it directly in WSGI. If you are buffering at your load balancer, then you may wish to exclude file uploads from the buffering rules.
You need to understand why this is happening, and that this problem exists both when sending a response and receiving a request. It's also a good idea to have these protections in place, not just for scalability, but for security reasons.

I'm afraid the problem could be in the amount of data you are transferring and possibly a slow connection. Also note that upload bandwidth is typically much less than download bandwidth.
As already pointed out, when you use request.body Django will wait for the whole body to be fully transferred from the client and available in-memory (or on disk, according to configurations and size) on the server.
I would suggest you to try what happens with the same request if the client is connected to a WiFi access point which is wired to the server itself, and see if it improves grately. If this is not possible, perhaps just run a tool like speedtest.net on the client, get the request size and do the math to see how much time it would require theoretically (I'd expect the mesured time to be more or less 20% more). Be careful that network speed is often mesured in bits per second, while file size is mesured in Bytes.
In some cases, if a lot of processing is needed on the data, it may be convinient to read() the request and do computations on-the-go, or perhaps directly pass the request object to any function that can read from a so-called "file-like object" instead of a string.
In your specific case, however, I'm afraid this would only affect that 1% of time that is not spent in receiving the body from the network.
Edit:
Sorry, ony now I've noticed the extra description in the bounty. I'm afraid I can't help you but, may I ask, what is the point? I'd guess this would only save a tiny bit of server resources for keeping a python thread idle for a while, without any noticable performance gain on the request...

Looking at the Django source, it looks like what actually happens when you call request.body is the the request body is loaded into memory by being read from a stream.
https://github.com/django/django/blob/stable/1.4.x/django/http/init.py#L390-L392
It's likely that if the request is large the time being taken is actually just loading it into memory. Django has methods on the request to handle acting on the body as a stream, which depending on what exactly the content being consumed is could allow you to process the request more efficiently.
https://docs.djangoproject.com/en/dev/ref/request-response/#django.http.HttpRequest.read
You could for example read one line at a time.

How do you restrict large file uploads in wsgi?

I'm trying to get an understanding of the best way of handling file uploads safely in a wsgi app. It seems a lot of solutions involve using FieldStorage from the cgi module to parse form data. From what I understand about FieldStorage it performs a bit of 'magic' behind the scenes by streaming data into a tempfile.
What I'm not 100% clear on is how to restrict a request containing a file greater than a specified amount (say 10MB). If someone uploads a file which is several GB in size you obviously want to block the request before it chews through your server's disk space right?
What is the best way to restrict file uploads in a wsgi application?

It would depend on your front-end server. If it has any configuration to block big request even before it goes into your app, use it.
If you want to block this with your code I see two approaches:
Look ate the Content-Length HTTP Header. If it's bigger than you can handle, deny the request right away.
Don't trust the headers and start reading the request body, until you reach your limit. Note that this is not a very clever way, but could work. =)
Trusting the HTTP header could lead you to some problems. Supose some one send a request with a Content-Length: 1024 but sends a 1GB request body. If your front-end server trusts the header, it will start do read this request and would find out later that the request body is actually much bigger that it should be. This situation could still fill your server disk, even being a request that "passes" the "too big check".
Although this could happen, I think trusting the Header would be a good start point.

You could use the features of the HTTP server you probably have in front of your WSGI application. For example lighttpd has many options for traffic shaping.

How to use twisted for downloading a remote file?

I'm relatively new to twisted and I'm planning on using it to create a file downloader. It would accept a file url and a number of parts to download the file.
What I have in mind is to split the file into how many parts the user specified and download each parts through deferred and when it is done, all parts gets assembled.
But do I need a protocol for each file to be downloaded and have each protocol dispatch a defer to download each file's chunks?
Is there a twisted component to read the remote file that has a seek? I really don't have any idea where to start.

If your mention of a URL implies that the protocol in use is HTTP (and I hope HTTP 1.1;-), then you could use twisted's relatively new HTTP 1.1 client (discussed at length here, and from the fact that the issue was marked as fixed 9 months ago I assume the client is finally in -- I have not checked that), using HTTP 1.1's range requests to get "slices" of the file.
If you're stuck with HTTP 1.0, or a not fully compliant server, you may be out of luck; if you really mean the "U" part of "URL", i.e., you need a Universal solution across all kinds of protocols, the problem of course becomes much, much harder.

Sending gzipped form data

I've heard how browsers can receive gzipped pages from the server. Can they also gzip form data that they send to the server? And if it's possible, how would I decompress this data on the server?
I'm using AppEngine's webapp module, but a general explanation / pointers to tutorials would be sufficient. I've done some googling to no avail.

Short answer: No.
See: Why can't browser send gzip request?

I think that browsers probably can send gzipped form data to the server. I don't know if it is common to do so or not.
The sender would need to make sure to have a Content-Encoding: header with a value that included gzip. The body of the message would then need to be encoded with a gzip encoding, and one can compress / decompress gzipped data in python with the gzip.GzipFile class. I don't know if the gzip module is available on appengine -- if it requires a C-module implementation, then it probably wouldn't be (not sure if it does).
As far as the decoding goes, it's possible that the web machinery that runs before your app-engine program gets any input will decode gzipped content. I've done almost no work with appengine, so I'm not familiar with that sort of detail. It's possible though, that you just don't have to worry about it on the server end...it just gets taken care of automatically. You'd have to check.
It might be useful to look at RFC2616, especially the sections for Accept-Encoding and Content-Encoding.

Short answer:
No, most browsers will not compress form data for client requests.
Long answer:
Yes, all browsers allow the client to send compressed form data. But since the browsers wouldn't compress the data for us, we've got to compress it ourselves.
Gzip uses the DEFLATE algorithm, which is publicly available and free to use. What can be done is to compress the form data on the client-side using JavaScript (download a JS Gzip library if you don't want to write one yourself), then send the compressed data to the server through either GET, POST, or PUT using XMLHttpRequest.
If you are in control of your web server, you could simply grab the data and uncompress it. If you are not in control, you will have to follow whatever policies set in place. For example, some web servers may require you to set a specific Content-Type, while others may not support it at all.
Lastly note that if your resource is a file that is already compressed, there may be no advantages in gzipping it. However if your resource is huge uncompressed data (e.g. a forum post of 80000 characters), the advantages are enormous.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.