python3 - How to overcome max url limit with python requests - python

I have two python apps running on separate ports using web.py. I am trying to send JSON strings in the range of 30,000-40,000 characters long from one app to another. The JSON contains all the information necessary to generate a powerpoint report. I tried enabling this communication using requests as such:
import requests
template = <long JSON string>
url = 'http://0.0.0.0:6060/api/getPpt?template={}'.format(template)
resp= requests.get(url).text
I notice that on the receiving end the json has been truncated to 803 characters long and so when it decodes the JSON I get:
json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 780 (char 779)
I assume this has to with a limitation on how long a URL request can be, from either web.py or requests or this is a standardised thing. Is there a way around this or do I need to find another way of enabling communication between these two python apps. If sending such long JSONs via http isn't possible could you please suggest alternatives. Thanks!

Do not put that much data into a URL. Most browsers limit the total length of a URL (including the query string) to about 2000 characters, servers to about 8000.
See What is the maximum length of a URL in different browsers?, which quotes the HTTP/1.1 standard, RFC7230:
Various ad hoc limitations on request-line length are found in practice. It is RECOMMENDED that all HTTP senders and recipients support, at a minimum, request-line lengths of 8000 octets.
You need to send that much data in the request body instead. Use POST or PUT as the method.
The requests library itself does not place any limits on the URL length; it sends the URL to the server without truncating it. It is your server that has truncated it here, instead of giving you a 414 URI Too Long status code.

Related

Is it possible to get only the header without fetching the body using the requests.get command? the server is blocking HEAD

In a configuration I am using, a minio server hosting files, accepts only GET requests and does not accepts HEAD requests. I need the header information to check for file-type to avoid fetching the entire file.
I would do it usually with requests.head(url) however as I mentioned earlier only the GET method is allowed.
In curl it is possible to do the following:
curl -I -X GET http://domain.dom/path/
which curls the header of the url but overrides the used method with the GET HTTP method.
Is there something equivalent for the Python3 requests package?
Unfortunately there doesn't seem to be a clean way to do this. If the server accepts Range header, you could try requesting the bytes from 0 to 0, which nets you access to the header data but not the body. For example
import requests
url = "http://stackoverflow.com"
headers = {"Range": "bytes=0-0"}
res = requests.get(url, headers=headers)
print(res.headers)
As said, this still depends on the server implementation. For reference: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Range
Based on the definition of a GET, it sounds like you could modify the request headers to include a range-request.
A client can alter the semantics of GET to be a "range request", requesting transfer of only some part(s) of the selected representation, by sending a Range header field in the request (Section 14.2).
I haven't tried this, but maybe setting a byte range of 0-1 would skip the body and you'd get the headers for free.

Checking if a URL exists and is smaller than x bytes without consuming full response

I have a use case where I want to check (from within a python/Django project) if a response to a GET request is smaller than x bytes, if the whole response completes within y seconds and if the response status is 200. The URL being tested is submitted by end users.
Some constraints:-
HEAD request is not acceptable. Simply because some servers might not include a Content-Length, or lie about it, or simply block HEAD requests.
I would not like to consume full GET response body. Imagine end user submitting url to 10GB file... all my server bandwidth(and memory) would be consumed by this.
tl;dr : Is there any python http api that:-
Accepts a timeout for the whole transaction. (I think httplib2 does this)
Response status is 200 (All http libraries do this)
Kills the requests(perhaps with RST) once x bytes have been received to avoid bandwidth starvation.
The x here would probably be in order of KBs, y would be few seconds.
You could open the URL in urllib and read(x+1) from the returned object. If the length of the returned string is x+1, then the resource is larger than x. Then call close() on the object to close the connection, i.e. kill the request. In the worst case, this will fill the OS's TCP buffer, which is something you can not avoid anyway; usually, this should not fetch more than a few kB more than x.
If you furthermore add a Range header to the request, sane servers will close the connection themselves after x+1 bytes. Note that this changes the reply code to 206 Partial Content, or 416 Requested range not satisfiable if the file is too small. Servers which do not support this will ignore the header, so this should be a safe measure.

Timeout with Python Requests + Clojure HttpKit Server but not Ring server

I have some Ring routes which I'm running one of two ways.
lein ring server, with the lein-ring plugin
using org.httpkit.server, like (hs/run-server app {:port 3000}))
It's a web app (being consumed by an Angular.js browser client).
I have some API tests written in Python using the Requests library:
my_r = requests.post(MY_ROUTE,
data=MY_DATA,
headers={"Content-Type": "application/json"},
timeout=10)
When I use lein ring server, this request works fine in the JS client and the Python tests.
When I use httpkit, this works fine in the JS client but the Python client times out with
socket.timeout: timed out
I can't figure out why the Python client is timing out. It happens with httpkit but not with lein-ring, so I can only assume that the cause is related to the difference.
I've looked at the traffic in WireShark and both look like they give the correct response. Both have the same Content-Length field (15 bytes).
I've raised the number of threads to 10 (shouldn't need to) and no change.
Any ideas what's wrong?
I found how to fix this, but no satisfactory explanation.
I was using wrap-json-response Ring middleware to take a HashMap and convert it to JSON. I switched to doing my own conversion in my handler with json/write-str, and this fixes it.
At a guess it might be something to do with the server handling output buffering, but that's speculation.
I've combed through the Wireshark dumps and I can't see any relevant differences between the two. The sent Content-Length fields are identical. The 'bytes in flight' differ, at 518 and 524.
No clue as to why the web browser was happy with this but Python Requests wasn't, and whether or this is a bug in Requests, httpkit, ring-middleware-format or my own code.

Send a GET request with a body

I'm using elasticsearch and the RESTful API supports supports reading bodies in GET requests for search criteria.
I'm currently doing
response = urllib.request.urlopen(url, data).read().decode("utf-8")
If data is present, it issues a POST, otherwise a GET. How can I force a GET despite the fact that I'm including data (which should be in the request body as per a POST)
Nb: I'm aware I can use a source property in the Url but the queries we're running are complex and the query definition is quite verbose resulting in extremely long urls (long enough that they can interfere with some older browsers and proxies).
I'm not aware of a nice way to do this using urllib. However, requests makes it trivial (and, in fact, trivial with any arbitrary verb and request content) by using the requests.request* function:
requests.request(method='get', url='localhost/test', data='some data')
Constructing a small test web server will show that the data is indeed sent in the body of the request, and that the method perceived by the server is indeed a GET.
*note that I linked to the requests.api.requests code because that's where the actual function definition lives. You should call it using requests.request(...)

Inconsistent behavior with HTTP POST requests in Python

Trying to make a POST request between a Python (WSGI) and a NodeJS + Express application. They are on different servers.
The problem is that when using different IP addresses (i.e. private network vs. public network), a urllib2 request on the public network succeeds, but the same request for the private network fails with a 502 Bad Gateway or URLError [32] Broken pipe.
The urllib2 code I'm using is this:
req = urllib2.Request(url, "{'some':'data'}", {'Content-Type' : 'application/json; charset=utf-8'})
res = urllib2.urlopen(req)
print f.read()
Now, I have also coded the request like this, using requests:
r = requests.post(url, headers = {'Content-Type' : 'application/json; charset=utf-8'}, data = "{'some':'data'}")
print r.text
And get a 200 OK response. This alternate method works for both networks.
I am interested in finding out if there is some additional configuration needed for a urllib2 request that I don't know of, or if I need to look into some network configuration which might be missing (I don't believe this is the case, since the alternate request method works, but I could definitely be wrong).
Any suggestions or pointers with this will be greatly appreciated. Thanks!
The problem here is that, as Austin Phillips pointed out, urllib2.Request's constructor's data parameter:
may be a string specifying additional data to send to the server… data should be a buffer in the standard application/x-www-form-urlencoded format. The urllib.urlencode() function takes a mapping or sequence of 2-tuples and returns a string in this format.
By passing it JSON-encoded data instead of urlencoded data, you're confusing it somewhere.
However, Request has a method add_data:
Set the Request data to data. This is ignored by all handlers except HTTP handlers — and there it should be a byte string, and will change the request to be POST rather than GET.
If you use this, you should probably also use add_header rather than passing it in the constructor, although that doesn't seem to be mentioned specifically anywhere in the documentation.
So, this should work:
req = urllib2.Request(url)
req.add_data("{'some':'data'}")
req.add_header('Content-Type', 'application/json; charset=utf-8')
res = urllib2.urlopen(req)
In a comment, you said:
The reason I don't want to just switch over to requests without finding out why I'm seeing this problem is that there may be some deeper underlying issue that this points to that could come back and cause harder-to-detect problems later on.
If you want to find deep underlying issues, you're not going to do that by just looking at your client-side source. The first step to figuring out "Why does X work but Y fails?" with network code is to figure out exactly what bytes X and Y each send. Then you can try to narrow down what the relevant difference is, and then figure out what part of your code is causing Y to send the wrong data in the relevant place.
You can do this by logging things at the service (if you control it), running Wireshark, etc., but the easiest way, for simple cases, is netcat. You'll need to read man nc for your system (and, on Windows, you'll need to get and install netcat before you can run it), because the syntax is different for each version, but it's always something simple like nc -kl 12345.
Then, in your client, change the URL to use localhost:12345 in place of the hostname, and it'll connect up to netcat and send its HTTP request, which will be dumped to the terminal. You can then copy that and use nc HOST 80 and paste it to see how the real server responds, and use that to narrow down where the problem is. Or, if you get stuck, at least you can copy and paste the data to your SO question.
One last thing: This is almost certainly not relevant to your problem (because you're sending the exact same data with requests and it's working), but your data is not actually valid JSON, because it uses single quotes instead of double quotes. According to the docs, string is defined as:
string
""
" chars "
(The docs have a nice graphical representation as well.)
In general, except for really simple test cases, you don't want to write JSON by hand. In many cases (including yours), all you have to do is replace the "…" with json.dumps(…), so this isn't a serious hardship. So:
req = urllib2.Request(url)
req.add_data(json.dumps({'some':'data'}))
req.add_header('Content-Type', 'application/json; charset=utf-8')
res = urllib2.urlopen(req)
So, why is it working? Well, in JavaScript, single-quoted strings are legal, as well as other things like backslash escapes that aren't valid in JSON, and any JS code that uses restricted-eval (or, worse, raw eval) for parsing will accept it. And, because so many people got used to writing bad JSON because of this, many browsers' native JSON parsers and many JSON libraries in other languages have workarounds to allow common errors. But you shouldn't rely on that.

Categories