I'm using elasticsearch and the RESTful API supports supports reading bodies in GET requests for search criteria.
I'm currently doing
response = urllib.request.urlopen(url, data).read().decode("utf-8")
If data is present, it issues a POST, otherwise a GET. How can I force a GET despite the fact that I'm including data (which should be in the request body as per a POST)
Nb: I'm aware I can use a source property in the Url but the queries we're running are complex and the query definition is quite verbose resulting in extremely long urls (long enough that they can interfere with some older browsers and proxies).
I'm not aware of a nice way to do this using urllib. However, requests makes it trivial (and, in fact, trivial with any arbitrary verb and request content) by using the requests.request* function:
requests.request(method='get', url='localhost/test', data='some data')
Constructing a small test web server will show that the data is indeed sent in the body of the request, and that the method perceived by the server is indeed a GET.
*note that I linked to the requests.api.requests code because that's where the actual function definition lives. You should call it using requests.request(...)
Related
The documentation I've found explaining http.client for Python seems a bit sparse. I want to use it over requests because requests has not worked for our project.
So, knowing that I'm using Python's http.client, I'm seeing again and again request and putrequest. Both methods are defined here under HTTPConnection.
HTTPConnection.request: This will send a request to the server using
the HTTP request method method and the selector url.
HTTPConnection.putrequest: This should be the first call after the
connection to the server has been made. It sends a line to the server
consisting of the method string, the url string, and the HTTP version
(HTTP/1.1). To disable automatic sending of Host: or Accept-Encoding:
headers (for example to accept additional content encodings), specify
skip_host or skip_accept_encoding with non-False values.
Also, the source code for both is defined in this file.
From my guess and reading things, it seems like request is a more high level API compared to putrequest. Is that correct?
The Answer: request() is an abstracted version of multiple functions, putrequest() being one of them.
Although this is defined in the documentation, it's easy to skip over the line that answers this question.
This is pointed out in this line of the http.client documentation:
As an alternative to using the request() method described above, you can also send your request step by step, by using the four functions below.
I'm working on an application that will have to use multiple external APIs for information and after processing the data, will output the result to a client. The client uses a web interface to query, once query is send to server, server process send requests to different API providers and after joining the responses from those APIs then return response to client.
All responses are in JSON.
current approach:
import requests
def get_results(city, country, query, type, position):
#get list of apis with authentication code for this query
apis = get_list_of_apis(type, position)
results = [ ]
for api in apis:
result = requests.get(api)
#parse json
#combine result in uniform format to display
return results
Server uses Django to generate response.
Problem with this approach
(i) This may generate huge amounts of data even though client is not interested in all.
(ii) JSON response has to be parsed based on different API specs.
How to do this efficiently?
Note: Queries are being done to serve job listings.
Most APIs of this nature allow for some sort of "paging". You should code your requests to only draw a single page from each provider. You can then consolidate the several pages locally into a single stream.
If we assume you have 3 providers, and page size is fixed at 10, you will get 30 responses. Assuming you only show 10 listings to the client, you will have to discard and re-query 20 listings. A better idea might be to locally cache the query results for a short time (say 15 minutes to an hour) so that you don't have to requery the upstream providers each time your user advances a page in the consolidated list.
As far as the different parsing required for different providers, you will have to handle that internally. Create different classes for each. The list of providers is fixed, and small, so you can code a table of which provider-url gets which class behavior.
Shameless plug but I wrote a post on how I did exactly this in Durango REST framework here.
I highly recommend using Django REST framework, it makes everything so much easier
Basically, the model on your APIs end is extremely simple and simply contains information on what external API is used and the ID for that API resource. A GenericProvider class then provides an abstract interface to perform CRUD operations on the external source. This GenericProvider uses other providers that you create and determines what provider to use via the provider field on the model. All of the data returned by the GenericProvider is then serialised as usual.
Hope this helps!
For testing I use pytest so it would be great if you suggest something pytest specific.
I have some code which uses the requests library. What it does is basically simple POST/GET requests for logging in, parsing data, etc.
Surely I want to test that code locally without doing any actual HTTP requests.
A monkeypatch funcarg could be the solution, but I think that mocking request.get(...) calls or directly pythons's urllib isn't good, because, for example, there are functions which do more than one HTTP request inside , so I can't just mock the request.get("anyURL") with a simple lambda *args, **kwaargs: """<html>response</html>""".
There are different URLs which should return different content. Sometimes it should be based on POST/GET data. Also I have no idea how will requests.session behave in case of direct mocking. Besides that how to emulate session termination? How to emulate a connection failure?
So in the end in my opinion it's quite hard to use monkey patching here. At least I am not able to write a good mocking function which will take into account everything. Also if I choose to mock urllib directly and someday requests library starts using something different all my tests will fail.
So the best way I think is to use actual HTTP server which turns on on a test run, and if possible takes into account pytest's scopes, etc (so it's a funcarg). While googling I found only two solutions:
https://pypi.python.org/pypi/pytest-localserver
https://github.com/kevin1024/pytest-httpbin
The first one sets up the HTTP server and serves predefined content over a specific URL. Definitely that does not work for me, because as I mentioned some functions which I intend to test do several requests so all inner HTTP requests.get() will get the same answer. Bad.
The second one as far a I see has the same problem. Or at least do not understand how to use it.
The third option could be writing a small Flask based service, but I guess I'll run into a problem that things I use in tests should be tested as well which is a bad practice.
You can rather unmock get after first call.
class Requester():
def get(*args):
...
def mock_get(requester, response):
orig_get = requester.get
def return_text_and_unmock(*args, **kwargs):
self.get = orig_get
return response
requester.get = return_text_and_unmock.__get__(requester, Requester)
return requester
I believe using a local server for unit testing is not a good idea as this is not really a unit test. I you're using requests one good way of being able to mock the requests is to use the module responses that is developed and maintained by dropbox: response dropbox. With responses you will be able to mock each request you make by specifying that you want a certain content to be return when a request is issued to a given URL. The README gives a quick overview of the module's abilities.
Given that :
WSGI doesn't play very well with async.
Twisted ergonomics suck.
Pyramid is very clean and component oriented.
How could I use Pyramid and Twisted ?
I can imagine making a twisted protocol to get the raw HTML request. But then I can't see how to parse it into a pyramid request objects. All documented pyramid tools seems to expect some wsgi interface at some point.
I could use waitress code to parse the request and turn it into a WSGI env then pass the env to pyramid but that's a lot of work with many issues I'm sure I can't even imagine down the road.
I know twisted includes a WSGI server, but it implies synchronicity in the app code, which does not serve my purpose. I want to be able to use the request and response objects, renderers, routers, and others pyramid tools in a twisted asynchronous protocol, with an asynchronous, non blocking app code as well. Hence I won't want to use WSGI.
Twisted API is verbose, heavy and uninuitive compared to any other asynchronous toolkit you'll find in Python or even other languages. Hence the critic about its ergonomics. I can use it, but training newcomers in my team to do it has a high cost. I wish to lower it.
Indeed, it packs a lot of power that I want to use.
To elaborate on my needs, I'm building a tool using crossbar.io and cyclone to have a WAMP/HTTP framework a bit friendlier to my team that the current tools. But cyclone is not as complete as pyramid, and I was hoping pyramid components were decoupled enough that the WSGI paradigm was not enforced, so I could leverage the tremendous work they did on it. All I need is an entry point : somewhere to get the HTML, and parse it into a request objet, and somewhere to take a response object, and returns HTML to the client. I wish i don't have to write a protocol manually for this, http is tricky and I'm sure I'll get it wrong in many ways.
One precision : i don't wish to use the full pyramid framework, just some components here and there, such as rooting, cookie parsing, CSRF protection, etc. I won't use their view system for it assumes a synchronous API.
Looking at Pyramid, I can see that it expects the entire request be be parsed and turned into a request object. it also returns the response as an object as well. So a part of the problem, to hook twisted and pyramid together, is to :
get the http request text as one big chunk from twisted;
parse it into the request object somehow (couldn't find a simple function to do this, but if I can turn it into an WSGI environ + request object, pyramid can convert it to it's format).
get the pyramid response object and turn it into a generator of strings (an adaptor can be find since that's what WSGI does).
Send the response back with twisted from this generator of strings.
Alternatives can be to use something simpler than pyramid like werkzeug for the glue.
Twisted Web lets you interpret HTTP request bodies (regardless of content-type, HTML or otherwise) incrementally as they're received - but it doesn't make doing so very easy. There's a very old ticket that we never seem to make much progress on for improving this situation. Until it's resolved, there probably isn't a better answer than the one I'm about to give. This incremental HTTP request body delivery, I think, is what you're looking for here (because you said you expect requests to "be a big HTML chunk").
The hook for incremental request body handling is Request.handleContentChunk. You can see a complete demonstration of its use in my answer to Python server for streaming request body content.
This gives you the data as it arrives at the server. If you want to use Pyramid, you'll have to construct a Pyramid request that uses this data. Most of the initialization of the Pyramid request object should be straightforward (eg filling the environ dictionary with the request headers - you can take these from Request.requestHeaders). The slightly trickier part will be initializing the Pyramid request object's body - which is supposed to be a file-like object that provides synchronous access to the request body.
On the one hand, if you dispatch the request before the request body has been completely received then you avoid the cost of buffering the entire request body in memory. On the other hand, if you let application code begin to read the request body then you have to deal with the circumstance that it tries to read beyond the point in the data which has actually arrived at the server. This can be dealt with. The body file-like object is expected to present a blocking interface. All you have to do is block until the data is available.
Here's a brief (incomplete, not meant to actually work) sketch of what I mean:
# XXX Note: Queue is not actually thread-safe. Use a safer primitive.
from Queue import Queue
class Body(object):
def __init__(self):
self._buffer = Queue()
self._pending = b""
self._eof = False
def read(self, how_many):
if self._eof:
return b""
if self._pending == b"":
data = self._buffer.get()
if data is None:
self._eof = True
return b""
else:
self._pending = data
if self._pending is None:
result = self._pending[:how_many]
self._pending = self._pending[how_many:]
return result
def _add_data(self, data):
self._buffer.put(data)
You can create an instance of this type, initialize the Pyramid request object's body attribute with it, and then call _add_data on it in the Twisted Request class's handleContentChunk callback.
You could also implement this as an enhancement to Twisted's own WSGI server. For the sake of simplicity, Twisted's WSGI server does read the entire request body before dispatching the request to the WSGI application - but it doesn't have to. If this is the only problem with WSGI then it'd be better to improve the quality of the WSGI implementation and keep the interface rather than both implementing the improvement and stepping outside of the interface (tying you more closely to both Twisted and Pyramid - unnecessarily).
The second half of the problem, generating response bodies incrementally, shouldn't really be a problem. Twisted's WSGI container will write out response data as the WSGI application object yields it. Or if you use twisted.web.resource instead of the WSGI interface, you can call request.write as many times as you like, at any time you like (up until you call request.finish). The only trick is that if you want to do this you must return NOT_DONE_YET from the render method.
I always had the idea that doing a HEAD request instead of a GET request was faster (no matter the size of the resource) and therefore had it advantages in certain solutions.
However, while making a HEAD request in Python (to a 5+ MB dynamic generated resource) I realized that it took the same time as making a GET request (almost 27 seconds instead of the 'less than 2 seconds' I was hoping for).
Used some urllib2 solutions to make a HEAD request found here and even used pycurl (setting headers and nobody to True). Both of them took the same time.
Am I missing something conceptually? is it possible, using Python, to do a 'quick' HEAD request?
The server is taking the bulk of the time, not your requester or the network. If it's a dynamic resource, it's likely that the server doesn't know all the header information - in particular, Content-Length - until it's built it. So it has to build the whole thing whether you're doing HEAD or GET.
The response time is dominated by the server, not by your request. The HEAD request returns less data (just the headers) so conceptually it should be faster, but in practice, many static resources are cached so there is almost no measureable difference (just the time for the additional packets to come down the wire).
Chances are, the bulk of that request time is actually whatever process generates the 5+MB response on the server rather than the time to transfer it to you.
In many cases, a web application will still execute the full script when responding to a HEAD request--it just won't send the full body back to the requester.
If you have access to the code that is processing that request, you may be able to add a condition in there to make it handle the request differently depending on the the method, which could speed it up dramatically.