How do I get urllib2 to log ALL transferred bytes

How do I get urllib2 to log ALL transferred bytes - python

I'm writing a web-app that uses several 3rd party web APIs, and I want to keep track of the low level request and responses for ad-hock analysis. So I'm looking for a recipe that will get Python's urllib2 to log all bytes transferred via HTTP. Maybe a sub-classed Handler?

Well, I've found how to setup the built-in debugging mechanism of the library:
import logging, urllib2, sys
hh = urllib2.HTTPHandler()
hsh = urllib2.HTTPSHandler()
hh.set_http_debuglevel(1)
hsh.set_http_debuglevel(1)
opener = urllib2.build_opener(hh, hsh)
logger = logging.getLogger()
logger.addHandler(logging.StreamHandler(sys.stdout))
logger.setLevel(logging.NOTSET)
But I'm still looking for a way to dump all the information transferred.

This looks pretty tricky to do. There are no hooks in urllib2, urllib, or httplib (which this builds on) for intercepting either input or output data.
The only thing that occurs to me, other than switching tactics to use an external tool (of which there are many, and most people use such things), would be to write a subclass of socket.socket in your own new module (say, "capture_socket") and then insert that into httplib using "import capture_socket; import httplib; httplib.socket = capture_socket". You'd have to copy all the necessary references (anything of the form "socket.foo" that is used in httplib) into your own module, but then you could override things like recv() and sendall() in your subclass to do what you like with the data.
Complications would likely arise if you were using SSL, and I'm not sure whether this would be sufficient or if you'd also have to make your own socket._fileobject as well. It appears doable though, and perusing the source in httplib.py and socket.py in the standard library would tell you more.

Related

Name resolving in http requests

I am trying to a simple make a http request to a server inside my company, from a dev server. I figured out that depending on the origin / destination server, I might, or not, to be forced to use qualified name of the destination server, like srvdestination.com.company.world instead of just srvdestination.
I am ok with this, but I don't understand how come my DB connection works?
Let's say I have srvorigin. Now, to make http request, I must use qualified name srvdestination.com.company.world. However, for database connection, the connection string with un-qualified name is enough psycopg.connect(host='srvdestination', ...) I understand that protocols are different, but how psycopg2 does to resolve the real name?

First it all depend on how the name resolution subsystem of your OS is configured. If you are on Unix (you did not specify), this is governed by /etc/resolv.conf. Here you can provide the OS with a search list: if a name has not "enough" dots (the number is configurable) then a suffix is added to retry resolution.
The library you use to do the HTTP request may not query the OS for name resolution and do its DNS resolution itself. In which case, it can only work with the information you give it (but it could as well re-use the OS /etc/resolv.conf and information in it), hence the need to use the full name.
On the contrary, the psycopg2 may use the OS resolution mechanism and hence dealing with "short" names just fine.
Both libraries should have documentation on how they handle hostnames... or otherwise you need to study their source code. I guess psycopg2 is a wrapper around the default libpq standard library, written in C if I am not mistaken, which hence certainly use the standard OS resolution process.
I can understand the curiosity around this difference but anyway my advice is to keep short names when you type commands on the shell and equivalent (and even there it could be a problem), but always use FQDNs (Fully Qualified Domain Names) in your program and configuration files. You will avoid a lot of problems.

Testing Python app which does HTTP requests

For testing I use pytest so it would be great if you suggest something pytest specific.
I have some code which uses the requests library. What it does is basically simple POST/GET requests for logging in, parsing data, etc.
Surely I want to test that code locally without doing any actual HTTP requests.
A monkeypatch funcarg could be the solution, but I think that mocking request.get(...) calls or directly pythons's urllib isn't good, because, for example, there are functions which do more than one HTTP request inside , so I can't just mock the request.get("anyURL") with a simple lambda *args, **kwaargs: """<html>response</html>""".
There are different URLs which should return different content. Sometimes it should be based on POST/GET data. Also I have no idea how will requests.session behave in case of direct mocking. Besides that how to emulate session termination? How to emulate a connection failure?
So in the end in my opinion it's quite hard to use monkey patching here. At least I am not able to write a good mocking function which will take into account everything. Also if I choose to mock urllib directly and someday requests library starts using something different all my tests will fail.
So the best way I think is to use actual HTTP server which turns on on a test run, and if possible takes into account pytest's scopes, etc (so it's a funcarg). While googling I found only two solutions:
https://pypi.python.org/pypi/pytest-localserver
https://github.com/kevin1024/pytest-httpbin
The first one sets up the HTTP server and serves predefined content over a specific URL. Definitely that does not work for me, because as I mentioned some functions which I intend to test do several requests so all inner HTTP requests.get() will get the same answer. Bad.
The second one as far a I see has the same problem. Or at least do not understand how to use it.
The third option could be writing a small Flask based service, but I guess I'll run into a problem that things I use in tests should be tested as well which is a bad practice.

You can rather unmock get after first call.
class Requester():
def get(*args):
...
def mock_get(requester, response):
orig_get = requester.get
def return_text_and_unmock(*args, **kwargs):
self.get = orig_get
return response
requester.get = return_text_and_unmock.__get__(requester, Requester)
return requester

I believe using a local server for unit testing is not a good idea as this is not really a unit test. I you're using requests one good way of being able to mock the requests is to use the module responses that is developed and maintained by dropbox: response dropbox. With responses you will be able to mock each request you make by specifying that you want a certain content to be return when a request is issued to a given URL. The README gives a quick overview of the module's abilities.

Using Python urllib2, How can I stream between a GET and a POST?

I want to write code to transfer a file from one site to another. This can be a large file, and I'd like to do it without creating a local temporary file.
I saw the trick of using mmap to upload a large file in Python: "HTTP Post a large file with streaming", but what I really need is a way to link up the response from the GET to creating the POST.
Anyone done this before?

You can't, or at least shouldn't.
urllib2 request objects have no way to stream data into them on the fly, period. And in the other direction, response objects are file-like objects, so in theory you can read(8192) out of them instead of read(), but for most protocols—including HTTP—it will either often or always read the whole response into memory and serve your read(8192) calls out of its buffer, making it pointless. So, you have to intercept the request, steal the socket out of it, and deal with it manually, at which point urllib2 is getting in your way more than it's helping.
urllib2 makes some things easy, some things much harder than they should be, and some things next to impossible; when it isn't making things easy, stop using it.
One solution is to use a higher-level third-party library. For example, requests gets you half-way there (it makes it very easy to stream from a response, but can only stream into a response in limited situations), and requests-toolbelt gets you the rest of the way there (it adds various ways to stream-upload).
The other solution is to use a lower-level library. And here, you don't even have to leave the stdlib. httplib forces you to think in terms of sending and receiving things bit by bit, but that's exactly what you want. On the get request, you can just call connect and request, and then call read(8192) repeatedly on the response object. On the post request, you call connect, putrequest, putheader, endheaders, then repeatedly send each buffer from the get request, then getresponse when you're done.
In fact, in Python 3.2+'s http.client (the equivalent of 2.x's httplib), HTTPClient.request doesn't have to be a string, it can be any iterable or any file-like object with read and fileno methods… which includes an response object. So, it's this simple:
import http.client
getconn = httplib.HTTPConnection('www.example.com')
getconn.request('GET', 'http://www.example.com/spam')
getresp = getconn.getresponse()
getconn = httplib.HTTPConnection('www.example.com')
getconn.request('POST', 'http://www.example.com/eggs', body=getresp)
getresp = getconn.getresponse()
… except, of course, that you probably want to craft appropriate headers (you can actually use urllib.request, the 3.x version of urllib2, to build a Request object and not send it…), and pull the host and port out of the URL with urlparse instead of hardcoding them, and you want to exhaust or at least check the response from the POST request, and so on. But this shows the hard part, and it's not hard.
Unfortunately, I don't think this works in 2.x.
Finally, if you're familiar with libcurl, there are at least three wrappers for it (including one that comes with the source distribution). I'm not sure whether to call libcurl higher-level or lower-level than urllib2, it's sort of on its own weird axis of complexity. :)

urllib2 may be too simple for this task. You might want to look into pycurl. I know it supports streaming.

Does twisted epollreactor use non-blocking dns lookup?

It seems obvious that it would use the twisted names api and not any blocking way to resolve host names.
However digging in the source code, I have been unable to find the place where the name resolution occurs. Could someone point me to the relevant source code where the host resolution occurs ( when trying to do a connectTCP, for example).
I really need to be sure that connectTCP wont use blocking DNS resolution.

It seems obvious, doesn't it?
Unfortunately:
Name resolution is not always configured in the obvious way. You think you just have to read /etc/resolv.conf? Even in the specific case of Linux and DNS, you might have to look in an arbitrary number of files looking for name servers.
Name resolution is much more complex than just DNS. You have to do mDNS resolution, possibly look up some LDAP computer records, and then you have to honor local configuration dictating the ordering between these such as /etc/nsswitch.conf.
Name resolution is not exposed via a standard or useful non-blocking API. Even the glibc-specific getaddrinfo_a exposes its non-blockingness via SIGIO, not just a file descriptor you can watch. Which means that, like POSIX AIO, it's probably just a kernel thread behind your back anyway.
For these reasons, among others, Twisted defaults to using a resolver that just calls gethostbyname in a thread.
However, if you know that for your application it is appropriate to have DNS-only hostname resolution, and you'd like to use twisted.names rather than your platform resolver - in other words, if scale matters more to you than esoteric name-resolution use-cases - that is supported. You can install a resolver from twisted.names.client onto the reactor, appropriately configured for your application and all future built-in name resolutions will be made with that resolver.

I'm not massively familiar with twisted, I only recently started used it. It looks like it doesn't block though, but only on platforms that support threading.
In twisted.internet.base in ReactorBase it looks like it does the resolving through it's resolve method which returns a deferred from self.resolver.getHostByName.
self.resolver is an instance of BlockingResolver by default which does block, but it looks like that if the platform supports threading the resolver instance is replaced by ThreadedResolver in the ReactorBase._initThreads method.

Persistent HTTPS Connections in Python

I want to make an HTTPS request to a real-time stream and keep the connection open so that I can keep reading content from it and processing it.
I want to write the script in python. I am unsure how to keep the connection open in my script. I have tested the endpoint with curl which keeps the connection open successfully. But how do I do it in Python. Currently, I have the following code:
c = httplib.HTTPSConnection('userstream.twitter.com')
c.request("GET", "/2/user.json?" + req.to_postdata())
response = c.getresponse()
Where do I go from here?
Thanks!

It looks like your real-time stream is delivered as one endless HTTP GET response, yes? If so, you could just use python's built-in urllib2.urlopen(). It returns a file-like object, from which you can read as much as you want until the server hangs up on you.
f=urllib2.urlopen('https://encrypted.google.com/')
while True:
data = f.read(100)
print(data)
Keep in mind that although urllib2 speaks https, it doesn't validate server certificates, so you might want to try and add-on package like pycurl or urlgrabber for better security. (I'm not sure if urlgrabber supports https.)

Connection keep-alive features are not available in any of the python standard libraries for https. The most mature option is probably urllib3

httplib2 supports this. (I'd have thought this the most mature option, didn't know urllib3 yet, so TokenMacGuy may still be right)
EDIT: while httplib2 does support persistent connections, I don't think you can really consume streams with it (ie. one long response vs. multiple requests over the same connection), which I now realise you may need.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.