The documentation I've found explaining http.client for Python seems a bit sparse. I want to use it over requests because requests has not worked for our project.
So, knowing that I'm using Python's http.client, I'm seeing again and again request and putrequest. Both methods are defined here under HTTPConnection.
HTTPConnection.request: This will send a request to the server using
the HTTP request method method and the selector url.
HTTPConnection.putrequest: This should be the first call after the
connection to the server has been made. It sends a line to the server
consisting of the method string, the url string, and the HTTP version
(HTTP/1.1). To disable automatic sending of Host: or Accept-Encoding:
headers (for example to accept additional content encodings), specify
skip_host or skip_accept_encoding with non-False values.
Also, the source code for both is defined in this file.
From my guess and reading things, it seems like request is a more high level API compared to putrequest. Is that correct?
The Answer: request() is an abstracted version of multiple functions, putrequest() being one of them.
Although this is defined in the documentation, it's easy to skip over the line that answers this question.
This is pointed out in this line of the http.client documentation:
As an alternative to using the request() method described above, you can also send your request step by step, by using the four functions below.
Related
I need to somehow extract plain HTTP request message from a Request object in Scrapy (so that I could, for example, copy/paste this request and run from Burp).
So given a scrapy.http.Request object, I would like to get the corresponding request message, such as e.g.
POST /test/demo_form.php HTTP/1.1
Host: w3schools.com
name1=value1&name2=value2
Clearly I have all the information I need in the Request object, however trying to reconstruct the message manually is error-prone as I could miss some edge cases. My understanding is that Scrapy first converts this Request into Twisted object, which then writes headers and body into a TCP transport. So maybe there's away to do something similar, but write to a string instead?
UPDATE
I could use the following code to get HTTP 1.0 request message, which is based on http.py. Is there a way to do something similar with HTTP 1.1 requests / http11.py, which is what's actually being sent? I would obviously like to avoid duplicating code from Scrapy/Twisted frameworks as much as possible.
factory = webclient.ScrapyHTTPClientFactory(request)
transport = StringTransport()
protocol = webclient.ScrapyHTTPPageGetter()
protocol.factory = factory protocol.makeConnection(transport)
request_message = transport.value()
print(request_message.decode("utf-8"))
As scrapy is open source and also has plenty of extension points, this should be doable.
The requests are finally assembled and sent out in scrapy/core/downloader/handlers/http11.py in ScrapyAgent.download_request ( https://github.com/scrapy/scrapy/blob/master/scrapy/core/downloader/handlers/http11.py#L270 )
If you place your hook there you can dump the request type, request headers, and request body.
To place your code there you can either try monkey patching ScrapyAgent.download_request or to subclass ScrapyAgent to do the request logging, then subclass HTTP11DownloadHandler to use your Scrapy Agent and then set HTTP11DownloadHandler as new DOWNLOAD_HANDLER for http / https requests in your project's settings.py (for details see: https://doc.scrapy.org/en/latest/topics/settings.html#download-handlers)
In my opinion this is the closest you can get to logging the requests going out without using a packet sniffer or a logging proxy (which might be a bit overkill for your scenario).
I want to write code to transfer a file from one site to another. This can be a large file, and I'd like to do it without creating a local temporary file.
I saw the trick of using mmap to upload a large file in Python: "HTTP Post a large file with streaming", but what I really need is a way to link up the response from the GET to creating the POST.
Anyone done this before?
You can't, or at least shouldn't.
urllib2 request objects have no way to stream data into them on the fly, period. And in the other direction, response objects are file-like objects, so in theory you can read(8192) out of them instead of read(), but for most protocols—including HTTP—it will either often or always read the whole response into memory and serve your read(8192) calls out of its buffer, making it pointless. So, you have to intercept the request, steal the socket out of it, and deal with it manually, at which point urllib2 is getting in your way more than it's helping.
urllib2 makes some things easy, some things much harder than they should be, and some things next to impossible; when it isn't making things easy, stop using it.
One solution is to use a higher-level third-party library. For example, requests gets you half-way there (it makes it very easy to stream from a response, but can only stream into a response in limited situations), and requests-toolbelt gets you the rest of the way there (it adds various ways to stream-upload).
The other solution is to use a lower-level library. And here, you don't even have to leave the stdlib. httplib forces you to think in terms of sending and receiving things bit by bit, but that's exactly what you want. On the get request, you can just call connect and request, and then call read(8192) repeatedly on the response object. On the post request, you call connect, putrequest, putheader, endheaders, then repeatedly send each buffer from the get request, then getresponse when you're done.
In fact, in Python 3.2+'s http.client (the equivalent of 2.x's httplib), HTTPClient.request doesn't have to be a string, it can be any iterable or any file-like object with read and fileno methods… which includes an response object. So, it's this simple:
import http.client
getconn = httplib.HTTPConnection('www.example.com')
getconn.request('GET', 'http://www.example.com/spam')
getresp = getconn.getresponse()
getconn = httplib.HTTPConnection('www.example.com')
getconn.request('POST', 'http://www.example.com/eggs', body=getresp)
getresp = getconn.getresponse()
… except, of course, that you probably want to craft appropriate headers (you can actually use urllib.request, the 3.x version of urllib2, to build a Request object and not send it…), and pull the host and port out of the URL with urlparse instead of hardcoding them, and you want to exhaust or at least check the response from the POST request, and so on. But this shows the hard part, and it's not hard.
Unfortunately, I don't think this works in 2.x.
Finally, if you're familiar with libcurl, there are at least three wrappers for it (including one that comes with the source distribution). I'm not sure whether to call libcurl higher-level or lower-level than urllib2, it's sort of on its own weird axis of complexity. :)
urllib2 may be too simple for this task. You might want to look into pycurl. I know it supports streaming.
In the Flask documentation on testing (http://flask.pocoo.org/docs/testing/), it has a line of code
rv = self.app.get('/')
And below it, it mentions "By using self.app.get we can send an HTTP GET request to the application with the given path."
Where can the documentation be found for these direct access methods (I'm assuming that there's one for all of the restful methods)? Specifically, I'm wondering what sort of arguments they can take (for example, passing in data, headers, etc). Looking around on flask's documentation for a Flask object, it doesn't seem to list these methods, even though it uses them in the above example.
Alternatively, a knowledgeable individual could answer what I am trying to figure out: I'm trying to simulate sending a POST request to my server, as I would with the following line, if I were doing it over HTTP:
res = requests.post("http://localhost:%d/generate" % port,
data=json.dumps(payload),
headers={"content-type": "application/json"})
The above works when running a Flask app on the proper port. But I tried replacing it with the following:
res = self.app.post("/generate",
data=json.dumps(payload),
headers={"content-type": "application/json"})
And instead, the object I get in response is a 400 BAD REQUEST.
This is documented in the Werkzeug project, from which Flask gets the test client: Werkzeug's test client.
The test client does not issue HTTP requests, it dispatches requests internally, so there is no need to specify a port.
The documentation isn't very clear about support for JSON in the body, but it seems if you pass a string and set the content type you should be fine, so I'm not exactly sure why you get back a code 400. I would check if your /generate view function is invoked at all. A debugger should be useful to figure out where is the 400 coming from.
I'm using elasticsearch and the RESTful API supports supports reading bodies in GET requests for search criteria.
I'm currently doing
response = urllib.request.urlopen(url, data).read().decode("utf-8")
If data is present, it issues a POST, otherwise a GET. How can I force a GET despite the fact that I'm including data (which should be in the request body as per a POST)
Nb: I'm aware I can use a source property in the Url but the queries we're running are complex and the query definition is quite verbose resulting in extremely long urls (long enough that they can interfere with some older browsers and proxies).
I'm not aware of a nice way to do this using urllib. However, requests makes it trivial (and, in fact, trivial with any arbitrary verb and request content) by using the requests.request* function:
requests.request(method='get', url='localhost/test', data='some data')
Constructing a small test web server will show that the data is indeed sent in the body of the request, and that the method perceived by the server is indeed a GET.
*note that I linked to the requests.api.requests code because that's where the actual function definition lives. You should call it using requests.request(...)
I was trying to download images with multi-thread, which has a limited max_count in python.
Each time a download_thread is started, I leave it alone and activate another one. I hope the download process could be ended in 5s, which means downloading is failed if opening the url costs more than 5s.
But how can I know it and stop the failed thread???
Can you tell which version of python you are using?
Maybe you could have posted a snippet too.
From Python 2.6, you have a timeout added in urllib2.urlopen.
Hope this will help you. It's from the python docs.
urllib2.urlopen(url[, data][,
timeout]) Open the URL url, which can
be either a string or a Request
object.
Warning HTTPS requests do not do any
verification of the server’s
certificate. data may be a string
specifying additional data to send to
the server, or None if no such data is
needed. Currently HTTP requests are
the only ones that use data; the HTTP
request will be a POST instead of a
GET when the data parameter is
provided. data should be a buffer in
the standard
application/x-www-form-urlencoded
format. The urllib.urlencode()
function takes a mapping or sequence
of 2-tuples and returns a string in
this format. urllib2 module sends
HTTP/1.1 requests with
Connection:close header included.
The optional timeout parameter
specifies a timeout in seconds for
blocking operations like the
connection attempt (if not specified,
the global default timeout setting
will be used). This actually only
works for HTTP, HTTPS and FTP
connections.
This function returns a file-like
object with two additional methods:
geturl() — return the URL of the
resource retrieved, commonly used to
determine if a redirect was followed
info() — return the meta-information
of the page, such as headers, in the
form of an mimetools.Message instance
(see Quick Reference to HTTP Headers)
Raises URLError on errors.
Note that None may be returned if no
handler handles the request (though
the default installed global
OpenerDirector uses UnknownHandler to
ensure this never happens).
In addition, default installed
ProxyHandler makes sure the requests
are handled through the proxy when
they are set.
Changed in version 2.6: timeout was
added.