I'm making HTTP requests with Python's urllib2 which go through a proxy.
proxy_handler = urllib2.ProxyHandler({'http': 'http://myproxy'})
opener = urllib2.build_opener(proxy_handler)
urllib2.install_opener(opener)
r = urllib2.urlopen('http://www.pbr.com')
I'd like to log all headers from this request. I know that using a standard HTTPHandler you can do:
handler = urllib2.HTTPHandler(debuglevel=1)
Is there something like this for ProxyHandler?
I'm pretty sure debuglevel isn't documented.
In practice, it's actually a feature of httplib that urllib2 just forwards along for convenience, so you don't have to pass lambda: httplib.HTTPConnection(debuglevel=1) in place of the default httplib.HTTPConnection as your HTTP object factory. So, you're unlikely to find anything similar in any of the other handlers.
But if you want to rely on an undocumented feature of the implementation, you're really going to need to read the source to see for yourself.
At any rate, the obvious way to add debugging to any of the handlers is to subclass them and do it yourself. For example:
class LoggingProxyHandler(urllib2.ProxyHandler):
def proxy_open(self, req, proxy, type):
had_proxy = req.has_proxy()
response = super(LoggingProxyHandler, self).proxy_open(req, proxy, type)
if not had_proxy and req.has_proxy():
# log stuff here
return response
I'm relying on internal knowledge that ProxyHandler calls set_proxy on the request if it doesn't have one and needs one. It might be cleaner to instead examine the response… but you may not get all the information you want that way.
Related
I'm setting up a small Python service to act as an REST API reverse proxy, but hoping there's some libraries available to help speed this process up.
Need to be able to run a function to calculate a variable to inject as a request header when the request is proxied through to the backend.
As it stands I have a simpler script to do the function to get the variable and inject it into a Nginx config file and then force a Nginx hot reload via signals, but trying to remove this dependency for what should be a fairly simple task.
Would a good approach be to use falcon as the listener and combine it with another approach to inject and forward requests?
Thanks for reading.
Edit: Been reading https://aiohttp.readthedocs.io/en/stable/ as it seems to be the right direction.
Thanks to someone over at falcon, this is now the accepted answer!
import io
import falcon
import requests
class Proxy(object):
UPSTREAM = 'https://httpbin.org'
def __init__(self):
self.session = requests.Session()
def handle(self, req, resp):
headers = dict(req.headers, Via='Falcon')
for name in ('HOST', 'CONNECTION', 'REFERER'):
headers.pop(name, None)
request = requests.Request(req.method, self.UPSTREAM + req.path,
data=req.bounded_stream.read(),
headers=headers)
prepared = request.prepare()
from_upstream = self.session.send(prepared, stream=True)
resp.content_type = from_upstream.headers.get('Content-Type',
falcon.MEDIA_HTML)
resp.status = falcon.get_http_status(from_upstream.status_code)
resp.stream = from_upstream.iter_content(io.DEFAULT_BUFFER_SIZE)
api = falcon.API()
api.add_sink(Proxy().handle)
Scope:
I am currently trying to write a Web scraper for this specific page. I have a pretty strong "Web Crawling" background using C#, but this httplib is beating me off.
Problem:
When trying to make a Http Get request for the page specified above I get a "Moved Permanently", that points to the very same URL. I can make a request using the requests lib, but I want to make it work using httplib so I can understand what I am doing wrong.
Code Sample:
I am completely new to Python, so any wrong language guideline or syntax is C#'s fault.
import httplib
# Wrapper for a "HTTP GET" Request
class HttpClient(object):
def HttpGet(self, url, host):
connection = httplib.HTTPConnection(host)
connection.request('GET', url)
return connection.getresponse().read()
# Using "HttpClient" class
httpclient = httpClient()
# This is the full URL I need to make a get request for : https://420101.com/strain-database
httpResponseText = httpclient.HttpGet('www.420101.com','/strain-database')
print httpResponseText
I really want to make it work using the httplib library, instead of requests or any other fancy one because I feel like I am missing something really small here.
The problem i've had too little or too much caffeine in my system.
To get a https, I needed the HTTPSConnection class.
Also, there is no 'www' in the address I wanted to GET. So, it shouldn't be included in the host.
Both of the wrong addresses redirect me to the correct one, with the 301 error code. If I were using requests or a more full featured module, it would have automatically followed the redirect.
My Validation:
c = httplib.HTTPSConnection('420101.com')
c.request("GET", "/strain-database")
r = c.getresponse()
print r.status, r.reason
200 OK
I am using python mechanize lib and I am trying to use http PUT method on some url - but I cant find any option for this. I see only GET and POST methods...
If the PUT method is not working maybe some1 can tell me a better lib for doing this?
One possible solution:
class PutRequest(mechanize.Request):
"Extend the mechanize Request class to allow a http PUT"
def get_method(self):
return "PUT"
You can then use this when making a request like this:
browser.open(PutRequest(url,data=your_encoded_params,headers=your_headers))
NOTE: I arrived at this solution by digging into the mechanize code packages to find out where mechanize was setting the HTTP method. I noticed that when we call mechanize.Request, we are using the Request class in _request.py which in turn is extending the Request class in _urllib2_fork.py. The http method is actually set in get_method of the Request class in _urllib2_fork.py. Turns out get_method in _urllib2_fork.py was allowing only GET and POST methods. To get past this limitation, I ended up writing my own put and delete classes that extended mechanize. Request but over-rode get_method() only.
Use Requests:
>>> import requests
>>> result = requests.put("http://httpbin.org/put", data='hello')
>>> result.text
Per documentation:
requests.put(url, data=None, **kwargs)
Sends a PUT request. Returns Response object.
Parameters:
url – URL for the new Request object.
data – (optional) Dictionary or bytes to send in the body of the Request.
**kwargs – Optional arguments that request takes.
Via Mechanize:
import mechanize
import json
class PutRequest(mechanize.Request):
def get_method(self):
return 'PUT'
browser = mechanize.Browser()
browser.open(
PutRequest('http://example.com/',
data=json.dumps({'locale': 'en'}),
headers={'Content-Type': 'application/json'}))
See also http://qxf2.com/blog/python-mechanize-the-missing-manual/ (probably outdated).
Requests does it in a nicer way as Key Zhu said.
Using urllib2, are we able to use a method other than 'GET' or 'POST' (when data is provided)?
I dug into the library and it seems that the decision to use GET or POST is 'conveniently' tied to whether or not data is provided in the request.
For example, I want to interact with a CouchDB database which requires methods such as 'DEL', 'PUT'. I want the handlers of urllib2, but need to make my own method calls.
I WOULD PREFER NOT to import 3rd party modules into my project, such as the CouchDB python api. So lets please not go down that road. My implementation must use the modules that ship with python 2.6. (My design spec requires the use of a barebones PortablePython distribution). I would write my own interface using httplib before importing external modules.
Thanks so much for the help
You could subclass urllib2.Request like so (untested)
import urllib2
class MyRequest(urllib2.Request):
GET = 'get'
POST = 'post'
PUT = 'put'
DELETE = 'delete'
def __init__(self, url, data=None, headers={},
origin_req_host=None, unverifiable=False, method=None):
urllib2.Request.__init__(self, url, data, headers, origin_req_host, unverifiable)
self.method = method
def get_method(self):
if self.method:
return self.method
return urllib2.Request.get_method(self)
opener = urllib2.build_opener(urllib2.HTTPHandler)
req = MyRequest('http://yourwebsite.com/put/resource/', method=MyRequest.PUT)
resp = opener.open(req)
It could be:
import urllib2
method = 'PATH'
request = urllib2.Request('http://host.com')
request.get_method = lambda: method()
That is, a runtime class modification A.K.A monkey path.
I do:
con = HTTPConnection(SERVER_NAME)
con.request('GET', PATH, HEADERS)
resp = con.getresponse()
For debugging reasons, I want to see the request I used (it's fields, path, method,..). I would expect there to be some sort of con.getRequest() or something of the sort but didn't find anything. Ideas?
Try
con.setdebuglevel(1)
That will enable debugging output, which among other things, will print out all the data it sends.
If you only want to get the headers and request line, not the request body (or any other debugging output), you can subclass HTTPConnection and override the _output method, which is called by the class itself to produce output (except for the request body). You'd want to do something like this:
class MyHTTPConnection(HTTPConnection):
def _output(self, s):
print repr(s)
super(MyHTTPConnection, self)._output(s)
For more details on how that works and possible alternatives, have a look at the httplib source code.