I am trying to write a man-in-the-middle for a webserver (to add extra services, not for nefarious reasons.
I am trying to pass a Host header, since the back-end put's it's address, as taken from the Host header, in the reply in lots of unpredictable places.
The original code is hundreds of lines, so I've simplified it to just the salient parts here.
import urllib2
opener = urllib2.build_opener(urllib2.HTTPHandler(debuglevel=1))
opener.addheaders.append(('Host','fakedomain.net'))
res = opener.open('http://www.google.com/doodles/finder/2014/All%20doodles')
res.read()
When I run this code, I expect Host: fakedomain.net to be passed to google's server. However, the debug code clearly shows Host: www.google.com\r\n. Changing Host to HostX works fine.
What is the correct way of sending a Host: header with an opener?
Note: this is a simplification; in the actual code, I am pointing to my own server, etc. - this is a simplification.
Use urllib2.Request,
import urllib2
opener = urllib2.build_opener(urllib2.HTTPHandler(debuglevel=1))
req = urllib2.Request('http://www.google.com/doodles/finder/2014/All%20doodles')
req.add_unredirected_header('Host', 'fakedomain.net')
res = opener.open(req)
res.read()
Thanks to Satoru who got me on the right track, and was almost what I was looking for, and certainly led me on to the right track.
The correct answer is:
import urllib2
opener = urllib2.build_opener(urllib2.HTTPHandler(debuglevel=1))
req = urllib2.Request('http://www.google.com/doodles/finder/2014/All%20doodles',None,{"Host":"fakedomain.net"})
res = opener.open(req)
res.read()
Sorry Satoru, I don't want to select your answer as correct, in case someone else finds my question, but I have upvoted it.
Related
Scope:
I am currently trying to write a Web scraper for this specific page. I have a pretty strong "Web Crawling" background using C#, but this httplib is beating me off.
Problem:
When trying to make a Http Get request for the page specified above I get a "Moved Permanently", that points to the very same URL. I can make a request using the requests lib, but I want to make it work using httplib so I can understand what I am doing wrong.
Code Sample:
I am completely new to Python, so any wrong language guideline or syntax is C#'s fault.
import httplib
# Wrapper for a "HTTP GET" Request
class HttpClient(object):
def HttpGet(self, url, host):
connection = httplib.HTTPConnection(host)
connection.request('GET', url)
return connection.getresponse().read()
# Using "HttpClient" class
httpclient = httpClient()
# This is the full URL I need to make a get request for : https://420101.com/strain-database
httpResponseText = httpclient.HttpGet('www.420101.com','/strain-database')
print httpResponseText
I really want to make it work using the httplib library, instead of requests or any other fancy one because I feel like I am missing something really small here.
The problem i've had too little or too much caffeine in my system.
To get a https, I needed the HTTPSConnection class.
Also, there is no 'www' in the address I wanted to GET. So, it shouldn't be included in the host.
Both of the wrong addresses redirect me to the correct one, with the 301 error code. If I were using requests or a more full featured module, it would have automatically followed the redirect.
My Validation:
c = httplib.HTTPSConnection('420101.com')
c.request("GET", "/strain-database")
r = c.getresponse()
print r.status, r.reason
200 OK
I am checking for url status with this code:
h = httplib2.Http()
hdr = {'User-Agent': 'Mozilla/5.0'}
resp = h.request("http://" + url, headers=hdr)
if int(resp[0]['status']) < 400:
return 'ok'
else:
return 'bad'
and getting
Error -3 while decompressing data: incorrect header check
the url i am checking is:
http://www.sueddeutsche.de/wirtschaft/deutschlands-innovationsangst-wir-neobiedermeier-1.2117528
the Exception Location is:
Exception Location: C:\Python27\lib\site-packages\httplib2-0.9-py2.7.egg\httplib2\__init__.py in _decompressContent, line 403
try:
encoding = response.get('content-encoding', None)
if encoding in ['gzip', 'deflate']:
if encoding == 'gzip':
content = gzip.GzipFile(fileobj=StringIO.StringIO(new_content)).read()
if encoding == 'deflate':
content = zlib.decompress(content) ##<---- error line
response['content-length'] = str(len(content))
# Record the historical presence of the encoding in a way the won't interfere.
response['-content-encoding'] = response['content-encoding']
del response['content-encoding']
except IOError:
content = ""
http status is 200 which is ok for my case, but i am getting this error
I actually need only http status, why is it reading the whole content?
You may have any number of reasons why you choose httplib2, but it's far too easy to get the status code of an HTTP request using the python module requests. Install with the following command:
$ pip install requests
See an extremely simple example below.
In [1]: import requests as rq
In [2]: url = "http://www.sueddeutsche.de/wirtschaft/deutschlands-innovationsangst-wir-neobiedermeier-1.2117528"
In [3]: r = rq.get(url)
In [4]: r
Out[4]: <Response [200]>
Link
Unless you have a considerable constraint that needs httplib2 explicitly, this solves your problem.
This may be a bug (or just uncommon design decision) in httplib2. I don't get this problem with urllib2 or httplib in the 2.x stdlib, or urllib.request or http.client in the 3.x stdlib, or the third-party libraries requests, urllib3, or pycurl.
So, is there a reason you need to use this particular library?
If so:
I actually need only http status, why is it reading the whole content?
Well, most HTTP libraries are going to read and parse the whole content, or at least the headers, before returning control. That way they can respond to simple requests about the headers or chunked encoding or MIME envelope or whatever without any delay.
Also, many of them automate things like 100 continue, 302 redirect, various kinds of auth, etc., and there's no way they could do that if they didn't read ahead. In particular, according to the description for httplib2, handling these things automatically is one of the main reasons you should use it in the first place.
Also, the first TCP read is nearly always going to include the headers anyway, so why not read them?
This means that if the headers are invalid, you may get an exception immediately. They may still provide a way to get the status code (or the raw headers, or other information) anyway.
As a side note, if you only want the HTTP status, you should probably send a HEAD request rather than a GET. Unless you're writing and testing a server, you can almost always rely on the fact that, as the RFC says, the status and headers should be identical to what you'd get with GET. In fact, that would almost certainly solve things in this case—if there is no body to decompress, the fact that httplib2 has gotten confused into thinking the body is gzipped when it isn't won't matter anyway.
I'm currently working on a automated way to interface with a database website that has RESTful webservices installed. I am having issues with figure out the proper formatting of how to properly send the requests listed in the following site using python.
https://neesws.neeshub.org:9443/nees.html
Particular example is this:
POST https://neesws.neeshub.org:9443/REST/Project/731/Experiment/1706/Organization
<Organization id="167"/>
The biggest problem is that I do not know where to put the XML formatted part of the above. I want to send the above as a python HTTPS request and so far I've been trying something of the following structure.
>>>import httplib
>>>conn = httplib.HTTPSConnection("neesws.neeshub.org:9443")
>>>conn.request("POST", "/REST/Project/731/Experiment/1706/Organization")
>>>conn.send('<Organization id="167"/>')
But this appears to be completely wrong. I've never actually done python when it comes to webservices interfaces so my primary question is how exactly am I supposed to use httplib to send the POST Request, particularly the XML formatted part of it? Any help is appreciated.
You need to set some request headers before sending data. For example, content-type to 'text/xml'. Checkout the few examples,
Post-XML-Python-1
Which has this code as example:
import sys, httplib
HOST = www.example.com
API_URL = /your/api/url
def do_request(xml_location):
"""HTTP XML Post requeste"""
request = open(xml_location,"r").read()
webservice = httplib.HTTP(HOST)
webservice.putrequest("POST", API_URL)
webservice.putheader("Host", HOST)
webservice.putheader("User-Agent","Python post")
webservice.putheader("Content-type", "text/xml; charset=\"UTF-8\"")
webservice.putheader("Content-length", "%d" % len(request))
webservice.endheaders()
webservice.send(request)
statuscode, statusmessage, header = webservice.getreply()
result = webservice.getfile().read()
print statuscode, statusmessage, header
print result
do_request("myfile.xml")
Post-XML-Python-2
You may get some idea.
I've tried a lot of codes to post parameters through urllib or httplib.
So, this is my code:
import httplib,urllib
para = urllib.urlencode({"username":"test#msn.com","password":"test"})
conn = httplib.HTTPconnection("account.example.com") #consider it's https !
conn.request("POST","/eng/auth/login",para)
res = conn.getresponse()
print res.status , res.reason
It's said 301 moved permanently!
Any tips or lead … ?
Thank you even for reading <3
You need to encode the parameters:
params = urllib.urlencode({"username":"test#msn.com","password":"test"})
The 301 might be totally legitimate, your example is posting to a login handler which will typically accept the POST, issue a Cookie and redirect you to the "correct" page to handle your session.
First take a look at the response headers, see if there is a Cookie and what the page is that you are being redirected to. This should help you figure it out.
I wrote this code:
import urllib
proxies = {'http': 'http://112.65.135.54:8080/'}
opener = urllib.FancyURLopener(proxies)
r = opener.open("http://www.python.org/")
print r.read()
and when I execute it this program works fine, and send for me source code of python.org But when i use this:
import urllib
proxies = {'http': 'http://80.176.245.196:1080/'}
opener = urllib.FancyURLopener(proxies)
r = opener.open("http://www.python.org/")
print r.read()
this program does not send me the source code of python.org
What am I going to do?
hehe :d i find the answer i must use "socks" instead of "http" :
import urllib
proxies = {'socks': 'http://80.176.245.196:1080/'}
opener = urllib.FancyURLopener(proxies)
r = opener.open("http://www.python.org/")
print r.read()
this code works fine
Presumably, the first IP address and port points to a working proxy, while the second set does not (they're on private IPs so of course nobody else can check). So, speak with whoever handles your local network, and get the exact specs for IP and port of the HTTP proxy you're supposed to use!
Edit: aargh, the question had been edited to "mask" the IPs (now they're back and they're definitely not on private networks!) -- so the answer was based on that. Anyway, no need for digging now, as the OP has already discovered that one is a socks proxy, not an http proxy, and so of course can't be treated as the latter;-).