I want to crawl a site, however cloudflare was getting in the way. I was able to get the servers IP, so cloudflare won't bother me.
How can I utilize this in the requests library?
For example, I want to go directly to
www.example.com/foo.php, but in requests it will resolve the IP on the cloudflare network instead of the one I want it to use. How can I make it use the one I want it to use?
I would of sent in a request so the real IP with the host set as the www.example.com, but that will just give me the home page. How can I visit other links on the site?
You will have to set a custom header host with value of example.com, something like:
requests.get('http://127.0.0.1/foo.php', headers={'host': 'example.com'})
should do the trick. If you want to verify that then type in the following command (requires netcat): nc -l -p 80 and then run the above command. It will produce output in the netcat window:
GET /foo.php HTTP/1.1
Host: example.com
Connection: keep-alive
Accept-Encoding: gzip, deflate
Accept: */*
User-Agent: python-requests/2.6.2 CPython/3.4.3 Windows/8
You'd have to tell requests to fake the Host header, and replace the hostname in the URL with the IP address:
requests.get('http://123.45.67.89/foo.php', headers={'Host': 'www.example.com'})
The URL 'patching' can be done with the urlparse library:
parsed = urlparse.urlparse(url)
hostname = parsed.hostname
parsed = parsed._replace(netloc=ipaddress)
ip_url = parsed.geturl()
response = requests.get(ip_url, headers={'Host': hostname})
Demo against Stack Overflow:
>>> import urlparse
>>> import socket
>>> url = 'http://stackoverflow.com/help/privileges'
>>> parsed = urlparse.urlparse(url)
>>> hostname = parsed.hostname
>>> hostname
'stackoverflow.com'
>>> ipaddress = socket.gethostbyname(hostname)
>>> ipaddress
'198.252.206.16'
>>> parsed = parsed._replace(netloc=ipaddress)
>>> ip_url = parsed.geturl()
>>> ip_url
'http://198.252.206.16/help/privileges'
>>> response = requests.get(ip_url, headers={'Host': hostname})
>>> response
<Response [200]>
In this case I looked up the ip address dynamically.
Answer for HTTPS/SNI support: Use the HostHeaderSSLAdapter in the requests_toolbelt module:
The above solution works fine with virtualhosts for non-encrypted HTTP connections. For HTTPS you also need to pass SNI (Server Name Identification) in the TLS header which as some servers will present a different SSL certificate depending on what is passed in via SNI. Also, the python ssl libraries by default don't look at the Host: header to match the server connection at connection time.
The above provides a straight-forward for adding a transport adapter to requests that handles this for you.
Example
import requests
from requests_toolbelt.adapters import host_header_ssl
# Create a new requests session
s = requests.Session()
# Mount the adapter for https URLs
s.mount('https://', host_header_ssl.HostHeaderSSLAdapter())
# Send your request
s.get("https://198.51.100.50", headers={"Host": "example.org"})
I think the best way to send https requests to a specific IP is to add a customized resolver to bind domain name to that IP you want to hit. In this way, both SNI and host header are correctly set, and certificate verification can always succeed as web browser.
Otherwise, you will see various issue like InsecureRequestWarning, SSLCertVerificationError, and SNI is always missing in Client Hello, even if you try different combination of headers and verify arguments.
requests.get('https://1.2.3.4/foo.php', headers= {"host": "example.com", verify=True)
In addition, I tried
requests_toolbelt
pip install requests[security]
forcediphttpsadapter
all solutions mentioned here using requests with TLS doesn't give SNI support
None of them set SNI when hitting https://IP directly.
# mock /etc/hosts
# lock it in multithreading or use multiprocessing if an endpoint is bound to multiple IPs frequently
etc_hosts = {}
# decorate python built-in resolver
def custom_resolver(builtin_resolver):
def wrapper(*args, **kwargs):
try:
return etc_hosts[args[:2]]
except KeyError:
# fall back to builtin_resolver for endpoints not in etc_hosts
return builtin_resolver(*args, **kwargs)
return wrapper
# monkey patching
socket.getaddrinfo = custom_resolver(socket.getaddrinfo)
def _bind_ip(domain_name, port, ip):
'''
resolve (domain_name,port) to a given ip
'''
key = (domain_name, port)
# (family, type, proto, canonname, sockaddr)
value = (socket.AddressFamily.AF_INET, socket.SocketKind.SOCK_STREAM, 6, '', (ip, port))
etc_hosts[key] = [value]
_bind_ip('example.com', 443, '1.2.3.4')
# this sends requests to 1.2.3.4
response = requests.get('https://www.example.com/foo.php', verify=True)
Related
Is it possible to send a HTTP request using two (or more) proxies at the same time in Python? The order of proxy servers matters! (Additional info: 1st proxy is Socks5 and requires authentication. 2nd is HTTP, no auth).
client -> Socks5 Proxy Server -> HTTP Proxy Server -> resource
The requests library allows only one proxy at a time:
import requests
from requests.auth import HTTPProxyAuth
url = 'http://example.com'
proxy_1 = {
'http': 'socks5://host:port',
'https': 'socks5://host:port'
}
auth = HTTPProxyAuth('user', 'password')
# second proxy is not accepted by requests api
# proxy_2 = {
# 'http': 'http://host:port',
# 'https': 'http://host:port'
# }
requests.get(url, proxies=proxy_1, auth=auth)
I need all this to check if proxy_2 is working while being behind proxy_1. May be there is a better way to do it?
Two basic ways to do proxy chaining in python:
1 modified this answer to be used when FIRST proxy requires auth:
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as sock:
# creating connection to 1st proxy
sock.connect((proxy_host, proxy_port))
# connecting to 2nd proxy
# auth creds are for the FIRST proxy while here you connect to the 2nd one
request = b"CONNECT second_proxy_host:second_proxy_port HTTP/1.0\r\n" \
b"Proxy-Authorization: Basic b64encoded_auth\r\n" \
b"Connection: Keep-Alive\r\n" \
b"Proxy-Connection: Keep-Alive\r\n\r\n"
sock.send(request)
print('Response 1:\n' + sock.recv(40).decode())
# this request will be sent through chain of two proxies
# auth creds are still for the FIRST proxy
request2 = b"GET http://www.example.com/ HTTP/1.0\r\n" \
b"Proxy-Authorization: Basic b64encoded_auth=\r\n" \
b"Connection: Keep-Alive\r\n" \
b"Proxy-Connection: Keep-Alive\r\n\r\n"
sock.send(request2)
print('Response 2:\n' + sock.recv(4096).decode())
2 Using PySocks:
# pip install pysocks
import socks
with sock.socksocket(socket.AF_INET, socket.SOCK_STREAM) as s:
s.setproxy(proxytype=socks.PROXY_TYPE_SOCKS5,
addr="proxy1_host",
port=8080,
username='user',
password='password')
s.connect(("proxy2_host", 8080))
message = b'GET http://www.example.com/ HTTP/1.0\r\n\r\n'
s.sendall(message)
response = s.recv(4069)
print(response.decode())
You could try using proxychains
Something like this
proxychains4 python test.py
#test.py
import requests
r = requests.get("https://ipinfo.io/ip")
print(r.content)
Or check out this and this questions
Also, you could try using selenium instead of requests and play with web driver settings
I am trying to scrape using proxies (this proxy server is a free one from the internet); in particular I would like to use their IP, not my private one. To test my script I am trying to Access "http://whatismyipaddress.com/" to see which IP this site sees. As it turns out it will see my private IP. Can somebody tell me what's wrong here?
import requests
from fake_useragent import UserAgent
def getMyIP(proxyServer,myPrivateIP):
scrape_website = "http://whatismyipaddress.com/"
ua = UserAgent()
headers = {'User-Agent': ua.random}
try:
response = requests.get(scrape_website,headers=headers,proxies={"https":proxyServer})
except:
faultString = proxyServer + " did not work; " + "\n"
print(faultString)
return
if myPrivateIP in str(response.content):
print("They found my private IP.")
proxyServer = "http://103.250.158.23:61219"
myPrivateIP = "xxx.xxx.xxx.xxx"
getMyIP(proxyServer,myPrivateIP)
Two things:
You set an {'https': ...} proxy configuration. This means for any HTTPS requests, it will use that proxy. You're requesting an HTTP URL however, so that proxy isn't getting used. Configure an 'http' proxy instead or in addition.
If the proxy forwards your IP in an HTTP header, and the target server heeds that header, that's tough luck and nothing you can do anything about, besides using a different proxy which doesn't forward your IP. I think point 1 is more likely the issue though.
Recently I ported my client upload code from HTTPConnection to requests. On uploading a image:
file_name ='/path/to/216169_1286900924tait.jpg?pt=5&ek=1'
The image stored on disk is really the name, and I want to upload it to remote server with same path and name, so I constructed the request like this:
url = 'http://host/bucket_name/%s' % (file_name)
headers = {...} # some other headers
with open(file_name, 'rb') as fd:
data = fd.read()
r = requests.put(url, data=data, headers=headers)
assert(r.status_code==200)
....
But the request send to server changed to this:
/path/to/216169_1286900924tait.jpg
requests should encode the tail as %3Fpt%3D5%26ek%3D1, but it seems that requests do nothing on url-encode of url, I think it may matched the ?pt=5&ek=1 pattern to request parameters, how to make requests convert urls blindly without the pattern match?
Update:
Server get the trimmed url and calculated the signature with it, and so do not match the signature I calculated with original url, so 403 returned.
You might have a problem on the way how you construct the URL:
>>> payload = {'pt': 5, 'ek': '1'}
>>> r = requests.get('http://host/bucket_name/file_name', params=payload)
if you call the print(r.url) you should have the right form.
Why should requests presume to encode the query parameters? It does not know that you don't want that part of the URL treated as the query string. Besides, the request is sent as is to the server, the query string is not omitted as you suggest. You can verify that with nc:
# run nc server
$ nc -l 1234
# then send request from Python
>>> requests.put('http://localhost:1234/path/to/216169_1286900924tait.jpg?pt=5&ek=1', data='any old thing')
nc will display the request:
PUT /path/to/216169_1286900924tait.jpg?pt=5&ek=1 HTTP/1.1
Host: localhost:1234
Content-Length: 13
User-Agent: python-requests/2.9.1
Connection: keep-alive
Accept: */*
Accept-Encoding: gzip, deflate
any old thing
So it is the remote server that is (correctly according to the HTTP protocol) interpreting the ?pt=5&ek=1 part of the file name as query parameters. What else should it do?
For comparison, since I assume that it previously worked with httplib.HTTPConnection:
>>> import httplib
>>> r = httplib.HTTPConnection('localhost', 1234)
>>> r.request('PUT', '/path/to/216169_1286900924tait.jpg?pt=5&ek=1', 'hello from httplib')
generates this request:
PUT /path/to/216169_1286900924tait.jpg?pt=5&ek=1 HTTP/1.1
Host: localhost:1234
Accept-Encoding: identity
Content-Length: 18
hello from httplib
Note that there is no difference in the way the URL is sent.
I dig into the requests source code, I find the following line of code(yes, requests based on urllib3):
scheme, auth, host, port, path, query, fragment = urllib3.util.parse_url(url)
It seems like that, you should url-encode your url manually during construct your url string, for example:
>>> path = '''~!##$^&*()_+|}{":?><`-=\\][\';.,'''
>>> url = 'http://host.com/bucket/%s' % path
>>> urllib3.util.parse_url(url)
>>> Url(scheme='http', auth=None, host='host.com', port=None, path='/bucket/~!#', query=None, fragment='$^&*()_+|}{":?><`-=B%7C%7D%7B%22%3A%3F%3E%3C%60-%3D%5C%5D%5B%27%3B.%2C')
Notice the path field output, not the same as path, if you encode path:
>>> path = '''~!##$^&*()_+|}{":?><`-=\\][\';.,'''
>>> url = 'http://host.com/bucket/%s' % (urllib.quote(path, ''))
>>> print url
>>> http://host.com/bucket/%7E%21%40%23%24%25%5E%26%2A%28%29_%2B%7C%7D%7B%22%3A%3F%3E%3C%60-%3D%5C%5D%5B%27%3B.%2C
>>> urllib3.util.parse_url(url)
>>> Url(scheme='http', auth=None, host='host.com', port=None, path='/bucket/%7E%21%40%23%24%25%5E%26%2A%28%29_%2B%7C%7D%7B%22%3A%3F%3E%3C%60-%3D%5C%5D%5B%27%3B.%2C', query=None, fragment=None)
this what I want. But if you want to pass tome unicode characters into path, you do not need to encode them, they were automatically transferred into %xx%xx format. But url encode is a good advise for any characters you passed into URL.
I need to make an API call (of sorts) in Django as a part of the custom authentication system we require. A username and password is sent to a specific URL over SSL (using GET for those parameters) and the response should be an HTTP 200 "OK" response with the body containing XML with the user's info.
On an unsuccessful auth, it will return an HTTP 401 "Unauthorized" response.
For security reasons, I need to check:
The request was sent over an HTTPS connection
The server certificate's public key matches an expected value (I use 'certificate pinning' to defend against broken CAs)
Is this possible in python/django using pycurl/urllib2 or any other method?
Using M2Crypto:
from M2Crypto import SSL
ctx = SSL.Context('sslv3')
ctx.set_verify(SSL.verify_peer | SSL.verify_fail_if_no_peer_cert, depth=9)
if ctx.load_verify_locations('ca.pem') != 1:
raise Exception('No CA certs')
c = SSL.Connection(ctx)
c.connect(('www.google.com', 443)) # automatically checks cert matches host
c.send('GET / \n')
c.close()
Using urllib2_ssl (it goes without saying but to be explicit: use it at your own risk):
import urllib2, urllib2_ssl
opener = urllib2.build_opener(urllib2_ssl.HTTPSHandler(ca_certs='ca.pem'))
xml = opener.open('https://example.com/').read()
Related: Making HTTPS Requests secure in Python.
Using pycurl:
c = pycurl.Curl()
c.setopt(pycurl.URL, "https://example.com?param1=val1¶m2=val2")
c.setopt(pycurl.HTTPGET, 1)
c.setopt(pycurl.CAINFO, 'ca.pem')
c.setopt(pycurl.SSL_VERIFYPEER, 1)
c.setopt(pycurl.SSL_VERIFYHOST, 2)
c.setopt(pycurl.SSLVERSION, 3)
c.setopt(pycurl.NOBODY, 1)
c.setopt(pycurl.NOSIGNAL, 1)
c.perform()
c.close()
To implement 'certificate pinning' provide different 'ca.pem' for different domains.
httplib2 can do https requests with certificate validation:
import httplib2
http = httplib2.Http(ca_certs='/path/to/cert.pem')
try:
http.request('https://...')
except httplib2.SSLHandshakeError, e:
# do something
Just make sure that your httplib2 is up to date. The one which is shipped with my distribution (ubuntu 10.04) does not have ca_certs parameter.
Also in similar question to yours there is an example of certificate validation with pycurl.
I have Python code to call a REST service that is something like this:
import urllib
import urllib2
username = 'foo'
password = 'bar'
passwordManager = urllib2.HTTPPasswordMgrWithDefaultRealm()
passwordManager .add_password(None, MY_APP_PATH, username, password)
authHandler = urllib2.HTTPBasicAuthHandler(passwordManager)
opener = urllib2.build_opener(authHandler)
urllib2.install_opener(opener)
params= { "param1" : param1,
"param2" : param2,
"param3" : param3 }
xmlResults = urllib2.urlopen(MY_APP_PATH, urllib.urlencode(params)).read()
results = MyResponseParser.parse(xmlResults)
MY_APP_PATH is currently an HTTP url. I would like to change it to use SSL ("HTTPS"). How would I go about changing this code to use https in the simplest way possible?
Unfortunately, urllib2 and httplib, at least up to Python 2.7 don't do any certificate verification for when using HTTPS. The result is that you're exchanging information with a server you haven't necessarily identified (it's a bit like exchanging a secret with someone whose identity you haven't verified): this defeats the security purpose of HTTPS.
See this quote from httplib (in Python 2.7):
Note: This does not do any certificate
verification.
(This is independent of httplib.HTTPSConnection being able to send a client-certificate: that's what its key and cert parameters are for.)
There are ways around this, for example:
http://thejosephturner.com/blog/post/https-certificate-verification-in-python-with-urllib2/
http://code.google.com/p/python-httpclient/ (not using urllib2, so possibly not the shortest way for you)
Just using HTTPS:// instead of HTTP:// in the URL you are calling should work, at least if you are trying to reach a known/verified server. If necessary, you can use your client-side SSL certificate to secure the API transaction:
mykey = '/path/to/ssl_key_file'
mycert = '/path/to/ssl_cert_file'
opener = urllib2.build_opener(HTTPSClientAuthHandler(mykey, mycert))
opener.add_handler(urllib2.HTTPBasicAuthHandler()) # add HTTP Basic Authentication information...
opener.add_password(user=settings.USER_ID, passwd=settings.PASSWD)