Requests - proxies dictionary - python

I'm little confused about requests module, especially proxies.
From documentation:
PROXIES
Dictionary mapping protocol to the URL of the proxy (e.g. {‘http’:
‘foo.bar:3128’}) to be used on each Request.
May there be more proxies of one type in the dictionary? I mean is it possible to put there list of proxies and requests module will try them and use only those which are working?
Or there can be only one proxy address for example for http?

Using the proxies parameter is limited by the very nature of a python dictionary (i.e. each key must be unique).
import requests
url = 'http://google.com'
proxies = {'https': '84.22.41.1:3128',
'http': '185.26.183.14:80',
'http': '178.33.230.114:3128'}
if __name__ == '__main__':
print url
print proxies
response = requests.get(url, proxies=proxies)
if response.status_code == 200:
print response.text
else:
print 'Response ERROR', response.status_code
outputs
http://google.com
{'http': '178.33.230.114:3128', 'https': '84.22.41.1:3128'}
<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="en"><head><meta content="Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for."
...more html...
As you can see, the value of the http protocol key in the proxies dictionary corresponds to the last encountered in its assignment (i.e. 178.33.230.114:3128). Try swapping the http entries around.
So, the answer is no, you cannot specify multiple proxies for the same protocol using a simple dictionary.
I have tried using an iterable as a value, which would make sense to me
proxies = {'https': '84.22.41.1:3128',
'http': ('178.33.230.114:3128', '185.26.183.14:80', )}
but with no luck, it produces an error

Well, actually you can, I've done this with a few lines of code and it works pretty well.
import requests
class Client:
def __init__(self):
self._session = requests.Session()
self.proxies = None
def set_proxy_pool(self, proxies, auth=None, https=True):
"""Randomly choose a proxy for every GET/POST request
:param proxies: list of proxies, like ["ip1:port1", "ip2:port2"]
:param auth: if proxy needs auth
:param https: default is True, pass False if you don't need https proxy
"""
from random import choice
if https:
self.proxies = [{'http': p, 'https': p} for p in proxies]
else:
self.proxies = [{'http': p} for p in proxies]
def get_with_random_proxy(url, **kwargs):
proxy = choice(self.proxies)
kwargs['proxies'] = proxy
if auth:
kwargs['auth'] = auth
return self._session.original_get(url, **kwargs)
def post_with_random_proxy(url, *args, **kwargs):
proxy = choice(self.proxies)
kwargs['proxies'] = proxy
if auth:
kwargs['auth'] = auth
return self._session.original_post(url, *args, **kwargs)
self._session.original_get = self._session.get
self._session.get = get_with_random_proxy
self._session.original_post = self._session.post
self._session.post = post_with_random_proxy
def remove_proxy_pool(self):
self.proxies = None
self._session.get = self._session.original_get
self._session.post = self._session.original_post
del self._session.original_get
del self._session.original_post
# You can define whatever operations using self._session
I use it like this:
client = Client()
client.set_proxy_pool(['112.25.41.136', '180.97.29.57'])
It's simple, but actually works for me.

Related

How to test python's http.client.HTTPResponse?

I'm trying to work with a third party API and I am having problems with sending the request when using the requests or even urllib.request.
Somehow when I use http.client I am successful sending and receiving the response I need.
To make life easier for me, I created an API class below:
class API:
def get_response_data(self, response: http.client.HTTPResponse) -> dict:
"""Get the response data."""
response_body = response.read()
response_data = json.loads(response_body.decode("utf-8"))
return response_data
The way I use it is like this:
api = API()
rest_api_host = "api.app.com"
connection = http.client.HTTPSConnection(rest_api_host)
token = "my_third_party_token"
data = {
"token":token
}
payload = json.loads(data)
headers = {
# some headers
}
connection.request("POST", "/some/endpoint/", payload, headers)
response = connection.getresponse()
response_data = api.get_response_data(response) # I get a dictionary response
This workflow works for me. Now I just want to write a test for the get_response_data method.
How do I instantiate a http.client.HTTPResponse with the desired output to be tested?
For example:
from . import API
from unittest import TestCase
class APITestCase(TestCase):
"""API test case."""
def setUp(self) -> None:
super().setUp()
api = API()
def test_get_response_data_returns_expected_response_data(self) -> None:
"""get_response_data() method returns expected response data in http.client.HTTPResponse"""
expected_response_data = {"token": "a_secret_token"}
# I want to do something like this
response = http.client.HTTPResponse(expected_response_data)
self.assertEqual(api.get_response_data(response), expected_response_data)
How can I do this?
From the http.client docs it says:
class http.client.HTTPResponse(sock, debuglevel=0, method=None, url=None)
Class whose instances are returned upon successful connection. Not instantiated directly by user.
I tried looking at socket for the sock argument in the instantiation but honestly, I don't understand it.
I tried reading the docs in
https://docs.python.org/3/library/http.client.html#http.client.HTTPResponse
https://docs.python.org/3/library/socket.html
Searched the internet on "how to test http.client.HTTPResponse" but I haven't found the answer I was looking for.

Derive protocol from url

I do have a list of urls such as ["www.bol.com ","www.dopper.com"]format.
In order to be inputted on scrappy as start URLs I need to know the correct HTTP protocol.
For example:
["https://www.bol.com/nl/nl/", "https://dopper.com/nl"]
As you see the protocol might differ from https to http or even with or without www.
Not sure if there are any other variations.
is there any python tool that can determine the right protocol?
If not and I have to build the logic by myself what are the cases that I should take into account?
For option 2, this is what I have so far:
def identify_protocol(url):
try:
r = requests.get("https://" + url + "/", timeout=10)
return r.url, r.status_code
except requests.HTTPError:
r = requests.get("http//" + url + "/", timeout=10)
return r.url, r.status_code
except requests.HTTPError:
r = requests.get("https//" + url.replace("www.","") + "/", timeout=10)
return r.url, r.status_code
except:
return None, None
is there any other possibility I should take into account?
There is no way to determine the protocol/full domain from the fragment directly, the information simply isn't there. In order to find it you would either need:
a database of the correct protocol/domains, which you can lookup your domain fragment in
to make the request and see what the server tells you
If you do (2) you can of course gradually build your own database to avoid needing the request in future.
On many https servers, if you attempt a http connection you will be redirected to https. If you are not, then you can reliably use the http. If the http fails, then you could try again with https and see if it works.
The same applies to the domain: if the site usually redirects, you can perform the request using the original domain and see where you are redirected.
An example using requests:
>>> import requests
>>> r = requests.get('http://bol.com')
>>> r
<Response [200]>
>>> r.url
'https://www.bol.com/nl/nl/'
As you can see the request object url parameter has the final destination URL, plus protocol.
As I understood question, you need to retrieve final url after all possible redirections. It could be done with built-in urllib.request. If provided url has no scheme you can use http as default. To parse input url I used combination of urlsplit() and urlunsplit().
Code:
import urllib.request as request
import urllib.parse as parse
def find_redirect_location(url, proxy=None):
parsed_url = parse.urlsplit(url.strip())
url = parse.urlunsplit((
parsed_url.scheme or "http",
parsed_url.netloc or parsed_url.path,
parsed_url.path.rstrip("/") + "/" if parsed_url.netloc else "/",
parsed_url.query,
parsed_url.fragment
))
if proxy:
handler = request.ProxyHandler(dict.fromkeys(("http", "https"), proxy))
opener = request.build_opener(handler, request.ProxyBasicAuthHandler())
else:
opener = request.build_opener()
with opener.open(url) as response:
return response.url
Then you can just call this function on every url in list:
urls = ["bol.com ","www.dopper.com", "https://google.com"]
final_urls = list(map(find_redirect_location, urls))
You can also use proxies:
from itertools import cycle
urls = ["bol.com ","www.dopper.com", "https://google.com"]
proxies = ["http://localhost:8888"]
final_urls = list(map(find_redirect_location, urls, cycle(proxies)))
To make it a bit faster you can make checks in parallel threads using ThreadPoolExecutor:
from concurrent.futures import ThreadPoolExecutor
urls = ["bol.com ","www.dopper.com", "https://google.com"]
final_urls = list(ThreadPoolExecutor().map(find_redirect_location, urls))

How do I disable verification of the SSL Certificate while using PySocks and Socks5 Proxy?

This seems to get me close:
SOCKS5 proxy using urllib2 and PySocks
But it seems that if I try to add the context to disable the SSL verification it just ignores it.
I am not the greatest at python, but it looks like the inheritance of the class in PySocks takes the same things as HTTPShandler.
https://github.com/Anorov/PySocks/blob/master/sockshandler.py
If so I thought I could just pass the context=context in without and issue.
But it doesn't work.
Here is my method...
def make_http_call(url, socks_hostname=None, socks_port=None, socks_username=None, socks_password=None, params=None):
"""
Make a HTTP GET request to given url.
"""
import ssl
url = add_url_params(url, params)
opener = urllib2.build_opener()
context = ssl._create_unverified_context
if socks_hostname and socks_port and socks_username and socks_password:
# Use proxy instead if params are provided
print "Socks Proxy is being used..."
opener = urllib2.build_opener(
SocksiPyHandler(socks.PROXY_TYPE_SOCKS5, socks_hostname, socks_port, True, socks_username, socks_password, context=context))
else:
print "Socks Proxy not in use..."
request = urllib2.Request(url)
response = opener.open(request).read()
return response
Is this possible?
For the context variable i would do this(untested code):
context = ssl.create_default_context()
context.verify_mode = CERT_NONE
context.check_hostname = False
The last line is not necessary as we are not checking the certificate, but it is just in case. Now i do not recommend this as this will not ensure that you are communicating with who you think you are and i do not recommend using any python version below 3 because they are known to have big security issues, See Here.

Setting proxy to urllib.request (Python3)

How can I set proxy for the last urllib in Python 3.
I am doing the next
from urllib import request as urlrequest
ask = urlrequest.Request(url) # note that here Request has R not r as prev versions
open = urlrequest.urlopen(req)
open.read()
I tried adding proxy as follows :
ask=urlrequest.Request.set_proxy(ask,proxies,'http')
However I don't know how correct it is since I am getting the next error:
336 def set_proxy(self, host, type):
--> 337 if self.type == 'https' and not self._tunnel_host:
338 self._tunnel_host = self.host
339 else:
AttributeError: 'NoneType' object has no attribute 'type'
You should be calling set_proxy() on an instance of class Request, not on the class itself:
from urllib import request as urlrequest
proxy_host = 'localhost:1234' # host and port of your proxy
url = 'http://www.httpbin.org/ip'
req = urlrequest.Request(url)
req.set_proxy(proxy_host, 'http')
response = urlrequest.urlopen(req)
print(response.read().decode('utf8'))
I needed to disable the proxy in our company environment, because I wanted to access a server on localhost. I could not disable the proxy server with the approach from #mhawke (tried to pass {}, None and [] as proxies).
This worked for me (can also be used for setting a specific proxy, see comment in code).
import urllib.request as request
# disable proxy by passing an empty
proxy_handler = request.ProxyHandler({})
# alertnatively you could set a proxy for http with
# proxy_handler = request.ProxyHandler({'http': 'http://www.example.com:3128/'})
opener = request.build_opener(proxy_handler)
url = 'http://www.example.org'
# open the website with the opener
req = opener.open(url)
data = req.read().decode('utf8')
print(data)
Urllib will automatically detect proxies set up in the environment - so one can just set the HTTP_PROXY variable either in your environment e.g. for Bash:
export HTTP_PROXY=http://proxy_url:proxy_port
or using Python e.g.
import os
os.environ['HTTP_PROXY'] = 'http://proxy_url:proxy_port'
Note from the urllib docs: "HTTP_PROXY[environment variable] will be ignored if a variable REQUEST_METHOD is set; see the documentation on getproxies()"
import urllib.request
def set_http_proxy(proxy):
if proxy == None: # Use system default setting
proxy_support = urllib.request.ProxyHandler()
elif proxy == '': # Don't use any proxy
proxy_support = urllib.request.ProxyHandler({})
else: # Use proxy
proxy_support = urllib.request.ProxyHandler({'http': '%s' % proxy, 'https': '%s' % proxy})
opener = urllib.request.build_opener(proxy_support)
urllib.request.install_opener(opener)
proxy = 'user:pass#ip:port'
set_http_proxy(proxy)
url = 'https://www.httpbin.org/ip'
request = urllib.request.Request(url)
response = urllib.request.urlopen(request)
html = response.read()
html

How can I open a website with urllib via proxy in Python?

I have this program that check a website, and I want to know how can I check it via proxy in Python...
this is the code, just for example
while True:
try:
h = urllib.urlopen(website)
break
except:
print '['+time.strftime('%Y/%m/%d %H:%M:%S')+'] '+'ERROR. Trying again in a few seconds...'
time.sleep(5)
By default, urlopen uses the environment variable http_proxy to determine which HTTP proxy to use:
$ export http_proxy='http://myproxy.example.com:1234'
$ python myscript.py # Using http://myproxy.example.com:1234 as a proxy
If you instead want to specify a proxy inside your application, you can give a proxies argument to urlopen:
proxies = {'http': 'http://myproxy.example.com:1234'}
print("Using HTTP proxy %s" % proxies['http'])
urllib.urlopen("http://www.google.com", proxies=proxies)
Edit: If I understand your comments correctly, you want to try several proxies and print each proxy as you try it. How about something like this?
candidate_proxies = ['http://proxy1.example.com:1234',
'http://proxy2.example.com:1234',
'http://proxy3.example.com:1234']
for proxy in candidate_proxies:
print("Trying HTTP proxy %s" % proxy)
try:
result = urllib.urlopen("http://www.google.com", proxies={'http': proxy})
print("Got URL using proxy %s" % proxy)
break
except:
print("Trying next proxy in 5 seconds")
time.sleep(5)
Python 3 is slightly different here. It will try to auto detect proxy settings but if you need specific or manual proxy settings, think about this kind of code:
#!/usr/bin/env python3
import urllib.request
proxy_support = urllib.request.ProxyHandler({'http' : 'http://user:pass#server:port',
'https': 'https://...'})
opener = urllib.request.build_opener(proxy_support)
urllib.request.install_opener(opener)
with urllib.request.urlopen(url) as response:
# ... implement things such as 'html = response.read()'
Refer also to the relevant section in the Python 3 docs
Here example code guide how to use urllib to connect via proxy:
authinfo = urllib.request.HTTPBasicAuthHandler()
proxy_support = urllib.request.ProxyHandler({"http" : "http://ahad-haam:3128"})
# build a new opener that adds authentication and caching FTP handlers
opener = urllib.request.build_opener(proxy_support, authinfo,
urllib.request.CacheFTPHandler)
# install it
urllib.request.install_opener(opener)
f = urllib.request.urlopen('http://www.google.com/')
"""
For http and https use:
proxies = {'http':'http://proxy-source-ip:proxy-port',
'https':'https://proxy-source-ip:proxy-port'}
more proxies can be added similarly
proxies = {'http':'http://proxy1-source-ip:proxy-port',
'http':'http://proxy2-source-ip:proxy-port'
...
}
usage
filehandle = urllib.urlopen( external_url , proxies=proxies)
Don't use any proxies (in case of links within network)
filehandle = urllib.urlopen(external_url, proxies={})
Use proxies authentication via username and password
proxies = {'http':'http://username:password#proxy-source-ip:proxy-port',
'https':'https://username:password#proxy-source-ip:proxy-port'}
Note: avoid using special characters such as :,# in username and passwords

Categories