Web scraping using Python and HTTPS proxies - python

Is there currently something in Python that support HTTPS proxies for web scraping ? I am currently using Python 2.7 with Windows but I could use Python 3 if it supports HTTPS proxy protocol.
I tried using mechanize and requests but both failed on HTTPS proxy protocol.
This bit is using mechanize:
import mechanize
br = mechanize.Browser()
br.set_debug_http(True)
br.set_handle_robots(False)
br.set_proxies({
"http" : "ncproxy1.uk.net.intra:8080",
"https" : "ncproxy1.uk.net.intra:8080",})
br.add_proxy_password("uname", "pass")
br.open("http://www.google.co.jp/") # OK
br.open("https://www.google.co.jp/") # Proxy Authentication Required
or using requests:
import requests
from requests.auth import HTTPProxyAuth
proxyDict = {
'http' : 'ncproxy1.uk.net.intra:8080',
'https' : 'ncproxy1.uk.net.intra:8080'
}
auth = HTTPProxyAuth('uname', 'pass')
r = requests.get("https://www.google.com", proxies=proxyDict, auth=auth)
print r.text
I obtain the following message:
Traceback (most recent call last):
File "D:\SRC\NuffieldLogger\NuffieldLogger\nuffieldrequests.py", line 10, in <module>
r = requests.get("https://www.google.com", proxies=proxyDict, auth=auth)
File "C:\Python27\lib\site-packages\requests\api.py", line 55, in get
return request('get', url, **kwargs)
File "C:\Python27\lib\site-packages\requests\api.py", line 44, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Python27\lib\site-packages\requests\sessions.py", line 335, in request
resp = self.send(prep, **send_kwargs)
File "C:\Python27\lib\site-packages\requests\sessions.py", line 438, in send
r = adapter.send(request, **kwargs)
File "C:\Python27\lib\site-packages\requests\adapters.py", line 331, in send
raise SSLError(e)
requests.exceptions.SSLError: [Errno 1] _ssl.c:504: error:140770FC:SSL routines:SSL23_GET_SERVER_HELLO:unknown protocol

For requests module you can use this
#!/usr/bin/env python3
import requests
proxy_dict = {
'http': 'http://user:passwd#proxy_ip:port',
'https': 'https://user:passwd#proxy_ip:port'
}
r = requests.get('https://google.com', proxies=proxy_dict)
print(r.text)
I have tested this.

Related

check_hostname requires server_hostname

I was using this code with Python 3.9 and worked fine but I have updated to an alpha version of Python 3.10 and now I get this exception:
proxy = {
"http": "http://" + proxy_ip,
"https": "https://" + proxy_ip
}
requests.get("https://www.url.org/", proxies=proxy, timeout=10)
Error:
check_hostname requires server_hostname
I change the code to this as mentioned in another question but I got another error
proxy = {
"http": proxy_ip,
"https": proxy_ip
}
requests.get("https://www.url.org/", proxies=proxy, timeout=10)
Error:
Proxy URL had no scheme, should start with http:// or https://
Any ideas besides downgrading the Python version?
All Code:
import requests
from os import getcwd
from tqdm import tqdm
# Check proxy
def checkProxy(proxy_ip):
proxy = {
"http": proxy_ip,
"https": proxy_ip
}
try:
requests.get("https://www.url.org/", proxies=proxy, timeout=10)
return True
except:
return False
# Main
if __name__ == "__main__":
with open('proxy.txt', encoding='utf-8') as f:
proxies = f.readlines()
proxies = [x.strip() for x in proxies]
f = open("proxy-valid.txt",'a+')
for proxy in tqdm(proxies):
if checkProxy(proxy):
f.write(proxy + "\n")
f.flush()
f.close()
Traceback:
Traceback (most recent call last):
File "C:\Users\User\Desktop\NG\utils\proxy-checker.py", line 39, in <module>
if checkProxy(proxy):
File "C:\Users\User\Desktop\NG\utils\proxy-checker.py", line 12, in checkProxy
requests.get("https://www.url.org/", proxies=proxy, timeout=10)
File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\requests\api.py", line 76, in get
return request('get', url, params=params, **kwargs)
File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\requests\api.py", line 61, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\requests\sessions.py", line 542, in request
resp = self.send(prep, **send_kwargs)
File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\requests\sessions.py", line 655, in send
r = adapter.send(request, **kwargs)
File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\requests\adapters.py", line 414, in send
raise InvalidURL(e, request=request)
requests.exceptions.InvalidURL: Proxy URL had no scheme, should start with http:// or https://
Thanks

cURL to Python: Connection error when using requests module

I want to move my bash code which uses a cURL command to a Python 2.7 script.
The cURL working command is:
$ curl --data "vm_id='52e4130d-ffe0-495a-87c0-fc84200252ed'&gpu_ip='10.2.0.22'&gpu_port='8308'&mock_ip='10.254.254.254'&mock_port='8308'" http://rodvr-services:8080/rodvr-assign_gpu
And my Python script contains this:
import requests
import requests.packages.urllib3
requests.packages.urllib3.disable_warnings()
payload = {'vm_id': '52e4130d-ffe0-495a-87c0-fc84200252ed', 'gpu_ip': '10.2.0.22', 'gpu_port': '8308', 'mock_ip': '10.254.254.254', 'mock_port': '8308'}
r = requests.get('http://rodvr-services:8080/rodvr-assign_gpu', params=payload)
When I execute the script, I get the following error:
$ python exec.py
Traceback (most recent call last):
File "exec.py", line 9, in <module>
r = requests.post('http://rodvr-services:8080/rodvr-assign_gpu', params=payload)
File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 112, in post
return request('post', url, data=data, json=json, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 58, in request
return session.request(method=method, url=url, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 502, in request
resp = self.send(prep, **send_kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 612, in send
r = adapter.send(request, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/adapters.py", line 490, in send
raise ConnectionError(err, request=request) requests.exceptions.ConnectionError: ('Connection aborted.', BadStatusLine('\n',))
Just in case, I checked what would happen using Python 3, and this is the output:
HTTPConnectionPool(host='rodvr-services', port=8080): Max retries exceeded with url: /rodvr-assign_gpu?mock_ip=10.254.254.254&vm_id=52e4130d-ffe0-495a-87c0-fc84200252ed&gpu_ip=10.2.0.22&mock_port=8308&gpu_port=8308 (Caused by <class 'http.client.BadStatusLine'>:
However, using the urllib2 library, it works:
data = "vm_id='52e4130d-ffe0-495a-87c0-fc84200252ed'&gpu_ip='10.2.0.22'&gpu_port='8308'&mock_ip='10.254.254.254'&mock_port='8308'"
r = urllib2.Request(url='http://rodvr-services:8080/rodvr-assign_gpu', data=data)
f = urllib2.urlopen(r)
print f.read()
Try r = requests.post('http://rodvr-services:8080/rodvr-assign_gpu', data=payload)
This website helps you to convert your curl command to python code.
You can see the code suggested by that website below:
import requests
data = [
('vm_id', '\'52e4130d-ffe0-495a-87c0-fc84200252ed\''),
('gpu_ip', '\'10.2.0.22\''),
('gpu_port', '\'8308\''),
('mock_ip', '\'10.254.254.254\''),
('mock_port', '\'8308\''),
]
requests.post('http://rodvr-services:8080/rodvr-assign_gpu', data=data)
# it is slightly different from your code
Due to my personal problems with my laptop, I can't test your code. hope this works for you.

requests.get() raises SSL Error

I am using Python 2.7.3 on a shared FreeBSD host.
The installed version of python requests module is 2.11.1.
import json
import requests
from requests.auth import HTTPBasicAuth
requests.packages.urllib3.disable_warnings()
s = requests.Session()
s.server = "dns-api.company.net"
s.auth = HTTPBasicAuth('user', 'pass')
s.headers = {'User-Agent':'DNS-Client'}
s.verify = False
r = s.get('https://dns-api.company.net/query') //Not the actual url
As you can see I am setting verify to False. Yet I get the following SSL error
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/site-packages/requests/sessions.py", line 477, in get
return self.request('GET', url, **kwargs)
File "/usr/local/lib/python2.7/site-packages/requests/sessions.py", line 465, in request
resp = self.send(prep, **send_kwargs)
File "/usr/local/lib/python2.7/site-packages/requests/sessions.py", line 573, in send
r = adapter.send(request, **kwargs)
File "/usr/local/lib/python2.7/site-packages/requests/adapters.py", line 431, in send
raise SSLError(e, request=request)
requests.exceptions.SSLError: [Errno bad handshake] [('SSL routines', 'SSL23_GET_SERVER_HELLO', '')]
I have tried the following variations but to no avail as I end up with the same error.
from requests.packages.urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
r = requests.get("https://dns-api.company.net/query", verify=False, auth=('user','pass')) // Dummy URL
I don't care about SSL verification. What am I doing wrong here?

Http Request through proxy in python having # in password

How to escape # character in the password of proxy. So that python can create the request correctly. I have tried \\ but still not able to hit the url correctly.
proxy = {
"http": "http://UserName:PassWord#X.X.X.X:Port_No"
}
Update question:
I am using python requests module for the http request. It split the string (to get host) from first occurrence of # where as it was suppose to split from second #.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 55, in get
return request('get', url, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 44, in request
return session.request(method=method, url=url, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 335, in request
resp = self.send(prep, **send_kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 438, in send
r = adapter.send(request, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/adapters.py", line 327, in send
raise ConnectionError(e)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='XXXXXXXX#X.X.X.X', port=XXXXX): Max retries exceeded with url: http:/URL (Caused by <class 'socket.gaierror'>: [Errno -2] Name or service not known)
You have to do urlencoding like in this post:
Escaping username characters in basic auth URLs
This way the # in the PW becomes %40
You don't mention which library you are using to perform your HTTP requests, so you should consider using requests, not only to solve this problem, but because it is a great library.
Here is how to use a proxy with basic authentication:
import requests
proxy = {'http': 'http://UserName:PassWord#X.X.X.X:Port_No'}
r = requests.get("http://whereever.com", proxies=proxy)
Update
Successfully tested with requests and proxy URLs:
http://UserName:PassWord#127.0.0.1:1234
http://UserName:PassWord##127.0.0.1:1234
http://User#Name:PassWord#1234#127.0.0.1:1234
If, instead, you need to use Python's urllib2 library, you can do this:
import urllib2
handler = urllib2.ProxyHandler({'http': 'http://UserName:PassWord#X.X.X.X:Port_No'})
opener = urllib2.build_opener(handler)
r = opener.open('http://whereever.com')
Note that in neither case is it necessary to escape the #.
A third option is to set environment variables HTTP_PROXY and/or HTTPS_PROXY (in *nix).

Make POST request using Python

I am trying to make a POST request But getting this error :
Traceback (most recent call last):
File "demo.py", line 7, in <module>
r = requests.post(url, data=payload, headers=headers)
File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 87, in post
return request('post', url, data=data, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 44, in request
return session.request(method=method, url=url, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 266, in request
prep = req.prepare()
File "/usr/local/lib/python2.7/dist-packages/requests/models.py", line 215, in prepare
p.prepare_body(self.data, self.files)
File "/usr/local/lib/python2.7/dist-packages/requests/models.py", line 338, in prepare_body
body = self._encode_params(data)
File "/usr/local/lib/python2.7/dist-packages/requests/models.py", line 74, in _encode_params
for k, vs in to_key_val_list(data):
ValueError: too many values to unpack
This is my program :
import requests
url = 'http://www.n-gal.com/index.php?route=openstock/openstock/optionStatus'
payload = {'var:1945,product_id:1126'}
headers = {'content-type': 'application/x-www-form-urlencoded'}
r = requests.post(url, data=payload, headers=headers)
I have tried the same POST request through Advanced rest client using following data :
URL : http://www.n-gal.com/index.php?route=openstock/openstock/optionStatus
payload : var=1945&product_id=1126
Content-Type: application/x-www-form-urlencoded
And it is working fine can anyone help me please...
You have made payload a set, not a dictionary. You forgot to close the string.
Change:
payload = {'var:1945,product_id:1126'}
To:
payload = {'var':'1945','product_id':'1126'}
As it is a set, the request will thus fail.
Try this :
import requests
url = 'http://www.n-gal.com/index.php?route=openstock/openstock/optionStatus'
payload = 'var=1945&product_id=1126'
headers = {'content-type': 'application/x-www-form-urlencoded'}
r = requests.post(url, data=payload, headers=headers)
print r.json()
import requests
import json
url = 'http://www.n-gal.com/index.php?route=openstock/openstock/optionStatus'
payload = {'var:1945,product_id:1126'}
headers = {'content-type': 'application/x-www-form-urlencoded'}
r = requests.post(url, data=json.dumps(payload), headers=headers)
I know this is a really old question, but I had the same problem when using Docker recently, and managed to solve it by including the requests library in the requirements (or just update the library with pip install requests --upgrade).
When using the original version of the requests library, it raised that very same error on Python3.8, which only stopped after upgrading it.

Categories