python: how to use/change proxy with mechanize - python

Im writing a web scraping program in python using mechanize. The problem I'm having is that the website I'm scraping from limits the amount of time that you can be on the website. When I was doing everything by hand, I would use a SOCKS proxy as a work-around.
What I tried to do is go to the network preferences (Macbook Pro Retina 13', mavericks) and change to the proxy. However, the program didn't respond to that change. It kept running without the proxy.
Then I added .set_proxies() so now the code to open the website looks something like this:
b=mechanize.Browser() #open browser
b.set_proxies({"http":"96.8.113.76:8080"}) #proxy
DBJ=b.open(URL) #open url
When I ran the program, I got this error:
Traceback (most recent call last):
File "GM1.py", line 74, in <module>
DBJ=b.open(URL)
File "build/bdist.macosx-10.9-intel/egg/mechanize/_mechanize.py", line 203, in open
File "build/bdist.macosx-10.9-intel/egg/mechanize/_mechanize.py", line 230, in _mech_open
File "build/bdist.macosx-10.9-intel/egg/mechanize/_opener.py", line 193, in open
File "build/bdist.macosx-10.9-intel/egg/mechanize/_urllib2_fork.py", line 344, in _open
File "build/bdist.macosx-10.9-intel/egg/mechanize/_urllib2_fork.py", line 332, in _call_chain
File "build/bdist.macosx-10.9-intel/egg/mechanize/_urllib2_fork.py", line 1142, in http_open
File "build/bdist.macosx-10.9-intel/egg/mechanize/_urllib2_fork.py", line 1118, in do_open
urllib2.URLError: <urlopen error [Errno 54] Connection reset by peer>
Im assuming that the proxy was changed and that this error is in response to that proxy.
Maybe I am misusing .set_proxies().
Im not sure if the proxy itself is the issue or the connection is really slow.
Should I even be using SOCKS proxies for this type of thing or is there a better alternative for what I am trying to do?
Any information would be extremely helpful. Thanks in advance.

A SOCKS proxy is not the same as a HTTP proxy. The protocol between client and proxy is different. The line:
b.set_proxies({"http":"96.8.113.76:8080"})
tells mechanize to use the HTTP proxy at 96.8.113.76:8080 for requests having the http scheme in the URL, e.g. a request for URL http://httpbin.org/get will be sent via the proxy at 96.8.113.76:8080. Mechanize expects this to be a HTTP proxy server, and uses the corresponding protocol. It seems that your SOCKS proxy is closing the connection because it is not receiving a valid SOCKS proxy request (because it is a actually a HTTP proxy request).
I don't think that mechanize has builtin support for SOCKS, so you may have to resort to some dirty tricks such as those in this answer. For that you will need to install the PySocks package. This might work for you:
import socks
import socket
from mechanize import Browser
SOCKS_PROXY_HOST = '96.8.113.76'
SOCKS_PROXY_PORT = 8080
def create_connection(address, timeout=None, source_address=None):
sock = socks.socksocket()
sock.connect(address)
return sock
# add username and password arguments if proxy authentication required.
socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, SOCKS_PROXY_HOST, SOCKS_PROXY_PORT)
# patch the socket module
socket.socket = socks.socksocket
socket.create_connection = create_connection
br = Browser()
response = br.open('http://httpbin.org/get')
>>> print response.read()
{
"args": {},
"headers": {
"Accept-Encoding": "identity",
"Connection": "close",
"Host": "httpbin.org",
"User-Agent": "Python-urllib/2.7",
"X-Request-Id": "e728cd40-002c-4f96-a26a-78ce4d651fda"
},
"origin": "192.161.1.100",
"url": "http://httpbin.org/get"
}

Related

Python HTTPS request SSLError CERTIFICATE_VERIFY_FAILED

PYTHON
import requests
url = "https://REDACTED/pb/s/api/auth/login"
r = requests.post(
url,
data = {
'username': 'username',
'password': 'password'
}
)
NIM
import httpclient, json
let client = newHttpClient()
client.headers = newHttpHeaders({ "Content-Type": "application/json" })
let body = %*{
"username": "username",
"password": "password"
}
let resp = client.request("https://REDACTED.com/pb/s/api/auth/login", httpMethod = httpPOST, body = $body)
echo resp.body
I'm calling an API to get some data. Running the python code I get the traceback below. However, the nim code works perfectly so there must be something wrong with the python code or setup.
I'm running Python version 2.7.15.
requests lib version 2.19.1
Traceback (most recent call last):
File "C:/Python27/testht.py", line 21, in <module>
"Referer": "https://REDACTED.com/pb/a/"
File "C:\Python27\lib\site-packages\requests\api.py", line 112, in post
return request('post', url, data=data, json=json, **kwargs)
File "C:\Python27\lib\site-packages\requests\api.py", line 58, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Python27\lib\site-packages\requests\sessions.py", line 512, in request
resp = self.send(prep, **send_kwargs)
File "C:\Python27\lib\site-packages\requests\sessions.py", line 622, in send
r = adapter.send(request, **kwargs)
File "C:\Python27\lib\site-packages\requests\adapters.py", line 511, in send
raise SSLError(e, request=request)
SSLError: HTTPSConnectionPool(host='REDACTED.com', port=443): Max retries exceeded with url: /pb/s/api/auth/login (Caused by SSLError(SSLError(1, u'[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:726)'),))
The requests module will verify the cert it gets from the server, much like a browser would. Rather than being able to click through and say "add exception" like you would in your browser, requests will raise that exception.
There's a way around it though: try adding verify=False to your post call.
However, the nim code works perfectly so there must be something wrong with the python code or setup.
Actually, your Python code or setup is less to blame but instead the nim code or better the defaults on the httpclient library. In the documentation for nim can be seen that httpclient.request uses a SSL context returned by getDefaultSSL by default which according to this code creates a context which does not verify the certificate:
proc getDefaultSSL(): SSLContext =
result = defaultSslContext
when defined(ssl):
if result == nil:
defaultSSLContext = newContext(verifyMode = CVerifyNone)
Your Python code instead attempts to properly verify the certificate since the requests library does this by default. And it fails to verify the certificate because something is wrong - either with your setup or the server.
It is unclear who has issued the certificate for your site but if it is not in your default CA store you can use the verify argument of requests to specify the issuer CA. See this documentation for details.
If the site you are trying to access works with the browser but fails with your program it might be that it uses a special CA which was added as trusted to the browser (like a company certificate). Browsers and Python use different trust stores so this added certificate needs to be added to Python or at least to your program as trusted too. It might also be that the setup of the server has problems. Browsers can sometimes work around problems like a missing intermediate certificate but Python doesn't. In case of a public accessible site you could use SSLLabs to check what's wrong.

Python Requests post times out despite timeout setting

I am using the Python Requests module (v. 2.19.1) with Python 3.4.3, calling a function on a remote server that generates a .csv file for download. In general, it works perfectly. There is one particular file that takes >6 minutes to complete, and no matter what I set the timeout parameter to, I get an error after exactly 5 minutes trying to generate that file.
import requests
s = requests.Session()
authPayload = {'UserName': 'myloginname','Password': 'password'}
loginURL = 'https://myremoteserver.com/login/authenticate'
login = s.post(loginURL, data=authPayload)
backupURL = 'https://myremoteserver.com/directory/jsp/Backup.jsp'
payload = {'command': fileCommand}
headers = {'Connection': 'keep-alive'}
post = s.post(backupURL, data=payload, headers=headers, timeout=None)
This times out after exactly 5 minutes with the error:
File "/usr/lib/python3/dist-packages/requests/adapters.py", line 330, in send
timeout=timeout
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 612, in urlopen
raise MaxRetryError(self, url, e)
urllib3.exceptions.MaxRetryError: > HTTPSConnectionPool(host='myremoteserver.com', port=443): Max retries exceeded with url: /directory/jsp/Backup.jsp (Caused by < class 'http.client.BadStatusLine'>: '')
If I set timeout to something much smaller, say, 5 seconds, I get a error that makes perfect sense:
urllib3.exceptions.ReadTimeoutError:
HTTPSConnectionPool(host='myremoteserver.com', port=443): Read
timed out. (read timeout=5)
If I run the process from a browser, it works fine, so it doesn't seem like it's the remote server closing the connection, or a firewall or something in-between closing the connection.
Posted at the request of the OP -- my comments on the original question pointed to a related SO problem
The clue to the problem lies in the http.client.BadStatusLine error.
Take a look at the following related SO Q & A that discusses the impact of proxy servers on HTTP requests and responses.

Python proxy connection fails at urllib splituser _userprog

I'm trying to access an http web service over an orginizational firewall using a proxy. To access the service, I need to generate a token using an https connection from the service provider. For some reason my connection over a proxy fails, and the python interpreter is throwing an error at line 1072 in urllib which deals with _userprog inside the splituser def:
match = _userprog.match(host)
The corrosponding error text is 'expected string or buffer'.
I've added both http_proxy and https_proxy as environment variables using SETX in the command line...
SETX http_proxy http:\\user:pw#proxyIP:port
SETX https_proxy https:\\user:pw#proxyIP:port
...and added the proxy handlers before the GetToken code of my script:
# set proxies
proxy = urllib2.ProxyHandler({
'http': 'proxy_ip',
'https': 'proxy_ip'
})
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)
class GetToken(object):
def urlopen(self, url, data=None):
# open url, send response
referer = "http://www.arcgis.com/arcgis/rest"
req = urllib2.Request(url)
req.add_header('Referer', referer)
if data:
response = urllib2.urlopen(req, data)
else:
response = urllib2.urlopen(req)
return response
def gentoken(self, username, password,
referer = 'www.arcgis.com', expiration=60):
# gets token from referrer
query_dict = {'username': username,
'password': password,
'expiration': str(expiration),
'client': 'referer',
'referer': referer,
'f': 'json'}
query_string = urllib.urlencode(query_dict)
token_url = "https://www.arcgis.com/sharing/rest/generateToken"
token_response = urllib.urlopen(token_url, query_string)
token = json.loads(token_response.read())
if "token" not in token:
print token['messages']
exit()
else:
return token['token']
But it still throws the same error. Any advice would be much appreciated and thank you in advance!
UPDATE
Thanks mhawke for the slash suggestion, that changed things...but now I'm getting a new error, here's the traceback:
Traceback
<module> C:\Users\tle\Desktop\Scripts\dl_extract2.py 161
main C:\Users\tle\Desktop\Scripts\dl_extract2.py 157
__init__ C:\Users\tle\Desktop\Scripts\dl_extract2.py 53
gentoken C:\Users\tle\Desktop\Scripts\dl_extract2.py 40
urlopen C:\Python26\ArcGIS10.0\lib\urllib.py 88
open C:\Python26\ArcGIS10.0\lib\urllib.py 207
open_https C:\Python26\ArcGIS10.0\lib\urllib.py 439
endheaders C:\Python26\ArcGIS10.0\lib\httplib.py 904
_send_output C:\Python26\ArcGIS10.0\lib\httplib.py 776
send C:\Python26\ArcGIS10.0\lib\httplib.py 735
connect C:\Python26\ArcGIS10.0\lib\httplib.py 1112
wrap_socket C:\Python26\ArcGIS10.0\lib\ssl.py 350
__init__ C:\Python26\ArcGIS10.0\lib\ssl.py 118
do_handshake C:\Python26\ArcGIS10.0\lib\ssl.py 293
IOError: [Errno socket error] [Errno 1] _ssl.c:480: error:140770FC:SSL routines:SSL23_GET_SERVER_HELLO:unknown protocol
UPDATE 2
as per mhawke's suggestion, tried using urllib2() instead of urllib() for the https connection for generating the token, which gets rid of the handshake error. unfortunately now i'm back to square one with the timeout error, except this time it's being thrown in line 1136 of urllib2. i suppose this is because urllib2 doesn't support https connections. does this also mean my proxy doesn't support http tunneling, or is there some way i could test for that from my local machine? in any event, here's the latest traceback:
Traceback
<module> C:\Users\tle\Desktop\Scripts\dl_extract2.py 161
main C:\Users\tle\Desktop\Scripts\dl_extract2.py 157
__init__ C:\Users\tle\Desktop\Scripts\dl_extract2.py 53
gentoken C:\Users\tle\Desktop\Scripts\dl_extract2.py 40
urlopen C:\Python26\ArcGIS10.0\lib\urllib2.py 126
open C:\Python26\ArcGIS10.0\lib\urllib2.py 391
_open C:\Python26\ArcGIS10.0\lib\urllib2.py 409
_call_chain C:\Python26\ArcGIS10.0\lib\urllib2.py 369
https_open C:\Python26\ArcGIS10.0\lib\urllib2.py 1169
do_open C:\Python26\ArcGIS10.0\lib\urllib2.py 1136
URLError: <urlopen error [Errno 10060] Ein Verbindungsversuch ist fehlgeschlagen, da die Gegenstelle nach einer bestimmten Zeitspanne nicht richtig reagiert hat, oder die hergestellte Verbindung war fehlerhaft, da der verbundene Host nicht reagiert hat>
UPDATE 3
This turned out to be a really easy fix -- all that are needed (in my case) are the system environment variables with normal slashes:
http_proxy: http://user:pw#proxyip:port
https_proxy: http://user:pw#proxyip:port
and the following code removed from the script:
proxy = urllib2.ProxyHandler({
'http': 'proxy_ip',
'https': 'proxy_ip'
})
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)
This link explains how and why this works:
http://lukasa.co.uk/2013/07/Python_Requests_And_Proxies/
The initial problem was apparently resolved by using forward slashes in the proxy environment variables.
For the SSL connection problem, you appear to be using the same port for both http and https proxies. Can your proxy server handle that?
First off, note that in gentoken(), urllib.urlopen() is used. urllib.urlopen() connects to the configured proxy using SSL if that scheme is set for the proxy URL. In your case https_proxy is https://user:pw#proxyIP:port, so a SSL connection will be made to your proxy. It would seem that your proxy doesn't handle that which would explain the failed SSL handshake exception. ** Try using urllib2.urlopen() instead.
Also, the python code that creates a ProxyHandler is for urllib2 only, not urllib. urllib connections will use the environment variable settings.
** It is documented here that urllib2() does not support https through a proxy, but it might work if your proxy supports HTTP tunnelling via HTTP CONNECT.
This turned out to be a really easy fix -- all that are needed (in my case) are the system environment variables with normal slashes:
http_proxy: http://user:pw#proxyip:port
https_proxy: http://user:pw#proxyip:port
and the following code removed from the script:
proxy = urllib2.ProxyHandler({
'http': 'proxy_ip',
'https': 'proxy_ip'
})
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)
This link explains how and why this works:
http://lukasa.co.uk/2013/07/Python_Requests_And_Proxies/

Proxy using Twython

I keep getting this error everytime I try running my code through proxy. I have gone through every single link available on how to get my code running behind proxy and am simply unable to get this done.
import twython
import requests
TWITTER_APP_KEY = 'key' #supply the appropriate value
TWITTER_APP_KEY_SECRET = 'key-secret'
TWITTER_ACCESS_TOKEN = 'token'
TWITTER_ACCESS_TOKEN_SECRET = 'secret'
t = twython.Twython(app_key=TWITTER_APP_KEY,
app_secret=TWITTER_APP_KEY_SECRET,
oauth_token=TWITTER_ACCESS_TOKEN,
oauth_token_secret=TWITTER_ACCESS_TOKEN_SECRET,
client_args = {'proxies': {'http': 'proxy.company.com:10080'}})
now if I do
t = twython.Twython(app_key=TWITTER_APP_KEY,
app_secret=TWITTER_APP_KEY_SECRET,
oauth_token=TWITTER_ACCESS_TOKEN,
oauth_token_secret=TWITTER_ACCESS_TOKEN_SECRET,
client_args = client_args)
print t.client_args
I get only a {}
and when I try running
t.update_status(status='See how easy this was?')
I get this problem :
Traceback (most recent call last):
File "<pyshell#40>", line 1, in <module>
t.update_status(status='See how easy this was?')
File "build\bdist.win32\egg\twython\endpoints.py", line 86, in update_status
return self.post('statuses/update', params=params)
File "build\bdist.win32\egg\twython\api.py", line 223, in post
return self.request(endpoint, 'POST', params=params, version=version)
File "build\bdist.win32\egg\twython\api.py", line 213, in request
content = self._request(url, method=method, params=params, api_call=url)
File "build\bdist.win32\egg\twython\api.py", line 134, in _request
response = func(url, **requests_args)
File "C:\Python27\lib\site-packages\requests-1.2.3-py2.7.egg\requests\sessions.py", line 377, in post
return self.request('POST', url, data=data, **kwargs)
File "C:\Python27\lib\site-packages\requests-1.2.3-py2.7.egg\requests\sessions.py", line 335, in request
resp = self.send(prep, **send_kwargs)
File "C:\Python27\lib\site-packages\requests-1.2.3-py2.7.egg\requests\sessions.py", line 438, in send
r = adapter.send(request, **kwargs)
File "C:\Python27\lib\site-packages\requests-1.2.3-py2.7.egg\requests\adapters.py", line 327, in send
raise ConnectionError(e)
ConnectionError: HTTPSConnectionPool(host='api.twitter.com', port=443): Max retries exceeded with url: /1.1/statuses/update.json (Caused by <class 'socket.gaierror'>: [Errno 11004] getaddrinfo failed)
I have searched everywhere. Tried everything that I possibly could. The only resources available were :
https://twython.readthedocs.org/en/latest/usage/advanced_usage.html#manipulate-the-request-headers-proxies-etc
https://groups.google.com/forum/#!topic/twython-talk/GLjjVRHqHng
https://github.com/fumieval/twython/commit/7caa68814631203cb63231918e42e54eee4d2273
https://groups.google.com/forum/#!topic/twython-talk/mXVL7XU4jWw
There were no topics I could find here (on Stack Overflow) either.
Please help. Hope someone replies. If you have already done this please help me with some code example.
Your code isn't using your proxy. The example shows, you specified a proxy for plain HTTP but your stackstrace shows a HTTPSConnectionPool. Your local machine probably can't resolve external domains.
Try setting your proxy like this:
client_args = {'proxies': {'https': 'http://proxy.company.com:10080'}}
In combination with #t-8ch's answer (which is that you must use a proxy as he has defined it), you should also realize that as of this moment, requests (the underlying library of Twython) does not support proxying over HTTPS. This is a problem with requests underlying library urllib3. It's a long running issue as far as I'm aware.
On top of that, reading a bit of Twython's source explains why t.client_args returns an empty dictionary. In short, if you were to instead print t.client.proxies, you'd see that indeed your proxies are being processed as they very well should be.
Finally, complaining about your workplace while on StackOverflow and linking to GitHub commits that have your GitHub username (and real name) associated with them in the comments is not the best idea. StackOverflow is indexed quite thoroughly by Google and there is little doubt that someone else might find this and associate it with you as easily as I have. On top of that, that commit has absolutely no effect on Twython's current behaviour. You're running down a rabbit hole with no end by chasing the author of that commit.
It looks like a domain name lookup failed. Assuming your configured DNS server can resolve Twitter's domain name (and surely it can), I would presume your DNS lookup for proxy.company.com failed. Try using a proxy by IP address instead of by hostname.

SSL3 POST with Python

I have a pile of tasks to automate within cPanel. There is a cPanel API described at http://videos.cpanel.net/cpanel-api-automation/ but I tried what I thought was easier for me...
Based on an answer from skyronic at How do I send a HTTP POST value to a (PHP) page using Python? I tried
import urllib, urllib2, ssl
url = 'https://mysite.com:2083/login'
user_agent = 'Mozilla/5.0 meridia (Windows NT 5.1; U; en)'
values = {'name':cpaneluser,
'pass':cpanelpw}
headers = {'User-Agent':user_agent}
data = urllib.urlencode(values)
req = urllib2.Request(url,data,headers)
response = urllib2.urlopen(req)
page = response.read()
The call to urlopen() is raising NameError: global name 'HTTPSConnectionV3' is not defined.
So then based on http://bugs.python.org/issue11220 I tried preceding the code above with
import httplib
class HTTPSConnectionV3(httplib.HTTPSConnection):
def __init__(self, *args, **kwargs):
httplib.HTTPSConnection.__init__(self, *args, **kwargs)
def connect(self):
sock = socket.create_connection((self.host, self.port), self.timeout)
if self._tunnel_host:
self.sock = sock
self._tunnel()
try:
self.sock = ssl.wrap_socket(sock, self.key_file, self.cert_file, \
ssl_version=ssl.PROTOCOL_SSLv3)
except ssl.SSLError, e:
print("Trying SSLv3.")
self.sock = ssl.wrap_socket(sock, self.key_file, self.cert_file, \
ssl_version=ssl.PROTOCOL_SSLv23)
class HTTPSHandlerV3(urllib2.HTTPSHandler):
def https_open(self, req):
return self.do_open(HTTPSConnectionV3, req)
urllib2.install_opener(urllib2.build_opener(HTTPSHandlerV3()))
This does print the "Trying SSLv3" and raises URLError: <urlopen error [Errno 1] _ssl.c:504: error:140770FC:SSL routines:SSL23_GET_SERVER_HELLO:unknown protocol>
And finally that led me to https://github.com/kennethreitz/requests/issues/606 where gregakespret who say he solved a similar problem using a solution from Senthil Kuaran at http://bugs.python.org/issue11220 :
https_sslv3_handler = urllib.request.HTTPSHandler(context=ssl.SSLContext(ssl.PROTOCOL_SSLv3))
opener = urllib.request.build_opener(https_sslv3_handler)
urllib.request.install_opener(opener)
But that raises AttributeError: 'module' object has no attribute 'request'. And indeed help(urllib) doesn't include any mention of request, and import urllib.request results in No module named request.
I'm using Python 2.7.3 within the Enthought Canopy distribution. The cPanel site is using a self-signed certificate, which I mention since it'sa an irregularity that would trip up a regular browser, though I gather that urllib and urllib2 don't actually authenticate the certificate anyway.
Thank you for reading, more so if you have a suggestion or can help me understand the problem.
I would use the requests library.
I'm the OP and it's been a while since I posted this. I've solved other related tasks (POST to use Instructure's Canvas API) using the requests library and found that code that had worked with urllib/urllib2 is now much shorter and sweeter.
Someone just upvoted this question, causing me to see that noone had answered it. My answer isn't much of one to my OP, but it is the direction I'd advise, having solved related problems since the post.
As for this question, I solved the problem by scripting in BASH on the server that had cPanel running. It was just a matter of identifying which cPanel scripts to call. But I did not get it running through the cPanel Web API.

Categories