I am using Python's urllib2 with Tor as a proxy to access a website. When I
open the site's main page it works fine but when I try to view the login page
(not actually log-in but just view it) I get the following error...
URLError: <urlopen error (10060, 'Operation timed out')>
To counteract this I did the following:
import socket
socket.setdefaulttimeout(None).
I still get the same timeout error.
Does this mean the website is timing out on the server side? (I don't know much
about http processes so sorry if this is a dumb question)
Is there any way I can correct it so that Python is able to view the page?
Thanks,
Rob
According to the Python Socket Documentation the default is no timeout so specifying a value of "None" is redundant.
There are a number of possible reasons that your connection is dropping. One could be that your user-agent is "Python-urllib" which may very well be blocked. To change your user agent:
request = urllib2.Request('site.com/login')
request.add_header('User-Agent','Mozilla/5.0 (X11; U; Linux i686; it-IT; rv:1.9.0.2) Gecko/2008092313 Ubuntu/9.04 (jaunty) Firefox/3.5')
You may also want to try overriding the proxy settings before you try and open the url using something along the lines of:
proxy = urllib2.ProxyHandler({"http":"http://127.0.0.1:8118"})
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)
I don't know enough about Tor to be sure, but the timeout may not happen on the server side, but on one of the Tor nodes somewhere between you and the server. In that case there is nothing you can do other than to retry the connection.
urllib2.urlopen(url[, data][, timeout])
The optional timeout parameter specifies a timeout in seconds for blocking operations like the connection attempt (if not specified, the global default timeout setting will be used). This actually only works for HTTP, HTTPS, FTP and FTPS connections.
http://docs.python.org/library/urllib2.html
Related
Few days back I was wanting to build a proxy that could allow me to securely and anonymously connect to websites and servers. At first it seemed like a pretty easy idea, I would create an HTTP proxy that uses SSL between the client and the proxy, It would then create a SSL connection with what ever website/server the client requested and then forward that information to and from the client and server. I spent about a day researching and writing code that would do just that. But I then realized that someone could compromise the proxy and use the session key that the proxy had to decrypt and read the data being sent to and from the server.
After a little more research it seem that a socks proxy is what I needed. However there is not much documentation on a python version of a socks proxy(Mostly just how to connect to one). I was able to find The PySocks Module and read the Socks.py file. It looks great for creating a client but I don't see how I could use it to make a proxy.
I was wondering if anyone had a simple example of a socks5 proxy or if someone could point me to some material that could help me begin learning and building one?
You create a python server to listen on a port and listen on IP 127.0.0.1. When you connect to your server you send: "www.facebook.com:80". No URL path nor http scheme. If the connect fails you send a failure message which may look something like "number Unable to connect to host." where number is an specific code that signifies a failed connection attempt. Upon success of a connection you send "200 Connection established". Then data is sent and received as normal. You do not want to use an http proxy because it accepts only website traffic.
You may want to use a framework for the proxy server because it should handle multiple connections.
I've read an ebook on asyncio named O'Reilly "Using Asyncio In Python 2020" multiple times and re-read it every now and again to try to grasp multiple connections. I have also just started to search for solutions using Flask because I want my proxy server to run along side a webserver.
I recommend using requesocks along with stem (assumes Tor). The official stem library is provided by Tor. Here's a simplified example based on a scraper that I wrote which also uses fake_useragent so you look like a browser:
import requesocks
from fake_useragent import UserAgent
from stem import Signal
from stem.control import Controller
class Proxy(object):
def __init__(self,
socks_port=9050,
tor_control_port=9051,
tor_connection_password='password')
self._socks_port = int(socks_port)
self._tor_control_port = int(tor_control_port)
self._tor_connection_password = tor_connection_password
self._user_agent = UserAgent()
self._session = None
self._update_session()
def _update_session(self):
self._session = requesocks.session()
# port 9050 is the default SOCKS port
self._session.proxies = {
'http': 'socks5://127.0.0.1:{}'.format(self._socks_port),
'https': 'socks5://127.0.0.1:{}'.format(self._socks_port),
}
def _renew_tor_connection(self):
with Controller.from_port(port=self._tor_control_port) as controller:
controller.authenticate(password=self._tor_connection_password)
controller.signal(Signal.NEWNYM)
def _sample_get_response(self, url):
if not self._session:
self._update_session()
# generate random user agent string for every request
headers = {
'User-Agent': self._user_agent.random,
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-us,en;q=0.5',
} # adjust as desired
response = self._session.get(url, verify=False, headers=headers)
return response
You must have the Tor service running before executing this script and you must also modify your torrc file to enable the control port (9051).
Tor puts the torrc file in /usr/local/etc/tor/torrc if you compiled Tor from source, and /etc/tor/torrc or /etc/torrc if you installed a pre-built package. If you installed Tor Browser, look for
Browser/TorBrowser/Data/Tor/torrc inside your Tor Browser directory (On Mac OS X, you must right-click or command-click on the Tor Browser icon and select "Show Package Contents" before the Tor Browser directories become
visible).
Once you've found your torrc file, you need to uncomment the corresponding lines:
ControlPort 9051
## If you enable the controlport, be sure to enable one of these
## authentication methods, to prevent attackers from accessing it.
HashedControlPassword 16:05834BCEDD478D1060F1D7E2CE98E9C13075E8D3061D702F63BCD674DE
Please note that the HashedControlPassword above is for the password "password". If you want to set a different password (recommended), replace the HashedControlPassword in the torrc file by noting the output from tor --hash-password "<new_password>" where <new_password> is the password that you want to set.
Once you've changed your torrc file, you will need to restart tor for the changes to take effect (note that you actually only need to send Tor a HUP signal, not actually restart it). To restart it:
sudo service tor restart
I hope this helps and at least gets you started for what you were looking for.
I have my web app API running.
If I go to http://127.0.0.1:5000/ via any browser I get the right response.
If I use the Advanced REST Client Chrome app and send a GET request to my app at that address I get the right response.
However this gives me a 503:
import requests
response = requests.get('http://127.0.0.1:5000/')
I read to try this for some reason:
s = requests.Session()
response = s.get('http://127.0.0.1:5000/')
But I still get a 503 response.
Other things I've tried: Not prefixing with http://, not using a port in the URL, running on a different port, trying a different API call like Post, etc.
Thanks.
Is http://127.0.0.1:5000/ your localhost? If so, try 'http://localhost:5000' instead
Just in case someone is struggling with this as well, what finally worked was running the application on my local network ip.
I.e., I just opened up the web app and changed the app.run(debug=True) line to app.run(host="my.ip.address", debug = True).
I'm guessing the requests library perhaps was trying to protect me from a localhost attack? Or our corporate proxy or firewall was preventing communication from unknown apps to the 127 address. I had set NO_PROXY to include the 127.0.0.1 address, so I don't think that was the problem. In the end I'm not really sure why it is working now, but I'm glad that it is.
I'm trying use SSH tunnels inside of Python's urllib2.
Creating the tunnel:
ssh -N user#machine.place.edu -L 1337:localhost:80
The above line should use port 80 on the remote machine and port 1337 on the local machine.
I used -N, so the bash prompt (intentionally) hangs so long as the this tunnel is running.
Using the tunnel in urllib2:
import urllib2
url = "http://ifconfig.me/ip"
headers={'User-agent' : 'Mozilla/5.0'}
proxy_support = urllib2.ProxyHandler({'http': 'http://127.0.0.1:1337'})
opener = urllib2.build_opener(proxy_support, urllib2.HTTPHandler(debuglevel=1))
urllib2.install_opener(opener)
req = urllib2.Request(url, None, headers)
html = urllib2.urlopen(req).read()
print html
When I run the above code, html = urllib2.urlopen(req).read() throws the error urllib2.HTTPError: HTTP Error 404: Not Found.
What might be going wrong, and how can we fix it?
Troubleshooting:
If I turn off the SSH tunnel, the error changes to urllib2.URLError: <urlopen error [Errno 61] Connection refused>. So, Python is clearly "seeing" the SSH tunnel.
If I comment out the proxy stuff by replacing opener = urllib2.build_opener(proxy_support, urllib2.HTTPHandler(debuglevel=1)) with opener = urllib2.build_opener(), then the ifconfig.me page downloads properly. (Of course, the project that I'm working on requires me to access documents from a few different networks, so I still need proxies to work.)
Some StackOverflow posts suggest using Requests instead of urllib2. I wouldn't mind using Requests instead -- I just used urllib2 here because I wasn't sure how to do custom headers (e.g. user-agent, referer) in Requests.
Unfortunately, since you're the only one with access to machine.place.edu, it's going to be impossible for anyone else to reproduce the problem.
First of all, try something like...
$ telnet localhost 1337
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
GET http://ifconfig.me/ip HTTP/1.0
...and hit enter a couple of times after the 'GET' line, and see what you get back.
If you get a 404, there's probably something wrong with the proxy.
If you get a 200, then you should be able to recreate that fairly easily with httplib.
If I run:
urllib2.urlopen('http://google.com')
even if I use another url, I get the same error.
I'm pretty sure there is no firewall running on my computer or router, and the internet (from a browser) works fine.
The problem, in my case, was that some install at some point defined an environment variable http_proxy on my machine when I had no proxy.
Removing the http_proxy environment variable fixed the problem.
The site's DNS record is such that Python fails the DNS lookup in a peculiar way: it finds the entry, but zero associated IP addresses. (Verify with nslookup.) Hence, 11004, WSANO_DATA.
Prefix the site with 'www.' and try the request again. (Use nslookup to verify that its result is different, too.)
This fails essentially the same way with the Python Requests module:
requests.exceptions.ConnectionError: HTTPConnectionPool(host='...', port=80): Max retries exceeded with url: / (Caused by : [Errno 11004] getaddrinfo failed)
This may not help you if it's a network-level issue but you can get some debugging info by setting debuglevel on httplib. Try this:
import urllib, urllib2, httplib
url = 'http://www.mozillazine.org/atom.xml'
httplib.HTTPConnection.debuglevel = 1
print "urllib"
data = urllib.urlopen(url);
print "urllib2"
request = urllib2.Request(url)
opener = urllib2.build_opener()
feeddata = opener.open(request).read()
Which is copied directly from here, hope that's kosher: http://bytes.com/topic/python/answers/517894-getting-debug-urllib2
You probably need to use a proxy. Check your normal browser settings to find out which. Take a look at opening websites using urllib2 from behind corporate firewall - 11004 getaddrinfo failed for a similar problem with solution.,
To troubleshoot the issue:
let us know on what OS is the script running and what version of Python
In command prompt on that very same machine, do ping google.com and observe if that works (or you get say "could not find host")
If (2) worked, open browser on that machine (try in IE if on Windows) and try opening "google.com" there. If there is a problem, look closely at proxy settings in Internet Options / Connections / LAN Settings
Let us know how it goes either way.
add s to the http i.e urllib2.urlopen('https://google.com')
worked for me
Does urllib2 in Python 2.6.1 support proxy via https?
I've found the following at http://www.voidspace.org.uk/python/articles/urllib2.shtml:
NOTE
Currently urllib2 does not support
fetching of https locations through a
proxy. This can be a problem.
I'm trying automate login in to web site and downloading document, I have valid username/password.
proxy_info = {
'host':"axxx", # commented out the real data
'port':"1234" # commented out the real data
}
proxy_handler = urllib2.ProxyHandler(
{"http" : "http://%(host)s:%(port)s" % proxy_info})
opener = urllib2.build_opener(proxy_handler,
urllib2.HTTPHandler(debuglevel=1),urllib2.HTTPCookieProcessor())
urllib2.install_opener(opener)
fullurl = 'https://correct.url.to.login.page.com/user=a&pswd=b' # example
req1 = urllib2.Request(url=fullurl, headers=headers)
response = urllib2.urlopen(req1)
I've had it working for similar pages but not using HTTPS and I suspect it does not get through proxy - it just gets stuck in the same way as when I did not specify proxy. I need to go out through proxy.
I need to authenticate but not using basic authentication, will urllib2 figure out authentication when going via https site (I supply username/password to site via url)?
EDIT:
Nope, I tested with
proxies = {
"http" : "http://%(host)s:%(port)s" % proxy_info,
"https" : "https://%(host)s:%(port)s" % proxy_info
}
proxy_handler = urllib2.ProxyHandler(proxies)
And I get error:
urllib2.URLError: urlopen error
[Errno 8] _ssl.c:480: EOF occurred in
violation of protocol
Fixed in Python 2.6.3 and several other branches:
_bugs.python.org/issue1424152 (replace _ with http...)
http://www.python.org/download/releases/2.6.3/NEWS.txt
Issue #1424152: Fix for httplib, urllib2 to support SSL while working through
proxy. Original patch by Christopher Li, changes made by Senthil Kumaran.
I'm not sure Michael Foord's article, that you quote, is updated to Python 2.6.1 -- why not give it a try? Instead of telling ProxyHandler that the proxy is only good for http, as you're doing now, register it for https, too (of course you should format it into a variable just once before you call ProxyHandler and just repeatedly use that variable in the dict): that may or may not work, but, you're not even trying, and that's sure not to work!-)
Incase anyone else have this issue in the future I'd like to point out that it does support https proxying now, make sure the proxy supports it too or you risk running into a bug that puts the python library into an infinite loop (this happened to me).
See the unittest in the python source that is testing https proxying support for further information:
http://svn.python.org/view/python/branches/release26-maint/Lib/test/test_urllib2.py?r1=74203&r2=74202&pathrev=74203