I am trying to use Tor with urllib2 and polipo. What I need is a way to switch to specific exit nodes while program is running.
I have set 'AllowDotExit 1' in /etc/tor/torrc and was trying the following approach:
import urllib2
proxy = '127.0.0.1'
port = '8118'
url='http://ifconfig.me.651d7ace80e0b53e6c05eb4db2491264f049df66.exit'
proxyurl = '%s:%s' % (proxy, port)
proxyhandler = urllib2.ProxyHandler({'http': proxyurl})
opener = urllib2.build_opener(proxyhandler)
page = opener.open(url)
print 'Page opened.'
print page.read()
But what I am getting is:
:!/usr/bin/env python tortest.py
Traceback (most recent call last):
File "tortest.py", line 18, in <module>
page = opener.open(url, timeout=20)
File "/usr/lib/python2.7/urllib2.py", line 406, in open
response = meth(req, response)
File "/usr/lib/python2.7/urllib2.py", line 519, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python2.7/urllib2.py", line 444, in error
return self._call_chain(*args)
File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 527, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 504: Connect to ifconfig.me.651d7ace80e0b53e6c05eb4db2491264f049df66.exit:80 failed: General SOCKS server failure
Could anyone help me with that?
General SOCKS server failure could be anything. For example, your TOR node might not know how to reach the specified exit node. This happens a lot. The exit may be listed on one of the status pages, but still be unreachable from your tor node. Try a different exit, or retry it later. Often when you request a particular exit node, it takes some minutes to establish a link through the network.
Related
I'm trying to download the HTML of a page (http://www.guangxindai.com in this case) but I'm getting back an error 403. Here is my code:
import urllib.request
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
f = opener.open("http://www.guangxindai.com")
f.read()
but I get error response.
Traceback (most recent call last):
File "<pyshell#7>", line 1, in <module>
f = opener.open("http://www.guangxindai.com")
File "C:\Python33\lib\urllib\request.py", line 475, in open
response = meth(req, response)
File "C:\Python33\lib\urllib\request.py", line 587, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python33\lib\urllib\request.py", line 513, in error
return self._call_chain(*args)
File "C:\Python33\lib\urllib\request.py", line 447, in _call_chain
result = func(*args)
File "C:\Python33\lib\urllib\request.py", line 595, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
I have tried different request headers, but still can not get correct response. I can view the web through browser. It seems strange for me. I guess the web use some method to block web spider. Does anyone know what is happening? How can I get the HTML of page correctly?
I was having the same problem that you and I found the answer in this link.
The answer provided by Stefano Sanfilippo is quite simple and worked for me:
from urllib.request import Request, urlopen
url_request = Request("http://www.guangxindai.com",
headers={"User-Agent": "Mozilla/5.0"})
webpage = urlopen(url_request).read()
If your aim is to read the html of the page you can use the following code. It worked for me on Python 2.7
import urllib
f = urllib.urlopen("http://www.guangxindai.com")
f.read()
I'm currently trying to fix a Kodi plugin called NetfliXBMC.
It uses this url to get information on specific movies:
http://www.netflix.com/JSON/BOB?movieid=<SOMEID>
While trying to build a minimal case to ask this question I discovered that it's not even necessary to be logged in to access the information, which simplifies my question a lot.
Querying information about a movie works from wget, from curl, from incognito chrome etc. It just never works from urllib2:
# wget works just fine
$: wget -q -O- http://www.netflix.com/JSON/BOB?movieid=80021955
{"contextData":"{\"cookieDisclosure\":{\"data\":{\"showCookieBanner\":false}}}","result":"success","actionErrors":null,"fieldErrors":null,"actionMessages":null,"data":[output omitted for brevity]}
# so does curl
$: curl http://www.netflix.com/JSON/BOB?movieid=80021955
{"contextData":"{\"cookieDisclosure\":{\"data\":{\"showCookieBanner\":false}}}","result":"success","actionErrors":null,"fieldErrors":null,"actionMessages":null,"data":[output omitted for brevity}
# but python's urllib always gets a 500
$: python -c "import urllib2; urllib2.urlopen('http://www.netflix.com/JSON/BOB?movieid=80021955').read()"
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 410, in open
response = meth(req, response)
File "/usr/lib/python2.7/urllib2.py", line 523, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python2.7/urllib2.py", line 448, in error
return self._call_chain(*args)
File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 531, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 500: Internal Server Error
$: python --version
Python 2.7.6
What I've tried so far: several different user-agent strings, initializing a urlopener with a cookie jar, plain old urllib (doesn't raise an exception but receives the same error page).
I'm really curious as to why this might be. Thanks in advance!
It turned out to be a bug on netflix' side when no Accept header is sent.
This doesn't work:
opener = urllib2.build_opener()
opener.open("http://www.netflix.com/JSON/BOB?movieid=80021955")
Adding a proper accept header makes it work:
opener = urllib2.build_opener()
mimeAccept = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
opener.addheaders = [('Accept', mimeAccept)]
opener.open("http://www.netflix.com/JSON/BOB?movieid=80021955")
[...]
Of course, there is another bug there: it returns a 500 internal server error instead of a 400 bad request when the problem was clearly on the request.
I am trying to connect to a REST resource and retrieve the data using Python script (Python 3.2.3). When I run the script I am getting error as HTTP Error 401: Unauthorized. Please note that I am able to access the given REST resource using REST client using Basic Authentication. In the REST Client I have specified the hostname, user and password details (realm is not required).
Below is the code and complete error. Your help is very much appreciated.
Code:
import urllib.request
# set up authentication info
auth_handler = urllib.request.HTTPBasicAuthHandler()
auth_handler.add_password(realm=None,
uri=r'http://hostname/',
user='administrator',
passwd='administrator')
opener = urllib.request.build_opener(auth_handler)
urllib.request.install_opener(opener)
res = opener.open(r'http://hostname:9004/apollo-api/nodes')
nodes = res.read()
Error
Traceback (most recent call last):
File "C:\Python32\scripts\get-nodes.py", line 12, in <module>
res = opener.open(r'http://tolowa.wysdm.lab.emc.com:9004/apollo-api/nodes')
File "C:\Python32\lib\urllib\request.py", line 375, in open
response = meth(req, response)
File "C:\Python32\lib\urllib\request.py", line 487, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python32\lib\urllib\request.py", line 413, in error
return self._call_chain(*args)
File "C:\Python32\lib\urllib\request.py", line 347, in _call_chain
result = func(*args)
File "C:\Python32\lib\urllib\request.py", line 495, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 401: Unauthorized
Try to give the correct realm name. You can find this out for example when opening the page in a browser - the password prompt should display the name.
You can also read the realm by catching the exception that was raised:
import urllib.error
import urllib.request
# set up authentication info
auth_handler = urllib.request.HTTPBasicAuthHandler()
auth_handler.add_password(realm=None,
uri=r'http://hostname/',
user='administrator',
passwd='administrator')
opener = urllib.request.build_opener(auth_handler)
urllib.request.install_opener(opener)
try:
res = opener.open(r'http://hostname:9004/apollo-api/nodes')
nodes = res.read()
except urllib.error.HTTPError as e:
print(e.headers['www-authenticate'])
You should get the following output:
Basic realm="The realm you are after"
Read the realm from above and set it in your add_password method and it should be good to go.
I am fairly inexperienced with user authentication especially through restful apis. I am trying to use python to log in with a user that is set up in parse.com. The following is the code I have:
API_LOGIN_ROOT = 'https://api.parse.com/1/login'
params = {'username':username,'password':password}
encodedParams = urllib.urlencode(params)
url = API_LOGIN_ROOT + "?" + encodedParams
request = urllib2.Request(url)
request.add_header('Content-type', 'application/x-www-form-urlencoded')
# we could use urllib2's authentication system, but it seems like overkill for this
auth_header = "Basic %s" % base64.b64encode('%s:%s' % (APPLICATION_ID, MASTER_KEY))
request.add_header('Authorization', auth_header)
request.add_header('X-Parse-Application-Id', APPLICATION_ID)
request.add_header('X-Parse-REST-API-Key', MASTER_KEY)
request.get_method = lambda: http_verb
# TODO: add error handling for server response
response = urllib2.urlopen(request)
#response_body = response.read()
#response_dict = json.loads(response_body)
This is a modification of an open source library used to access the parse rest interface.
I get the following error:
Traceback (most recent call last):
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/webapp/_webapp25.py", line 703, in __call__
handler.post(*groups)
File "/Users/nazbot/src/PantryPal_AppEngine/fridgepal.py", line 464, in post
url = user.login()
File "/Users/nazbot/src/PantryPal_AppEngine/fridgepal.py", line 313, in login
url = self._executeCall(self.username, self.password, 'GET', data)
File "/Users/nazbot/src/PantryPal_AppEngine/fridgepal.py", line 292, in _executeCall
response = urllib2.urlopen(request)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 400, in open
response = meth(req, response)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 513, in http_response
'http', request, response, code, msg, hdrs)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 438, in error
return self._call_chain(*args)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 372, in _call_chain
result = func(*args)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 521, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 404: Not Found
Can someone point me to where I am screwing up? I'm not quite sure why I'm getting a 404 instead of an access denied or some other issue.
Make sure the "User" class was created on Parse.com as a special user class. When you are adding the class, make sure to change the Class Type to "User" instead of "Custom". A little user head icon will show up next to the class name on the left hand side.
This stumped me for a long time until Matt from the Parse team showed me the problem.
Please change: API_LOGIN_ROOT = 'https://api.parse.com/1/login' to the following: API_LOGIN_ROOT = 'https://api.parse.com/1/login**/**'
I had the same problem using PHP, adding the / at the end fixed the 404 error.
I installed Python 2.6.2 earlier on a Windows XP machine and run the following code:
import urllib2
import urllib
page = urllib2.Request('http://www.python.org/fish.html')
urllib2.urlopen( page )
I get the following error.
Traceback (most recent call last):<br>
File "C:\Python26\test3.py", line 6, in <module><br>
urllib2.urlopen( page )<br>
File "C:\Python26\lib\urllib2.py", line 124, in urlopen<br>
return _opener.open(url, data, timeout)<br>
File "C:\Python26\lib\urllib2.py", line 383, in open<br>
response = self._open(req, data)<br>
File "C:\Python26\lib\urllib2.py", line 401, in _open<br>
'_open', req)<br>
File "C:\Python26\lib\urllib2.py", line 361, in _call_chain<br>
result = func(*args)<br>
File "C:\Python26\lib\urllib2.py", line 1130, in http_open<br>
return self.do_open(httplib.HTTPConnection, req)<br>
File "C:\Python26\lib\urllib2.py", line 1105, in do_open<br>
raise URLError(err)<br>
URLError: <urlopen error [Errno 11001] getaddrinfo failed><br><br><br>
import urllib2
response = urllib2.urlopen('http://www.python.org/fish.html')
html = response.read()
You're doing it wrong.
Have a look in the urllib2 source, at the line specified by the traceback:
File "C:\Python26\lib\urllib2.py", line 1105, in do_open
raise URLError(err)
There you'll see the following fragment:
try:
h.request(req.get_method(), req.get_selector(), req.data, headers)
r = h.getresponse()
except socket.error, err: # XXX what error?
raise URLError(err)
So, it looks like the source is a socket error, not an HTTP protocol related error. Possible reasons: you are not on line, you are behind a restrictive firewall, your DNS is down,...
All this aside from the fact, as mcandre pointed out, that your code is wrong.
Name resolution error.
getaddrinfo is used to resolve the hostname (python.org)in your request. If it fails, it means that the name could not be resolved because:
It does not exist, or the records are outdated (unlikely; python.org is a well-established domain name)
Your DNS server is down (unlikely; if you can browse other sites, you should be able to fetch that page through Python)
A firewall is blocking Python or your script from accessing the Internet (most likely; Windows Firewall sometimes does not ask you if you want to allow an application)
You live on an ancient voodoo cemetery. (unlikely; if that is the case, you should move out)
Windows Vista, python 2.6.2
It's a 404 page, right?
>>> import urllib2
>>> import urllib
>>>
>>> page = urllib2.Request('http://www.python.org/fish.html')
>>> urllib2.urlopen( page )
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python26\lib\urllib2.py", line 124, in urlopen
return _opener.open(url, data, timeout)
File "C:\Python26\lib\urllib2.py", line 389, in open
response = meth(req, response)
File "C:\Python26\lib\urllib2.py", line 502, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python26\lib\urllib2.py", line 427, in error
return self._call_chain(*args)
File "C:\Python26\lib\urllib2.py", line 361, in _call_chain
result = func(*args)
File "C:\Python26\lib\urllib2.py", line 510, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 404: Not Found
>>>
DJ
First, I see no reason to import urllib; I've only ever seen urllib2 used to replace urllib entirely and I know of no functionality that's useful from urllib and yet is missing from urllib2.
Next, I notice that http://www.python.org/fish.html gives a 404 error to me. (That doesn't explain the backtrace/exception you're seeing. I get urllib2.HTTPError: HTTP Error 404: Not Found
Normally if you just want to do a default fetch of a web pages (without adding special HTTP headers, doing doing any sort of POST, etc) then the following suffices:
req = urllib2.urlopen('http://www.python.org/')
html = req.read()
# and req.close() if you want to be pedantic