urllib2 proxy does not work with tor - python

I want to write some script with python which uses tor/proxy addresses to access web, for the test I have the following script:
import urllib2
from BeautifulSoup import BeautifulSoup
protocol = 'socks4'
ip = '127.0.0.1:9050'
proxy = urllib2.ProxyHandler({protocol:ip})
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)
page = urllib2.urlopen("http://www.ifconfig.me/ip").read()
print(page)
The problem is it shows my own IP address, while when run directly from terminal:
proxychains curl ifconfig.me/ip
shows different IP, how can I fix it?
when http used instead of socks 4 it gives the following error:
Traceback (most recent call last):
File "proxy_test.py", line 11, in <module>
page = urllib2.urlopen("http://www.ifconfig.me/ip").read()
File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 400, in open
response = meth(req, response)
File "/usr/lib/python2.7/urllib2.py", line 513, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python2.7/urllib2.py", line 438, in error
return self._call_chain(*args)
File "/usr/lib/python2.7/urllib2.py", line 372, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 521, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 501: Tor is not an HTTP Proxy

I use http (not sock) and it works
import urllib2
from BeautifulSoup import BeautifulSoup
protocol = 'http'
ip = '127.0.0.1:8118'
proxy = urllib2.ProxyHandler({protocol:ip})
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)
page = urllib2.urlopen("http://www.ifconfig.me/ip").read()
print(page)

Related

while reading json data from url using python gives error "urllib.error.HTTPError: HTTP Error 403: Forbidden"

with this code I am reading a URL and using the data for filtration but urllib could not work
url = "myurl"
response = urllib.request.urlopen(url)
data = json.loads(response.read())
yesterday it was working well but now giving me error:
Traceback (most recent call last):
File "vaccine_survey.py", line 22, in <module>
response = urllib.request.urlopen(url)
File "/usr/lib/python3.6/urllib/request.py", line 223, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.6/urllib/request.py", line 532, in open
response = meth(req, response)
File "/usr/lib/python3.6/urllib/request.py", line 642, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python3.6/urllib/request.py", line 570, in error
return self._call_chain(*args)
File "/usr/lib/python3.6/urllib/request.py", line 504, in _call_chain
result = func(*args)
File "/usr/lib/python3.6/urllib/request.py", line 650, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
this works for me, here 'myurl' is a url address
from urllib.request import Request, urlopen
req = Request('myurl', headers={'User-Agent': 'Mozilla/5.0'})
response = urlopen(req).read()
data = json.loads(response.read())

Why is urllib.request.urlopen giving me 404 on Wall Street Journal's website?

Problem
I'm using urllib.request.urlopen on the Wall Street Journal and it gives me a 404.
Details
Other sites work fine. Same error if I use https://. I did this example in REPL but the same error happens in my calls from my Django server:
>>> from urllib.request import urlopen
>>> urlopen('http://www.wsj.com')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 531, in open
response = meth(req, response)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 641, in http_response
'http', request, response, code, msg, hdrs)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 569, in error
return self._call_chain(*args)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 503, in _call_chain
result = func(*args)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 649, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found
This is how it should work:
>>> urlopen('http://www.cbc.ca')
<http.client.HTTPResponse object at 0x10b0f8c88>
I'm not sure how to debug this. Anyone know what's going on, and how I can fix it?
first import Request like this:
from urllib.request import **Request**, urlopen
and then pass your url and header to Request like below:
url = 'https://www.wsj.com/'
response_obj = urlopen(Request(url, headers={'User-Agent': 'Mozilla/5.0'}))
print(response_obj)
I tested it now its working

urllib urlopen throws 403 status code: HTTP Error 403: Forbidden

I tried to test the connection to the website on PyCharm.
When I run the code I am faced with this error:
I use BeautifulSoup modules
import urllib.request
from bs4 import BeautifulSoup
def get_html(url):
response = urllib.request.urlopen(url)
return response.read()
def parse(html):
soup = BeautifulSoup(html)
def main():
print(get_html('https://tap.az/'))
if __name__ == '__main__':
main()
C:\Users\Adil\AppData\Local\Programs\Python\Python37-32\python.exe C:/Users/Adil/Desktop/PAbot/pabot.py
Traceback (most recent call last):
File "C:/Users/Adil/Desktop/PAbot/pabot.py", line 21, in <module>
main()
File "C:/Users/Adil/Desktop/PAbot/pabot.py", line 17, in main
print(get_html('https://tap.az/'))
File "C:/Users/Adil/Desktop/PAbot/pabot.py", line 8, in get_html
response = urllib.request.urlopen(url)
File "C:\Users\Adil\AppData\Local\Programs\Python\Python37-32\lib\urllib\request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "C:\Users\Adil\AppData\Local\Programs\Python\Python37-32\lib\urllib\request.py", line 531, in open
response = meth(req, response)
File "C:\Users\Adil\AppData\Local\Programs\Python\Python37-32\lib\urllib\request.py", line 641, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Users\Adil\AppData\Local\Programs\Python\Python37-32\lib\urllib\request.py", line 569, in error
return self._call_chain(*args)
File "C:\Users\Adil\AppData\Local\Programs\Python\Python37-32\lib\urllib\request.py", line 503, in _call_chain
result = func(*args)
File "C:\Users\Adil\AppData\Local\Programs\Python\Python37-32\lib\urllib\request.py", line 649, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
Try adding a user agent in your request header.
from bs4 import BeautifulSoup
import urllib.request
user_agent = 'Mozilla/5.0'
headers = {'User-Agent': user_agent}
request = urllib.request.Request(url='https://tap.az/', headers=headers)
response = urllib.request.urlopen(request)
result = BeautifulSoup(response.read(), 'html.parser')
print(result)

Inability to extract data from website using urllib2 [duplicate]

This question already has answers here:
urllib2 Error 403: Forbidden
(2 answers)
Closed 8 years ago.
I was following a tutorial on pythonforbeginners.com, and I came across a code which isn't running right on my OSX.
from bs4 import BeautifulSoup
import urllib2
url = "http://www.pythonforbeginners.com"
content = urllib2.urlopen(url).read()
soup = BeautifulSoup(content)
print soup.prettify()
This gives me the error:
Traceback (most recent call last): File
"/Users/dhruvmullick/CS/Python/Extracting Data/test.py", line 8, in
content = urllib2.urlopen(url).read() File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py",
line 127, in urlopen
return _opener.open(url, data, timeout) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py",
line 410, in open
response = meth(req, response) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py",
line 523, in http_response
'http', request, response, code, msg, hdrs) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py",
line 448, in error
return self._call_chain(*args) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py",
line 382, in _call_chain
result = func(*args) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py",
line 531, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp) urllib2.HTTPError: HTTP Error 403: Forbidden
The 403 error indicates that the server is blocking your connection.
...a request from a client for a web page or resource to indicate that the server can be reached and understood the request, but refuses to take any further action.
Try with a different domain and you'll find that it works as expected.
To make a work-around, you can add a custom user-agent.

Browsing a NTLM protected website using python with python NTLM

I have been tasked with creating a script that logs on to a corporate portal goes to a particular page, downloads the page, compares it to an earlier version and then emails a certain person depending on changes that have been made. The last parts are easy enough but it has been the first step that is giving me the most trouble.
After unsuccessfully using urllib2(I am trying to do this in python) to connect and about 4 or 5 hours of googling I have determined that the reason I can't connect is due to NTLM authentication on the web page. I have tried a bunch of different processes for connecting found on this site and others to no avail. Based on the NTLM example I have done:
import urllib2
from ntlm import HTTPNtlmAuthHandler
user = 'username'
password = "password"
url = "https://portal.whatever.com/"
passman = urllib2.HTTPPasswordMgrWithDefaultRealm()
passman.add_password(None, url, user, password)
# create the NTLM authentication handler
auth_NTLM = HTTPNtlmAuthHandler.HTTPNtlmAuthHandler(passman)
# create and install the opener
opener = urllib2.build_opener(auth_NTLM)
urllib2.install_opener(opener)
# create a header
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
header = { 'Connection' : 'Keep-alive', 'User-Agent' : user_agent}
response = urllib2.urlopen(urllib2.Request(url, None, header))
When I run this (with a real username, password and url) I get the following:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "ntlm2.py", line 21, in <module>
response = urllib2.urlopen(urllib2.Request(url, None, header))
File "C:\Python27\lib\urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "C:\Python27\lib\urllib2.py", line 400, in open
response = meth(req, response)
File "C:\Python27\lib\urllib2.py", line 513, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python27\lib\urllib2.py", line 432, in error
result = self._call_chain(*args)
File "C:\Python27\lib\urllib2.py", line 372, in _call_chain
result = func(*args)
File "C:\Python27\lib\urllib2.py", line 619, in http_error_302
return self.parent.open(new, timeout=req.timeout)
File "C:\Python27\lib\urllib2.py", line 400, in open
response = meth(req, response)
File "C:\Python27\lib\urllib2.py", line 513, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python27\lib\urllib2.py", line 432, in error
result = self._call_chain(*args)
File "C:\Python27\lib\urllib2.py", line 372, in _call_chain
result = func(*args)
File "C:\Python27\lib\urllib2.py", line 619, in http_error_302
return self.parent.open(new, timeout=req.timeout)
File "C:\Python27\lib\urllib2.py", line 400, in open
response = meth(req, response)
File "C:\Python27\lib\urllib2.py", line 513, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python27\lib\urllib2.py", line 438, in error
return self._call_chain(*args)
File "C:\Python27\lib\urllib2.py", line 372, in _call_chain
result = func(*args)
File "C:\Python27\lib\urllib2.py", line 521, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 401: Unauthorized
The thing that is most interesting about this trace to me is that the final line says a 401 error was sent back. From what I have read the 401 error is the first message sent back to the client when NTLM is started. I was under the impression that the purpose of python-ntml was to handle the NTLM process for me. Is that wrong or am I just using it incorrectly? Also I'm not bounded to using python for this, so if there is an easier way to do this in another language let me know (From what I seen a-googling there isn't).
Thanks!
If the site is using NTLM authentication, the headers attribute of the resulting HTTPError should say so:
>>> try:
... handle = urllib2.urlopen(req)
... except IOError, e:
... print e.headers
...
<other headers>
WWW-Authenticate: Negotiate
WWW-Authenticate: NTLM

Categories