I have a very basic script to download a website using Python urllib2.
This has been working brilliantly for the past 6 months, and then this morning it no longer works?
#!/usr/bin/python
import urllib2
proxy_support = urllib2.ProxyHandler({'http': 'http://DOMAIN\USER:PASS#PROXY:PORT/'})
opener = urllib2.build_opener(proxy_support)
urllib2.install_opener(opener)
translink = open('/tmp/trains.html' ,'w')
response = urllib2.urlopen('http://translink.com.au')
html = response.read()
translink.write(html)
translink.close()
I am now getting the following error
Traceback (most recent call last):
File "./gettrains.py", line 7, in <module>
response = urllib2.urlopen('http://translink.com.au')
File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 407, in open
response = meth(req, response)
File "/usr/lib/python2.7/urllib2.py", line 520, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python2.7/urllib2.py", line 445, in error
return self._call_chain(*args)
File "/usr/lib/python2.7/urllib2.py", line 379, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 528, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 502: Proxy Error ( The HTTP message includes an unsupported header or an unsupported combination of headers. )
I am new to Python, any help would be very much appreciated.
Cheers
#!/usr/bin/python
import requests
proxies = {
"http": "http://domain\user:pass#proxy:port",
"https": "http:// domain\user:pass#proxy:port",
}
html = requests.get("http://translink.com.au", proxies=proxies)
translink = open('/tmp/trains.html' ,'w')
translink.write(html.content)
translink.close()
Try to change a header. For example:
opener = urllib2.build_opener(proxy_support)
opener.addheaders = ([('User-Agent' , 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)')])
urllib2.install_opener(opener)
I had same problem some days ago. My proxy didn't admit the default header user-agent='Python-urllib/2.7'
To simplify things a little bit, I would avoid the proxy setup from within python and simply let your OS manage it for you. You can do this by setting an environment variable (like export http_proxy="your_proxy" in Linux). Then simply grab the file directly through python, which you can do with urllib2 or requests, you may also consider the wget module.
It's totally possible that there may have been some changes to your proxy that forwards the requests with headers that are no longer acceptable by your final destination. In that case there's very little you can do.
Related
I am trying to make some image filters for my API. Some URLs work while most do not. I wanted to know why and how to fix it. I have Looked through another stack overflow post but have not had much luck as I don't know the problem.
Here is an example of a working URL
And one that does not work
Edit Here is another URL that does not work
Here is the API I am trying to make
Here is my code
def generate_image_Wanted(imageUrl):
with urllib.request.urlopen(imageUrl) as url:
f = io.BytesIO(url.read())
im1 = Image.open("images/wanted.jpg")
im2 = Image.open(f)
im2 = im2.resize((300, 285))
img = im1.copy()
img.paste(im2, (85, 230))
d = BytesIO()
d.seek(0)
img.save(d, "PNG")
d.seek(0)
return d
Here is my error
Traceback (most recent call last):
File "c:\Users\micha\OneDrive\Desktop\MicsAPI\test.py", line 23, in <module>
generate_image_Wanted("https://cdn.discordapp.com/avatars/902240397273743361/9d7ce93e7510f47da2d8ba97ec32fc33.png")
File "c:\Users\micha\OneDrive\Desktop\MicsAPI\test.py", line 11, in generate_image_Wanted
with urllib.request.urlopen(imageUrl) as url:
File "C:\Users\micha\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 214, in urlopen
return opener.open(url, data, timeout)
File "C:\Users\micha\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 523, in open
response = meth(req, response)
File "C:\Users\micha\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 632, in http_response
response = self.parent.error(
File "C:\Users\micha\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 561, in error
return self._call_chain(*args)
File "C:\Users\micha\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 494, in _call_chain
result = func(*args)
File "C:\Users\micha\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 641, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
Thank you for looking at this and have a good day.
maybe sites you can't scrape has server prevention for known bot and spiders and block your request from urllib.
You need to provide some headers - see more about python request lib
Working example:
import urllib.request
hdr = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)' }
url = "https://cdn.discordapp.com/avatars/902240397273743361/9d7ce93e7510f47da2d8ba97ec32fc33.png"
req = urllib.request.Request(url, headers=hdr)
response = urllib.request.urlopen(req)
response.read()
I'm trying to download the HTML of a page (http://www.guangxindai.com in this case) but I'm getting back an error 403. Here is my code:
import urllib.request
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
f = opener.open("http://www.guangxindai.com")
f.read()
but I get error response.
Traceback (most recent call last):
File "<pyshell#7>", line 1, in <module>
f = opener.open("http://www.guangxindai.com")
File "C:\Python33\lib\urllib\request.py", line 475, in open
response = meth(req, response)
File "C:\Python33\lib\urllib\request.py", line 587, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python33\lib\urllib\request.py", line 513, in error
return self._call_chain(*args)
File "C:\Python33\lib\urllib\request.py", line 447, in _call_chain
result = func(*args)
File "C:\Python33\lib\urllib\request.py", line 595, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
I have tried different request headers, but still can not get correct response. I can view the web through browser. It seems strange for me. I guess the web use some method to block web spider. Does anyone know what is happening? How can I get the HTML of page correctly?
I was having the same problem that you and I found the answer in this link.
The answer provided by Stefano Sanfilippo is quite simple and worked for me:
from urllib.request import Request, urlopen
url_request = Request("http://www.guangxindai.com",
headers={"User-Agent": "Mozilla/5.0"})
webpage = urlopen(url_request).read()
If your aim is to read the html of the page you can use the following code. It worked for me on Python 2.7
import urllib
f = urllib.urlopen("http://www.guangxindai.com")
f.read()
I need help with to Login to webpage using id and password and goto a link within the website and download the complete page code or response that we can see in page source as response for the link. I tried using perl or python but no luck.
I need help with to login to www.server.com:7180 and save cookies and then redirect to ww.server.com:7180/healthissues page or directly download whatever we get response in browser in text file.
import urllib
import urllib2
import webbrowser
import cookielib
data1={
'j_username':'id', 'j_password':'pass'
}
data = urllib.urlencode(data1)
url = 'http://server.intranet.com:7180/cmf/allHealthIssues'
full_url = url + '?' + data
response = urllib2.urlopen(full_url)
with open("results.html", "w") as f:
f.write(response.read())
webbrowser.open("results.html")
The above code downloads the webpage but i always end up with authentication page in the download. I found lot of packages but unfortunately i donot have access to install packages or libraries. Any help is appreciated.
I tried with the code suggested by PM 2Ring but I'm getting the error below. I have python 2.6.6 and I'm not sure if that method will work. Please let me know any workaround or way to resolve the error.
Traceback (most recent call last):
File "a.py", line 15, in <module>
handle = urllib2.urlopen(req)
File "/usr/lib64/python2.6/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib64/python2.6/urllib2.py", line 397, in open
response = meth(req, response)
File "/usr/lib64/python2.6/urllib2.py", line 510, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib64/python2.6/urllib2.py", line 435, in error
return self._call_chain(*args)
File "/usr/lib64/python2.6/urllib2.py", line 369, in _call_chain
result = func(*args)
File "/usr/lib64/python2.6/urllib2.py", line 518, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Invalid request
Although you are importing cookielib you aren't actually using it, so you can't get past the authentication page of the website. The Python docs for cookielib have some simple examples of how to use it. Eg,
import cookielib, urllib2
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
r = opener.open("http://example.com/")
Obviously, your code will need to be a little more complicated, as you need to send the password.
So you'll need to do something like this (untested):
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)
data1={'j_username':'id', 'j_password':'pass'}
data = urllib.urlencode(data1)
headers = {'User-agent' : 'Mozilla/5.0 (X11; Linux i586; rv:31.0) Gecko/20100101 Firefox/31.0'}
req = urllib2.Request(url, data, headers)
handle = urllib2.urlopen(req)
It's a shame you can't install Requests, as it makes things so much simpler than using the native Python modules.
I'm currently trying to fix a Kodi plugin called NetfliXBMC.
It uses this url to get information on specific movies:
http://www.netflix.com/JSON/BOB?movieid=<SOMEID>
While trying to build a minimal case to ask this question I discovered that it's not even necessary to be logged in to access the information, which simplifies my question a lot.
Querying information about a movie works from wget, from curl, from incognito chrome etc. It just never works from urllib2:
# wget works just fine
$: wget -q -O- http://www.netflix.com/JSON/BOB?movieid=80021955
{"contextData":"{\"cookieDisclosure\":{\"data\":{\"showCookieBanner\":false}}}","result":"success","actionErrors":null,"fieldErrors":null,"actionMessages":null,"data":[output omitted for brevity]}
# so does curl
$: curl http://www.netflix.com/JSON/BOB?movieid=80021955
{"contextData":"{\"cookieDisclosure\":{\"data\":{\"showCookieBanner\":false}}}","result":"success","actionErrors":null,"fieldErrors":null,"actionMessages":null,"data":[output omitted for brevity}
# but python's urllib always gets a 500
$: python -c "import urllib2; urllib2.urlopen('http://www.netflix.com/JSON/BOB?movieid=80021955').read()"
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 410, in open
response = meth(req, response)
File "/usr/lib/python2.7/urllib2.py", line 523, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python2.7/urllib2.py", line 448, in error
return self._call_chain(*args)
File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 531, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 500: Internal Server Error
$: python --version
Python 2.7.6
What I've tried so far: several different user-agent strings, initializing a urlopener with a cookie jar, plain old urllib (doesn't raise an exception but receives the same error page).
I'm really curious as to why this might be. Thanks in advance!
It turned out to be a bug on netflix' side when no Accept header is sent.
This doesn't work:
opener = urllib2.build_opener()
opener.open("http://www.netflix.com/JSON/BOB?movieid=80021955")
Adding a proper accept header makes it work:
opener = urllib2.build_opener()
mimeAccept = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
opener.addheaders = [('Accept', mimeAccept)]
opener.open("http://www.netflix.com/JSON/BOB?movieid=80021955")
[...]
Of course, there is another bug there: it returns a 500 internal server error instead of a 400 bad request when the problem was clearly on the request.
I am trying to pull information from a site ever 5 seconds but it doesn't seem to be working and I get errors every time I run it.
Code below:
import urllib2, threading
def readpage():
data = urllib2.urlopen('http://forums.zybez.net/runescape-2007-prices').read()
for line in data:
if 'forums.zybez.net/runescape-2007-prices/player/' in line:
a = line.split('/runescape-2007-prices/player/'[1])
print(a.split('">')[0])
t = threading.Timer(5.0, readpage)
t.start()
I get these errors:
Exception in thread Thread-1:
Traceback (most recent call last):
File "C:\Python27\lib\threading.py", line 808, in __bootstrap_inner
self.run()
File "C:\Python27\lib\threading.py", line 1080, in run
self.function(*self.args, **self.kwargs)
File "C:\Users\Jordan\Desktop\username.py", line 3, in readpage
data = urllib2.urlopen('http://forums.zybez.net/runescape-2007-prices').rea
()
File "C:\Python27\lib\urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "C:\Python27\lib\urllib2.py", line 410, in open
response = meth(req, response)
File "C:\Python27\lib\urllib2.py", line 523, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python27\lib\urllib2.py", line 448, in error
return self._call_chain(*args)
File "C:\Python27\lib\urllib2.py", line 382, in _call_chain
result = func(*args)
File "C:\Python27\lib\urllib2.py", line 531, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 403: Forbidden
Help would be appreciated, thanks!
The site is rejecting the default User-Agent reported by urllib2. You can change it for all requests in the script using install_opener.
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0 (X11; Linux i686; rv:5.0) Gecko/20100101 Firefox/5.0')]
urllib2.install_opener(opener)
You'll also need to split the data from by the site to read it line by line
urllib2.urlopen('http://forums.zybez.net/runescape-2007-prices').read().splitlines()
and change
line.split('/runescape-2007-prices/player/'[1])
to
line.split('/runescape-2007-prices/player/')[1]
Working:
import urllib2, threading
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0 (X11; Linux i686; rv:5.0) Gecko/20100101 Firefox/5.0')]
urllib2.install_opener(opener)
def readpage():
data = urllib2.urlopen('http://forums.zybez.net/runescape-2007-prices').read().splitlines()
for line in data:
if 'forums.zybez.net/runescape-2007-prices/player/' in line:
a = line.split('/runescape-2007-prices/player/')[1]
print(a.split('">')[0])
t = threading.Timer(5.0, readpage)
t.start()
Did you try opening that url without the thread? The error code says 403: Forbidden, maybe you need authentication for that web page.
This has nothing to do with Python -- the server is denying your requests to that URL.
I suspect that either the URL is incorrect or you've hit some kind of rate limiting and are being blocked.
EDIT: how to make it work
The site is blocking Python's User-Agent. Try this:
import urllib2, threading
def readpage():
headers = { 'User-Agent' : 'Mozilla/5.0' }
req = urllib2.Request('http://forums.zybez.net/runescape-2007-prices', None, headers)
data = urllib2.urlopen(req).read()
for line in data:
if 'forums.zybez.net/runescape-2007-prices/player/' in line:
a = line.split('/runescape-2007-prices/player/'[1])
print(a.split('">')[0])