Accessing netflix api from python's urllib2 results in 500 error - python

I'm currently trying to fix a Kodi plugin called NetfliXBMC.
It uses this url to get information on specific movies:
http://www.netflix.com/JSON/BOB?movieid=<SOMEID>
While trying to build a minimal case to ask this question I discovered that it's not even necessary to be logged in to access the information, which simplifies my question a lot.
Querying information about a movie works from wget, from curl, from incognito chrome etc. It just never works from urllib2:
# wget works just fine
$: wget -q -O- http://www.netflix.com/JSON/BOB?movieid=80021955
{"contextData":"{\"cookieDisclosure\":{\"data\":{\"showCookieBanner\":false}}}","result":"success","actionErrors":null,"fieldErrors":null,"actionMessages":null,"data":[output omitted for brevity]}
# so does curl
$: curl http://www.netflix.com/JSON/BOB?movieid=80021955
{"contextData":"{\"cookieDisclosure\":{\"data\":{\"showCookieBanner\":false}}}","result":"success","actionErrors":null,"fieldErrors":null,"actionMessages":null,"data":[output omitted for brevity}
# but python's urllib always gets a 500
$: python -c "import urllib2; urllib2.urlopen('http://www.netflix.com/JSON/BOB?movieid=80021955').read()"
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 410, in open
response = meth(req, response)
File "/usr/lib/python2.7/urllib2.py", line 523, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python2.7/urllib2.py", line 448, in error
return self._call_chain(*args)
File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 531, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 500: Internal Server Error
$: python --version
Python 2.7.6
What I've tried so far: several different user-agent strings, initializing a urlopener with a cookie jar, plain old urllib (doesn't raise an exception but receives the same error page).
I'm really curious as to why this might be. Thanks in advance!

It turned out to be a bug on netflix' side when no Accept header is sent.
This doesn't work:
opener = urllib2.build_opener()
opener.open("http://www.netflix.com/JSON/BOB?movieid=80021955")
Adding a proper accept header makes it work:
opener = urllib2.build_opener()
mimeAccept = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
opener.addheaders = [('Accept', mimeAccept)]
opener.open("http://www.netflix.com/JSON/BOB?movieid=80021955")
[...]
Of course, there is another bug there: it returns a 500 internal server error instead of a 400 bad request when the problem was clearly on the request.

Related

urllib: Opening a url always gets 429: Too many requests

I just got started with the urllib module. I'm trying to scrape products from supermarkets and there's a website that seems to always respond with an HTTP Error 429: Too many requests. I already did a bit of research on the Stack Overflow and no one seems to have the same problem. My code is as simple as it can get:
>>> import urllib.request
>>> resp = urllib.request.urlopen("https://shop.coles.com.au/a/a-national/product/head-shoulders-shampoo-conditioner-2in1-deep-clean")
Traceback (most recent call last):
File "<pyshell#1>", line 1, in <module>
resp = urllib.request.urlopen("https://shop.coles.com.au/a/a-national/product/head-shoulders-shampoo-conditioner-2in1-deep-clean")
File "C:\Users\thank\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "C:\Users\thank\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 531, in open
response = meth(req, response)
File "C:\Users\thank\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 640, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Users\thank\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 568, in error
return self._call_chain(*args)
File "C:\Users\thank\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 503, in _call_chain
result = func(*args)
File "C:\Users\thank\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 648, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 429: Too Many Requests
I've also tried to modify the user-agent as this answer suggests, but the result is still the same
Can someone explain which default settings inside the urllib module may cause the problem? Or is it because the website blocks bots? Other product pages of the website don't work either.
429 is server asking you to stop. Basically, the web server thinks you are trying to spam or scrape and it doesn't like it. Generally you should honor the server and if there is try after some time with 429 response you should follow it.
If you feel you are wrongly been asked by the server, either you can make sure that your user request is **similar" to the user request generated by an user from the browser, which will include user-agent and all the other information a regular browser would send with the request. If the server is sending you 429 despite that most probably either it has blocked your ip temporarily or permanently. In that you should look how to scrape through multiple ips.

downloading file using python format becomes invalid

hey i am trying to download stock data from the nse website of india
so i am using python for this
the link is
import urllib
urllib.urlretrieve("https://www.nseindia.com/content/historical/DERIVATIVES/2016/JAN/fo01JAN2016bhav.csv.zip","fo01JAN2016bhav.csv.zip")
but when i try to open the file that is downloaded it says that the
compressed zipped file is invalid
when i try it normal download from the website by simply pasting the link the file that gets downloaded gets opened
link
https://www.nseindia.com/content/historical/DERIVATIVES/2016/JAN/fo01JAN2016bhav.csv.zip
so if i try using urllib 2
i get this
f=urllib2.urlopen('https://www.nseindia.com/content/historical/DERIVATIVES/2016/JAN/fo01JAN2016bhav.csv.zip')
Traceback (most recent call last):
File "<pyshell#6>", line 1, in <module>
f=urllib2.urlopen('https://www.nseindia.com/content/historical/DERIVATIVES/2016/JAN/fo01JAN2016bhav.csv.zip')
File "C:\Python27\lib\urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "C:\Python27\lib\urllib2.py", line 410, in open
response = meth(req, response)
File "C:\Python27\lib\urllib2.py", line 523, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python27\lib\urllib2.py", line 448, in error
return self._call_chain(*args)
File "C:\Python27\lib\urllib2.py", line 382, in _call_chain
result = func(*args)
File "C:\Python27\lib\urllib2.py", line 531, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 403: Forbidden
how do i fix this ?
it happens for this link only i have tried downloading images from imgur and the code works fine
why is the http 403 error coming when i can normaly access it through my browser?
This link provides an example of what you want to do: https://stackoverflow.com/a/22776/6595777
Found another question regarding downloading zip files. Try this:
url = "http://www.nseindia.com/content/historical/DERIVATIVES/2016/JAN/fo01JAN2016bhav.csv.zip"
download = urllib2.urlopen(url)
with open(os.path.basename(url), "wb") as f:
f.write(download.read())
I don't have commenting permissions yet so I'm posting as an answer.
I can't browse to your link via https, http works though. Have you tried changing your link in your script to http?
It is possible that your script is downloading the error page that I get when trying to use https (ERR_SSL_PROTOCOL_ERROR.) This means that what you download will have the file name you specify (ending in .zip,) but it is actually html. This means it will give you the error that the zip file is invalid
hey i do not know why this is happening in urllib and urllib2 libraries but when i used the requests library
r = requests.get(url)
with open("code3.zip", "wb") as code:
code.write(r.content)
it worked
this might be an indirect solution to my answer

Python urllib.request.urlopen() returning error 403

I'm trying to download the HTML of a page (http://www.guangxindai.com in this case) but I'm getting back an error 403. Here is my code:
import urllib.request
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
f = opener.open("http://www.guangxindai.com")
f.read()
but I get error response.
Traceback (most recent call last):
File "<pyshell#7>", line 1, in <module>
f = opener.open("http://www.guangxindai.com")
File "C:\Python33\lib\urllib\request.py", line 475, in open
response = meth(req, response)
File "C:\Python33\lib\urllib\request.py", line 587, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python33\lib\urllib\request.py", line 513, in error
return self._call_chain(*args)
File "C:\Python33\lib\urllib\request.py", line 447, in _call_chain
result = func(*args)
File "C:\Python33\lib\urllib\request.py", line 595, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
I have tried different request headers, but still can not get correct response. I can view the web through browser. It seems strange for me. I guess the web use some method to block web spider. Does anyone know what is happening? How can I get the HTML of page correctly?
I was having the same problem that you and I found the answer in this link.
The answer provided by Stefano Sanfilippo is quite simple and worked for me:
from urllib.request import Request, urlopen
url_request = Request("http://www.guangxindai.com",
headers={"User-Agent": "Mozilla/5.0"})
webpage = urlopen(url_request).read()
If your aim is to read the html of the page you can use the following code. It worked for me on Python 2.7
import urllib
f = urllib.urlopen("http://www.guangxindai.com")
f.read()

Python urllib2 problems

I have a very basic script to download a website using Python urllib2.
This has been working brilliantly for the past 6 months, and then this morning it no longer works?
#!/usr/bin/python
import urllib2
proxy_support = urllib2.ProxyHandler({'http': 'http://DOMAIN\USER:PASS#PROXY:PORT/'})
opener = urllib2.build_opener(proxy_support)
urllib2.install_opener(opener)
translink = open('/tmp/trains.html' ,'w')
response = urllib2.urlopen('http://translink.com.au')
html = response.read()
translink.write(html)
translink.close()
I am now getting the following error
Traceback (most recent call last):
File "./gettrains.py", line 7, in <module>
response = urllib2.urlopen('http://translink.com.au')
File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 407, in open
response = meth(req, response)
File "/usr/lib/python2.7/urllib2.py", line 520, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python2.7/urllib2.py", line 445, in error
return self._call_chain(*args)
File "/usr/lib/python2.7/urllib2.py", line 379, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 528, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 502: Proxy Error ( The HTTP message includes an unsupported header or an unsupported combination of headers. )
I am new to Python, any help would be very much appreciated.
Cheers
#!/usr/bin/python
import requests
proxies = {
"http": "http://domain\user:pass#proxy:port",
"https": "http:// domain\user:pass#proxy:port",
}
html = requests.get("http://translink.com.au", proxies=proxies)
translink = open('/tmp/trains.html' ,'w')
translink.write(html.content)
translink.close()
Try to change a header. For example:
opener = urllib2.build_opener(proxy_support)
opener.addheaders = ([('User-Agent' , 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)')])
urllib2.install_opener(opener)
I had same problem some days ago. My proxy didn't admit the default header user-agent='Python-urllib/2.7'
To simplify things a little bit, I would avoid the proxy setup from within python and simply let your OS manage it for you. You can do this by setting an environment variable (like export http_proxy="your_proxy" in Linux). Then simply grab the file directly through python, which you can do with urllib2 or requests, you may also consider the wget module.
It's totally possible that there may have been some changes to your proxy that forwards the requests with headers that are no longer acceptable by your final destination. In that case there's very little you can do.

Python urllib2 URLError exception?

I installed Python 2.6.2 earlier on a Windows XP machine and run the following code:
import urllib2
import urllib
page = urllib2.Request('http://www.python.org/fish.html')
urllib2.urlopen( page )
I get the following error.
Traceback (most recent call last):<br>
File "C:\Python26\test3.py", line 6, in <module><br>
urllib2.urlopen( page )<br>
File "C:\Python26\lib\urllib2.py", line 124, in urlopen<br>
return _opener.open(url, data, timeout)<br>
File "C:\Python26\lib\urllib2.py", line 383, in open<br>
response = self._open(req, data)<br>
File "C:\Python26\lib\urllib2.py", line 401, in _open<br>
'_open', req)<br>
File "C:\Python26\lib\urllib2.py", line 361, in _call_chain<br>
result = func(*args)<br>
File "C:\Python26\lib\urllib2.py", line 1130, in http_open<br>
return self.do_open(httplib.HTTPConnection, req)<br>
File "C:\Python26\lib\urllib2.py", line 1105, in do_open<br>
raise URLError(err)<br>
URLError: <urlopen error [Errno 11001] getaddrinfo failed><br><br><br>
import urllib2
response = urllib2.urlopen('http://www.python.org/fish.html')
html = response.read()
You're doing it wrong.
Have a look in the urllib2 source, at the line specified by the traceback:
File "C:\Python26\lib\urllib2.py", line 1105, in do_open
raise URLError(err)
There you'll see the following fragment:
try:
h.request(req.get_method(), req.get_selector(), req.data, headers)
r = h.getresponse()
except socket.error, err: # XXX what error?
raise URLError(err)
So, it looks like the source is a socket error, not an HTTP protocol related error. Possible reasons: you are not on line, you are behind a restrictive firewall, your DNS is down,...
All this aside from the fact, as mcandre pointed out, that your code is wrong.
Name resolution error.
getaddrinfo is used to resolve the hostname (python.org)in your request. If it fails, it means that the name could not be resolved because:
It does not exist, or the records are outdated (unlikely; python.org is a well-established domain name)
Your DNS server is down (unlikely; if you can browse other sites, you should be able to fetch that page through Python)
A firewall is blocking Python or your script from accessing the Internet (most likely; Windows Firewall sometimes does not ask you if you want to allow an application)
You live on an ancient voodoo cemetery. (unlikely; if that is the case, you should move out)
Windows Vista, python 2.6.2
It's a 404 page, right?
>>> import urllib2
>>> import urllib
>>>
>>> page = urllib2.Request('http://www.python.org/fish.html')
>>> urllib2.urlopen( page )
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python26\lib\urllib2.py", line 124, in urlopen
return _opener.open(url, data, timeout)
File "C:\Python26\lib\urllib2.py", line 389, in open
response = meth(req, response)
File "C:\Python26\lib\urllib2.py", line 502, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python26\lib\urllib2.py", line 427, in error
return self._call_chain(*args)
File "C:\Python26\lib\urllib2.py", line 361, in _call_chain
result = func(*args)
File "C:\Python26\lib\urllib2.py", line 510, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 404: Not Found
>>>
DJ
First, I see no reason to import urllib; I've only ever seen urllib2 used to replace urllib entirely and I know of no functionality that's useful from urllib and yet is missing from urllib2.
Next, I notice that http://www.python.org/fish.html gives a 404 error to me. (That doesn't explain the backtrace/exception you're seeing. I get urllib2.HTTPError: HTTP Error 404: Not Found
Normally if you just want to do a default fetch of a web pages (without adding special HTTP headers, doing doing any sort of POST, etc) then the following suffices:
req = urllib2.urlopen('http://www.python.org/')
html = req.read()
# and req.close() if you want to be pedantic

Categories