I just got started with the urllib module. I'm trying to scrape products from supermarkets and there's a website that seems to always respond with an HTTP Error 429: Too many requests. I already did a bit of research on the Stack Overflow and no one seems to have the same problem. My code is as simple as it can get:
>>> import urllib.request
>>> resp = urllib.request.urlopen("https://shop.coles.com.au/a/a-national/product/head-shoulders-shampoo-conditioner-2in1-deep-clean")
Traceback (most recent call last):
File "<pyshell#1>", line 1, in <module>
resp = urllib.request.urlopen("https://shop.coles.com.au/a/a-national/product/head-shoulders-shampoo-conditioner-2in1-deep-clean")
File "C:\Users\thank\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "C:\Users\thank\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 531, in open
response = meth(req, response)
File "C:\Users\thank\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 640, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Users\thank\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 568, in error
return self._call_chain(*args)
File "C:\Users\thank\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 503, in _call_chain
result = func(*args)
File "C:\Users\thank\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 648, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 429: Too Many Requests
I've also tried to modify the user-agent as this answer suggests, but the result is still the same
Can someone explain which default settings inside the urllib module may cause the problem? Or is it because the website blocks bots? Other product pages of the website don't work either.
429 is server asking you to stop. Basically, the web server thinks you are trying to spam or scrape and it doesn't like it. Generally you should honor the server and if there is try after some time with 429 response you should follow it.
If you feel you are wrongly been asked by the server, either you can make sure that your user request is **similar" to the user request generated by an user from the browser, which will include user-agent and all the other information a regular browser would send with the request. If the server is sending you 429 despite that most probably either it has blocked your ip temporarily or permanently. In that you should look how to scrape through multiple ips.
Related
hey i am trying to download stock data from the nse website of india
so i am using python for this
the link is
import urllib
urllib.urlretrieve("https://www.nseindia.com/content/historical/DERIVATIVES/2016/JAN/fo01JAN2016bhav.csv.zip","fo01JAN2016bhav.csv.zip")
but when i try to open the file that is downloaded it says that the
compressed zipped file is invalid
when i try it normal download from the website by simply pasting the link the file that gets downloaded gets opened
link
https://www.nseindia.com/content/historical/DERIVATIVES/2016/JAN/fo01JAN2016bhav.csv.zip
so if i try using urllib 2
i get this
f=urllib2.urlopen('https://www.nseindia.com/content/historical/DERIVATIVES/2016/JAN/fo01JAN2016bhav.csv.zip')
Traceback (most recent call last):
File "<pyshell#6>", line 1, in <module>
f=urllib2.urlopen('https://www.nseindia.com/content/historical/DERIVATIVES/2016/JAN/fo01JAN2016bhav.csv.zip')
File "C:\Python27\lib\urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "C:\Python27\lib\urllib2.py", line 410, in open
response = meth(req, response)
File "C:\Python27\lib\urllib2.py", line 523, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python27\lib\urllib2.py", line 448, in error
return self._call_chain(*args)
File "C:\Python27\lib\urllib2.py", line 382, in _call_chain
result = func(*args)
File "C:\Python27\lib\urllib2.py", line 531, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 403: Forbidden
how do i fix this ?
it happens for this link only i have tried downloading images from imgur and the code works fine
why is the http 403 error coming when i can normaly access it through my browser?
This link provides an example of what you want to do: https://stackoverflow.com/a/22776/6595777
Found another question regarding downloading zip files. Try this:
url = "http://www.nseindia.com/content/historical/DERIVATIVES/2016/JAN/fo01JAN2016bhav.csv.zip"
download = urllib2.urlopen(url)
with open(os.path.basename(url), "wb") as f:
f.write(download.read())
I don't have commenting permissions yet so I'm posting as an answer.
I can't browse to your link via https, http works though. Have you tried changing your link in your script to http?
It is possible that your script is downloading the error page that I get when trying to use https (ERR_SSL_PROTOCOL_ERROR.) This means that what you download will have the file name you specify (ending in .zip,) but it is actually html. This means it will give you the error that the zip file is invalid
hey i do not know why this is happening in urllib and urllib2 libraries but when i used the requests library
r = requests.get(url)
with open("code3.zip", "wb") as code:
code.write(r.content)
it worked
this might be an indirect solution to my answer
I'm trying to download the HTML of a page (http://www.guangxindai.com in this case) but I'm getting back an error 403. Here is my code:
import urllib.request
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
f = opener.open("http://www.guangxindai.com")
f.read()
but I get error response.
Traceback (most recent call last):
File "<pyshell#7>", line 1, in <module>
f = opener.open("http://www.guangxindai.com")
File "C:\Python33\lib\urllib\request.py", line 475, in open
response = meth(req, response)
File "C:\Python33\lib\urllib\request.py", line 587, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python33\lib\urllib\request.py", line 513, in error
return self._call_chain(*args)
File "C:\Python33\lib\urllib\request.py", line 447, in _call_chain
result = func(*args)
File "C:\Python33\lib\urllib\request.py", line 595, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
I have tried different request headers, but still can not get correct response. I can view the web through browser. It seems strange for me. I guess the web use some method to block web spider. Does anyone know what is happening? How can I get the HTML of page correctly?
I was having the same problem that you and I found the answer in this link.
The answer provided by Stefano Sanfilippo is quite simple and worked for me:
from urllib.request import Request, urlopen
url_request = Request("http://www.guangxindai.com",
headers={"User-Agent": "Mozilla/5.0"})
webpage = urlopen(url_request).read()
If your aim is to read the html of the page you can use the following code. It worked for me on Python 2.7
import urllib
f = urllib.urlopen("http://www.guangxindai.com")
f.read()
I've been at this for the better part of a day but have been coming up with the same Error 400 for quite some time. Basically, the application's goal is to parse a book's ISBN from the Amazon referral url and use it as the reference key to pull images from Amazon's Product Advertising API. The webpage is written in Python 3.4 and Django 1.8. I spent quite a while researching on here and settled for using python-amazon-simple-product-api since it would make parsing results from Amazon a little easier.
Answers like: How to use Python Amazon Simple Product API to get price of a product
Make it seem pretty simple, but I haven't quite gotten it to lookup a product successfully yet. Here's a console printout of what my method usually does, with the correct ISBN already filled:
>>> from amazon.api import AmazonAPI
>>> access_key='amazon-access-key'
>>> secret ='amazon-secret-key'
>>> assoc ='amazon-associate-account-name'
>>> amazon = AmazonAPI(access_key, secret, assoc)
>>> product = amazon.lookup(ItemId='1632360705')
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "/home/tsuko/.virtualenvs/django17/lib/python3.4/site-packages/amazon/api.py", line 161, in lo
okup
response = self.api.ItemLookup(ResponseGroup=ResponseGroup, **kwargs)
File "/home/tsuko/.virtualenvs/django17/lib/python3.4/site-packages/bottlenose/api.py", line 242, i
n __call__
{'api_url': api_url, 'cache_url': cache_url})
File "/home/tsuko/.virtualenvs/django17/lib/python3.4/site-packages/bottlenose/api.py", line 203, i
n _call_api
return urllib2.urlopen(api_request, timeout=self.Timeout)
File "/usr/lib/python3.4/urllib/request.py", line 153, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.4/urllib/request.py", line 461, in open
response = meth(req, response)
File "/usr/lib/python3.4/urllib/request.py", line 571, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python3.4/urllib/request.py", line 499, in error
return self._call_chain(*args)
File "/usr/lib/python3.4/urllib/request.py", line 433, in _call_chain
result = func(*args)
File "/usr/lib/python3.4/urllib/request.py", line 579, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 400: Bad Request
Now I guess I'm curious if this is some quirk with PythonAnywhere, or if I've missed a configuration setting in Django? As far as I can tell through AWS and the Amazon Associates page my keys are correct. I'm not too worried about parsing at this point, just getting the object. I've even tried bypassing the API and just using Bottlenose (which the API extends) but I get the same error 400 result.
I'm really new to Django and Amazon's API, any assistance would be appreciated!
You still haven't authorised your account for API access. To do so, you can go to https://affiliate-program.amazon.com/gp/advertising/api/registration/pipeline.html
I'm currently trying to fix a Kodi plugin called NetfliXBMC.
It uses this url to get information on specific movies:
http://www.netflix.com/JSON/BOB?movieid=<SOMEID>
While trying to build a minimal case to ask this question I discovered that it's not even necessary to be logged in to access the information, which simplifies my question a lot.
Querying information about a movie works from wget, from curl, from incognito chrome etc. It just never works from urllib2:
# wget works just fine
$: wget -q -O- http://www.netflix.com/JSON/BOB?movieid=80021955
{"contextData":"{\"cookieDisclosure\":{\"data\":{\"showCookieBanner\":false}}}","result":"success","actionErrors":null,"fieldErrors":null,"actionMessages":null,"data":[output omitted for brevity]}
# so does curl
$: curl http://www.netflix.com/JSON/BOB?movieid=80021955
{"contextData":"{\"cookieDisclosure\":{\"data\":{\"showCookieBanner\":false}}}","result":"success","actionErrors":null,"fieldErrors":null,"actionMessages":null,"data":[output omitted for brevity}
# but python's urllib always gets a 500
$: python -c "import urllib2; urllib2.urlopen('http://www.netflix.com/JSON/BOB?movieid=80021955').read()"
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 410, in open
response = meth(req, response)
File "/usr/lib/python2.7/urllib2.py", line 523, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python2.7/urllib2.py", line 448, in error
return self._call_chain(*args)
File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 531, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 500: Internal Server Error
$: python --version
Python 2.7.6
What I've tried so far: several different user-agent strings, initializing a urlopener with a cookie jar, plain old urllib (doesn't raise an exception but receives the same error page).
I'm really curious as to why this might be. Thanks in advance!
It turned out to be a bug on netflix' side when no Accept header is sent.
This doesn't work:
opener = urllib2.build_opener()
opener.open("http://www.netflix.com/JSON/BOB?movieid=80021955")
Adding a proper accept header makes it work:
opener = urllib2.build_opener()
mimeAccept = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
opener.addheaders = [('Accept', mimeAccept)]
opener.open("http://www.netflix.com/JSON/BOB?movieid=80021955")
[...]
Of course, there is another bug there: it returns a 500 internal server error instead of a 400 bad request when the problem was clearly on the request.
I am using the "Server Side" flow to get a user's permissions to access some information using Python on Google Appengine.
I am able to get the server generated code from Facebook after the user clicks on the "Allow" button.
However when I get the access token, I run into the following error:
Traceback (most recent call last):
File
"/base/python_runtime/python_lib/versions/1/google/appengine/ext/webapp/init.py",
line 515, in call
handler.get(*groups) File "/base/data/home/apps/finisherph/1.348502373491720746/controllers.py",
line 21, in get
data = urllib2.urlopen(access_token_url)
File
"/base/python_runtime/python_dist/lib/python2.5/urllib2.py",
line 124, in urlopen
return _opener.open(url, data) File
"/base/python_runtime/python_dist/lib/python2.5/urllib2.py",
line 387, in open
response = meth(req, response) File
"/base/python_runtime/python_dist/lib/python2.5/urllib2.py",
line 498, in http_response
'http', request, response, code, msg, hdrs) File
"/base/python_runtime/python_dist/lib/python2.5/urllib2.py",
line 425, in error
return self._call_chain(*args) File
"/base/python_runtime/python_dist/lib/python2.5/urllib2.py",
line 360, in _call_chain
result = func(*args) File "/base/python_runtime/python_dist/lib/python2.5/urllib2.py",
line 506, in http_error_default
raise HTTPError(req.get_full_url(), code,
msg, hdrs, fp) HTTPError: HTTP Error
400: Bad Request
Here's the code in my controller where the response from facebook goes after user clicks on the "Allow" button. It's still a hack so the code is a little bit dirty. Still trying to make it work.
class Register(webapp.RequestHandler):
def get(self):
code=self.request.get('code')
logging.debug("code: "+code)
accesst_url=["https://graph.facebook.com/oauth/access_token?"]
accesst_url.append("client_id=CLIENT_ID&")
import urllib
accesst_url.append(urllib.urlencode
({'redirect_uri':'http://my.website.com/register/facebook/'}))
accesst_url.append('&')
accesst_url.append("client_secret=CLIENT_SECRET&")
accesst_url.append("".join(["code=",str(code)]))
logging.debug(accesst_url)
access_token_url="".join(accesst_url)
logging.debug(access_token_url)
import urllib2
data = urllib2.urlopen(access_token_url)
...
...
The error occurs here:
data = urllib2.urlopen(access_token_url)
when I copy and paste the access_token_url from my logs, I get the following error:
{ "error": {
"type": "OAuthException",
"message": "Error validating verification code." } }
What am I missing here?
It looks like you are trying to access the access_token as url, which is not quite right.
Here is an example which illustrates how OAuth authentication via FB is done over GAE.
You go to the https://graph.facebook.com/oauth/authorize? with your client_id and redirect_uri
Upon authorization, it gives a code and you use code and client_secret to get an access_token from https://graph.facebook.com/oauth/access_token
And then you use that access_token to operate as the Facebook user.