I can not scrape a webpage with Beautiful Soup - python

I'm trying to scrap https://www.crowdcube.com/investments?sector=technology with BeautifulSoup in Python 3.
Traceback (most recent call last):
File "D:\DataVisualization\lib\urllib\request.py", line 163, in urlopen
return opener.open(url, data, timeout)
File "D:\DataVisualization\lib\urllib\request.py", line 472, in open
response = meth(req, response)
File "D:\DataVisualization\lib\urllib\request.py", line 582, in http_response
'http', request, response, code, msg, hdrs)
File "D:\DataVisualization\lib\urllib\request.py", line 510, in error
return self._call_chain(*args)
File "D:\DataVisualization\lib\urllib\request.py", line 444, in _call_chain
result = func(*args)
File "D:\DataVisualization\lib\urllib\request.py", line 590, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

use requests and this site do not need UA:
In [23]: import requests
In [24]: r = requests.get('https://www.crowdcube.com/investments?sector=technology')
In [25]: r.status_code
Out[25]: 200

Related

while reading json data from url using python gives error "urllib.error.HTTPError: HTTP Error 403: Forbidden"

with this code I am reading a URL and using the data for filtration but urllib could not work
url = "myurl"
response = urllib.request.urlopen(url)
data = json.loads(response.read())
yesterday it was working well but now giving me error:
Traceback (most recent call last):
File "vaccine_survey.py", line 22, in <module>
response = urllib.request.urlopen(url)
File "/usr/lib/python3.6/urllib/request.py", line 223, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.6/urllib/request.py", line 532, in open
response = meth(req, response)
File "/usr/lib/python3.6/urllib/request.py", line 642, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python3.6/urllib/request.py", line 570, in error
return self._call_chain(*args)
File "/usr/lib/python3.6/urllib/request.py", line 504, in _call_chain
result = func(*args)
File "/usr/lib/python3.6/urllib/request.py", line 650, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
this works for me, here 'myurl' is a url address
from urllib.request import Request, urlopen
req = Request('myurl', headers={'User-Agent': 'Mozilla/5.0'})
response = urlopen(req).read()
data = json.loads(response.read())

urllib request gives 404 error but works fine in browser

When i try this line:
import urllib.request
urllib.request.urlretrieve("https://i.redd.it/53tfh959wnv41.jpg", "photo.jpg")
i get the following error:
Traceback (most recent call last):
File "scraper.py", line 26, in <module>
urllib.request.urlretrieve("https://i.redd.it/53tfh959wnv41.jpg", "photo.jpg")
File "/usr/lib/python3.6/urllib/request.py", line 248, in urlretrieve
with contextlib.closing(urlopen(url, data)) as fp:
File "/usr/lib/python3.6/urllib/request.py", line 223, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.6/urllib/request.py", line 532, in open
response = meth(req, response)
File "/usr/lib/python3.6/urllib/request.py", line 642, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python3.6/urllib/request.py", line 570, in error
return self._call_chain(*args)
File "/usr/lib/python3.6/urllib/request.py", line 504, in _call_chain
result = func(*args)
File "/usr/lib/python3.6/urllib/request.py", line 650, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found
But the link works fine in my browser? Why does it work in the browser but not for a request? It works with other pictures from the same site.
The request returns
If you check your developer console, It's a 404:
So what you see is imgur's custom 404 "page" (which is an image).
EDIT:
So urlretrieve fails on 404 status code. If you want to use the contents of the request (even if the statuscode is 404) you can do the following:
try:
urllib.request.urlretrieve("https://i.redd.it/53tfh959wnv41.jpg", "photo.jpg")
except Exception as e:
with open("error_photo.jpg", 'wb') as fp:
fp.write(e.read())
Try to change user-agent. You can just add a kwarg:
req = urllib.request.urlretrieve("https://i.redd.it/53tfh959wnv41.jpg", "photo.jpg", headers={"User-Agent": "put custom user agent here"})

Why is urllib.request.urlopen giving me 404 on Wall Street Journal's website?

Problem
I'm using urllib.request.urlopen on the Wall Street Journal and it gives me a 404.
Details
Other sites work fine. Same error if I use https://. I did this example in REPL but the same error happens in my calls from my Django server:
>>> from urllib.request import urlopen
>>> urlopen('http://www.wsj.com')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 531, in open
response = meth(req, response)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 641, in http_response
'http', request, response, code, msg, hdrs)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 569, in error
return self._call_chain(*args)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 503, in _call_chain
result = func(*args)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 649, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found
This is how it should work:
>>> urlopen('http://www.cbc.ca')
<http.client.HTTPResponse object at 0x10b0f8c88>
I'm not sure how to debug this. Anyone know what's going on, and how I can fix it?
first import Request like this:
from urllib.request import **Request**, urlopen
and then pass your url and header to Request like below:
url = 'https://www.wsj.com/'
response_obj = urlopen(Request(url, headers={'User-Agent': 'Mozilla/5.0'}))
print(response_obj)
I tested it now its working

amazon product api returning HTTP Error 400: Bad Request

I've access data using the amazon product API before and that was easy. All I had to do was make some api keys and use a module in Python and it gave back the data.
however, this was 4-5 months ago and now when I try to use the api keys, its giving me 400 error.
Here is my code
from amazon.api import AmazonAPI
asin = 'B00CAB5ZKC'
AMAZON_ACCESS_KEY='AKIAI4ZP7EZGNSTAWKTA'
AMAZON_SECRET_KEY ='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'
AMAZON_ASSOC_TAG =246152698300
amazon = AmazonAPI(AMAZON_ACCESS_KEY, AMAZON_SECRET_KEY, AMAZON_ASSOC_TAG, region = "DE")
product = amazon.lookup(ItemId=asin, Condition='All', MerchantId='All')
This is the error I get
Traceback (most recent call last):
File "C:\Users\Hari\Documents\Python\data\amazon_test_script.py", line 22, in <module>
product = amazon.lookup(ItemId=asin, Condition='All', MerchantId='All')
File "build\bdist.win-amd64\egg\amazon\api.py", line 173, in lookup
response = self.api.ItemLookup(ResponseGroup=ResponseGroup, **kwargs)
File "C:\Python27\lib\site-packages\bottlenose\api.py", line 265, in __call__
{'api_url': api_url, 'cache_url': cache_url})
File "C:\Python27\lib\site-packages\bottlenose\api.py", line 226, in _call_api
return urllib2.urlopen(api_request, timeout=self.Timeout)
File "C:\Python27\lib\urllib2.py", line 154, in urlopen
return opener.open(url, data, timeout)
File "C:\Python27\lib\urllib2.py", line 437, in open
response = meth(req, response)
File "C:\Python27\lib\urllib2.py", line 550, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python27\lib\urllib2.py", line 475, in error
return self._call_chain(*args)
File "C:\Python27\lib\urllib2.py", line 409, in _call_chain
result = func(*args)
File "C:\Python27\lib\urllib2.py", line 558, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 400: Bad Request
>>>
I've been debugging this for 2 days now, with no success. Could anyone tell me what am I missing.

HTTP Error 403 returned on accessing Amazon API via Bottlenose (Python)

Anyway, I'm trying to write a simple request to the Amazon API using the following code:
ak = "***"
sk = "***"
at = "***"
import bottlenose
amazon = bottlenose.Amazon(ak, sk, at, "DE")
response=amazon.ItemLookup(ItemId="B00KWAO4CI")
print(response.price_and_currency)
It should return an XML object. Instead I get the following result:
Traceback (most recent call last):
File "simpleamazon.py", line 7, in <module>
response=amazon.ItemLookup(ItemId="B00KWAO4CI")
File "/Library/Python/2.7/site-packages/bottlenose/api.py", line 251, in __call__
{'api_url': api_url, 'cache_url': cache_url})
File "/Library/Python/2.7/site-packages/bottlenose/api.py", line 212, in _call_api
return urllib2.urlopen(api_request, timeout=self.Timeout)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 154, in urlopen
return opener.open(url, data, timeout)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 437, in open
response = meth(req, response)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 550, in http_response
'http', request, response, code, msg, hdrs)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 475, in error
return self._call_chain(*args)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 409, in _call_chain
result = func(*args)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 558, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden
Up until recently I received HTTP Error 400 instead. To my knowledge I haven't changed anything. I've also tried using response groups, but it resulted in the same error(s).
Do you have any leads?
Using Python 3.5.2

Categories