I cannot open a website that exists - python

I am getting an error that makes me believe my program is unable to find a website I know exists. the website is
https://www.transfermarkt.com/marco-reus/verletzungen/spieler/35207
My code looks like
from urllib import request as u_r
def strip_webite():
with u_r.urlopen("https://www.transfermarkt.com/marco-reus/verletzungen/spieler/35207") as f:
pass
if __name__ == "__main__":
strip_webite()
And the error I get is
File "database_management.py", line 19, in <module>
strip_webite()
File "database_management.py", line 15, in strip_webite
with u_r.urlopen("https://www.transfermarkt.com/marco-reus/verletzungen/spieler/35207") as f:
File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 223, in urlopen
return opener.open(url, data, timeout)
File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 532, in open
response = meth(req, response)
File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 642, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 570, in error
return self._call_chain(*args)
File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 504, in _call_chain
result = func(*args)
File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 650, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found

It looks like Transfermarkt is blocking requests from bots with the default User-Agent string sent by Python's urllib library, though it doesn't mention anything about that in its robots.
This seems to imply they don't mind us scraping them, but they'd prefer we announce who we are.
To do so with urllib, do the following:
from urllib import request as u_r
def strip_webite():
request = u_r.Request("https://www.transfermarkt.com/marco-reus/verletzungen/spieler/35207")
request.add_header('User-Agent', 'my-cool-app')
with u_r.urlopen(request) as f:
pass
if __name__ == "__main__":
strip_webite()

Related

urllib request gives 404 error but works fine in browser

When i try this line:
import urllib.request
urllib.request.urlretrieve("https://i.redd.it/53tfh959wnv41.jpg", "photo.jpg")
i get the following error:
Traceback (most recent call last):
File "scraper.py", line 26, in <module>
urllib.request.urlretrieve("https://i.redd.it/53tfh959wnv41.jpg", "photo.jpg")
File "/usr/lib/python3.6/urllib/request.py", line 248, in urlretrieve
with contextlib.closing(urlopen(url, data)) as fp:
File "/usr/lib/python3.6/urllib/request.py", line 223, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.6/urllib/request.py", line 532, in open
response = meth(req, response)
File "/usr/lib/python3.6/urllib/request.py", line 642, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python3.6/urllib/request.py", line 570, in error
return self._call_chain(*args)
File "/usr/lib/python3.6/urllib/request.py", line 504, in _call_chain
result = func(*args)
File "/usr/lib/python3.6/urllib/request.py", line 650, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found
But the link works fine in my browser? Why does it work in the browser but not for a request? It works with other pictures from the same site.
The request returns
If you check your developer console, It's a 404:
So what you see is imgur's custom 404 "page" (which is an image).
EDIT:
So urlretrieve fails on 404 status code. If you want to use the contents of the request (even if the statuscode is 404) you can do the following:
try:
urllib.request.urlretrieve("https://i.redd.it/53tfh959wnv41.jpg", "photo.jpg")
except Exception as e:
with open("error_photo.jpg", 'wb') as fp:
fp.write(e.read())
Try to change user-agent. You can just add a kwarg:
req = urllib.request.urlretrieve("https://i.redd.it/53tfh959wnv41.jpg", "photo.jpg", headers={"User-Agent": "put custom user agent here"})

Why is urllib.request.urlopen giving me 404 on Wall Street Journal's website?

Problem
I'm using urllib.request.urlopen on the Wall Street Journal and it gives me a 404.
Details
Other sites work fine. Same error if I use https://. I did this example in REPL but the same error happens in my calls from my Django server:
>>> from urllib.request import urlopen
>>> urlopen('http://www.wsj.com')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 531, in open
response = meth(req, response)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 641, in http_response
'http', request, response, code, msg, hdrs)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 569, in error
return self._call_chain(*args)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 503, in _call_chain
result = func(*args)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 649, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found
This is how it should work:
>>> urlopen('http://www.cbc.ca')
<http.client.HTTPResponse object at 0x10b0f8c88>
I'm not sure how to debug this. Anyone know what's going on, and how I can fix it?
first import Request like this:
from urllib.request import **Request**, urlopen
and then pass your url and header to Request like below:
url = 'https://www.wsj.com/'
response_obj = urlopen(Request(url, headers={'User-Agent': 'Mozilla/5.0'}))
print(response_obj)
I tested it now its working

Google Url Shortener API from Python AppEngine: HTTPError: HTTP Error 403: Forbidden

I'm having trouble using Google URL Shortener API in AppEngine production environment.
In the Developers console, I have the URL Shortener API turned on, and oAuth 2 is also turned on. On top of that I have the simple API Access Browser key obtained from the API Access screen.
Here is the problem. When I run the following code, I get "HTTPError: HTTP Error 403: Forbidden" in the Developers Console log. Interestingly, the same code properly returns the short url in the development environment.
def goo_shorten_url(url):
post_url = 'https://www.googleapis.com/urlshortener/v1/url?fields=id'
logging.info('post_url: {}'.format(post_url))
postdata = {'longUrl':url}
headers = {'Content-Type':'application/json'}
req = urllib2.Request(
post_url,
json.dumps(postdata),
headers
)
ret = urllib2.urlopen(req).read()
print ret
return json.loads(ret)['id']
If I include the API key in the post url as follows,
post_url = 'https://www.googleapis.com/urlshortener/v1/url?fields=id&key=MYAPIKEY'
Prod and Dev both return HTTP Error 403.
I suspect one of these three is true, but would like to hear your thoughts.
An API key is required, but I'm not using the right API key.
An API key is not required (which explains why it work with no key in Dev), but my API key is wrong resulting both Prod and Dev fail.
Google doesn't allow applications to programmatically submit a POST request to its Url shortener API.(this doesn't explain why it would work in Dev at all)
Thanks for reading.
Prod
File "/base/data/home/apps/s~myapp/1.377367579804576653/util/test_module.py", line 50, in get
strin = goo_shorten_url(longurl)
File "/base/data/home/apps/s~myapp/1.377367579804576653/util/JOTools.py", line 41, in goo_shorten_url
ret = urllib2.urlopen(req).read()
File "/base/data/home/runtimes/python27/python27_dist/lib/python2.7/urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "/base/data/home/runtimes/python27/python27_dist/lib/python2.7/urllib2.py", line 410, in open
response = meth(req, response)
File "/base/data/home/runtimes/python27/python27_dist/lib/python2.7/urllib2.py", line 523, in http_response
'http', request, response, code, msg, hdrs)
File "/base/data/home/runtimes/python27/python27_dist/lib/python2.7/urllib2.py", line 448, in error
return self._call_chain(*args)
File "/base/data/home/runtimes/python27/python27_dist/lib/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/base/data/home/runtimes/python27/python27_dist/lib/python2.7/urllib2.py", line 531, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 403: Forbidden
Dev with API Key
File "C:_dev\eclipse-work\gae\MyProj\util\test_module.py", line 50, in get
strin = goo_shorten_url(longurl)
File "C:_dev\eclipse-work\gae\MyProj\util\JOTools.py", line 41, in goo_shorten_url
ret = urllib2.urlopen(req).read()
File "C:\PYTHON27\lib\urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "C:\PYTHON27\lib\urllib2.py", line 410, in open
response = meth(req, response)
File "C:\PYTHON27\lib\urllib2.py", line 523, in http_response
'http', request, response, code, msg, hdrs)
File "C:\PYTHON27\lib\urllib2.py", line 448, in error
return self._call_chain(*args)
File "C:\PYTHON27\lib\urllib2.py", line 382, in _call_chain
result = func(*args)
File "C:\PYTHON27\lib\urllib2.py", line 531, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 403: Forbidden
Google has a nice API for this. You can test your requests here. Hope this helps.

Python 3.x get JSON from URL

Hello fellow Programmers,
today I wanted to get some JSON Data from this website using Python 3.3: http://ladv.de/api/-apikey-redacted-/ausDetail?id=884&wettbewerbe=true&all=true
The official API tells me that calling this URL returns some JSON Data. But if I use the following code to get it (which I found on stackoverflow, too), it throws an error:
import urllib.request
import json
request = 'http://ladv.de/api/mmetzger/ausDetail?id=884&wettbewerbe=true&all=true'
response = urllib.request.urlopen(request)
obj = json.load(response)
str_response = response.readall().decode('utf-8')
obj = json.loads(str_response)
print(obj)
prints out
Traceback (most recent call last):
File "D:/ladvclient/testscrape.py", line 5, in <module>
response = urllib.request.urlopen(request)
File "C:\Python33\lib\urllib\request.py", line 156, in urlopen
return opener.open(url, data, timeout)
File "C:\Python33\lib\urllib\request.py", line 475, in open
response = meth(req, response)
File "C:\Python33\lib\urllib\request.py", line 587, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python33\lib\urllib\request.py", line 513, in error
return self._call_chain(*args)
File "C:\Python33\lib\urllib\request.py", line 447, in _call_chain
result = func(*args)
File "C:\Python33\lib\urllib\request.py", line 595, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found
Where is the bug, and what is the correct code?
Thanks in advance,
forumfresser
The site you're trying to fetch is not available, as seen here:
http://ladv.de/api/-apikey-redacted-/ausDetail?id=884&wettbewerbe=true&all=true
You could also just read the error message by yourself:
urllib.error.HTTPError: HTTP Error 404: Not Found

Facebook publish HTTP Error 400 : bad request

Hey I am trying to publish a score to Facebook through python's urllib2 library.
import urllib2,urllib
url = "https://graph.facebook.com/USER_ID/scores"
data = {}
data['score']=SCORE
data['access_token']='APP_ACCESS_TOKEN'
data_encode = urllib.urlencode(data)
request = urllib2.Request(url, data_encode)
response = urllib2.urlopen(request)
responseAsString = response.read()
I am getting this error:
response = urllib2.urlopen(request)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib2.py", line 124, in urlopen
return _opener.open(url, data, timeout)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib2.py", line 389, in open
response = meth(req, response)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib2.py", line 502, in http_response
'http', request, response, code, msg, hdrs)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib2.py", line 427, in error
return self._call_chain(*args)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib2.py", line 361, in _call_chain
result = func(*args)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib2.py", line 510, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 400: Bad Request
Not sure if this is relating to Facebook's Open Graph or improper urllib2 API use.
I ran your code and got the same error (there is no more error in the body, would have posted that in a comment but can't yet I guess) so I googled "publish facebook scores."
I believe you'll need to grant your app permission to publish scores first, unless you've done that already. See http://developers.facebook.com/blog/post/539/.
You may have to provide user:agent as some browser. I remember getting similar error while running crawler in some website, as it detected that no browser is calling for it.

Categories