Inability to extract data from website using urllib2 [duplicate]

Inability to extract data from website using urllib2 [duplicate] - python

This question already has answers here:
urllib2 Error 403: Forbidden
(2 answers)
Closed 8 years ago.
I was following a tutorial on pythonforbeginners.com, and I came across a code which isn't running right on my OSX.
from bs4 import BeautifulSoup
import urllib2
url = "http://www.pythonforbeginners.com"
content = urllib2.urlopen(url).read()
soup = BeautifulSoup(content)
print soup.prettify()
This gives me the error:
Traceback (most recent call last): File
"/Users/dhruvmullick/CS/Python/Extracting Data/test.py", line 8, in
content = urllib2.urlopen(url).read() File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py",
line 127, in urlopen
return _opener.open(url, data, timeout) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py",
line 410, in open
response = meth(req, response) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py",
line 523, in http_response
'http', request, response, code, msg, hdrs) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py",
line 448, in error
return self._call_chain(*args) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py",
line 382, in _call_chain
result = func(*args) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py",
line 531, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp) urllib2.HTTPError: HTTP Error 403: Forbidden

The 403 error indicates that the server is blocking your connection.
...a request from a client for a web page or resource to indicate that the server can be reached and understood the request, but refuses to take any further action.
Try with a different domain and you'll find that it works as expected.
To make a work-around, you can add a custom user-agent.

Related

Why is urllib.request.urlopen giving me 404 on Wall Street Journal's website?

Problem
I'm using urllib.request.urlopen on the Wall Street Journal and it gives me a 404.
Details
Other sites work fine. Same error if I use https://. I did this example in REPL but the same error happens in my calls from my Django server:
>>> from urllib.request import urlopen
>>> urlopen('http://www.wsj.com')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 531, in open
response = meth(req, response)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 641, in http_response
'http', request, response, code, msg, hdrs)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 569, in error
return self._call_chain(*args)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 503, in _call_chain
result = func(*args)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 649, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found
This is how it should work:
>>> urlopen('http://www.cbc.ca')
<http.client.HTTPResponse object at 0x10b0f8c88>
I'm not sure how to debug this. Anyone know what's going on, and how I can fix it?

first import Request like this:
from urllib.request import **Request**, urlopen
and then pass your url and header to Request like below:
url = 'https://www.wsj.com/'
response_obj = urlopen(Request(url, headers={'User-Agent': 'Mozilla/5.0'}))
print(response_obj)
I tested it now its working

python: urllib2.HTTPError: HTTP Error 405: Method Not Allowed

I am really an ETL guy trying to learn Python, please help
import urllib2
urls =urllib2.urlopen("url1","url2")
i=0
while i< len(urls):
htmlfile = urllib2.urlopen(urls[i])
htmltext = htmlfile.read()
print htmltext
i+=1
I am getting errors as
Traceback (most recent call last):
File ".\test.py", line 2, in
urls =urllib2.urlopen("url1","url2")
File "c:\python27\Lib\urllib2.py", line 154, in urlopen
return opener.open(url, data, timeout)
File "c:\python27\Lib\urllib2.py", line 437, in open
response = meth(req, response)
File "c:\python27\Lib\urllib2.py", line 550, in http_response
'http', request, response, code, msg, hdrs)
File "c:\python27\Lib\urllib2.py", line 475, in error
return self._call_chain(*args)
File "c:\python27\Lib\urllib2.py", line 409, in _call_chain
result = func(*args)
File "c:\python27\Lib\urllib2.py", line 558, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 405: Method Not Allowed

Your error is coming from line 2:
urls =urllib2.urlopen("url1","url2")
Whatever url you're trying to access is returning a http error code
HTTP Error 405: Method Not Allowed
Looking at the urllib2 docs, you should only be using 1 url as an argument
https://docs.python.org/2/library/urllib2.html
Open the URL url, which can be either a string or a Request object.
data may be a string specifying additional data to send to the server, or None if no such data is needed. Currently HTTP requests are the only ones that use data; the HTTP request will be a POST instead of a GET when the data parameter is provided.
The 2nd argument you're putting in may be turning the request into a POST, which would explain the Method Not Allowed code.

python urllib2 can not fetch a specific url [duplicate]

This question already has answers here:
urllib2 returns 404 for a website which displays fine in browsers
(3 answers)
Closed 8 years ago.
I'm using urllib2 to request for URLs and read their contents but unfortunately it's not working for some URLs. look at these commands:
#No problem with this URL
urllib2.urlopen('http://www.huffingtonpost.com/2014/07/19/todd-akin-slavery_n_5602083.html')
#This one produced error
urllib2.urlopen('http://www.foxnews.com/us/2014/07/19/cartels-suspected-as-high-caliber-gunfire-sends-border-patrol-scrambling-on-rio/')
The second URL produced and error like this:
Traceback (most recent call last):
File "D:/Developer Center/Republishan/republishan2/republishan2/test.py", line 306, in <module>
urllib2.urlopen('http://www.foxnews.com/us/2014/07/19/cartels-suspected-as-high-caliber-gunfire-sends-border-patrol-scrambling-on-rio/')
File "C:\Python27\lib\urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "C:\Python27\lib\urllib2.py", line 410, in open
response = meth(req, response)
File "C:\Python27\lib\urllib2.py", line 523, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python27\lib\urllib2.py", line 448, in error
return self._call_chain(*args)
File "C:\Python27\lib\urllib2.py", line 382, in _call_chain
result = func(*args)
File "C:\Python27\lib\urllib2.py", line 531, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 404: Not Found
What's the problem with this?

I think the site is checking for a User-Agent and or other headers which urllib doesn't set by default.
You can set a User-Agent manually.
Requests library sets a user-agent automatically.
However remember that requests user-agent may also be blocked by some sites.
Try this. This is working for me. You need to install the requests module first!
pip install requests
Then
import requests
r = requests.get("http://www.foxnews.com/us/2014/07/19/cartels-suspected-as-high-caliber-gunfire-sends-border-patrol-scrambling-on-rio/")
print r.text
Urllib is hard and you've to code more. Requests is simpler and is more in line with the Python philosophy that code should be beautiful!

Python 3.x get JSON from URL

Hello fellow Programmers,
today I wanted to get some JSON Data from this website using Python 3.3: http://ladv.de/api/-apikey-redacted-/ausDetail?id=884&wettbewerbe=true&all=true
The official API tells me that calling this URL returns some JSON Data. But if I use the following code to get it (which I found on stackoverflow, too), it throws an error:
import urllib.request
import json
request = 'http://ladv.de/api/mmetzger/ausDetail?id=884&wettbewerbe=true&all=true'
response = urllib.request.urlopen(request)
obj = json.load(response)
str_response = response.readall().decode('utf-8')
obj = json.loads(str_response)
print(obj)
prints out
Traceback (most recent call last):
File "D:/ladvclient/testscrape.py", line 5, in <module>
response = urllib.request.urlopen(request)
File "C:\Python33\lib\urllib\request.py", line 156, in urlopen
return opener.open(url, data, timeout)
File "C:\Python33\lib\urllib\request.py", line 475, in open
response = meth(req, response)
File "C:\Python33\lib\urllib\request.py", line 587, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python33\lib\urllib\request.py", line 513, in error
return self._call_chain(*args)
File "C:\Python33\lib\urllib\request.py", line 447, in _call_chain
result = func(*args)
File "C:\Python33\lib\urllib\request.py", line 595, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found
Where is the bug, and what is the correct code?
Thanks in advance,
forumfresser

The site you're trying to fetch is not available, as seen here:
http://ladv.de/api/-apikey-redacted-/ausDetail?id=884&wettbewerbe=true&all=true
You could also just read the error message by yourself:
urllib.error.HTTPError: HTTP Error 404: Not Found

Facebook publish HTTP Error 400 : bad request

Hey I am trying to publish a score to Facebook through python's urllib2 library.
import urllib2,urllib
url = "https://graph.facebook.com/USER_ID/scores"
data = {}
data['score']=SCORE
data['access_token']='APP_ACCESS_TOKEN'
data_encode = urllib.urlencode(data)
request = urllib2.Request(url, data_encode)
response = urllib2.urlopen(request)
responseAsString = response.read()
I am getting this error:
response = urllib2.urlopen(request)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib2.py", line 124, in urlopen
return _opener.open(url, data, timeout)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib2.py", line 389, in open
response = meth(req, response)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib2.py", line 502, in http_response
'http', request, response, code, msg, hdrs)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib2.py", line 427, in error
return self._call_chain(*args)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib2.py", line 361, in _call_chain
result = func(*args)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib2.py", line 510, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 400: Bad Request
Not sure if this is relating to Facebook's Open Graph or improper urllib2 API use.

I ran your code and got the same error (there is no more error in the body, would have posted that in a comment but can't yet I guess) so I googled "publish facebook scores."
I believe you'll need to grant your app permission to publish scores first, unless you've done that already. See http://developers.facebook.com/blog/post/539/.

You may have to provide user:agent as some browser. I remember getting similar error while running crawler in some website, as it detected that no browser is calling for it.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Inability to extract data from website using urllib2 [duplicate] - python

Related

Why is urllib.request.urlopen giving me 404 on Wall Street Journal's website?

python: urllib2.HTTPError: HTTP Error 405: Method Not Allowed

python urllib2 can not fetch a specific url [duplicate]

Python 3.x get JSON from URL

Facebook publish HTTP Error 400 : bad request

Categories

Resources