Im learning beautifulsoup and was trying to write a small script to find houses on a dutch real estate website. When I try to get the website's content, I'm immediately getting an HTTP405 error:
File "funda.py", line 2, in <module>
html = urlopen("http://www.funda.nl")
File "<folders>request.py", line 223, in urlopen
return opener.open(url, data, timeout)
File "<folders>request.py", line 532, in open
response = meth(req, response)
File "<folders>request.py", line 642, in http_response
'http', request, response, code, msg, hdrs)
File "<folders>request.py", line 570, in error
return self._call_chain(*args)
File "<folders>request.py", line 504, in _call_chain
result = func(*args)
File "<folders>request.py", line 650, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 405: Not Allowed
What im trying to execute:
from urllib.request import urlopen
html = urlopen("http://www.funda.nl")
Any idea why this is resulting in HTTP405? Im just doing a GET request, right?
Possible duplicate of HTTPError: HTTP Error 403: Forbidden. You need to fake that you are a regular visitor. This is generally (varies from site to site) done by using a common / regular User-Agent HTTP header.
>>> url = "http://www.funda.nl"
>>> import urllib.request
>>> req = urllib.request.Request(
... url,
... data=None,
... headers={
... 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
... }
... )
>>> f = urllib.request.urlopen(req)
>>> f.status, f.msg
(200, 'OK')
Using the requests library -
>>> import requests
>>> response = requests.get(
... url,
... headers={
... 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
... }
... )
>>> response.status_code
200
It works if you don't use Requests or urllib2:
import urllib
html = urllib.urlopen("http://www.funda.nl")
leovp's comment makes sense.
Related
So I am working on a project crawling different sites. All sites work except for caesarscasino.com.
No matter what I try I get a 403 Forbidden Error. I have searched on here and others to no avail.
Here is my code:
import urllib3
import urllib.request, urllib.error
from urllib.request import Request
import ssl
try:
from urllib2 import urlopen
except ImportError:
from urllib.request import urlopen
ssl._create_default_https_context = ssl._create_unverified_context # overrides the default function for context creation with the function to create an unverified context.
urllib3.disable_warnings()
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
url = 'https://www.caesarscasino.com/'
req = Request(url, headers=headers) #opens the URL
result = urllib.request.urlopen(req).read()
print(result)
With this error code:
Traceback (most recent call last):
File "C:\Users\sp\Desktop\untitled0.py", line 30, in <module>
result = urllib.request.urlopen(req).read()
File "C:\Users\sp\anaconda3\envs\spyder\lib\urllib\request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "C:\Users\sp\anaconda3\envs\spyder\lib\urllib\request.py", line 531, in open
response = meth(req, response)
File "C:\Users\sp\anaconda3\envs\spyder\lib\urllib\request.py", line 640, in http_response
response = self.parent.error(
File "C:\Users\sp\anaconda3\envs\spyder\lib\urllib\request.py", line 569, in error
return self._call_chain(*args)
File "C:\Users\sp\anaconda3\envs\spyder\lib\urllib\request.py", line 502, in _call_chain
result = func(*args)
File "C:\Users\sp\anaconda3\envs\spyder\lib\urllib\request.py", line 649, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
HTTPError: Forbidden
The thing with scraping the web is, that not a lot of people like being scraped. Thus they do not allow a machine (which you scraper is) to access that page. This is the error you are getting. It basically means, do not access that site, when you are a programm. However, there are ways around that. Like spoofing the IP address and rotating headers, while your programm checks out this site. I already answered that question on how to do so here. Check it out and let me know in the comments whether or not that works for you.
I believe your issues are related to the fact that it's https. See here for info on how to fix that.
I am trying to create a web scraper for NBA data. When I am running the below code:
import requests
response = requests.get('https://stats.nba.com/stats/leaguedashplayerstats?College=&Conference=&Country=&DateFrom=10%2F20%2F2017&DateTo=10%2F20%2F2017&Division=&DraftPick=&DraftYear=&GameScope=&GameSegment=&Height=&LastNGames=0&LeagueID=00&Location=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=Totals&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2017-18&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=&Weight=')
requests are timing out with the error:
File "C:\ProgramData\Anaconda3\lib\site-packages\requests\api.py",
line 70, in get
return request('get', url, params=params, **kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\requests\api.py",
line 56, in request
return session.request(method=method, url=url, **kwargs)
File
"C:\ProgramData\Anaconda3\lib\site-packages\requests\sessions.py",
line 488, in request
resp = self.send(prep, **send_kwargs)
File
"C:\ProgramData\Anaconda3\lib\site-packages\requests\sessions.py",
line 609, in send
r = adapter.send(request, **kwargs)
File
"C:\ProgramData\Anaconda3\lib\site-packages\requests\adapters.py",
line 473, in send
raise ConnectionError(err, request=request)
ConnectionError: ('Connection aborted.', OSError("(10060,
'WSAETIMEDOUT')",))
However, when I hit the same URL in the browser, I am getting a response.
Looks like the website you mentioned is checking for "User-Agent" in the request's header. You can fake the "User-Agent" in your request to make it look like it is coming from the actual browser and you'll receive the response.
For example:
import requests
url = "https://stats.nba.com/stats/leaguedashplayerstats?College=&Conference=&Country=&DateFrom=10%2F20%2F2017&DateTo=10%2F20%2F2017&Division=&DraftPick=&DraftYear=&GameScope=&GameSegment=&Height=&LastNGames=0&LeagueID=00&Location=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=Totals&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2017-18&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=&Weight="
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
# it's the user-agent of my browser ^
response = requests.get(url, headers=headers)
response.status_code # will return: 200
response.text # will return the website content
You can find the user-agent of your browser from here.
if it's still not working, use this header:
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36','Accept-Encoding': 'gzip, deflate, br','Accept-Language': 'en-US,en;q=0.9,hi;q=0.8'}
If other headers does not work, try this HEADER , it worked pretty well for me.
headers = {"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1.1 Safari/605.1.15","Accept-Language": "en-gb","Accept-Encoding":"br, gzip, deflate","Accept":"test/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8","Referer":"http://www.google.com/"}
collected these headers from this link
This code works well on my local machine, but when I upload and run it on pythonanywhere.com it gives me this error.
My Code:
url = "http://www.codeforces.com/api/contest.list?gym=false"
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
req = urllib2.Request(url, headers=hdr)
opener = urllib2.build_opener()
openedReq = opener.open(req, timeout=300)
The error:
Traceback (most recent call last):
File "/home/GehadAbdallah/main.py", line 135, in openApi
openedReq = opener.open(req, timeout=300)
File "/usr/lib/python2.7/urllib2.py", line 410, in open
response = meth(req, response)
File "/usr/lib/python2.7/urllib2.py", line 523, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python2.7/urllib2.py", line 448, in error
return self._call_chain(*args)
File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 531, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden
P.S. i'm working on python 2.7
Free accounts on PythonAnywhere are restricted to a whitelist of sites, http/https only, and access goes via a proxy. There's more info here:
PythonAnywhere wiki: "why do I get a 403 forbidden error when opening a url?"
I recently used urllib2 with a flask project on pythonanywhere using their free account to access an api at donorschoose.org
This might be helpful,
#app.route('/funding')
def fundingByState():
urllib2.install_opener(urllib2.build_opener(urllib2.ProxyHandler({'http': 'proxy.server:3128'})))
donors_choose_url = "http://api.donorschoose.org/common/json_feed.html?historical=true&APIKey=DONORSCHOOSE"
response = urllib2.urlopen(donors_choose_url)
json_response = json.load(response)
return json.dumps(json_response)
This does work.
If you're using paid account but still get this error message
try this pythonanywhere_forums
To me, I have to delete the console then restart a new one.
This code works well on my local machine, but when I upload and run it on pythonanywhere.com it gives me this error.
My Code:
url = "http://www.codeforces.com/api/contest.list?gym=false"
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
req = urllib2.Request(url, headers=hdr)
opener = urllib2.build_opener()
openedReq = opener.open(req, timeout=300)
The error:
Traceback (most recent call last):
File "/home/GehadAbdallah/main.py", line 135, in openApi
openedReq = opener.open(req, timeout=300)
File "/usr/lib/python2.7/urllib2.py", line 410, in open
response = meth(req, response)
File "/usr/lib/python2.7/urllib2.py", line 523, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python2.7/urllib2.py", line 448, in error
return self._call_chain(*args)
File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 531, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden
P.S. i'm working on python 2.7
Free accounts on PythonAnywhere are restricted to a whitelist of sites, http/https only, and access goes via a proxy. There's more info here:
PythonAnywhere wiki: "why do I get a 403 forbidden error when opening a url?"
I recently used urllib2 with a flask project on pythonanywhere using their free account to access an api at donorschoose.org
This might be helpful,
#app.route('/funding')
def fundingByState():
urllib2.install_opener(urllib2.build_opener(urllib2.ProxyHandler({'http': 'proxy.server:3128'})))
donors_choose_url = "http://api.donorschoose.org/common/json_feed.html?historical=true&APIKey=DONORSCHOOSE"
response = urllib2.urlopen(donors_choose_url)
json_response = json.load(response)
return json.dumps(json_response)
This does work.
If you're using paid account but still get this error message
try this pythonanywhere_forums
To me, I have to delete the console then restart a new one.
I want to create a script with python that test if my combination (username and password) is correct, but I always get a 401 HTTP response so i think the script can't submit the login data. (the cpanel login isn't a traditional login panel so i will use the demo login panel as our example-site.com) :
import urllib, urllib2, os, sys, re
site = 'http://cpanel.demo.cpanel.net/'
username = 'demo'
password = 'demo'
headers = {
"User-Agent" : "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:28.0) Gecko/20100101 Firefox/28.0",
"Accept" : "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Encoding" : "gzip, deflate",
"Accept-Charset" : "ISO-8859-1,utf-8;q=0.7,*;q=0.7"}
data = [
("user",username),
("pass",password),
("testcookie",1),
("submit","Log In"),
("redirect_to",'http://cpanel.demo.cpanel.net/'),
("rememberme","forever")]
req = urllib2.Request(site, urllib.urlencode(dict(data)), dict(headers))
response = urllib2.urlopen(req)
if any('index.html' in v for v in response.headers.values()) == True :
print "Correct login"
else :
print "incorrect"
I get this error :
Traceback (most recent call last):
File "C:\Python27\cp\cp4.py", line 19, in <module>
response = urllib2.urlopen(req)
File "C:\Python27\lib\urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "C:\Python27\lib\urllib2.py", line 400, in open
response = meth(req, response)
File "C:\Python27\lib\urllib2.py", line 513, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python27\lib\urllib2.py", line 438, in error
return self._call_chain(*args)
File "C:\Python27\lib\urllib2.py", line 372, in _call_chain
result = func(*args)
File "C:\Python27\lib\urllib2.py", line 521, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 401: Access Denied
Any ideas to solve the problem and test the login details?
You need to POST the login info to http://cpanel.demo.cpanel.net/login/
Example using requests (much easier!):
import requests
logininfo = {'user':'demo', 'pass':'demo'}
r = requests.post("http://cpanel.demo.cpanel.net/login/", data=logininfo)
if (r.status_code==200):
print "Correct login"
Consider using Requests, a much more user-friendly HTTP client library for Python.
import requests
url = 'http://cpanel.demo.cpanel.net/login/'
username = 'demo'
password = 'demo'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0',
}
data = {
'user': username,
'pass': password,
}
response = requests.post(url, headers=headers, data=data)
if response.status_code == 200:
print "Successfully logged in as {username}".format(username=username)
else:
print "Login unsuccessful: HTTP/{status_code}".format(status_code=response.status_code)
Edited to check for HTTP/200, as CPanel does throw an HTTP/401 if the login is incorrect.